Abstract
Currently, modern object detection algorithms still suffer the imbalance problems especially the foreground–background and foreground–foreground class imbalance. Existing methods generally adopt re-sampling based on the class frequency or re-weighting based on the category prediction probability, such as focal loss, proposed to rebalance the loss assigned to easy negative examples and hard positive examples for single-stage detectors. However, there are still two critical issues unresolved. In practical applications, such as autonomous driving, the class imbalance will become more extreme due to the increased detection field and target distribution characteristics, needing a more effective way to balance the foreground–background class imbalance. Besides, existing methods typically employ the sigmoid or softmax entropy loss for classification task, which we believe is not capable to realize the foreground–foreground class balance. In this paper, we propose a new form of focal loss by re-designing the re-weighting scheme that can calculate the weight according to the probability as well as widen the weight difference of the examples. Besides, we introduce the extended focal loss to multi-class classification task by reformulating the standard softmax cross-entropy loss for better utilizing the discriminant difference of foreground categories, thereby yielding a class-discriminative focal loss. Comprehensive experiments are conducted on the KITTI and BDD dataset, respectively. The results show that our approach can easily surpass focal loss with no more training and inference time cost. Besides, when trained with the proposed loss function, current state-of-the-art object detectors no matter in one-stage or two-stage paradigms can achieve significant performance gains.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Modern object detection algorithms are developed based on convolutional neural networks (CNNs) and can be roughly divided into two categories, two-stage detectors [13] and one-stage detectors [24, 40]. Compared with the classical object detector, the modern object detector has evolved from the traditional manual feature extractor (e.g. LBP [25], Haar [19, 27, 38], or HOG [7, 17, 23]) to the semantic feature extractor based on CNN, but inherits the two-stage and proposal-driven mechanism. As popularized in the R-CNN framework [13], an appropriate amount of candidate target locations is generated in the first stage, and then finely classified into foreground or background classes in the second stage. Since then, a series of advanced two-stage detectors [6, 12, 14, 18, 20, 33] have been proposed, and have achieved a constant improvement in accuracy on the challenging PASCAL VOC [8] and COCO benchmark [22].
Although the two-stage detector can achieve high detection accuracy, it has the disadvantage of tacking too long time due to the need to carry out two stages of training and inference. In contrast, the one-stage detector is proposed, such as YOLO series [1, 5, 30,31,32], SSD [10, 24], RetinaNet [21] and FCOS [36], to complete the target classification and location regression only in a single stage. The one-stage detector greatly reduces the inference time while achieves considerable detection accuracy, and is considered to be a more efficient and elegant target detection method especially for autonomous driving application, which requires a high trade-off between accuracy and speed.
While the two-stage detector and the one-stage detector are structurally different, they are all subject to the class imbalance problems [26]. The two-stage framework typically applies region proposal mechanisms (e.g. Selective Search [37], Edge Boxes [42], RPN [33], DeepMask [28, 29]) to screen a large number of candidate target locations in the first stage. In the second stage, sampling methods like fixed ratio of foreground to background [33] or online hard example mining (OHEM) [35] are used to obtain a reasonable balance between foreground and background. But for one-stage detectors, solving the class balance problem is a bigger challenge because it regularly adopts dense sampling of target locations, aspect ratios, and scales and all the possible candidates need to be learned during training. To improve the training efficiency, techniques like data enhancement [31], hard examples mining [35, 38], and loss function design [2, 16, 21] have been proposed. The recent work focal loss [21] has received rising spotlight. It tried to reduce the class imbalance by modifying the sigmoid cross-entropy loss to down-weight the loss assigned to the easy negative examples.
However, in practical applications, such as automatic driving, it is necessary to perform multiclass object detection in a wide viewing angle scene. The anchor mechanism will create more extreme imbalance in foreground and background candidates in high-resolution images, increasing the difficulty of balancing the quantity of negative and positive examples. Therefore, the form of focal loss needs a further improvement to accommodate more imbalanced application scenarios. On the other hand, focal loss only considers the balance between foreground and background. There is no use of discriminative information between foreground classes, which is helpful for improving the discrimination of foreground categories.
In this paper, firstly we explore the form of the weighting factor in the focal loss, which we call it focal weight, as shown in Fig. 1. It can be intuitively seen that the loss weight of the current example is determined by the prediction probability. The greater the probability, the smaller the weight of loss. Inspired by that, we propose the extended function form of focal weight as shown in Fig. 2. Compared with the focal weight, it has the same function of adaptively calculating the weight according to the probability, but can further widen the weight difference of the examples. Secondly, we investigate the function form of focal loss. We believe that the original focal loss cannot dig into the constraint relationship between foreground classes in the form of sigmoid cross-entropy loss, so we apply the extended focal weight to softmax cross-entropy loss. Finally, we propose a new loss function called class-discriminative focal loss aiming for achieving the foreground–foreground class imbalance. On the one hand, we reformulate the standard softmax cross-entropy loss to calculate the negative logarithmic loss of the prediction probability for both the ground-truth class and wrong categories. On the other hand, as shown in Fig. 3, we define a weighting factor called discriminative weight in order to adjust the loss of the wrong prediction probability according to its similarity with ground truth. In short, the contributions of our research are as follows:
-
1.
We propose a new form of focal loss, namely extended focal loss, that is capable to further mitigate the extreme class imbalance.
-
2.
We propose the class-discriminative focal loss by introducing the extended focal loss to multi-class classification task as well as reshaping the standard softmax cross-entropy loss, which can improve the discriminability of foreground categories so as to reduce the foreground–foreground class imbalance.
-
3.
Our proposed loss function can easily surpass the state-of-the-art method, focal loss, by nearly 1.1 mAP with no more cost of training as well as inference time. It is easy to generalize and apply to other detection models.
-
4.
When trained with our proposed loss function, the network can achieve significant performance gains, outperforming other state-of-the-art methods on two major datasets of autopilot detection tasks, KITTI and Berkeley deep drive (BDD).
The rest of the paper is organized as follow. Section 2 reviews related works. Section 3 details the proposed method. Experimental results are given in Sect. 4, and the conclusions are presented in Sect. 5.
2 Related works
2.1 CNN-based object detection
Since the application of CNNs, the accuracy of object detection methods have been greatly improved. Especially after the impressive work called AlexNet done by Krizhevsky et al. in 2012 [15], deep neural network has begun to dominate the object detection and other various tasks in computer vision. With the development of neural network structure, the object detection algorithm is also progressing, and it is gradually divided into two main directions: two-stage detectors and one-stage detectors.
The two-stage framework, applied on classical object detection methods, has a long history. The two-stage detector has adopted this framework into CNN architectures. R-CNN [13] was the pioneer to use the CNN as the feature extractor in the first stage following by the support vector machine (SVM) for the classification task in the second stage. After that, Fast R-CNN [12] upgraded the classifier to a convolutional neural network in the second stage largely improving the accuracy. Faster R-CNN [33] creatively proposed the region proposal mechanism making the object detection system an entire neural network structure. Numerous extensions to this structure have been proposed, e.g. [6, 18, 20].
One-stage detectors typically finish the feature extraction, object localization, and object classification in a convolutional neural network. OverFeat [34] was one of the first one-stage detectors. SSD [10, 24] and YOLO [30,31,32] drew on many ideas such as anchor boxes and feature pyramid from two-stage detectors. The recent work, RetinaNet [21], has received great attention for its elegant architecture and high efficiency.
2.2 Class imbalance
Imbalance problems in object detection have received significant attention, especially class imbalance [26]. For two-stage detectors, owing to the region proposal mechanism [33], this problem was solved more satisfactorily by some common sampling schemes [33, 35]. While these sampling heuristic can be applied on one-stage detectors, they are still inefficient due to the domination of the easily classified background examples in the training process [21]. Despite that, kinds of hard negative mining [24, 35, 38] that excavate the hard examples are proposed to improve the training efficiency. Another influential approach is to modify the loss function. Bulo et al. [2] put forward a loss function called Loss Max-Pooling to eliminate the influence of dataset with long tail distribution on training. Liu et al. [24] integrated the so-called \(\alpha \)-balance into the cross-entropy loss to weight the losses of different classes according to their frequency. Lin et al. [21] brought up the focal loss for down-weighting the easy negatives, while the hard examples are unaffected. Weber et al. [39] introduced a focal loss variant called automated focal loss, which can greatly reduce the training convergence time. The above methods hold the opinion that examples of minor classes should have higher losses than those of major classes as the feature learned from the minor classes is poorer. While the focal loss focuses addressing inliers (easy examples), the Huber Loss [9] is designed to reduce the contribution of outliers (hard examples). The recent work Gradient Harmonizing Mechanism [16] also considers the harmfulness of the very hard examples, but it bases on the statistical distribution of the gradient, not the statistical distribution of the loss. Meanwhile, as discussed in [16], the optimal distribution of gradient is unclear. In our work, we also take the idea of reshaping the loss function. However, in addition to reducing the class imbalance, our proposed class-discriminative focal loss is also capable to utilize the interrelationship between foreground classes so as to increase the discriminability of foreground categories and improve the accuracy.
2.3 Objective function design
The loss function of the object detection system usually combines two parts, one for object classification, the other for object location regression. In general, softmax cross-entropy [10, 31, 34] or sigmoid cross-entropy [21] is adopted for the classification loss. In [21], the function form of focal loss is sigmoid cross-entropy. The work presented in [3] introduced focal loss to softmax cross-entropy and demonstrated that sigmoid cross-entropy is more stabled for training with a variety of aspect ratios and scales, while softmax cross-entropy can get higher performance. We also talk about the difference of them in our work, and our class-discriminative focal loss bases on softmax cross-entropy in consideration of its ability to generate category prediction probability with constraints.
For box regression loss, usually the \(L_2\) loss [30, 34], the smooth \(L_2\) loss [24] or the similar smooth \(L_1\) loss [10] are used. The modification of regression loss is not our aim, and we follow the RetinaNet to adopt the smooth \(L_1\) loss.
3 Class-discriminative focal loss
The focal loss introduced by [21] tried to eliminate the training inefficiency caused by the imbalanced data distribution for one-stage detectors. However, while focal loss achieves competitive results on the COCO benchmark [22], it has slightly worse performance on the much more imbalanced dataset like KITTI [11] and BDD [41]. The reason is that in these autopilot datasets, the resolution of the images is higher than those in COCO, which is closer to the practical applications such as automated driving, leading to more extreme imbalance between foreground and background with the anchor mechanism. Besides, as discussed in [3], extending the focal loss to multi-class task works better. When the focal loss is applied on the binary classification, the sigmoid operation is utilized to compute the probability of the targets with the loss computation in the loss layer. But for multi-class classification, the softmax operation is adopted. The former performs in greater numerical stability, while the latter performs in higher accuracy. Based on the above considerations, we extend the form of focal weight to further widen the weight difference of the examples, forming the extended focal loss. In this case, the hard positive examples can get more contributions in the loss, adapting to the extreme imbalanced situations. In addition, we apply the extended focal loss on multi-class classification. In contrast to the previous work [3], we utilize the softmax operation in the loss layer aiming to get the classes prediction probability with constraints. With the help of the constraint category probability, we furthermore propose the class-discriminative focal loss to increase the difference in loss weight between foreground categories, which helps to improve the discriminability of foreground categories, especially similar categories.
To clearly introduce our class-discriminative focal loss, a normal definition of focal loss is required. Focal loss was first applied on sigmoid cross-entropy:
where p is the prediction probability of the class, generated by the sigmoid function:
where z is the output of the network. For notational convenience, the probability that the network assigned to the positive example or the negative example can be unified as:
and the sigmoid cross-entropy can be simplified as:
The main contribution of focal loss is the adaptive weight w formulated as Eq. 5:
In the above, w is determined by two variable, \(p_t\) and \(\gamma \). The former is the probability of the ground truth class estimated by the model and the latter is a modulating parameter. According to [21], since the range of \(p_t\) is [0,1], it is used to quantify the classification difficulty of the examples. When \(p_t\) is big enough (\(p_t\gg 0.5\)), the corresponding example is well-classified. In this case, \(1-p_t\) is near 0, down-weighting the loss. In contrast, \(1-p_t\) is near 1 when \(p_t\) is small, keeping the loss for the hard examples unaffected. Besides, the modulating factor \(\gamma \) is used for smooth adjustment. We called w as focal weight and plotted it with \(\gamma \in [0,5]\) as shown in Fig. 1.
In addition to reducing the imbalance between hard examples and easy examples, focal loss also integrates a weighting factor \(\alpha _t\) for addressing the class imbalance between negative examples and positive examples:
In the above, \(\alpha \) is the weighting factor for positive examples while \(1-\alpha \) for negative examples, and \(\alpha \) can be set by the inverse class frequency.
Finally, the focal loss can be defined as:
3.1 Extended sigmoid focal loss
Since the class imbalance in the practical applications is even greater, we try to improve the form of focal weight and propose the extended focal weight \(w_\mathrm{extended}\) as Eq. 8:
As shown in Fig. 2, \(w_\mathrm{extended}\) is a piecewise symmetric function. We assume that the example is easy to classify when the corresponding probability is greater than 0.5, in which the assigned loss weight should be small. (We have also experimented other probabilities like 0.3, 0.4, 0.6, and 0.7, but we found 0.5 to work best in our experiments. Further discussion are presented in Sect. 4.3.6.)
Figure 2 shows the graph of the extended focal weight. Intuitively, the extended focal weight reduces the loss contribution from the well-classified examples just like the focal weight. However, for difficult examples, the extended focal weight endows them a higher loss weight compared to focal weight. In this case, the difference between the hard examples and easy examples is widen to adapt to the more imbalanced situations.
With the extended focal weight, the extended focal loss for binary classification can be formulated as:
3.2 Extended softmax focal loss
In order to investigate the relationship between the foreground categories, we introduce the extended focal weight to the softmax cross-entropy loss:
In the above, \(\mathbf{p} \) is a vector meaning the estimated probability of the network for multiclass prediction and \(\mathbf{y} \) is the one-hot ground-truth label. Since \(\mathbf{y} \) is one-hot label, we define \(p_t^{\prime }\) for the ground-truth class. The element of \(\mathbf{p} \) is \(p_i\), generated by the softmax operation:
Similar with Eq. 6, we define a weighting factor \(\alpha _t^{\prime }\) for rebalancing the loss assigned to foreground and background:
But in the above \(\alpha ^{\prime }\) is for the ground-truth foreground class while \(1-\alpha ^{\prime }\) for the other foreground classes and background.
With the extended focal weight and the softmax cross-entropy as well as the weighting factor \(\alpha _t^{\prime }\), the extended focal loss for multiclass classification can be formulated as:
3.3 Class-discriminative focal loss
The traditional softmax cross-entropy loss only calculates the negative logarithmic loss of the ground-truth class \(-log(p_t)^{\prime }\) due to the one-hot label \(\mathbf{y} \) ignoring the other prediction probability for the wrong categories \(p_i (y_i\ne 1)\). In our opinion, the same \(p_t^{\prime }\) may be generated with different \(p_i (y_i\ne 1)\) implying that the similarities between the predicted classes and the ground-truth class are different, which helps to improve the discriminability of the foreground categories. Therefore, we reshape the original softmax cross-entropy to calculate the negative logarithmic loss on the ground-truth class as well as the wrong classes \(\sum \nolimits _i { - I\left( {{y_i} \ne 1} \right) } \log \left( {1 - {p_i}} \right) \). Besides, we also define a weighting factor called discriminative weight \(w_\mathrm{discriminative}\):
In the above, both \(\frac{p_i}{1-p_t^{\prime }}\) and \(\frac{p_i}{p_t^{\prime }}\) quantify the difference between the predicted wrong classes and the ground-truth class. The former calculates the ratio of each wrong predicted probability to the total wrong predicted probability, weighting for the hard examples. The latter calculates the ratio of each wrong predicted probability to the ground-truth probability, weighting for the easy examples. We plotted \(w_\mathrm{discriminative}\) with \(\gamma \in [0,2]\) as shown in Fig. 3.
With the reshaped softmax cross-entropy and the extended focal weight as well as the discriminative weight, we define our class-discriminative focal loss as Eq. 15:
4 Experiments
To compare with the original focal loss and other rebalance strategies, we choose RetinaNet [21] as the detection network and adopt ResNet-50 as backbone with feature pyramid network (FPN) architecture for ablation study. Furthermore, to better demonstrate the effectiveness of our methods, we conducted horizontal study by improving current state-of-the-art detection networks with our methods. For comprehensive evaluation, mean of average precision (mAP) is reported.
4.1 Datasets
We present experimental results on the challenging detection tasks of KITTI and BDD since these public datasets are collected from real application environment which has an extreme imbalanced distribution.
KITTI consists of 7481 images, containing three object categories of car, pedestrian and cyclist. These object categories are in a great imbalance, in which the ratio of the ground-truth numbers is 17.7:2.7:1, as shown in Fig. 4. Besides, separating pedestrian from cyclist is quite difficult for its similarities. We divide the dataset into two parts, 90% for training and 10% for validation.
BDD has a larger amount of data, where 70K is the training set, 10K is the validation set, and 20K is the test set. Similarly, the target category distribution in the BDD dataset is highly uneven, and it contains ten target categories, more than KITTI. The target quantity distribution is shown in Fig. 5. We only use the 10K validation set for algorithm research since both the training set and validation set of BDD have the same uneven category distributions. In the same way, we divide the dataset into two parts, 90% for training and 10% for validation.
4.2 Implementation details
For the models except YOLOv3 [32] and YOLOv4 [1], we make the implementation based on the Open MMLab Detection Toolbox [4] and make the implementation for YOLOv3 and YOLOv4 based on their pytorch implementation. All studies are trained using the default settings in the original code of each algorithm and adaptively conducted on an NVIDIA GTX 1080Ti.
4.3 Ablation study on KITTI dataset
Comprehensive experiments are conducted on KITTI and BDD dataset, respectively. In this section, we mainly show the ablation study on KITTI dataset for hyperparameter tuning and validating the performance of our proposed methods, since the experiments on BDD have the similar results.
4.3.1 Sigmoid focal loss
We first train the network with the original sigmoid focal loss as the baseline and the results are presented in Table 1. According to [21], the parameter \(\gamma \) cannot be set too large, and the parameter \(\alpha \) usually ranges from 0.25 to 0.9. As shown in Table 1, the original focal loss achieved a best mAP of 85.3, in which the AP of car or cyclist is much higher than pedestrian. The parameter setting of \(\gamma =1.0\) and \(\alpha =0.5\) achieved the highest AP of car and cyclist. Besides, the AP of pedestrian and cyclist greatly declined while the AP of car was unaffected when \(\gamma \) was set as 0 and \(\alpha \) was set as 0.75. In this case only the \(\alpha \)-balance strategy was implemented, indicating the effectiveness of focal loss to improve the detection accuracy of hard examples.
4.3.2 Extended sigmoid focal loss
Results using our extended sigmoid focal loss are shown in Table 2. The extended sigmoid focal loss achieved a best mAP of 85.9 with the parameter setting of \(\gamma =1.0\) and \(\alpha =0.75\). In this case, the APs of pedestrian and cyclist are highest while the AP of car is high enough, showing that hard positive examples have gotten more attention and the loss distribution is more balanced. We can see our extended sigmoid focal loss has slightly better performance than the original focal loss.
4.3.3 Extended softmax focal loss
Table 3 shows the results using our extended softmax focal loss. For multi-class classification task, the weighting factor \(\alpha ^{\prime }\) is only applied on the ground-truth class. We can see the best mAP of the extended softmax focal loss is 85.7 with \(\gamma =1.0\) and \(\alpha ^{\prime }=0.8\), which is higher than that of the original focal loss. When \(\gamma =0\), our loss is equivalent to the softmax cross-entropy with \(\alpha \)-balance scheme, which outperforms the sigmoid cross entropy. When \(\gamma =1.0\) and \(\alpha ^{\prime }=0.75\), the performance of the extended softmax focal loss is almost equal to that of the extended sigmoid focal loss. However, a small increase in \(\alpha ^{\prime }\) to 0.8 brought a considerable promotion of the mAP, demonstrating the better performance and less numerical stability of softmax cross-entropy. We also found that the best mAP of the extended softmax focal loss is slightly lower than that of the extended sigmoid focal loss. We blame this deficiency on the traditional softmax cross-entropy method that only calculates the negative logarithmic loss of the ground-truth class.
4.3.4 Class-discriminative focal loss
Results using our class-discriminative focal loss are given in Table 4. The class-discriminative focal loss achieved a best mAP of 86.4 with \(\gamma =1.0\) and \(\alpha ^{\prime }=0.8\), surpassing the original focal loss by 1.1 mAP. The results are better than that of the extended softmax focal loss except the results with \(\alpha ^{\prime }=0.5\) and \(\gamma =1.0\), demonstrating the effectiveness of our approach. When \(\alpha ^{\prime }=0.25\), the model cannot converge due to the excessive imbalanced distribution of the examples. However, we can easily avoid this situation by setting \(\alpha ^{\prime }\) based on the inverse class frequency. When \(\gamma =0\) and \(\alpha ^{\prime }=0.75\), the loss function adopting only the \(\alpha \)-balance strategy achieved a mAP of 86.0, which outperforms other \(\alpha \)-balance variants of sigmoid cross-entropy as well as softmax cross-entropy. Focusing on the best mAP, we found that the AP of pedestrian, the most difficult class to classify, is the highest, and the gap of the APs has been narrowed down. These results show that our class-discriminative focal loss is capable to fully exploit the relationship between foreground classes as well as mitigate the problem of imbalanced data distribution.
4.3.5 Analysis of the various focal loss
For an in-depth understanding of the various focal loss functions, we plotted the loss curves as shown in Fig. 6. We can see that our proposed focal loss variants converge within the same number of iterations as the original focal loss, demonstrating its effectiveness. Besides, since the loss calculation is not required in the inference stage, our method has the same inference speed compared to focal loss. Furthermore, we made statistics on the loss contribution of negative examples and positive examples in the training process and obtained the ratio of the loss of negative examples to positive examples as shown in Table 5. These ratios also reflect the loss proportion of easy examples to hard examples, because most of easy examples are negative examples. For sigmoid cross-entropy, the ratio of EFL-B is 5.31:1, slightly smaller than that of FL. For softmax cross-entropy, the ratio of CDFL is 260.25:1, greatly smaller than that of EFL-M. These results confirm that our proposed loss functions, especially CDFL, can better achieved the rebalancing of categories.
4.3.6 Analysis of different threshold setting
According to Focal Loss [21], the threshold of \(p_t\) is defined as 0.5 without in-depth analysis. In this paper, we have further explored the impact of different threshold setting.Based on RetinaNet+CDFL, we implemented experiments under different thresholds between 0.3 to 0.7 on both KITTI and BDD datasets, and the results are shown as Table 6. The results show that 0.5 is the best threshold, which is consistent with focal loss [21].
4.3.7 Qualitative results
Some qualitative results are shown in Figs. 7 and 8. As shown in the figure, our proposed method can detect more difficult targets that have high similarity such as pedestrian and cyclist. Besides, our method will get more accurate bounding box locations.
4.4 Horizontal study on BDD and KITTI datasets
In this section, experiments are conducted on both KITTI and BDD datasets. We first compare our methods with other rebalance schemes and the results are shown in Tables 7 and 8. The CDFL outperforms all current state-of-the-art methods for weakening the damage of the class imbalance in object detection. In both Tables 7 and 8, it achieves a \(\sim \) 1.1 point mAP gap (86.4 vs. 85.3, 41.72 vs. 40.71) with the closest competitor, focal loss [21]. Compared to GHM [16], we can see a gain of 2–3.2 mAP based on CDFL.
Furthermore, to further verify the effectiveness of our proposed methods, we conduct thorough ablation experiments to compare the proposed mechanisms with current state-of-the-art detectors. Except RetinaNet, we also employ our proposed methods to the main stream two-stage detector Faster R-CNN, one-stage anchor-free detector FCOS and one-stage anchor-based detector YOLOv4. For faster analysis in our ablation experiments, we implement the simplified version of YOLOv4, namely YOLOv4m, which is all the same with YOLOv4 except the model depth and width. The results are shown in Tables 9 and 10. It shows that when trained with the proposed mechanisms, the baseline network can achieve significant performance gains, 1.01/1.01 for RetinaNet, 0.62/0.21 for Faster R-CNN, 0.56/0.89 for FCOS in KITTI/BDD. Although YOLOv4m+EFL-B has a slight performance degradation in KITTI, it has a significant performance improvement in BDD, which has a more serious category imbalance. As shown in Table 11, it is obvious that the mAPs of minor classes such as pedestrian and cyclist in KITTI as well as rider and motor in BDD have a remarkable improvement. These results confirm that our proposed methods, namely EFL-B and CDFL, can effectively improve the performance of main stream one-stage as well as two-stage detectors in imbalance application scenarios.
5 Conclusions
In this work, we analyse the limitation of existing rebalance schemes for object detection in consideration of the practical extreme imbalanced scenarios and multi-class classification task. To address this, we propose a extended focal loss to further mitigate the foreground-background class imbalance. Moreover, we propose the class-discriminative focal loss by introducing the extended focal loss to multi-class classification task and reformulating the standard softmax cross-entropy loss, which can improve the discriminability of foreground categories so as to reduce the foreground-foreground class imbalance. Extensive experiments conducted on KITTI and BDD datasets show that our approach can easily surpass the state-of-the-art method, focal loss, with no more training and inference time cost. Besides, our method is easy to generalize and apply to current state-of-the-art one-stage or two-stage object detectors and achieve the best performance.
References
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Bulo, S.R., Neuhold, G., Kontschieder, P.: Loss max-pooling for semantic image segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7082–7091. IEEE (2017)
Chen, C., Song, X., Jiang, S.: Focal loss for region proposal network. In: Pattern Recognition and Computer Vision—First Chinese Conference, pp. 368–380. Springer, Berlin (2018)
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, W., Huang, H., Peng, S., Zhou, C., Zhang, C.: Yolo-face: a real-time face detector. Visual Comput., 1–9 (2020)
Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Springer Series in statistics. Springer, Berlin (2001)
Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)
Li, B., Liu, Y., Wang, X.: Gradient harmonized single-stage detector. In: The Thirty-Third AAAI Conference on Artificial Intelligence, vol. 33, pp. 8577–8584 (2019)
Li, T., Ye, M., Ding, J.: Discriminative hough context model for object detection. Visual Comput. 30(1), 59–69 (2014)
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: Light-head R-CNN: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264 (2017)
Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: Proceedings of the 2002 International Conference on Image Processing, pp. 900–903. IEEE (2002)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: IEEE International Conference on Computer Vision, pp. 2999–3007 (2017)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: 13th European Conference on Computer Vision, pp. 740–755. Springer, Berlin (2014)
Liu, B., Wu, H., Su, W., Zhang, W., Sun, J.: Rotation-invariant object detection using sector-ring hog and boosted random ferns. Visual Comput. 34(5), 707–719 (2018)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: 14th European Conference on Computer Vision, pp. 21–37. Springer, Berlin (2016)
Ojala, T., Pietikäinen, M., Mäenpää, T.: Gray scale and rotation invariant texture classification with local binary patterns. In: 6th European Conference on Computer Vision, pp. 404–420. Springer, Berlin (2000)
Oksuz, K., Cam, B.C., Kalkan, S., Akbas, E.: Imbalance problems in object detection: a review. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Papageorgiou, C.P., Oren, M., Poggio, T.: A general framework for object detection. In: Proceedings of the Sixth International Conference on Computer Vision, pp. 555–562. IEEE (1998)
Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp. 1990–1998 (2015)
Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: 14th European Conference on Computer Vision, pp. 75–91. Springer, Berlin (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517–6525 (2017)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 9627–9636 (2019)
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511–518 (2001)
Weber, M., Fürst, M., Zöllner, J.M.: Automated focal loss for image based object detection. arXiv preprint arXiv:1904.09048 (2019)
Wei, L., Cui, W., Hu, Z., Sun, H., Hou, S.: A single-shot multi-level feature reused neural network for object detection. Visual Comput. 1–10 (2020)
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2636–2645 (2020)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: 13th European Conference on Computer Vision, pp. 391–405. Springer, Berlin (2014)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, G., Qin, H. Class-discriminative focal loss for extreme imbalanced multiclass object detection towards autonomous driving. Vis Comput 38, 1051–1063 (2022). https://doi.org/10.1007/s00371-021-02067-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02067-9