Distribution-Balanced Loss for Multi-label Classification in Long-Tailed Datasets

Wu, Tong; Huang, Qingqiu; Liu, Ziwei; Wang, Yu; Lin, Dahua

doi:10.1007/978-3-030-58548-8_10

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12349))

Included in the following conference series:

European Conference on Computer Vision

7568 Accesses
106 Citations

Abstract

We present a new loss function called Distribution-Balanced Loss for the multi-label recognition problems that exhibit long-tailed class distributions. Compared to conventional single-label classification problem, multi-label recognition problems are often more challenging due to two significant issues, namely the co-occurrence of labels and the dominance of negative labels (when treated as multiple binary classification problems). The Distribution-Balanced Loss tackles these issues through two key modifications to the standard binary cross-entropy loss: 1) a new way to re-balance the weights that takes into account the impact caused by label co-occurrence, and 2) a negative tolerant regularization to mitigate the over-suppression of negative labels. Experiments on both Pascal VOC and COCO show that the models trained with this new loss function achieve significant performance gains over existing methods. Code and models are available at: https://github.com/wutong16/DistributionBalancedLoss.

Access provided by Autonomous University of Puebla. Download conference paper PDF

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Multi-label Learning by Hyperparameters Calibration for Treating Class Imbalance

Co-learning Binary Classifiers for LP-Based Multi-label Classification

Keywords

1 Introduction

Along with the wide adoption of deep learning, recent years have seen great progress in visual recognition, especially the remarkable breakthroughs in classification tasks. However, mainstream benchmarks are often constructed under two common conditions: 1) all classes have comparable numbers of instances and 2) each instance belongs to a unique class. While providing a clean setting for various studies, this conventional setting conceals a number of complexities that often arise in real-world applications [14, 16, 25, 26]. In contrast, the distribution of different object categories typically exhibit a long tail in practical contexts while individual images can generally be associated with multiple semantic labels [15, 23, 24, 35]. Previous works [1, 12, 17] have repeatedly shown that such issues can cause substantial performance drop if not appropriately handled.

A widely adopted approach to multi-label problem is to use binary cross-entropy [7] in the place of the softmax loss, and use class-specific re-weighting to balance the contributions of different classes, e.g. setting the class weights to be inversely proportional to the class sizes. Such simple methods often result in limited improvement, as they fail to take into account the impacts of two important issues, namely label co-occurrence and the dominance of negative labels.

First, label co-occurrence is very common in natural images. For example, an image that contains unusual concepts, e.g. “tigers” and “leopards”, is likely to be also associated with more common labels, e.g. “trees” and “river”. Therefore, re-sampling such images may not necessarily result in a more balanced distribution of classes. Second, each image is usually associated with a very small fraction of all the classes in the list. Consequently, given an image, most classes are negative. However, the binary cross entropy (BCE) loss is designed to be symmetric, where positive and negative classes are treated uniformly. This symmetric formulation in conjunction with the dominant portion of negative classes would lead to over-suppression of the negative side, thus introducing significant bias to the classification boundaries. In response to issues above, we propose a new loss function, called Distribution-Balanced Loss. This loss function consists of two key modifications to the standard BCE loss: 1) re-balanced weighting, which adjust the weights in a way that closes the gap between expected sampling times and actual sampling times, with label co-occurrence taken into account; and 2) negative-tolerant regularization, which avoids over-suppression of the negative labels by setting a margin and a re-scaling factor. Experiments on two multi-label recognition benchmarks, i.e. Pascal VOC [8] and MS COCO [22], show that the proposed loss achieves remarkable improvement over previous methods.

2 Related Work

Previous works on long-tailed recognition [18, 26, 33] mainly follow two directions: re-sampling and cost-sensitive learning. And many efforts have been dedicated to the multi-label classification task.

Re-sampling. To achieve a more balanced distribution, researchers have proposed to either over-sample the minority classes [1, 2, 30], or under-sample the majority classes [1, 10, 17]. The downside of the former is that it might lead to over-fitting on minority classes with duplicated samples, while the latter might weaken feature learning capacity due to omitting a number of valuable instances. While previous works mainly focus on single label datasets, we extend re-sampling to the multi-label scenario.

Cost-Sensitive Learning. Assigning different costs to different training samples is proved to be an effective strategy dealing with imbalanced data. Typically, researchers apply class-level re-weighting by the proportional inverse of class frequency [13, 33], or the square root for smoothing. Recently, Cui et al [5] proposed to re-weight by the inverse of effective number of samples, and Cao et al [3] emphasized larger margin for rare classes. Further, various works adopted sample-level control of cost based on individual properties, e.g. example difficulty [21], estimated Bayesian uncertainty [19], gradient direction [29]. Our method applies re-weighting based on class frequency and individual ground truth labels and modifies the loss gradient with a regularization as well for a better optimization.

Multi-label Classification. Earlier solutions for multi-label recognition include decomposing it into independent binary classification tasks [31], and k-nearest neighbor named ML-kNN [36], etc. Recently, many approaches attempted to take label relationships into consideration to better regularize the embedding space. CNN-RNN [32] utilized the RNNs combined with CNN to learn a joint image-label embedding, and Wang et al [34] took advantages of a spatial transformer layer and long short-term memory (LSTM) units to capture contextual dependencies. There’s also a popular trend to model label correlation with graph structure [4, 20]. Our method is based on the widely used binary cross-entropy loss [7] and gains improvement by combining it with re-sampling and re-weighting.

3 Distribution-Balanced Loss

The problem we want to exploit here is how to train a model effectively when training samples follow a long-tailed distribution. Suppose the dataset we use is $\mathcal {D}= \{(\mathbf {x}^1, \mathbf {y}^1), \cdots , (\mathbf {x}^N, \mathbf {y}^N)\}$, where N is the number of training samples and $( \mathbf {x}^k, \mathbf {y}^k ),k \in \{1, ... , N\}$ is a sample-label pair. Let’s denote the number of classes as C, then we have $\mathbf {y}^k = [y^k_1, \cdots , y^k_C] \in \{0,1\}^C$. Let $n_i = \sum _{k=1}^{N} y_i^k$ denote the number of training examples that contain class i. Please note that $N \le \sum _{i=0}^C n_i$ since a single example can be counted several times for each class it contains.

As we mentioned before, our distribution-balanced loss consists of two components, namely re-balanced weighting and negative-tolerant regularization. In Sect. 3.1, we would introduce the reason why we need a re-balanced weight in long-tailed multi-label classification and the mathematical derivation of the optimal value of this weight. In Sect. 3.2, we would demonstrate the over-suppression for negative samples brought by sigmoid and how to overcome the problem with our negative-tolerant regularization. Finally, these two components can be integrated as a unified loss function, i.e. distribution-balanced loss, for end-to-end training, which would be shown in Sect. 3.3.

3.1 Re-balanced Weighting After Re-sampling

The most common sampling rule is to select each example from training set with equal probability, and the probability of a sampled example containing class i would be $p_i = n_i/N$. To alleviate the discrepancy of imbalanced sampling probability among classes caused by data distribution, many re-sampling strategies are proposed. One popular strategy is known as class-aware sampling [30]. It first uniformly samples a class from the whole C classes, and then samples an example from the selected class randomly. This process runs iteratively in each training epoch. Let $N_e$ denote the times for each class to be visited by the iterator in one epoch, which is usually set as $N_e = \max (n_1, \cdots , n_C)$. In cases of extreme imbalance, $N_e$ can be set smaller to control the data scale in one epoch.

However, in the multi-label scenario, an example usually contains several ground-truth labels, making the selection for classes no longer independent. That is to say, re-sampling instances from one specific class will inevitably influence the sample numbers of the other classes co-occurring. This leads to the following problems. First, it induces inner-class imbalance because samples in a class are no longer selected with equal probability. More importantly, the class imbalance is not necessarily eliminated and may even be exaggerated, the reason for which would be introduced below.

In fact, the numbers of samples for different classes after re-sampling would not follow a uniform distribution as expected. Here we estimate them using label co-occurrence statistics of the original training set. Assuming p(i|j) to be the conditional probability of an instance containing label i under the condition of containing label j, so that $p(i|j) = n_{i\cap j}/n_j$, where $n_{i\cap j}$ denotes the number of examples that contain both label i and label j. Therefore, when we randomly choose a class and sample an instance from it, the probability that it contains label i would be shown as Eq. 1.

$$\begin{aligned} \hat{p}_i = \frac{1}{C}\sum _{j=0}^Cp(i|j) =\frac{1}{C}\sum _{j=0}^C{\frac{n_{i\cap j}}{n_j}} \end{aligned}$$

(1)

The class distribution after re-sampling is show in Fig. 2, and the theoretical estimation matches our statistics of data sampled in one epoch during real training procedure. According to the distribution, we proposed a re-balanced weighting strategy to overcome the extra imbalance caused by re-sampling. First, without taking label co-occurrence into consideration, for each instance k and class i with $y^k_i=1$, the expectation of class-level sampling frequency can be calculated as $P^C_i(x^k)$ in Eq. 2. Then given an instance $x^k$ and its corresponding label $y^k$, it is supposed to be repeatedly sampled by each positive class i it contains, thus the expectation of instance-level sampling frequency can be estimated as $P^I(x^k)$ in Eq. 2. Correspondingly, we define a re-balancing weight, namely $r^k_i$, to close the gap between expected sampling times and actual sampling times, as shown in Eq. 3.

$$\begin{aligned} P^C_i(x^k)=\frac{1}{C}\frac{1}{n_i}, \ P^I(x^k) = \frac{1}{C}\sum _{y^k_i = 1}\frac{1}{n_i} \end{aligned}$$

(2)

$$\begin{aligned} r^k_i = \frac{P^C_i(x^k)}{P^I(x^k)} \end{aligned}$$

(3)

$$\begin{aligned} \hat{r} = \alpha + \frac{1}{1 + exp(-\beta \times (r - \mu ))} \end{aligned}$$

(4)

However, the weight elements are sometimes towards zero and may increase the difficulty of optimization. To make the optimization process stable, we further designed a smoothing function to map r into a proper range of values, which is demonstrated in Eq. 4. Here $\alpha $ is an overall lift in weight, while $\beta $ and $\mu $ controls the shape of the mapping function, which rapidly increases near 0 and goes flat near 1.

Finally, the loss function, which we name as Re-balanced-BCE, becomes Eq. 5, where $z^k$ denotes the output of the classifier.

$$\begin{aligned} \mathcal {L}_{R-BCE}(x^k, y^k) = \frac{1}{C}\sum _{i=0}^C{ \left[ y^k_i log(1 + e^{-z^k_i}) + (1 - y^k_i)log(1 + e^{z^k_i})\right] \times \hat{r}^k_i} \end{aligned}$$

(5)

What’s worth noting is that $\hat{r}^k_i$ is applicable to both positive and negative labels although it was originally deduced from the sampling procedure regarding only the positive ones, in order to keep a consistency at class-level.

3.2 Negative-Tolerant Regularization

As mentioned above, binary cross entropy (BCE) loss, which is widely used for multi-label classification, sometimes suffers from over-suppression for negative labels because of the dominance of negative classes. To be more specific, BCE considers the recognition task as a series of binary classification tasks, calculating independent class-wise probability with sigmoid function. In contrast, cross entropy (CE) loss, which is popular in single-label classification, utilizes softmax to emphasize mutual exclusion. Unlike softmax where the optimization step would be rather small once the logit for positive class is much higher than those of negative classes, sigmoid treats them independently and encourages the logits of both positive and negative classes to be away from zero in the same gradient declining manner. The difference between them can be observed by their gradients shown in Eq. 6 and visualized in Fig. 3(a)(b).

$$\begin{aligned} \left\{ \begin{aligned} \frac{\partial {\mathcal {L}_{CE}(z_j, y)}}{\partial {(z_j)}} = \frac{e^{z_j}}{\sum _{i=0}^Ce^{z_i}}, y_j=0 \\ \frac{\partial {\mathcal {L}_{BCE}(z_j, y)}}{\partial {(z_j)}} = \frac{1}{C}\frac{e^{z_j}}{1+e^{z_j}}, y_j=0 \end{aligned} \right. \end{aligned}$$

(6)

A straightforward consequence is that the classifiers for the tail classes would over-fit to a limited number of positive samples in the feature space, and meanwhile, they would push a huge number of negative samples away to produce lower logits. It can be taken as a class-specific over-fitting for the tail categories, which leads to a bad generalization of the classifiers. As shown in Fig. 1(c), the output distribution becomes sharp and the predictions of the testing samples are easy to be influenced by head classes.

To address the problem, we need a regularization to overcome the over-suppression. Specifically, the loss by negative logits actually needs a sharp drop once it’s optimized to be lower than a threshold so that they won’t be continuously suppressed due to a relatively small gradient. Based on the idea, we propose a negative-tolerant regularization (NTR) by first using a non-zero bias initializaiton to act as the thresholds, and then applying a linear scaling to the negative logits before their calculation in the standard BCE, together with a regularization parameter to constrain the gradient between 0 and 1. The Negative-Tolerant-BCE thus becomes Eq. 7.

$$\begin{aligned} \mathcal {L}_{NT-BCE}(x^k, y^k) = \frac{1}{C}\sum _{i=0}^C{y^k_i log(1+e^{z^k_i-\nu _i}) + \frac{1}{\lambda }(1 - y^k_i)log(1+e^{-\lambda (z^k_i-\nu _i)})} \end{aligned}$$

(7)

$\lambda $ is the scale factor that effects the loss gradient as shown in Fig. 3(c), controlling how “tolerant” we are to $z_i$, and $\nu $ is a class-specific bias. The design for $\nu $ is supposed to take intrinsic model bias into consideration. Concretely, a network trained with imbalanced data is likely to give passive predictions on those tail classes on average, the thresholds for them should correspondingly be lower, assuring that they won’t be too easily achieved. It shares a similar idea with [3] that a larger margin is needed for rare classes. Assuming that we use a fully-connect layer as the classifier, the intrinsic bias of the model can be estimated by minimizing the loss function at the very beginning of training, where the classifiers are randomly initialized, and the dot-product distance between classifier vectors and instance features are at an average of zero. For a regular BCE loss, considering the bias $b_i$ as the only variable, and assuming the class prior to be $p_i = n_i/N_0$, we can deduce an approximation of averaged loss by class i:

$$\begin{aligned} L_{i} = p_i \log (1 + e^{-b_i}) + (1 - p_i) \log (1 + e^{b_i}) \end{aligned}$$

(8)

$$\begin{aligned} \hat{b}_i = -\log (\frac{1}{p_i} - 1), \ \nu _i = -\kappa ~\hat{b}_i \end{aligned}$$

(9)

We minimize Eq. 8 at $\hat{b}_i$, and use $\kappa $ as a scale factor to get $\nu _i$, which is further applied to Eq. 7.

3.3 Distribution-Balanced Loss

So far, R-BCE performs a re-balanced weighting strategy and the weight vector is fixed given an instance, while NT-BCE conducts regularization to the classifier outputs and affects the training by modifying the loss gradient. They can be naturally integrated as a unified loss function for end-to-end training, as shown in Fig. 4, and we finally get our distribution-balanced loss as Eq. 10.

$$\begin{aligned} \mathcal {L}_{DB}(x^k, y^k) = \frac{1}{C}\sum _{i=0}^C \hat{r}^k_i \left[ y^k_i log(1+e^{z^k_i-\nu _i}) + \frac{1}{\lambda }(1 - y^k_i)log(1+e^{-\lambda (z^k_i-\nu _i)}) \right] \end{aligned}$$

(10)

DB loss helps to smooth the distribution of the classifier outputs especially for those tail classes. It achieves superior performance in multi-label datasets with a long-tailed distribution, as will be validated in Sect. 4.

4 Experiments

4.1 Datasets

The proposed Distribution Balanced Loss is analyzed on artificially created long-tailed versions of two multi-label image recognition benchmarks, named VOC-MLT and COCO-MLT, respectively. They’re subsets sampled from the original datasets by pareto distribution $pdf(x) = \alpha \frac{x_{min}^\alpha }{x^{\alpha + 1}}$ following [26]. $\alpha $ controls how fast data scale decays. Regarding the interaction of sampling among classes, we construct the datasets in a head-to-tail manner so that we can strictly constrain the scale of tail classes: we first rank all the classes by $\hat{p}_i$ calculated with original data, and for each class i from head to tail, we add or eliminate instances that contain class i from the subset by referring to the expected distribution. Details on the construction of VOC-MLT and COCO-MLT can be found in the appendix.

VOC-MLT. We construct the long-tailed version of VOC [8] from its 2012 train-val set, with the power parameter $\alpha =6$. It contains 1,142 images from 20 classes, with a maximum of 775 images per class and a minimum of 4 images per class. We evaluate the performance on VOC2007 test set with 4952 images.

COCO-MLT. The long-tailed version of MS COCO-2017 [22] is created with the same $\alpha $, containing 1,909 images from 80 classes. The maximum of training numbers per class is 1,128 and the minimum is 6. We use the test set of COCO2017 with 5,000 for evaluation. What’s worth noting is that the test set of COCO and VOC are not perfectly balanced, they share a similar distribution with the original train set. But the ranking of sample scale per-class of both the original train and test set is roughly consistent with the long-tailed version.

4.2 Experimental Settings

Evaluation Metrics. Following [26], we split the classes into three groups by the number of their training examples: head classes each contains over 100 samples, medium classes each has between 20 and 100 samples, and tail classes with under 20 samples each. We evaluate mean average precision(mAP) for all the classes, and we also report mAP for each subset to observe how the techniques effect on them.

Comparing Methods. We compare our methods with several state-of-the-art techniques dealing with multi-label classification or long-tailed recognition. We also report the results of their effective combinations for fair comparison. The standard binary cross entropy loss with sigmoid function is used or modified by all the methods. The compared methods include: (1) Empirical risk minimization: The plain model with all the examples having the same weight and sampling probability. (2) Re-weighting (RW): we perform a smooth version of re-weighting to be inversely proportional to the square root of class frequency, and we normalize the weights to be between 0 and 1 in a mini-batch. (2) Re-sampling (RS) [30]: we use class-aware re-sampling without extra tricks as a baseline, and we also evaluate the combination of RS and other techniques in comparison. (3) ML-GCN [4]: a recently proposed method by for multi-label classification with graph convolutional network (GCN). (4) Focal loss [21]: we use $\gamma =2$ with a balance parameter of 2 for focal loss. (5) Class-balanced loss (CB) [5]: a class-wise re-weighting guided by the effective number of each class $E_n=(1-\beta ^n)/(1-\beta )$. (6) Label-distribution-aware margin loss (LDAM) [3]: a recently proposed margin-loss which is proved to be effective for softmax classifier.

Implementation Details. We adopt Resnet50 [11] pretrianed on ImageNet [6] as backbone feature extractor, followed by global average pooling and a $2048 \times 256$ fully connection(FC) layer to obtain image-level features. The final classifier is a $256 \times C$ FC layer which outputs the logits. The input images are organized with a batch size of 32, randomly cropped and resized to $224\times 224$ together with standard data augmentation. We use SGD with momentum of 0.9 and weight decay of $1\times 10^{-4}$ as our optimizer, and we also use linear warm-up learning rate schedule [9] for the first 500 iterations with a ratio of $\frac{1}{3}$. Training not combined with re-sampling is trained for 80 epochs with an initial learning rate of 0.02, which decays by a factor of 10 after 55 and 70 epochs, respectively. Re-sampling enhanced methods are trained for 8 epochs with the same learning rate initialization decaying step, and it decays after 5 and 7 epochs, respectively. We use the class-aware re-sampling [30] and the times that the iterator visits each class in one epoch is set as $\frac{N_{max}}{4}$ where $N_{max}$ denotes the maximum number of training samples per class. Once $N_e<N_{max}$, it actually controls how we over-sample the tail classes and under-sample the head classes with no tail ones co-occurring. The experiments are implemented in PyTorch.

Table 1. Experimental results of mAP by our methods and other comparing approaches on VOC-MLT and COCO-MLT. We evaluate the results on the whole class set and the three subsets, respectively

Full size table

4.3 Benchmarking Results

VOC-MLT. VOC-MLT contains 6, 6, and 8 classes for the head, medium, and tail classes, respectively. We adjusted the hyper-parameters in other methods so that they currently work best in our dataset. For our method, we choose $\alpha =0.1$, $\beta =10$, and $\mu =0.3$ for the smoothing function during re-balanced weighting. And we set $\lambda =5, \kappa =0.05$ for NTR. The experimental results compared with other traditional and state-of-the-art approaches can be seen in Table 1. What’s worth noting is that unlike COCO-MLT whose tail classes always co-occur with head classes, the tail classes for VOC usually appear as single-label, which lower the complexity and difficulty for the classification task on them, bringing higher performance for the tail that even outperforms the head. This character also alleviate the defect of inner-class imbalance cause by head-tail connection and notice that after regular re-sampling, all subsets including the head have an improvement, especially the tail classes. Comparing with the best baseline, re-sampling trained with focal loss (RS-Focal), our re-balanced weighting strategy gains further improvement by about $1.0\%$ in total mAP, with $0.4\%$ and $2.2\%$ for head and tail classes and drop $0.3\%$ for medium classes. Our final DB-Loss further achieves remarkable improvements by $1.5\%$ compared with R-BCE, and by $0.8\%$, $1.0\%$ and $2.6\%$ for the three subsets, respectively. It can be seen that NTR is especially beneficial for the tail classes.

COCO-MLT. The whole 80 classes of COCO-MLT are split into 22, 33, and 25 classes for the head, medium, and tail classes, respectively. We choose $\alpha =0.1$, $\beta =10$, and $\mu =0.2$ for the smoothing function during re-balanced weighting. And we set $\lambda =2, \kappa =0.05$ for NTR. The experimental results compared with other traditional and state-of-the-art approaches can be seen in Table 1. COCO-MLT has a heavy head-tail connection, i.e. some tail classes has a $100\%$ probability of oc-curring with certain head classes, as shown in Fig. 5. A direct result of class-aware re-sampling is a sharp rise of mAP to the tail classes, while the performance for head classes drops by about $0.9\%$. With re-balanced weighting, the negative effect on the head classes is fixed and mAP for head, medium, and tail classes all have an improvement, by $2.3\%$, $1.1\%$ and $2.1\%$, respectively. With focal loss combined with either BCE trained re-sampling or R-BCE trained re-sampling, we see an extra improvement. Replacing R-BCE with DB-Loss would further bring an average improvement of about $0.8\%$, and brings up mAP for tail classes by about $1.1\%$.

4.4 Ablation Study

Visualization of the Imbalance Caused by Re-sampling. We visualize the conditional probability matrix that reveals label co-occurrence relationship in Fig. 5. As can be seen that the most frequently appearing classes usually have the highest co-existing probability on the condition on of other classes. This makes them repeatedly sampled, and the imbalance is not eliminated after re-sampling. We also roughly estimate the inner-class imbalance by the variance of normalized sampling times: for each class and each of the instances containing it, we calculate its expected sampling times. We normalize them within a class to a mean value of 1, and the variance of the normalized sampling times is calculated, as shown in Fig. 5. Variance can roughly represent the extent of imbalance in sampling. Classes with high variance gain little or negative increment in mAP despite heavy sampling on them. A more precise and complicated cooperation of sampling variance and data scale is out of the scope of this paper, which can be reserved as future work.

Step-Wise Evaluation of Our Framework. We perform a step-wise evaluation on the test set by showing mAP increment per-class to have a better understanding of how re-balanced weighting and negative-tolerant regularization work on different parts of the dataset distribution. As shown in Fig. 6 and mentioned above, regular re-sampling is not friendly to head and medium classes, with little or negative increment. While using re-balanced weighting has a general improvement among classes and amend performance drop by regular re-sampling. Negative-tolerant regularization also benefits a large range of classes, and it leads to a remarkable improvement for tail classes as we expected, indicating an improved generalization ability after the suppression of negative samples is relaxed.

The Combination of Re-sampling and Various Re-weighting Methods. Re-sampling and traditional re-weighting methods based on the original distribution share a similar thought of drawing more importance to the rare classes, and they’re usually performed at instance level. As a result, the combinations of them are at risk of redundancy: the head classes are over-ignored and the tail classes are over-emphasized. Our re-balanced weight is also calculated from the training distribution and it’s designed to fine-tune and enhance re-sampling rather than doing repetitive jobs. So we combine the traditional re-weighting methods with re-sampling for comparison as performed in Table 2. Our superior performance shows the benefit of applying R-BCE to re-sampling.

Table 2. Experimental results on re-sampling combined with several re-weighting techniques. CB loss with focal is reported by [5] to perform better, so all the other techniques are enhanced with focal loss for fair comparison

Full size table

4.5 Further Analysis

The Effect of Hyper-Parameters of Smoothing Function. The smoothing function Eq. 4 has three hyper-parameters, $\alpha $ applies an overall lift in weight, $\beta $ and $\mu $ control the shape of the mapping function. We report the results of $\beta $ with $\mu =0.2$ fixed as shown in Fig. 4.

The Effect of $\lambda $ in Negative-tolerant Regularization. To understand how $\lambda $ and $\nu $ of Eq. 7 affect the results independently, we first fix $\nu =0$ the same as in the main experiments and change $\lambda $ in a large range, and then fix $\lambda =2,5$ for COCO-MLT and VOC-MLT, respectively, and change $\nu $. The effect of $\nu $ is relatively small, and we would report the results in the supplementary material. Here in Fig. 7b, we observe that it performs the best at $\lambda =5-10$ for VOC, with an improvement of about $2.5\%$ for the tail classes, and $1\%$ for head and medium classes. In COCO, head and medium classes are slightly affected when $\lambda <3$ and tail classes have an improvement of about $1\%$ at around $2<\lambda <3$. What’s worth noting is that, by adding the same form of regularization to positive logits, the results slightly drop as expected.

Group-Wise Analysis. Medium classes always have a better mAP on average than both the head and tail classes. The reason for this may be that, medium classes neither suffer from the over-fitting problem as tail classes do due to insufficient training samples, nor do they get hurt from the imbalance induced by re-sampling. Another explanation for this is that the average number of classes an instance has is gradually reduced from the head classes to the tail. This indicates a lower complexity and difficulty for the recognition task. For instance, quite a number of the training samples for the tail classes of VOC-MLT have only one ground-truth label. Phenomena led by this is that the defect of re-sampling is relieved and the mAP performance of the tail classes surprisingly outperforms head classes by a margin.

5 Conclusion

In this work, we propose a simple yet powerful loss function, Distribution-Balanced Loss, to tackle the multi-label long-tailed recognition problem. Multi-label long-tailed recognition problem has two intrinsic challenges, namely the co-occurrence of labels and the dominance of negative labels (when treated as multiple binary classification problems). To tackle these two obstacles, the Distribution-Balanced Loss consists of two key ingredients: 1) a new way to rebalance the weights that takes into account the impact caused by label co-occurrence, and 2) a negative tolerant regularization to mitigate the over-suppression of negative labels. Extensive experiments on both Pascal VOC and COCO validate the effectiveness of the Distribution-Balanced Loss to tackle multi-label long-tailed visual data. The models trained with our new loss function achieve significant performance gains over existing methods, which we believe will serve as a strong baseline for future research.

References

Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)
Article Google Scholar
Byrd, J., Lipton, Z.C.: What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372 (2018)
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: Advances in Neural Information Processing Systems (NIPS), pp. 1565–1576 (2019)
Google Scholar
Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5177–5186 (2019)
Google Scholar
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Google Scholar
Durand, T., Mehrasa, N., Mori, G.: Learning a deep ConvNet for multi-label classification with partial labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 647–657 (2019)
Google Scholar
Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)
Article Google Scholar
Goyal, P., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Horn, G.V., Perona, P.: The devil is in the tails: fine-grained classification in the wild. arXiv preprint arXiv:1709.01450 (2017)
Huang, C., Li, Y., Change Loy, C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5375–5384 (2016)
Google Scholar
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–441 (2018)
Google Scholar
Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Huang, Q., Yang, L., Huang, H., Wu, T., Lin, D.: Caption-supervised face recognition: training a state-of-the-art face model without manual annotation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intel. Data Anal. 6(5), 429–449 (2002)
Article Google Scholar
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Khan, S., Hayat, M., Zamir, S.W., Shen, J., Shao, L.: Striking the right balance with uncertainty. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Lee, C.W., Fang, W., Yeh, C.K., Frank Wang, Y.C.: Multi-label zero-shot learning with structured knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1576–1585 (2018)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Liu, Z., et al.: Open compound domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196 (2018)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)
Google Scholar
Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050 (2018)
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Chapter Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. (IJDWM) 3(3), 1–13 (2007)
Article Google Scholar
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294 (2016)
Google Scholar
Wang, Y.X., Ramanan, D., Hebert, M.: Learning to model the tail. In: Advances in Neural Information Processing Systems (NIPS), pp. 7029–7039 (2017)
Google Scholar
Wang, Z., Chen, T., Li, G., Xu, R., Lin, L.: Multi-label image recognition by recurrently discovering attentional regions (2017)
Google Scholar
Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14236516 & No. 14203518), and Innovation and Technology Support Program (ITSP) Tier 2, ITS/431/18F. Correspondence to Ziwei Liu.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Tong Wu & Yu Wang
The Chinese University of Hong Kong, Hong Kong, China
Qingqiu Huang, Ziwei Liu & Dahua Lin

Authors

Tong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Qingqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Wu .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1210 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, T., Huang, Q., Liu, Z., Wang, Y., Lin, D. (2020). Distribution-Balanced Loss for Multi-label Classification in Long-Tailed Datasets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-58548-8_10
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58547-1
Online ISBN: 978-3-030-58548-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics