Prototypical Classifier for Robust Class-Imbalanced Learning

Wei, Tong; Shi, Jiang-Xin; Li, Yu-Feng; Zhang, Min-Ling

doi:10.1007/978-3-031-05936-0_4

Tong Wei¹³,
Jiang-Xin Shi¹³,
Yu-Feng Li¹³ &
…
Min-Ling Zhang^14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13281))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2401 Accesses
5 Citations

Abstract

Deep neural networks have been shown to be very powerful methods for many supervised learning tasks. However, they can also easily overfit to training set biases, i.e., label noise and class imbalance. While both learning with noisy labels and class-imbalanced learning have received tremendous attention, existing works mainly focus on one of these two training set biases. To fill the gap, we propose Prototypical Classifier, which does not require fitting additional parameters given the embedding network. Unlike conventional classifiers that are biased towards head classes, Prototypical Classifier produces balanced and comparable predictions for all classes even though the training set is class-imbalanced. By leveraging this appealing property, we can easily detect noisy labels by thresholding the confidence scores predicted by Prototypical Classifier, where the threshold is dynamically adjusted through the iteration. A sample reweighting strategy is then applied to mitigate the influence of noisy labels. We test our method on both benchmark and real-world datasets, observing that Prototypical Classifier obtains substaintial improvements compared with state of the arts.

T. Wei and J.-X. Shi—Co-first authors. This work was done when Tong Wei was a student at Nanjing University.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Towards Deeper Insights into Deep Learning from Imbalanced Data

Deep Over-sampling Framework for Classifying Imbalanced Data

A survey of class-imbalanced semi-supervised learning

Article 19 May 2023

Keywords

1 Introduction

Deep neural networks (DNNs) have been widely used for machine learning applications. Despite of their success, it has been shown that the training of DNNs requires large-scale labeled and unbiased data. However, in many real-world applications, training set biases are prevalent [9, 21, 27, 28], which typically have two types: i) class-imbalanced data distribution; and ii) noisy labels. For example, in autonomous driving, the vast majority of the training data is composed of standard vehicles but models also need to recognize rarely seen classes such as emergency vehicles or animals with very high accuracy. This will sometime lead to biased training models that do not perform well in practice. Moreover, large-scale high-quality data annotations are expensive and time-consuming to obtain. Although coarse labels are cheap and of high availability, the presence of noise will hurt the model performance. Therefore, it is desirable to develop machine learning algorithms that can accommodate not only class-imbalanced training set, but also the presence of label noise.

Both learning with noisy labels and class-imbalanced learning (a.k.a. long-tailed learning) have been studied for many years. When dealing with label noise, the most popular approach is sample selection where correctly-labeled examples are identified by capturing the training dynamics of DNNs [11, 29]. When dealing with class imbalance, many existing works propose to reweight examples or design unbiased loss functions by taking into account the class distribution of training set [3, 8, 26]. However, most existing methods focus on only one of these two training set biases.

In this paper, we address both training set biases simultaneously. As shown in Fig. 1a, it is known that the classifier directly learned on class-imbalanced data is biased towards head classes [8, 32] which results in poor generalization on tail classes. Moreover, using sample loss/confidence produced by biased classifiers fails to detect label noise, because both clean and noisy samples of tail classes have large loss and low confidence. To solve this problem, we propose to use Prototypical Classifier which is demonstrated to produce balanced predictions even through the training set is class-imbalanced. Our basic idea is that there exists an embedding in which examples cluster around a single prototype representation for each class. In order to do this, we learn a non-linear mapping of the input into an embedding space using a neural network and take a class’s prototype to be the normalized mean vector of examples in the embedding space. Classification is then performed for an embedded test example by simply finding the nearest class prototype. Notably, Prototypical Classifier does not need additional learnable parameters given embedding of examples. Unfortunately, it is easy to observe that simply using prototypes for classification may lead to many wrong predictions for samples of head classes as shown in Fig. 1b. The reason is that the representations are supposed to be modified when the classification boundaries of tail classes expand. We therefore train the neural networks to pull together embedding of examples and the prototype of their class, while pushing apart examples from prototypes of other classes. By doing this, it can avoid many mis-classifications for samples of head classes, as shown in Fig. 1c. Subsequently, we find that the confidence scores produced by Prototypical Classifier is balanced and comparable across classes. By leveraging this property, we can simply detect noisy labels via thresholding where the threshold is dynamically adjusted, followed by a sample re-weighting strategy.

In summary, our key contributions of this work are:

We propose to learn from training set with mixed biases, which is practical but has been understudied;
Our approach, Prototype Classifier, is simple yet powerful. It produces more balanced predictions over all classes than normal classifiers even when the training set is class-imbalanced. This property further benefits the detection of label noise.
On both simulated datasets and a real-world dataset Webvision with label noise, Prototype Classifier achieves substantial performance improvement.

2 Related Work

Class-Imbalanced Learning. Recently, many approaches have been proposed to handle class-imbalanced training set. Most extant approaches can be categorized into three types by modifying (i) the inputs to a model by re-balancing the training data [16, 22, 32]; (ii) the outputs of a model, for example by post-hoc adjustment of the classifier [8, 17, 25]; and (iii) the internals of a model by modifying the loss function [2, 6, 20, 23]. Each of the above methods are intuitive, and have shown strong empirical performance. However, these methods assume the training examples are correctly-labeled, which is often difficult to obtain in real-world applications. Instead, we study a realistic problem to learn from class-imbalanced data with label noise.

Label Noise Detection. Plenty of methods have been proposed to detect noisy labels [4, 7, 10]. Many works adopt the small-loss trick, which treats samples with small training losses as correctly-labeled. In particular, MentorNet [7] reweights samples with small loss so that noisy samples contribute less to the loss. Co-teaching [4] trains two networks where each network selects small-loss samples in a mini-batch to train the other. DivideMix [10] fits a Gaussian mixture model on per-sample loss distribution to divide the training data into clean set and noisy set. In addition, AUM [19] introduces a margin statistic to identify noisy samples by measuring the average difference between the logit values for a sample’s assigned class and its highest non-assigned class. The above methods only consider class-balanced training sets, thus is not directly applicable for class-imbalanced problems. Ref. [12] observes that real-world dataset with label noise also has imbalanced number of samples per-class. Nevertheless, they only inspect a particular setup of class imbalance.

3 Prototypical Classifier with Dynamic Threshold

3.1 Motivation

Consider a binary classification problem with the data generating distribution $\mathbb {P}_{XY}$ being a mixture of two Gaussians. In particular, the label Y is either positive (+1) or negative (−1) with equal probability (i.e., $\frac{1}{2}$). Condition on $Y = +1, \mathbb {P} (X \mid Y = +1) \sim \mathcal {N} (\mu _1, \sigma _1)$ and similarly, $\mathbb {P} (X \mid Y = -1) \sim \mathcal {N} (\mu _2, \sigma _2)$. Without loss of generality, let $\mu _1 > \mu _2$. It is straightforward to verify that the optimal Bayes’s classifier is $f(x) = sign(x - \frac{\mu _1+\mu _2}{2})$ [30], i.e., classify x as +1 if $x > \frac{\mu _1+\mu _2}{2}$. This reminds us the nearest neighbor classifier, whose classification boundary is at the middle of two data points (i.e., balanced classification boundary). For general multi-class tasks, this motivates us to measure the distance of samples to class prototypes, which is empirically observed to produce balanced classification boundary even though the training set is class-imbalanced, as shown in Fig. 2.

In order to do this, we learn a non-linear mapping of the input into an embedding space using a neural network $f_{\theta }$ parameterized by $\theta $ using training set $\mathcal {D} = \{({\boldsymbol{x}}_i, y_i)\}_{i=1}^N$. The class prototype is taken as the normalized mean vector of the embedded examples belonging to its class. For example, the prototype for class $k \in \{1,\dots , K\}$ is computed as:

$$\begin{aligned} \boldsymbol{c}_{k} = {\text {Normalize}}\bigg ( \frac{1}{|\mathcal {D}_k|} \sum _{ i \in \mathcal {D}_k } f_\theta ({\boldsymbol{x}}_i) \bigg ), \mathcal {D}_k = \left\{ i \mid y_{i}=k \right\} . \end{aligned}$$

(1)

Prototypical Classifier produces a distribution over classes for sample ${\boldsymbol{x}}$ based on a softmax over distances to the prototypes in the embedding space. In particular, when use cosine similarity as distance measure, we have:

$$\begin{aligned} \mathbb {P}_{\theta }( Y=k \mid {\boldsymbol{x}})=\frac{\exp \left( f_{\theta }({\boldsymbol{x}})^{\top } \mathbf {c}_{k}\right) }{\sum _{k^{\prime }} \exp \left( f_{\theta }({\boldsymbol{x}})^{\top } \mathbf {c}_{k^{\prime }}\right) }. \end{aligned}$$

(2)

Learning proceeds by minimizing the negative log-probability $J(\theta )=-\log \mathbb {P}_{\theta }(Y=k \mid \mathbf {x})$ of the true class label k via SGD. Notably, the model in Eq. (2) is equivalent to a linear model with a particular parameterization [18]. To see this, expand the term in the exponent:

$$\begin{aligned} \mathbf {c}_{k}^{\top } f_{\theta }({\boldsymbol{x}}) = \mathbf {w}_{k}^{\top } f_{\theta }({\boldsymbol{x}})+b_{k}, \text{ where } \mathbf {w}_{k}= \mathbf {c}_{k} \text{ and } b_{k}=0. \end{aligned}$$

(3)

Our results indicate that Prototypical Classifier is effective despite the equivalence to a linear model. We hypothesize this is because all of the required non-linearity can be learned within the embedding function [24]. Indeed, this is the approach that modern neural network classification systems currently use.

3.2 Dynamic Thresholding for Label Noise Detection

However, the existence of label noise may hurt the representation learning of the network. To tackle this issue, it is a common practice to correct noisy labels. Let $\hat{{\boldsymbol{y}}} = [\hat{y}_1, \cdots , \hat{y}_K] = \mathbb {P}_{\theta }( Y \mid {\boldsymbol{x}})$ be the prediction of Prototypical Classifier, the labels are refined as stated by the following rule:

$$\begin{aligned} \tilde{y} = \left\{ \begin{array}{ll} y_{i} &{} \text{ if } \hat{y}_{y_i} > \tau _t \\ \hbox {arg max}_j \hat{y}_{j} &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$

(4)

In words, we deem samples as clean if the confidence scores on their original labels is greater than a threshold $\tau _t$. It is notably that using normal classifiers cannot achieve this goal due to its biased predictions, while predictions of Prototypical Classifier are balanced and comparable. We illustrate this finding in Fig. 3.

We then need to construct $\tau _t$. Intuitively, with the increase of the optimization iteration t, the predictive confidence also increases in general, so that $\tau _t$ is also required to increase. Mathematically, we set the dynamic threshold $\tau _t$ as an increasing function of t, which is given by:

$$\begin{aligned} \tau _t = \gamma ^{t} \tau _0. \end{aligned}$$

(5)

Here, $\tau _0$ is the initial threshold and $\gamma $ is set to 1.005 in our experiments. We provide more analysis about $\tau _t$ in supplementary materials. Lemma 1 summarizes the performance bound of the label noise detection method.

Lemma 1

With probability at least p, the F$_1$-score of detecting noisy labels in $\mathcal {D}_j$ by thresholding the predictive scores of Prototypical Classifier is at least $1-\frac{e^{-v} \max \left( N^{-}, N^{+}\right) +\alpha }{N^{-}}$ when the noise ratio is known, where $p=\int _{-1}^{\mu ^{t r u e}-\mu ^{f a l s e}-\varDelta } f(t) d t$, f(t) is the probability density function of the difference of two independent beta-distributed random variables $\beta _1 - \beta _2$, where $\beta _{1} \sim {\text {Beta}}\left( N^{-}, 1\right) $, $\beta _{2} \sim {\text {Beta}}\left( \alpha +1, N^{+}-\alpha \right) $.

Lemma 1 shows that the performance of noise detection depends on the intraclass concentration of clean samples in the embedding space (denoted by $\frac{\varDelta ^2}{v}$), which is optimized by the prototypical contrastive loss defined in Eq. (6). We refer the reader to Ref. [33] for the proof of Lemma 1. We further justify the effectiveness of our method in Fig. 4, which produces high F$_1$-score for both head and tail classes.

3.3 Example Reweighting

In standard training, we aim to minimize the expected loss for the training set, where each input example is weighted equally. Here we aim to learn a reweighting of the inputs to cope with hard mislabeled samples whose labels are not correctly refined, where we minimize a weighted loss:

$$\begin{aligned} \mathcal {L}_{\text{ pc } }= \frac{-1}{ \sum _{i=1}^N w_i } \sum _{i=1}^{N} w_i \log \frac{\exp \left( f_{\theta }({\boldsymbol{x}}) \cdot \boldsymbol{c}_{y_{i}} / \tau \right) }{\sum _{k=1}^{K} \exp \left( f_{\theta }({\boldsymbol{x}}) \cdot \boldsymbol{c}_{k} / \tau \right) }. \end{aligned}$$

(6)

With a slight abuse of the notation, we re-define $w_i$ to be the weight for the i-th example and $\tau $ is a temperature parameter. We expect the weights can reflect the likelihood of examples being correctly-labeled. In that regard, we devise a weighted version for computing prototypes as:

$$\begin{aligned} \boldsymbol{c}_{k} = {\text {Normalize}}\bigg ( \frac{1}{\sum _{i \in \mathcal {D}_k} w_i} \sum _{ i \in \mathcal {D}_k } w_i f_\theta ({\boldsymbol{x}}_i) \bigg ), \mathcal {D}_k = \left\{ i \mid y_{i}=k \right\} . \end{aligned}$$

(7)

Recall that, one appealing property of Prototypical Classifier is balanced predictions across all classes, as opposite to biased normal classifiers. We therefore simply set examples weights as the predicted score of Prototypical Classifier on the training label, i.e., for the i-th example, we set $w_i = \mathbb {P}_{\theta }(Y=y_i \mid {\boldsymbol{x}}_i)$ where $y_i$ is the training label of ${\boldsymbol{x}}_i$. For samples whose labels are rectified, we update their weights by $w' = \frac{\tau _t - w}{2}$ to reflect the uncertainty. The modified example weights are always positive since the label is refined if and only if $w = \mathbb {P}(Y = y_i \mid {\boldsymbol{x}}_i) \le \tau _t$. The optimization of $\mathcal {L}_{\text{ pc }}$ is realized by contrastive learning, which has been demonstrated effective in learning representations [13]. Observing that the presence of label noise may have negative effect on representation learning, we train networks to optimize the unsupervised contrastive loss, which does not use the biased training labels. The basic idea of unsupervised contrastive learning is to pull together two embeddings of the same example, while pushing apart from other examples. Formally, let ${\boldsymbol{z}}_i = f_{\theta }({\boldsymbol{x}}_i)$ and ${\boldsymbol{z}}_i^{\prime }$ be the embedding of augmented version of ${\boldsymbol{x}}_i$, the unsupervised contrastive loss is computed as:

$$\begin{aligned} \mathcal {L}_{\text{ cc } }^{i}=-\log \frac{\exp \left( \boldsymbol{z}_{i} \cdot \boldsymbol{z}_{i}^{\prime } / \tau \right) }{\sum _{b=0}^{B} \exp \left( {\boldsymbol{z}}_i \cdot \boldsymbol{z}_{b}^{\prime } / \tau \right) }, \end{aligned}$$

(8)

where $\tau $ is a scalar temperature parameter and B is mini-batch size.

Given the above definitions and denoting $\mathcal {L}^{\mathrm {ce}}$ as conventional cross-entropy loss, the overall training objective is written as:

$$\begin{aligned} \mathcal {L}=\mathcal {L}^{\mathrm {ce}}+\lambda _{1} \mathcal {L}^{\mathrm {cc}}+\lambda _{2} \mathcal {L}^{\mathrm {pc}}, \end{aligned}$$

(9)

where hyperparameters $\lambda _{1}$ and $\lambda _{2}$ are trade-off parameters. We adopt DNNs as feature extractor and a linear layer as projector to generate latent feature representation $\boldsymbol{z}_i$. Another linear layer following the feature extractor is used as classifier. When minimizing $\mathcal {L}_{\mathrm {pc}}$, we apply mixup [31] to improve the generalization which has been shown to be effective for learning with noisy labels [29].

4 Experiments

We perform experiments on CIFAR-10 and CIFAR-100 datasets by controlling label noise ratio and imbalance factor of the training set. Additionally, we perform experiments on a commonly used dataset Webvision with real-world label noise.

4.1 Results on Simulated Datasets

Class-Imbalanced Dataset Generation. Formally, for a dataset with K classes and N training examples for each class, by assuming the imbalance factor is $\rho $, the number of examples for the k-th class is set to $N_k={N}/{\rho ^{\frac{k-1}{K-1}}}$.

Label Noise Injection. Let Y denote the variable for the clean label, $\bar{Y}$ the noisy label, and X the instance/feature, the transition matrix $T(X = x)$ is defined as $T_{ij}(X) = \mathbb {P}(\bar{Y}=j \mid Y=i, X=x)$. In this work, we follow the setup in RoLT+ [28] by setting $T(X = x)$ according to the estimated class priors $\mathbb {P}(y)$, e.g., the empirical class frequencies in the training dataset. Formally, given the noise proportion $\gamma \in [0,1]$, we define:

$$\begin{aligned} T_{ij}(X) = \mathbb {P}(\bar{Y}=j \mid Y=i, X=x) = \left\{ \begin{array}{ll} 1 - \gamma &{} i = j \\ \frac{N_j}{N - N_i} \gamma &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$

(10)

Here, N is the size of training set and $N_j$ is frequency of class j.

Table 1. Test accuracy (%) on CIFAR-10. $^*$ denotes ensemble models.

Full size table

Table 2. Test accuracy (%) on CIFAR-100. $^*$ denotes ensemble models.

Full size table

Result. We train the PreAct ResNet-18 network using SGD optimizer with momentum 0.9 for all methods. We set $\lambda _1=1$ and $\lambda _2=5$. We use $\tau _0 = 0.1$ for CIFAR-10 and $\tau =0.01$ for CIFAR-100. Tables 1 and 2 respectively summarize the results for CIFAR-10 and CIFAR-100 datasets. We compare our methods with several commonly used baselines for long-tailed learning (1–3) and learning with noisy labels (4–5). As shown in the results, previous methods dreadfully degrade their performance as the noise ratio and imbalance factor increase, while our methods retain robust performance. In particular, compared with CE, Prototypical Classifier improves the test accuracy by 9% on average. It can be observed that the improvement becomes more significant when the noise ratio is high, benefiting from proposed noise detection method.

As DivideMix [10] and RoLT+ [28] are two strong baselines in this task, (4) and (5) obtain much higher performance than (1–3), particularly when noise ratio is high. Although (4) and (5) use an ensemble of two networks, our method (6) outperforms them in most cases. On CIFAR-100, Prototypical Classifier achieves the best results among all the approaches and outperforms others by a large margin for both head and tail classes in Fig. 5.

4.2 Results on Real-World Dataset

We test the performance of our method on a real-world dataset. WebVision [14] contains 2.4 million images collected from Flickr and Google with real noisy and class-imbalanced data. Following previous literature, we train on a subset, mini WebVision, which contains the first 50 classes. In Table 3, we report results comparing against state-of-the-art approaches, including MentorNet [7], Co-teaching [4], ELR [15], HAR [1], and DivideMix [10]. We use InceptionResNet-v2 for all methods. We set $\tau _0 = 0.05$, $\lambda _1=1$ and $\lambda _2=2$ in all experiments. From the results, we can see that, by using a single model, the proposed method achieves competitive performance with DivideMix and outperforms other baselines.

4.3 Ablation Studies

We examine the effectiveness of the each module of our method by removing it and comparing its performance with the full framework. The results are reported in Table 4. Generally, it is easy to see that removing any part of the method significantly drops the performance or even fails in some cases. The performance of re-weighting and dynamic threshold shows their great effectiveness for dealing with label noise. Though we do not use the normal classifier trained via $\mathcal {L}_{ce}$, it is observed to help improve the representation learning. We have a similar observation for the unsupervised contrastive loss $\mathcal {L}_{ce}$. The strong augmentation method AugMix [5] also provides substaintial improvement.

Table 3. Accuracy (%) on WebVision and ImageNet. $^*$ denotes ensemble models.

Full size table

Additionally, we also test our method on class-balanced training sets with label noise in Table 5. Prototypical Classifier outperforms other methods in most cases, even though both DivideMix and RoLT+ uses an ensemble of two networks, which shows the generality of Prototypical Classifier.

**Table 4. Ablation studies. $\rho =0.5$ and $\gamma =100$. ( ) indicate performance loss (gain) compared with Prototypical Classifier.**

Table 5. Accuracy (%) on class-balanced datasets. $^*$ denotes ensemble models.

Full size table

5 Conclusion

We propose Prototypical Classifier for learning with training set biases. Prototypical Classifier is shown to produce balanced predictions for all classes even when learned on class-imbalanced training set. This appealing property provides a way of detecting label noise by thresholding the predicted scores of examples. Experiments demonstrate the superiority of the proposed method. We believe Prototypical Classifier can motivate solutions to more problems with class-imbalanced training sets, for instance semi-supervised learning and self-supervised learning.

References

Cao, K., Chen, Y., Lu, J., Arechiga, N., Gaidon, A., Ma, T.: Heteroskedastic and imbalanced deep learning with adaptive regularization. In: ICLR (2021)
Google Scholar
Cao, K., Wei, C., Gaidon, A., Aréchiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NeurIPS, pp. 1565–1576 (2019)
Google Scholar
Cui, Y., Jia, M., Lin, T., Song, Y., Belongie, S.J.: Class-balanced loss based on effective number of samples. In: CVPR, pp. 9268–9277 (2019)
Google Scholar
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: NeurIPS, pp. 8536–8546 (2018)
Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2020)
Google Scholar
Jamal, M.A., Brown, M., Yang, M.H., Wang, L., Gong, B.: Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In: CVPR, pp. 7610–7619 (2020)
Google Scholar
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML, pp. 2304–2313 (2018)
Google Scholar
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. In: ICLR (2020)
Google Scholar
Karthik, S., Revaud, J., Boris, C.: Learning from long-tailed data with noisy labels. CoRR abs/2108.11096 (2021)
Google Scholar
Li, J., Socher, R., Hi, S.C.: DivideMix: learning with noisy labels as semi-supervised learning. In: ICLR (2020)
Google Scholar
Li, J., Xiong, C., Hoi, S.C.: Learning from noisy data with robust representation learning. In: ICCV, pp. 9485–9494 (2021)
Google Scholar
Li, J., Xiong, C., Hoi, S.C.: MOPRO: webly supervised learning with momentum prototypes. In: ICLR (2021)
Google Scholar
Li, J., Zhou, P., Xiong, C., Hoi, S.C.H.: Prototypical contrastive learning of unsupervised representations. In: ICLR (2021)
Google Scholar
Li, W., Wang, L., Li, W., Agustsson, E., Gool, L.V.: Webvision database: visual learning and understanding from web data. CoRR abs/1708.02862 (2017)
Google Scholar
Liu, S., Niles-Weed, J., Razavian, N., Fernandez-Granda, C.: Early-learning regularization prevents memorization of noisy labels. In: NeurIPS, pp. 20331–20342 (2020)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: CVPR, pp. 2537–2546 (2019)
Google Scholar
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment. In: ICLR (2021)
Google Scholar
Mensink, T., Verbeek, J.J., Perronnin, F., Csurka, G.: Distance-based image classification: generalizing to new classes at near-zero cost. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2624–2637 (2013)
Article Google Scholar
Pleiss, G., Zhang, T., Elenberg, E.R., Weinberger, K.Q.: Identifying mislabeled data using the area under the margin ranking. In: NeurIPS, pp. 17044–17056 (2020)
Google Scholar
Ren, J., et al.: Balanced meta-softmax for long-tailed visual recognition. In: NeurIPS, pp. 4175–4186 (2020)
Google Scholar
Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: ICML, pp. 4331–4340 (2018)
Google Scholar
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Chapter Google Scholar
Shu, J., et al.: Meta-weight-net: learning an explicit mapping for sample weighting. In: NeurIPS, pp. 1917–1928 (2019)
Google Scholar
Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: NeurIPS, pp. 4077–4087 (2017)
Google Scholar
Tang, K., Huang, J., Zhang, H.: Long-tailed classification by keeping the good and removing the bad momentum causal effect. In: NeurIPS, pp. 1513–1524 (2020)
Google Scholar
Wang, Y., Ramanan, D., Hebert, M.: Learning to model the tail. In: NeurIPS, pp. 7029–7039 (2017)
Google Scholar
Wei, T., Li, Y.F.: Does tail label help for large-scale multi-label learning? IEEE Trans. Neural Netw. Learn. Syst. 31(7), 2315–2324 (2020)
Google Scholar
Wei, T., Shi, J., Tu, W., Li, Y.: Robust long-tailed learning under label noise. CoRR abs/2108.11569 (2021)
Google Scholar
Wu, Z.F., Wei, T., Jiang, J., Mao, C., Tang, M., Li, Y.F.: NGC: a unified framework for learning with open-world noisy data. In: ICCV, pp. 62–71 (2021)
Google Scholar
Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. In: NeurIPS, pp. 19290–19301 (2020)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. In: ICLR (2017)
Google Scholar
Zhou, B., Cui, Q., Wei, X., Chen, Z.: BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. In: CVPR, pp. 9716–9725 (2020)
Google Scholar
Zhu, Z., Dong, Z., Cheng, H., Liu, Y.: A good representation detects noisy labels. arXiv preprint arXiv:2110.06283 (2021)

Download references

Acknowledgments

The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This research was supported by the NSFC (62176118).

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Tong Wei, Jiang-Xin Shi & Yu-Feng Li
School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China
Min-Ling Zhang
Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing, China
Min-Ling Zhang

Authors

Tong Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-Xin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Min-Ling Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Feng Li .

Editor information

Editors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
João Gama
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Tianrui Li
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yang Yu
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Enhong Chen
JD iCity, JD Technology & JD Intelligent Cities Research, Beijing, China
Yu Zheng
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Fei Teng

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 869 KB)

Appendices

A Ablations on Dynamic Threshold

Figure 6 shows a comparison of fixed threshold and the dynamic threshold $\tau _t$ with $\tau _0 = 0.1$. We consider both exponential scheduler controlled by $\gamma $ and linear scheduler controlled by the threshold of last iteration $\tau _T$.

We test the performance of different choice of parameters and the results are reported in Table 6. From the results, we have two observations: i) when using fixed threshold or the dynamic threshold grows too slow, performance drops in the last iterations because many noisy labels are incorrectly flagged as clean; and ii) when dynamic threshold grows too fast, the network cannot achieve best performance, because many clean labels are incorrectly flagged as noisy.

Table 6. Test accuracy (%) on CIFAR-10-LT with imbalance factor 100 and noise ratio 50%.

Full size table

Table 7. Test accuracy (%) on clean datasets with different imbalanced factor.

Full size table

B Results on Clean Datasets

Although our method is particularly designed learning with noisy labels, it is interesting to study its performance on clean but class-imbalanced datasets. In this experiment, we do not use sample re-weighting and label noise correction. We report the results in Table 7. For fair comparison, we do not apply AugMix in this experiment. Intriguingly, Prototypical Classifier consistently outperforms all baselines by a large margin, showing the superiority of our proposed representation learning method.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, T., Shi, JX., Li, YF., Zhang, ML. (2022). Prototypical Classifier for Robust Class-Imbalanced Learning. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham. https://doi.org/10.1007/978-3-031-05936-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-05936-0_4
Published: 11 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05935-3
Online ISBN: 978-3-031-05936-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Prototypical Classifier for Robust Class-Imbalanced Learning

Abstract

Similar content being viewed by others

Towards Deeper Insights into Deep Learning from Imbalanced Data

Deep Over-sampling Framework for Classifying Imbalanced Data

A survey of class-imbalanced semi-supervised learning

Keywords

1 Introduction

2 Related Work