Meta-prototype Decoupled Training for Long-Tailed Learning

Fu, Siming; Chu, Huanpeng; He, Xiaoxuan; Wang, Hualiang; Yang, Zhenyu; Hu, Haoji

doi:10.1007/978-3-031-26351-4_16

Siming Fu¹²,
Huanpeng Chu¹²,
Xiaoxuan He¹²,
Hualiang Wang¹²,
Zhenyu Yang¹³ &
…
Haoji Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13846))

Included in the following conference series:

Asian Conference on Computer Vision

372 Accesses
1 Citations

Abstract

Long-tailed learning aims to tackle the crucial challenge that head classes dominate the training procedure under severe class imbalance in real-world scenarios. Supervised contrastive learning has turned out to be worth exploring research direction, which seeks to learn class-specific feature prototypes to enhance long-tailed learning performance. However, little attention has been paid to how to calibrate the empirical prototypes which are severely biased due to the scarce data in tail classes. Without the aid of correct prototypes, these explorations have not shown the significant promise expected. Motivated by this, we propose the meta-prototype contrastive learning to automatically learn the reliable representativeness of prototypes and more discriminative feature space via a meta-learning manner. In addition, on top of the calibrated prototypes, we leverage it to replace the mean of class statistics and predict the targeted distribution of balanced training data. By this procedure, we formulate the feature augmentation algorithm which samples additional features from the predicted distribution and further balances the over-whelming dominance severity of head classes. We summarize the above two stages as the meta-prototype decouple training scheme and conduct a series of experiments to validate the effectiveness of the framework. Our method outperforms previous work with a large margin and achieves state-of-the-art performance on long-tailed image classification and semantic segmentation tasks (e.g., we achieve 55.1% overall accuracy with ResNetXt-50 in ImageNet-LT).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Towards Calibrated Hyper-Sphere Representation via Distribution Overlap Coefficient for Long-Tailed Learning

SAFA: Sample-Adaptive Feature Augmentation for Long-Tailed Image Classification

FeatMatch: Feature-Based Augmentation for Semi-supervised Learning

Keywords

1 Introduction

Most real-world data comes with a long-tailed nature: a few head classes contribute the majority of data, while most tail classes comprise relatively few data. An undesired phenomenon is models [2, 29] trained with long-tailed data perform better on head classes while exhibiting extremely low accuracy on tail ones.

To address this problem, a large number of studies have been conducted in recent years, making promising progress in the field of deep long-tailed learning. Supervised contrastive learning (SCL) has been the main focus of many techniques for long-tailed learning. The mainstream insights work on supervised contrastive learning methods [17, 43] which seek to learn class-specific feature prototypes to enhance long-tailed learning performance. DRO-LT [21] innovatively explores the idea of feature prototypes to handle long-tailed recognition in an open world. Following that, TSC [15] converges the different classes of features to a target that is uniformly distributed over the hyper-sphere during training.

Nevertheless, when a class has only few samples, the distribution of training samples may not represent well the true distribution of the data. The shift between test distribution and training distribution causes the offset of the prototypes in tail classes [21]. The above works are all based on the empirical prototype under imbalanced data, limiting the effectiveness of feature representation. Therefore, the sub-optimal prototypes become an issue in learning high-quality representations for SCL methods, which confuse optimization for improved long-tailed learning.

To alleviate the above issues, we propose the supervised meta-prototype contrastive learning which calibrates the empirical prototype under the imbalanced setting. Specifically, we extend meta-learner to automatically restore the meta-prototypes of feature embeddings via two nested loops of optimization, guaranteeing the efficiency of the meta-prototype contrastive learning algorithm. Our major insight here is to parameterize the mapping function as a meta-network, which is theoretically a universal approximator for almost all continuous functions, and then use the meta-data (a small unbiased validation set) to guide the training of all the meta-network parameters. The meta-prototypes provide more meaningful feature prototypes which are designed to be robust against possible shifts of the test distribution and guide the SCL to obtain the discriminative feature representation space.

To further ease the dominance of the head classes in classification decisions, we develop the calibration feature augmentation algorithm based on the learned meta-prototype in classifier training stage. Specifically, we utilize it as the ‘anchor’ of corresponding class which represents the mean of the class statistics under the imbalanced setting. In contrast to the typical methods which generate the new feature samples based on the class statistics of imbalanced training data, our meta-prototype calibrates the bias and provides the reasonable feature distribution of new feature samples for tail classes. The newly generated feature are sampled from the calibrated distribution and help to find the correct classifier decision boundary via improving the performance of severely under-represented tail classes.

We summarize the above processes as the meta-prototype decoupled training framework which includes calibrating the empirical prototype for SCL in the representation learning stage and enhancing feature embedding for tail classes based on learned meta-prototype in the classifier learning stage. We extensively validate our model on typical visual recognition tasks, including image classification on three benchmarks (CIFAR-100-LT [12], ImageNet-LT [18] and iNaturalist2018 [25]), semantic segmentation on ADE20K dataset [40]. The experimental results demonstrate our method consistently outperforms the state-of-the-art approaches on all the benchmarks.

Summary of Contributions:

To the best of our acknowledge, we are the first in long-tailed learning to complete the meta-prototype to promote the representation quality of supervised prototype contrastive learning in the representation learning stage.
On top of the learned meta-prototype, we develop the feature augmentation algorithm for tail classes to ease dominance of the head classes in classification decisions in the classifier learning stage.
Our method outperforms previous works with a large margin and achieve state-of-the-art performance on long-tailed image classification and semantic segmentation tasks.

2 Related Work

Supervised Contrastive Learning. Existing supervised contrastive learning-based methods for long-tailed learning seek to help alleviate the biased label effect. DRO-LT [21] extends standard contrastive loss and optimizes against the worst possible centroids within a safety hyper ball around the empirical centroid. KCL [10] develops a new method to explicitly pursue balanced feature space for representation learning. TSC [15] generates a set of targets uniformly distributed on a hypersphere and makes the features of different classes converge to these distinct and uniformly distributed targets during training. Hybrid-SC [28] explores the effectiveness of supervised contrastive learning. It introduces prototypical supervised learning to obtain better features and resolve the memory bottleneck. The above works are all based on the empirical prototype under imbalanced data, which limits the effectiveness of feature representation. To alleviate the above issue, we introduce the meta-prototype to calibrate the empirical prototype, further constructing a discriminative feature space.

Meta-learning. The recent development of meta-learning [1, 7] inspires researchers to leverage meta-learning to handle class imbalance. Meta-weight-net [22] introduces a method capable of adaptively learning an explicit weighting function directly from data. MetaSAug [14] proposes to augment tail classes with a variant of ISDA [30] by estimating the covariance matrices for tail classes. Motivated by these works, our method attempts to automatically estimate the meta-prototype of each class to calibrate the empirical prototype for high-quality feature representation.

Data Augmentation for Long-Tailed Learning. In long-tail learning, transfer-based augmentation has been explored. Transfer-based augmentation seeks to transfer the knowledge from head classes to augment model performance on tail classes. TailCalibX [26] and GLAG [38] explore a direction that attempts to generate meaningful features by estimating the tail category’s distribution. RSG [27] dynamically estimates a set of feature centers for each class, and uses the feature displacement between head-class sample features and the nearest intra-class feature center to augment each tail sample feature. However, the estimated distribution of tail category and the intra-class feature center are biased or unreasonable due to the imbalanced size of training dataset. Our meta-prototype feature augmentation algorithm calibrates the bias and predicts likely shifts of the test distribution.

Decoupled Scheme for Long-Tailed Learning. Decoupling [9] is a pioneering work that introduces a two-step training scheme. It empirically evaluates different sampling strategies for representation learning in the first step, and then evaluates different classifier training schemes by fixing the feature extractor trained in the second step. Decouple [9] and Bag of tricks [37] decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. BBN [41] further unifies the two stages to form a cumulative learning strategy. MiSLAS [39] proposes to enhance the representation learning with data mixup in the first stage. During the second stage, MiSLAS applies a label-aware smoothing strategy for better model generalization. In our paper, our method also adopts the two-stage decoupled training scheme, which leads to better long-tailed learning performance.

3 The Proposed Methods

3.1 Problem Definition

For long-tailed learning, considering $ \mathcal { D }^{tra} = \{ \boldsymbol{x}^i, y^i\}$, $ i\in \{1, \cdots , K\} $ be the training set, where $\boldsymbol{x}^i$ denotes an image sample and $y^i$ indicates its class label. Let K be the total numbers of classes, $N_i$ be the number of samples in class i, where $\sum _{i=1}^{K} N_{i}=N$. A long-tail setup can be defined by ordering the number of samples per category, i.e. $N_{1} \ge N_{2} \ge \ldots \ge N_{K}$ and $N_{1} \gg N_{K}$ after sorting of $N_{i}$. Under the long-tailed setting, the training dataset is imbalanced, leading to the poor performance on tail classes.

We train a network $\boldsymbol{ \varPsi }(\cdot ; \boldsymbol{W})$ consisting of two components: (i) a backbone or representation network (CNN for images) that translates an image to a feature representation $\boldsymbol{z}_{i}= \boldsymbol{\varPsi }(\boldsymbol{x}^i; \boldsymbol{w^{E}}) \in \mathbb {R}^{1 \times d}$ and (ii) a classifier $\boldsymbol{w}^{C} \in \mathbb {R}^{K \times d} $ at predicts the category specific scores (logits). As shown in Fig. 1, given a pair $ (\boldsymbol{x}^i, y^i) $ sampled from a mini-batch $ \mathcal { B } \subset \mathcal { D }^{tra} $, feature vector $\boldsymbol{z}_{i}$ is extracted by the feature extractor. $\boldsymbol{z}_{i}$ is projected onto the classifier to output the classification logit. Too few samples belonging to the tail classes result in inadequate learning of tail classes representations.

3.2 Supervised Meta-prototype Contrastive Learning in the Representation Learning Stage

Supervised contrastive learning introduces cluster-based prototypes and encourages embeddings to gather around their corresponding prototypes. Our original feature prototypes follow the MoPro [13], adopting the exponential-moving-average (EMA) algorithm during training by:

$$\begin{aligned} \boldsymbol{c}_{k} \leftarrow m \boldsymbol{c}_{k}+(1-m) \boldsymbol{z}_{i}, \quad \forall i \in \left\{ i \mid \hat{y}_{i}=k\right\} , \end{aligned}$$

(1)

where $\boldsymbol{c}_{k}$ is the prototype for class k and m is momentum coefficient, usually set as 0.999. Then given the embedding $z_i^f$, the prototypes are queried with contrastive similarity matching. The prototype contrastive loss [13, 21] is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{PC}}=-\log \left[ \frac{\exp \left( z_{i}^{f} \cdot c^{k} / \tau \right) }{\sum _{j=1}^{K} \exp \left( z_{i}^{f} \cdot c^{j} / \tau \right) }\right] , \end{aligned} \end{aligned}$$

(2)

where $\tau $ is a hyper-parameter and usually set as 0.07 [11]. The neural network is denoted as $f(\cdot , \textbf{W})$, and $\textbf{W}$ denotes all of its parameters. Generally, the optimal network parameter $\textbf{W}^{*}$ can be extracted by minimizing the training loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}^{{\text {train}}}(\textbf{W} ; \boldsymbol{c}^{k}) = \mathcal {L}_{CE}(\textbf{W}) + \lambda \cdot \mathcal {L}_{{\text {PC}}}(\textbf{W}, \boldsymbol{c}^{k}), \end{aligned} \end{aligned}$$

(3)

where $\lambda $ denotes the weighting coefficient to balance the two loss terms and $\mathcal {L}_{CE}$ is the cross-entropy loss. As aforementioned, the empirical prototypes of tail classes can be far away from the ground-truth prototypes due to the limited features of tail classes and large variances in data distribution between training and test datasets. Therefore, we aim to learn appropriate feature prototypes to perform reasonable feature representation learning.

The whole process of the meta-prototype constrastive learning is summarized in Algorithm 1. In the presence of imbalanced training data, our method calibrates the empirical prototypes by prototype meta network, denoted as $\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta )$. $\boldsymbol{c}^{k}$ is the input of the meta network and $\varTheta $ represents the parameters contained in it. The meta network consists of MLP, which maps the empirical prototype $\boldsymbol{c}^{k}$ into the meta-prototype $\boldsymbol{\hat{c}}^{k}$. The prototype meta network is an encoder-decoder network, where the encoder contains one linear layer with a ReLU activation function, and the decoder consists of a Linear-ReLU-Linear structure. The optimal parameter $\textbf{w}$ is calculated by minimizing the following training loss:

$$\begin{aligned} \begin{aligned} \textbf{W}^{*}(\varTheta )&=\underset{\textbf{W}}{\arg \min } \mathcal {L}^{{\text {train}}}(\textbf{W} ; \boldsymbol{c}^{k}; \varTheta ) \\ {}&= \underset{\textbf{W}}{\arg \min }\left\{ \mathcal {L}_{CE}(\textbf{W}) + \lambda \cdot \mathcal {L}_{{\text {PC}}}( \textbf{W}, \mathcal {C}(\boldsymbol{c}^{k}; \varTheta ))\right\} . \end{aligned} \end{aligned}$$

(4)

The parameters contained in the meta-network can be optimized by using the meta-learning idea. The optimal parameter $\varTheta ^{*}$ can be obtained by minimizing the following meta-loss:

$$\begin{aligned} \begin{aligned} \varTheta ^{*}&=\underset{\varTheta }{\arg \min }\ \mathcal {L}^{m e t a}\left( \textbf{W}^{*}(\varTheta )\right) , \end{aligned} \end{aligned}$$

(5)

where $\mathcal {L}^{\text{ meta } }(\textbf{w})=\mathcal {L}_{CE}\left( y_{i}^{(\text{ meta } )}, f\left( x_{i}^{(\text{ meta } )}, \textbf{W}\right) \right) $ on meta-data. Specifically, following the meta-learning methods [14, 22] for long-tailed learning, we conduct a small amount of balanced meta-data set (i.e., with balanced data distribution) $\left\{ x_{i}^{(\text{ meta } )}, y_{i}^{(\text{ meta } )}\right\} _{i=1}^{M}$ to represent the meta-knowledge of ground-truth sample-label distribution, where M is the number of meta-samples and $M \ll N$.

Online approximation. To estimate the optimal feature prototypes for different classes, we adopt a double optimization loop, respectively, to guarantee the efficiency of the algorithm. We optimize the model in a meta-learning setup by i). updating equation of the network parameter can be formulated by moving the current $\textbf{W}^{(t)}$ along the descent direction of the objective loss in Eq. 4 on a mini-batch training data by

$$\begin{aligned} \begin{aligned} \hat{\textbf{W}}^{(t)}(\varTheta )\leftarrow \textbf{W}^{(t)}-\alpha \times \nabla _{\textbf{W}^{(t)}}\mathcal {L}^{{\text {train}}}(\textbf{W}; \boldsymbol{c}^{k}; \varTheta ), \end{aligned} \end{aligned}$$

(6)

where $\alpha $ is the step size. ii). After receiving the updated network parameters $\hat{\textbf{W}}^{(t)}(\varTheta )$, the parameter $\varTheta $ of the meta-network can then be readily updated by Eq. 5, i.e., moving the current parameter $\varTheta ^{(t)}$ along the objective gradient to be calculated on the meta-data by

$$\begin{aligned} \begin{aligned} \varTheta ^{(t+1)}=\varTheta ^{(t)}-\beta \frac{1}{n} \sum _{i=1}^{n} \nabla _{\varTheta ^{(t)}} \mathcal {L}^{meta}\left( \hat{\textbf{W}}^{(t)}(\varTheta )\right) , \end{aligned} \end{aligned}$$

(7)

where $\beta $ is the step size. iii) Then, the updated $\varTheta ^{(t+1)}$ is employed to ameliorate the parameter $\textbf{W}$ of the network, constituting a complete loop:

$$\begin{aligned} \begin{aligned} \textbf{W}^{(t+1)}=\textbf{W}^{(t)}-\alpha \times \nabla _{\textbf{W}^{(t)}} \mathcal {L}^{\text{ train } }(\textbf{W}^{(t)}; \boldsymbol{c}^{k}; \varTheta ^{(t+1)}), \end{aligned} \end{aligned}$$

(8)

Since the updated meta-network $\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta ^{(t+1)})$ are learned from balanced meta-data, we could expect $\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta ^{(t+1)})$ contribute to learning better network parameters $\textbf{W}^{(t+1)}$.

3.3 Meta-prototype Feature Augmentation in the Classifier Training Stage

On the classifier training phase, the target of our work is to generate addition feature embeddings to further balance the over-whelming dominance severity of head classes in the representation space. It is natural to utilize the feature augmentation to calibrate the ill-defined decision boundary. Following the Joint Bayesian face model [3], typical feature augmentation methods [26, 34, 38] assume that the features $\textbf{z}_{i}$ lies in a Gaussian distribution with a class mean $\mu _i$ and a covariance matrix $\varSigma _i$. The mean of a class is estimated as the arithmetic mean of all features in the same class by $\mu _{k}=\frac{1}{N_{k}} \sum _{i \in \mathcal {F}_{k}} \textbf{z}_{i}$.

However, the means of Gaussian distribution for tail classes are biased due to sparse sample size of the tail categories and large variances for data distribution between train and test datasets. This bias causes the distribution of the generated data to deviate significantly from the data distribution of the validation set. It leads to significant performance drop, even the destruction of the original representational space. Therefore, as Fig. 2 illustrated, we leverage the meta-prototypes $\hat{c}_{i}$ as the ‘anchor’ to replace the typical class statistics $\mu _{k}$ to provide the reasonable feature distribution of new feature samples for tail classes.

Given a trained backbone (discussed in Sect. 3.2), we first pre-compute feature representations for the entire dataset. These features of true samples are denoted as $\mathcal {F}=\left\{ \textbf{z}_{i}\right\} _{i=1}^{N}$. $\mathcal {F}_k $ denotes features of images belonging to the category k. For each class, we sample $N_1 - N_K$ additional features, such that the resulting feature dataset is completely balanced and all classes have $N_1$ instances. Sampling is performed based on an instance specific calibrated distribution. Specifically, each $\textbf{z}_{ik}$ ($i^{th}$ feature from category k) is responsible for generating $ s_\mathrm{{new}} = \max \left\{ \left[ N_{1} / N_{k}-1\right] _{+}, 1\right\} $ features, where $[\cdot ]_{+}$ is the ceiling function.

Based on the learned meta-prototype, the features covariance for the corresponding class can be calculated as:

$$\begin{aligned} \begin{aligned} \varSigma _{k}=\frac{1}{N_{k}-1} \sum _{i \in \mathcal {F}_{k}}\left( \textbf{z}_{i}-\boldsymbol{\hat{c}}^{k}\right) \left( \textbf{z}_{i}-\boldsymbol{\hat{c}}^{k}\right) ^{T}, \end{aligned} \end{aligned}$$

(9)

where $\varSigma _{k} \in \mathbb {R}^{d \times d}$ denotes the full covariance of the Gaussian distribution for category k. Next, for each feature $\tilde{\textbf{z}_{i}}$ belonging to tail classes k processed by Tukey’s Ladder of Power transformation [24], we calculate the similarity degree with other classes k which have more training samples as $ S_{i,k} = \tilde{\textbf{z}_{i}}^{\top } \cdot \boldsymbol{\hat{c}}^{k} /\Vert \tilde{\textbf{z}_{i}}^{\top } \Vert \cdot \Vert \boldsymbol{\hat{c}}^{k}\Vert $. We identify the set of M category indices that are neighbors $\mathcal {N}_{i}$ with the maximum cosine similarity. We calibrate the distribution of feature $\tilde{\textbf{z}_{i}}$ as:

$$\begin{aligned} \begin{aligned} \mu _{\tilde{\textbf{z}_{i}}} = (1-\alpha ) \cdot \tilde{\textbf{z}_{i}} + \alpha \cdot \frac{1}{M} \sum _{k \in \mathcal {N}_{i}} \frac{e^{S_{i,k}}}{\sum _{j=1}^{M} e^{S_{i,j}}} \cdot \hat{c}^{k} \\ \varSigma _{\tilde{\textbf{z}_{i}}} = (1-\alpha )^{2} \cdot \varSigma _{i} + \alpha ^{2} \cdot \frac{1}{M} \sum _{k \in \mathcal {N}_{i}} \frac{e^{S_{i,k}}}{\sum _{j=1}^{M} e^{S_{i,j}}} \cdot \varSigma _{k} + \beta , \end{aligned} \end{aligned}$$

(10)

where $\alpha $ is the hyper-parameter to balance the degree of the calibration and $\beta $ is an optional constant hyper-parameter to increase the spread of the calibrated distribution. We found that $\beta = 0.05 $ works reasonably well for multiple experiments. We generate the new samples with the same associated class label and denote the new samples for category k as $\mathcal {F}_{k}^{*}$. This combined set of features is generated for all categories and used to train classifier. As shown in Fig. 3, we generate features using our meta-prototype feature augmentation and re-build the t-SNE visualization in the right plot. Compared with the left plot which is before generation, the right plot eases the interpretation and clarifies the feature boundaries. In addition, due to the meta-prototype, the newly generated features are close to validation samples, which further promote the performance of the classifier.

4 Experiments

4.1 Long-Tailed Image Classification Task

Table 1. Top 1 accuracy of CIFAR-100-LT with various imbalance factors (100, 50, 10). RL, DT, and DA indicate representation learning, decouple training, and data augmentation, respectively.

Full size table

Datasets and Setup. We perform experiments on long-tailed image classification datasets, including the CIFAR-100-LT [12], ImageNet-LT [18] and iNaturalist2018 [25].

CIFAR-100-LT is based on the original CIFAR-100 dataset, whose training samples per class are constructed by imbalance ratio (The imbalance ratios we adopt in our experiment are 10, 50 and 100).
ImageNet-LT is a long-tailed version of the ImageNet dataset by sampling a subset following the Pareto distribution with power value 6. It contains 115.8K images from 1,000 categories, with class cardinality ranging from 5 to 1,280.
iNaturalist2018 is the largest dataset for long-tailed visual recognition. It contains 437.5K images from 8,142 categories. It is extremely imbalanced with an imbalance factor of 512.

Experimental Details. We implement all experiments in PyTorch. On CIFAR-100-LT, following [20], we use ResNet-32 [31] as the feature extractor for all methods. we conduct model training with SGD optimizer based on batch size 256, momentum 0.9 under three imbalance ratios (10, 50 and 100). For image classification on ImageNet-LT, following [5, 8, 23], we use ResNetXt-50 [31] as the feature extractor for all methods. We conduct model training with the SGD optimizer based on batch size 512, and momentum 0.9. In both training epochs (90 and 200 training epochs), the learning rate is decayed by a cosine scheduler [19] from 0.2 to 0.0. On iNaturalist2018 [25] dataset, we use ResNet-50 [31] as the feature extractor for all methods with 200 training epochs, with the same experimental parameters set for the other. Moreover, we use the same basic data augmentation (i.e., random resize and crop to 224, random horizontal flip, color jitter, and normalization) for all methods.

Comparison with State of the Arts. As shown in Table 1, to prove the versatility of our method, we employ our method on the CIFAR-100-LT dataset with three imbalance ratios. We compare against the most relevant methods and choose methods that are recently published and representative of different types, such as class re-balancing, decouple training and data augmentation. Our method surpasses the DRO-LT [21] under various imbalance factors, especially on the largest imbalance factor (52.3% vs 47.3%). Furthermore, compared with the data augmentation methods [38], our model achieves competitive performance (52.3% vs 51.7% with 100 imbalance factor).

Table 2 shows the long-tailed results on ImageNet-LT. We adopt the performance data from the deep long-tailed survey [36] for various methods at 90 and 200 training epochs to make a fair comparison. Our approach achieves 53.8% and 55.1% in overall accuracy, which outperforms the state of the art methods by a significant margin at 90 and 200 training epochs, respectively. Compared with representation learning methods, our method surpasses SSP by 0.7% (53.8% vs 53.1%) at 90 training epochs and outperforms SSP by 1.8% (55.1% vs 53.3%) at 200 training epochs. In addition, our method obtains higher performance by 1.1% (53.8% vs 52.7%) and 0.7% (55.1% vs 54.4%) than PaCo at 90 and 200 training epochs, respectively.

Table 2. Results on ImageNet-LT in terms of accuracy (Acc) under 90 and 200 training epochs. In this table, CR, DT, and RL indicate class re-balancing, decouple training, and representation learning, respectively.

Full size table

Furthermore, Table 3 presents the experimental results on the naturally-skewed dataset iNaturalist2018. Compared with the improvement brought by representation learning, decouple training and data augmentation approaches, our method achieves competitive result (71.0%) consistently.

4.2 Semantic Semgnetaion on ADE20K Dataset

To further validate our method, we apply our strategy to segmentation networks and report our performance on the semantic segmentation benchmark, ADE20K.

Dataset and Setup. ADE20K is a scene parsing dataset covering 150 fine-grained semantic concepts and it is one of the most challenging semantic segmentation datasets. The training set contains 20,210 images with 150 semantic classes. The validation and test set contain 2,000 and 3,352 images respectively.

Table 3. Benchmarking on iNaturalists2018 in Top 1 accuracy (%). RL, DT, and DA indicate representation learning, decouple training, and data augmentation.

Full size table

Experimental Details. We evaluate our method using two widely adopted segmentation models (OCRNet [33] and DeepLabV3+ [4]) based on different backbone networks. We initialize the backbones using the models pre-trained on ImageNet [6] and the framework randomly. All models are trained with an image size of $512 \times 512$ and 80K/160K iterations in total. We train the models using the Adam optimizer with the initial learning rate of 0.01, weight decay of 0.0005, and momentum of 0.9. The learning rate dynamically decays exponentially according to the ‘ploy’ strategy.

Comparison with State of the Arts. The numerical results and comparison with other peer methods are reported in Table 4. Our method achieves 1.1% and 0.5% improvement in mIoU using OCRNet with HRNet-W18 when the iterations are 80K and 160K, respectively. Moreover, our method outperforms the baseline with large margin at 0.9% and 1.1% in mIoU using DeeplabV3+ with ResNet-50 when the iterations are 80K and 160K, respectively. Even with a stronger backbone, ResNet-101, our method also achieves 0.8% mIoU and 0.9% improvement than the baseline. Compared with DisAlign, our method still outperforms it on both in both mIoU and mAcc with various backbones consistently.

Table 4. Performance of semantic segmentation on ADE20K. R-50 and R-101 denote ResNet-50 and ResNet-101, respectively.

Full size table

4.3 Ablation Study

We conduct ablation study on the ImageNet-LT dataset to further understand the hyper-parameters of our methods and the effect of each proposed component. All of them have trained with ResNetXt-50 by 90 epochs for a fair comparison.

$\lambda $ in Meta Training Loss. One major hyper-parameter in our method is $\lambda $ in Eq. 3, which adjusts the degree of adjustment in meta training loss. We set the hyper-parameter $\lambda \in \left\{ 0.1, 0.3, 0.5, 0.7, 0.9\right\} $. We study the sensitivity of the accuracy to the values of $\lambda $. Figure 4(a) quantifies the effect of the trade-off parameter $\lambda $ on the validation accuracy. It shows that combining the $\mathcal {L}_{PC}$ and $\mathcal {L}_{CE}$ with optimal $\lambda $ is 0.5 gives the best results.

$\alpha $ in Meta-Prototype Feature Generation. In Eq. 10, we introduce a class-wise confidence score $\alpha $ which controls the degree of distribution calibration. We initialize $\alpha $ to 0.2 for each tail class and it changes adaptively during training. We set the hyper-parameter $\alpha $ in the interval from 0.2 to 1 with a stride of 0.2 and take the five sets of values to conduct ablation experiments as shown in Fig. 4(b). Overall, the larger $\alpha $ means more confidence to transfer the knowledge from head to tail classes. The optimal $\alpha $ for ImageNet-LT is 0.4.

Effectiveness of MPCL and MPFA. Table 5 verifies the critical roles of our adaptive modules for meta-prototype contrastive learning (MPCL) and meta-prototype feature augmentation (MPFA). The baseline only performs decoupled training pipelines without using any components of our methods. In representation learning stage, our MPCL module significantly surpasses the performance over the DRO-LT and KCL (52.7% vs 51.9% vs 51.5%). Moreover, in classifier training stage, our MPFA module further boosts the performance, especially in the tail classes (53.8% vs 53.2%). The results suggest the effectiveness of both the MPCL and MPFA components in improving the training performance.

Table 5. Ablation study on ImageNet-LT for different decouple methods.

Full size table

5 Conclusion

In this paper, we have proposed a novel meta-prototype decoupled training framework to tackle the long-tail challenge. Our decoupled training framework includes calibrating the empirical prototype for SCL in the representation learning stage and enhancing feature embedding for tail classes based on learned meta-prototype in the classifier learning stage. The first module of our method completes the meta-prototype to promote the representation quality of supervised prototype contrastive learning. The second module leverages the learned meta-prototype to provide the reasonable feature distribution of new feature samples for tail classes. We sample features from the calibrated distribution to ease the dominance of the head classes in classification decisions. The experimental results show that our method achieves state-of-the-art performances for various settings on long-tailed learning.

References

Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Chen, D., Cao, X., Wang, L., Wen, F., Sun, J.: Bayesian face revisited: a joint formulation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 566–579. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_41
Chapter Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
Cui, J., Zhong, Z., Liu, S., Yu, B., Jia, J.: Parametric contrastive learning (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017)
Google Scholar
Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., Chang, B.: Disentangling label distribution for long-tailed visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6626–6636 (2021)
Google Scholar
Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition (2019)
Google Scholar
Kang, B., Li, Y., Xie, S., Yuan, Z., Feng, J.: Exploring balanced feature spaces for representation learning. In: International Conference on Learning Representations (2021)
Google Scholar
Khosla, P., et al.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661–18673 (2020)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Li, J., Xiong, C., Hoi, S.C.: Mopro: webly supervised learning with momentum prototypes. arXiv preprint arXiv:2009.07995 (2020)
Li, S., Gong, K., Liu, C.H., Wang, Y., Qiao, F., Cheng, X.: Metasaug: meta semantic augmentation for long-tailed visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5212–5221 (2021)
Google Scholar
Li, T., et al.: Targeted supervised contrastive learning for long-tailed recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6918–6928 (2022)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537–2546 (2019)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al.: Balanced meta-softmax for long-tailed visual recognition. Adv. Neural. Inf. Process. Syst. 33, 4175–4186 (2020)
Google Scholar
Samuel, D., Chechik, G.: Distributional robustness loss for long-tail learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9495–9504 (2021)
Google Scholar
Shu, J., et al.: Meta-weight-net: learning an explicit mapping for sample weighting. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Tang, K., Huang, J., Zhang, H.: Long-tailed classification by keeping the good and removing the bad momentum causal effect. Adv. Neural. Inf. Process. Syst. 33, 1513–1524 (2020)
Google Scholar
Tukey, J.W., et al.: Exploratory data analysis, vol. 2. Reading, MA (1977)
Google Scholar
Van Horn, G., et al.: The inaturalist species classification and detection dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8769–8778 (2018)
Google Scholar
Vigneswaran, R., Law, M.T., Balasubramanian, V.N., Tapaswi, M.: Feature generation for long-tail classification. In: Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, pp. 1–9 (2021)
Google Scholar
Wang, J., Lukasiewicz, T., Hu, X., Cai, J., Xu, Z.: RSG: a simple but effective module for learning imbalanced datasets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3784–3793 (2021)
Google Scholar
Wang, P., Han, K., Wei, X.S., Zhang, L., Wang, L.: Contrastive learning based hybrid networks for long-tailed image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 943–952 (2021)
Google Scholar
Wang, X., Lian, L., Miao, Z., Liu, Z., Yu, S.: Long-tailed recognition by routing diverse distribution-aware experts. In: International Conference on Learning Representations (2021)
Google Scholar
Wang, Y., Pan, X., Song, S., Zhang, H., Huang, G., Wu, C.: Implicit semantic data augmentation for deep networks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431 (2016)
Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 19290–19301 (2020)
Google Scholar
Yuan, Y., Wang, J.: OCNET: object context network for scene parsing (2018)
Google Scholar
Zang, Y., Huang, C., Loy, C.C.: FASA: feature augmentation and sampling adaptation for long-tailed instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3457–3466 (2021)
Google Scholar
Zhang, S., Li, Z., Yan, S., He, X., Sun, J.: Distribution alignment: a unified framework for long-tail visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2361–2370 (2021)
Google Scholar
Zhang, Y., Kang, B., Hooi, B., Yan, S., Feng, J.: Deep long-tailed learning: a survey. arXiv preprint arXiv:2110.04596 (2021)
Zhang, Y., Wei, X.S., Zhou, B., Wu, J.: Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3447–3455 (2021)
Google Scholar
Zhang, Z., Xiang, X.: Long-tailed classification with gradual balanced loss and adaptive feature generation (2022)
Google Scholar
Zhong, Z., Cui, J., Liu, S., Jia, J.: Improving calibration for long-tailed recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16489–16498 (2021)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. , pp. 633–641 (2017)
Google Scholar
Zhou, B., Cui, Q., Wei, X.S., Chen, Z.M.: BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9719–9728 (2020)
Google Scholar
Zhu, B., Niu, Y., Hua, X.S., Zhang, H.: Cross-domain empirical risk minimization for unbiased long-tailed classification. In: AAAI Conference on Artificial Intelligence (2022)
Google Scholar
Zhu, L., Yang, Y.: Inflated episodic memory with region self-attention for long-tailed visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4344–4353 (2020)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (U21B2004), the Zhejiang Provincial key RD Program of China (2021C01119), and the Zhejiang University-Angelalign Inc. R &D Center for Intelligent Healthcare.

Author information

Authors and Affiliations

College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China
Siming Fu, Huanpeng Chu, Xiaoxuan He, Hualiang Wang & Haoji Hu
Shenzhen TP-LINK Digital Technology Co., Ltd., Shenzhen, China
Zhenyu Yang

Authors

Siming Fu
View author publications
You can also search for this author in PubMed Google Scholar
Huanpeng Chu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxuan He
View author publications
You can also search for this author in PubMed Google Scholar
Hualiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haoji Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haoji Hu .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, S., Chu, H., He, X., Wang, H., Yang, Z., Hu, H. (2023). Meta-prototype Decoupled Training for Long-Tailed Learning. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13846. Springer, Cham. https://doi.org/10.1007/978-3-031-26351-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-26351-4_16
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26350-7
Online ISBN: 978-3-031-26351-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Meta-prototype Decoupled Training for Long-Tailed Learning

Abstract

Similar content being viewed by others

Towards Calibrated Hyper-Sphere Representation via Distribution Overlap Coefficient for Long-Tailed Learning

SAFA: Sample-Adaptive Feature Augmentation for Long-Tailed Image Classification

FeatMatch: Feature-Based Augmentation for Semi-supervised Learning

Keywords

1 Introduction

2 Related Work