Keywords

1 Introduction

Most real-world data comes with a long-tailed nature: a few head classes contribute the majority of data, while most tail classes comprise relatively few data. An undesired phenomenon is models [2, 29] trained with long-tailed data perform better on head classes while exhibiting extremely low accuracy on tail ones.

To address this problem, a large number of studies have been conducted in recent years, making promising progress in the field of deep long-tailed learning. Supervised contrastive learning (SCL) has been the main focus of many techniques for long-tailed learning. The mainstream insights work on supervised contrastive learning methods [17, 43] which seek to learn class-specific feature prototypes to enhance long-tailed learning performance. DRO-LT [21] innovatively explores the idea of feature prototypes to handle long-tailed recognition in an open world. Following that, TSC [15] converges the different classes of features to a target that is uniformly distributed over the hyper-sphere during training.

Nevertheless, when a class has only few samples, the distribution of training samples may not represent well the true distribution of the data. The shift between test distribution and training distribution causes the offset of the prototypes in tail classes [21]. The above works are all based on the empirical prototype under imbalanced data, limiting the effectiveness of feature representation. Therefore, the sub-optimal prototypes become an issue in learning high-quality representations for SCL methods, which confuse optimization for improved long-tailed learning.

To alleviate the above issues, we propose the supervised meta-prototype contrastive learning which calibrates the empirical prototype under the imbalanced setting. Specifically, we extend meta-learner to automatically restore the meta-prototypes of feature embeddings via two nested loops of optimization, guaranteeing the efficiency of the meta-prototype contrastive learning algorithm. Our major insight here is to parameterize the mapping function as a meta-network, which is theoretically a universal approximator for almost all continuous functions, and then use the meta-data (a small unbiased validation set) to guide the training of all the meta-network parameters. The meta-prototypes provide more meaningful feature prototypes which are designed to be robust against possible shifts of the test distribution and guide the SCL to obtain the discriminative feature representation space.

To further ease the dominance of the head classes in classification decisions, we develop the calibration feature augmentation algorithm based on the learned meta-prototype in classifier training stage. Specifically, we utilize it as the ‘anchor’ of corresponding class which represents the mean of the class statistics under the imbalanced setting. In contrast to the typical methods which generate the new feature samples based on the class statistics of imbalanced training data, our meta-prototype calibrates the bias and provides the reasonable feature distribution of new feature samples for tail classes. The newly generated feature are sampled from the calibrated distribution and help to find the correct classifier decision boundary via improving the performance of severely under-represented tail classes.

We summarize the above processes as the meta-prototype decoupled training framework which includes calibrating the empirical prototype for SCL in the representation learning stage and enhancing feature embedding for tail classes based on learned meta-prototype in the classifier learning stage. We extensively validate our model on typical visual recognition tasks, including image classification on three benchmarks (CIFAR-100-LT [12], ImageNet-LT [18] and iNaturalist2018 [25]), semantic segmentation on ADE20K dataset [40]. The experimental results demonstrate our method consistently outperforms the state-of-the-art approaches on all the benchmarks.

Summary of Contributions:

  • To the best of our acknowledge, we are the first in long-tailed learning to complete the meta-prototype to promote the representation quality of supervised prototype contrastive learning in the representation learning stage.

  • On top of the learned meta-prototype, we develop the feature augmentation algorithm for tail classes to ease dominance of the head classes in classification decisions in the classifier learning stage.

  • Our method outperforms previous works with a large margin and achieve state-of-the-art performance on long-tailed image classification and semantic segmentation tasks.

2 Related Work

Supervised Contrastive Learning. Existing supervised contrastive learning-based methods for long-tailed learning seek to help alleviate the biased label effect. DRO-LT [21] extends standard contrastive loss and optimizes against the worst possible centroids within a safety hyper ball around the empirical centroid. KCL [10] develops a new method to explicitly pursue balanced feature space for representation learning. TSC [15] generates a set of targets uniformly distributed on a hypersphere and makes the features of different classes converge to these distinct and uniformly distributed targets during training. Hybrid-SC [28] explores the effectiveness of supervised contrastive learning. It introduces prototypical supervised learning to obtain better features and resolve the memory bottleneck. The above works are all based on the empirical prototype under imbalanced data, which limits the effectiveness of feature representation. To alleviate the above issue, we introduce the meta-prototype to calibrate the empirical prototype, further constructing a discriminative feature space.

Meta-learning. The recent development of meta-learning [1, 7] inspires researchers to leverage meta-learning to handle class imbalance. Meta-weight-net [22] introduces a method capable of adaptively learning an explicit weighting function directly from data. MetaSAug [14] proposes to augment tail classes with a variant of ISDA [30] by estimating the covariance matrices for tail classes. Motivated by these works, our method attempts to automatically estimate the meta-prototype of each class to calibrate the empirical prototype for high-quality feature representation.

Data Augmentation for Long-Tailed Learning. In long-tail learning, transfer-based augmentation has been explored. Transfer-based augmentation seeks to transfer the knowledge from head classes to augment model performance on tail classes. TailCalibX [26] and GLAG [38] explore a direction that attempts to generate meaningful features by estimating the tail category’s distribution. RSG [27] dynamically estimates a set of feature centers for each class, and uses the feature displacement between head-class sample features and the nearest intra-class feature center to augment each tail sample feature. However, the estimated distribution of tail category and the intra-class feature center are biased or unreasonable due to the imbalanced size of training dataset. Our meta-prototype feature augmentation algorithm calibrates the bias and predicts likely shifts of the test distribution.

Decoupled Scheme for Long-Tailed Learning. Decoupling [9] is a pioneering work that introduces a two-step training scheme. It empirically evaluates different sampling strategies for representation learning in the first step, and then evaluates different classifier training schemes by fixing the feature extractor trained in the second step. Decouple [9] and Bag of tricks [37] decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. BBN [41] further unifies the two stages to form a cumulative learning strategy. MiSLAS [39] proposes to enhance the representation learning with data mixup in the first stage. During the second stage, MiSLAS applies a label-aware smoothing strategy for better model generalization. In our paper, our method also adopts the two-stage decoupled training scheme, which leads to better long-tailed learning performance.

3 The Proposed Methods

3.1 Problem Definition

For long-tailed learning, considering \( \mathcal { D }^{tra} = \{ \boldsymbol{x}^i, y^i\}\), \( i\in \{1, \cdots , K\} \) be the training set, where \(\boldsymbol{x}^i\) denotes an image sample and \(y^i\) indicates its class label. Let K be the total numbers of classes, \(N_i\) be the number of samples in class i, where \(\sum _{i=1}^{K} N_{i}=N\). A long-tail setup can be defined by ordering the number of samples per category, i.e. \(N_{1} \ge N_{2} \ge \ldots \ge N_{K}\) and \(N_{1} \gg N_{K}\) after sorting of \(N_{i}\). Under the long-tailed setting, the training dataset is imbalanced, leading to the poor performance on tail classes.

We train a network \(\boldsymbol{ \varPsi }(\cdot ; \boldsymbol{W})\) consisting of two components: (i) a backbone or representation network (CNN for images) that translates an image to a feature representation \(\boldsymbol{z}_{i}= \boldsymbol{\varPsi }(\boldsymbol{x}^i; \boldsymbol{w^{E}}) \in \mathbb {R}^{1 \times d}\) and (ii) a classifier \(\boldsymbol{w}^{C} \in \mathbb {R}^{K \times d} \) at predicts the category specific scores (logits). As shown in Fig. 1, given a pair \( (\boldsymbol{x}^i, y^i) \) sampled from a mini-batch \( \mathcal { B } \subset \mathcal { D }^{tra} \), feature vector \(\boldsymbol{z}_{i}\) is extracted by the feature extractor. \(\boldsymbol{z}_{i}\) is projected onto the classifier to output the classification logit. Too few samples belonging to the tail classes result in inadequate learning of tail classes representations.

Fig. 1.
figure 1

Overview of our proposed method during the training period. Upper box introduces the meta-prototype, which consists of the following steps in sequence: sampling a mini-batch images \(\mathcal {B}\) from training set \(\mathcal {D}^{tra}\), learning features by the feature extractor \(\boldsymbol{\varPsi }(\cdot ; \boldsymbol{w^{E} })\), embedding features onto the hyper-sphere, estimating the prototypes for classes, and learning meta-prototypes for discriminative representation space. Bottom box introduces the meta-prototype feature augmentation algorithm which enriches the samples of tail classes to re-build the classifier decision boundaries.

3.2 Supervised Meta-prototype Contrastive Learning in the Representation Learning Stage

Supervised contrastive learning introduces cluster-based prototypes and encourages embeddings to gather around their corresponding prototypes. Our original feature prototypes follow the MoPro [13], adopting the exponential-moving-average (EMA) algorithm during training by:

$$\begin{aligned} \boldsymbol{c}_{k} \leftarrow m \boldsymbol{c}_{k}+(1-m) \boldsymbol{z}_{i}, \quad \forall i \in \left\{ i \mid \hat{y}_{i}=k\right\} , \end{aligned}$$
(1)

where \(\boldsymbol{c}_{k}\) is the prototype for class k and m is momentum coefficient, usually set as 0.999. Then given the embedding \(z_i^f\), the prototypes are queried with contrastive similarity matching. The prototype contrastive loss [13, 21] is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{PC}}=-\log \left[ \frac{\exp \left( z_{i}^{f} \cdot c^{k} / \tau \right) }{\sum _{j=1}^{K} \exp \left( z_{i}^{f} \cdot c^{j} / \tau \right) }\right] , \end{aligned} \end{aligned}$$
(2)

where \(\tau \) is a hyper-parameter and usually set as 0.07 [11]. The neural network is denoted as \(f(\cdot , \textbf{W})\), and \(\textbf{W}\) denotes all of its parameters. Generally, the optimal network parameter \(\textbf{W}^{*}\) can be extracted by minimizing the training loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}^{{\text {train}}}(\textbf{W} ; \boldsymbol{c}^{k}) = \mathcal {L}_{CE}(\textbf{W}) + \lambda \cdot \mathcal {L}_{{\text {PC}}}(\textbf{W}, \boldsymbol{c}^{k}), \end{aligned} \end{aligned}$$
(3)

where \(\lambda \) denotes the weighting coefficient to balance the two loss terms and \(\mathcal {L}_{CE}\) is the cross-entropy loss. As aforementioned, the empirical prototypes of tail classes can be far away from the ground-truth prototypes due to the limited features of tail classes and large variances in data distribution between training and test datasets. Therefore, we aim to learn appropriate feature prototypes to perform reasonable feature representation learning.

The whole process of the meta-prototype constrastive learning is summarized in Algorithm 1. In the presence of imbalanced training data, our method calibrates the empirical prototypes by prototype meta network, denoted as \(\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta )\). \(\boldsymbol{c}^{k}\) is the input of the meta network and \(\varTheta \) represents the parameters contained in it. The meta network consists of MLP, which maps the empirical prototype \(\boldsymbol{c}^{k}\) into the meta-prototype \(\boldsymbol{\hat{c}}^{k}\). The prototype meta network is an encoder-decoder network, where the encoder contains one linear layer with a ReLU activation function, and the decoder consists of a Linear-ReLU-Linear structure. The optimal parameter \(\textbf{w}\) is calculated by minimizing the following training loss:

$$\begin{aligned} \begin{aligned} \textbf{W}^{*}(\varTheta )&=\underset{\textbf{W}}{\arg \min } \mathcal {L}^{{\text {train}}}(\textbf{W} ; \boldsymbol{c}^{k}; \varTheta ) \\ {}&= \underset{\textbf{W}}{\arg \min }\left\{ \mathcal {L}_{CE}(\textbf{W}) + \lambda \cdot \mathcal {L}_{{\text {PC}}}( \textbf{W}, \mathcal {C}(\boldsymbol{c}^{k}; \varTheta ))\right\} . \end{aligned} \end{aligned}$$
(4)

The parameters contained in the meta-network can be optimized by using the meta-learning idea. The optimal parameter \(\varTheta ^{*}\) can be obtained by minimizing the following meta-loss:

$$\begin{aligned} \begin{aligned} \varTheta ^{*}&=\underset{\varTheta }{\arg \min }\ \mathcal {L}^{m e t a}\left( \textbf{W}^{*}(\varTheta )\right) , \end{aligned} \end{aligned}$$
(5)

where \(\mathcal {L}^{\text{ meta } }(\textbf{w})=\mathcal {L}_{CE}\left( y_{i}^{(\text{ meta } )}, f\left( x_{i}^{(\text{ meta } )}, \textbf{W}\right) \right) \) on meta-data. Specifically, following the meta-learning methods [14, 22] for long-tailed learning, we conduct a small amount of balanced meta-data set (i.e., with balanced data distribution) \(\left\{ x_{i}^{(\text{ meta } )}, y_{i}^{(\text{ meta } )}\right\} _{i=1}^{M}\) to represent the meta-knowledge of ground-truth sample-label distribution, where M is the number of meta-samples and \(M \ll N\).

Online approximation. To estimate the optimal feature prototypes for different classes, we adopt a double optimization loop, respectively, to guarantee the efficiency of the algorithm. We optimize the model in a meta-learning setup by i). updating equation of the network parameter can be formulated by moving the current \(\textbf{W}^{(t)}\) along the descent direction of the objective loss in Eq. 4 on a mini-batch training data by

$$\begin{aligned} \begin{aligned} \hat{\textbf{W}}^{(t)}(\varTheta )\leftarrow \textbf{W}^{(t)}-\alpha \times \nabla _{\textbf{W}^{(t)}}\mathcal {L}^{{\text {train}}}(\textbf{W}; \boldsymbol{c}^{k}; \varTheta ), \end{aligned} \end{aligned}$$
(6)

where \(\alpha \) is the step size. ii). After receiving the updated network parameters \(\hat{\textbf{W}}^{(t)}(\varTheta )\), the parameter \(\varTheta \) of the meta-network can then be readily updated by Eq. 5, i.e., moving the current parameter \(\varTheta ^{(t)}\) along the objective gradient to be calculated on the meta-data by

$$\begin{aligned} \begin{aligned} \varTheta ^{(t+1)}=\varTheta ^{(t)}-\beta \frac{1}{n} \sum _{i=1}^{n} \nabla _{\varTheta ^{(t)}} \mathcal {L}^{meta}\left( \hat{\textbf{W}}^{(t)}(\varTheta )\right) , \end{aligned} \end{aligned}$$
(7)

where \(\beta \) is the step size. iii) Then, the updated \(\varTheta ^{(t+1)}\) is employed to ameliorate the parameter \(\textbf{W}\) of the network, constituting a complete loop:

$$\begin{aligned} \begin{aligned} \textbf{W}^{(t+1)}=\textbf{W}^{(t)}-\alpha \times \nabla _{\textbf{W}^{(t)}} \mathcal {L}^{\text{ train } }(\textbf{W}^{(t)}; \boldsymbol{c}^{k}; \varTheta ^{(t+1)}), \end{aligned} \end{aligned}$$
(8)

Since the updated meta-network \(\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta ^{(t+1)})\) are learned from balanced meta-data, we could expect \(\mathcal {C}(\boldsymbol{c}^{k} ; \varTheta ^{(t+1)})\) contribute to learning better network parameters \(\textbf{W}^{(t+1)}\).

figure a

3.3 Meta-prototype Feature Augmentation in the Classifier Training Stage

On the classifier training phase, the target of our work is to generate addition feature embeddings to further balance the over-whelming dominance severity of head classes in the representation space. It is natural to utilize the feature augmentation to calibrate the ill-defined decision boundary. Following the Joint Bayesian face model [3], typical feature augmentation methods [26, 34, 38] assume that the features \(\textbf{z}_{i}\) lies in a Gaussian distribution with a class mean \(\mu _i\) and a covariance matrix \(\varSigma _i\). The mean of a class is estimated as the arithmetic mean of all features in the same class by \(\mu _{k}=\frac{1}{N_{k}} \sum _{i \in \mathcal {F}_{k}} \textbf{z}_{i}\).

However, the means of Gaussian distribution for tail classes are biased due to sparse sample size of the tail categories and large variances for data distribution between train and test datasets. This bias causes the distribution of the generated data to deviate significantly from the data distribution of the validation set. It leads to significant performance drop, even the destruction of the original representational space. Therefore, as Fig. 2 illustrated, we leverage the meta-prototypes \(\hat{c}_{i}\) as the ‘anchor’ to replace the typical class statistics \(\mu _{k}\) to provide the reasonable feature distribution of new feature samples for tail classes.

Fig. 2.
figure 2

Illustration of the feature augmentation process based on the learned meta-prototype \(\hat{c}\). Tukey’s Ladder of Power transformation function transfers the feature instance \(\textbf{z}_{i}\) into \(\tilde{\textbf{z}_{i}}\). Meta-prototypes replace the means \(\mu \) of class statistics to calculate the neighbors \(\mathcal {N}_{i}\) via \(S_{i,k}\) and the calibrated distribution \(\mu _{\tilde{\textbf{z}_{i}}}\) and \(\varSigma _{\tilde{\textbf{z}_{i}}}\). Additional features for tail classes are sampled from the calibrated statistics so as to ease the dominance of the head classes in classification decisions.

Given a trained backbone (discussed in Sect. 3.2), we first pre-compute feature representations for the entire dataset. These features of true samples are denoted as \(\mathcal {F}=\left\{ \textbf{z}_{i}\right\} _{i=1}^{N}\). \(\mathcal {F}_k \) denotes features of images belonging to the category k. For each class, we sample \(N_1 - N_K\) additional features, such that the resulting feature dataset is completely balanced and all classes have \(N_1\) instances. Sampling is performed based on an instance specific calibrated distribution. Specifically, each \(\textbf{z}_{ik}\) (\(i^{th}\) feature from category k) is responsible for generating \( s_\mathrm{{new}} = \max \left\{ \left[ N_{1} / N_{k}-1\right] _{+}, 1\right\} \) features, where \([\cdot ]_{+}\) is the ceiling function.

Based on the learned meta-prototype, the features covariance for the corresponding class can be calculated as:

$$\begin{aligned} \begin{aligned} \varSigma _{k}=\frac{1}{N_{k}-1} \sum _{i \in \mathcal {F}_{k}}\left( \textbf{z}_{i}-\boldsymbol{\hat{c}}^{k}\right) \left( \textbf{z}_{i}-\boldsymbol{\hat{c}}^{k}\right) ^{T}, \end{aligned} \end{aligned}$$
(9)

where \(\varSigma _{k} \in \mathbb {R}^{d \times d}\) denotes the full covariance of the Gaussian distribution for category k. Next, for each feature \(\tilde{\textbf{z}_{i}}\) belonging to tail classes k processed by Tukey’s Ladder of Power transformation [24], we calculate the similarity degree with other classes k which have more training samples as \( S_{i,k} = \tilde{\textbf{z}_{i}}^{\top } \cdot \boldsymbol{\hat{c}}^{k} /\Vert \tilde{\textbf{z}_{i}}^{\top } \Vert \cdot \Vert \boldsymbol{\hat{c}}^{k}\Vert \). We identify the set of M category indices that are neighbors \(\mathcal {N}_{i}\) with the maximum cosine similarity. We calibrate the distribution of feature \(\tilde{\textbf{z}_{i}}\) as:

$$\begin{aligned} \begin{aligned} \mu _{\tilde{\textbf{z}_{i}}} = (1-\alpha ) \cdot \tilde{\textbf{z}_{i}} + \alpha \cdot \frac{1}{M} \sum _{k \in \mathcal {N}_{i}} \frac{e^{S_{i,k}}}{\sum _{j=1}^{M} e^{S_{i,j}}} \cdot \hat{c}^{k} \\ \varSigma _{\tilde{\textbf{z}_{i}}} = (1-\alpha )^{2} \cdot \varSigma _{i} + \alpha ^{2} \cdot \frac{1}{M} \sum _{k \in \mathcal {N}_{i}} \frac{e^{S_{i,k}}}{\sum _{j=1}^{M} e^{S_{i,j}}} \cdot \varSigma _{k} + \beta , \end{aligned} \end{aligned}$$
(10)

where \(\alpha \) is the hyper-parameter to balance the degree of the calibration and \(\beta \) is an optional constant hyper-parameter to increase the spread of the calibrated distribution. We found that \(\beta = 0.05 \) works reasonably well for multiple experiments. We generate the new samples with the same associated class label and denote the new samples for category k as \(\mathcal {F}_{k}^{*}\). This combined set of features is generated for all categories and used to train classifier. As shown in Fig. 3, we generate features using our meta-prototype feature augmentation and re-build the t-SNE visualization in the right plot. Compared with the left plot which is before generation, the right plot eases the interpretation and clarifies the feature boundaries. In addition, due to the meta-prototype, the newly generated features are close to validation samples, which further promote the performance of the classifier.

Fig. 3.
figure 3

t-SNE visualization of a few head and tail classes from ImageNet-LT. The plot on the left is before generation, and the plot on the right is after generation. We show 10 validation samples for each class and limit to 40 training + generated samples for ease of interpretation. Markers: \(\cdot \) (dot) indicate training samples; + (plus) are validation samples; and \(\times \) (cross) are generated features also shown with a lighter version of the base color. Best seen in colour.

4 Experiments

4.1 Long-Tailed Image Classification Task

Table 1. Top 1 accuracy of CIFAR-100-LT with various imbalance factors (100, 50, 10). RL, DT, and DA indicate representation learning, decouple training, and data augmentation, respectively.

Datasets and Setup. We perform experiments on long-tailed image classification datasets, including the CIFAR-100-LT [12], ImageNet-LT [18] and iNaturalist2018 [25].

  • CIFAR-100-LT is based on the original CIFAR-100 dataset, whose training samples per class are constructed by imbalance ratio (The imbalance ratios we adopt in our experiment are 10, 50 and 100).

  • ImageNet-LT is a long-tailed version of the ImageNet dataset by sampling a subset following the Pareto distribution with power value 6. It contains 115.8K images from 1,000 categories, with class cardinality ranging from 5 to 1,280.

  • iNaturalist2018 is the largest dataset for long-tailed visual recognition. It contains 437.5K images from 8,142 categories. It is extremely imbalanced with an imbalance factor of 512.

Experimental Details. We implement all experiments in PyTorch. On CIFAR-100-LT, following [20], we use ResNet-32 [31] as the feature extractor for all methods. we conduct model training with SGD optimizer based on batch size 256, momentum 0.9 under three imbalance ratios (10, 50 and 100). For image classification on ImageNet-LT, following  [5, 8, 23], we use ResNetXt-50 [31] as the feature extractor for all methods. We conduct model training with the SGD optimizer based on batch size 512, and momentum 0.9. In both training epochs (90 and 200 training epochs), the learning rate is decayed by a cosine scheduler [19] from 0.2 to 0.0. On iNaturalist2018 [25] dataset, we use ResNet-50 [31] as the feature extractor for all methods with 200 training epochs, with the same experimental parameters set for the other. Moreover, we use the same basic data augmentation (i.e., random resize and crop to 224, random horizontal flip, color jitter, and normalization) for all methods.

Comparison with State of the Arts. As shown in Table 1, to prove the versatility of our method, we employ our method on the CIFAR-100-LT dataset with three imbalance ratios. We compare against the most relevant methods and choose methods that are recently published and representative of different types, such as class re-balancing, decouple training and data augmentation. Our method surpasses the DRO-LT [21] under various imbalance factors, especially on the largest imbalance factor (52.3% vs 47.3%). Furthermore, compared with the data augmentation methods [38], our model achieves competitive performance (52.3% vs 51.7% with 100 imbalance factor).

Table 2 shows the long-tailed results on ImageNet-LT. We adopt the performance data from the deep long-tailed survey [36] for various methods at 90 and 200 training epochs to make a fair comparison. Our approach achieves 53.8% and 55.1% in overall accuracy, which outperforms the state of the art methods by a significant margin at 90 and 200 training epochs, respectively. Compared with representation learning methods, our method surpasses SSP by 0.7% (53.8% vs 53.1%) at 90 training epochs and outperforms SSP by 1.8% (55.1% vs 53.3%) at 200 training epochs. In addition, our method obtains higher performance by 1.1% (53.8% vs 52.7%) and 0.7% (55.1% vs 54.4%) than PaCo at 90 and 200 training epochs, respectively.

Table 2. Results on ImageNet-LT in terms of accuracy (Acc) under 90 and 200 training epochs. In this table, CR, DT, and RL indicate class re-balancing, decouple training, and representation learning, respectively.

Furthermore, Table 3 presents the experimental results on the naturally-skewed dataset iNaturalist2018. Compared with the improvement brought by representation learning, decouple training and data augmentation approaches, our method achieves competitive result (71.0%) consistently.

4.2 Semantic Semgnetaion on ADE20K Dataset

To further validate our method, we apply our strategy to segmentation networks and report our performance on the semantic segmentation benchmark, ADE20K.

Dataset and Setup. ADE20K is a scene parsing dataset covering 150 fine-grained semantic concepts and it is one of the most challenging semantic segmentation datasets. The training set contains 20,210 images with 150 semantic classes. The validation and test set contain 2,000 and 3,352 images respectively.

Table 3. Benchmarking on iNaturalists2018 in Top 1 accuracy (%). RL, DT, and DA indicate representation learning, decouple training, and data augmentation.

Experimental Details. We evaluate our method using two widely adopted segmentation models (OCRNet [33] and DeepLabV3+ [4]) based on different backbone networks. We initialize the backbones using the models pre-trained on ImageNet [6] and the framework randomly. All models are trained with an image size of \(512 \times 512\) and 80K/160K iterations in total. We train the models using the Adam optimizer with the initial learning rate of 0.01, weight decay of 0.0005, and momentum of 0.9. The learning rate dynamically decays exponentially according to the ‘ploy’ strategy.

Comparison with State of the Arts. The numerical results and comparison with other peer methods are reported in Table 4. Our method achieves 1.1% and 0.5% improvement in mIoU using OCRNet with HRNet-W18 when the iterations are 80K and 160K, respectively. Moreover, our method outperforms the baseline with large margin at 0.9% and 1.1% in mIoU using DeeplabV3+ with ResNet-50 when the iterations are 80K and 160K, respectively. Even with a stronger backbone, ResNet-101, our method also achieves 0.8% mIoU and 0.9% improvement than the baseline. Compared with DisAlign, our method still outperforms it on both in both mIoU and mAcc with various backbones consistently.

Table 4. Performance of semantic segmentation on ADE20K. R-50 and R-101 denote ResNet-50 and ResNet-101, respectively.
Fig. 4.
figure 4

Ablation study on \(\lambda \) in Eq. 3 and \(\alpha \) in Eq. 10.

4.3 Ablation Study

We conduct ablation study on the ImageNet-LT dataset to further understand the hyper-parameters of our methods and the effect of each proposed component. All of them have trained with ResNetXt-50 by 90 epochs for a fair comparison.

\(\lambda \) in Meta Training Loss. One major hyper-parameter in our method is \(\lambda \) in Eq. 3, which adjusts the degree of adjustment in meta training loss. We set the hyper-parameter \(\lambda \in \left\{ 0.1, 0.3, 0.5, 0.7, 0.9\right\} \). We study the sensitivity of the accuracy to the values of \(\lambda \). Figure 4(a) quantifies the effect of the trade-off parameter \(\lambda \) on the validation accuracy. It shows that combining the \(\mathcal {L}_{PC}\) and \(\mathcal {L}_{CE}\) with optimal \(\lambda \) is 0.5 gives the best results.

\(\alpha \) in Meta-Prototype Feature Generation. In Eq. 10, we introduce a class-wise confidence score \(\alpha \) which controls the degree of distribution calibration. We initialize \(\alpha \) to 0.2 for each tail class and it changes adaptively during training. We set the hyper-parameter \(\alpha \) in the interval from 0.2 to 1 with a stride of 0.2 and take the five sets of values to conduct ablation experiments as shown in Fig. 4(b). Overall, the larger \(\alpha \) means more confidence to transfer the knowledge from head to tail classes. The optimal \(\alpha \) for ImageNet-LT is 0.4.

Effectiveness of MPCL and MPFA. Table 5 verifies the critical roles of our adaptive modules for meta-prototype contrastive learning (MPCL) and meta-prototype feature augmentation (MPFA). The baseline only performs decoupled training pipelines without using any components of our methods. In representation learning stage, our MPCL module significantly surpasses the performance over the DRO-LT and KCL (52.7% vs 51.9% vs 51.5%). Moreover, in classifier training stage, our MPFA module further boosts the performance, especially in the tail classes (53.8% vs 53.2%). The results suggest the effectiveness of both the MPCL and MPFA components in improving the training performance.

Table 5. Ablation study on ImageNet-LT for different decouple methods.

5 Conclusion

In this paper, we have proposed a novel meta-prototype decoupled training framework to tackle the long-tail challenge. Our decoupled training framework includes calibrating the empirical prototype for SCL in the representation learning stage and enhancing feature embedding for tail classes based on learned meta-prototype in the classifier learning stage. The first module of our method completes the meta-prototype to promote the representation quality of supervised prototype contrastive learning. The second module leverages the learned meta-prototype to provide the reasonable feature distribution of new feature samples for tail classes. We sample features from the calibrated distribution to ease the dominance of the head classes in classification decisions. The experimental results show that our method achieves state-of-the-art performances for various settings on long-tailed learning.