Keywords

1 Introduction

Most real-world data comes with a long-tailed nature: a few head classes contribute the majority of data, while most tail classes comprise relatively few data.An undesired phenomenon is models [2, 36, 42] trained with long-tailed data perform better on head classes while exhibiting extremely low accuracy on tail ones.

To remedy it, one of the mainstream insights works on devising balanced classifiers [16, 44, 45] against imbalanced data. The cosine-based classifier discards the norms that have been proven to be larger on head classes [53]. The \(\tau \)-norm classifier [16] manually shrinks the discrepancy among the norms of classifier weights through a \(\tau \)-normalization function. In addition, some works [2, 13, 23, 32] attach extra margin or scale terms on output scores to prompt classifiers to focus on data-scarce classes. Another prevailing method devotes to learning discriminative features using imbalanced data [5, 31, 34, 43, 51]. Range loss [51] is proposed to enlarge the inter-class feature distance and reduce the intra-class feature variation within the mini-batch data. Unsupervised discovery (UD) [43] uses self-supervised learning to help the model highlight tail classes from the feature level. In addition, LDA [31] transfers the learned feature distribution from the training domain to an ideal balanced domain.

While achieving promising performance, there lack of measures to quantitatively evaluate to what extent these classifiers or features can achieve the presumed “balanced” classifiers or “discriminative” features. Hence, one cannot measure how severely head classes dominate the features and classifiers in the high-dimensional representation space, resulting in confusions to guide further optimization for improved long-tailed learning.

To this end, we first extend cosine-based classifier as a von Mises-Fisher (vMF) distribution mixture model on hyper-sphere, denoted as the vMF classifier. Second, based on the representation space constructed by the vMF classifier, we mathematically define a novel measure between two probability density fuctions, denoted as distribution overlap coefficient \(o_\varLambda \), to quantify to what extent the classifiers are “balanced” or features are “discriminative”. A high \(o_\varLambda \) means that the two distributions (classes) are severely intertwined together. We suppose that \(o_\varLambda \) among classes in a “balance” classifier should be low enough, i.e., one class is not overwhelmingly dominated by other ones. “Discriminative” features means \(o_\varLambda \) between features and the corresponding classifier weights is high enough, i.e., features are well matched with correct classes.

On top of \(o_\varLambda \), we provide an explicit optimization objective to boost the representation quality on hyper-sphere, i.e., to allow classifier weights to be distributed separately while aligning the weights of classifiers with features. Specifically, we propose two loss terms: the inter-class discrepancy and class-feature consistency loss. The first one minimizes the overlap among classifier weights, and the second one maximizes the overlap between features and the corresponding classifier weights. To further ease dominance of the head classes in classification decisions during inference, we develop a post-training calibration algorithm for classifier at zero cost based on the learned class-wise overlap coefficients.

We extensively validate our model on three typical visual recognition tasks, including image classification on benchmarks (ImageNet-LT [25] and iNaturalist2018 [39]), semantic segmentation on ADE20K dataset [55], and instance segmentation on LVIS-v1.0 dataset [9]. The experimental results and ablative study demonstrate our method consistently outperforms the state-of-the-art approaches on all the benchmarks.

Summary of Contributions:

  • To the best of our acknowledge, we are the first in long-tailed learning to define the distribution overlap coefficient to evaluate representation quality for features and the proposed vMF classifiers.

  • We formulate overlap-based inter-class discrepancy and class-feature consistency loss terms to alleviate the interference among the classifier weights and align features with classifier weights.

  • We develop a post-training calibration algorithm for classifier at zero cost based on the learned class-wise overlap coefficients to ease dominance of the head classes in classification decisions during inference.

  • Our models outperform previous work with a large margin and achieve state-of-the-art performance on long-tailed image classification, semantic segmentation and instance segmentation tasks.

2 Related Works

Classifier Design for Deep Long-Tailed Learning. In generic visual problems [11, 54], the common practice of deep learning is to use linear classifier. However, long-tailed class imbalance often results in larger classifier weight norms for head classes than tail classes, which makes the linear classifier easily biased to dominant classes. To address long-tailed class imbalance, researchers design different types of classifiers. Scale-invariant cosine classifier [44] is proposed, where both the classifier weights and sample features are normalized. The \(\tau \)-normalized classifier [16] rectifies the imbalance of decision boundaries by introducing the \(\tau \) temperature factor for normalization [48]. Realistic taxonomic classifier (RTC) [45] addresses the issue with hierarchical classification where different samples are classified adaptively at different hierarchical levels. GistNet classifier [24] leverages the over-fitting to the popular classes to transfer class geometry from popular to few-shot classes. Causal classifier [37] records the bias by computing the exponential moving average features during training, and then removes the bad causal effect by subtracting the bias from prediction logits during inference.

Representation Learning for Long-Tailed Learning. Existing representation learning methods for long-tailed learning mainly focus on metric learning, prototype learning. Metric learning based methods [17, 34, 41] explore distance-based losses to learn a more discriminative feature space. LMLE [14] introduces a quintuple loss to learn representations that maintain both inter-cluster and inter-class margins. Prototype learning based methods [26, 58] seek to learn class-specific feature prototypes to enhance long-tailed learning performance. Open long-tailed recognition (OLTR) [26] innovatively explores the idea of feature prototypes to handle long-tailed recognition in an open world. Self-supervised pre-training (SSP) [47] uses self-supervised learning for model pre-training, followed by standard training on long-tailed data.

von Mises-Fisher Distribution. In directional statistics, the von Mises-Fisher distribution [15] is a probability distribution on the hyper-sphere. There are a lot of methods built on von Mises-Fisher distribution in machine learning and deep learning. The vMF Mixture Model (vMFMM) [10] proposes SFR model which assumes that the facial features are unit vectors and distributed according to a mixture of vMFs. The vMF k-means algorithm [28] is proposed based on the mixture vMF distribution to unsupervisedly evaluate the compactness and orientation of clusters. More recently, the t-vMF similarity [19] rebuilds the classifier by the proposed similarity based on vMF distribution to regularize features within deteriorated data. Sphere Confidence Face [20] minimizes KL divergence between spherical Dirac delta and r-radius vMF to achieve superior performance on face uncertainty learning.

Different from all them, to our best acknowledge, we are the first to quantify the distribution overlap coefficient between vMF distributions. Benefiting from it, we conduct a series of comprehensive and in-depth analyses to explore how to achieve high-quality representation space built upon vMF distribution.

Fig. 1.
figure 1

Overview of our proposed method during the training period. Bottom box consists of the following steps in sequence: sampling a mini-batch images \(\mathcal {B}\) from training set \(\mathcal {D}^{tra}\), learning features by the feature extractor \(\boldsymbol{\varPsi }(\cdot ; \boldsymbol{ \theta })\), embedding features onto hyper-sphere, predicting output via our proposed vMF classifier \( \boldsymbol{ \varPhi }( \cdot ; \boldsymbol{ \mathcal {K} }, \boldsymbol{ \mathcal {M} } )\) and calculating the performance loss value. Upper boxes introduce our proposed the class-feature consistency loss term \(\mathcal {L}_{cfc}\) and inter-class discrepancy loss term \(\mathcal {L}_{icd}\).

3 Methodology

First, we briefly review the canonical pipeline of long-tailed learning, exemplified by long-tailed image classification, and elaborate on our proposed vMF classifier. Afterward, we mathematically define the distribution overlap coefficient. On top of it, we further present the proposed the inter-class discrepancy loss and class-feature consistency loss terms. Finally, a post-training calibration algorithm is devised to zero-costly boost performance.

3.1 Build vMF Classifier on Hyper-Sphere

Let \( \mathcal { D }^{tra} = \{ \boldsymbol{I}^l, y^l\}\), \( l\in \{1, \cdots , N\} \) be the training set, where \(\boldsymbol{I}^l\) denotes an image sample and \(y^l =i\) indicates it belongs to class i. Let C be the total numbers of classes, \(n_i\) be the number of samples in class i, where \(\sum _{i=1}^{C} n_{i}=N\). The class prior distribution on training set can be defined as \(p^{tra}_{\mathcal {D}}(i) = n_i / N\).

As shown in Fig. 1, given a pair \( (\boldsymbol{I}^l, y^l) \) sampled from a mini-batch \( \mathcal { B } \subset \mathcal { D }^{tra} \), feature vector \(\boldsymbol{x}^l = \boldsymbol{\varPsi }(\boldsymbol{I}^l; \boldsymbol{ \theta }) \in \mathbb {R}^{1 \times d}\) is extracted by the feature extractor \( \boldsymbol{ \varPsi }(\cdot ; \boldsymbol{ \theta }) \), of which learnable parameter \(\boldsymbol{\theta }\) is instantiated by a neural network (e.g., ResNet). Then \(\boldsymbol{x}^l\) is projected onto the unit hyper-sphere \(\mathbb {S}^{d-1}\) via \(\tilde{ \boldsymbol{x} }^l = \boldsymbol{x}^l / \Vert \boldsymbol{x}^l \Vert _2\) and subsequently fed into the vMF classifier.

We depict the classifier with C classes as a mixture model with C von Mises-Fisher distributions on \(\mathbb {S}^{d-1}\), each class containing two variables: the compactness \(\kappa _i \in \mathbb {R}^+\) and the unit orientation vector \(\tilde{ \boldsymbol{\mu } }_i \in \mathbb {R}^{1 \times d}\). Consequently, vMF classifier is well-defined as \( \boldsymbol{ \varPhi }( \cdot ; \boldsymbol{ \mathcal {K} }, \boldsymbol{ \mathcal {M} } )\), where \(\boldsymbol{ \mathcal {K} } = \{ \kappa _1, ..., \kappa _C \}\) and \(\boldsymbol{ \mathcal {M} } = \{ \tilde{ \boldsymbol{\mu } }_1, ..., \tilde{ \boldsymbol{\mu } }_C\}\) are learnable compactness and orientation vectors for C classes, respectively. The probability density function (PDF) \(p(\tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{ \mu } }_i )\) of i-th class is mathematically defined as:

$$\begin{aligned} \begin{aligned} p(\tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{ \mu } }_i ) = C_d(\kappa _i) e^{ \kappa _i \cdot \tilde{\boldsymbol{x}} \tilde{ \boldsymbol{\mu } }_i^{\top } } = \frac{ {\kappa _i}^{ \frac{d}{2}-1} }{ (2\pi )^{ \frac{d}{2}} \cdot I_{ \frac{d}{2}-1}(\kappa _i) } e^{ \kappa _i \cdot \tilde{\boldsymbol{x}} \tilde{ \boldsymbol{\mu } }_i^{\top } }, \end{aligned} \end{aligned}$$
(1)

where \(I_v(\kappa )\) is the modified Bessel function [18] of the first kind of real order v and \(C_d(\kappa )\) is a normalization constant.

From the view of Bayes Theorem [29], given the class prior distribution \(p^{tra}_{\mathcal {D}}(i)\) and \(p(\tilde{ \boldsymbol{x} }^l | \kappa _i, \tilde{ \boldsymbol{ \mu } }_i )\), the probability \(p^l_i \) for \(\boldsymbol{I}^l\) belonging to class i can be formulated by the posterior probability \(p(y^l=i | \tilde{ \boldsymbol{x} }^l)\) as:

$$\begin{aligned} \begin{aligned} p^l_i = p(y^l=i | \tilde{ \boldsymbol{x} }^l) = \frac{ p^{tra}_{\mathcal {D}}(i) \cdot p(\tilde{ \boldsymbol{x} }^l | \kappa _i, \tilde{ \boldsymbol{ \mu } }_i ) }{ \sum _{j=1}^C p_{\mathcal {D}}^{tra}(j) \cdot p(\tilde{\boldsymbol{x}}^l | \kappa _j, \tilde{\boldsymbol{ \mu }}_j ) }. \end{aligned} \end{aligned}$$
(2)

Equation 2 is the formulation of our vMF classifier. Our vMF classifer degrades to a balanced cosine classifier [32] with a temperature \(\sigma \), when \(\kappa _i = const~ \sigma , \forall i \in [1, C] \).

The performance loss \(\mathcal {L}_{perf}\) of the mini-batch \(\mathcal {B}\) is calculated by the cross-entropy function as follows:

$$\begin{aligned} \mathcal {L}_{perf} = -\frac{1}{N'} \sum _{l=1}^{N'} \sum _{i=1}^C \mathbbm {1}[y^l=i] \cdot \log {p^l_i}, \end{aligned}$$
(3)

where \(\mathbbm {1}[y=i]\) is the binary indicator that denotes whether the corresponding image comes from the i-th class and \(N'\) is the number of samples in a mini-batch.

The total loss \(\mathcal {L}\) for mini-batch \(\mathcal {B}\) in one iteration is calculated as:

$$\begin{aligned} \begin{aligned} \mathcal {L} = \mathcal {L}_{perf} + \lambda \cdot ( \mathcal {L}_{icd} + \mathcal {L}_{cfc} ), \end{aligned} \end{aligned}$$
(4)

where \(\mathcal {L}_{icd}\) and \(\mathcal {L}_{cfc}\) are proposed additional loss terms to regularize feature and classifier, which will be introduced in the subsequent subsection. \(\lambda \) is a hyper-parameter to adjust the weight of additional loss terms.

Table 1. Derivatives for compactness and orientation of vMF classifier.

3.2 Quantify Distribution Overlap Coefficient on Hyper-Sphere

As aforementioned, we geometrically depict the classifier as a set of vMF distributions on \(\mathbb {S}^{d-1}\). The distribution overlap coefficient [7] is mathematically explained as the area of intersection between two probability density functions. Based on it, we mathematically quantify distribution overlap coefficient to measure the intersection degree of two classes (vMF distribution) in the \(\mathcal {S}^{d-1}\). In this paper, we provide the analytic expression \(o_\varLambda \) based on Kullback-Leibler divergence [30] for the vMF distribution [8]. Specifically, \(o_\varLambda \) is defined as:

$$\begin{aligned} \begin{aligned} o_\varLambda (\kappa _i, \kappa _j, \tilde{ \boldsymbol{\mu } }_i, \tilde{ \boldsymbol{\mu } }_j) = \frac{1}{ 1 + KL\{ p( \tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{\mu } }_i), p( \tilde{ \boldsymbol{x} } | \kappa _j, \tilde{ \boldsymbol{\mu } }_j) \} }, \end{aligned} \end{aligned}$$
(5)

where \(KL\{ p( \tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{\mu } }_i), p( \tilde{ \boldsymbol{x} } | \kappa _j, \tilde{ \boldsymbol{\mu } }_j) \} \) is the Kullback-Leibler divergence between two vMF distributions, abbreviated as \(KL_{ij}\):

$$\begin{aligned} \begin{aligned} KL_{ij}&= - \int _{ \tilde{\boldsymbol{x}} } p( \tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{\mu } }_i) \cdot \ln \frac{ p( \tilde{ \boldsymbol{x} } | \kappa _j, \tilde{ \boldsymbol{\mu } }_j) }{ p( \tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{\mu } }_i) } \,d\tilde{\boldsymbol{x}} \\&= \ln \frac{C_d(\kappa _i)}{C_d(\kappa _j)} \cdot \underbrace{ \int _{ \tilde{\boldsymbol{x}} } C_d(\kappa _i) \cdot e^{ \kappa _i \cdot \tilde{\boldsymbol{x}} \tilde{ \boldsymbol{\mu } }_i^{\top } } \,d\tilde{\boldsymbol{x}} }_{=1}\\&+ \underbrace{ ( \int _{ \tilde{\boldsymbol{x}} } \tilde{\boldsymbol{x}} \cdot C_d(\kappa _i) \cdot e^{ \kappa _i \cdot \tilde{\boldsymbol{x}} \tilde{ \boldsymbol{\mu } }_i^{\top } } \,d\tilde{\boldsymbol{x}} ) }_{= \mathbb {E}[ \tilde{\boldsymbol{x}} ] = A_d(\kappa _i) \cdot \tilde{\boldsymbol{\mu }}_i } ( \kappa _i \cdot \tilde{\boldsymbol{\mu }}_i^\top - \kappa _j \cdot \tilde{\boldsymbol{\mu }}_j^\top ) \\&= \ln \frac{C_d(\kappa _i)}{C_d(\kappa _j)} + A_d(\kappa _i) \cdot ( \kappa _i - \kappa _j \tilde{\boldsymbol{\mu }}_i \tilde{\boldsymbol{\mu }}_j^{\top }), \end{aligned} \end{aligned}$$
(6)

where \(A_d(\kappa _i) = I_{d/2}(\kappa _i) / I_{d/2 - 1}(\kappa _i) \) is non-decreasing and \(0<A_d(\kappa _i) <1 \). \( \mathbb {E}[ \tilde{\boldsymbol{x}} ] \) is the expectation vector for \(\tilde{\boldsymbol{x}} \sim p( \tilde{ \boldsymbol{x} } | \kappa _i, \tilde{ \boldsymbol{\mu } }_i) \) [33]. Generally \(0 <o_\varLambda \le 1\). \(o_\varLambda = 1\) (i.e., \(\kappa _i = \kappa _j\) and \(\tilde{\boldsymbol{\mu }}_i \tilde{\boldsymbol{\mu }}_j^{\top }=1\)) means they are completely congruent. \(o_\varLambda \rightarrow 0\) indicates there is nearly no intersection between two distributions.

Fig. 2.
figure 2

Visualization of overlap coefficient \(o_\varLambda (\kappa _i, \kappa _j, \tilde{\boldsymbol{\mu }}_i, \tilde{\boldsymbol{\mu }}_j)\) and partial derivatives for \(\kappa _i\) and \(\tilde{ \boldsymbol{\mu } }_i \tilde{ \boldsymbol{\mu } }_j^\top \). To exhibit them in 3D coordination, \(\kappa _j\) is fixed to a certain value, instantiated as 16. \(\kappa _i\) and \(\tilde{ \boldsymbol{\mu } }_i \tilde{ \boldsymbol{\mu } }_j^\top \) (\(\tilde{\boldsymbol{\mu }}_i \in \mathbb {R}^{1 \times 512}\) ) are uniformly sampled 100 values from range [12, 20] and range [−1, 1], respectively.

The derivatives of \(\kappa _i\), \(\kappa _j\), \(\tilde{ \boldsymbol{\mu } }_i\) and \(\tilde{ \boldsymbol{\mu } }_j\) for \(o_\varLambda \) are listed as the Col 1 of Table 1. And visualization for them is demonstrated in Fig. 2. Specifically, the partial derivative with respect to \(\tilde{\boldsymbol{\mu }}_i \tilde{\boldsymbol{\mu }}_j^{\top }\) is non-negative.

The partial derivatives with respect to \(\kappa _i\) or \(\kappa _j\) are non-monotonous. An empirical conclusion is that \(\kappa _i\) and \(\kappa _j\) need to be kept at the same order of magnitude to achieve guaranteed performance, when using \(o_\varLambda \) as the optimization objective.

3.3 Improve Representation of Feature and Classifier via \(o_\varLambda \)

Inter-class Discrepancy Loss. To achieve the discriminative representation space in long-tailed learning, we seek to optimize our vMF classifier via shrinking the overlap among classes as much as possible to alleviate the overwhelm of the head classes on the tail ones. We denote the above optimization objective as the inter-class discrepancy loss term \(\mathcal {L}_{icd}\), which acts function on the weights \(\boldsymbol{\mathcal {K}}\) and \(\boldsymbol{\mathcal {M}}\) of the vMF classifier.

First, we measure the average overlap coefficient \(o_i\) among class i and all the other classes, formulated by:

$$\begin{aligned} \begin{aligned} o_i&= \frac{1}{C-1} \sum _{j=1, j \ne i}^C o_\varLambda (\kappa _i, \kappa _j, \tilde{\boldsymbol{\mu }}_i, \tilde{\boldsymbol{\mu }}_j). \end{aligned} \end{aligned}$$
(7)

Furthermore, we define the inter-class discrepancy loss term \(\mathcal {L}_{icd}\) as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{icd}&= \frac{1}{C} \sum _{i=1}^C o_i, \end{aligned} \end{aligned}$$
(8)

The proposed \(\mathcal {L}_{icd}\) minimizes the average distribution overlap coefficient to regularize distributions, contributing to a more distinction-prone classifier on \(\mathbb {S}^{d-1}\).

Class-Feature Consistency Loss. In addition, the poorly matching between the feature vectors and the corresponding classifier weights derives unsatisfied performance, especially for the sample-starved classes. Class-feature consistency loss term \(\mathcal {L}_{cfc}\) is proposed to alleviate the above issue by aligning features with the corresponding classifier weights as far as possible.

Specifically, we first fit the class-wise feature distribution (\(\kappa ^{\boldsymbol{x}}\), \(\tilde{ \boldsymbol{\mu } }^{\boldsymbol{x}}\)) within the mini-batch \(\mathcal {B}\). The class set involved in \(\mathcal {B}\) is denote as \(\mathcal {C}'\). For a certain class \(i \in \mathcal {C}'\), the feature-level orientation vector \(\tilde{ \boldsymbol{\mu } }_i^{\boldsymbol{x}}\) is defined as:

$$\begin{aligned} \begin{aligned} \tilde{ \boldsymbol{\mu } }_i^{\boldsymbol{x}} = \frac{ \sum _{l=1, y^l = i}^{N'} \boldsymbol{x}^l }{ \Vert \sum _{l=1, y^l = i}^{N'} \boldsymbol{x}^l \Vert _2}. \end{aligned} \end{aligned}$$
(9)

Considering that the compactness \(\kappa \) is over-sensitive to sample number and intractable to be estimated [10], \(\kappa \) is shared between the feature and the corresponding classifier weight, i.e., feature-level compactness \(\kappa ^{\boldsymbol{x}}_{i}\) for class i is equal to \(\kappa _i\). Then, \(\mathcal {L}_{cfc}\) is formulated as following:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{cfc} = \mathbb {E}_{i \in \mathcal {C}^{'}}[1 - o_\varLambda (\kappa _i, \kappa _{i}^{\boldsymbol{x}}, \tilde{\boldsymbol{\mu }}_i, \tilde{\boldsymbol{\mu }}_i^{\boldsymbol{x}}) ], \end{aligned} \end{aligned}$$
(10)

where \(\mathbb {E}\) indicates the average function. \(\mathcal {L}_{cfc}\) is, in effect, equivalent to maximizing the distribution overlap coefficient between features and the corresponding classifier weights.

3.4 Calibrate Classifier Weight Beyond Training via \(o_\varLambda \)

Despite exerting additional loss terms to regularize features and classifiers, the overwhelm of the head classes on the tail ones is, in effect, tough to eradicate under a highly imbalanced dataset. We visualize the compactness of the classifier and the average overlap coefficients from a well-trained model, as demonstrated in Col 1 of Fig. 3. The head classes share larger compactness and smaller overlap coefficients, however, the case for tail ones is reversed.

Fig. 3.
figure 3

The calibrated compactness of vMF classifier (trained on ImageNet-LT with ResNetXt-50 feature extractor). Under different \(\alpha \) settings, we adjust \(\kappa \) via Eq. 11 and Eq. 12. Each picture represents the value of re-scaled \(\hat{ \boldsymbol{ \mathcal {K} } }\) when \(\alpha \) equals to the corresponding value. When \(\alpha = 0\), it indicates \(\hat{\kappa }_i = \hat{ o }_i\), while \(\alpha = 1\) , \(\hat{\kappa }_i = \kappa _i\).

A general summary of the calibration strategy is that increase the compactness for classes that are severely overlapped with other classes. Specifically, given a well-trained vMF classifier \( \boldsymbol{ \varPhi }( \cdot ; \boldsymbol{ \mathcal {K} }, \boldsymbol{ \mathcal {M} } )\), we first apply Eq. 7 to obtain the average overlap coefficient for each class, denoted as \(\boldsymbol{ \mathcal {O} }= \{o_1, ..., o_C \}\). Then we use a maximum-minimum normalization strategy to reconcile \(\boldsymbol{ \mathcal {O} } \) to the same value range as \(\boldsymbol{ \mathcal {K} }\), to make sure that both are on the same order of magnitude by:

$$\begin{aligned} \begin{aligned} \hat{ o }_i = \frac{ o_i - o^{min} }{ o^{max} - o^{min} } \cdot ( \kappa ^{max} - \kappa ^{min} ) + \kappa ^{min}, \end{aligned} \end{aligned}$$
(11)

where \(o^{max}\) and \(o^{min}\) are maximum and minimum values of set \(\boldsymbol{ \mathcal {O} }\), respectively, as well as \(\kappa ^{max}\) and \(\kappa ^{min}\). We reset compactness vector as \(\hat{ \boldsymbol{ \mathcal {K} } } = \{ \hat{\kappa }_1, ..., \hat{\kappa }_C \}\), formulated as following:

$$\begin{aligned} \begin{aligned} \hat{\kappa }_i = \kappa _i^\alpha \cdot \hat{ o }_i^{ 1 - \alpha }, \end{aligned} \end{aligned}$$
(12)

\(\alpha \in [0, 1]\) is a hyper-parameter to balance the importance contribution to the re-scaled \(\hat{ \boldsymbol{ \mathcal {K} } }\) as shown in Fig. 3. In the inference period, we comply with a canonical assumption that the classes on the test set follow the uniform distribution, i.e., \( p_{\mathcal {D}}^{test}(i) = 1 / C \). Consequently, we replace \(p^{tra}_{\mathcal {D}}(i)\) by \(p^{test}_{\mathcal {D}}(i)\) in Eq. 2, and the vMF classifier is calibrated as \( \boldsymbol{ \varPhi }( \cdot ; \hat{\boldsymbol{ \mathcal {K} }}, \boldsymbol{ \mathcal {M} } )\).

Moreover, our post-training calibration algorithm is capable of extending to several wide-used classifiers for cost-free performance boosting. Next, we instantiate how to apply the algorithm above to calibrate the weights of \(\tau \)-norm [16], causal classifiers [38] and linear classifiers. Given the weight vector \(\boldsymbol{w^{\tau }_i}\) of class i from a well-trained \(\tau \)-norm classifier \(\boldsymbol{W}^{\tau }\), we equivalently convert \(\boldsymbol{w^{\tau }_i}\) into compactness \(\kappa _i = \Vert \boldsymbol{w^{\tau }_i} \Vert _2^{1 - \tau }\) and orientation vector \(\tilde{\boldsymbol{\mu }}_i = \boldsymbol{w^{\tau }_i} / \Vert \boldsymbol{w^{\tau }_i} \Vert _2\). After calibration via Eq. 11 and Eq. 12, \(\boldsymbol{w^{\tau }_i}\) is rebuilt by producting orientation vector and re-balanced compactness together. Along the same lines, the weight vector \(\boldsymbol{w^{cau}_i}\) for a well-trained causal classifier \(\boldsymbol{W}^{cau}\) is converted to \(\kappa _i = \Vert \boldsymbol{w^{cau}_i} \Vert _2/ (\Vert \boldsymbol{w^{cau}_i} \Vert _2 + \gamma )\) and \(\tilde{\boldsymbol{\mu }}_i = \boldsymbol{w^{cau}_i} / \Vert \boldsymbol{w^{cau}_i} \Vert _2\). The weight vector \(\boldsymbol{w^{lin}_i}\) for a well-trained linear classifier \(\boldsymbol{W}^{lin}\) is converted to \(\kappa _i = \Vert \boldsymbol{w^{lin}_i} \Vert _2 \) and \(\tilde{\boldsymbol{\mu }}_i = \boldsymbol{w^{lin}_i} / \Vert \boldsymbol{w^{lin}_i} \Vert _2\). \(\gamma \) and \(\tau \) are both the hyper-parameters for classifiers above. (Detail proofs in Appendix \({\textbf {A.2}}\))

4 Experiments

In this section, we conduct a series of experiments to validate the effectiveness of our method. Below we present our experimental analysis and ablation study on the image classification task in Sect. 4.1, followed by our results on semantic segmentation task and instance segmentation task in Sect. 4.2.

Table 2. Results on ImageNet-LT in terms of accuracy (Acc) under 90 and 200 training epochs. In this table, CR, DT, RL and CD indicate class re-balancing, decouple training, representation learning and classifier design, respectively. \(\dagger \) indicates only vMF classifier is applied w/o additional loss terms and post-training calibration algorithm.

4.1 Long-Tailed Image Classification Task

Datasets and Setup. We perform experiments on long-tailed image classification datasets, including the ImageNet-LT [25] and iNaturalist2018 [39].

  • ImageNet-LT is a long-tailed version of the ImageNet dataset by sampling a subset following the Pareto distribution with power value 6. It contains 115.8K images from 1,000 categories, with class cardinality ranging from 5 to 1,280.

  • iNaturalist2018 is the largest dataset for long-tailed visual recognition. It contains 437.5K images from 8,142 categories. It is extremely imbalanced with an imbalance factor of 512.

Experimental Details. For image classification on ImageNet-LT, we implement all experiments in PyTorch. Following  [5, 13, 38], we use ResNetXt-50 [46] as the feature extractor for all methods. We conduct model training with the SGD optimizer based on batch size 512, momentum 0.9. In both training epochs (90 and 200 training epochs), the learning rate is decayed by a cosine scheduler [27]. On iNaturalist2018 [39] dataset, we use ResNet-50 [46] as the feature extractor for all methods with 200 training epochs, with the same experimental parameters set for the other. By default, learnable \(\kappa \) for all categories are initialized as 16 and \(\lambda \) is 0.2. Moreover, we use the same basic data augmentation (i.e., random resize and crop to 224, random horizontal flip, color jitter, and normalization) for all methods.

Table 3. Benchmarking on iNaturalists2018 in accuracy (%). DT, CD and RL indicate decouple training, classifier design and representation learning, respectively.

Comparison with State of the Arts. In our paper, the comparison methods use single models. Note that there are also ensemble models for long-tailed classification, e.g., RIDE [42] and TADE [52]. For fair comparisons, following xERM [57], we will not include their results in the experiments. Table 2 shows the long-tailed results on ImageNet-LT. We adopt the performance data from the deep long-tailed survey [53] for various methods at 90 and 200 training epochs to make a fair comparison. Our approach achieves 53.7% and 55.0% in overall accuracy, which outperforms the state-of-the-art methods by a significant margin at 90 and 200 training epochs, respectively. Compared with representation learning methods, our method surpasses SSP by 0.6% (53.7% vs 53.1%) at 90 training epochs and outperforms SSP by 1.7% (55.0% vs 53.3%) at 200 training epochs. In addition, our method obtains higher performance by 1.0% (53.7% vs 52.7%) and 0.6% (55.0% vs 54.4%) comparing to PaCo at 90 and 200 training epochs, respectively. We observe that our vMF classifier (w/o proposed additional loss terms and post-training calibration algorithm) still achieves better performance than previous classifier design strategies, i.e., our vMF classifier surpasses \(\tau \)-norm and TDE which by 2.6% (52.2% vs 49.6%) and 0.4% (52.2% vs 51.8%) at 90 epochs. Moreover, our vMF classifier performs better when training 200 epochs than 90 epochs (53.4% vs 52.2%), in contrast to TDE (51.3% vs 51.8%). This shows that our vMF classifier has more potential to fit data better and learn better representations.

Furthermore, Table 3 presents the experimental results on the naturally-skewed dataset iNaturalist2018. Compared with the improvement brought by representation learning and classifier design approaches, our method achieves competitive result (71.0%) consistently.

4.2 Long-Tailed Semantic and Instance Segmentation Task

To further validate our method, weconduct comprehensive experiments on the semantic and instance segmentation datasets, i.e., ADE20K [55] and LVIS-v1.0 [9].

Dataset and Setup

  • ADE20K is a scene parsing dataset covering 150 fine-grained semantic concepts and it is one of the most challenging semantic segmentation datasets. The training set contains 20,210 images with 150 semantic classes. The validation and test set contain 2,000 and 3,352 images respectively.

  • LVIS-v1.0 contains 1230 categories with both bounding box and instance mask annotations. LVIS-v1.0 divides all categories into 3 groups based on the number of images that contain those categories: frequent (>100 images), common (11–100 images) and rare (<10 images). We train the models with 57K train images and report the accuracy on 5K val images.

Table 4. Performance of semantic segmentation on ADE20K and instance segmentation on LVIS-v1.0. R-50 and R-101 denote ResNet-50 and ResNet-101, respectively. ‘Cascade-R101’ is for Cascade Mask R-CNN [1].

Experimental Details. We evaluate our method using two wide-adopted segmentation models (OCRNet [49] and DeepLabV3+ [4]) based on different backbone networks. We initialize the backbones using the models pre-trained on ImageNet [6] and the framework randomly. All models are trained with an image size of \(512 \times 512\) and 160K iterations in total. We train the models using Adam optimizer with the initial learning rate 0.01, weight decay 0.0005 and momentum 0.9. Furthermore, We implement our method on LVIS-v1.0 with mmdetection [3] and train Mask R-CNN [12] with random sampler by 2x training schedule. The model is trained with batch size of 16 for 24 epochs. The optimizer is SGD with momentum 0.9 and weight decay 0.0001. The initial learning rate is 0.02 with 500 iterations’ warm up. For above two tasks, we set the optimal configuration in our experiments that is all learnable \(\boldsymbol{\mathcal {K}}\) are initialized to 16.

Table 5. Ablation on our proposed two loss terms and the loss weight \(\lambda \). ‘None’ indicates only the performance loss term is applied to train model. ‘0.1’ means \(\lambda \) is set as 0.1.
Table 6. Ablation on the hyper-parameter \(\alpha \) of post-training calibration algorithm with different classifiers. \(\ddagger \) indicates the corresponding classifier is calibrated under the optimal \(\alpha \).

Comparison with State of the Arts. For the semantic segmentation task, The numerical results and comparison with other peer methods are reported in left part of Table 4. Our method achieves 0.7% (41.5% vs 40.8%) improvement in mIoU using OCRNet with HRNet-W18. Moreover, our method outperforms the baseline with large at 1.0% (45.9% vs 44.9%) in mIoU using DeeplabV3+ with ResNet-50 when the iteration is 160K. Even with a stronger backbone: ResNet-101, our method also achieves 0.8% (47.2% vs 46.4%) mIoU improvement than baseline. For the instance segmentation task, we report quantitative results and compare our method with recent work in the right part of Table 3. Our method can achieve 29.8% in AP and 32.9% in \(\mathrm AP_b\) when applied to the Cascade-R101. Apart from the CE loss baseline, we further compare our method with recent designs for long-tailed instance segmentation. Our method surpasses Seesaw Loss by 0.2% (29.8% vs 29.6%) AP, and surpasses DisAlign by 0.9% (29.8% vs 28.9%) AP, which reveals the effectiveness of our method (Table 6).

4.3 Ablation Study

We conduct ablation study on ImageNet-LT dataset to further understand the hyper-parameters of our methods and the effect of each proposed component.

Ablation Study on Two Additional Loss Terms and the Loss Weight \(\boldsymbol{\lambda }\). Firstly, we evaluate the effectiveness of the proposed \(\mathcal {L}_{icd}\) and \(\mathcal {L}_{cfc}\). Setting \(\lambda =0.2\) and initializing \(\kappa = 16\), we train vMF classifier w/o additional loss terms, w/ \(\mathcal {L}_{icd}\), w/ \(\mathcal {L}_{cfc}\) and w/ both of them, respectively. Experimental results are reported in Row 1–4 of Table 5. Our baseline is the balanced cosine classifier [32]. Conclusions are (1). Giving additional surveillance via \(\mathcal {L}_{icd}\) is beneficial to the performance on tail classes. It can be seen from the second and third rows in the Table 5. The performance of the tail of the loss term has been greatly improved (26.9% vs 31.7%). (2). \(\mathcal {L}_{cfc}\) gains the non-trival performance improvements on all classes. (3). Simultaneously adopting the above two loss terms further improves the accuracy by 1.3\(\%\), further widening the performance gap up to 1.7\(\%\) compared with the baseline. Secondly, we conduct four experiments on different \(\lambda \). Row 5–8 of Table 5 show \(\lambda =0.2\) is the optimal setting.

Ablation Study on Post-calibration Algorithm with Different Classifier To verify the versatility of our post-training calibration algorithm, we perform it on our vMF, linear, \(\tau \)-norm (\(\tau =0.7\), optimal setting in [16]) and causal [38] classifiers, following Sect. 3.4. All of them have trained on ImageNet-LT with ResNetXt-50. We set the hyper-parameter \(\alpha \) in the interval 0 to 1 with a stride of 0.1 and take the eleven sets of values to conduct ablation experiments on above classifiers. For linear classifier, the optimal \(\alpha =0.7\), where our algorithm improves allover accuracy performance by 5.1%. For \(\tau \)-norm classifier and causal classifier, under the optimal \(\alpha = 0.1\), the allover accuracy is improved by 4.4% and 1.9%. When \(\alpha = 0.2\), our vMF classifier achieves highest accuracy 53.7%. The reason for slight improvement on ours may be because it has already learned with proposed loss terms (\(\mathcal {L}_{cfc}\) and \(\mathcal {L}_{icd}\)) that are also based on distribution overlap coefficient.

5 Conclusions

In this paper, we extend cosine-based classifiers as a vMF distribution mixture model on hyper-sphere, denoted as the vMF classifier. Benefiting from the representation space constructed by the vMF classifier, we define the distribution overlap coefficient to measure the representation quality for features and classifiers. Based on distribution overlap coefficient, we formulate the inter-class discrepancy and class-feature consistency loss terms to alleviate the interference among the classifier weights and align features with classifier weights. Furthermore, we develop a novel post-training calibration algorithm to zero-costly boost the performance. Our method outperforms previous work with a large margin and achieves state-of-the-art performance on long-tailed image classification, semantic segmentation, and instance segmentation tasks.