Keywords

1 Introduction

With the development of deep convolutional neural networks (CNNs) [16] trained with large-scale datasets [32], computer vision research has been propelled forward significantly in recent years. These large-scale datasets are usually well-designed with the number of instances in each class balanced artificially, which however is inconsistent with the real-world scenarios. It is common that the images of some categories are easy to be collected while some others are difficult, resulting in the number of samples in each head class being far greater than the number of samples in each tail class, as shown in Fig. 1(a). Due to the insufficient information of tail classes, CNNs’ feature space for tail classes is under-represented and the decision boundary is biased to head classes, leading to poor classification performance on tail classes.

Fig. 1.
figure 1

Motivation of this work: We select three head classes and two tail classes from CIFAR-LT-100 dataset [25] and plot t-SNE [28] visualization to compare methods including: (a) LDAM [6], reweighting-based method without tail-class augmentation in training phase; (b) RSG [38], augmentation-based method without sample-specific augmentation; and (c) SAFA, our proposed augmentation-based method with sample-specific augmentation. (a) “w/o Augmentation”: Imbalanced distributions for head-class samples and tail-class samples cause CNNs under-represent tail classes in feature space. (b) “w/o Sample-Specific”: CNNs enlarge feature spaces for tail classes with augmented tail-class features. However, these augmented tail-class features are not sample-specific and distracting from real tail-class features, making the feature space for tail classes fail to generalize to test phase and is still under-represented. (c)“Sample-Specific”: Sample-specific augmented features recover the distribution of limited tail-class samples, which enlarges feature space of tail classes and generalizes better to test phase, helping CNNs to perform more homogeneously across different classes

To address the issue of imbalance data distribution, a natural solution is augmenting training samples to compensate tail classes in feature space. Data augmentation techniques like cropping, mirroring and mixup [16, 18, 44] are adopted to alleviate data imbalance problem. However, these conventional data augmentation techniques are typically performed inside each tail class without considering information in head classes. As a result, the diversity of augmented samples is inherently limited by the insufficient training samples in tail classes so that the augmented data can not recover the data distribution of tail classes. Considering that head class with amounts of samples providing diverse intra-class variance, previous works [9, 10, 23, 33, 38, 39, 43] adopt different methods to enlarge feature space of tail classes by generating new features for tail classes during training via transferring intra-class variance information from head classes to tail classes. [33, 43] utilize feature variation information, such as different poses or lighting conditions, among samples from the same head class to generate new tail-class features. However, these methods did not introduce any mechanisms to ensure the variation information obtained from head classes is class-irrelevant. The augmented tail-class samples may shift to other classes due to the class-relevant information from head classes, hurting the performance of CNN classifiers. Also, these approaches are not in an end-to-end manner. To augment tail-classes with class-irrelevant information, noise vectors are used in [39] to encode the sample variation information. But noise vectors are too random to reflect the true variations among images, using such noise vectors for generation can possibly generate unstable or low-quality features. In [38], a feature augmentation module is integrated into CNNs for end-to-end training, the variation information extracted by removing the centers of each class and a vector transformation module is used to enlarge the distance between feature variance and tail-class features. All of abovementioned methods adopt a direct combination between the intra-class variance extracted from head class and random tail-class samples to produce abundant augmented features belonging to tail classes. Whereas, these augmented features are not sample-specific: the incompatibility between the tail-class sample with applied intra-class variance causes implausible augmented features distracting from real features in feature space, as shown Fig. 1(b). CNNs do enlarge (resp., reduce) feature space of tail classes (resp., head classes) during training phase with these non-sample-specific augmented features, but unfortunately fail to generalize the feature space on test phase that the tail classes are still under-represented.

Fig. 2.
figure 2

The illustration of integrating our proposed SAFA into the N-th layer in deep network to produce diverse and effective tail-class features to reshape the feature space. Our SAFA is only used during training denoted by orange dot line, leaving no computational burden at test time denoted as black solid line (Color figure online)

In this paper, to alleviate these limitations, we propose a novel semantic Sample-Adaptive Feature Augmentation (SAFA) to generate reliable and diverse augmented features for tail classes during training phase to enlarge the under-represented feature space of tail classes and improve classifiers with less biased decision boundary. SAFA is a novel plug-in approach, which is convenient to be integrated into various networks to effectively augment tail classes without additional computational burden in testing phase, as shown in Fig. 2. Note that we only show a simple CNN in Fig. 2, but SAFA can be used in any network architecture. SAFA aims to extract diverse and transferable semantic directions (e.g., intra-class transformation) from head-class features and translate tail-class samples along extracted directions adaptively to produce diverse and effective features. SAFA is formulated by auto-encoder structure consisting of an sample-specific encoder and a sample-adaptive generator. The encoder is used to extract transferable class-irrelevant information from head classes, while the sample-adaptive generator is designed to correct extracted variance information to produce sample-specific features of tail classes. SAFA leverages a recycling training scheme enforcing consistency of the relevant semantics before and after translation and ensuring augmented features are sample-specific. Contrastive loss ensures the transferable semantic directions are class-irrelevant and mode seeking loss is adopted to exploit diverse semantic directions, producing diverse tail-class features and enlarging the feature space of tail classes. In Fig. 1(c), we demonstrate the effect of SAFA. SAFA is able to generate diverse and effective augmented features and recover the real distribution of tail classes, enlarging the feature space of tail classes and generalizing promisingly in test phase. The proposed SAFA as a plug-in is convenient and versatile to be combined with different architectures and loss functions during training phase without additional computational burden at test time. With extensive experimental evaluations, we verify the effectiveness of SAFA: SAFA obtains outstanding results on Imbalanced CIFAR, Places-LT, ImageNet-LT, and iNaturalist2018.

2 Related Work

2.1 Long-Tail Classification Methods

Re-sampling. Over-sampling the tail classes [4, 5, 34] or under-sampling the head classes [4, 15, 21] strategies are widely used to balance the data distribution for imbalanced datasets. Although being effective, over-sampling might result in over-fitting of tail classes while under-sampling may weaken the feature learning of head classes due to the absence of valuable samples [6, 7, 11, 42].

Re-weighting. Reweighting-based methods aim to assign weights to training samples on either class or sample level. A classic scheme is to reweight the classes with the weights that are inversely proportional to their frequencies [17, 40]. The method in [11] further improves this scheme with proposed effective number. L2RW [31] is designed to assign weights to examples sample-wisely based on the gradient directions. Meta-class-weight [20] exploits meta-learning to estimate precise class-wise weights, while [6] allocate large margins to tail classes. Apart from above works, Focal Loss [26] and meta-weight-net [35] assign weights to examples sample-wisely. In addition, for learning better representations, some approaches propose to separate the training into two stages: representation learning and classifier re-balancing learning [6, 12, 20, 22]. BBN [48] further unifies the two stages to form a cumulative learning strategy.

Augmentation. Data augmentation is widely adopted to CNNs for alleviating over-fitting. For example, rotation and horizontal flipping are employed for maintaining the prediction invariant of CNNs [16, 18, 36]. In complementary to the traditional data augmentation, semantic data augmentation that performs semantic altering is also effective for enhancing classifier performance [2, 41]. A hallucinator [39] was designed to generate new samples for tail classes. It uses samples from tail classes and noise vectors to produce new hallucinated samples for tail classes. A Delta-encoder framework [33] was proposed for generating new samples. It is first trained to reconstruct the pre-computed feature vector of input images from head classes. Thereafter, it is used to generate new samples by combining the tail-class samples, and the newly generated ones are further used to train the classifier. A feature transfer learning (FTL) framework [43] was proposed to transfer the intra-class variance from head classes to tail classes by generating new tail-class samples. Our methods can be categorized as augmentation-based methods, which mainly focus on augmenting tail-class samples to overcome imbalance issue. Different from other augmentation methods that simply apply the same transformation (e.g., adding random noises) to all tail-class samples, we distinguish different samples and design a sample-adaptive augmentation method to produce effective and diverse augmented tail-class samples. Our method fully considers individual differences combined with intra-class variance to generate semantically rational augmentations.

2.2 Semantic Transformations in Deep Feature Space

Our work is motivated by the fact that high-level representations learned by deep convolutional networks can potentially capture abstractions with semantics [3]. In fact, translating deep features along certain directions is shown to be corresponding to performing meaningful semantic transformations on the input images. For example, deep feature interpolation [8, 49] leverages simple interpolations of deep features from pre-trained neural networks to achieve semantic image transformations. Variational Auto-Encoder (VAE) [24] and Generative Adversarial Network (GAN) based methods [14] establish a latent representation corresponding to the abstractions of images, which can be manipulated to edit the semantics of images. Generally, these methods reveal that certain directions in the deep feature space correspond to meaningful semantic transformations, and can be leveraged to perform semantic data augmentation. In this work, we focus on learn adaptive semantic transformations for tail-class by leveraging diverse class-invariant features from head classes.

Fig. 3.
figure 3

The framework of SAFA, including a delta extraction module E, a sample-specific delta generator D, and a sample-adaptive generator G, and a contrastive module Q. E is used to extract class-irrelevant delta \(\boldsymbol{\varDelta }^{ij}\) from head-class pairs \(\{\boldsymbol{F}^i_h,\boldsymbol{F}^j_h\}\), D is applied to combined extracted \(\boldsymbol{\varDelta }^{ij}\) with tail-class feature \(\boldsymbol{F}_t^i\) to produce sample-specific delta \(\boldsymbol{\varDelta }_t^{ij}\), which coupled with \(\boldsymbol{F}_t^i\) is fed into sample-adaptive generator G to generated sample-specific tail-class feature \(\tilde{\boldsymbol{F}}_t^j\)

3 Methodology

Given an imbalanced training dataset \(\mathbb {S}=\{\boldsymbol{x}^{i},{y}^{i}|_{i=1}^n\}\), where \({y}^{i} \in \{1, \cdots , C\}\) is the label of i-th sample \(\boldsymbol{x}^i\), where C is the number of classes, and \(n_c\) denotes the number of samples belongs to the c-th class. We assume that the classes are sorted by cardinality in a decreasing order, i.e., \(n_{i+1} \le n_i\). The data obeys the long tail distribution, i.e., most samples belong to only a few head classes denoted as \(\{\boldsymbol{x}_h^i\}\) and data of the other tail classes represented as \(\{\boldsymbol{x}_t^i\}\) only has a few samples. Feeding head-class samples \(\{\boldsymbol{x}_h^i\}\) (resp., tail-class samples \(\{\boldsymbol{x}_t^i\}\)) into CNNs, the corresponding feature maps from specific layer of backbone are denoted as \(\{\boldsymbol{F}_h^i\}\) (resp., \(\{\boldsymbol{F}_t^i\}\)).

3.1 SAFA: Sample-Specific Feature Augmentation

In this section, we introduce how to integrate SAFA into CNNs for producing diverse tail-class features to effectively enlarge tail-class feature space during training phase and generalize to test phase. Our SAFA is inspired by [33], in which intra-class transformation (i.e., the difference between two samples within the same category) is called “delta”. Deltas are extracted from paired samples of the same class, in which delta is the additional information required to reconstruct one sample of the pairs from another sample. In [33], deltas are directly combined with random target-class samples to generate new features for target classes. However, the effect of delta may depend on the combined target-class samples [1], that is, an effective delta for one sample may be unsuitable for another sample. On one hand, the extracted deltas are different in semantic scale, e.g., different degrees (90 or 180) pose rotation. On the other hand, the difficulties of translating samples from tail classes with different-scale semantic directions are different, e.g., translating a dog with left face to a dog with left face may be easier than another dog with frontal face. Naive augmentation may lead to corrupted features or features without class-preserving characteristic, which are distracting from real tail-class features, as shown in Fig. 1(b).

To extract effective and transferable deltas, and adaptively apply these deltas to tail-class samples to produce effective augmented features, we propose SAFA as illustrated in Fig. 3. SAFA consists of a delta feature extractor E, a sample-specific delta generator D, a sample-adaptive feature generator G and a contrastive module Q. All these modules are built up with Conv-BN-ReLU-Conv layers (Q has an additional FC layer). During training, given a random pair of feature maps \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\}\) from the same head class (i.e., \(y_h^i=y_h^j\)), the delta extraction module E is used to extract class-irrelevant delta \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\), which combined with random tail-class feature maps \(\boldsymbol{F}_t^i\) fed into the sample-specific delta generator D to generate sample-specific delta \(\boldsymbol{\boldsymbol{\varDelta }}_t^{ij}\). After that, \(\boldsymbol{\boldsymbol{\varDelta }}_t^{ij}\) combined with \(\boldsymbol{F}_t^i\) for the sample-adaptive generation module G to produce sample-specific tail-class features \(\tilde{\boldsymbol{F}}_t^j\). Finally, real feature maps \(\boldsymbol{F}\) are coupled with augmented tail-class feature maps \(\tilde{\boldsymbol{F}}_t\) are fed into deeper layers of the network.

To ensure the transferability of extracted delta, a modified recycle reconstruction loss [19] is adopted to ensure that delta encoder and sample-adaptive generator are inverses of each other. As shown in Fig. 4, extracted delta \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\) from \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\}\) are reconstructed from fake tail-class pair \(\{\tilde{\boldsymbol{F}}_t^j, \boldsymbol{F}_t^j\}\) as \(\hat{\boldsymbol{\varDelta }}^{ij}\). Further, \(\hat{\boldsymbol{\varDelta }}^{ij}\) is combined with \(\boldsymbol{F}_h^i\) to reconstruct \(\boldsymbol{F}_h^j\) by \(\hat{\boldsymbol{F}}_h^j\). In this way, delta information and sample information are reconstructed bidirectionally, effectively improving the transferability of extracted delta information and enforcing the generated features to be sample-specific. To ensure the class-preserving characteristic of augmented tail-class samples, we introduce contrastive learning in Q to push away paired samples from different classes while pairs from the same class are dragged in. To further improve the diversity of augmented samples and enlarge the feature space of tail classes, a modified mode seeking loss [29] is integrated into SAFA by maximizing the ratio of the distance between augmented tail-class samples with respect to the distance between extracted deltas.

The overall objective function of SAFA can be given as follows,

$$\begin{aligned} \mathcal {L}_{\text{ overall }}= \mathcal {L}_{cls} + \lambda _1 \mathcal {L}_{\text{ r }} + \lambda _2 \mathcal {L}_{\text{ ms }}^{t} + \lambda _3 \mathcal {L}_{\text{ ms }}^{h} + \lambda _4 \mathcal {L}_{c} \end{aligned}$$
(1)

where \(\mathcal {L}_{cls}\) denotes any classification loss, such as softmax with cross-entropy loss, focal loss [26], LDAM [6]; \(\lambda _1\) (resp., \(\lambda _2\), \(\lambda _3\), \(\lambda _4\)) denote coefficient; \(\mathcal {L}_{\text{ r }}\), \(\mathcal {L}_{\text{ ms }}\) and \(\mathcal {L}_{c}\) are cycle reconstruction loss, mode seeking loss and contrastive loss respectively, which will be introduced in the next subsection.

3.2 Module Details and Objective Functions

Class-Irrelevant Delta Extraction. The delta feature extraction module E aims to capture diverse and transferable delta information. Given a pair of feature \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\} \in \mathcal {R}^{C\times W \times H}\) from the same head class, where C (resp., W, H) denotes the channel (resp., width, height) dimension. The delta feature extraction module E is used to extract delta feature \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\):

$$\begin{aligned} \boldsymbol{\boldsymbol{\varDelta }}^{ij}= E(\boldsymbol{F}^i_h-\boldsymbol{F}^j_h), \end{aligned}$$
(2)

where \(\boldsymbol{\boldsymbol{\varDelta }}^{ij} \in \mathcal {R}^{C_{\boldsymbol{\varDelta }} \times W \times H}\), and \(C_{\boldsymbol{\varDelta }}\) represents the dimension of delta features. Such extracted delta feature \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\) captures the variance (i.e., rich transformation information) between \(\boldsymbol{F}_h^i\) and \(\boldsymbol{F}_h^j\). By feeding various pairs of features from the same head class into delta feature extraction module E, we can obtain amounts of diverse delta features \(\boldsymbol{\boldsymbol{\varDelta }}\), which can be applied to tail-class features to enlarge the feature space of tail class.

Sample-Adaptive Generation. The sample-adaptive delta generator D and feature generator G are designed to produce sample-specific delta and features. The delta \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\) extracted from different paired head-class features \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\}\) may vary due to complicated scene geometry and light sources, which may lead to different compatibility with different tail-class samples. Thus, it is crucial to attend relevant information from delta \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\) according to tail-class feature \(\boldsymbol{F}_t^i\) to produce sample-specific delta feature \(\boldsymbol{\boldsymbol{\varDelta }}_t^{ij}\) more compatible to tail-class feature \(\boldsymbol{F}_t^i\). D is designed to attend relevant variance information from extracted delta \(\boldsymbol{\boldsymbol{\varDelta }}^{ij}\) according to specific tail-class feature \(\boldsymbol{F}_t^i\) to produce sample-adaptive delta feature \(\boldsymbol{\boldsymbol{\varDelta }}_t^{ij} \in \mathcal {R}^{C_{\boldsymbol{\varDelta }} \times W \times H}\), where \(\boldsymbol{\varDelta }_t^{ij}=D\left( \textrm{concat}\left( \boldsymbol{\varDelta }^{ij}, \boldsymbol{F}_t^i\right) \right) \). Then, we combine it with \(\boldsymbol{F}_{t}^i\) into the generator G to produce augmented tail-class feature \(\tilde{\boldsymbol{F}}_t^j\) belonging to class \(y_t^j\):

$$\begin{aligned} \begin{aligned} \tilde{\boldsymbol{F}}_{t}^j = G(\boldsymbol{\varDelta }_t^i + \boldsymbol{F}_t^i). \end{aligned} \end{aligned}$$
(3)
Fig. 4.
figure 4

The illustration of cycle delta reconstruction and feature reconstruction

Cycle Reconstruction Loss. To enforce delta extractor E to extract effective class-irrelevant delta feature and ensure the augmented features are faithful to input tail-class features (i.e., to be sample-specific), we apply cycle reconstruction loss [19] in SAFA. We use objective functions that encourage reconstruction in feature direction: paired head-class feature \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\}\) \(\rightarrow \) \(\boldsymbol{\varDelta }^{ij}\) \(\rightarrow \) reconstructed head-class feature \(\hat{\boldsymbol{F}}_h^j\), and delta direction: \(\boldsymbol{\varDelta }^{ij}\) \(\rightarrow \) augmented tail-class feature \(\tilde{\boldsymbol{F}}_t^j\) \(\rightarrow \) \(\hat{\boldsymbol{\varDelta }}^{ij}\). For delta direction, with augmented paired tail-class features \(\{\tilde{\boldsymbol{F}}_t^j, \boldsymbol{F}_t^i\}\), we can extract reconstructed class-irrelevant delta \({\hat{\boldsymbol{\varDelta }}^{ij}}=E(\tilde{\boldsymbol{F}}_t^j - \boldsymbol{F}_t^i)\) and optimize:

$$\begin{aligned} \mathcal {L}_{\text{ r }}^{\boldsymbol{\varDelta }}= ||\hat{\boldsymbol{\varDelta }}^{ij} - \boldsymbol{\varDelta }^{ij}||_{2}. \end{aligned}$$
(4)

Note that \({\hat{\boldsymbol{\varDelta }}}^{ij}\) and \({{\boldsymbol{\varDelta }}}^{ij}\) are extracted from tail class and head class respectively, which means \(\mathcal {L}_{\text{ r }}^{\boldsymbol{\varDelta }}\) can force \(\boldsymbol{\varDelta }^{ij}\) to be class-irrelevant. For feature reconstruction direction, reconstructed delta feature \({\hat{\boldsymbol{\varDelta }}}^{ij}\) combined with head-class feature \(\boldsymbol{F}_h^i\) is fed into sample-adaptive generation module G to produce reconstructed head-class feature \(\hat{\boldsymbol{F}}_h^j= G(D(\textrm{concat}(\hat{\boldsymbol{\varDelta }}^{ij}, \boldsymbol{F}_h^i))+\boldsymbol{F}_h^i)\), then we have:

$$\begin{aligned} \mathcal {L}_{\text{ r }}^{F}= ||\hat{\boldsymbol{F}}_h^j - {\boldsymbol{F}}_h^j||_{2}. \end{aligned}$$
(5)

The recycle reconstruction loss \(\mathcal {L}_{\text{ r }} = \mathcal {L}_{\text{ r }}^{\boldsymbol{\varDelta }} + \mathcal {L}_{\text{ r }}^{F}\) can enforce delta extraction module E, sample-adaptive generation module D and G to work consistently for extracting transferable delta and adaptively combining delta with tail-class feature to produce sample-specific tail-class feature.

Contrastive Loss. To ensure the category-preserving characteristics of augmented tail-class features, we adopt a contrastive module Q and calculate the contrastive loss to ensure that the delta feature \({\boldsymbol{\varDelta }}^{ij}\) not leaking head class information to augmented tail-class feature (i.e., \({\boldsymbol{\varDelta }}^{ij}\) is class-irrelevant). In a mini-batch \(\boldsymbol{F}_a\) consisting of real head-class features \(\boldsymbol{F}_h\), real tail-class features \(\boldsymbol{F}_t\), and augmented tail-class features \(\tilde{\boldsymbol{F}}_t\), we shuffle all samples with batch size s, and we form s/2 pairs by random sampling for training the contrastive module Q. Using \(y_c \in \{0,1\}\) as ground-truth to show whether the paired features come from the same class.

$$\begin{aligned} \mathcal {L}_{c}= -\left\langle \left( y_c \log \beta +(1-y_c) \log \left( 1-\beta \right) \right) \right\rangle _{\frac{s}{2}} \end{aligned}$$
(6)

where \(\beta = Q (\boldsymbol{F}_a^i,\boldsymbol{F}_a^j)\) represent the probability distribution to show whether \(\{\boldsymbol{F}_a^i, \boldsymbol{F}_a^j\}\) belong to the same class, and \( \langle \cdot \rangle _{\frac{s}{2}} \) denotes that \(\mathcal {L}_{c}\) is calculated over s/2 paired features on average.

Mode Seeking Loss. To further produce diverse augmented tail-class features and enlarge feature space of tail classes, we employ mode seeking loss  [29] to increase the distance between paired augmented tail-class features generated from the same \(\boldsymbol{\varDelta }^{ij}\) feature, and also extend the distance between a pair of augmented tail-class feature generated from the same tail-class feature, respectively. In detail, given delta feature \(\boldsymbol{\varDelta }^{ij}\) extracted from \(\{\boldsymbol{F}_h^i, \boldsymbol{F}_h^j\}\) and paired features \(\{\boldsymbol{F}_t^i, \boldsymbol{F}_t^j\}\) from the same tail class, we can produce paired augmented feature \(\{\tilde{\boldsymbol{F}_t^i}, \tilde{\boldsymbol{F}_t^j}\}\) following Eq. (3), the mode seeking loss can be written as:

$$\begin{aligned} \!\!\!\!\!\!\!\!{} & {} \mathcal {L}_{\text{ ms }}^{t} = \left\langle \frac{|| \boldsymbol{F}^i_t - \boldsymbol{F}^j_t||_1}{||\tilde{\boldsymbol{F}}_t^i - \tilde{\boldsymbol{F}}_t^j||_1}\right\rangle _{\frac{s}{2}}, \mathcal {L}_{\text{ ms }}^{h} = \left\langle \frac{||{{\boldsymbol{F}}_h^i - {\boldsymbol{F}}_h^j} ||_1}{||\tilde{\boldsymbol{F}}_t^i - \tilde{\boldsymbol{F}}_t^j||_1} \right\rangle _{\frac{s}{2}}. \end{aligned}$$
(7)

4 Experiment

We conduct experiment on CIFAR-LT-10/CIFAR-LT-100 [25], ImageNet-LT [27], Places-LT [47], and iNaturalist 2018 [37]. For those comparison experiments conducted in the same settings, we directly quote their results from original papers. Next, we briefly introduce these datasets and basic experiment settings. The details of datasets and implementation are reported in Supplementary.

4.1 Implementation Details and Datasets

In following experiments, our SAFA is employed before the second-to-last down-sampling layer, since we got the best results. In addition, we report additional experimental results on CIFAR-LT-10/CIFAR-LT-100 by integrating SAFA into different layers in Supplementary. The hyperparameter \(\lambda _1\) (resp., \(\lambda _2\),\(\lambda _3\), and \(\lambda _4\)) is set as 100 (resp., 1e−2, 1e2, 1e−1), and by observing validation accuracy on CIFAR-LT-100 dataset. We provide in-depth analysis of each loss item in Sect. 4.3 and Supplementary.

During training, given a thresh epoch \(\mathbb {T}_{th}\), which decides when to activate SAFA module to produce new tail-class features. Before \(\mathbb {T}_{th}\), the network without SAFA is only optimized with \(\mathcal {L}_{cls}\). After \(\mathbb {T}_{th}\), SAFA is activated to be optimized with \(\mathcal {L}\) in Eq. (1) and produce augmented tail-class features. In each mini-batch, we sample same-class pairs from dataset, and they are split into two parts according to a manually set constant head-class ratio \(\gamma = n_h/ (n_h + n_t)\), where \(n_h\) and \(n_t\) denote the number of head classes and the number of tail classes, respectively. Following [38], we set head-class ratio \(\gamma =0.2\) for all datasets.

CIFAR-LT: For CIFAR-LT-10 (resp., CIFAR-LT-100) with 10 (resp., 100) classes, following [11], we create 5 training sets by changing the imbalance factor \(\rho \) in the range of \(\{200,100,50,20,10\}\), where \(\rho \) is the image amount ratio between the largest and smallest classes. We use the original balanced test sets for our test sets. Following [38], the main results on CIFAR-LT-10/CIFAR-LT-100 are trained on ResNet-32 [16] for 200 epochs with batch size of 128. The learning rate was set to 0.1 at the beginning, then declined by 0.01 at the 160-th epoch and again at the 180-th epoch. Our SAFA is activated at \(\mathbb {T}_{th}=159\).

ImageNet-LT: ImageNet-LT is built in [27] based on ImageNet dataset [32] with 1000 classes, its imbalance factor \(\rho \) is 1280/5. Our experiments about ImageNet-LT are conducted with ResNeXt-50-32x4d [16], which was trained with a batch size of 256 for 100 epochs, as described in [22]. The initial learning rate was set to 0.1, and it gradually declined by 0.1 at the 60-th, 80-th, and 95-th epochs, respectively. According to [38], test set classes are further divided into three groups: many-shot (over 100 samples), medium-shot (between 20 and 100 samples), and few-shot (less than 20 samples) to better examine performance differences across classes with different numbers of samples seen during training. Our SAFA is integrated into ResNeXt-50-32x4d at \(\mathbb {T}_{th}=59\).

Places-LT: Places-LT is a subset of the large-scale scene classification dataset [47]. The dataset comprises 365 categories with class cardinality ranging from 5 to 4980. Following [27], we finetune ResNet-152, which is pre-trained on the entire ImageNet dataset [32]. The network was trained with a batch size of 256. The starting learning rate was set to 0.01, and it declined by 0.1 every ten epochs until the training was terminated after 30 epochs. Our SAFA is employed at \(\mathbb {T}_{th}=9\). Similar to the ImageNet-LT evaluation, the top-1 accuracy of many-shot, medium-shot, and few-shot in this study are reported.

iNaturalist 2018: The iNaturalist 2018 [37] dataset is a large-scale dataset with images collected from 8142 classes in real-world, which have an extremely imbalanced class distribution with an imbalance factor of 1000/2. With a batch size of 256, we train ResNet-50 from scratch across 90 epochs. The learning rate was initially set to 0.1 and then degraded by 0.1 at the 50-th, 70-th, and 85-th epochs, respectively. Our SAFA is utilized at \(\mathbb {T}_{th}=69\), and we report top-1 err as final evaluation.

Table 1. Test top-1 errors (%) of ResNet-32 on CIFAR-LT-10 and CIFAR-LT-100 with imbalance ratio \(\rho \) ranging from \(\{200,100,50,20,10\}\)

4.2 Comparison with Previous Methods

Considering that our SAFA worked as a plug-in can be integrated different networks and combined with different loss functions, here, we conduct comparison experiment on typical long-tailed methods [6, 11, 26] and several state-of-the-art methods [20, 38, 45]. For the sake of brevity, we will refer to the baseline trained using cross-entropy (resp., Class-Balanced Cross-Entropy loss [11]) as “CE loss” (resp., “CB-CE loss”), and refer to “A-SAFA” as a combination of our SAFA and the method “A”.

Table 2. Top-1 accuracy of ResNeXt-50 on ImageNet-LT
Table 3. Top-1 accuracy of ResNet-152 on Places-LT.
Table 4. Top-1 error rates of ResNet-50 on iNaturalist 2018

Results on CIFAR-LT: Comparison result on CIFAR-LT-10 and CIFAR-LT-100 with imbalance factor \(\rho \) ranging from \(\{200,100,50,20,10\}\) are shown in Table 1, which are categorised into three groups according to the adopted basic losses (i.e., CE, focal [26], and LDAM [6]). We evaluate our method with the three basic losses. The results reveal that our method can consistently improve the performance of the basic losses significantly. Particularly, our method notably surpasses mixup that conducts augmentation on the inputs and RSG [38] that augments tail class by leveraging knowledge from tail class, manifesting that our augmentation method is more effective in long-tailed scenarios. Furthermore, SAFA outperforms the re-weighting strategies. This illustrates that our augmentation method can indeed improve classifier performance. SAFA can still obtain stable performance gains when the dataset is less imbalanced (implying imbalance factor \(\rho \) = 10), demonstrating that SAFA will not harm the classifier’s performance in a moderately balanced scenario. Another observation is that re-weighting strategies [6, 11] are beneficial for long-tailed issues, since some re-weighting methods including CB-CE, CB Focal loss, CB-RSG, as well as our CB-SAFA surpass cross-entropy training (CE loss) by a significant margin. Moreover, we compare our method with other previous sample generation methods [33, 39, 43] in Supplementary (Table 4).

Results on ImageNet-LT: We present the results for ImageNet-LT in Table 2. When compared to LDAM-DRS, LDAM-DRW-RSG [38], LADM-DRS-SAFA (ours) still achieves a greater level of accuracy, demonstrating that SAFA can solve the problem of imbalanced datasets. On medium-shot and few-shot classes, SAFA can produce effective and diverse tail-class features to enlarge tail-class feature space to improve the model and considerably improve its generality.

Table 5. Top-1 error rates of different network architectures combined with LDAM-DRW [6] on CIFAR-LT
Table 6. Results of ablated methods by removing each proposed loss from Eq. (1). We report the top-1 error rates of ResNet-32 combined with SAFA and LDAM-DRW [6] on CIFAR-LT-10/CIFAR-LT-100 with different imbalance ratios

Results on Places-LT: The Table 3 shows the top-1 accuracy on Place-LT. The results reveal that when SAFA is paired with LDAM-DRS, performance may be increased even further, demonstrating that SAFA is useful. Furthermore, when compared to the two most current prominent approaches, tau-normalized, BBN, DisAlign, and RSG, SAFA can increase the model’s performance on medium-shot and few-shot classes while causing less accuracy loss on many-shot classes, resulting in higher overall accuracy and competitive result.

Results on iNaturalist 2018: We show the experimental results under the same setting as [38] on iNaturalist 2018 dataset. The results reveal that by leveraging the proposed sample-adaptive feature augmentation method, we may achieve superior results, demonstrating the efficacy of SAFA. As can be observed, SAFA assists the model in achieving competitive outcomes, demonstrating that SAFA is capable of effectively coping with imbalanced datasets.

4.3 Ablation Studies

Adaptivity to Different Backbone Networks: Firstly, we analyze the effectiveness of our proposed SAFA module by integrating SAFA into different network architectures including ResNet-32, ResNet-56, ResNet-110, DenseNet-40, and ResNeXt-29 (8\(\times \)64d), and report the comparison results on CIFAR-LT-10/CIFAR-LT-100 with \(\rho =\{200,10\}\) in Table 5, in which “w/o SAFA” denotes that removing SAFA during training. From Table 5, we can see that all models equipped with SAFA are consistently better whether \(\rho =100\) or \(\rho =10\), which indicates that SAFA can be employed into various deep neural network to improve long-tail classification performance.

Combination with Different Loss Functions: In Table 1, by comparing the results of CE loss (resp., Focal loss, LDAM-DRW loss) with the results of CB-SAFA (resp., SAFA focal loss, LDAM-DRW SAFA loss), it is seen that our SAFA is compatible with different loss functions and can consistently improve classification performance based on different loss functions.

Analysis of Each Loss Term of SAFA: In our SAFA, we employ a reconstruction loss \(\mathcal {L}_{r}\), a tail mode seeking loss \(\mathcal {L}^{t}_{ms}\), a head mode seeking loss \(\mathcal {L}^{h}_{ms}\), and a contrastive loss \(\mathcal {L}_{c}\). To investigate the impact of each loss term, we conduct ablation studies on CIFAR-10 and CIFAR-100 datasets by removing each loss term from the final objective in Eq. (1). The results are summarized in Table 6. Firstly, we can see that the classification performance is compromised when removing \(\mathcal {L}_r\), even worse than baseline LDAM-DRW [6] without augmentation, implying that our recycle reconstruction loss is necessary and it enforces our SAFA module to extract transferable delta and achieve sample-adaptive augmentation. Removing \(\mathcal {L}_{c}\) results in slight performance degradation on two datasets, since the generated features may not belong to the category of combined tail class without contrastive loss. By removing the head mode seeking loss \(\mathcal {L}^{h}_{ms}\), we can see that the classification performance in less imbalanced scenarios such as imbalance ratio \(\rho =\{200,100\}\) on two datasets become much worse while leaving less impact on relatively balanced settings with \(\rho =\{20,10\}\). Another observation is that ablating tail mode seeking loss \(\mathcal {L}^{t}_{ms}\) results in a minor deterioration of classification performance in extremely imbalanced settings with \(\rho =\{200,100\}\), compared to a more significant decline in less imbalanced settings with \(\rho =\{20,10\}\). It can be explained as follows: in extremely imbalanced settings like \(\rho =\{200,100\}\), where head-class samples may be compact in feature space, leading to more compact deltas in feature space, in other words, the distance among deltas is limited. In this scenario, using head mode seeking loss \(\mathcal {L}^{h}_{ms}\) to enlarge the distance between real tail-class feature and augmented tail-class feature based on the distance of head-class pairs can produce diverse tail-class samples, whereas the distance between deltas may be sufficient to produce different samples without the use of \(\mathcal {L}^{h}_{ms}\). Similarly, \(\mathcal {L}^{t}_{ms}\) is adopted to enlarge the distance between two tail-class features augmented from the same tail-class feature. It is helpful to leverage \(\mathcal {L}^{t}_{ms}\) to enforce SAFA to be sensitive to the difference of paired features from the same tail class in a less imbalanced setting, where the tail-class feature space may be compact, however it is not necessary for a relatively loose tail-class feature space.

5 Conclusions

In this paper, we propose a novel plug-in approach SAFA, which is convenient to be integrated into various networks and coupled with different loss functions. Our SAFA aims to extract transferable delta from head class and achieve sample-adaptive application to tail class to enlarge tail-class feature space. Extensive experiment demonstrate the effectiveness of the proposed SAFA.