1 Introduction

The supervised learning methods rely on a large number of manually labeled samples. In many fields, the lack of labeled samples limits the reliability and generalization ability of the model. Few-shot learning is proposed to address this problem, which aims to enable the model to classify the new classes that do not appear in the training set with only a few annotations [1].

The essence of few-shot learning tasks are to solve cross-domain problems. Chen et al. [2] find that the meta-learning methods lose their advantages when the domain difference is too large. The same view is also put forward in [3], Guo et al. find that there is a little difference in the performance of different meta-learning models in the same domain, but the performance of one meta-learning model in different domains is significantly different. This phenomenon is called a cross-domain problem [4]. At present, many researchers have discovered that learning an excellent feature encoder can greatly improve the model performance [5]. Therefore, more and more researchers focus on learning an initialization feature encoder with strong generalization ability to solve the cross-domain problem.

In this work, we propose a novel few-shot classification framework based on metric learning and self-supervised learning, which consists of a classification model and a meta-learning model. Specifically, the classification model is trained on the base class set to obtain the feature extractor with strong extraction ability, and then this feature extractor is utilized as the initial feature encoder of the meta-learning model to evaluate on the new classes set [6, 7]. The contributions are as follows:

  • We add the rotation self-supervised auxiliary loss into the classification network, which aims to improve the feature representation ability of the model.

  • We propose the meta loss (ML) based on pairwise-samples, which aims to reduce the intra-class difference and increase the inter-class difference.

  • We propose a new regularization technique named resistance regularization, which could improve the generalization ability of the model. The resistance regularization includes the exchange processing and NT-Xent (normalized temperature-scaled cross entropy) loss.

2 Related work

2.1 Meta-learning

Meta-learning could generalize the previously learned knowledge or experience to many new tasks autonomously and quickly [8]. For example, the meta transfer learning (MTL) is proposed in [6], which combines the hard task (HT) meta-batch scheme to force the meta-learner to “grow up in difficulties”. Task-aware feature embedding network (TAFE-Net) is proposed in [9] to obtain the task aware embedding for few-shot classification tasks. Latent embedding optimization (LEO) is introduced in [10], which applies a parameter generation model to capture useful parameters for the tasks. Chen et al. [2] propose that the performance of the model is related to the domain difference, and the performance of the shallow network is better than that of other deep backbones when the domain difference is small. In addition, a new meta-learning method is proposed by dual formulation and KKT conditions in [11] to improve the computational efficiency.

2.2 Metric learning

Metric learning aims to reduce the intra-class difference and increase the inter-class difference, it is widely used in many fileds. Recently, the deep metric learning losses are built on pairwise-samples. For example, a novel hierarchical triplet loss (HTL) is proposed to automatically collect informative training samples in [12]. Riplet center loss is proposed in [13], which could further enhance the distinctiveness of features. Multi-class n-pair loss is proposed in [14] to solve the slow convergence of the contrastive loss and triplet loss. A new angle loss is proposed in [15], which aims to learn valuable features by considering the angle relationship of samples. Wu et al. [16] put forward that the selection of training samples plays an equally important role in the training of the model, and propose a novel sampling method by distance weighting. In our work, we propose the meta loss (ML) based on the pairwise-samples, which is utilized in the meta-learning model to consider the influence of other samples on the target sample in the feature space.

Classification by metric learning is performed in two steps. First, the eigenvector centroid βi of the class i is calculated by formula (1), where K is the number of input samples; secondly, calculating the similarity score pi between the predicted sample x and the centroid in formula (2). The category with the highest similarity score is regarded as the category of the target sample x.

$$ \beta_{i} = \frac{1}{K}{{\sum}_{j=1}^{K} f_{\theta}(x_{i,j})} $$
(1)
$$ p_{i} = <f_{\theta}(x), \beta_{i}> = \frac{f_{\theta}(x) \cdot \beta_{i}}{\lVert f_{\theta}(x) \rVert \lVert \beta_{i} \rVert} $$
(2)

2.3 Self-supervised learning

Self-supervised learning aims to mine the supervision information from large-scale unlabeled data by many auxiliary tasks, which could help the model capture more valuable features. Doersch et al. [17] construct a new auxiliary loss by predicting the context of the image, and the same work has been carried out in [18] and [19]. The auxiliary task of predicting the color of an image is designed in [20, 21], which is utilized to extract the semantic information. Gidaris et al. [22] propose a rotation loss to predict the rotation angle of the image, which could improve the robustness of the model. Hjelm et al. [23] design an auxiliary task to distinguish between the global feature and local feature of the image. Tian et al. [24] propose to construct samples by multi-perspective information. Chen et al. propose SimCLR in [25], it designs the auxiliary tasks by augmenting the input samples. At present, many researchers [26, 27] put forward to combine these self-supervised auxiliary tasks into the classification networks, which could greatly improve the performance of the models. In view of the above mentioned, the rotation self-supervised loss is applied in our classification network to obtain an initialized feature encoder with strong representational ability.

3 Method

3.1 The overall framework

The framework of our self-supervised pairwise-sample resistance model (SPRM) is shown in Fig. 1, which consists of a classification model and a meta-learning model. The fc1 and fc2 are the two classification layers, f𝜃 is the feature encoder, and fϕ is the projection head. l(⋅) is the cross-entropy loss, and the rotation self-supervised loss (\({\mathscr{L}}_{R}\)) is added to the classification model as an auxiliary loss, and the trained feature encoder of the classification model is used as the initial feature encoder of the meta-learning model; the meta loss (\({\mathscr{L}}_{ML}\)) is proposed based on the pairwise-samples, and it is applied in the meta-learning model; inspired by SimCLR, a new regularization technique called resistance regularization (\({\mathscr{L}}_{NT}\)) is proposed for few-shot learning, and it is applied into the ML as a regularization term.

Fig. 1
figure 1

The framework of SPRM

The algorithm flow of SPRM is shown as Algorithm 1, it is divided into two training steps as follows:

Algorithm 1
figure e

SPRM feature backbone training.

Step 1: Training the classification model by \(l(\cdot )+{\mathscr{L}}_{R}\); training the projection head by \(l(\cdot )+ {\mathscr{L}}_{NT}\).

Step 2: Training the meta-learning model by \({\mathscr{L}}_{ML}+ {\mathscr{L}}_{NT}\). The resistance regularization \({\mathscr{L}}_{NT}\) is utilized as the regularization term in the meta loss \({\mathscr{L}}_{ML}\).

3.2 Rotation self-supervised loss

The rotation self-supervised loss [22] is utilized to increase the feature extraction ability and robustness of the classification model. Specifically, rotating the input image at four angles of 0, 90, 180 and 270, so the four images can be obtained by one image. The task of the model is to predict the rotation angles of these rotated images. The self-supervised loss function is defined as:

$$ \mathcal{L}_{R} = \min_{\theta} \frac{1}{MC} \sum\limits_{i=1}^{M} \sum\limits_{j=1}^{C} l(f_{\theta}(x_{i,j}),y) $$
(3)

where M is the number of input samples and C is the number of the rotation angles, and C is 4 in our network. xi,j represents the i-th sample with the rotation angle j.

3.3 Meta loss

Wang et al. [28] propose the MS loss combined with the self-similarity S, positive relative similarity P and negative relative similarity N, the calculation is shown as:

$$ \mathcal{L}_{MS} = \frac{1}{M} {\sum}_{i=1}^{M} \left\{\frac{1}{\alpha} \log\left[1 + \sum\limits_{k \in \mathcal{P}} \exp^{-\alpha(S_{ik}-\lambda)}\right]+\frac{1}{\beta} \log\left[1+{\sum}_{k \in \mathcal{N}} \exp^{\beta(S_{ik}-\lambda)}\right]\right\} $$
(4)
$$ S_{ik}=<x_{i},x_{k}>=\frac{x_{i} \cdot x_{k}}{\lVert x_{i} \rVert \lVert x_{k} \rVert} $$
(5)

where \(\mathcal {P}\) and \(\mathcal {N}\) are the positive and negative samples respectively, xi is the anchor, xk is the k-th sample which needs to be predicted, λ is the similarity threshold, and α and β are the hyperparameters, which are set by experience. α controls the compactness of the positive samples and penalizes the positive samples whose cosine similarity is less than λ; β controls the compactness of negative samples and penalizes the positive samples whose cosine similarity is greater than λ.

Meta-learning aims to predict the query set samples by using the support set samples. However, MS loss sets each sample as the anchor in turn, which will lead to the situation that the samples in the query set are used to predict the other samples, it is incompatible with the meta-learning training paradigm.

According to all the above, we propose the meta loss (ML) for few-shot learning, which only takes the centroid of the support set samples as the anchor. The calculation formula of ML is shown as:

$$ \mathcal{L}_{ML} = \frac{1}{N} {\sum}_{a=1}^{N} \left\{\frac{1}{\alpha} \log\left[\mu+{\sum}_{k \in \mathcal{P}} \exp^{-\alpha(S_{ak}-\eta)}\right]+\frac{1}{\beta} \log\left[\mu+{\sum}_{k \in \mathcal{N}} \exp^{\beta(S_{ak}-\lambda)}\right]\right\} $$
(6)

where the N is the number of categories, and μ and η are added in ML to constrain the positive and negative pairwise-samples, see Section 3.3.2 for details. It can be seen from formula (4) and formula (6) that the i of MS traverses all samples from 1 to M, while the a of ML traverses from 1 to N. That is, in the N-way tasks, ML only uses the centroid of the support set samples as the anchor, which not only improves the calculation efficiency of the model, but also satisfies the principle that only the support set samples are used as the anchors in the meta-learning model. ML contains two steps: mining and weighting the pairwise-samples, which are shown in Fig. 2.

Fig. 2
figure 2

Mining and weighting of pairwise-samples in ML. Draw a circle by the distance of the negative pairwise-sample which nearest to the anchor, the radius of this circle is \(r_{\min \limits }\), where \({\min \limits } ({d_{a\mathcal {N}})=r_{\min \limits }+\epsilon }\); and draw a circle by the distance of the pairwise-sample which farthest from the anchor, the radius of this circle is \(r_{\max \limits }\), where \({\max \limits } ({d_{a\mathcal {P}}})=r_{\max \limits }-\epsilon \)

3.3.1 Mining the pairwise-samples

ML aims to reduce the computational effort of the model by mining more valuable pairwise-samples, which are the negative pairwise-samples (different categories) with large similarity score and the positive pairwise-samples (the same category) with small similarity score.

Inspired by LMNN [29] and MS loss [28], the positive relative similarity P is used to mine the difficult pairwise-samples by formulas 7 and 8, and the other pairwise-samples with less information are discarded.

$$ S_{ai}^{-} > \min_{i \notin \mathcal{P}} S_{a\mathcal{P}}-\epsilon $$
(7)
$$ S_{aj}^{+} < \max_{j \notin \mathcal{N}} S_{a\mathcal{N}}+\epsilon $$
(8)

where \(S_{a\mathcal {P}}\) and \(S_{a\mathcal {N}}\) are the similarity score maps of positive and negative pairwise-samples, respectively, 𝜖 is the threshold for mining, and \(S_{ai}^{-}\) is the cosine similarity range of the mined negative pairwise-samples, \(S_{aj}^{+}\) is the cosine similarity range of the mined positive pairwise-samples. The mining process of ML is shown in Table 1.

Table 1 Mining process of ML

3.3.2 Weighting the pairwise-samples

The valuable pairwise-samples can be roughly mined by positive relative similarity P, and then these valuable pairwise-samples can be further weighted by self-similarity S and negative relative similarity N. Specifically, given a negative pairwise-sample \(\{ x_{a}, x_{i}\}, i \in \mathcal {N}\), the weight wai (partial derivative of Sai in formula 7) is calculated by formula 9, which penalizes the negative samples with cosine similarity > λ, given the positive pairwise-samples \(\{ x_{a}, x_{j}\}, j \in \mathcal {P}\), the weight calculation is shown in formula 10, which penalizes the positive samples with cosine similarity < η. The parameter μ is utilized to adjust the proportion of self-similarity S.

$$ w_{ai}^{-} = \frac{1}{\mu \exp^{\beta(\lambda-S_{ai})}+{\sum}_{k \in \mathcal{N}}\exp^{\beta(S_{ak}-S_{ai})}} = \frac{\exp^{\beta(S_{ak}-\lambda})}{\mu+{\sum}_{k \in \mathcal{N}}\exp^{\beta(S_{ak}-\lambda})} $$
(9)
$$ w_{aj}^{+} = \frac{1}{\mu \exp^{-\alpha(\eta-S_{aj})}+{\sum}_{k \in \mathcal{P}}\exp^{-\alpha(S_{ak}-S_{aj})}} $$
(10)

3.4 Resistance regularization

Resistance regularization contains the exchange processing and NT-Xent loss. Before that, we need to construct the pairwise-sample labels.

In a N-way K-shot M-query task, there are N classes, each with K + M images, and a total of N × (K + M) input images. Coping each image, then the 2N × (K + M) input images will be obtained. Regarding the real label of each image, we suppose the N × (K + M) original images are the N × (K + M) different categories. In fact, there are 2 × (K + M) images in each category after copying, but in the pairwise-sample scenario, there are only two images (the original and its copied image) in each category. For example, there is a 3-way 1-shot 1-query meta-learning task, the original input samples are [dog1, dog2, cat1, cat2, pig1, pig2], their coped samples are [Dog1, Dog2, Cat1, Cat2, Pig1, Pig2], so the input data is expanded to [dog1, dog2, cat1, cat2, pig1, pig2, Dog1, Dog2, Cat1, Cat2, Pig1, Pig2] after copying. Pairwise-sample labels are constructed in Fig. 3 (a), we mark the pairwise-sample with the same label as 1, otherwise 0.

Fig. 3
figure 3

Pairwise-sample labels for resistance regularization

3.4.1 Exchange processing

After the pairwise-sample labels are constructed, the labels are fixed and the positions of the 2N × (K + M) input images are exchanged, as shown in Fig. 3 (b). The self-pairwise-sample labels are deleted. After exchanging, for the areas where the pairwise-sample label are not zero, if the real labels of the two samples in one pairwise-sample are the same, it is called soft exchanging (represented as + 1), otherwise it is called hard exchanging (represented as -1). “+ 1” and “-1” are for the convenience of differentiation, and they are regarded as 1 when calculating the NT-Xent loss. In Fig. 3 (b), an example of soft exchanging is the pairwise-sample {dog2, Dog1}; An example of hard exchanging is the pairwise-sample {dog1, Pig2}.

To sum up, both soft and hard exchanging are designed to hinder the further learning of the model. Specifically, taking the pairwise-sample {dog2, Dog1} as an example, the pairwise-sample label of them is “+ 1” and the real labels of them are the same, which allows the model to narrow the intra-class difference. But for the pairwise-sample {dog2, Dog2}, the pairwise-sample label of them is “0” and the real labels of them are the same, which will prevent the model from learning similar characteristics. Therefore, the soft exchanging will correctly increase the similarity of pairwise-samples {dog2, Dog1}; incorrectly decrease the similarity of pairwise-samples {dog2, Dog2}, {dog2, dog1}. Similarly, taking the pairwise-sample {dog1, Pig2} as an example, hard exchanging will correctly decrease the similarity of the pairwise-samples {pig2, cat1}, and incorrectly increase the similarity of the pairwise-samples {dog1, Pig2}, {cat1, Dog2} and {pig1, Cat2}. Both the soft and hard exchanging could correctly decrease the similarity between the two samples whose the real labels are different, but the hard exchanging has a greater hindrance to the training of the model compared with soft exchanging.

The effects of the proportion of the soft and hard exchanging on the model performance are explored in Table 2 by comparative experiments, where “Soft%” represents the proportion of the soft exchanging.

Table 2 The experimental results of SPRM with different exchanging proportion on mini-ImageNet

Considering that the appropriate range of soft exchanging proportion can enhance the generalization ability of the model, the soft exchanging proportion of our method is randomly selected between 52.5% and 86.25%.

3.4.2 NT-Xent loss

NT-Xent loss is proposed in SimCLR, the calculation is shown in formula 11. Zi and Zj represent the eigenvectors of an original and its copied image obtained from the feature extractor, respectively; Zk is the eigenvector of the k-th image (ki) obtained from the feature extractor. The NT-Xent loss is added as the auxiliary term in the ML.

$$ \mathcal{L}_{NT}^{i,j}=-\frac{1}{2M}\sum\limits_{i=1}^{2M}\sum\limits_{j=1}^{2M}y_{ij}\log \frac{\exp^{sim(Z_{i},Z_{j})/\tau}}{{\sum}_{k=1[k\neq i]}^{2M}\exp^{sim(Z_{i},Z_{k})/\tau}}, i \neq j $$
(11)
$$ sim(Z_{i},Z_{j})=\frac{Z_{i} \cdot Z_{j}}{\lVert Z_{i} \rVert \lVert Z_{j} \rVert} $$
(12)

4 Experiments

4.1 Datasets

Mini-ImageNet

mini-ImageNet [1] dataset is composed of 60000 images selected from ImageNet, a total of 100 categories. There are 600 images in each category, and the size of each image is 84 × 84. It is usually divided into the base class set (64 categories), validation set (16 categories) and new class set (20 categories).

Tiered-ImageNet

tiered-ImageNet [30] dataset is also selected from ImageNet. It contains 34 super-categories, each super-category contains 10-30 classes, a total of 608 classes and 779,165 images. The 34 super-categories can be divided into the base class set (20 super-categories), validation set (6 super-categories) and new class set (8 super-categories).

4.2 Implementation details

The models are trained on the base class set, and then evaluated on the new class set. The model is implemented by Python 3.8 with CUDA 11.0. The two NVIDIA GeForce RTX 2080 Ti GPUs are utilized. Some hyperparameters of the models are shown in Table 3.

Table 3 Some hyperparameters of the models

The evaluation indicator of this experiment is the confidence interval (z = 1.96) of the average precision P of M samples at the 95% confidence level, i.e. P ± Rinterval. The calculation of the confidence interval radius Rinterval is shown as:

$$ R_{interval}=Z\sqrt{\frac{P(1-P)}{M}} $$
(13)

5 Results and discussion

5.1 The performance evaluation of SPRM

In this paper, different few-shot learning methods are compared on mini-ImageNet and tiered-ImageNet dataset. The experimental results are shown in Tables 4 and 5, where “Our-self” is only the classification model with the self-supervised technology, and “Our-self-ML” is the meta-learning model combined with the self-supervised classification model, which uses the ML without adding resistance regularization term.

Table 4 Average accuracy confidence intervals (%) of different meta-learning methods on the mini-ImageNet dataset
Table 5 Average accuracy confidence intervals (%) of different meta-learning methods on the tired-ImageNet dataset

In Tables 4 and 5, the classification accuracy of SPRM on the 5-way 1-shot task of mini-ImageNet reaches 66.35%, and it reaches 82.24% on the 5-way 5-shot task, which demonstrates the model has better performance than other few-shot learning methods. The classification accuracy of SPRM on the 5-way 1-shot task of tired-ImageNet reaches 70.70%, and it reaches 85.40% on the 5-way 5-shot task. These results show that the SPRM has excellent performance and generalization.

5.2 Ablation study

The effect of the three technologies (including the rotation self-supervised loss, ML and resistance regularization) are studied in the ablation experiments. The results are shown in Table 6. When the resistance regularization is used alone, it can be regarded as a loss function; when the ML and resistance regularization are not used, the cross-entropy loss function is used as the loss function; when these three technologies are not used, the framework does not contain the meta-learning model.

Table 6 Ablation experiment results

In Table 6, the performance of our proposed method is the best. The application of the rotation self-supervised loss in the classification model can greatly improve the model performance. Compared with the cross-entropy loss, ML has obvious improvement on the classification tasks. When the resistance regularization is used as a loss function alone, it can also increase the prediction accuracy of the few-shot classification model. The ML with the resistance regularization has the best performance when the self-supervised technique is not considered. To sum up, the three proposed techniques all have improved the performance of the model.

In addition, the effect of the mining and weighting technologies of the MS loss and ML are also explored in the ablation study. The evaluation results on mini-ImageNet and tiered-ImageNet dataset are shown in Tables 7 and 8, respectively.

Table 7 The evaluation results of different loss function on mini-ImageNet
Table 8 The evaluation results of different loss function on tired-ImageNet

In Tables 7 and 8, the prediction accuracy of ML is the highest, and both the mining and weighting strategy in ML have a positive gain on the model performance. Comparing the results of and in these two tables, the effect of the mining strategy in ML is better than that in MS loss. Comparing the experimental results of and in these two tables, the weighting strategy of ML can improve the model performance in most cases. In addition, according to and , the calculation time of the mining strategy in ML has greatly reduced. To sum up, the mining strategy of ML can not only enhance the model performance, but also improve the computational efficiency.

5.3 Visualization analysis

The heat maps of different methods are visualized by Grad-CAM [56], they are shown in Fig. 4, the feature encoder pays more attention to the warm tone region and ignores the cold tone region. The different methods described in Fig. 4 are shown in Table 9.

Fig. 4
figure 4

The heat map visualization of different methods by Grad-CAM

Table 9 The different methods for visualization analysis

In Fig. 4, for “None” and “Self”, when there are many objects in the image, the attention of the feature encoder is easily influenced by the interfering objects; when there are few objects in the image and the target is large, the feature encoder can quickly notice the target, but the attention area is small, the model cannot obtain the complete semantic information. Compared with the feature encoders of “Self” and “None”, SPRM can pay more attention to the whole target and capture more complete semantic information.

The feature vectors of several categories in mini-ImageNet and tiered-ImageNet dataset are shown in Figs. 5 and 6 by UMAP [57], respectively. In Figs. 5 and 6, the feature vectors of the same category are more compact in “Self” than that in “None”, which proves that the rotation self-supervised loss can improve the feature representation ability of the classification model. The same phenomenon occurs in “ML+Rr”, which further confirms that the ML can reduce the intra-class difference and expand the inter-class difference; the SPRM method combines the advantages of the rotation self-supervised loss, ML and resistance regularization to optimize the decision boundary and improve the model performance.

Fig. 5
figure 5

The feature vector visualization of several categories in mini-ImageNet by UMAP(2-dim)

Fig. 6
figure 6

The feature vector visualization of several categories in tired-ImageNet by UMAP(2-dim)

6 Conclusion

In this paper we propose a new few-shot classification model named self-supervised pairwise-sample resistance model (SPRM). It contains a classification model and a meta-learning model. The rotation self-supervised loss is utilized as an auxiliary loss in the classification model to obtain the feature extractor with strong representational ability, which is used as an initialize feature extractor in the meta-learning model; and the meta loss (ML) and resistance regularization are proposed and applied in the meta-learning model to improve the model performance. SPRM is evaluated on the 5-way 1-shot and 5-way 5-shot tasks of mini-ImageNet and tired-ImageNet. The experimental results indicate that our method is superior to the other advanced methods in few-shot classification tasks.