Keywords

1 Introduction

Few-Shot Learning (FSL) aims to learn the novel categories by a small number of images, and usually includes an auxiliary dataset for training [41,42,43]. The purpose of image classification is to predict the category of image x, while few-shot image classification predicts which of \(c\times k\) images (c categories and each category has k images) are of the same category as image x. Few-shot image classification usually organizes images into a c-way k-shot form, which is called a task. FSL learns meta-knowledge related to tasks, rather than knowledge related to specific categories, thereby generalizing the meta-knowledge to the novel categories. The auxiliary dataset usually contains a large number of annotated images. For example, the training set of tieredImageNet [38] contains 448,695 images. When FSL methods are directly applied to a new field, we need to label the auxiliary dataset, which will consume large effort. To alleviate the burden caused by the auxiliary dataset, Unsupervised FSL (UFSL) has attracted attention [2, 6, 16, 17, 20, 21, 25, 36]. Under the setting of supervised FSL, the labels of the auxiliary dataset are used to build tasks (as shown in Fig. 1 (a)), while under the setting of UFSL, the auxiliary dataset has no available labels. Our work will explore how to build tasks through unlabeled auxiliary dataset.

Fig. 1.
figure 1

The four baselines from the view of sampling. (a) Label-based baseline, which is a supervised baseline. Since the images have category labels, two of the four sampled images belong to the same category. (b) Random-based baseline, in which four images are randomly sampled, and the label of task is randomly determined. (c) CSS-based baseline, in which three images are randomly sampled, and then one of the images is selected to obtain another view through data augmentation. (d) Clustering-based baseline, in which first all images are divided into multiple clusters by a clustering algorithm, and then four images are selected with cluster ids as labels.

There are two common unsupervised ways to build tasks from the auxiliary dataset: 1) CSS-based methods (Comparative Self-Supervised, as shown in Fig. 1(c)) use data augmentations to obtain another view of the images to construct the image pairs, and then use the image pairs to build tasks [17, 20]; 2) Clustering-based methods (as shown in Fig. 1(d)), by a clustering algorithm, divide the images into clusters to obtain the pseudo-labels, and then use the pseudo-labels to build tasks [5, 16]. Since CSS-based methods build tasks based on the different views, these methods are concise and effective, but the diversity of tasks may be poor. Also, since Clustering-based methods build tasks based on the clusters, the diversity of tasks may be good, but these methods usually contain multiple steps. For example, CACTUs [16] use the existing unsupervised model [7] to extract features, and then cluster these features to obtain the cluster ids (cluster assignments). This method takes features learning and clustering as two independent steps and requires additional unsupervised models.

To combine the advantages of both methods, we propose a Learn Features into Clustering Space (LF2CS) method, which is a single-stage clustering method. By fixing the clustering center matrix to the identity matrix, LF2CS sets a separable clustering space, and then uses a learnable model to learn features into the clustering space. By this way, LF2CS does not require any other images when determining the cluster id of each image. Based on LF2CS, we put forward an image sampling and task building method, and thus propose a clustering-based unsupervised few-shot image classification method, as shown in Fig. 2. First, each image \(x_i\) is fed into a random initialized network to obtain the feature \(z_i\), and LF2CS is used to obtain the initial cluster id \(\hat{y}_i\), so that the cluster ids \(\hat{Y}\) of all images can be initialized. Then, for each image \(x_i\), \(c\) \(\times \) \(k\) images are sampled based on the cluster ids \(\hat{Y}\), in which k images have the same cluster id as the image \(x_i\), so a task can be built based on these \(c\) \(\times \) \(k\) \(+\) \(1\) images. Finally, encoders are used to extract features of the images in the task, and the new cluster id \(\hat{y}_i\) of the image \(x_i\) is calculated by LF2CS and updated to the cluster ids \(\hat{Y}\).

Our method first generate cluster ids of images, then samples images to build tasks, and realizes joint learning of feature extraction, clustering and FSL. After the task is built, any task-based few-shot learning method [41,42,43] can be used to optimize the network, such as MatchNet [43]. To show the performance of our method, as shown in Fig. 1, we summarize four baselines according to how to build tasks from the perspective of image sampling. Based on the Conv-4 [43] and ResNet-12 backbones, we conduct experiments on the Omniglot [23], miniImageNet [43], tieredImageNet [38], and CIFARFS [3] datasets. In summary, our main contributions are as follows:

  1. 1)

    By fixing the clustering center matrix to the identity matrix, we first set a separable clustering space and then use a learnable model to learn features into the clustering space;

  2. 2)

    A clustering-based unsupervised few-shot image classification method is proposed through image sampling and task building, which jointly learns feature learning, clustering, and few-shot image classification;

  3. 3)

    From the view of sampling, we implement four baselines according to how to build tasks: Label-based, Random-based, CSS-based, and Clustering-based;

  4. 4)

    We conduct experiments on a series of benchmarks based on Conv-4 and ResNet-12, and experimentals show that our method has a strong ability to generalize to the novel categories and achieves the state-of-the-art results.

2 Related Work

2.1 Few-Shot Learning

FSL aims to learn the novel categories through few labeled images, and it usually contains a labeled auxiliary dataset [12, 26, 30, 32]. There are many methods: Metric-based, Finetune-based, and Parameterize-based, etc. Metric-based methods are to learn the ability of similarity measure [28, 41,42,43]. Finetune-based methods are to finetune a base-learner for new tasks by its few samples and make the base-learner converge fast within fewer parameter update steps [11, 37, 40]. Parameterize-based methods are to parameterize the base-learner or its sub-components, thereby generalizing the meta-learner to new tasks [4, 8, 44].

2.2 Unsupervised Feature Learning

Unsupervised feature learning includes many methods, and we mainly introduce clustering and self-supervised learning. Clustering aims to divide the data into different clusters based on the inherent information of data, such as Gaussian Mixture model [1], K-means [18], spectral clustering [48], hierarchical clustering [19], etc. Recently, deep-based clustering methods [7, 27, 45, 47] have shown superior performance. Self-supervised learning uses the data itself to construct some supervised signals to replace the label, such as colorization [46], solving jigsaw puzzle [34], rotation [13], etc. Recently, comparative self-supervised learning methods [9, 10, 14, 15, 31] have made great progress, which learn from multiple views of the images. These methods obtain multiple different views, thereby constructing positive and negative image pairs, so that the positive image pair are similar in the feature space, and the negative image pair are not similar [15].

2.3 Unsupervised Few-Shot Learning

Different from FSL, UFSL usually contains an unlabeled auxiliary dataset. There are two common methods: CSS-based and Clustering-based. CSS-based methods use data augmentations to obtain mutiply views of image to construct image pairs which are used to build tasks. Generally, the data augmentations contain common image transforms [2, 6, 17, 20, 25, 36], such as random resized crop and color jitter, and (or) generative models [21, 36], such as autoencoders and generative adversarial networks. Clustering-based methods [5, 16] divide the images in the auxiliary dataset into multiple clusters to obtain the pseudo-labels of the images, and then use the pseudo-labels to build tasks.

Fig. 2.
figure 2

The overall pipeline of our approach. The 3-way 1-shot task is used as an example to show our method. Clusters IDs \(\hat{Y}\) stores the cluster ids of all images in \(D_{unlabel}\), and \(|\hat{Y}|=N_{aux}\). First, for the image \(x_i\), its cluster id \(\hat{y_i}\) is obtained from \(\hat{Y}\) (Image Traversing). Then, according to \(\hat{y_i}\) and \(\hat{Y}\), three images are sampled from \(D_{unlabel}\): the cluster id of one image is \(\hat{y_i}\), and the cluster id of the other two images is not \(\hat{y_i}\) (Image Sampling), so the 3-way 1-shot task can be built (Task Building). Then, the feature of task is obtained by encoders (Task Encoding), and encoders are optimized by reducing the distance of features with the same cluster id as the image \(x_i\) (Pull) and increasing the distance of features with different cluster ids to the image \(x_i\) (Push) (Network Optimizing). Finally, the cluster id \(\hat{y_i}\) is updated (Cluster ID Updating).

3 Our Approach

3.1 Unsupervised Few-Shot Learning

In FSL, an auxiliary dataset \(D_{aux}=\{(x_i, y_i)\}_{i=1}^{N_{aux}}\) (\(N_{aux}\) is the number of images) is used to train a learnable model, and then applies the model to a target dataset \(D_{target}=\{\{(x_i, y_i)\}_{i=1}^{N_{label}}, \{x_j\}_{j=1}^{N_{unlabel}}\}\) (where \(N_{label} \ll N_{unlabel}\), \(N_{label}\) and \(N_{unlabel}\) are the number of labeled and unlabeled images), so that the target dataset with few labeled images can be classified. \(D_{aux}\) and \(D_{target}\) are also known as the base classes and the novel classes, respectively. \(D_{aux}\) contains a large number of labeled images. UFSL aims to train a model using an unlabeled auxiliary dataset \(D_{unlabel}=\{x_i\}_{i=1}^{N_{aux}}\) which contains a large number of unlabeled images, and then also applies the model to \(D_{target}\).

Task-based FSL organizes images into tasks in the form of c-way k-shot. Task \(T_i\) is composed of the support set \(T_i^{supp}\), the query set \(T_i^{quer}\) and the label \(y_i^{task}\):

$$\begin{aligned} T_i=\{T_i^{supp}, T_i^{quer}, y_i^{task}\}=\big \{\{(x_j, y_j)\}_{j=1}^{c\times k}, \{x_i\}, y_i^{task}\big \}, \end{aligned}$$
(1)

where \(y_i^{task}\in \{0,1\}^{c\times k}\), \(T_i^{supp}\) contains c classes and each class has k labeled images. Note that in our work, \(T_i^{quer}\) contains only one image \(x_i\). Under the setting of supervised FSL, since the image label is available on the base classes, the task \(T_i\) is built in an supervised way. Under the setting of UFSL, the task \(T_i\) needs to be built in an unsupervised way. The difference between most UFSL methods is how to build tasks in the base classes. There are two common unsupervised ways to build tasks: Clustering-based and CSS-based. Clustering-based methods divide the images of \(D_{unlabel}\) into multiple clusters, and then uses the cluster ids as the pseudo-labels to build tasks. CSS-based methods build tasks based on different views of the same image.

3.2 Learn Features into Clustering Space

In DeepCluster [7], Caron et. al alternates between clustering the image features to generate pseudo-labels (Eq. 2) and updating the parameters of the learnable model \(f_\theta \) by predicting pseudo-labels (Eq. 3):

$$\begin{aligned} \underset{C\in \mathbb {R}^{d\times K}}{\min } \frac{1}{N} \sum _{i=1}^{N} \underset{\hat{y}_i \in \{0, 1\}^K}{\min } ||f_\theta (x_i) - C\hat{y}_i||^2_2, \end{aligned}$$
(2)
$$\begin{aligned} \underset{\theta }{\min }\ \frac{1}{N} \sum _{i=1}^{N} \mathcal {L}(f_\theta (x_i), \hat{y}_i) , \end{aligned}$$
(3)

where \(\theta \) is the parameter of the network \(f_\theta (\cdot )\), C is the cluster center matrix, K is the number of clusters, d is the output feature dimension of \(f_\theta (\cdot )\), and \(\hat{y}_i\) is the cluster id of the image \(x_i\) and \(\hat{y}_i^T1_K=1\). This method has some limitations: all features need to be extracted before clustering, feature learning and clustering are two independent step, and clustering result usually depends on the initialization of the cluster centers.

Different from [7], we preset a separable d-dimensional clustering space by setting \(K=d\) and fixing the matrix C to the identity matrix \(E\in \mathbb {R}^{d\times d}\). Since it is located on the coordinate axis of this space, all cluster centers are orthogonal, this clustering space has strong separability. Therefore, Eq. 2 is simplified to:

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^{N}\underset{\hat{y}_i \in \{0, 1\}^d}{\min } ||f_\theta (x_i) - E\hat{y}_i||^2_2. \end{aligned}$$
(4)

In order to make Eq. 4 feasible, the network \(f_\theta (\cdot )\) need to be used to learn features of all images into the d-dimensional clustering space. We call this Learn Features into Clustering Space (LF2CS). Specifically, given an image \(x_i\), the feature \(z_i=f_\theta (x_i)\) and the cluster to which \(x_i\) belongs can be quickly calculated by:

$$\begin{aligned} \hat{y}_i = \underset{m\in [1,d]}{\arg \min }||z_i-E_m||^2_2 = \underset{m\in [1,d]}{\arg \max }\ \ z_i^m, \end{aligned}$$
(5)

where \(z_i\in \mathbb {R}^{d}\), \(E_m\in \mathbb {R}^{d}\), \(E_m\) is the cluster center of cluster m, \(z_i^m\) is the m-th value of \(z_i\), and \(\hat{y}_i\) is the cluster id of \(x_i\). By treating \(\hat{y}_i\) as the pseudo-label of \(x_i\), the parameter of \(f_\theta (x_i)\) can be updated by Eq. 3.

From Eq. 5, it can be seen that given a feature \(z_i\), cluster center, which is closest to \(z_i\), can be quickly calculated. The advantage of this is that the cluster id \(\hat{y}_i\) of the image \(x_i\) can be calculated in a simple way, without the need of other image features. Our LF2CS eliminates the process of solving cluster centers in Eq. 2 by iterative algorithms, such as K-means. Therefore, our LF2CS is a single-stage clustering algorithm, instead of alternating between Eq. 3 and Eq. 2.

Avoiding Trivial Solution. Most clustering-based methods have trivial solutions [7, 27]. In order to avoid invalid solutions, our LF2CS forces all images to be scattered into the clustering space by limiting the maximum number of images in each cluster. When the number of images in the m-th cluster exceeds \(n_{max} = N_{aux} / d\), we will no longer assign any images to the m-th cluster, but look for the next cluster that satisfies the constraint. The maximum number of images \(n_{max}\) per cluster means that each cluster is almost the same size. In this way, our method avoids both over-clustering (many clusters are relatively small) and under-clustering (some clusters are relatively large).

3.3 Image Sampling and Task Building

Given an unlabeled auxiliary dataset \(D_{unlabel}=\{x_i\}_{i=1}^{N_{aux}}\), our goal is to build tasks from unlabeled images. In order to record the cluster to which each image belongs, \(\hat{Y}=\{\hat{y}_i\}_{i=1}^{N_{aux}}\) is used to store the cluster ids of all images, that is, \(\hat{y}_i\) is the cluster id of the image \(x_i\), where \(\hat{y}_i\in [1, d]\). To facilitate joint training, we use mini-batch of tasks in training for each optimization step, instead of episode

Fig. 3.
figure 3

Task Encoding and Network Optimizing. Both FSL Head and LF2CS Head are a fully connected layer. We call Encoder, FSL Head, and LF2CS Head as Encoders. We treat the optimization as a multi-task learning problem: LF2CS and few-shot image classification. Note that \(z_i^{quer}\in \mathbb {R}^q\) and \(z_i\in \mathbb {R}^d\).

[43]. When evaluating, like other methods [41,42,43], we use episode as input. To build c-way k-shot task, the image \(x_i\) is used as the query set \(T_i^{quer}\), \(c\) \(\times \) \(k\) images are selected as the support set \(T_i^{supp}\) through the cluster id \(\hat{y}_i\). Specifically, we get c clusters by \(\hat{y}_i\) and randomly selecting other \(c-1\) clusters:

$$\begin{aligned} \hat{Y}_c=\{\hat{y}_i\} \cup \{c_j\}_{j=1}^{c-1}, \ \ \ s.t. \ \ c_j \ne \hat{y}_i, \end{aligned}$$
(6)

where \(c_j\) is a cluster id and \(\hat{Y}_c\) is the set of c cluster ids. According to \(\hat{Y}\), by randomly selecting k images from \(D_{unlabel}\) for each cluster id in \(\hat{Y}_c\), we can get \(c\) \(\times \) \(k\) images which are treated as the support set \(T_i^{supp}\). Since the images with the same cluster id are usually treated as the same category, the task \(T_i=\{T_i^{supp}, T_i^{quer}, y_i^{task}\}\) is built successfully, where \(y_i^{task}\) can be obtained by whether the cluster ids of the images in the query set \(T_i^{quer}\) and the support set \(T_i^{supp}\) are the same. The new cluster id \(\hat{y}_i\) of the image \(x_i\) can be calculated by Eq. 5, and then \(\hat{Y}\) is updated by replacing the old cluster id with the new one.

3.4 Task Encoding and Network Optimizing

As shown in Fig. 3, we use the Encoder to embed images of the task \(T_i\). In our method, the Encoder is Conv-4 or ResNet-12. FSL Head is used to learn features for few-shot image classification, and LF2CS Head is used to learn features into the clustering space. Both FSL Head and LF2CS Head are a fully connected layer, which are used to embed features for clustering and few-shot image classification. The output dimension of FSL Head is q and the output dimension of LF2CS Head is d. The feature \(z_i\in \mathbb {R}^d\) is extracted by feeding the image \(x_i\) in the query set \(T_i^{quer}\) to Encoder and LF2CS Head, and the feature \(z_i^{quer}\in \mathbb {R}^q\) also is extracted by feeding the image \(x_i\) in the query set \(T_i^{quer}\) to Encoder and FSL Head. In addition, the features \(Z_i^{supp}=\{z_{i,j}^{supp}\}_{j=1}^{c\times k}\) are extracted by feeding images in the support set \(T_i^{supp}\) to Encoder and FSL Head, where \(z_{i,j}^{supp}\in \mathbb {R}^q\) is the feature of the j-th image in the support set \(T_i^{supp}\).

figure a

After the task is built, any task-based few-shot learning method [41,42,43] can be used to optimize the network. We optimize the network based on MatchNet [43] and use the softmax over cosine to measure similarity between features:

$$\begin{aligned} \begin{aligned} p_i^{task} = \{p_{i,j}^{task}\}_{j=1}^{c\times k} = \Big \{\frac{exp(cos(z_{i,j}^{supp}, z_i^{quer}))}{\sum \limits _{z_{i,l}^{supp} \in Z_i^{supp}} exp(cos(z_{i,l}^{supp}, z_i^{quer}))}\Big \}_{j=1}^{c\times k}, \end{aligned} \end{aligned}$$
(7)

where \(cos(\cdot , \cdot )\) is the cosine similarity. FSL can be optimized by MSE (Mean Squared Error) loss between \(p_i^{task}\) and the task label \(y_i^{task}\), and LF2CS can be optimized by CE (Cross Entropy) loss between the feature \(z_i\) and the cluster id \(\hat{y}_i\). The optimization of our method can be treated as a multi-task learning problem. Therefore, we jointly learn LF2CS and FSL, and the total loss is:

$$\begin{aligned} \begin{aligned} \mathcal {L} = MSE(p_i^{task}, y_i^{task}) + CE(z_i, \hat{y}_i), \end{aligned} \end{aligned}$$
(8)

where \(y_i^{task}\in \{0, 1\}^{c\times k}\), \(p_i^{task} \in \mathbb {R}^{c\times k}\), \(z_i\in \mathbb {R}^d\), and \(\hat{y}_i \in [1, d]\). In the CE loss, \(\hat{y}_i\) is converted to a one-hot vector. Taking Eq. 8 as our total loss, our method increases the similarity of features with the same cluster id and reduces the similarity of features with different cluster ids.

For easy understanding, we show the overall pipeline of our method in Algorithm 1. Please refer to https://github.com/xidianai/LF2CS for the code.

Table 1. Few-shot image classification accuracies of 5-way and 20-way on Omniglot. All accuracies are averaged over 1000 test episodes and are reported with 95% confidence intervals. The backbone of all methods is Conv-4. Bold represent the best values.

4 Experimental Results

4.1 Setup

Our method includes an Encoder and two linear head (FSL Head and LF2CS Head). For Encoder, we use two widely used networks (Conv-4 [43] and ResNet-12). For each Head, we use a fully connected layer. The output dimension q of FSL Head is 1024, and the output dimension d of LF2CS Head is 1024. All our models are trained from scratch without any pre-trained weights. We use the SGD optimizer with learning rate of 0.01, weight decay of 5e-4 and batch size of 64. We use the learning rate strategy shown in Fig. 5 and train the model for 1,500 epochs. All our experiments are implemented based on PyTorch [35]. We use the evaluate strategies used in [42]. Following [42], each class uses 15 query samples in each episode: for c-way k-shot, each episode contains 15\(\times \) \(c\) query images and \(c\) \(\times \) \(k\) support images. All accuracies are averaged over 1,000 test episodes and are reported with 95% confidence intervals. We also use KNN to evaluate our LF2CS, and set the number of nearest neighbor samples to 100. The top-1 and top-5 accuracy of KNN are used to report the results.

Table 2. 5-way 1-shot and 5-way 5-shot few-shot image classification accuracies on the miniImageNet dataset. All accuracies are averaged over 1000 test episodes and reported with 95% confidence intervals. Bold represent the best values.

4.2 Unsupervised Few-Shot Image Classification Results

Results on Omniglot. The Omniglot [23] dataset consists of 1,623 characters from 50 different alphabets and each character contains 20 samples drawn by different people. Following [41, 43], we resize the sample to 28\(\times \)28. We use 1,028 characters for training, 172 characters for validation, and the remaining 423 characters for testing. On the Omniglot dataset, we only use Conv-4 as the backbone of Encoder. Table 1 shows the few-shot image classification accuracies on Omniglot. Compared with other unsupervised baselines, the method of building tasks by randomly selecting images (Random-based) has the worst accuracies and our LF2CS has the best accuracies in all cases. Compared with other unsupervised methods, ours do 97.31% for 5-way 1-shot with Conv-4 as the backbone, which has also reached the level of supervised methods.

Results on MiniImageNet. The miniImageNet [43] dataset consists of 100 categories sampled from ImageNet [39], all images are 84\(\times \)84 in size, and each category has 600 images. Following [43], we divide all categories into 64 (training), 16 (validation), and 20 (test) categories. Table 2 shows the few-shot image classification accuracies on miniImageNet. We compare with 5 supervised methods and 8 unsupervised methods. Compared with other unsupervised baselines, the method of building tasks by randomly selecting images has the worst accuracies, while our methods have the best accuracies. Especially, our ResNet-12 reaches 53.14% for 5-way 1-shot, which has better results than all unsupervised methods. Compared with other unsupervised methods, our methods have a smaller gap with the supervised few-shot image classification methods.

Table 3. 5-way 1-shot and 5-way 5-shot few-shot image classification accuracies on the tieredImageNet dataset. All accuracies are averaged over 1000 test episodes and reported with 95% confidence intervals. Bold represent the best values.

Results on TieredImageNet. The tieredImageNet [38] dataset consists of 608 classes with a total of 779,165 images of size 84\(\times \)84. These classes are selected from 34 higher-level nodes in the ImageNet hierarchy. Following [38], 351 classes from 20 high level nodes are used for training, 97 classes from 6 nodes for validation and 160 classes from 8 nodes for testing. Table 3 shows the few-shot image classification accuracies on tieredImageNet. Compared with other unsupervised methods, our methods have the best accuracies. Among them, our ResNet-12 has an accuracy of 53.16% for 5-way 1-shot and 66.59% for 5-way 5-shot. Compared with other unsupervised baselines, our methods have a smaller gap with the supervised few-shot image classification methods.

Results on CIFARFS. The CIFARFS [3] dataset consists of 100 classes sampled from CIFAR100 [22]. All the images are 32\(\times \)32 in size. Following [3], we divide all classes into 64, 16, and 20 classes for training, validation, and testing, respectively. We compare with four supervised methods and one unsupervised method. Table 4 shows the few-shot accuracies on CIFARFS. Compared with other unsupervised methods, our methods have the best accuracies. For 5-way 1-shot, our method based on the Conv-4 backbone achieves an accuracy of 51.52%.

Table 4. Few-shot image classification accuracies on the CIFARFS dataset. All accuracies are averaged over 1000 test episodes and reported with 95% confidence intervals. Bold represent the best values.

4.3 Ablation Experiments and Visualization

The Performance of LF2CS. The number of clusters is the same as the dimension d, which usually affects the clustering results. We conduct experiments to observe the effect of the dimension d. Encoder also plays an important role on the clustering results, and we evaluate our LF2CS on two architectures: Conv-4 [43] and ResNet-12. Figure 4 shows the top-1 KNN accuracies and training time per epoch on miniImageNet. Table 5 shows the accuracies of LF2CS with different Encoder on miniImageNet. We report the results with top-1 and top-5 KNN accuracies, and 5-way 1-shot and 5-way 5-shot few-shot accuracies. Note that the categories in training, validation and test sets do not intersect. From Fig. 4, we can find that as the dimension d increases, the KNN accuracy and training time also increase. To balance the time and accuracy, the dimension d is set to 1024. From the few-shot accuracies in Table 5, it can be seen that LF2CS has a strong generalization ability to the novel categories, such as 48.41% in the training set v.s. 48.32% in the test set with Conv-4 as the backbone and 53.59% in the training set v.s. 53.14% in the test set with ResNet-12 as the backbone.

Fig. 4.
figure 4

Left: The Top-1 KNN accuracies in test set on miniImageNet; Right: The training time per epoch of different dimension of LF2CS Head on miniImageNet.

Table 5. Top-1 and Top-5 KNN accuracies, and 5-way 1-shot and 5-way 5-shot few-shot accuracies of LF2CS Head with different Encoder on the training, validation and test sets of miniImageNet. All few-shot accuracies are averaged over 1000 test episodes and reported with 95% confidence intervals. The dimension d is 1024.

The Change of the Cluster Assignments. Now, we show the gradual change of the cluster ids over time. During the training process, we have counted the change of the cluster ids. The change of learning rate is shown in the left of Fig. 5. There are two curves in the right of Fig. 5. The first represents the change ratio of cluster ids and the second represents the ratio of cluster ids that do not meet the limit on the number of clusters. At the beginning of training, the change ratio of cluster ids and the ratio of cluster ids that do not meet the limit on the number of clusters are almost 98% or more. It is almost completely random. But as the training progresses, the ratio of cluster ids that do not meet the limit on the number of clusters dropped sharply. That is to say, the number of images that do not meet the limit is rapidly decreasing. As the training progresses, the change ratio of cluster assignments is also reduced. At the end of training, only 20% of the cluster assignments changed.

Fig. 5.
figure 5

The gradual change of the cluster ids over time. Left: The learning rate decay using cosine strategy; Right: The change ratio of the cluster ids (assignments).

Fig. 6.
figure 6

Visualize images in some clusters on the miniImageNet dataset. Each row of images represents the same clusters. The images of left, middle, and right come from the training, validation and test sets, respectively.

The Visualization of Images in Some Clusters. Since task is built by selecting images from clusters, the clustering results affect the quality of tasks. When there are more images belonging to the same category in the same cluster, the quality of the task will be higher. Therefore, visualizing the images of the same cluster will help to understand our method. Figure 6 shows some images in some clusters, which are divided into three parts. The left, middle, and right show the images in the training, validation, and test sets of miniImageNet, respectively. Obviously, we can see that each cluster represents a certain vision semantics of images. For example, the first row has a similar texture, and the last row has a similar shape.

5 Conclusion

In our work, we propose a novel single-stage clustering method: Learning Features into Clustering Space (LF2CS), which fixes the cluster center matrix to the identity matrix, thereby setting a strongly separable clustering space, and then learns features into the clustering space. Based on this, we put forward an image sampling and task building method, and with this, we propose an unsupervised few-shot image classification method. Experimental results and visualization show that our LF2CS has a strong ability to generalize to the novel categories. Based on Conv-4 and ResNet-12, we conduct experiments on four FSL datasets, and our method achieves the state-of-the-art results.