Keywords

1 Introduction

Current state of the art on semantic segmentation, object detection, image classification and most other learning based tasks rely on deep neural networks. Deep neural networks are high-capacity powerful models which require large amounts of annotated data and millions of parameters. Large amounts of supervised training data per concept is required for deep learning algorithms to achieve great performance, and the learning process could take multiple days to weeks using specialized expensive hardware like GPUs. Adapting a deep learning model to recognize a new class includes 2 major steps: 1) Collection of the large scale dataset, 2) Fine-tune the existing model to recognize the new class. Both of these steps are time, memory and resource intensive. If new classes are to be recognized, then typically thousands of training examples are required for training and fine-tuning the model. Sometimes, unfortunately this fine-tuning might result in the model forgetting the initial classes on which it was trained. One of the most important objectives of few-shot learning based algorithms is to adapt the existing models at real-time, to recognize novel classes which were unseen during the initial training phase. The major challenge is that, these novel classes have less than 5 visual examples each for training the model. The performance of state-of-the-art classification models deteriorates when the number of images per new class reduces to less than 10, whereas humans are capable of learning new visual concepts reliably and effortlessly with very few examples. This has inspired scientists to adapt deep learning algorithms to work on few-shot domain, where the main goal is to learn novel concepts using limited number of examples. The main advantage of solving the few-shot problem is that it relies only on very few examples and eliminates or restricts the need to formulate large amount of labeled training data, which is usually a cumbersome and costly process.

In this paper, we propose a novel, computationally efficient, yet effective framework called Param-Net, which is a fusion of the best practices of parameter generation, gradient descent and data augmentation methods. Two publicly available dataset: MiniImageNet and Pascal-VOC have been used in this paper for evaluation and bench-marking. MiniImageNet dataset is the most popular dataset for benchmarking few-shot learning algorithms. Pascal-VOC is the most popular dataset for object detection tasks. Using Pascal-VOC dataset, an attempt can be made to scale the few-shot classification task to few-shot detection task. The dataset is split into: (1) classes that contain adequate number of samples denoted as \(\text {C}_\text {Base}\), this is considered as large scale dataset and (2) classes that contain 1–5 images each which are denoted as \(\text {C}_\text {Few}\), these are the few shot classes. The goal of our work is to devise a visual learning based few-shot system that achieves good performance on novel few shot classes, \(\text {C}_\text {Few}\) (with less than 5 samples each) and does not degrade the performance on the large scale base classes \(\text {C}_\text {Base}\) and has a fast inference with little or zero training for adding new classes to the existing model.

In neural networks, parameters of a particular class and its activations share a strong relationship and this property is used by Param-Net to predict weights for novel classes. For fair comparison with state-of-the-art approaches, we use a Res-Net based model for extracting the most relevant features/activations of the input images. The activations which are determined prior to the final fully connected layer in the base model, is used as input to the Param-Net, which is a multi-layer transformation function. Param-Net is used to convert activations of a particular class to its corresponding weights. Res-Net as well as Param-Net model is pre-trained on \(\text {C}_\text {Base}\). Param-Net can adapt to novel few shot classes with just a forward pass and zero training as the network is category agnostic.

In this paper, the models are initially tested on MiniImageNet dataset and compared with state-of-the-art few shot algorithms. The proposed Param-Net model outperforms the state-of-the-art methods on few-shot classes, while also not compromising on the efficiency on base classes (\(\text {C}_\text {Base}\)). On MiniImageNet dataset, Param-Net achieves an accuracy of 62.69% for 5-way 1-shot learning and 86.14% for 5 shot learning. The inference time of the model to add novel few shot classes is also very low (because of zero-training time), it only takes around 23 ms for adding a novel class on a Tesla K80 GPU. Inspite of this state-of-the-art performance it has been observed that the accuracy of Param-Net on closely similar classes is slightly compromised. To account for this, we suggest fine-tuning the Param-Net with base classes as well as few-shot classes. This significantly improves the accuracy at the cost of slightly slower convergence. Fine-tuned Param-Net achieves an accuracy of: a) 94.23% for 5-way and 5 shot learning on MiniImageNet dataset and b) 87.26% on Pascal-VOC dataset for 20 way 5-shot settings. The fine-tunable version of Param-Net takes around 138 s of training on Tesla K80 GPU for the addition of a set of novel classes for MiniImageNet dataset. Hence with little fine-tuning, Param-Net can be used to adapt a deep learning model to add novel classes with just a single training image.

In “Sect. 2”, we describe the techniques widely used and documented in literature to achieve current state-of-the-art results, most of the techniques described in this section are used to benchmark the Param-Net framework. Then in “Sect. 3” we elaborate the proposed Param-Net approach, in “Sect. 4”, we discuss the experimental setup, results and benchmarks and in “Sect.  5”, we discuss the conclusion and future scope.

2 Related Work

The ideas behind Param-Net has broad prior support in literature, but mostly appear in disjoint or in incompatible problem setting. Research literature on few-shot learning techniques exhibits great diversity. We adapt these concepts into a unified framework for recognition in real-world scenarios. In this section, we focus on methods using the supervised meta-learning paradigm [12], [51], [9] most relevant to ours and compared to in the experiments. We can divide these methods into 5 categories:

Data Generation and Data Augmentation Methods: In [9], a sampling method was proposed that extracts varying sequences of patches by decorrelating an image based on maximum entropy reinforcement learning. This is a form of “learned” data augmentation. In [19], GAN based approach was proposed to address the few shot learning, where GAN allows the few shot classifiers to learn sharper decision boundary, which could generalize better. In [30], a modified auto-encoder was proposed to synthesize new samples for an unseen category just by seeing few examples from it.

Gradient Descent Based Methods: Meta-LSTM [12], treats the model parameters as its hidden states and uses LSTM as a model updater, that not only learns the initial model, but also the update rule. In contrast to Meta-LSTM, MAML [14] only learns an initial update. In MAML, the updating rule is fixed to a stochastic gradient descent (SGD). In [26], a variant of MAML was proposed where only first order gradients were used. In [21], MetaSGD was proposed as an extension of MAML, which learns weight initialization as well as learner’s update step size. In [1, 2], a modification to MAML was proposed to prevent overfitting. In [2] entropy based and inequality minimization measures were introduced and in [1], Meta-Transfer Learning approach was introduced where it leverages transfer learning and benefits from referencing neuron knowledge in pre-trained deep nets. A framework was proposed in [18] to unify meta learning and gradient based hyperparameter optimization. In [20], Neuron-level adaptation was suggested, to reduce the complexity of weight modification when the connections are dense.

Metric Learning Methods: Siamese Neural Network which uses a two stream convolutional neural network was originally utilized by Koch et al. [30], to learn powerful discriminative representations and then generalized them to unseen classes. Vinyals et al. [13] proposed Matching-Nets and introduced the episodic training mechanism into few-shot learning. Prototypical Network was proposed in [15], which is built upon the matching network [13], uses cosine similarity and 4-layer network. Here, query image is compared with support images using class centroids to eliminate outliers in support set. In [16], a variant of Matching network [13] was proposed and named the Relation-Net. It uses additional network to learn similarity between image through a deep non-linear metric. Relationship of every query-support pair is evaluated using a neural network. As an extension to prototypical network in [15], three light weight and parameter free improvements were proposed in [5]. In [10, 25] modifications to Relation-Net was proposed. In [10], images were encoded into feature vectors by an encoding network. In [25], second order descriptors were proposed instead of first order descriptors. Given a new task with its few-shot support set, Garcia et al. [23] proposed to construct a graph where all examples of the support set and a query set are densely connected. There have been modifications proposed to [23], by [27] and [7]. In [27], transductive propagation network was proposed to propagate labels from known labeled instances to unlabeled test instances. In [7], Edge Labeling Graph Neural Network (EGNN) was proposed to predict edge-labels rather than node-labels, this is ideal for performing classification on various tasks without retraining. In [4], local descriptor based image-to-class measure was proposed which was obtained using deep local descriptors of convolutional feature maps.

Parameter Generation Methods: Using attention based mechanism to predict the classification weights of each novel class as a mixture of base classification weights of each novel class, Gidaris et al. in [22] proposed Dynamic-Net to capture class dependencies in the context of few-shot learning. But in Dynamic-Net [22], dependencies were considered between base classes and novel classes. In contrast, the dependencies were considered to exist between all the classes in [6], and these dependencies were proposed to be captured using GNN architectures. But this is computationally more expensive than the simple attention based mechanism proposed in [22]. The episodic formulation proposed in [13], was used by [6], to apply the Denoising Autoencoder framework in the context of few-shot learning, thereby improving the performance of parameter generation, by forcing it to reconstruct more discriminative classification weights. In GNN, input is the labeled training examples and unlabeled test examples of few shot problem and the model is trained to predict the label of test examples. But here, input to GNN is some initial estimate of classification weights of the classes that needs to be learnt and it is trained to reconstruct more discriminative classification weights.

Model Fine-Tuning Methods: Most of the ADAS based models, usually opt for fine-tuning the pre-trained model to add novel classes, but this method works well, only when there is sufficient number of novel class examples for training. The method that has been proposed in this paper is a fusion of the most effective features of Parameter generation, Model fine-tuning, data generation and gradient based methods. In the following section, we shall discuss the architecture of the model followed by the experimental results.

3 Methodology

The datasets are split into large-scale dataset (\(\text {D}_\text {Base}\)) and few-shot dataset (\(\text {D}_\text {Few}\)), where \(\text {D}_\text {Base}\) contains classes which have sufficient number of images for training, whereas \(\text {D}_\text {Few}\) contains classes with less than 5 images. \(\text {C}_\text {Base}\) refers to the classes present in \(\text {D}_\text {Base}\) and \(\text {C}_\text {Few}\) refers to the classes present in \(\text {D}_\text {Few}\). There is no overlap between \(\text {C}_\text {Base}\) and \(\text {C}_\text {Few}\). The distribution of the dataset into \(\text {D}_\text {Base}\) and \(\text {D}_\text {Few}\) is illustrated in Table 1.

Table 1. Random distribution of classes from public datasets into large scale and few-shot classes.

The goal of our work is to devise a visual learning based few-shot system that achieves good classification performance on novel few shot classes, \(\text {C}_\text {Few}\) (with less than 5 samples each) and does not degrade the performance on the large scale base classes \(\text {C}_\text {Base}\) and has a fast inference with little or zero training for adding new classes to the existing model.

In a neural network, for a particular class: weights and activations are closely related. In this paper, we propose a novel, computationally efficient, yet effective framework called Param-Net, which is a multi-layer dense regression transformation function to convert the activations of a particular class to its corresponding weights.

Initially, Resnet-101 deep neural network was considered as the base model and was pre-trained on the large scale dataset (\(\text {D}_\text {Base}\)). In the base-model, the entire network prior to the final fully connected layer was considered as feature extractor. The final fully connected layer was considered as the classifier network. For an input image ‘\(\text {X}_\text {i}\)’, the feature extractor will output a “d”-dimensional feature vector \(\text {Z}_{\text {X}_\text {i}} = \text {F}(\text {X}_{\text {i}})\). The weights “w” of the classifier network consists of “N” classification weights, where “N” is the number of classes in \(\text {D}_\text {Base}\):

$$\begin{aligned} w=[w_\text {i}]^\text {N}_\text {i=1} \end{aligned}$$
(1)

where \(\text {w}_\text {i}\) is the d-dimensional classification weight vector for the \(\text {i}^\text {th}\) class. For input image ‘\(\text {X}_\text {i}\) and for ‘N’ classes, the classifier will compute the classification scores as: For input image ‘\(\text {X}_\text {i}\)’,

$$\begin{aligned}{}[s_{1i}, s_{2i}, ..., s_{Ni}] = [Z_{xi} w_1, Z_{xi} w_2, .. , Z_{xi} w_N], \end{aligned}$$
(2)

For an image “\(\text {X}_\text {e}\)”, belonging to class “k”, the objective of the feature extractor and classifier is to maximize \(\text {s}_\text {ke}\), where \(\text {s}_\text {ke} = \text {Z}_\text {xe} \mathrm{{w}}_\text {k}\), and minimize \([\text {s}_\text {1e}\), \(\text {s}_\text {2e},..., \text {s}_\text {(k-1)e}]\) and \([\text {s}_\text {(k+1)e}\), \(\text {s}_\text {(k+2)e}, ..., \text {s}_\text {Ne}]\). The weights “w” are learnt through back-propagation using the loss function:

$$\begin{aligned} \dfrac{\sum \limits _{i=0}^{M}{L_i}}{M} \end{aligned}$$
(3)

where,

$$\begin{aligned} L_\text {i} = -log\dfrac{e^{\text {Z}_{\text {X}_\text {i}}\text {W}_{\text {y}_\text {i}}}}{\sum \limits _{j=1}^{N}{e^{\text {Z}_{\text {X}_\text {i}}\text {W}_\text {j}}}} \end{aligned}$$
(4)

where “M” is the number of images per epoch, for image “\(\text {X}_\text {i}\)”: “\(\text {L}_\text {i}\)” is the loss and “\(\text {y}_\text {i}\)” is the annotation label. To adapt the base model to include novel few-shot classes, a transformation layer named Param-Net has been proposed in this paper. The objective of the Param-Net is to predict parameters of a particular class based on its corresponding activations. The activations which are used as input to the Param-Net are determined using the feature extractor network. Parameters of the original base-model classifier network is replaced by the Parameters estimated from the Param-Net which can be denoted as:

$$\begin{aligned} T(Z_{\text {X}_\text {i}})=we_\text {i} \end{aligned}$$
(5)

where T() is the transformation function or the Param-Net, \(\text {Z}_{\text {X}_\text {i}}\) is the activation of the image “\(\text {X}_\text {i}\)” and \(\text {we}_\text {i}\) is the estimated parameters for the image “\(\text {X}_\text {i}\)”.

During the training phase of the Param-Net, initially a mini-batch of input images is formed from the \(\text {C}_\text {Base}\) classes, containing M\(^{'}\) images of each class. Hence the size of the mini-batch was N * M\(^{'}\), where N is the number of classes. Using the feature extractor network, d-dimensional activation vectors are extracted for the entire mini-batch, containing N classes and M\(^{'}\) images for each class. Mean activations are extracted for each class using:

$$\begin{aligned} Mz_\text {j} = \dfrac{\sum \limits _{k=0}^{M^{'}}{Z_{\text {X}_\text {jk}}}}{M^{'}} \end{aligned}$$
(6)

where \(\text {Mz}_\text {j}\) is the mean activation of the batch of images belonging to \(\text {j}^\text {th}\) class, \(\text {M}^{'}\) is the number of images per class in the batch and \(\text {Z}_{\text {X}_\text {jk}}\) is the activation of the \(\text {k}^\text {th}\) image of the \(\text {j}^\text {th}\) class.

The d-dimensional mean activations of each of the class is input to the Param-Net to estimate the parameters of the corresponding classes, as illustrated in Fig. 1(a). The Param-Net is a regression network whose input and output dimensions are of the same size, but to reduce overfit, we have posed it as a classification task. After estimating parameters of all the classes individually, they are concatenated into a “d * N” vector.

A mini-batch of training images \({\text {X}'_\text {1}}, {\text {X}'_\text {2}}, ...,{\text {X}'_\text {N}}\), containing an equal sample of all the classes is considered and features are extracted, where \(\text {X}'_\text {1}\), contains a subset of images belonging to class 1. These features are then convolved with the estimated parameters from the transformation layer and softmax activations are applied. The output is compared with the input annotated labels and the loss function is computed and the gradients are propagated. The loss function from Eq. 4 is modified to:

$$\begin{aligned} L_\text {i} = -log\dfrac{e^{\text {Z}_{\text {x}^{'}_{\text {i}}}\text {[T(Mz)]}_{\text {y}_\text {i}}}}{\sum \limits _{j=1}^{N}{e^{\text {Z}_{\text {x}^{'}_\text {i}}\text {[T(Mz)]}_\text {j}}}} \end{aligned}$$
(7)

where, \(\text {L}_\text {i}\) is the loss for the image ‘\(\text {X}_\text {i}\)’, ‘N’ is the number of classes, \(\text {Z}_{\text {x}^{'}_\text {i}}\) is the activations estimated or features extracted for the image ‘\(\text {X}_\text {i}\)’ using the feature extraction network, ‘\([\text {M}_{\text {z}}]_{\text {y}_\text {i}}\)’ is the mean activation of the actual class that the image ‘\(\text {X}_\text {i}\)’ belongs to, ‘\(\text {[Mz]}_\text {j}\)’ is the mean activation of all the other classes ranging from 1 to N and T[] is the estimated weights from the Param-Net. The entire flow of the network has been illustrated in Fig. 1 (b).

There are three distinct phases in this approach: Training, Parameter predictor and inference. During the training phase, only images from \(\text {D}_\text {Base}\) are used to train the transformation layer/Param-Net. None of the images from the \(\text {D}_\text {Few}\) are used. In the parameter predictor phase, the images 1–5 of a given class \(\text {C}_\text {Few}\) from \(\text {D}_\text {Few}\) are considered and passed through the feature extractor and mean activations of the extracted features are determined.

Fig. 1.
figure 1

Illustration of few-shot learning pipeline proposed in this paper using Param-Net based approach with zero-training. (a): Parameter Generator network (b): Training and inference pipeline for the proposed model.

These mean activations are input to the transformation layer to estimate the parameters for the few shot class \(\text {C}_\text {Few}\). These estimated parameters for few-shot classes are concatenated with the estimated parameters of the other base classes. This way, a new class is added to the existing model with zero training. At inference time, for any input image, features are extracted and convolved with the estimated parameters to determine the class of the input image.

We also propose to fine-tune the Param-Net with base classes as well as few-shot classes to significantly improve the accuracy. For addition of a novel class to the existing model, Param-Net is fine-tuned with mean-activations from \(\text {C}_\text {Base}\) as well as from \(\text {C}_\text {Few}\).

But the convergence of the fine-tuned model does not require too many epochs nor too many computation cycles because: (1) Param-Net is just a 2-layer dense network, (2) Input to the Param-Net is only mean activations which are of lower dimension, compared to the high-dimension raw input. Hence, with little training, the efficiency of Param-Net can be considerably increased. The modified flow for the proposed Param-Net is illustrated in Fig. 2.

Fig. 2.
figure 2

Illustration of few-shot learning pipeline proposed in this paper using fine-tuning based Param-Net approach with little-training. Training and inference pipeline for the proposed model has been shown with the Param-Net being fine-tuned, every time a novel few-shot class is added to the model.

4 Results

We evaluate the proposed Param-Net extensively in terms of few-shot recognition accuracy and model convergence speed on MiniImageNet and Pascal-VOC datasets. In the following sections, we describe the datasets, detailed experimental setup, comparison to the state-of-the-art methods and an ablation study with respect to different model architectures.

4.1 Dataset

MiniImageNet was proposed by Vinyals et al. in [13], for evaluating few-shot learning based algorithms. Its complexity is high due to the use of images from ImageNet dataset, but requires less resources and cheaper infrastructure than running on the full ImageNet dataset. It is the most popular dataset for benchmarking few-shot learning algorithms. We used the same split proposed by [1] for fair comparison. We consider 5-way classification for both 1-shot and 5-shot. Pascal-VOC dataset is one of the most popular datasets for object detection task. The logical extension of few-shot learning based algorithms is object detection based few-shot learning. Hence the use of Pascal-VOC helps us to make the model robust to scale to object detection tasks. Param-Net was evaluated on the Pascal VOC for 20-way 1-shot performance.

4.2 Model

For a fair comparison with the state-of-the-art we use pre-trained Resnet-101 as our base model for feature extraction, for MiniImageNet and Pascal-VOC datasets. Resnet-101 was pre-trained on ImageNet and the model was later fine-tuned on \(\text {D}_\text {Base}\) dataset. Model fine-tuning was performed by freezing the initial layers and only updating the weights of the final few layers. The layer prior to the final fully connected layer serves as a feature extractor for Param-Net. The dimension of the extracted features is 2048. The learning rate for fine-tuning Resnet-101 was 0.0001 using an Adam optimizer.

For MiniImageNet dataset, the \(\text {D}_\text {Base}\) dataset had 80 classes with 600 images each and these images were used for fine-tuning Resnet-101. Of the 600 images, 500 images were used for training and 50 images each for validation and testing. The same distribution of \(\text {D}_\text {Base}\) is used for training the Param-Net as well. One of the most important characteristics of Param-Net is that it is class-agnostic, which indicates that the network need not be trained on images from \(\text {D}_\text {Few}\). Adam optimizer was used with a learning rate of 0.001 and it took around 187 s for the Param-Net to be trained on a Tesla K80 GPU. Once the Param-Net is trained, it has the capability to add novel classes with zero-training and the addition of novel classes can be done at real time. It takes only 18 ms to add novel classes to the existing base-classes because Param-Net is just a two layer dense regression network with a low-dimensional input.

In the following sections, we shall discuss the key role that a robust feature extractor plays, in improving the performance of few-shot learning algorithms. We shall also discuss the effect of the number of dense layers in the Param-Net on the quality of the weights generated, in the following sections.

Table 2 shows the comparison between: a) Accuracy of conventional fine-tuning based approach, where Resnet-101 model was fine-tuned and test only using the \(\text {D}_\text {Base}\) dataset b) Accuracy of the Param-Net based approach on the \(\text {D}_\text {Base}\) dataset, where similarly, the Param-Net was trained and tested only using the \(\text {D}_\text {Base}\) dataset and the weights of the classification layer of the Res-Net model was replaced by the weights estimated using the Param-Net. It is evident from the results in Table  2, that the Param-Net is able to achieve comparable performance on the \(\text {D}_\text {Base}\) dataset, while achieving state-of-the-art results on \(\text {D}_\text {Few}\) dataset as depicted in later sections.

Table 2. Comparative study between: a) ResNet-101 based feature extractor and classifier and b) ResNet-101 based feature extractor with Param-Net based classifier. Both are trained and tested only on \(\text {D}_\text {Base}\) dataset.

4.3 Evaluation Results and Comparisons

In Table 3, we compare Param-Net with state-of-the-art methods of different few-shot techniques. The accuracies of all the techniques have been reported on the test dataset.

Data generation techniques [9, 11, 19, 30]: Different techniques like decorrelation [9], GAN [19], auto-encoder [30] and image deformation network [11] have been used. But these models need to be trained on more than just 5 examples for the generators to generate useful data, otherwise the generators under-perform on few-shot data. Hence the accuracy of the data augmentation based methods is lower than the most of the other approaches.

Metric learning techniques [4, 5, 7, 10, 13, 15, 16, 23, 25]: In [13], few-shot learning problem was addressed using cosine similarities between the features produced by CNN, which is a very simplistic metric to differentiate between the images. In [15] and [16], instead of cosine similarities, a non-linear similarity score was introduced using neural networks, but the descriptors were of first order. As an extension to [15], different approaches [5, 10, 25] were proposed like second order descriptors, encoding feature vectors using encoder network, batch folding, few shot localization (fsl) and covariance pooling (cp). But still they under-performed because of the inability of the feature extractors to extract meaningful features. GNN based approach, proposed by [22] under-performed because it uses node-labeling to model intra-cluster similarity and inter-cluster dissimilarity. This was addressed in [7] to achieve better performance. In [4], k-nearest neighbors metric was used, but one drawback of this approach was that it used Conv4 as a feature extractor, which is a weak feature extractor. Most of the mentioned metric learning techniques, learn a feature similarity metric between few-shot training examples and a test (query) example. But, these methods treat each training class independently from one another. Hence, the performance of metric learning frameworks is weakened.

Gradient descent techniques: In [12] and [14], element-wise fine tuning was used, hence inducing overfitting on the designed models, and in [12], LSTM was used to update the initial model as well as the update rule, hence it was time consuming as well, this was addressed in [14], by learning only the initial model. As an update to [14], different solutions were proposed like: using first order gradients [26], joint learning of weight initialization as well as learner’s step size [21], entropy based inequality minimization measures [2] and transfer learning and fine-tuning [1]. But the problem with gradient based approaches is that, they require many iterative steps over many examples to perform well.

Parameter generation: In [22], dependencies were considered between novel classes and base classes using simple attention based mechanism. In [6], GNN based techniques were used to differentiate between all the classes, novel and base.

The Param-Net that has been proposed here, is a combination of data augmentation, gradient descent and parameter generation methods. This has resulted in state-of-the-art performance on 1-shot and 5-shot settings for MiniImageNet, which is the most popular public dataset for benchmarking few-shot learning algorithms. The existing state-of-the-art methods mainly focus on making relation measure, concept representation and knowledge transfer, but do not pay enough attention to final classification. This issue has been addressed in this paper by posing a regression problem of the Param-Net into a classification task, thereby also reducing the overfit of the model onto the training dataset considerably.

Table 3. 5-way accuracy on MiniImageNet. Blue: Best accuracy.
Table 4. Results of ablation study

Table 3 and Table 4, indicates the importance of a robust feature extractor. In Table 3, the techniques which used Resnet based architecture for feature extraction performed better than the approaches that used Conv-4 or Conv-32 based feature extractors. Similarly, we also experimented using Nas-Net (network based on neural-architecture search) instead of Resnet-101 as feature extractor as shown in Table 4. Nas-Net improved the performance of the algorithm on both 1-shot as well as 5-shot settings. We also conducted experiments to ascertain the contribution of the number of layers in the Param-Net, to the eventual performance of the algorithm on MiniImageNet dataset. It has been observed that a 2-layer dense network performs better than a 1-layer dense network, as has been indicated in Table 4.

In this paper, we also propose an approach where, Param-Net is finetuned with \(\text {D}_\text {Base}\) as well as \(\text {D}_\text {Few}\). For every new class that needs to be added, the Param-Net needs to be finetuned. It significantly leads to an increase in the accuracy but with a little extra training time of 138 s for addition of a set of novel classes to the existing model, on a NVIDIA K80 GPU.

5 Conclusion

In this work, we contribute to few-shot learning by developing a novel, computationally efficient framework called Param-Net which achieves top performance in tackling few-shot learning problems.

The main objective of few-shot applications is to add novel classes at real time to the existing model in the presence of less than 5 visual examples. Hence Param-Net has been proposed in this paper. It is a dense transformation layer which converts the activations of a particular class to its corresponding weights. It is pre-trained on large-scale base classes and at inference time it adapts to novel few-shot classes with just a single forward pass and zero or little training as the network is class agnostic.

Extensive comparison with related works indicate that the Param-Net outperforms state-of-the-art algorithms in terms of accuracy (1-shot and 5-shot) and in terms of faster convergence (zero or very-little training). We evaluate the performance of Param-Net on two publicly available datasets: MiniImageNet and Pascal-VOC. MiniImageNet is the most popular dataset for benchmarking few-shot algorithms. Pascal-VOC dataset was used to verify the scalability of Param-Net from few-shot classification task to few-shot detection task.

The future scope of improvement for the proposed Param-Net would be to scale the algorithm to address few-shot detection rather than just few-shot classification problems. The first step to address this challenge has been successfully accomplished by testing the Param-Net on the Pascal-VOC dataset, which is the premier dataset for object detection tasks.