1 Introduction

Differentiating between objects with multiple features is one of the most complex tasks. Even the human eye is sometimes challenged to correctly differentiate between different objects solely by comparing certain features of the objects. Face recognition is a highly intricate task that poses significant challenges due to the structural and facial similarities between individuals. This complexity makes it difficult for computer programs to accurately evaluate and distinguish between faces.

Face recognition using AI models has been the subject of numerous studies and experiments. However, despite significant progress, face recognition remains a challenging task due to the structural similarity of facial features. Although facial representations differ across individuals, they often appear closer together in the latent space. Consequently, features or embeddings extracted from trained CNN models for face recognition may exhibit similarity. To overcome this issue and enable CNN models to effectively distinguish between different faces and generalize well, training on a large volume of data is essential. To address the limitation of requiring a high volume of data per identity, the concept of triplet loss-based modeling was introduced. While many studies and experiments have explored the use of the Triplet Loss Function for Face Recognition, there is a lack of comprehensive presentations on the experimental results of different combinations of triplet mining strategies.

Numerous research has been published and several methods and techniques have been proposed and published related to this task. Training a face recognition CNN model is resource-intensive and demands a large amount of data to achieve optimal results. Reliance on high volumes of data underscores the importance of having robust computational resources to meet computational demands. To effectively handle the high volume of data, it is crucial to incorporate a diverse range of feature-rich datasets. This includes masked images, occluded faces, low-resolution images, and more. By introducing such variety, CNN models can capture and learn the subtle differences present in the data, ultimately leading to a comprehensive and distinct representation in the latent space. Manual process of data annotation is an expensive job and can cause high technical debt. Thus, obtaining accurately labeled datasets for face recognition can be challenging due to the requirements of high volume and variety of data.

In order to circumvent the problem of data sufficiency for model training, meta-learning using metric-based learning methodology is used to perform face recognition using few shots. The metric-based loss called Triplet Loss is used on face verification and recognition tasks using a Siamese Network. We used offline and online triplet mining strategies along with the Triplet Loss function by selecting hard and semi-hard triplets. Another factor introduced while selecting hard and semi-hard triplets is selection of negative samples during triplet creation. We used random or best-fit negative sample strategy and performed model training using these combinations. Selection of a triplet must be such that the model learns image embeddings in such a manner that the distance between the anchor and positive image embeddings is closer than the anchor and negative image embedding [1]. Similarity between these embeddings is measured using the cosine similarity metric, while distance is computed using the Euclidean distance. These metrics provide valuable insights into the proximity and relationship between the embeddings, enabling effective comparison and analysis. This work also describes how selection of hyperparameters may influence the outcome.

Our research distinguishes itself by diverging from conventional methods that heavily rely on extensive datasets and deep networks for achieving broad generalization. Recognizing the limitations of this approach, we conducted experiments in scenarios marked by limited data availability. Our study intentionally employs substantially fewer samples per class yet manages to achieve noteworthy levels of generalization. Also, we introduce a novel element to our methodology - the integration of a few-shot learning strategy. This approach not only addresses the challenges associated with limited data scenarios but also significantly reduces the dependency on high computational resources. Consequently, our research stands out for its efficiency in model training with higher number of classes and a significantly lower number of samples, offering a practical and resource-conscious alternative that contributes to the advancement of the field.

A pre-trained CNN is used as a transfer learning model to further train the Triplet Loss model using the Siamese network and gradients are updated. The pre-trained CNN is modified, and custom convolutional and dense layers are added to perform the model training. We ensured that datasets were preprocessed, images were cropped, faces extracted and aligned before the training. As part of preprocessing activity, we used an image size of 112x112 and the alpha channel from the synthetic dataset (DigiFace-1m) was removed to remove transparency and reduce the number of image channels to 3.

Post-model training, results were evaluated on the standard real and synthetic datasets such as LFW, CelebA, VGGFace2 and DigiFace-1m. Results are evaluated using ‘Model Testing’ and ‘Real time’ evaluation. The evaluation is performed using few shot samples from support set and query set. These results are presented in Section 4.

To the best of our understanding, the studies that use Siamese network using triplet loss do not provide a comprehensive analysis on different triplet mining strategies under few shot settings. These studies [1,2,3,4,5,6] either utilize triplet loss directly or provide a task specific version [7] of triplet loss. Contributions of the study are:

  1. 1.

    In-depth examination of triplet loss employing various triplet mining strategies for parametric few-shot learning.

  2. 2.

    Model training and evaluation strategy designed within the constraints of a limited dataset, particularly in the realm of few-shot learning.

  3. 3.

    The study conducts experiments on samples from classes for 1-shot, 2-shots and 5-shots that are not seen during training and validation process. This adds a distinctive dimension to the study.

  4. 4.

    Design of two-stage training approach that improves the effectiveness of the model and provides insights into handling limited data scenarios in face recognition. Provides insights into the factors influencing performance of CNN models for face recognition.

  5. 5.

    Experimental results obtained from these carefully designed methodologies will contribute to advancing the field and addressing the challenges associated with face recognition tasks.

The organization of the paper is as follows. In Section 1, we introduce the study of triplet loss-based face recognition and present the motivation behind our study. Section 2 provides a comprehensive review of literature and existing work related to usage of triplet loss and face recognition. Section 3 covers methodology adopted to accomplish different tasks related to the study. Section 4 discusses and summarizes performance evaluation results of the experiments conducted related to the study. Finally, Section 5 provides conclusions and future work.

2 Literature survey

Early stages of face recognition can be traced back to various research texts [8, 9]. At that time, much of the focus was on manually designing and crafting features for face recognition. However, in recent years, there has been a shift towards making machines more intelligent [10]. Researchers are now aiming to offload the responsibility of solving the aforementioned complex task to automated systems.

Over time, these initial research efforts evolved, leading to advancements in face detection and facial feature extraction techniques. Appearance-based methods such as fisherfaces [11], as well as feature-based approaches, were proposed to handle larger datasets consisting of facial images. Multiple approaches based on Support Vector Machines (SVM) [12,13,14], Principal Component Analysis (PCA) [11], and Hidden Markov Model (HMM) [15, 16] were also introduced to tackle face recognition tasks. Machine learning based approaches have been used by using the subspace discriminant ensemble-based approach [17]. A hybrid approach to recognize faces is also used by using Viola Jones, PCA and applying PCA on detected features. Viola Jones is still being used to detect a face and PCA is used along with it to detect different parts of the face such as face, left eye, right, nose and mouth [18]. Features are extracted from the detected parts of these faces and the face is recognized by applying PCA. These techniques served as fundamental building blocks for subsequent research conducted in controlled environments and with limited datasets.

Advancement of technologies in the current era is enabling identity authentication and authorization using face verification. This has enabled a face recognition system to be a generalized source of authentication. To achieve this, it is important that data is normalized, segmented and good quality features [19] are generated. Structural and facial similarity between different faces adds to the complexity of the face recognition task and research is being done to extract texture features [20] from eyes, nose, mouth and face. Face recognition generalizes well in control situations such as similarity matching only using the frontal view. However, there are scenarios where there is a pose instead of a frontal view. In that case the frontal view is calculated from the pose-view [21] angle before performing the face similarity. Current age of deep learning has enabled Face Recognition [22] tasks to be progressed at a level of fair maturity. This has been made possible by high computing systems, availability of datasets and evolution of new techniques, technology and algorithms. One such algorithmic technique uses triplet loss [1, 23] and different triplet mining [24] strategies to find the similarity [25] between faces. This is a novel technique that enables CNN to produce face embeddings that represent similar embeddings closer in the latent space and the different face embeddings have a larger distance between them. The technique is primarily used in Siamese [26, 27] network-based modelling. This approach has been used to conduct similar experiments to perform face recognition [28] where the face images are occluded with a mask [29,30,31,32,33]. The triplet-based model using a Siamese network has been used in unsupervised learning to generate more accurate pseudo labels [34] for person re-identification tasks [35]. It has been observed that for partial matching and to counter occlusion, an evolved version of the triplet loss function [29] can be used to further improve the performance [36, 37] of the model on the standard datasets. The triplet loss-based approach has been extended to relatively new concepts of Few-shot learning [30] whereby only a limited set of datasets is required to achieve significantly good performance on the given face recognition task. This significantly reduces the requirement of having a high volume of training datasets. The use of triplet loss in our experiments has proven to be effective in reducing the requirements for a large number of samples [38] per class. This approach mimics the behaviour of few-shot learning methods, as highlighted in a recent study by Holkar et al. [26]

The conventional methodologies in the field often rely on extensive datasets and employ deep neural networks to achieve broad generalization. However, this conventional approach faces a notable limitation due to its dependency on large datasets. It is this limitation that serves as a primary motivation for our study, prompting us to explore and conduct experiments in scenarios characterized by limited data availability. In contrast to the common practice of employing numerous samples per class, our study specifically investigates the efficacy of utilizing significantly fewer samples per class. The objective is to demonstrate that even with restricted data, it is possible to achieve substantial generalization.

In addition to addressing the challenges associated with limited data scenarios, our study also leverages a few-shot learning strategy. This innovative approach aims to minimize the reliance on extensive computational resources traditionally required for model training. By adopting a few-shot learning strategy, we effectively reduce the computational burden, leading to a notable reduction in the overall model training time. This not only contributes to resource efficiency but also underscores the practicality and applicability of the proposed methodology in scenarios where computational resources are constrained.

3 Methodology

The system pipeline consists of multiple stages. Each stage corresponds to a specific task. In stage-1, the base network is trained and in stage-2 triplet loss network is trained by using the base network from stage-1. The stage-1 task is intended to develop a model that is used for features extraction. To achieve this, the widely adopted dataset DigiFace-1m is chosen. The data is augmented offline, and the model is trained thereafter. VGG16 is used as a base network for transfer learning. As shown in the computation graph in Fig. 2, the weights corresponding to block-5 of VGG16 and the fully connected layers are updated during the training process. Further details related to base network selection, dataset selection criterion and model training are presented in the subsequent sections respectively.

3.1 Base network training methodology ( Stage -1)

3.1.1 Base network selection criterion

The pre-trained VGG16 model is selected based on the following considerations:

  1. 1.

    Simple architecture - The architecture of VGG16 is straightforward and easy to understand, consisting of stacked convolutional layers and dense layers.

  2. 2.

    Pre-trained - VGG16 has been pre-trained on the large-scale image dataset ‘ImageNet’. This enables the network to learn generic features from large-scale datasets.

  3. 3.

    Performance on medium-sized dataset - VGG16 performs well on medium dataset. It does not perform well like other recent models like Resnet50, Inception, etc. However, our objective of training on a medium sized dataset is fulfilled by VGG16.

  4. 4.

    Shorter training time - Because of shorter training time than other standard networks, VGG16 is the ideal selection for the experiments.

System pipeline and methodology adopted for training and validating the base network is displayed in Fig. 1

Fig. 1
figure 1

Overview of system pipeline and methodology for training and testing

3.1.2 Dataset selection

For selecting the datasets for model training and testing, evaluation is done on state-of-art datasets. Criterion for dataset selection is mentioned in Table 1. The criterion is:

  1. 1.

    Active - The dataset must be active for current research.

  2. 2.

    Class Sufficiency - The dataset must have enough classes having enough samples on which training can be done.

  3. 3.

    Sample sufficiency per class - One common method to increase the size of the dataset is by augmenting the data. Embeddings of an augmented image closely resemble those of the original image. When working with limited data, it is preferable to use a dataset that naturally incorporates variations in the samples. This allows the network to generate more accurate results without relying heavily on data augmentation. Based on these considerations, we opted for a dataset that contains a minimum of 50 samples per class (without augmentation) for training the base network.

  4. 4.

    Balanced - To avoid biases and fair distribution, we ensured that each class must be represented equally.

Table 1 Dataset selection criterion

Based on the criterion mentioned in Table 1, DigiFace-1m dataset is selected for model training and testing. Details of the DigiFace-1m dataset are presented in Table 2

Table 2 Selected dataset for base network (Stage -1)

3.1.3 Dataset preprocessing - base network

Because of the inherent complexity of the facial features, the model is trained on the frontal view of the face. However, some images are pose-variant and do not present a frontal view. As part of the data preprocessing, the dataset is iterated and face alignment, extraction and resizing are performed on every sample. Sample(s) on which automatic alignment or extraction could not be performed is discarded from model training and testing. Face alignment is performed using the OpenCV library. MTCNN is used to perform face extraction and the image is resized to 112x112 pixel with a reduction of alpha channel.

3.1.4 Model training, validation and testing

VGG16 is used for face recognition tasks by transfer learning and only training its last convolution layer and subsequent dense layers. Input to the model is an RGB image having width and height as 112 pixel and number of channels as 3. DigiFace-1m dataset has an alpha channel that was removed as part of the preprocessing. The model computational graph is presented in Fig. 2. The optimized model training configuration and parameters are presented in Table 3

Fig. 2
figure 2

VGG16 computation graph - face recognition task

Our objective was to train the model minimally on Face Recognition tasks and achieve sufficient accuracy of around 80%, so that model could minimally detect the face and be used for extracting the embeddings by the downstream network.

3.2 Triplet loss network training methodology (stage -2)

Triplet loss network uses the feature extraction network trained in stage-1 of network pipeline. The softmax layer of the feature extraction network is removed, and the features are extracted using the last dense layer of 128 dimensions. This model is fine-tuned by attaching a convolutional layer and the couple of fully connected layers that are normalized and the triplet loss function is applied thereafter. As part of the training process, the size of mini batch plays a critical role in determining the optimal triplet for that particular batch. Hence, it is suggested to have maximum representations of different classes in the mini batch. During the training process, the triplets are mined, and training is performed. The loss is updated based on the distance between the anchor, positive and negative samples mined for each class in the mini batch. The methodology adopted for training the triplet loss network is displayed in Fig. 1

Table 3 Optimized model training parameters for base network

3.2.1 Training and validation dataset selection

All the experiments have been conducted on standard datasets namely DigiFace-1m, CelebA and LFW. DigiFace-1m is a large dataset with 10k identities and 72 image samples per class. LFW and CelebA datasets have multiple identities but a limited number of samples per class. The selection of the dataset for training the model is based on the criterion mentioned in Table 1. The base model is trained on the DigiFace-1m dataset, specifically on 200 classes as indicated in Table 2. For training the triplet loss network, we also utilize the DigiFace-1m dataset, but with different classes and samples. By selecting different classes and samples, we ensure that the training data for the triplet loss network is distinct from the data used to train the base network. Further details about the dataset are mentioned in Table 4. The “Selected Classes” column in the table indicates that the class names range from 2000 to 2049.

Table 4 Selected dataset - triplet loss network (stage 2)

3.2.2 Dataset preprocessing - triplet loss network

This section outlines the preprocessing techniques applied to the selected datasets, including image resizing, face extraction, face alignment, and normalization. Precursor for face recognition is face detection task. To enable a CNN model to perform well on face detection tasks, face extraction is done using MTCNN and face alignment using OpenCV as part of the preprocessing. The images in DigiFace-1m dataset are of size 112x112 and 4 channels. After preprocessing, we resized the images to 112x112 and 3 channels and removed the alpha channel.

3.2.3 Hyperparameters selection

Choosing appropriate hyperparameters significantly influences the training results of CNN models for face recognition. This section discusses the process of selecting hyperparameters, such as learning rate, batch size, and margin values, and the considerations involved in their determination. Rationale behind the chosen hyperparameters is provided, ensuring a robust training process. For achieving good results from the model training and faster convergence, we emphasize the need of selecting a batch size that must have a sufficient representation of samples from each class for a decent triplet mining strategy. We chose to have a batch size of 1024 so as to have sufficient representation of each class in a mini batch.

3.2.4 Triplets mining strategies and network training

This section presents an in-depth exploration of various triplet mining strategies employed in face recognition tasks. Different techniques, such as Hard Triplets, Semi-Hard Triplets, and offline triplets, are examined for advantages and challenges. We provide insights into selecting suitable triplets to train the CNN models effectively. Triplets are formed by selecting an anchor (A) and a positive (P) sample from the same class and a negative (N) sample from any other class. There are primarily two ways of selecting a triplet, i.e., Hard triplets and semi hard triplets [1]. Triplets are selected such that the network learns that the Euclidean distance between the anchor and positive pair is less than the anchor and negative pair by a margin. This process of network training [1] is presented in Fig. 4

Fig. 3
figure 3

Triplet loss - network training

Schematic of our network design is represented in Fig. 3. Different sizes of shapes in the mini batch specify unequal distribution of samples per class. Deep learning architecture block specifies that transfer learning is performed on VGG16, and some custom layers have been added as well. These layers include a 1x1 filter-based convolution layer and a couple of fully connected layers. The output is L2 normalized to provide the normalized 128-dim embeddings.

3.2.5 Triplet loss objective

The objective of the triplet loss function (f) represented in (1) is to establish that the embeddings of anchor and positive samples are represented closely in the latent space. And embeddings of anchor and negative (N) samples must be at distance greater than the anchor and positive sample distance. Thus, the objective is to minimize the distance between the anchor and the positive and maximize the distance between the anchor and negative. The triplet loss function is represented as (1). The function f(x)\(_{\text {i}}\) produces the embeddings of a sample ‘x’ that belongs to i\(^{\text {th}}\) class.

$$\begin{aligned} L_{\text {A,P,N}}= \max (||f(A) - f(P)||_2 - ||f(A) - f(N)||_2 + \alpha , 0) \end{aligned}$$
(1)

For a mini batch of size ‘M’, the loss per batch is represented as (2):

$$\begin{aligned} L_{\text {A,P,N}}= \frac{\sum {\max (||f(A) - f(P)||_2 - ||f(A) - f(N)||_2 + \alpha , 0)}}{M} \end{aligned}$$
(2)

3.2.6 Hard triplets

A negative image sample (N) is selected such that the Euclidean distance between anchor (A) and negative is less than the Euclidean distance between anchor and positive (P) samples embeddings. Equation (3) represents distance (d) of hard triplets.

$$\begin{aligned} d||A,N|| < d||A, P|| \end{aligned}$$
(3)

3.2.7 Semi-hard triplets

A negative sample image is selected such that the Euclidean distance between anchor image and negative image embedding is less than the Euclidean distance between anchor and positive image embedding by a margin. Equation (4) represents the distance (d) in case of semi-hard triplets by a margin of \(\alpha \).

$$\begin{aligned} d||A,P||< d||A,N|| < d ||A, P|| + \alpha \end{aligned}$$
(4)

3.2.8 Determining the number of triplets

With reference to (3) and (4), it is sufficient to select one best triplet per class per mini batch. This can be achieved by selecting the best anchor positive and anchor negative pairs. We created all possible anchor positive combinations [1] of the triplets for a mini batch. Thus, if a class has ‘n’ samples then there are \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) possible triplets for that class. This ensures to have sufficient representations of samples per class in a mini batch.

3.2.9 Approaches adopted for negative sample selection

The selection of negative samples is a crucial factor that significantly impacts the convergence of triplet loss. In our study, we conducted experiments using two primary strategies for selecting the negative sample either by selecting a random negative sample or by selecting the best negative sample. The approach to select a random negative and best negative sample is presented below.

  • Selecting a random negative

    1. 1.

      There are ‘c’ classes and each class has ‘s’ number of samples.

    2. 2.

      ‘S’ denotes the total number of samples from (c-1) classes.

    3. 3.

      Then, a negative sample is selected at random from the set of ‘S’ for each anchor and positive pair.

    Algorithm 1 is used for selecting a random negative sample from a mini batch for generating a triplet.

    $$\begin{aligned}{} & {} \text {{Let }} S = \{s1, s2, s3, s4, s5... (c-1 * s)\} \\{} & {} n \in S \text {{ where `n' represents random negative sample}} \end{aligned}$$
    Algorithm 1
    figure a

    Find a random negative embedding.

  • Selecting Best Negative

    1. 1.

      This approach requires determination of a negative sample by iterating through all the classes in a mini batch and finding the negative sample image that is closest to the anchor image as presented in (5).

    $$\begin{aligned} argmin_n || A - N || \end{aligned}$$
    (5)

    Algorithm 2 is used for selecting the best negative sample from a mini batch for generating a triplet.

    Algorithm 2
    figure b

    Find \(best\_negative\_embedding(anDist, anchorembedding, classes)\).

3.2.10 Testing strategies

This section describes testing strategies used while evaluating the performance of the triplet loss model. The testing strategies are divided into two categories ‘model testing’ and ‘real time testing’. These categories are decided based on the number of samples per class, that are to be compared using cosine similarity for evaluating model performance. Data sampling and splitting strategies are presented in Table 5.

Table 5 Testing strategies - data sampling
Fig. 4
figure 4

Network transformation - triplet learning

3.2.11 Similarity score estimation

As shown in the network design in Fig. 4, the output of the network is the extracted features or embeddings. Cosine similarity is used as the distance measure for testing and evaluation of our experiments on face recognition datasets on completely unseen dataset.

$$\begin{aligned} \text {cosine similarity} = \frac{{\textbf{A} \cdot \textbf{B}}}{{\Vert \textbf{A}\Vert \cdot \Vert \textbf{B}\Vert }} \end{aligned}$$
(6)

In (6) A and B represents “training and testing” or “support and query” samples embeddings.

3.2.12 Hardware specifications

This section captures hardware resources used during the model training along with training times of different models. The hardware resources details are mentioned in Table 6.

Table 6 Hardware specifications

4 Discussions and results

This section comprehensively lists and describes the performance evaluation results of the base network and the triplet loss network. Results of triplet loss-based CNN are governed by underlying pre-trained model that is used as a base network. It also represents the compilation of results for experiments that are conducted with different triplet mining strategies. Influence of hyperparameters on the results is described in Section 3.2.3. Triplet loss network performance evaluation results are also presented on real time strategy with limited samples on unseen data

4.1 Model training hours

The study specifies ‘Base Network’ that acts as the embedding network as one CNN model and the ‘Triplet Loss’ model with different triplet mining strategies as another CNN model. Both models trained on same hardware specified in Table 7 reflect the training time taken by ‘Base Network’and ‘Triplet Loss’ CNN models. Higher training times for triplet loss networks are attributed to the non-vectorized implementation of the triplet selection algorithms.

Table 7 Model training hours

4.2 Results

This section provides quantitative results related to base network and triplet loss network from Section 3

4.2.1 Base network

The model is trained on the DigiFace-1m dataset as specified in Table 2. The dataset percentage split used for training, validation and testing is 80-10-10%. The model is trained using transfer learning from VGG16 pre-trained model on the task of face recognition. As shown in Fig. 2, transfer learning was applied from the last convolution block of VGG16 network by unfreezing its last layer and adding custom fully connected layers. The input is classified using softmax. The model is run for 500 epochs. Training and validation graphs clearly suggest that convergence is achieved around the 90-100th epoch. 200 classes and 72 samples for each class are chosen for model training. A standard split of 80-10-10 is used for training, validation, and testing respectively. The performance evaluation results are mentioned in Table 8. The intermediate results are listed in the form of confusion matrix for the intermediate model that is used as a feature extractor. Since the model is trained on 200 classes and corresponding confusion matrix will be challenging to represent, we have presented the complete confusion matrix in Fig A1 of appendix. Here, we are presenting confusion matrix Fig. 5 for randomly selected 10 classes. Accuracy and Loss graphs are presented in Figs. 6 and 7 respectively.

Fig. 5
figure 5

Confusion matrix for base network for randomly selected 10 classes

Fig. 6
figure 6

Base network - training and validation accuracy

Fig. 7
figure 7

Base network - training and validation loss

4.2.2 Triplet loss network - quantitative results

The triplets’ losses for different triplet mining strategies are shown in Fig. 8. It is clear that the selection of the best negative sample in both hard triplets and semi-hard triplets mining strategy converges faster than the randomly selected negative sample approach.

Table 8 Base network - performance evaluation results
Fig. 8
figure 8

Triplet losses for different triplet mining strategies

We report the quantitative evaluation results in Tables 11 and 10 for different triplet mining strategies presented in Section 3. The most optimized hyperparameters are presented in Table 9.

As per experimental results mentioned in Table 11 using ‘model testing’ based performance evaluation strategy, it is evident that semi-hard triplet selection with the randomly selected negative sample yields best results on both seen and unseen data. Selection of a base network that provides optimal embeddings of an image plays a significant role in triplet loss base model. In case of ‘real time’ testing strategy, a new dataset VGGFace2 is utilized to evaluate performance of model using limited datasets. In this performance evaluation strategy, classes in query set and support set are same, but samples are different. Only 2 samples per class are selected in support set for different testing executions. Query set has only one image per class. Performance evaluation results are presented in Table 10. Similarity matches in ‘real time’ test strategy between samples of different classes are presented in Fig. 9. This suggests that triplet loss-based training is particularly useful in constrained environments where number of samples is limited or few. Selection of the ‘margin’ variable and the ‘batch size’ play a significant role while model training. An optimally chosen batch size must be such that significant samples per class are present in each mini batch. Performance results for experiments related to offline triplet mining are inconclusive and are mentioned as ‘inconclusive’ in Table 11.

Table 9 Optimized hyperparameters for triplet loss network
Table 10 Performance evaluation results

4.2.3 Comparative analysis

This section describes the comparative analysis between different studies. Table 12 displays the comparisons between different techniques that have primarily used triplet loss and few shot learnings in their experiments. Our experiments are performed in multiple few shot configurations, provide detailed analysis of triplet mining techniques, and tested on unseen classes.

The SOTA models like FaceNet [1] and DeepFace [40] are designed to learn from various variations, such as changes in illumination and pose, to produce high-quality embeddings. Achieving this involves leveraging deep network architectures and training on extensive datasets comprising thousands of identities and millions of samples. It is crucial to note that our study does not seek to draw comparisons with these state-of-the-art models, which often employ proprietary datasets and intricate network architectures. Instead, our investigation is centered around a more constrained dataset and a less complex network architecture. In contrast to many existing studies that utilize a few-shot learning methodology, our approach differs in terms of both dataset size, mining and network depth. Unlike studies that often involve training and testing for classes either \(\le \)20 or \(\ge \)1000 and high number of samples per class, our experiments cover 50 classes with very few numbers of samples. Moreover, our study introduces variations in the number of samples per class (one, two, or five), and these classes remain unseen during the training and validation phases. While many studies commonly employ Siamese Networks [21, 26] with contrastive loss and some studies use quadruped loss [39], our research delves into the intricacies of different triplet mining strategies. Our particular focus is on parametric few-shot learning, with an emphasis on testing for classes that were not part of the training or validation process. This distinct approach allows for a more comprehensive examination of the model’s generalization capabilities on unseen classes, given the limitations of our dataset.

Fig. 9
figure 9

Similarity - samples from support and query sets

Table 11 Triplet loss network -performance evaluation results (model testing)
Table 12 Comparative analysis of our study with SOTA models

5 Conclusions and future work

In this paper, we have presented a comprehensive analysis of methodologies for face recognition using few shots via the metric based learning. The performance is evaluated on unseen dataset and we observed an accuracy of over 70% in both real time and ‘model testing’ mode for each of the 50 unseen classes with a processing time of about 100 milliseconds. . The study compares performance of various triplet selection techniques and demonstrates effectiveness of triplet Loss in training a CNN for face recognition tasks. Our results show that two-stage training approach, incorporating a pre-trained VGG16 as base feature extraction network, yields promising results in limited dataset scenario using few shot learning. The study also highlights the impact of hyperparameters and data sampling on performance.

While this study provides valuable insights into face recognition with limited data using few shots, there are several directions for future research. Different few shots learning techniques can be used with cross domain datasets to further improve the training methodology and training times. Overall, this study lays the foundation for further advancements in face recognition with limited data, and future research can build upon these findings to address the challenges and explore new possibilities in this field.