1 Introduction

Facial recognition technology has become one of the most widely used artificial intelligence applications today. Although facial recognition technology has not yet reached its peak, it has made enough progress to find series of applications and programs, and its uses have diversified greatly to reach a widely used extent around us, but we can say that the most prominent of these uses appears in security purposes.

The mission of a face recognition system is to identify or verify people, by comparing and analyzing different patterns on the basis of facial features. It can help generally to authenticate and distinguish human faces from a photograph or video. Verification system is a one-to-one matching process where the user features, extracted during the verification, are compared with only his/her features extracted and stored in the system database during the enrollment. Identification system is a one-to-many matching process where the system needs to match the entered face image to the whole identities stored in the database. This paper is interested in this mode of use.

Generally, a face identification system confronts, in addition to the complexity of matching process that is time consuming, several issues like the significant changes in age, the appearance of a beard or mustache, the wearing of glasses, the covering part of the face or the complicated view of face where not all features are visible. In this paper, we are interested in identification systems in the case of multi-view face acquisitions.

Multi-view face recognition, which aims to solve the problem of variation in pose of face, is a more difficult and complex task than frontal face recognition, and this difficulty is evident because of the nonlinear variations existing in the data space. Thus, it has attracted research efforts both because of its potential applications and the great challenge that presents.

In some critical nowadays applications, such as surveillance systems or criminal identification systems, we have usually only few face images per identity in the training database and always with different angles of view, which is a huge issue for identification systems that aim to ensure high precision, especially those that are based on Convolutional Neural Networks (CNNs) [1, 17, 22] and that have proved a robustness for several applications including frontal face recognition. However, CNNs require having several images per class for the training stage, which is not always possible for identification systems of real life and for multi-view face recognition as well. In addition, the classical methods of face recognition, that focus on the local manifold-structure, can be efficient even using few face images per identity for training, but absolutely only in the case of frontal face identification. To achieve this trad-off and deal with this special circumstances, we believe that Siamese Neural Networks [2,3,4] are the optimal alternative solution, since they can easily learn from little samples per subject using the few-shot learning technique [5]. In the literature, the proposed multi-view facial identification systems are created using databases containing several images per identity.

In this paper, we propose a few-shot multi-view face identification system, based on the Siamese Neural Network (SNN) [2][6], in the case when we have a few samples of images per angle of view per identity in the training set [7], then we compare this system with two CNN models trained from scratch and with the pre-trained VGGFACE model [6]. It should be mentioned here that the proposed system requires at least two images for the training process.

The rest of the paper is organized as follows. The related works are presented in Sect. 2. Then, in Sects. 3 we describe in details the proposed system. Our evaluation and experimental results are presented in Sect. 4. Finally, our conclusion and perspectives are given in Sect. 5.

2 Related works

Most face identification systems perform the training using multiple images per subject during the feature extraction process. In several applications like access control, surveillance systems, and criminal identification systems, we have usually only a few face images per identity and they are taken from different angles, which makes the development of few-shot multi-view face identification systems very important for such applications. In the literature, they are various methods to build multi-view face identification systems that can be classified into three categories as follow: machine Learning-based systems, Deep Learning-based systems and hybrid systems that combine Machine Learning, and Deep Learning methods.

For Machine Learning algorithms, they are many works that have used classical algorithms for feature extraction and classification tasks as well. Anand et al. [8] used a Local binary pattern algorithm for feature extraction then followed by Euclidean distance for classification. Kurita et al. [9] proposed to obtain aligned principal components by prior knowledge that is learned using the principal component analysis on multi-view images, and then, they synthesize a virtual view of viewed face image and frontal one using a linear object class. Fouad et al. [10] compare two feature extraction algorithms for face recognition which are PCA and LDA to recognize the best technique. Yuehang et al. [11] presented a framework to improve the efficiency and the low accuracy of a multi-view face recognition system based on a cascade face detector and improved distance model based on DLIB face alignment. Li et al. [12] proposed to create a hybrid system composes of Support vector regression and classification methods for multi-view face detection and recognition. The Support Vector Regression is used to detect the pose of the head to choose the detector for the specific view detected. Moujahdi et al. [13] have used LST [14] and SVDA [15] methods for feature extraction, and they used a model for head pose estimation in a 2D image [16]. This model is used to calculate the accurate angle of view of an individual before starting the recognition task, thus effectively dealing with pose variability in the same class using an inter-communication between several KNN classifiers. Tuncer et al. [17] proposed a new face recognition architecture based on Local cross-Pattern, Wavelet and fuzzy logical methods for the feature extraction task, and on SVM, KNN, LDA and KDA methods for the classification task.

Fig. 1
figure 1

The general architecture of the Siamese Neural Network system

Fig. 2
figure 2

The main process operations of the proposed system during the test stage

We believe that Machine learning techniques typically improve efficiency and accuracy if we have an ever-increasing amounts of data, about the views, that are processed. However, the major drawback this category is the complexity to generate relevant and discriminating features. In addition, building an Machine Learning model for a large scale system is computationally expensive. Thus, improving the quality of feature extraction, which is measured by its ability to represent and discriminate face samples, has been one of the main challenges faced by the multi-view face recognition research community in the recent years. Deep Learning techniques, and more specifically Convolutional Neural Networks (CNNs), are currently the most widely used techniques to address this challenge.

For Deep Learning category, Zhu et al. [1] created a deep learning model named multi-view perception (MVP) to separate the identity and view representation for any multi-view face image. Cao et al. [18] proposed to use a Deep Residual Equivariant Mapping (DREAM) block to add residual to the input deep representation to transform face images from a profile face to a canonical pose. Wanshun et al. [19] proposed pose auto-augment framework based on a convolutional neural network model, before training this last a data augmentation is lunched. Xiongjun et al. [20] proposed a deep convolutional framework based on the SphereFace-20 model and the Batch Normalization (BN). Meddad et al. [21] proposed a hybrid face identification system based on a compressed CNN model with an indexation and parallelization method that is suitable for embedded devices.

We can say that Deep Learning field is efficient while feeding a huge number amount of data into the neural network architectures to learn from for training. The main disadvantage of this makes Deep learning approaches data greedy as since they require an excessive amount of data for training process which is not always available in all real-world applications, rarely exceeding a few samples, such as a company that contains only 30 employees. A new variant of deep learning algorithms is used nowadays, which is the Siamese Neural Network, for several applications that do not have a large amount of data for training. For example, Bromley et al [22] proposed an algorithm for signature verification written on a pen-input tablet based on a SNN model. Chopra et al [23] proposed a discriminative method for learning complex similarity metrics for face verification. Siamese Networks employ similarity scores to do recognition because they are a metric learning method.

Fig. 3
figure 3

Architectures of the two proposed CNN Models

For Hybrid machine learning and deep learning algorithms, Sarhan et al [24] proposed a combined adaptive deep learning vector quantization (CADLVQ) classifier with a majority voting algorithm for classification and SURF for feature extraction for only three different views. Kisku et al [25] present a multi-appearance fusion of Generalization of Linear Discriminant Analysis and Principal Component Analysis (PCA) for multi-view face image for verification system and using the SVM for binary classification pattern. Vareto et al. [26] focus on the open-set face identification problem, evaluating both partial least squares (PLS) and multilayer perceptron (MLP) classification models in the pursuit of an approach that is not directly dependent on gallery set size. In fact, the authors create a voting system scheme (candidate list) and a collection of either PLS or MLP binary models, specified as hashing functions, to assess whether the requested subject is known or unknown. The topic is recognized if he or she stands out among the other candidates on the list. The hybrid machine learning and deep learning algorithms give an efficient algorithm based on the advantage of the pertinent extraction features from the image by using deep learning approach and a good classification from using machine learning algorithms.

Fig. 4
figure 4

Training process of the scenario 1 (i.e., training samples with the same angle)

Fig. 5
figure 5

Training process of the scenario 2 (i.e., training samples with the same angle of view then with different angle)

The main drawbacks of hybrid systems, the feature extraction from the image is slow by using machine learning algorithms, and if this step is not well done, the machine learning algorithm can well predict the identity because it depends on the feature vector.

In this paper, to overcome data scarcity limitation using classical deep learning algorithms, we have used a Convolutional Siamese Neural Network for few-shot multi-view face identification.

3 Proposed system

In this section, we will present our convolutional Siamese neural network for multi-view face identification system with a CNN encoder of two different architectures.

3.1 Siamese neural network model

The Siamese Neural Network model contains usually two different inputs, an image A and an image B, to built comparable vectors. As shown in Fig. 1, Siamese neural network contains two identical Convolutional Neural Network (sub-networks) with the same parameters, configurations and weights. These sub-networks are used to calculate the distance between two inputs using the comparison of their feature vectors extracted by the CNN models.

Table 1 Overall Face identification accuracy of the Schneiderman database using Scenarios 1 and 2 of training and all training set with binary cross-entropy loss and contrastive loss
Table 2 Overall face identification accuracy of Umist database using Scenario 1 and 2 of training with all training set with binary cross-entropy loss and contrastive loss

Firstly, the first image A is fed into the first model, and after passing it through the convolutional layers followed by a fully connected layer, we extract a vector of features F(A). The second image B is passed through the second CNN model which is similar to the first one in terms of layers, weights, and parameters then we extract the second feature vector F(B). Secondly, we compare the two face vectors by calculating the distance between them. The distance should be small than the threshold security of the system if the two inputs belong to the same identity.

$$\begin{aligned} d\left( F(A),F(B)\right) = \sqrt{\sum \nolimits _{i=1}^{n} \left( F_{i}(A)-F_{i}(B)\right) ^2 } \end{aligned}$$
(1)

where F(A) and F(B) are the face vectors of A and B images, respectively. After each feature extraction phase, a distance value (Euclidean distance see Eq. 1) and the loss are calculated to train the sub-networks. The loss functions used in our implementation are Binary cross-entropy (see Eq. 2) and Contrastive loss (see Eq. 3).

$$\begin{aligned} \textrm{Loss} = - y \cdot \textrm{log}(\ {P({y}) + (1-y)} \cdot \textrm{log}\; (1-{P({y})}) \end{aligned}$$
(2)

where the y value is the right label and P(Y) is the predicted label.

$$\begin{aligned} \textrm{Loss} = (1-y) \cdot \frac{1}{2} (\ {{D}_w)^2 + (y) \cdot \frac{1}{2}{\max (0,m-{D}_w)}^2 } \end{aligned}$$
(3)

where the y value is the right label. y is equal to 0 when two inputs are similar and equal to 1 where are dissimilar, \({D}_w\) is the distance measure between feature vectors of input images. In the test phase shown in Fig. 2, we fed the test image and all reference images to the SNN architecture, and then, we took the label of the referenced image that has the smaller distance measure between the referenced image of each identity and the test image. In the CNN encoder module, we can use any type of neural network models and architectures. In the next sub-section, we present the convolutional neural network models used in our SNN system.

Table 3 Overall face identification accuracy of the Schneiderman database using scenario 2 of training only one image per angle for training set with contrastive loss

3.2 CNN structures

As shown in Fig. 3, that presents the two CNN models used in our SNN system, they consist of a total of six layers with five convolutional layers followed by a fully connected layer, and the images fed to the model are of size of \({\textbf {112}} \times {\textbf {92}}\). The five first layers are convolutional layers.

For the first convolutional model, the number of the first convolutional kernels is 8 and the size is \({\textbf {3}} \times {\textbf {3}}\) followed by a downsampled layer (i.e., Max Pooling). The second layer is still a convolutional layer, with a total of 12 convolution kernels, and the size of each convolution kernel is the same as that of the first convolutional layer followed by a max-pooling layer. The third layer is a convolutional layer, with a total of 16 convolutional kernels and a size of \({\textbf {3}} \times {\textbf {3}}\) followed by a downsampling layer. The two last convolutional layers have, respectively, in total 32 and 64 with a size of \({\textbf {3}} \times {\textbf {3}}\) followed by a RELU function. The last layer of the first model is a fully connected layer with 512 neurons followed by a RELU function.

For the second convolutional model, the number of the first convolutional kernels is 8 and the size is \({\textbf {3}} \times {\textbf {3}}\) followed by a downsampled layer (i.e., Max Pooling). The second layer is still a convolutional layer, with a total of 16 convolution kernels, and the size of each convolution kernel is the same of that of the first convolutional layer followed by a max-pooling layer. The third layer is a convolutional layer, with a total of 32 convolutional kernels and the size of \({\textbf {3}} \times {\textbf {3}}\) followed by a downsampling layer. The two last convolutional layers have, respectively, in total 64 and 128 with a size of \({\textbf {3}} \times {\textbf {3}}\) followed by a RELU function. The last layer of the first model is a fully connected layer with 1028 neurons followed by a RELU function.

The optimizer algorithm used for the training process is Adam Optimizer with a learning rate of 0.0001 and 100.000 iterations.

4 Experimental results

In this section, we evaluate the accuracy of several identification systems including the proposed one following several scenarios of training and test. We will present first the used datasets, then we will describe the training process scenarios, and finally, we will present and discuss the results.

4.1 Databases description and evaluation

We have used in this paper two multi-view face databases: Umist [27] and Schneiderman [28]. Umist database contains 475 images of 20 identities, and each identity owns 19 to 36 images in various angles from the left profile to the right profile. Schneiderman database contains 6660 images of 90 identities, each identity owns 76 images that are taken every 5 degrees from the right to the left profile. To evaluate our system, we have created two training sets and one test set. Firstly, we have split Umist Database on 5 angles of view which are: 0, 20, 60, 80 and 90. We have taken 3 images per angle per identity to built the training set. We have split as well the Schneiderman database to 10 angles of view: + 10, + 20, + 40, + 60, + 80, 0, \(-20\), \(-40\), \(-60\) and \(-80\), and then, we have taken 6 images per angle of view per identity to build the training set. Secondly, we have taken one image for each angle per identity as a test image. Finally, after finishing the training phase, we need to have a small data set that contains only one image per identity as a reference image of this identity which is used to be compared with the test image in the test phase. To evaluate our system, we have created a reference dataset of each angle by choosing one image per angle per identity. For example, if we have 3 angles 20, 30 and 40, we create 3 reference datasets that contain one image per identity per, respectively, the angles 20, 30 and 40.

The metric that we have used to evaluate the performance/accuracy of the models is calculated as follow:

$$\begin{aligned} \text {Accuracy} = \frac{\text {Number of correct prediction}}{ \text {total number of predictions}} \end{aligned}$$
(4)

4.2 Training process scenarios

We propose in this paper two training process scenarios:

  • Scenario 1: Training using only images of the same angle of view (see Fig. 4)

  • Scenario 2: Training using images with the same angle of view then using different angles (see Fig. 5)

In our system, we have two inputs, the first one represents the identity that we want to learn his/her similarity or dissimilarity with the second one. If this last is an image of the same identity of the first input, we have in this case a similarity learning process. If the second input is an image of different identity of the first input, we have in this case a dissimilarity learning process.

Table 4 Overall face identification accuracy of Umist database using scenario 2 of training process with only one image per angle for training set with contrastive loss
Fig. 6
figure 6

Overall face identification accuracy of the Schneiderman database using scenario 2 of training

Fig. 7
figure 7

Overall face identification accuracy of Umist database using scenario 2 of training

As shown in Fig. 4, for the similarity training process, we have taken an image of angle 20 of the identity 1 as the first input and another image of angle 20 of the same identity for learning the similarity; then in the next iteration, we take the same first input with another image of the angle 20 of the same identity but different of the one used in the previous iteration.

For the dissimilarity training process, we have taken an image of angle 20 of the identity 1 as the first input and an image of angle 20 of the identity 2; then in the next iteration, we take the same first input with another image of the angle 20 of the next different identity human.

As shown in Fig. 5, for the similarity training process, we take an image of angle 80 of the identity 3 as the first input and another image of the same angle of the same identity; then, in the next iteration, we have taken the same first image of the first entry with an image of angle 60 of the same identity to learn the similarity in different angle of view.

For the dissimilarity training process, we have taken an image of angle 80 of the identity 3 as the first input and an image of the same angle of the identity 4; then, in the next iteration, we have taken the same first image of the first input with an image of angle 60 of the identity 4 to learn the dissimilarity in different angle of view.

In the training process, we choose to train all the models using all images of the training set in scenarios 1 and 2 and also using only one image per angle per identity to reduce the number of the training set but only for scenario 2.

We have designed these two training scenarios to enable the system to learn from multi-view face images with limited samples per human subject and particularly, with only one image per angle per identity.

4.3 Results and discussion

It should be noted here that the CNN models 1 and 2 cited in this sub-section are the CNN models presented in the Sect. 3.2. We want to remind as well here that we have used two losses algorithms (i.e., binary cross-entropy and contrastive loss) and that training was from scratch for all tested CNN models. As shown in Table 1, for the scenarios 1 and 2 using all images of the training set, we have found that for the Schneiderman database, the best accuracy is between 95.6% and 100% while using the VGGFACE, and for the Umist database (see Table 2) the best accuracy is between 95% and 97% while using the SNN model 2. These results prove that the accuracy of VGGFACE is improved with the increase in training images number, since its accuracy is better using Schneiderman database compared to Umist database. Results prove as well that SNN can preserve accuracy even with minimal number of training images. In practice, we know that having several images per class for the training stage is not always possible for multi-view face identification, that is why we have repeated the evaluations but this time using only one image per angle per identity for the training stage (see Tables 3 and 4).

As shown in Table 3, for scenario 2 using only one image per angle per identity during the training stage, SNN model 2 get the best results with an accuracy between 77% and 92% for Umist database and 58.1% and 98.7% for Shneiderman database (see Table 4). These results prove that SNN model preserves performance in real-life circumstances of multi-view identification compared to the classical CNN and VGGFACE models.

Figures 6 and 7 summarize our results of Table.3 and Table.4 and prove the priority of the proposed model in the case of few-shot multi-view face identification.

As shown in the experimental result, the classification method used is benefit to built a multi-view face identification method while having few samples to train. The distance measure calculated between the feature vector of the test entry and the reference image feature vectors makes as having a good prediction module in our system.

The advantage of our approach is the ability to identify an identity using just one image per angle of view. The use of limited samples makes the model simple to train and speeds up the training process. The limitation of our approach is its ineffectiveness when there is only one image per identity, as the SNN model requires two images as inputs.

5 Conclusion

In this paper, we have proposed a few-shot multi-view face identification system based on Convolutional Siamese Neural Network. We have proved that the proposed system outperforms Convolutional Neural Network models such as VGGFace and CNN models, that are trained from scratch, in the scenario where only few images per angle per identity are available for training, which is the same case of most real-life face identification applications with an accuracy of 74,4% with angle of view + 10 against 37% with VGGFace and, respectively, 29.1%, 23,2% with CNN trained from scratch. For our future work, our main objective is to improve the performance of the proposed system while handling limited image samples per individual in a large-scale multi-view database with a high number of identities.