1 Introduction

Biometrics such as face, fingerprint and iris are defined as the physiological characteristics of an individual. It has become a need now to get more efficient biometric systems that can be easily embedded in smart devices such as tablets and mobile phones. A human face is considered as one of the most effective biometric traits compared to other biometrics. Face recognition (FR) is characterized by its cheap cost, contactless nature and high acceptability during acquisition [1,2,3]. However, the performance of FR systems is affected by changes in resolution, pose and illumination [4, 5]. To overcome this problem, most of the FR techniques extract features from face images using well-known algorithms such as scale-invariant feature transform (SIFT) [6], speeded up robust feature (SURF) extraction [7] or histogram of oriented gradients (HOG) [8]. In the last few years, researchers interested in the field of FR proved that the utilization of CNNs gives more robust, representative and detailed features that improve the overall system performance. Furthermore, to reach optimal results, a CNN has to be trained by a large number of face images without implementation of external pre-processing methods as the system accuracy could be affected by the easy failures of these methods [9,10,11,12,13,14,15,16,17,18].

Paying attention to enhancing the recognition accuracy should not make us overlook protection of biometric data from hackers. For this purpose, encryption techniques could be applied [19]. In practice, illumination variations have a great effect on biometric templates. So, different digests are produced when there is a minor change in the input. Another drawback of using encryption with biometrics is the need to apply data decryption that represents an attack point. Recently, cancelable biometrics has attracted a great attention. In cancelable biometrics, a one-way function transforms the biometric template instead of storing the original one. This way can provide revocability since another transformation is used to re-enroll a compromised biometric, and privacy since the original biometric cannot be recovered easily from a transformed one. Additionally, the recognition accuracy is not degraded as the statistical characteristics of features after transformation are approximately maintained [20,21,22].

Motivated by the previous observations, we propose a cancelable multi-biometric FR method in which a fusion network combines deep features which are extracted using multiple CNNs. Security and privacy of biometric data are provided by performing a bio-convolving encryption step on the final facial descriptor.

Our main contributions are as follows:

  1. 1.

    We propose a new, simple and efficient CNN architecture.

  2. 2.

    We present a FR method that achieves remarkable results compared to the state-of-the-art techniques in terms of recognition accuracy, specificity, precision, recall and fscore.

  3. 3.

    We provide protection of biometric data without degradation of the recognition accuracy.

  4. 4.

    We perform extensive experiments on different datasets.

This paper is organized as follows. Section 2 is about the related work. Section 3 describes the proposed method. Section. 4 presents the experimental results and Sect. 5 gives the concluding remarks of the paper.

2 Related work

Recently, CNN-based FR methods have achieved remarkable results on different datasets [23, 24]. To study these methods, the following aspects should be taken into consideration: (a) number and architecture of CNNs; (b) dataset used for training; (c) type of loss function; and (d) the incorporated learning strategies. To enhance the system performance, some operations are applied on face images to generate different types of inputs to train multiple CNNs. These operations could be: (a) cropping of certain face regions [25,26,27,28,29]; (b) utilization of synthesized or frontalized faces [30,31,32]; or (c) utilization of different numbers of training images from different datasets [26, 33]. Kurban et al. [34] proposed an approach in which score level fusion is applied between two different datasets to create a new one. In addition, energy imaging method is used for gesture feature extraction and principal component analysis (PCA) is used for dimensionality reduction. They achieved encouraging performance with high genuine match rate (GMR) and low false acceptance rate (FAR). Li et al. [35] designed a method that improves recognition accuracy through updating the classification model in real time by using self-detection, decision and learning (S-DDL).

Several researches are concerned with improving FR performance, but few are concerned with protecting biometric patterns and providing user’s privacy. Generation of cancelable biometric templates [36,37,38] is an approach that is used to protect users’ data. Traditional techniques reconstruct biometric templates by performing reverse engineering [39]. We can classify cancelable FR systems based on the number of biometrics into: (a) unimodal systems, in which cancelability is achieved using a single biometric trait, and (b) multi-biometric systems, in which multiple traits are used [22]. We can follow one of the two approaches to produce cancelable biometric templates: (a) cryptosystem approach [40,41,42], in which a key is obtained from the original data, or (b) transformation-based approach, in which discrimination between templates is enhanced using original templates with a specific key [43]. In practice, several techniques are used to provide cancelability for both unimodal and multi-biometric systems such as bio-convolving with random kernels, non-invertible geometric transforms, random projections or cancelable biometric filters [44]. In this work, bio-convolving with random kernels is adopted due to its simplicity and great success with CNN-based FR methods.

3 The proposed method

Figure 1 presents the proposed approach pipeline that consists of four main stages: (a) detections of different facial regions; (b) extraction of deep features using multiple CNNs; (c) utilization of a fusion network to form a discriminative facial descriptor; and (d) application of bio-convolving with random kernels to protect biometric data from different attacks.

Fig. 1
figure 1

The proposed FR method

3.1 Detection of facial regions

Face, nose, eyes and mouth regions are detected from the original face images [45]. Figure 2 shows examples of the detection and separation operations. Eyes, nose and mouth are very effective regions on which several changes are clearly noticed. These changes include laughing, closing eyes, opening mouth or wearing glasses. Figure 3 depicts the visual activations of the first five convolutional layers of the proposed CNN model for each facial region.

Fig. 2
figure 2

Detection and separation of different facial regions

Fig. 3
figure 3

Visual activations of the first five convolutional layers of the proposed CNN model

3.2 Deep feature extraction

Multiple CNNs of the same architecture are used to extract deep features from the detected facial regions. We propose a CNN model that includes “22” convolutional [46], “8” max pooling [46], “1” batch normalization [47], “2” residual learning framework “ResBL” [48], “3” depth concatenation [49], “1” fully connected [50], “1” feature normalization [51] and “1” softmax [52] layers as illustrated in Table 1. A multi-GPU platform could be implemented to limit the computation time consumption.

Table 1 Proposed CNN architecture

3.3 Fusion of deep features

We adopt a fusion network to combine the extracted features into a more representative, reliable, useful and detailed facial descriptor. This network consists of two layers: local and fusion layers. The local layer is composed of four parallel CNNs. If we consider that F(i)(.) represents the deep feature vector which is extracted from a CNN i, then the output of the fusion layer could be computed as illustrated in Eq. (1):

$$ {\text{Facial}}\,{\text{descriptor}} = \mathop \sum \limits_{i = 1}^{N} {\bf{W}}^{(i)} {\bf F}^{(i)} \left( . \right) + {\bf{b}}^{(i)} $$
(1)

where the corresponding weights and bias in the fusion layer are denoted by \( {\bf W}^{\left( i \right)} \) and \( {\bf b}^{\left( i \right)} \), respectively. The number of CNNs is represented by N; here, N  = 4.

3.4 Bio-convolving encryption

This method [44] adopts a convolution approach that leads to generating cancelable biometric templates. A transformed sequence \( f \left[ i \right], \;\;i = 1, \ldots ,F \) is obtained using an original sequence \( r \left[ i \right], \;\;i = 1, \ldots ,F \) through a convolution with a random kernel h[i].

$$ f \left[ i \right] = r \left[ i \right]* h \left[ i \right] $$
(2)

4 Experimental results

To verify the performance of the proposed method, we present experiments on different datasets: FERET [53], LFW [54] and PaSC [55]. In addition, recognition accuracy, specificity, precision, recall and fscore are used for evaluation; see Eqs. 3, 4, 5, 6 and 7. All experiments have been performed using a platform with the specifications shown in Table 2. We used the stochastic gradient descent method to train the CNN. Momentum is set to 0.9. We applied L2 regularization with a weight decay of 5 × 104. We began the CNN training with a learning rate equal to 0.1 and stopped the training after 5 epochs.

$$ {{Accuracy}} = \frac{{{{TP}} + {{TN}}}}{{{{TP}} + {{FP}} + {{FN}} + {{TN}}}} $$
(3)
$$ {{Specificity}} = \frac{{TN}}{{{{FP}} + {{TN}}}} $$
(4)
$$ {{Precision}} = \frac{{TP}}{{{{TP}} + {{FP}}}} $$
(5)
$$ {{Recall}} = \frac{{TP}}{{{{TP}} + {{FN}}}} $$
(6)
$$ F_{{Score}} = \frac{{2 \times {{Recall}} \times {{Precision}}}}{{{{Recall}} + {{Precision}}}} $$
(7)

where TP = true positive, FN = false negative, FP = false positive and TN = true negative.

Table 2 Platform specifications

4.1 Evaluation of unimodal and multi-biometric FR methods

As mentioned before, the proposed method adopts an FR technique in which four regions are detected from the original face images. Firstly, we studied the performance of the unimodal FR method to know the most effective part of a face image in the recognition process. The unimodal FR method uses a specific region: face, nose, eyes or mouth to train a single CNN. Table 3 illustrates the results of the unimodal FR method using different facial regions.

Table 3 Experimental results of unimodal FR using different face regions for different datasets

We observe from Table 3 that unimodal FR system based on face region achieves better results than those based on other regions as the face region contains plenty of features. The utilization of nose region comes next as the changes in nose are less than those of eyes and mouth under different positions and expressions of persons in images. Table 4 gives the comparison results of the unimodal FR system based on face region using state-of-the-art CNNs. To guarantee a fair comparison, hyper-parameters are tuned for all methods.

Table 4 Comparison results of the unimodal FR system using state-of-the-art CNN models

Table 4 shows that the proposed CNN model is superior to other state-of-the-art CNN models. With the proposed CNN, the recognition accuracy reaches 97.14% on FERET dataset compared to 97.02% for CoCo loss, 97.94% on LFW dataset compared to 97.81 for CoCo loss, and 95.42% on PaSC dataset compared to 95.27% for CoCo loss.

The performance of the proposed multi-biometric FR system has been studied. Table 5 provides the comparison results of the proposed multi-biometric FR system using the state-of-the-art CNN models.

Table 5 Comparison results of the proposed multi-biometric FR system using state-of-the-art CNN models

From Table 5, compared to the state-of-the-art CNN models, the proposed CNN achieves remarkable results. With the proposed model, the recognition accuracy reaches 98.89% on FERET dataset compared to 98.73% for CoCo loss, 98.93% on LFW dataset compared to 98.82 for CoCo loss, and 97.38% on PaSC compared to 97.27% for CoCo loss. Figure 4 depicts the receiver operating characteristic (ROC) curves of the proposed CNN model and CoCo loss model for different datasets.

Fig. 4
figure 4

ROC plots of the multi-biometric systems using the proposed and CoCo loss CNNs on a FERET dataset, b LFW dataset and c PaSC dataset

As shown in Fig. 4, the proposed model gives better results than those of CoCo loss CNN on FERET and LFW datasets, while on PaSC dataset the performance of both CNN models is almost the same. Furthermore, Fig. 5 illustrates the ROC plots of both multi-biometric and unimodal systems on different datasets.

Fig. 5
figure 5

ROC plots of both multi-biometric and unimodal systems on a FERET dataset, b LFW dataset and c PaSC dataset

Figure 5 confirms the superiority of the proposed multi-biometric FR method to the unimodal one as the area under the curve for the multi-biometric system is greater than that of the unimodal system.

4.2 Evaluation of bio-convolving method

The proposed method uses bio-convolving encryption to provide protection of biometric data without degradation in the system accuracy. To prove that, Table 6 illustrates the change in recognition accuracy for both unimodal and multi-biometric FR methods after applying bio-convolving and bloom filter [64] methods.

Table 6 Cancelable biometric effect on recognition accuracy for unimodal and multi-biometric systems

Table 6 shows that the recognition accuracy is slightly affected after applying bio-convolving. In addition, Fig. 6 shows a graphical comparison between different cancelable techniques and their influence on recognition accuracy for both unimodal and multi-biometric FR systems.

Fig. 6
figure 6

Cancelable biometric influence on recognition accuracy for unimodal and multi-biometric FR

The ability of bioconvolving method to provide security and privacy of user’s data can be verified through performing encryption and decryption operations on a number of facial images using different de-convolution masks as shown in Fig. 7. Figure 8 presents the probability density functions (PDFs) of the mean square error (MSE) and normalized absolute error (NAE) for both correct and slightly different de-convolution masks.

Fig. 7
figure 7

Encryption and decryption operations on a number of facial images using different de-convolution masks

Fig. 8
figure 8

PDF of MSE and NAE

As illustrated in Figs. 7 and 8, it is clear that we can distinguish easily between correct and incorrect de-convolution results. Overall, the experimental results prove that the proposed multi-biometric FR method is superior to the unimodal one due to exploiting multiple CNNs to obtain a variety of features from different facial regions and applying fusion to get more appropriate and sufficient facial descriptors. On the other hand, the utilization of four CNNs adds complexity to the proposed FR method. Furthermore, a single CNN model takes about 4 h in the training process. So, the proposed method suffers from a time consumption issue.

5 Conclusions

In this paper, we presented a cancelable multi-biometric FR method that achieves a promising performance across different datasets. Thanks to depth concatenation and the residual learning framework, we proposed an efficient CNN architecture. In the proposed method, multiple CNNs extract deep features from different facial regions, a fusion network combines these features into a more representative, reliable and detailed output, and finally, a bio-convolving method maintains privacy and security of biometric templates without affecting the recognition accuracy. The experiments on the FERET, LFW and PaSC datasets demonstrated that (a) the proposed method outperforms the state-of-the-art techniques in the presence of cancelability and (b) the multi-biometric system achieves excellent results compared to the unimodal one.