1 Introduction

A significant number of feature extraction approaches are proposed in literature to represent face images. The existing face recognition approaches can broadly be classified into two categories, global and local [1]. Global face recognition methods are usually based on statistical approaches in which features are extracted from the entire face image. Among global methods, principal component analysis (PCA) [2], fisher linear discriminant (FLD) [2], independent component analysis (ICA) [3], the space–frequency techniques such as Fourier transform [4]. Although the global face recognition techniques are the most common in face recognition, recently, lots of work is being done on local feature extraction methods as these are considered as more robust against variations in facial expressions, noise, and occlusion. These structure-based approaches deal with local information related to some interior parts of face images. Among the sparse descriptors, the scale invariant feature transform (SIFT) [5], Gabor wavelet [6], and local binary patterns (LBP) [7]. Early face recognition research was based on 2D appearance images [8]. However, an increasing number of 3D-shape-based face recognition algorithms have recently emerged with the advent of 3D scanners. Although the appearance of a face in a 2D image encodes the shape information of the face, aside from the face albedo, 2D face recognition alone has not been able to achieve the desired accuracy because of its sensitivity to illumination, pose, and expression variations. On the other hand, 3D face recognition can better handle pose variations (particularly in depth rotations) and the 3D data can be used to correct the pose of the corresponding 2D image (texture) as well [9].

1.1 Overview of our proposed approach

In this paper, we investigate how local features of 3D and 2D information contribute to face recognition when illumination, expressions and combined changes in expression under illumination are taken into account. All processes included in our training and test steps are fully automated. Our system, as illustrated in Fig. 1, includes four important steps which consist in:

  1. 1.

    Preprocessing: By translating and rotating one input 3D image to align one reference 3D image, face poses, and changed positions between the face and the equipment are normalized.

  2. 2.

    Feature extraction: Robust feature representation is very important to the whole system. It is expected that these features are invariant to rotation, scale, expression, and illumination. The existing work usually uses raw depth and intensity features. In our system, we combine one global feature and four local features.

  3. 3.

    Classification: PCA combined with enhanced fisher linear discriminant model (EFM) are used for reducing the dimensional space. Classification is performed using the normalized correlation metric.

  4. 4.

    Fusion system: It consists in the fusion of the classification results by support vector machines (SVM) method [10], and the score normalization is performed using Min_Max method [11] upstream is chosen for its simplicity.

Fig. 1
figure 1

Overview of the proposed framework

1.2 Contributions of this paper

In this paper, we propose a new scheme to combine several methods of local feature extraction from depth and intensity images to overcome the problems due to illumination, expressions, and combined changes in expression under illumination. The main contributions of this paper are as follows:

  1. 1.

    A novel feature extraction method (statistical local features) is proposed. It is based on the calculation of statistical parameters in a neighborhood of the pixel such as the average, standard deviation, variance.

  2. 2.

    Study the fusion of two multimodal systems (multi-algorithms: built by the fusion of several local characteristics and multi-sensor: built by the fusion of 2D and 3D information).

  3. 3.

    Studying several feature extraction methods to gain insights into their complementarity.

The remainder of this paper is organized as follows. Section 2 introduces the preprocessing procedure, which is very important to robust recognition. Section 3 describes the features for face representation. Section 4 reports the experimental results. Finally, Section. 5 summarizes this paper.

2 Preprocessing

It is assumed in this paper that one face is described by one 3D point cloud captured by one 3D laser scanner. Each point cloud consists of thousands of points in the 3D space. These discrete points approximately describe the face surface. We use CASIA 3D face database. Each point is described with 3D spatial coordinates and corresponding RGB color coordinates. In this section, we describe how the original 3D data are preprocessed. That is, we exactly register the data and then obtain the depth and intensity images. This part prepares for the feature extraction in the next section. This preprocessing includes two main steps, registration of 3D face surfaces and acquisition of depth and intensity images.

2.1 Registration

We use iterative closest point (ICP) [12]. ICP has two attributions to this method. Firstly, it aligns all the faces with the first 3D face (neutral expression). Secondly, it examines whether the detected nose tip is correct.

2.2 Depth and intensity images

Depth and intensity images are obtained from registered 3D data. The data are converted into a 3D image depth (see Fig. 2a) and a color image (see Fig. 3a). In most images, the nose is the closest part of the face in 3D scanner; it has the highest value in depth between all points of the face. For each pixel, the average is calculated using the neighboring window of size \(9\times 9\) around it. Using a window of size \(3\times 3\) which calculates the sum of the depth of its corresponding pixels, the nose is detected as the coordinates of the center pixel of the window that returns the maximum value (see Fig. 2b). After detecting the nose, a sub-image centered on the center of the nose, with size \(57\times 47\), is extracted (see Figs. 2c, 3b). For RGB color images, we used the intensity images (see Fig. 3c). However, due to the quality of original 3D data, the depth and intensity images usually contain much noise, such as holes and outliers. We can obtain enhanced images by the following processes. The preprocessing of depth images includes noise removal and hole filling. We use the following scheme to remove the outliers. For each pixel, the mean is computed for the \(5\times 5\) neighboring window (see Figs. 2d, 3d). If the pixel intensity is less than a given threshold, this pixel is replaced by the mean pixel. The result is shown in Figs. 2e, 3e. The variation in the lighting strongly influences the presentation of the intensity images. To cope with this problem, histogram equalization is used to reduce the influence of the illumination variations (see Fig. 3f).

Fig. 2
figure 2

Preprocessing of the depth image: a the depth image, b detecting the nose tip, c extracted sub-image, d mean image \(5\times 5\), e depth image after removing noise and filling holes

Fig. 3
figure 3

Preprocessing of the intensity image: a the color image, b extracted sub-image, c intensity image, d mean image \(5\times 5\), e intensity image after removing noise and patching holes, f intensity image after histogram equalization

3 Features for face representation

3.1 Multi-scale local binary patterns (MSLBP)

The original LBP operator was later generalized to deal with different neighborhoods [13]. A local neighborhood is defined as a set of sampling points evenly spaced on a circle which is centered at the pixel to be labeled, and the sampling points that do not fall within the pixels are interpolated using bilinear interpolation, thereby allowing for any radius and any number of sampling points in the neighborhood. Formally, given a pixel at (\(x_{c}\), \(y_{c}\)), the resulting LBP can be expressed in decimal form as:

$$\begin{aligned} \hbox {LBP}_{P,R}(x_{c},y_{c})=\sum _{p=0}^{P-1}s(i_{p}-i_{c})2^{p} \end{aligned}$$
(1)

where \(i_{c}\) and \(i_{p}\) are respectively gray-level values of the central pixel and \(P\) surrounding pixels in the circle neighborhood with a radius \(R\), and function \(s (x)\) is defined as:

$$\begin{aligned} s(x)=\left\{ \begin{array}{l@{\quad }l} 1 &{} \mathrm{if}\ x \ge 0 \\ 0 &{}\mathrm{if}\ x \prec 0 \end{array} \right. \end{aligned}$$
(2)

Given a facial depth and intensity images, we generate a set of multi-scale LBP for facial representation. Some examples are illustrated in Fig. 4. In this figure, the number of sampling points varies from 8 to 24 points, and the radius value varies from 1 to 4 pixels.

Fig. 4
figure 4

The multi-scale LBP for facial depth and intensity images

3.2 Proposed statistical local features (SLF)

The main purpose of the proposed method is to compute some statistical parameters in the neighborhood of the pixel using different sizes and number of neighboring points. The calculated parameters are:

3.2.1 The mean

It is defined as:

$$\begin{aligned} \hbox {mean}_{P,R}(x_{c},y_{c})=\frac{1}{P}\sum _{p=0}^{P-1}i_{p} \end{aligned}$$
(3)

where \(i_{c}\) and \(i_{p}\) are respectively gray-level values of the central pixel and \(P\) surrounding pixels in the circle neighborhood with a radius \(R\).

3.2.2 Standard deviation

Standard deviation shows how much variation exists from the average. It is defined as:

$$\begin{aligned} \hbox {std}_{P,R}(x_{c},y_{c})=\sqrt{\frac{1}{P} \sum \nolimits _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{2}} \end{aligned}$$
(4)

3.2.3 Variance

The variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value). It is defined as:

$$\begin{aligned} \hbox {VAR}_{P,R}(x_{c},y_{c})=\frac{1}{P} \sum _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{2} \end{aligned}$$
(5)

3.2.4 Skewness

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the central point. It is defined as:

$$\begin{aligned}&\hbox {skew}_{P,R}(x_{c},y_{c})\nonumber \\&\quad = \frac{\frac{1}{P} \sum _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{3}}{\left( \sqrt{\frac{1}{P} \sum _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{2}}\right) ^{3/2}} \end{aligned}$$
(6)

3.2.5 Kurtosis

Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. It is defined as:

$$\begin{aligned} \hbox {kur}_{P,R}(x_{c},y_{c})= \frac{\frac{1}{P} \sum _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{4}}{\left( \sqrt{\frac{1}{P} \sum _{p=0}^{P-1}(i_{p}-\hbox {mean}_{P,R}(x_{c},y_{c}))^{2}}\right) ^{2}} \end{aligned}$$
(7)

Some examples for SLF features are illustrated in Fig. 5.

Fig. 5
figure 5

The statistic local features (SLF), a facial depth image, b intensity image

3.3 Overview of Gabor wavelet filters

In this paper, we use 2D Gabor filters of depth and intensity images to characterize a person. The Gabor wavelets represent the properties of spatial localization, orientations, and spatial frequency selectivity. The representation of faces using Gabor wavelet has been successfully used in 2D and 3D face recognition [14]. This representation of an image describes the facial characteristics of both the spatial frequency and spatial relations.

3.4 Scale invariant feature transform (SIFT)

The SIFT [5] is a local 2D feature calculated at keypoint locations. The interested reader is referred to Lowes paper [5] for the details of the keypoint localization and the SIFT feature extraction.The SIFT operator works on each \(\hbox {LBP}_{P,R}\) separately. Because \(\hbox {LBP}_{P,R}\) highlight the local characteristics of smooth facial image depth and intensity. Many SIFT-based keypoints can be detected for the following step more than in the original images. Same statistical work was done along with the experiments on CASIA 3D faces database. The average number of descriptors extracted from each of \(\hbox {LBP}_{P,R}\) depth is 52 and \(\hbox {LBP}_{P,R}\) intensity is 162 while that of each original facial depth image is limited to 14, and intensity is limited to 63. The SIFT descriptors for these faces were then computed using Lowes code [15]. Figure 6 shows the SIFT-based keypoints extracted from one range and intensity face image and its four associated \(\hbox {LBP}_{P,R}\). To calculate the similarity between a learning and evaluation face, their SIFT descriptors were matched using the Euclidean distance (see Fig. 7).

Fig. 6
figure 6

The SIFT-based keypoints detected from an original depth, intensity facial image, and four associated \(\hbox {LBP}_{P,R}\)

Fig. 7
figure 7

SIFT matches between learning and evaluation faces belonging to a the same identity and b different identities

4 Experimental results

4.1 The CASIA 3D database

We use the CASIA 3D face database [16] to test our proposed authentication system. The basis is constructed by a 3D scanner Minolta VIVID 910 non-contact working in the fast mode. This database contains 123 subjects, each subject having 37 or 38 images with individual variations of poses, expression, illumination, and combined changes in expression under illumination and pose as expressions. This database contains complex variations that are difficult to any algorithm. In this section, we studied the variations of illumination (images: 1, 2, 3, 4, 5), expressions (images: 6, 7, 8, 9, 10) and the combined changes in expression under illumination (images: 11, 12, 13, 14, 15). Therefore, we used 15 images for each subject. We used an assessment protocol of separating people into two classes, client and impostor. Customer group contains 100 subjects, while the impostor group is divided into 13 impostors for evaluation and 10 for testing. The repartition of images in different sets is given in Table 1.

Table 1 Distribution of photos in different sets

4.2 Global feature (PCA + EFM)

For this part, we use a holistic approach. The characteristic vector of 2D and 3D image is built by concatenation of rows of depth and intensity image. We use PCA + EFM method for reduction and separation of space and normalized correlation for similarity measure. Table 2 shows the error rate in the entire evaluation and testing for a comprehensive approach (PCA + EFM) feature extraction. (EER: equal error rate and RR: the recognition rate (\(\hbox {RR} = 100 - \hbox {FRR} - \hbox {FAR}\)). FRR: the false reject rate and FAR: the false accept rate, FN: feature number is the number of feature extracted by enhanced fisher linear discriminant model (EFM). It is generally computed experimentally [11]. We vary FN from \(10, 20, \ldots , 200\), and then, we select the one which gives the best result. \(P\): number of points in the neighborhood pixel, \(R\): radius). The table shows that \(\hbox {PCA} + \hbox {EFM}\) gives a poor performance for depth information (3D). Indeed, the RR criterion for instance is 89.36 %, whereas the RR is 93.14 % when relative intensity information(2D) and multimodal fusion (3D and 2D) are used.

Table 2 Performance of the \(\hbox {PCA} + \hbox {EFM}\) throughout evaluation and test set

4.3 Multi-scale LBP (MSLBP)

For this part, we use the MSLBP local method. Table 3 shows the error rate in all tests and evaluation by this method of feature extraction. The number of sampling points varies from 8 points to 24 points, and the radius value varies from 1 pixel to 4 pixels. LBP method gives better results for 3D information in the case of 2D information for the four radius values (\(R\)). The fusion of four radius values (MSLBP) improves performance for 2D and 3D information and the multi-sensor fusion (\(\hbox {3D} + \hbox {2D}\)) with an \(\hbox {EER} = 0.95\,\%\) overall evaluation and \(\hbox {RR} = 94.54\,\%\) in the test set.

Table 3 Performance of the multi-scale LBP method throughout evaluation and test set

4.4 Statistical local features (SLF)

For this part, we use the proposed local method (SLF). Table 4 shows the error rate in every test and evaluation for SLF method. The number of sampling points varies from 8 points to 24 points, and the radius value varies from 1 pixel to 4 pixels. First, the fusion of the four radii (\(R = 1, 2, 3, 4\)) for different neighborhoods does not improve performance for five statistical descriptors available. We also notice that for \(R = 3\), \(R = 4\) and the number of points \(P = 24\) (maximum neighborhood size in our application), we obtain a better result for all statistical descriptors. Therefore, the increase of the number points in the vicinity improves the performance in the case of statistical descriptors. The four descriptors (mean, standard deviation, variance, skewness) give almost the same results. Kurtosis is the worst descriptor. It confirms the results of visual perception (image quality) obtained in images \(kurtosis_{24,4}\) ( see Fig. 6) The fusion of the five parameters of our local features improves face authentication. Performance without the kurtosis is better than the fusion of five statistical parameters with \(\hbox {EER} = 1.20\,\%\) overall evaluation and \(\hbox {RR} = 96.32\,\%\) in the test set.

Table 4 Performance of the SLF throughout evaluation and test set

4.5 Gabor wavelets

The family of Gabor filters is characterized by a number of resolutions or frequencies and orientations. In this work, we concatenated for each resolution the eight directions in the feature vector. Gabor filters have a complex shape that can be exploited. It is important to use the information provided by the real part and the imaginary part of Gabor coefficients. We use the filtered phase responses of Gabor filters as in [11], we have shown that the filtered phase more relevant in this application. We use Gabor wavelets for each resolution and the fusion of five resolutions. The best results are obtained when the resolution is \(\lambda \)= 4, that is, EER = 1.57% and RR = 96.13%. The fusion of five resolutions does not improve.

4.6 MSLBP + SIFT

Table 5 shows the error rate in the entire evaluation and testing with the extraction of characteristics for LBP. The number of sampling points varies from 8 points to 24 points, and the radius value varies from 1 pixel to 4 pixels. Subsequently, SIFT is computed from the MSLBP 2D data. The table shows that the fusion of four \(LBP(P,R)\) (\(R = 1, 2, 3, 4\) and \(P = 8, 16, 24\)) plus SIFT gives the best result with a \(\hbox {EER} = 2.48\,\%\) and \(\hbox {RR}=94. 73\,\%\).

Table 5 Performance of the \(\hbox {LBP}_{P,R}\) + SIFT descriptor throughout evaluation and test set

4.7 Fusion of feature representation

Table 6 compares the error rates, scores and computational load for the five considered descriptors and the fusion of these descriptors. Experiments were conducted on Matlab implementation on a Intel i5 2.50 GHz CPU processor with a 8 GB RAM. From this table, we can infer that: The SLF features give the best results with \(\hbox {EER} = 1.20\,\%\) overall evaluation and \(\hbox {RR} = 96.32\,\%\) in the test set and a low runtime equal 0.399 s. This shows the effectiveness of the proposed descriptor. Compared with all global and local descriptors studied, which justifies the effectiveness of our descriptor. The fusion of our proposed descriptor SLF with the descriptor MSLBP combined to SIFT gives the best results Indeed, we obtain an \(\hbox {EER} = 0.98\,\%\), \(\hbox {RR}=97.22\,\%\) and a runtime equal 1.182 s. The fusion of all considered descriptors does not improve the performance compared with the fusion of two descriptors SLF and MSLBP combined to SIFT. The performance of the proposed system is compared with the state-of-art in 3D and 2D face recognition for CASIA databases . The comparisons are based on the recognition rate as shown in Table 7. The results show that the proposed system achieves higher average recognition rate compared with the current systems in the literature tested using the same database.

Table 6 Performance of five methods of feature extraction throughout evaluation and test set
Table 7 Comparison of recognition rate with state-of-art(CASIA databases)

5 Conclusion

In this work, we presented an automatic multimodal authentication algorithm based on 2D intensity and 3D depth face image. Firstly, we used a comprehensive approach based on the reduction of space PCA followed by EFM , secondly, by four local methods: MSLBP (based on the coding of a local neighborhood), SLF (based on the calculation of statistical parameters in a few neighborhood of the pixel), SIFT and Gabor wavelets. The application is carried out on the CASIA 3D database according to a protocol proposed for addressing major problems in the field of 3D facial recognition and multimodal, including: variations in illumination, expressions, variations combined in various expressions. From all the experiments carried out, we can say that:

  • MSLBP is a good descriptor in the case of modality depth 3D versus 2D intensity modality, a significant improvement in performance is obtained by fusion of two modalities 3D and 2D.

  • The best results are obtained by SLF descriptor if the neighborhood size becomes significant (\(R = 3\), \(R = 4\)).

  • The fusion of four LBP (\(R = 1, 2, 3, 4\) and \(P = 8, 16, 24\)) + SIFT gives the best result with \(\hbox {TEE} = 2.85\,\%\) and \(\hbox {RR} = 93.50\,\%\).

  • For Gabor wavelets, the best result is obtained with the first resolution (\(\lambda = 4\)) with \(\hbox {EER} = 1.57\,\%\) and \(\hbox {RR}= 96.13\,\%\). The fusion of the five resolutions will not improve performance.

  • Local descriptor statistics SLF gives the best result compared with all global and local descriptors studied, which justifies the effectiveness of our proposed descriptor (we obtained \(\hbox {TEE} = 1.2\) and \(\hbox {RR} = 96.32\,\%\).

  • The fusion of our proposed descriptor SLF with the MSLBP + SIFT descriptor gives the best result with \(\hbox {EER} = 2.39\,\%\) and \(\hbox {RR} = 96.26\,\%\) of the depth image (3D) and \(\hbox {EER} = 3.56\,\%\) and \(\hbox {RR} = 95.70\,\%\) for the intensity image (2D). Finally, the multi-fusion algorithms (\(\hbox {3D} + \hbox {2D}\)) gives a \(\hbox {TEE} = 0.98\,\%\) and \(\hbox {RR} = 97.22\,\%\).

Our approach is fully automatic and has been tested on different shifts, expressions and illuminations. The performances obtained are stable. For the future work we propose to:

  • Study large rotations of the head,

  • Improve the detection of the nose (since our algorithm uses only the most salient point),

  • Study the fusion at features in the case of our method local statistics (SLF) for an adaptive selection of the best features.