1 Introduction

Face recognition has been widely studied by the vision community over the past few decades. The importance of this technology comes from its use in various applications such as law enforcement, security and access control. Recently, the research of face recognition has been directed towards the single sample face recognition (SSFR) [1]. However, recognizing human faces in the SSFR scenario is extremely challenging due to the presence of limited single reference samples in the gallery and the large sensitivity of intra-person variations for instance pose, illumination, facial expression and partial occlusion in probe images. In particular, pose variation is considered as the most complex problem that changes the out-of-plane rotations of the face resulting in self-occluded faces [2]. Such modification alters the shape and appearance of faces in a way that some discriminated facial details are lost due to self-occlusion. This loss in information leads to severe performance degradation of the frontal face recognition systems. A vast amount of pose invariant face recognition (PIFR) approaches has been introduced to address the pose variation problem from different perspectives. For comprehensive details on PIFR literature, the reader is referred to the recent surveys [2, 3]. However, most of the current PIFR approaches may be impractical in the SSFR scenario due to the reasons that are described at the end of Sect. 2.

The main contribution of this paper is a patch-based pose invariant feature extraction method that is efficiently applicable in the SSFR framework. The proposed method extracts pose invariant facial features from the landmark-based patches located at the face organs, namely eyebrows, eyes, nose, and mouth, rather than the whole face image. Since the local patches may expose relatively small out-of-plane rotations compared to the global image, extracting discriminated details from these small regions may produce pose invariant facial features.

The privileges of the proposed method over the existing PIFR methods are fourfold. In the first place, the proposed method can efficiently handle moderate pose variations (\(\pm \,45^\circ\) yaw) and show a significant performance under large poses (\(\pm \,60^\circ\) yaw) compared to the state-of-the-art pose robust feature extraction methods. In the second place, it does not need the abundant multi-pose training face data compared to the learning-based PIFR approaches. Finally, it is fully automatic and does not need manual landmark annotation.

The remaining part of this paper is organized as follows. Section 2 reviews related research attempts of PIFR approaches. In Sect. 3, the proposed method is described in detail. The experimental results carried out on a benchmarking face database are reported in Sect. 5. Finally, the conclusion of this paper is outlined in Sect. 6.

2 Related work

This section presents a review of the state-of-the-art PIFR approaches that addressed the pose problem from different points of view. According to Ding and Tao [2], the PIFR approaches can be classified into four categories namely, pose robust feature extraction, multi-view subspace learning, face synthesis, and hybrid approaches. The methods in the first category aim at designing face descriptors to extract discriminated facial features that are invariant against pose variations. For more details on face feature extraction approaches, the reader is referred to the recent survey [4]. The multi-view subspace learning-based methods use multi-pose face images to establish a shared latent subspace in which the features of different poses are projected and then matched. Face synthesis methods focus on transforming face images from one pose to another, so two faces originally in different poses can be matched in the same pose. Lastly, hybrid methods are simply a combination of two or more of the previous three groups. The next paragraphs explore the contemporary research endeavors in these categories and describe their limitations in the SSFR scenario.

Pose robust feature extraction methods extract facial features by either handcrafted or learning-based descriptors. The handcrafted approaches use a manually designed descriptor to extract features from either landmark-based or random facial keypoints-based patches. Zhou et al. [5] proposed a Huffman local binary pattern (LBP) to extract features from landmark-based patches. The authors applied a divide-and-rule strategy in both representation and classification to recognize faces across pose. Huang et al. [6] combined enhanced landmark-based multi-scale LBP (MSLBP) features with Gabor features by a proposed kernel-level fusion technique. Gao and Lee [7] presented a combined pose invariant scale invariant feature transform and personalized correspondence learning (PISIFT-PCL) method. The approach learns a generic correspondence between the poses to generate virtual patches from which the PISIFT features are extracted. The learning-based approaches extract facial features directly from the raw face images by machine learning techniques such as kernel-based and deep learning-based models. These methods learn to extract pose robust features by training on large-scale multi-pose face images. Duan and Tan [8] proposed a feature learning approach based on spatial self-similarity to extract the subject related information from a local feature by removing its pose related details. Shao et al. [9] proposed a pose invariant face representation learning approach based on sparse many-to-one encoders and a deep convolutional neural network. Ding and Tao [10] exploited convolutional neural networks (CNNs) to extract complementary facial features which are then compressed using a three-layer stacked auto-encoder (SAE).

The multi-view subspace learning-based PIFR approaches divide the nonlinear manifold of multi-pose face images into a separated set of pose spaces. Each pose space is considered as a single view from which pose specific projections to a shared latent subspace are learned. Guo et al. [11] utilized graph embedding to propose a multi-view linear discriminant analysis (MiLDA) for multi-pose face recognition. Cai et al. [12] proposed a regularized latent least square regression (RLLSR) method to map different poses of one person into a single point in the latent pose free space. Wang et al. [13] employed deep learning to design a deeply coupled autoencoder networks (DCAN) method to project samples from two poses into one common discriminating subspace.

Face synthesis methods generate a face image with the desired pose using a 2D or 3D model. In the 2D category, the process starts by fitting a 2D model to the face image and then 2D geometrical transformations (e.g., piecewise affine and thin-plate spline) are often used to warp face images to the desired pose. Sagonas et al. [14] introduced a robust statistical frontalization (RSF) technique based on the iterative procedure of both facial landmarks detection and warping to construct frontal face images. Haghighat et al. [15] proposed an improved version of active appearance models (AAM) by adopting automatic facial landmarks localization to enhance the AAM initialization part in the fitting procedure. Gao et al. [16] employed the discriminant appearance models (DAM) and partial least squares (PLS) to propose a view-based pose normalization method. The approaches in the 3D category accomplish face synthesis with the aid of a 3D model. Firstly, the 3D model is fitted to the face image based on facial landmarks. The face image texture is then mapped to the aligned 3D model. Lastly, the textured 3D model is rendered to the desired pose and a new synthesized face image is generated. Ding and Tao [17] invested a dense gird of 3D landmarks to design homography-based pose normalization (HPN) method. Deng et al. [18] developed a lighting-aware face frontalization approach depending on a five landmarks-based 3D generic model and quotient image symmetry. Zhang et al. [19] adopted the use of a reference 3D face model, occlusion localization procedure, local face symmetry scheme, and Poisson image editing to design a face frontalization method. Recently, deep learning techniques have been incorporated into the 2D and 3D methods to achieve impressive face synthesis results. Huang et al. [20] proposed a deep architecture called a two-pathway generative adversarial network (TP-GAN) to synthesize a frontal face image exploiting the global structure and local texture of face image. Kan et al. [21] adopted a stacked progressive auto-encoders (SPAE) deep neural network to convert non-frontal face images to frontal views in a progressive manner.

Hybrid approaches consisting of two or more frameworks from the aforementioned categories have also been proposed for PIFR. Tran et al. [22] developed a disentangled representation learning-generative adversarial network (DR-GAN) approach to learn a generative and discriminative representation which can be used to synthesize frontal face image. Peng et al. [23] exploited a 3D model to synthesize multiple multi-view face images from which a rich feature embedding is learned by a deep neural network. Ding et al. [24] designed a hybrid approach based on combining a 3D-based frontal face synthesis, patch-based facial representation, and transformation dictionary-based subspace learning.

Despite the significant progress in PIFR research, most of the reported approaches may not be applicable in the strict SSFR condition for a number of reasons. Firstly, the performance of handcrafted pose robust feature extraction methods under large pose yaw \(\pm \,60^\circ\) is still limited. Secondly, the learning-based pose robust feature extraction, multi-view subspace learning, and deep learning-based approaches require ample multi-pose training face images which are not available in the case of SSFR. Thirdly, face synthesis methods need to fit a 2D or 3D model to the face image in order to generate a face with the desired pose. The fitting process, however, may be time-consuming and computationally expensive. Moreover, some face synthesis approaches may require manual landmark annotation which is not feasible in real-world applications. They may also produce undesirable artifacts such as stretching due to inaccurate estimation of shape or pose parameters of the 2D or 3D model. These artifacts distort the face appearance and subsequently degrade the extracted features affecting the performance.

Fig. 1
figure 1

The proposed patch-based pose invariant feature extraction method

3 Proposed approach

In this section, the proposed patch-based pose invariant feature extraction method is described in detail. The main steps of the proposed approach include landmark detection, patch extraction, Gabor/HOG feature extraction, dimension reduction, feature fusion, and normalization. The diagram in Fig. 1 illustrates these steps. The details of each step are elaborated in the next subsections.

3.1 Landmark detection and patch extraction

Each face image is converted to the gray-scale. Then, 68 landmarks are detected using the constrained local neural field (CLNF) detection algorithm [25]. Only 51 landmarks located at the facial components, namely eyebrows, eyes, nose, and mouth, are used to determine patches. Each landmark represents the centroid of a patch P(xy) whose spatial size is predefined empirically for each dataset. Thus, 51 patches are segmented for patch-based local feature extraction. Due to the fact that local patches exhibit small pose variations compared to the global image, patch-based local feature extraction can produce features that are robust against pose variations.

3.2 Gabor/HOG feature extraction

Gabor filter [26, 27] and histograms of oriented gradients (HOG) [28] methods were adopted for local feature extraction. According to their original articles, Gabor and HOG features are robust against facial expression and illumination variations, respectively. To address the pose variation, Gabor magnitudes and HOG features were extracted from the 51 patches rather than the entire image.

Gabor features are extracted from each patch as follows. 40 Gabor filters of 5 scales and 8 orientations are defined as in (1)

$$\begin{aligned} \psi _{u,v}(x,y) = \frac{f^2_u}{\pi \kappa \eta }e^{-((f^2_u/\kappa ^2)x'^{\;2}+(f^2_u/\eta ^2)y'^{\;2})}e^{j2\pi f_ux'} \end{aligned}$$
(1)

where u and v define the scale and orientation, respectively, \(x'=x\,\cos \,\theta _v+y\,\sin \,\theta _v\), \(y'=-x\,\sin \,\theta _v\,+y\,\cos \theta _v\), \(f_u=f_{max}/2^{(u/2)}\), and \(\theta _v=v\pi /8\). The common parameters used for face recognition are \(\kappa =\eta =\sqrt{2}\) and \(f_{max}=0.25\). Gabor features are extracted by filtering the patch P(xy) with the Gabor filter \(\psi _{u,v}(x,y)\) as in (2)

$$\begin{aligned} G_{u,v}(x,y) = P(x,y)*\psi _{u,v}(x,y) \end{aligned}$$
(2)

where \(G_{u,v}(x,y)\) is the complex filtering output that is composed of real \(R(G_{u,v}(x,y))\) and imaginary \(I(G_{u,v}(x,y))\) parts. The magnitude \(M_{u,v}(x,y)\) of filtering operation can be defined as in (3)

$$\begin{aligned} M_{u,v}(x,y) = \sqrt{R(G_{u,v}(x,y))^2+I(G_{u,v}(x,y))^2} \end{aligned}$$
(3)

All the 40 magnitude responses are downsampled by factor 5 and concatenated to form the Gabor feature vector of a single patch. All the Gabor vectors of 51 local patches are concatenated to construct the Gabor features of the global face image.

HOG features are extracted from each patch as follows. The gradient filter \([-1,\,0,\,1]\) is used to compute the horizontal gradient \(G_x(x,y)\) and vertical gradient \(G_y(x,y)\) of the patch. The magnitude |G(xy)| and angle \(\theta (x,y)\) of the gradient are defined as in (4) and (5), respectively

$$\begin{aligned}&|G(x,y)| = \sqrt{G_x(x,y)^2+G_y(x,y)^2} \end{aligned}$$
(4)
$$\begin{aligned}&\theta (x,y) = \arctan \Bigg (\frac{G_y(x,y)}{G_x(x,y)}\Bigg ) \end{aligned}$$
(5)

The patch is divided into cells, and each cell has \(4\times 4\) pixels. Then, a histogram of 10 evenly spaced orientation bins ranging from \(0^\circ -180^\circ\) is computed. Every bin is incremented by 1 when the magnitude (|G(xy)|) whose angle \((\theta (x,y))\) belongs to the same bin. A block is formed by combining every four connected cells. The histograms of cells can be normalized in the block by \(L2-Hys\) (Lowe-style clipped L2 norm) normalization method. The combination of all histograms constructs the HOG feature vector of the patch. All HOG vectors from all local patches are concatenated to form the HOG feature vector of the global image.

3.3 Dimension reduction

Each of Gabor and HOG vectors has a high dimension which may slow down the performance. Many state-of-the-art approaches have been proposed for feature selection and dimension reduction [29,30,31]. However, these techniques have not been proved efficient for face recognition. A recent study [15] has shown that the use of principal component analysis (PCA) [32] with Gabor and HOG features yields significant face recognition performance. Hence, in the proposed method, the dimensional size of one-type vectors is reduced by PCA. Let the set of N feature vectors \(\Gamma _1,\,\Gamma _2,\,\ldots ,\,\Gamma _N\) be a training set. The average vector \(\Gamma _A\) of this set is defined as in (6)

$$\begin{aligned} \Gamma _A = \frac{1}{N}\sum _{n=1}^{N}\Gamma _n \end{aligned}$$
(6)

The difference between each training vector and the average vector is defined as in (7)

$$\begin{aligned} \Phi _n = \Gamma _n-\Gamma _A\quad (n = 1,\,2,\,\ldots ,\,N) \end{aligned}$$
(7)

The matrices \(A=[\Phi _1\,\Phi _2\,\ldots \,\Phi _N]\) and \(L=A^TA\) are constructed. Then, N eigenvectors \(e_i\) and eigenvalues of the matrix L are calculated. The eigenspace \(s_l\) can be defined as in (8)

$$\begin{aligned} s_l = \sum _{k=1}^{N}e_{lk}\Phi _k\quad (l = 1,\,\ldots ,\,N) \end{aligned}$$
(8)

The eigenspace represents a basis set from which the weights \(w_k\) can be obtained. The authors in [32] argued that a smaller \(N^{'}\) dimensional eigenspace is sufficient to obtain weights. Only the \(N^{'}\) eigenvectors with the highest eigenvalues are selected to generate the \(N^{'}\) eigenspace. Each feature vector \(\Gamma\) is projected into the \(N^{'}\) eigenspace to find its weight as in (9)

$$\begin{aligned} w_k = s^T_k(\Gamma -\Gamma _A) \end{aligned}$$
(9)

where \(k=1,\,\ldots ,\,N^{'}\) and \(w_k\) is the contribution weight of kth eigenspace \(s_k\). The resulted weights are grouped to construct the reduced feature vector \(\Omega ^T=[w_1,\,w_2,\,\ldots ,\,w_{N^{'}}]\).

3.4 Feature fusion and normalization

The reduced Gabor and HOG vectors are then fused by canonical correlation analysis (CCA) [33] to yield a more discriminated and robust feature vector. Let \(X \in {\mathbb {R}}^ {p\times n}\) and \(Y \in {\mathbb {R}}^ {q\times n}\) are matrices of two feature sets from two different modalities. Let \(S_{xx} \in {\mathbb {R}}^ {p\times p}\) and \(S_{yy} \in {\mathbb {R}}^ {q\times q}\) are the within-sets covariance matrices of X and Y. Let \(S_{xy} \in {\mathbb {R}}^ {p\times q}\) is the between-set covariance matrix (note that \(S_{yx} = S^T_{xy}\)). CCA aims to find the linear combinations, \(X^* = W^T_x X \; and \; Y^* = W^T_y Y\), that maximize the pair-wise correlations across the two feature sets. The transformation matrices, \(W_x\) and \(W_y\), are found by solving the eigenvalue equations defined in (10) and (11), respectively

$$\begin{aligned}&S^{-1}_{xx}S_{xy}S^{-1}_{yy}S_{yx}\hat{W_x} = \Lambda ^2\hat{W_x} \end{aligned}$$
(10)
$$\begin{aligned}&S^{-1}_{yy}S_{yx}S^{-1}_{xx}S_{xy}\hat{W_y} = \Lambda ^2\hat{W_y} \end{aligned}$$
(11)

where \(\hat{W_x}\) and \(\hat{W_y}\) are the eigenvectors and \(\Lambda ^2\) is the diagonal matrix of eigenvalues or squares of the canonical correlations. The non-zero eigenvalues in each equation are then sorted in descending order. The transformation matrices, \(W_x\) and \(W_y\), consist of the sorted eigenvectors corresponding to the non-zero eigenvalues. Thus, \(X^*\) and \(Y^*\) are calculated and known as canonical variates. Feature-level fusion is performed either by concatenating or adding the transformed feature vectors as defined in (12) and (13), respectively

$$\begin{aligned}&Z_1 = \left( \begin{array}{c} X^*\\ Y^* \end{array} \right) = \left( \begin{array}{c} W^T_xX\\ W^T_yY \end{array} \right) = \left( \begin{array}{cc} W_x &{} 0 \\ 0 &{} W_y \end{array} \right) ^T \left( \begin{array}{c} X\\ Y \end{array} \right) \end{aligned}$$
(12)
$$\begin{aligned}&Z_2 = X^* + Y^* = W^T_xX + W^T_yY = \left( \begin{array}{c} W_x\\ W_y \end{array} \right) ^T \left( \begin{array}{c} X\\ Y \end{array} \right) \end{aligned}$$
(13)

where \(Z_1\) and \(Z_2\) are called the canonical correlation discriminant features. In this paper, the summation method defined in (13) is used. Finally, the fused vectors are normalized using min-max normalization.

Fig. 2
figure 2

Face images of a subject from FERET b-series database

Table 1 Recognition rates on FERET b-series database

4 Parameters settings

For the proposed method, there are two main parameters (i.e. the patch size and the number of eigenfaces) and other auxiliary parameters (i.e. Gabor filter’s scale and orientation and HOG’s cell size, block size, and bins number) need to be tuned for better recognition performance. The value of patch size is empirically set depending on the dataset. The eigenfaces’ number is set to the number of training classes for optimal performance. In the experiments, the patch size was set to \(40\times 40\) pixels and the eigenfaces’ number was set to 200 for the FERET b-series [34] dataset. Gabor filter’s scale and orientation were set to 5 and 8, respectively. HOG’s cell size, block size, and bins number were set to \(4\times 4\) pixels, \(2\times 2\) cells, and 10, respectively.

5 Experiments and results

In this section, an experimental evaluation of the proposed method is presented. The experiments were conducted on the publicly available database namely, FERET [34]. Two statistical classifiers, namely k nearest neighbor (kNN) with city-block distance function and support vector machine (SVM) were used to classify the proposed patch-based pose invariant features using a single sample per person. The experiments were implemented using MATLAB R2016a on a Windows 10 Professional laptop with Intel Core i7-3630QM CPU 2.4 GHz and 16 GB RAM. The b-series images of the FERET face database were used to evaluate the performance of the proposed method in comparison with several state-of-the-art approaches [5, 7, 8, 14,15,16, 21] that solved the pose variation problem from different perspectives. The b-series images in the FERET database are of size \(256\times 384\) pixels and were collected for 200 subjects. A subset of nine samples consisting of one frontal and eight pose varied images per subject were selected in the experiments. The frontal image is labeled as ba and holds a face in neutral conditions. The remaining eight non-frontal images include faces in different poses with \(+\,60^\circ\), \(+\,45^\circ\), \(+\,30^\circ\), \(+\,15^\circ\), \(-\,15^\circ\), \(-\,30^\circ\), \(-\,45^\circ\) and \(-\,60^\circ\) in yaw. These pose varied images are labeled as bb, bc, bd, be, bf, bg, bh and bi, respectively. Figure 2 shows face images of a sample subject. The size of patch was set to \(40\times 40\) pixels. Each subject in the gallery was represented by a single frontal face image ba, whereas the eight pose varied images bb-bi were presented in the testing operation. Table 1 reports the recognition rates of the proposed method and the peer approaches where the results of competed methods were transferred directly from their original papers. As can be seen in the table, the proposed method showed a comparable or better performance compared to the peer methods. The proposed scheme achieved \(100\%\) accuracy for pose variations with yaw between \(+\,45^\circ\) and \(-\,30^\circ\) as shown by the percentages highlighted in bold in the table. In the wide pose yaw \(\pm \,60^\circ\), the proposed method achieved \(96\%\) and \(94.5\%\) rates outperforming the peer methods. This outstanding performance is due to the extraction of discriminated facial features from landmark-based patches located at the face organs rather than the entire face image. The experimental results under this constrained condition suggest that the proposed method is effective for a wide range of pose variations within \(\pm \,\,60^\circ\) yaw.

6 Conclusion

In this paper, a patch-based pose invariant feature extraction method is presented for single sample face recognition. This technique can be adopted to develop a face recognition system for large-scale identification applications, such as driver license, national ID card, or passport identification system in which only one training sample per person is enrolled in the database. The proposed approach consists of CLNF-based landmark detection, patch extraction, Gabor and HOG-based feature extraction, PCA-based dimension reduction, CCA-based feature fusion, and min-max normalization. The implementation of the proposed scheme was accomplished using MATLAB tool, and the performance was tested using FERET b-series database. The evaluation metric, namely recognition rate was used to evaluate the performance of the proposed framework in comparison with the recent approaches. Experimental results have shown the excellent performance of the proposed method under a wide range of pose variations. From the simulation results, it is evident that the proposed technique achieved \(100\%\) and \(96\%\) and \(94.5\%\) recognition rates for moderate and wide pose variations, respectively. Although the proposed feature extraction method demonstrated excellent performance results in pose variations with \(-\,45^\circ\) and \(\pm \,60^\circ\) yaw, it still cannot reach \(100\%\) accuracy. This is the limitation of the currently proposed method, which leads to further investigations. In the future, the possibility to improve the performance of the proposed method under large pose variations in semi-profile and profile face images will also be investigated.