Keywords

1 Introduction

Automatic human face recognition is a well-defined research problem in the fields of computer vision and pattern recognition. The technical core is to define a distance to measure the similarity between two given face images X and Y. The simplest way to define the distance is using the l2 metric on the whole raw images as:

$$\begin{aligned} d = \left\| X - Y \right\| _{2}. \end{aligned}$$
(1)

Besides l2 form, other forms like l0, l1, etc., are also widely used. A distance metric is believed good if d is small when both X and Y are from the same person and large when they are from two different persons (so-called small intra-personal variations and large inter-personal variations). Unfortunately, it is hard to directly employ the raw face images for similarity measurement in practice. This is because human face images exhibit significant appearance variations in scale, pose, lighting, background, hairstyle, clothing, expression, color saturation, image resolution, focus, etc., as they occur in real world applications.

To distinguish persons from their faces, a more effective and efficient way is to represent face images using visual features so face images can be projected into a feature space and classified. Then the similarity between two images X and Y is measured with the following distance metric:

$$\begin{aligned} d = \sum _{i}\left\| x_{i} - y_{i} \right\| _{2}, \end{aligned}$$
(2)

where \(x_{i}\) and \(y_{i}\) are features extracted from two face images X and Y.

The power of using features for face recognition comes from, not only the construction of visual features, but importantly from the flexibility and possibility of weighting visual features for classification. With weighting, the similarity is measured by calculating the distance metric:

$$\begin{aligned} d = \sum _{i} w_{i}\left\| x_{i} - y_{i} \right\| _{2} , \end{aligned}$$
(3)

where \(w_{i}\) is the weight received by feature i. The intuition of giving weights is that for each face image point (in a high-dimension space) such a metric should make face image points from the same person closer than points from different persons.

In face recognition, one of the most technically challenging issues is how to construct suitable facial features for face classification. The facial features constructed by conventional approaches are so-called “hand-crafted features”, i.e. features are constructed mathematically or engineered. Commonly used mathematical tools include Wavelet and Gabor filtering. The two most remarkable engineering features used in face recognition are SIFT [28] and LBP [1]. An entirely different way to construct facial features is through learning from face image data, i.e. learning to extract facial representations from training sets. The classical Eigenface approach is about how to extract principal facial components from training data sets for classification. Since the principal components are learned from training sets, the extracted facial features are called learned features. Another well-known algorithm for feature learning is Linear Discriminant Analysis (LDA). Today, with the rise of Deep Learning networks, almost all facial features used in face recognition are learned features.

Although it is easy to see that hand-crafted features and learned features are two different approaches, few realize that they are from two different facial feature constructing schools and there is consequently little debate around this topic. Successful stories of deep neural networks have led us to believe that learning is king! The unspoken assumption is: hand-crafted features are out of date, and only approaches using learned features are viable. The consequence is that we have become blind to their inherent problems. Solutions that (over) learn from training sets (particularly Deep Learning) are becoming increasingly database-dependent, even worse, it is hard to distinguish cases where general progress is made in face recognition from just good solutions to particular problems defined over specific databases.

In this paper, we argue that in the interest of making fundamental progress in face recognition, we ought to adequately study how to develop database-independent face recognition algorithms. We are interested in how good a modern face recognition system can be without learning. We consider face identification mainly due to two reasons: the problem itself is more challenging than face verification; it has been a research topic for quite some time and there are extensive experimental results available for comparison. The scientific methodology we employ here is to construct a face identifier and test and compare with state-of-the-art identifiers to explore empirically the question of how good a face identifier can be without learning.

We propose a method that merely leverages the power of the Gabor phase to address the problem of face identification in controlled scenarios. A slim filter bank of only two Gabor filters is applied to extract the Gabor phase information and explicit phase code matching is performed on the quantized phase map via our Block Matching scheme [55]. Different from other elastic matching schemes, the Block Matching scheme not only cancels the patch-wise spatial shift in the phase map but also simultaneously evaluates the patch-wise utility during the learning-free matching process. Combining the matching scheme with phase codewords enables the employment of high-definition phase information (4 times higher than [47]) from the 2 utilized Gabor filters. Thus, the proposed approach can significantly bring up the algorithmic efficiency without sacrificing the recognition accuracy. Furthermore, it is totally comparable to state-of-the-art Gabor solutions and even Deep Learning based solutions.

The disposition of our paper is as follows: we first briefly review related Gabor based approaches in Sect. 2; our approach is then described in Sect. 3 followed by comparative experiments presented in Sect. 4 where we also compare the performance between our approach and Deep Learning solutions. Finally, we discuss our work as a whole and offer our conclusions.

2 Related Work

Gabor filtering enables the employment of rich low-level, multi-scale features by transforming images from the pixel domain to the complex Gabor space. A Gabor face is obtained by filtering a face image with the Gabor filter function, which is defined as:

$$\begin{aligned} \psi _{u,v}(z) = \frac{\left\| k_{u,v} \right\| ^2}{\sigma ^2} e^{(-\left\| k_{u,v} \right\| ^2 \left\| z \right\| ^2 / 2\sigma ^2)} [ e^{i k_{u,v} z} - {e^{-\sigma ^2 /2}} ] , \end{aligned}$$
(4)

where u and v define the orientation and scale of the Gabor kernels respectively, and the wave vector is defined as:

$$\begin{aligned} k_{u,v} = k_{v} e^{i\phi _{u} }, \end{aligned}$$
(5)

where \( k_{v} = k_{max} / f^{v}\), \(\phi _{u} =u \pi /8 \); \(k_{max}\) is the maximum frequency, \(\sigma \) is the relative width of the Gaussian envelop, and f is the spacing factor between kernels in the frequency domain [27]. The discrete filter bank of 5 different spatial frequencies (\(v \in [0,\cdots ,4]\)) and 8 orientations (\(u \in [0, \cdots ,7]\)) is mostly exploited to filter face images to facilitate multi-scale analysis for face recognition.

In the complex Gabor transformed space, most state-of-the-art face recognition approaches utilize the amplitude of Gabor filtered image for face representation and facial feature construction. As in [51], the LGBP feature is extracted from the amplitude spectrum. One of the motivations is because the amplitude varies slowly with spatial shifts, making it robust to texture variations caused by dynamic expressions and imprecise alignment.

By constructing LBP type features from the amplitude and adopting different learning techniques, many Gabor filtering based approaches have shown remarkable advantages over pixel feature based methods: the identification rate in benchmark evaluations has been improved by more than 20% (reaching around 90%) thanks to the so-called “blessing of dimensionality” [17] (but with a high cost of less computational efficiency [8, 29]).

The Gabor phase is robust to light change. It has been well-known that phase is more important than amplitude for signal representation and reconstruction [30]. It is reasonable to believe that the Gabor phase should have played a more important role in face identification. However, the use of the Gabor phase in face recognition is far from common and it has often been unsuccessful with worse or nearly the same performance as the amplitude in comparative experiments [6, 16, 47, 53].

This is largely due to two challenging issues: (1) the Gabor phase is a periodic function and a hard quantization occurs for every period; (2) the Gabor phase is very sensitive to spatial shifts [45, 53], which imposes a rigid requirement on face image alignment. The first issue was partly solved by introducing the phase-quadrant demodulation technique [14], but the second issue is still far from being solved. The state-of-the-art Gabor phase approach (LGXP in [47]) extracts varied LBP from the phase spectrum. Since the combination of phase and LBP is also sensitive to spatial shifts, the power of the Gabor phase has not been demonstrated in face identification.

Fusing other features that are independent of the local Gabor features can also lead to better performance: [38, 42, 52] fuse the global (holistic) features with local ones at feature level; [47] proposes a fusion of the Gabor phase and amplitude on score and feature levels; [8] fuses real, imaginary, amplitude and phase data. Alternatively, attaching an illumination normalization step and weighting the local Gabor features is shown to be helpful as well [6].

3 Our Learning-Free Face Matching Approach

In this section, we first introduce the philosophy of our proposed approach in Sect. 3.1, and then demonstrate how it is used in face identification to achieve competitive performance with respect to the state-of-the-art.

3.1 Overview

Repeatable features extracted from small face portions are known as good discriminative traits for identifying persons. In addition, such local features are less likely to be influenced than the holistic features by pose changes and facial expressions. Thus, it is natural to divide face images into blocks and perform similarity measurements between them. Practically, in most face recognition methods, the matching process compares spatially corresponding patches after face alignment.

But such a matching process implies that the spatially corresponding features are the best match. This implication is hardly true even after face alignment. Because of the movement of facial components, head pose variability and imprecise alignment, the spatially corresponding patches easily become dislocated (see Fig. 2 in [58]). It is nearly impossible to achieve reasonable face alignment by using similarity transformations applied holistically to images.

In our approach, facial components are aligned individually by our Block Matching algorithm. Our Block Matching segments a face image into non-overlapping blocks and treats individual blocks as features explicitly. Given a pair of face images X and Y, the core of the algorithm is to use a given block (feature) \(x_{i}\) of image X to search for the corresponding block \(y_{i}\) in image Y. Then we measure the distance of two blocks as \( \left\| x_{i}- y_{i} \right\| _{2} \), which is used to form the similarity between two face images as in Eq. 2. This is a direct application of the Elastic Matching concept [45] in face recognition.

Moreover, since not all blocks contribute to face identity equally, it is natural to weight the face blocks during the matching process as shown in Eq. 3. By computing proper weight factors, we can expect larger distance values for patches from different persons and smaller distances for patches from the same persons. The key is how to acquire the weight factor \(w_{i}\).

Without doubt, we can learn weights from the training sets using metric learning techniques as in [12, 18], but the developed algorithm will be database dependent. To have good generalization performance, we developed an efficient on-line learning step to calculate the weight factor \(w_{i}\) during matching the face pair at hand in our Block Matching approach, which is introduced in the following.

3.2 Algorithm

We designed the algorithm based on the observation that a face can be distinguished by its unique feature(s) which is more informative than its surrounding one(s), e.g. scars, moles, nasolabial folds, etc. This means that in an Elastic Matching context, if a segmented patch is discriminative, it gives very small distance when a good match is found and the distance varies dramatically if it is matched to surrounding locations. By considering both the minimum matching distance and the variation of the matching distance, we can evaluate the discrimination power of the local patches.

Specifically, given a face-matching pair, a probe image P (denoted by pb) and a gallery image G (denoted by gl), we first segment the probe image into N non-overlapping patches that are denoted by \(\{f^{(pb)}_{n}\}_{0}^{N-1}\). (The local features are simply formed by the corresponding patches, e.g. by pixels from a patch for gray-scale images.)

For each probe patch \( f^{(pb)}_{n} \) centered at image coordinate \((x_{n}, y_{n})\) (denoted by \( f^{(pb)}(x_{n}, y_{n}) \)), it searches its best matching block within the corresponding search window and yields a patch-wise distance vector \(d_{n}\) where:

$$\begin{aligned} d_{n} = \{ d_{n}^{i} \}, i\in [0, L-1] \end{aligned}$$
(6)

where L is the number of candidate gallery patches within the \((2R+1)\times (2C+1)\) search window, i.e. \( L = (2R+1)\cdot (2C+1)\) when applying an “exhaustive” search method, R and C stand for the search offset in vertical and horizontal directions respectively. Each element in \(d_{n}\) is computed as:

$$\begin{aligned} d_{n}^{i}= \left\| (f^{(pb)}(x_{n}, y_{n}) - f^{(gl)}(x_{i}, y_{i}) ) \right\| _{2}, \end{aligned}$$
(7)

where the patch-wise distance metric is the l2-norm of element wise distance of local features (patches) and \(f^{(gl)}(x_{i}, y_{i})\) denotes the patch that centered at image coordinate \((x_{i}, y_{i})\) within the search window on the gallery face image so that

$$\begin{aligned} x_{i} = x_{n} + \varDelta x, \varDelta x \in [-C, C] ,\,\, y_{i} = y_{n} + \varDelta y, \varDelta y \in [-R, R] . \end{aligned}$$
(8)

We then calculate the slope \(k_{n}\) of the linear fitting of the first 5 ascendingly sorted values of \(d_{n}\) for normalization of the patch wise distance for each patch, such that the weight factors for each local feature \(w_{n}\) is calculated as:

$$\begin{aligned} w^{*}_{n} = k_{n} / {d}^{*}_{n}, \end{aligned}$$
(9)

where \( {d}^{*}_{n} = min(d_{n})\). \(w^{*}_{n}\) is then normalized by its l1-norm as:

$$\begin{aligned} w_{n} = w^{*}_{n} / \sum _{n = 0}^{N-1}w^{*}_{n}. \end{aligned}$$
(10)

Finally, the distance between a matching pair of probe and gallery face images is the weighted sum of \(d^{*}_{n}\) as:

$$\begin{aligned} dist^{(pb, gl)} = \sum _{n = 0}^{N-1} w_{n} \cdot d^{*}_{n}. \end{aligned}$$
(11)

More details of our Block Matching approach are given in [55].

3.3 Gabor Phase Block Matching (GPBM)

It is known that the features constructed from pixels are vulnerable to lighting and pose variations. To further improve the recognition performance, one way is to construct more robust features. Another effective way is increasing the dimensionality of features to raise recognition rate dramatically thanks to the “blessing of dimensionality”. Traditionally, the most popular way is to exploit the Gabor features via Gabor transformation, which normally increases dimensionality of image representations by 40 times [27].

The Gabor transformation enables the employment of rich low-level multi-scale features by transforming images from the pixel domain to the complex Gabor space. Different strategies of using either Gabor magnitude or Gabor phase, or a hybrid of both magnitude and phase have been proposed to construct features. One reasonable option for many state-of-the-art approaches has been to utilize Gabor amplitude for face representation and feature construction.

But high dimensional features lead to high cost and create difficulties for training, computation, and storage (as pointed out in [8, 29]). To build a practical solution, patch-based approaches and dimensional reduction techniques, such as PCA or LDA, and rotated sparse regression, are commonly used to learn a subspace to reduce intra-class variation and expand inter-class variations. Since the learning process has to be involved and training datasets are needed (e.g. for LDA, the leading eigenvectors of the covariance matrix are needed to calculate over training image pairs), the advantage of using hand-crafted features to achieve generalization performance is diminished.

Fig. 1.
figure 1

The matching process of our Gabor Phase Block Matching (GPBM) approach.

Can we remain learning-free (to promise generalization) in our face matching approach and also further improve the recognition performance without suffering from heavy computational load brought about by high-dimension representations? We focus on the Gabor Phase, since it better reconstructs signals than amplitude [30]. We combine the Gabor Phase face representation with our Block Matching approach introduced in Sect. 3.2, and demonstrate that increasing the signal dimension is not the only way to boost the recognition performance.

Specifically, we filter faces with only a single-scale Gabor filter pair and calculate the phase of the filtered face. That is, for each face image, only two demodulated Gabor phase spectra are used as in the input of our Block Matching method, see Fig. 1. We first segment the probe phase spectrum into N non-overlapping patches and the patches \(\{f^{(pb)}_{n}\}_{0}^{N-1}\) are simply formed by the raw phase codes of the patches. Then the Block Matching approach is utilized to calculate the distance of the two faces. The only difference is that when calculating the phase distance, each element in \(d_{n}\) is computed by performing an explicit matching over the raw demodulated phase as:

$$\begin{aligned} d_{n}^{i}= \left\| \text {XOR}(f^{(pb)}(x_{n}, y_{n}) , f^{(gl)}(x_{i}, y_{i}) )_{decimal} \right\| _{2}, \end{aligned}$$
(12)

where the patch-wise distance metric is the l2-norm of element wise Hamming distance in decimal. More technical details are provided in [56].

4 Experiment

4.1 Database Selection

There are a variety of large-scale datasets available for benchmark evaluation of different face recognition approaches, such as the FERET [33], FRGC2.0 [32] and the LFW [21] datasets. Since we focus on face recognition in controlled scenarios in this paper, the FERET database—the most commonly used face identification benchmark—is selected to evaluate and compare our method with state-of-the-art face identification approaches. In addition, the CMU-PIE [36] dataset is selected to evaluate our GPBM against variations of pose, expression and illumination.

4.2 Experimental Setup

Face images were first normalized (aligned) based on the positions of both eyes as in [47]. A central facial area of \(150 \times 136\), which maintained the same aspect ratio (1.1:1) as in [47, 54], was segmented from the face image and used for our experiments.

Due to our Block Matching scheme, the Gabor phase information with a higher definition can be utilized in our approach. We found that a single-scale Gabor filter pair with two orientations is sufficient for face identification. In our implementation, the selected Gabor filters had the following parameters: \(v = 0\), \(u \in \{2,6\}\), \(f = \sqrt{2}\), \(k_{max} = \pi /2\), \(\sigma = 2\pi \).

One can see that the chosen Gabor filters have broad high-frequency coverage. These high-frequency components correspond to facial texture variations and are insensitive to the factors of lighting, pose, and aging. Accordingly, to retain high phase definition and to be tolerant to potential phase change caused by texture shift, a Gray-coded 16-PSK demodulator was used for phase demodulation and the constellation is shown in Fig. 2. Compared to the quadrature phase demodulation used in [47, 52], 4 times the phase information can be utilized thanks to the employment of our Block Matching approach.

Fig. 2.
figure 2

16-PSK demodulator constellation.

To provide a thorough answer to “How good can a face identifier be without learning”, we compare our Block Matching approach [55] and our GPBM to other methods on both image domain and Gabor transformed space in the following.

4.3 Evaluations on the CMU-PIE Database

The CMU-PIE database contains 41368 images of 68 subjects. Images with pose labels 05, 07, 09, 27, and 29 under 21 illuminations (Flash 2 to 22) of all the 68 persons are selected as the probe set.

Fig. 3.
figure 3

Recognition rates under different parameters on the CMU-PIE dataset.

When applying the Block Matching method, the most important parameters are the block size (H and W) and searching offset (R and C). Our empirical tests on other datasets indicate that it makes sense to divide a central facial area into \(5 \times 7\) patches, which semantically correspond to components of human faces, like eyes, nose, etc. Thus, for a facial area of \(150 \times 136\), a reasonable size of a block is \(30 \times 20\). In our implementation, our block size was \(29 \times 19\) (we prefer odd block sides) in the block search. To have a good coverage while keeping low computational complexity, the search offset was set to around a quarter of the block size and we selected the search offset of \(R=7\), \(C=6\) pixels in our experiments. To test how sensitive the performance was to the selected parameters, we selected the first 2000 probe images on the CMU-PIE to evaluate the performance with the chosen parameters and other parameters randomly selected around them. The evaluation results in Fig. 3 show that the recognition performance is rather insensitive to parameter selections.

We then conducted experiments on the CMU-PIE probe set and compared our GPBM with G_LBP and G_LDP [54]. The G_LBP is the Gabor version LBP and the G_LDP is a type of improved Gabor amplitude Local Binary Pattern. The G_LDP achieved equivalent performance as LGXP (Gabor Phase pattern) on the FERET evaluations so it is a good reference for comparison. Since the Gabor phase is inadequate to handle the non-monotopic illumination variations, we employed the Difference of Gaussian (DoG) image to equalize the illumination on the face. The corresponding results is denoted by \( GPBM+DoG\). The comparative rank-1 recognition rates are listed in Table 1. It can be seen that even with use of the raw pixels without any photometric processing for face matching, the Block Matching approach performs comparably to the LBP and LDP approaches where hand-crafted features were employed. Similarly, for our GPBM, it is slightly better than the G_LDP, even though LDP extracts much more complicated Gabor amplitude patterns. The results indicate that with the environment of dramatic illumination and pose variation, our Block Matching approaches have equal recognition power to the hand-crafted features.

Table 1. Comparative rank-1 recognition rates on the CMU-PIE database.
Table 2. Comparative rank-1 recognition rates on the FERET database.

4.4 Evaluations on the FERET Database

The FERET database is the most commonly used face identification benchmark. It contains variations in illumination, expression and aging. The gallery set “Fa” contains 1196 frontal face images and the easiest probe set “Fb” contains 1195 images with variations mainly in expression. The probe set “Fc” has 194 images with illumination variations. The “Dup1” set contains 722 images taken later in time than the “Fa”. 234 images in the “Dup1” taken at least 1 year after the “Fa” session were selected to form the hardest “Dup2” set. We faithfully followed the evaluation protocol of the FERET dataset and compare with other feature based methods in Table 2.

It is easy to observe that when recognizing faces in the image domain, the Block Matching approach outperforms the LBP and LDP (first 3 rows in Table 2). If Gabor transformation is employed, in a fair comparison (non-learning component was involved and only Gabor phase was utilized for face matching), our GPBM is almost 12% better than LGXP on the hardest “Dup2”. Even in unfavorable comparisons, where pre-processing, training, and fusion methods were exploited by LN+LGXP and S[LGBP Mag+LGXP], our GPBM still excels. To our best knowledge, the method S[LGBP_ Mag+LGXP]—aided by the Gabor amplitude and training procedures—was the state-of-the-art Gabor phase based method in terms of performance on the hardest FERET “Dup2”, and our GPBM is entirely comparable.

Our approach also has comparable performance to the Deep Learning based PCANet-2 [9]. It firmly confirms again that in a controlled scenario by weighting the image-wise distance via our Block Matching process, we can achieve equally effective face identification as the state-of-the-art. Our results indicate that feature design and high-dimensional signal representation might be less important than commonly believed.

We further compare our GPBM with other state-of-the-art approaches based on other techniques on the FERET in Table 3. From the table one can see that all these approaches are based on Gabor features, which indicates that the Gabor filter is a very effective tool for signal representation. Our GPBM method outperforms all the other approaches on the hardest “Dup2” set and it features three advantages: (1) it enables high definition Gabor phase to be utilized for face identification; (2) a single-scale Gabor filter with two orientations is sufficient to generate an effective face image representation, with 1/20 of the computational complexity of other methods that utilized 40 Gabor filters; (3) further to this, it is not a learning-based face identification method and, therefore, promises good generalization.

Table 3. Comparative summary of recent state-of-the-art face identification approaches.

We also evaluated our approach under pose variations using the pose probe sets “bd”, “be”, “bf”, “bg” and gallery set “ba” on the FERET dataset. These sets correspond to pose angles of \(+25^\circ \), \(+15^\circ \), \(-15^\circ \), \(-25^\circ \), and \(0^\circ \) to the camera, and each of these sets contains 200 persons. The comparative performance of our GPBM is listed in Table 4. It can be observed that our GPBM approach is significantly more accurate than the non-learning feature-based approaches (LGBP and LGXP) when probe faces have relatively large pose angles. In addition, it outperforms the learning-based methods on all probe sets and is even comparable to the recent Deep Learning approach SPAE on “be” and “bf” sets.

Table 4. Comparative rank-1 recognition rates on the FERET dataset with the posed probe sets.

The computational complexity is always a big concern. As in Table 4 of [29], under the image size of \(128 \times 128\) with a \(5 \times 8\) Gabor filter bank, the histogram extraction of LGBP takes around 0.45 s, S[LGBP_ Mag+LGXP] takes 0.99 s. Extracting GOM feature takes 0.7 s [8]. However, the “feature extraction” time in our method is 0 s since only the raw phase is used for matching; the demodulation is the only on-line computation of the probe face, thus, it is extremely fast. Our Matlab implementation executes the matching of a face pair in 0.05 s on average (Gabor filtering included) on a 3.4 GHz Intel CPU. We can therefore safely conclude that our GPBM outperforms the best Gabor-phase based approach (S[LGBP_ Mag+LGXP]) in efficiency with a big margin. We can also infer that the other methods in Table 3 could hardly be more efficient than our GPBM due to higher image resolution, Gabor face dimensions, and additional photometric processing. Here we should mention that our GPBM needs to run block matching. Right now, we used an “exhaustive search” strategy. Since we have just a few blocks per probe image, matching is still fast. In future work, we could also incorporate fast-search strategies from the video compression field to speed up face matching.

4.5 Deep Learning for Face Recognition in Controlled Scenarios

Before we conclude this paper, it would be interesting to investigate how good Deep Learning can be for face recognition in controlled scenarios. To answer this, we trained several CNNs with well-known architectures of AlexNet [24], VGG-net [37], Google’s InceptionNet [40] and FaceNet [35]), and evaluated them on the most difficult probe set “Dup2” of the FERET database.

For fair comparisons between different architectures, layers after the last spatial pooling in our implementation of the InceptionNet and the FaceNet were replaced by two concatenated Fully Connected (FC) layers, and Softmax was selected as the lost function. We used the WebFace dataset [50] to train our networks. WebFace is a face image collection with half a million instances of around 10000 celebrities. Since the FERET dataset was formed with non-celebrity people, all the trained nets were fine-tuned carefully with the FERET gallery images.

Table 5. Rank-1 accuracy of several well-known CNN architectures on FERET Dup2 (input image size to CNNs was \(120\times 120\)).

To illustrate how architecture choice affects the recognition performance, we investigated how rank-1 accuracy varies under different sizes of the FC layers. The results are listed in Table 5. We can see that the architecture (length of FC) does influence the recognition accuracy. Explicitly inherit network architectures designed for one image classification task (e.g. networks with \(FC-4096\) layers work well on the ImageNet) may not perform well in a novel face recognition task. Investigations on suitable deep feature representations must be made correspondingly and here we found that \(FC-1024\) is a good choice which is also verified in [31]. On the other hand, the performance strongly correlates to architecture in general: even with \(FC-1024\), the InceptionNet outperformed others.

Apparently, CNNs can outperform the proposed approach for around 4%, but such advantage is not statistically significant for the test: the best CNN correctly identified 9 more probe faces than our proposed approach which made 222 correct answers out of 234 probes on the “Dup2” set. We can see that even with the most advanced CNNs and trained with a massive face dataset, deep-learning doesn’t solve the face identification problem defined over the FERET set significantly better than our non-learning approach. At least, one can conclude that for an unseen face identification task, the approach developed without learning could be a promising solution.

5 Discussion

Although without learning, we have shown that a combination of Block Matching with Gabor phase could work as a very good identifier. Unlike in most state-of-the-art face identifiers where a highly engineered design of facial features or high dimensional features (“blessing of dimensionality”) are “must-have”, the proposed face identifier has no designed features. It just uses the blocks of raw face images, two orientation channels of a single scale Gabor filter to construct the phase features for face recognition. Our experimental results show that the form and dimensionality of features are of course important, but not the key in building a good identifier. The key lies in how to handle the factors causing unlimited variations of facial textures. The crucial issue in constructing features is knowing how the features affect the relationship between within-class variability and between-class variability. Face recognition can be performed reliably only when the between-class variability is larger than the within-class variability. Why does the proposed identifier work? It is due to that spatial shift caused by camera, pose, expression, scale, and aging between two face images is effectively tackled by our Block Matching approach. In addition, the 2D Gabor filtering family uniquely achieves the theoretical lower bound on joint uncertainty over spatial position and frequency. These properties are particularly useful in characterizing facial textures. Leveraged by our Block Matching technique, the Gabor phase demonstrates its power in handling lighting factors and detailing of local facial texture changes.

Just like others, the performance of our face identifier was also tested with standard database sets. However, there is a clear difference between the evaluations. Since it is not learned from the training data, our identifier is of good generalization. One can expect a similar performance when applying to other databases. Of course, according to the No Free Lunch Theorem, if we are interested solely in the generalization performance, there is no reason to prefer one identifier over another. Certain prior knowledge about the problem or a concrete application is always used explicitly or implicitly, for example, choice of operating parameters in our case. In fact, our identifier itself is of a good technical platform where learning can be well integrated. For example, instead of being computed on-line, the weights can be learned from database sets. Thus, our approach is becoming the so-called metric learning:

$$\begin{aligned} d = (\mathbf x - \mathbf y )^{T} W (\mathbf x - \mathbf y ), \end{aligned}$$
(13)

where \(\mathbf x \) and \(\mathbf y \) are the feature representations of X and Y, W is a weight matrix, typically a symmetric positive definite matrix. A typical example is the Mahalanobis metric. The weight matrix can be learned from either sets of labeled image pairs or just sets of labeled images with an objective of finding a matrix such that positive pairs have smaller distances than negative pairs. Of course, once learning is involved the developed identifier will be more database-dependent and less apt for generalization.

Face recognition has been developed for over more than three decades and three-order of magnitude improvement in recognition rate has been achieved. One has to realize that such an achieved performance increase was only on the selected databases.

Due to the popularity of machine learning, we don’t know how to measure the real progress that has been made in face recognition. Taking the “LFW benchmark” as an example, so many advanced CNN networks have been trained to do face recognition and some of the best can achieve \(99.5\%\) recognition accuracy in benchmark evaluations. But when such networks were practically deployed in a real-world application, it was found that they were still far from usable mostly due to the divergence between the training dataset and the real-world data [57]. Though Deep Learning made a big stride in solving challenging face recognition tasks, it is still early to confirm that it is the only right way to go. It is wise to include diverse solutions using “hand-crafted” features and/or features learned from data. This is why in this paper we take on a radical approach to see how far face recognition can go without learning.

6 Conclusions

In this paper we argue strongly that it makes sense to study how good a face identifier can be without learning, particularly today when Deep Learning is very commonly used. We have shown how to construct such an identifier that simply uses Block Matching technique over Gabor phase codes to achieve state-of-the-art performance. We have demonstrated that engineered feature designs or those adhering to the slogan “blessing of dimensionality” are not essential ingredients for building a good identifier. The key issue in constructing features is to achieve between-class variability larger than within-class variability. Since it is not learned from the training data, our identifier lends itself well for generalization. One can expect similar performance when applying it to other databases. This is very important for developing algorithms that constitute real progress in face recognition.