1 Introduction

Motivated by its widespread range of practical applications in surveillance, identification systems, access control, social networks, etc. face recognition has been an active research topic in pattern recognition over decades. Promoted by the face recognition grand challenge, recognition rates under well controlled settings have almost saturated [47]. With this achievement, the recent focus of research has been directed towards recognizing faces in the presence of undesired perturbations in imaging conditions such as variability in lighting conditions, subject pose and expression, misalignment, occlusion, low resolution, etc [31]. The challenges in this case are caused by the large variability in appearance of the same subject and small sample size compared to the dimensionality of the data.

Although not fundamental to the operational logic of a system, quality of the feature representation adopted in an algorithm may impose serious limitations on performance. Consequently, much effort has recently been focused on designing new low level image descriptors and/or combining multiple features to surpass standard representations such as SIFT [41], Gabor [40], HOG [22], LBP [61], LPQ [50], etc. Moreover, many of the current descriptors in image analysis such as LBP [61], SIFT [41], etc. have a hand-crafted design, not benefiting much from statistical learning which limits their representation capacity. An alternative is to develop new features via statistical learning [17, 32]. In this paper, a new face image representation based on binarised statistical image features (BSIF) [24] is introduced and then extended into a multiscale framework (MBSIF). Similar to some other common representations in face image analysis such as LBP [61] and LPQ [50], the new descriptor converts the local micro-structures of a face image into a set of discrete codes. This is realised using a number of different filters and projecting an image/sub-images linearly onto a subspace, the basis vectors of which are estimated via unsupervised learning. In other words, the MBSIF descriptor benefits from a learning stage in contrast to ad hoc design schemes used in some other alternatives.

The binary string code generation in many descriptors including LBP and LPQ is achieved by independently binarising each element. A fundamental prerequisite for independent binarisation of code elements is their statistical independence. While this condition is only approximately met in LBP or LPQ descriptors, in the MBSIF descriptor the justification for independent processing is provided using independent component analysis (ICA) in the filter design procedure. Extending the BSIF descriptor into a multiscale framework increases its representation capability, enabling the feature to capture image content at multiple resolutions. It is shown that the extension of the BSIF representation to a multiscale scheme is fundamentally beneficial, rendering the representation to perform on par or better than widely employed descriptors in the field. By stacking the frequency of occurrences of the MBSIF binary codes into a histogram, one may characterise the statistical distributions of filter responses at different scales.

In practice prior to extracting features, an alignment step is performed on the images. The alignment is usually imposed via an affine or similarity transformation using detected facial landmark points. However, such 2D holistic alignments will be insufficient in presence of out-of-plane head rotations. Even in frontal poses, an error in the localisation of a landmark will result in misalignment of the whole face. To address the problem, two approaches are pursued in the present work. First, a Markov random field (MRF) image matching model is embedded at the pixel level to provide dense alignment between a pair of images [5, 6]. The benefits of employing such an approach are two fold. First, it provides dense pixelwise alignment between a pair of images which is quite useful for face recognition in unconstrained settings [3]. Second, the matching is discriminative in the sense that two images of the same subject would most probably provide a good match while images of different subjects are less likely to be matched accurately. As a result, the method acts as a discriminative pre-processing step for the subsequent stages of a recognition pipeline. The MBSIF histogram is then constructed locally taking into account the correspondences and then mapped into an LDA space for comparison. Finally, the regional MBSIF descriptor similarities are summed up to produce the final similarity score.

An appealing characteristic of the proposed approach is the capability to perform unseen face pair matching. That is, given a pair of face images which were not available to the system before, the system should decide whether they belong to the same subject or to different individuals. The decision in this case can be made using a class-specific fisher discriminant analysis (CSLDA) [34]. The employed class-specific LDA transformation is used to construct discriminative subspaces for the features extracted from each image in a pair using a single sample and a fixed set of training data (imposter set). As will be described, the CSLDA transformation can be constructed in an unsupervised fashion making it a suitable candidate for the unseen face matching task. A further characteristic of the proposed technique is the symmetric face comparison. To this end, the method computes the similarity between a pair of face images by symmetrising the MRF matching process and as a result the LDA space feature construction and matching. This is in contrast to previous widely employed asymmetric methods where the similarity is measured only in one direction, compromising performance. The similarity score of the proposed MBSIF + CSLDA descriptor is finally combined with those of the MLBP [18] and MLPQ [62] representations via a sum rule to further increase the accuracy. As will be illustrated, the proposed method provides better discrimination and robustness than many of the existing state-of-the-art approaches in the most challenging situations of real life photos.

In summary the main contributions of the present work can be summarised as follows.

  • A novel discriminative multiscale image descriptor (MBSIF + CSLDA) using statistical learning based on a variant of linear discriminant analysis is proposed. The discriminative descriptor can be learnt in an unsupervised fashion, suitable for unseen image pair matching tasks.

  • In order to gauge the similarity of a pair of images, the face pair matching task is symmetrised. For this purpose, the discriminative LDA subspace learning is performed symmetrically, improving recognition performance.

  • The proposed descriptor is combined with the MLBP and MLPQ features in a score level fusion scheme in an LDA space to further enhance the recognition accuracy.

  • Last but not least, a dense pixelwise image pair matching method embedded at the pixel level makes the proposed method applicable to the problem of pose robust recognition of faces.

The rest of the paper is organised as follows. In Sect. 2, we briefly review the literature. Section 3 presents the details of our proposed multiscale local descriptor. In Sect. 4, the symmetric face matching approach is introduced. An evaluation of the proposed method including a comparison to the state-of-the-art methods is presented in Sect. 5 following which the conclusions are drawn in Sect. 6.

2 Related work

A great extent of the early efforts at face recognition made extensive use of features extracted globally from an image and mapped onto a lower dimensional space called subspace. Two prominent examples in this group are eigenfaces [67] and fisherfaces [12]. However, as local feature-based approaches demonstrated a higher degree of robustness against image perturbation, presently the majority of the best performing methods widely exploit local features for the characterisation of face image data. As an example, the authors in [17] use vector quantized local pixels to extract discriminative information from different face regions. While references [2, 64] use histogram of local pattern features (such as LBP, LTP etc.), reference [49] uses spatially localised Gabor filters in a multi-layer framework for face verification. In [44], the authors propose to use histogram of local binary patterns extracted from orientation images, achieving good performance using a single training sample per subject. A more recent approach to boosting the performance under unconstrained settings is to jointly use multiple local descriptors [17, 36, 75], where in the combination is applied via a wide range of methods from combination at the decision level to multiple kernel learning (MKL). Some other recent methods adopt metric-learning approaches for improved similarity comparison [23, 28, 43]. In [72, 73], the authors propose a two-level classifier, training a small number of one-shot and two-shot classifiers for each test pair employing one or both test images as positive samples and an additional set of negative samples. Employing a set of attribute (race, gender, hair colour, etc.) classifiers, the authors in [35, 36] also make use of this two-level classifier. Recently, a blur tolerant image descriptor called local phase quantization (LPQ) operator is introduced by Rahtu et al. [50]. LPQ has been shown to perform better than the local binary pattern (LBP) operator in face recognition and texture classification. In [76], global and local Gabor phase pattern histograms are proposed for face recognition. Graph-based approaches constitute a major category in part-based local face matching. In this framework [5, 7, 70, 71], different subregions of a face are processed independently of other non-neighbouring regions. Such a processing model is helpful in dealing with local geometrical distortions and handling occlusions and cluttered background. In addition, under this framework, good performance may be achieved even using only one training image per class. The current work uses a graph-based method for dense symmetric pixelwise alignment of faces. After establishing dense correspondences, regional multi-resolution features are employed for decision making in an LDA space.

3 Face representation via multiscale binarised statistical image features (MBSIF)

3.1 BSIF image coding

The binarised statistical image features (BSIF) is a generative model based on the independent component analysis (ICA) [32]. ICA represents the data as a linear transformation of some latent independent components. Let \(\mathbf{p}\) denote the pixel grey values in an image patch concatenated into a vector. Using ICA, \(\mathbf{p}\) can be represented using a feature matrix \(\fancyscript{F}\) as

$$\begin{aligned} {\mathbf{p}} = \fancyscript{F} {\mathbf{r}} \end{aligned}$$
(1)

where the elements of the vector \(\mathbf{r}\) are some unknown random variables which differ from one patch to another. Conversely, the elements of \(\fancyscript{F}\) are constant and the same for all different image patches. A fundamental assumption regarding this linear generative model is that the elements of \(\mathbf{r}\) are statistically independent. In this case, one may, using a large enough number of training samples, recover a reasonable approximation to \(\fancyscript{F}\) up to a multiplicative constant without explicitly knowing the latent vector \(\mathbf{r}\) [32]. Estimation of \(\fancyscript{F}\) is equivalent to determining the matrix \(\mathbf{F}\) which produces \(\mathbf{r}\) as the output of a number of linear filters as

$$\begin{aligned} {\mathbf{r}} = {\mathbf{F}}{\mathbf{p}} \end{aligned}$$
(2)

where each row of \(\mathbf{F}\) represents a filter to be applied on the pixels of patch \(\mathbf{p}\).

In practice, the statistical models are applied on pre-processed data. Suppose that the pixels of a single patch after pre-processing are collected into the vector \({\mathbf{z}}=(z_1,\ldots , z_N)\). Commonly, for pre-processing a linear transformation is used. In this case, \(z_i\)’s would be linear transformations of the independent components \(r_i\)’s. This can be readily observed by multiplying both sides of Eq. 1 by the matrix performing the pre-processing and obtain

$$\begin{aligned} {\mathbf{z}}=\fancyscript{U}{\mathbf{r}} \end{aligned}$$
(3)

where matrix \(\fancyscript{U}\) is obtained by multiplying matrix \(\fancyscript{F}\) by the pre-processing transformation matrix, \(\mathbf{V}\). In practice, a whitening transformation is used as the pre-processing step as it is found to be instrumental in contrast gain and luminance control [32]. In this case, for matrix \(\fancyscript{U}\) to be invertible, the number of independent components should be chosen in a way that it equals the number of variables produced after the whitening transformation. Under this condition, the system in Eq. 3 would be invertible in a unique way, producing the vector \(\mathbf{r}\) as a linear function of \(\mathbf{z}\) as

$$\begin{aligned} {\mathbf{r}}= {\mathbf{U}}{\mathbf{z}} \end{aligned}$$
(4)

where matrix \(\mathbf{U}\) is obtained by inverting matrix \(\fancyscript{U}\). The filter matrix \(\mathbf{F}\) in Eq. 2 can then be obtained by multiplying the linear transformations given by \(\mathbf{U}\) and \(\mathbf{V}\), i.e.

$$\begin{aligned} {\mathbf{F}}={\mathbf{U}}{\mathbf{V}} \end{aligned}$$
(5)

As a result, the independent components \(r_i\)’s of vector \(\mathbf{r}\) are obtained as

$$\begin{aligned} {\mathbf{r}}={\mathbf{U}}{\mathbf{V}}{\mathbf{p}} \end{aligned}$$
(6)

In summary, given an image \(\mathbf{p}\) of size \(d\times d\) pixels, one applies \(N\) filters on the pixels of \(\mathbf{p}\) using the filter matrix \({\mathbf{F}}^{N\times d^2}\) and obtains \(N\) responses which are stacked into the vector \(\mathbf{r}\). As the filter responses \(r_i\)’s are independent, they can be processed independently. A useful post-processing step is binarising \(r_i\)’s by thresholding at zero to produce the binarised features \(b_i\)’s as

$$\begin{aligned} b_i = \left\{ \begin{array}{ll} 1 &\, r_i> 0 ,\\ 0 &\,\mathrm{otherwise.}\\ \end{array} \right. \end{aligned}$$
(7)

The binarised features of \(b_i\)’s can then be summarised using aggregate statistics such as histograms.

3.1.1 Training for BSIF filters

The training procedure for filter matrix \(\mathbf{F}\) can be summarised as follows. Using a training set of image patches randomly taken from images, their covariance matrix is estimated and eigen-decomposed. The dimensionality of each patch is then reduced using \(N\) (number of the filters used) principal eigenvectors of the covariance matrix divided by their standard deviations. At the end of this step, whitened data samples \(\mathbf{z}\) are obtained. In more detail, if the eigen decomposition of the covariance matrix \(\mathbf{C}\) is \({\mathbf{C}} = {\mathbf{Y}}{\mathbf{D}}{\mathbf{Y}}^\top\), where \(\mathbf{D}\) is the diagonal matrix of eigen values of \(\mathbf{C}\) in a descending order and the columns of \(\mathbf{Y}\) are the corresponding eigen vectors of \(\mathbf{C}\), then matrix \(\mathbf{V}\) which is used for whitening and dimensionality reduction is given by

$$\begin{aligned} {\mathbf{V}} = \left[ {\mathbf{D}}^{-1/2}{\mathbf{Y}}\right] _{{1:{\it N}}} \end{aligned}$$
(8)

where \([.]_{1:N}\) denotes the first \(N\) rows of a matrix. Next, given the whitened data samples \(\mathbf{z}\), the independent component analysis is employed to estimate an orthogonal matrix \(\mathbf{U}\). Having estimated the matrices of \(\mathbf{U}\) and \({\mathbf{V}}\), the final filter matrix is obtained as \({\mathbf{U}}{\mathbf{V}}\). Some sample filters learnt are depicted in Fig. 1. In the figure, eight BSIF filters corresponding to a of size \(17\times 17\) are depicted. By applying the filters, eight filter responses are obtained which are then binarised to form an 8-bit binary code for each pixel.

Fig. 1
figure 1

Sample \(17\times 17\) BSIF filters (\(N=8\))

An essential prerequisite in the binarisation is the independency of filter responses [32, 46]. As ICA is used for filter design, the dependencies between filter responses in the binarised statistical image features approach are minimised. This is in contrast to some commonly employed techniques such as local binary patterns where the independency holds only approximately.

3.2 Multiscale analysis

Suppose the size of each individual BSIF filter is fixed at \(d\times d\). In this case, using a larger number of filters (increasing \(N\)) would include more high frequency components into the descriptor. This is because the \(N\) eigenvectors of the covariance matrix of the training data are sorted in a descending order with respect to their corresponding eigenvalues and increasing \(N\) would include more eigenvectors corresponding to smaller eigenvalues into the whitening transformation. Conversely, using a fixed number of filters (\(N\)), by increasing the size of each filter, the variations of the signal over a larger support region are taken into account. In others words, the descriptor now captures large scale image content. It has been observed that using eight filters (\(N=8\)) results in an acceptable frequency response, able to capture a wide range of frequency content of images [24]. Hence, the number of filters in all experiments in this work is fixed to 8, producing an 8 bit binary code for each pixel. As noted earlier, the other parameter controlling the frequency content of the feature is the filter size. While smaller filters capture small scale variations of texture, larger filters are better suited to deal with blurring effects and low frequency contents. In this work, the compromise brought about by this trade-off is moderated via a simple yet powerful texture representation, called multiscale binarised statistical image feature.

The proposed multi-resolution representation is derived by varying the filter size, and combining the BSIF descriptors in different scales. However, in this case the common problem of high dimensionality and small sample size may result in instability of the representation in the presence of image noise. The problem, however, can be minimised using histograms as aggregate statistics which can capture the most fundamental statistical properties of the feature. The benefits of employing histograms of the code words are three fold. First of all, using a histogram reduces the feature dimension from the image size to that of the histogram. Moreover, by optimising the dimensionality of histogram and projection onto other spaces, the effects of the image noise on the feature can be regulated. Finally, a histogram summary is more robust to spatial image transformations such as rotation and translation and hence the sensitivity to misalignment is decreased [39].

Fig. 2
figure 2

a Original image, b normalised and cropped image, cj BSIF coded images at different scales

3.3 MBSIF face descriptor

In the proposed approach to multi-resolution analysis, BSIF operators at \(Z\) scales are first applied to a face image after photometric normalisation [64]. A grey level code for each pixel at each resolution is thus obtained, Fig. 2. The c–j coded images are obtained by applying eight BSIF filters each. The coded image of (c) in the figure corresponds to the finest scale, i.e. the result of applying \(3\times 3\) filters while the coded image of (j) represents the output of applying BSIF filters at the coarsest scale, i.e. using filters of size \(17\times 17\). The resulting BSIF code images are divided into non-overlapping rectangular regions \(G_0,G_1,\ldots ,G_{J\times J-1}\) after cropping to the same size. The BSIF pattern histogram for region \(j\) in the scale of \(s\), \({\mathbf{h}}_{j,s}\), is computed by

$$\begin{aligned}&{\mathbf{{h}}}_{j,s} = \left[ h^0_{j,s},h^1_{j,s},\ldots ,h^{L-1}_{j,s}\right] \nonumber \\&h^i_{j,s} = \sum _{m\in G_j}{1}\mathrm{l}\left\{ \mathrm {BSIF}_s(m)=i\right\} \nonumber \\&j \in \left[ 0,1,\ldots ,J\times J-1\right] , \nonumber \\&s\in \left[ 1,2,\ldots ,Z\right] , L = 256 \end{aligned}$$
(9)

where \({1\!\!1} \left\{ . \right\}\) is the indicator function equal to one when its argument is true and zero otherwise. \(L\) is the number of histogram bins (determined by the number of filters used) and the size of the BSIF filter at scale \(s\) is \(d\times d\) where \(d = 2\times s +1\). By concatenating all the histograms computed at different scales for each region into a single vector, the final multi-resolution regional face descriptor is obtained

$$\begin{aligned} {\mathbf{{q}}}_j = \left[ h_{j,1},h_{j,2},\ldots ,h_{j,Z}\right] ^\top \end{aligned}$$
(10)

3.4 Single sample model construction using class-specific LDA

In order to obtain a discriminative regional descriptor, we use a client-specific linear discriminant analysis (CSLDA) [34] to project the multi-resolution features onto a discriminative subject-specific subspace. The client-specific LDA operates in a two-class framework. That is, when comparing a pair of images, one of them is assumed to be the model (\(f\)) and the likelihood of the second image (\(f^\prime\)) belonging to the first one and not to a class of imposters is measured. The two-class linear discriminant transformation for region \(j\) taking \(f\) as the model, \(\mathbf{{a}}_j^f\), is given by

$$\begin{aligned} {\mathbf{{a}}}_j^f= S_{j}^{-1} \left( \mu ^f_j-\mu _j\right) \end{aligned}$$
(11)

where \(S_{j}^{-1}\) denotes inverse of the within-class scatter matrix for region \(j\) while \(\mu ^f_j\) and \(\mu _j\) are the mean histograms of the model image \(f\) and training data for the same region, respectively. In [34], it has been shown if the number of training samples excluding those belonging to the subject \(f\) are large enough, the inverse of the within-class scatter matrix can be approximated as

$$\begin{aligned} S_{j}^{-1}\approx \Psi _j \Phi _j^{-1} \Psi _j^\top \end{aligned}$$
(12)

where \(\Psi _j\) is the matrix of leading eigenvectors of the mixture covariance matrix and \(\Phi _j\) is the diagonal matrix of corresponding eigenvalues for region \(j\). The reasons supporting the use of client-specific LDA are its perfect adaptability to the unseen face pair matching, computational efficiency, ease of training and lower error rates in classification [34]. Once a regional linear discriminant transformation is estimated, the similarity of two corresponding regions is measured as the cosine similarity measure \(\frac{{(\mathbf{{a}}_j^f)}^\top {\mathbf{{q}}_j}^{f^\prime }}{{\Vert {\mathbf{{a}}_j^f}\Vert }\Vert {{\mathbf{{q}}_j}^{f^\prime }}\Vert }\) and the final similarity between a pair of images, \(\mathrm {Sim}(f,f^\prime )\), is measured as the sum of regional similarities, i.e.

$$\begin{aligned} \mathrm {Sim}\left( f,f^\prime \right) = \sum _j \frac{{\left( \mathbf{{a}}_j^f\right) }^\top {\mathbf{{q}}_j}^{f^\prime }}{{\left\| {\mathbf{{a}}_j^f}\right\| }\left\| {{\mathbf{{q}}_j}^{f^\prime }}\right\| } \end{aligned}$$
(13)

3.4.1 Discussion

The rationale for using CSLDA is to obtain a discriminative compact descriptor for face representation and matching. However, the common fisher discriminant analysis is a supervised technique requiring class labels of training examples. Thus, at the first glance it might seem that for a pairwise face matching task where the goal is to gauge the similarity of a pair of images, labelled training data of images belonging to both subjects is required. This is a rather restrictive assumption in practical applications where the two images are never seen before [31]. However, the problem is easily circumvented using the CSLDA approach as follows. Assume there is a set of random training face images. We call this set the imposter set. There is no restriction on this set except that if by any chance a number of images belonging to either one of the subjects to be compared exists in the training set, the number of such samples should be small compared to the total number of training images. This requirement can be easily fulfilled by choosing a large number of training images of random subjects in the imposter set. This condition is studied in [34] and using it the approximation to the within-class scatter matrix in Eq. 12 is derived. Once the imposter set is selected, the within-class scatter matrix for the class-specific LDA can be approximated using Eq. 12. Note that the approximation in Eq. 12 does not require any labels as it only entails an eigen decomposition of the features extracted from the imposter set. Next, we construct a class-specific LDA transformation using Eq. 11, taking \(\mu ^f\) to be the feature extracted from the first image and \(\mu\) the mean over the imposter set. That is, the transformation for the CSLDA can be constructed using only a single model sample. In this case, the second image would either belong to the imposter set or to the class represented by the first image. The probability of the second image belonging to the first image and not the class represented by imposters is then measured by Eq. 13. Exchanging the roles of the two images, we construct a second CSLDA transformation using the second image as the model and measure the probability of the first image belonging to the second image and not to the class represented by the imposter set. Finally, the similarity of the two images is taken as the average of the two similarity scores thus obtained. In practice, we also make use of the mirrored versions of both images in a pair to reduce the effect of self-occlusion in inconsistent poses. As we use both images and their horizontally flipped versions as model images, four CSLDA transformations would be required. In addition, a pair of images and their horizontally mirrored versions can be matched in eight different ways by exchanging the roles of the model and the test images in each pair. As a result, four CSLDA subspaces and eight image pair comparisons are performed for each pair of images.

Note that the preceding approach for comparing a pair of images is completely unsupervised as no class labels are utilised in obtaining the CSLDA transformation, thanks to the approximation given by Eq. 12. This is extremely advantageous and different from most commonly employed approaches based on linear discriminant analysis in comparing a pair of face images.

4 Dense image alignment

Alignment prior to recognition has a fundamental impact on performance. This has fuelled the research leading to a growing number of methods for object alignment [4, 10, 13, 16, 20, 25, 51, 53, 54, 59, 66, 68, 77]. However, obviously aligning a non-planar object using a 2D transformation such as similarity or affine can only partly correct for the misalignment existing objects. This shortcoming is successfully approached via 2D or 3D methods such as the well-known active appearance model (AAM) [20] or the 3D morphable model (3DMM) [14]. An alternative to these methods is the dense image matching approaches using Markov random fields which estimate pixelwise alignment between a pair of images. For dense image alignment we adopt the method proposed in [4, 6, 7]. The reasons supporting such a choice are as follows. First of all, it provides dense pixelwise alignments between a pair of images. This has been found to be quite advantageous in pose-invariant and also frontal pose face recognition. Second, unlike most MRF-based methods which are rather slow due to high computational complexity of the optimisation problem involved, the method in [4, 7] uses a variety of different techniques including multi-resolution analysis and GPU acceleration to perform matching much faster than many other alternatives. Next, the matching is performed in a discriminative way. That is, unlike other 2D or 3D approaches such as AAMs [20] or 3DMMs [14] which try to fit a generic model to an image, the method in [4, 7] tries to find the best alignment between a pair of images without using a pre-learnt generic model. As a result, one expects to have good alignment (smooth deformation maps) when the two images belong to the same subject and poor alignment when the images are from different subjects. This in effect is likely to lead to high similarity scores in the subsequent stages of a recognition system between images of the same subject and low similarity between images of different individuals. Last but not least, the procedure can be modified to compare a pair of face images symmetrically. Some matching results of this method are depicted in Fig. 3. In this work, we symmetrise the process of matching two images as follows. Initially, the template is matched to the target and then the roles of the two images are exchanged. The procedure is also repeated for the horizontally mirrored versions of both images. As a result, for each pair of images we perform eight matchings. The MBSIF histograms are then computed taking into account the correspondences thus obtained. Once the similarity between each pair of images is computed, the final score is obtained by averaging the similarity scores of all the eight pairs of matches. As will be illustrated in the experiments, the symmetric matching serves to improve the performance by a great extent.

Fig. 3
figure 3

Some results of dense image-to-image matching using the method of [6, 7]

5 Experiments

5.1 Implementation details

In the following experiments, after geometrically normalising the images (the details will be given separately) before extracting features, the cropped face images are pre-processed using an effective photometric normalisation scheme [64]. The applied method is designed to decrease the effects of changes in illumination conditions, highlights and local shadowing, while keeping the fundamental visual information. In the multi-resolution analysis, the numbers of scales for the multiscale local binary pattern (MLBP) and the multiscale local phase quantisation (MLPQ) operators are set to 10 and 7 respectively, as advocated in [19]. For the BSIF descriptor, while applying the BSIF operator in a small number of scales does not provide sufficient discriminatory information for face representation, an operator with a larger filter size captures lower frequency components which tend to be influenced by the illumination conditions more severely. Here, the number of scales is optimised empirically and is set to 8. The other parameter to tune is the number of local regions (\(J\times J\)) from which the histograms are extracted. While using fewer regions provides robustness against misalignment, in the case of dense correspondences, using a larger number of regions a larger amount of spatial information becomes available for classification. We investigate the effect of varying \(J\) on system performance. Finally, for the construction of within scatter matrices, the dimensionality of the \(\Psi _j\)’s and \(\Phi _j\)’s is chosen in a way that \(95\,\%\) of the variation in the training data is preserved.

Table 1 Comparison of the performance of different descriptors on the combined Yale database B and the extended Yale face database B

5.2 Comparison of different descriptors: combined Yale database B and the extended Yale face database B

In this section, a face identification experiment is performed on the combined Yale database B [27] and the extended Yale database B [37] under varying illumination conditions to compare the single scale BSIF descriptor to the proposed discriminative multiscale representation and the multiscale local binary pattern and the multiscale local phase quantisation histograms. The data set consists of 2432 images of 38 subjects under 64 different illumination conditions. For each of the 38 individuals in the database, a single image corresponding to the normal illumination condition is selected as the gallery and all the remaining images are considered as the test samples. Each image in this data set is cropped to a of size \(192\times 160\) (rows \(\times\) columns) and then divided into \(16\times 16\) non-overlapping rectangular regions. For the construction of the imposter set for the class-specific LDA, frontal images of the XM2VTS database [42] are used. The BSIF filters used in this experiment are learnt using an external set of natural images, provided by the authors of [24]. As a result, the generalisation capability of the method is also evaluated. A number of investigations are made in this experiment. The 8-bit single scale BSIF descriptor with varying filter sizes using a \(\chi ^2\) distance measure is examined. The multiscale BSIF descriptor using the \(\chi ^2\) distance measure is also evaluated and compared to the single scale BSIF descriptors. In addition, we have also evaluated the multiscale LBP and the multiscale LPQ descriptors for comparison using a \(\chi ^2\) distance measure. Finally, we have also included the results obtained using the client-specific LDA using the multiscale LBP, multiscale LPQ and the multiscale BSIF descriptor. For the client-specific LDA, for each probe-gallery pair, four client-specific LDA subspaces, two corresponding to the probe and the mirrored probe image and two for the gallery and the mirrored gallery image are learnt. As noted earlier, for each pair of images eight scores are obtained which are averaged to produce the final score. The results obtained are reported in Table 1. A number of observations from the table are in order. First, the proposed multiscale descriptor using a \(\chi ^2\) distance measure consistently performs better than the single scale versions using the \(\chi ^2\) distance measure by a large margin. Second, the MBSIF descriptor with a \(\chi ^2\) measure outperforms both the MLBP and MLPQ representations using the same distance metric. Third, all the three multiscale descriptors using a client-specific LDA perform better than the \(\chi ^2\) distance measure. The proposed MBSIF + CSLDA approach while performing much better than the MBLP + CSLDA also outperforms the MLPQ + CSLA approach by nearly \(6\) percent. The improved representational capacity achieved in the new descriptor can be analysed from different viewpoints. First, the filters used in constructing the MBSIF descriptor are estimated using statistical analysis of image properties in contrast to other ad hoc design schemes such as those used in LBP. Second, the redundancy in the input data is minimised via a whitening transform using PCA in the filter learning procedure. And finally, by using an independent component analysis in the filter design, the codes generated become statistically independent, thus suitable for further processing under independence assumptions. It can be observed that the proposed discriminative multiscale regional descriptor (MBSIF + CSLDA) improves the performance of the single scale BSIF descriptor to a large extent making it comparable to other alternatives, also emphasised by the following experiments.

Fig. 4
figure 4

Comparison of mean recognition accuracies between MBSIF, MLPQ and MLBP descriptors on the LFW data set against varying J

Fig. 5
figure 5

Effect of symmetric MRF matching on mean recognition accuracy using different descriptors on the LFW data set against varying J

5.3 Experiment in unseen pair matching: LFW

Recently with the development of the LFW data set [31] studying the performance of face recognition methods in unconstrained settings has been facilitated. The LFW data set includes real-world variations in facial images such as pose, illumination, expression, occlusion, low resolution, blur etc. It contains 13,233 images of 5749 subjects. The task is to determine whether a pair of images belongs to the same person or not. We evaluate the proposed approach on the “View 2” of the data set consisting of 3000 matched and 3000 mismatched pairs divided into 10 sets. The evaluation is performed in a leave-one-out cross-validation scheme on the entire test sets. The aggregate performance of the method over tenfolds is reported as the mean accuracy and the standard error on the mean. There are different evaluation settings on this database: the image restricted setting and the unrestricted setting. The restricted setting provides training data for the image pairs as “same” or “not same”. The image unrestricted setting in addition provides the identities of the subjects in each pair. There is also the unsupervised setting where no training data in the form of same/not same pairs are provided. We evaluate the proposed approach on the most restricted protocol where strictly LFW data are used, without any outside training data. In addition, as our method is unsupervised (both the MBSIF filter learning and the CSLDA approach are unsupervised), it is equally comparable with the results in this setting. In each of the ten experiments on the LFW data set, one out of ten subsets is used as the test set and the remaining nine as the training data. We use one of the nine training subsets to learn the projection matrix of the class-specific LDA. Two separate subsets of the remaining eight subsets are used to learn filters for the BSIF descriptor. Filter learning is performed using 20,000 randomly sampled image patches. Filters are learned in eight scales, i.e. \(m = \{3, 5, \dots , 17\}\) and in each scale, eight filters are learned (\(N = 8\)) giving rise to an 8-bit BSIF code. The remaining training subsets are used to set the acceptance/rejection threshold. We use the funnelled and aligned versions of the LFW data set and after computing the LBP, LPQ or BSIF code images, crop the images and keep an area of \(80\times 96\) pixels in the centre of the code image. In the experiments on the LFW, a number of investigations are made. First of all, the proposed MBSIF descriptor is compared against two other commonly used texture representations for face recognition, namely the MLBP [18] and MLPQ [62] against a varying \(J\). The results are obtained using the proposed method described in earlier sections, i.e. using the symmetric matching and the client-specific LDA approach on the MBSIF histograms. The results are shown in Fig. 4. A number of observations can be made from the figure. First, it can be seen that the proposed MBSIF descriptor outperforms both MLPQ and MLBP representations. Second, by increasing \(J\) and as a result the number of regions, the performance of all three descriptors is improved. This is due to the fact that the underlying MRF matching method provides good pixelwise alignment and by increasing \(J\) more spatial information becomes available for recognition. The boost in performance with increasing \(J\) is better observed from \(J = 2\) to \(J = 8\) than from \(J = 8\) to \(J = 16\) with the performance being almost saturated around \(J = 16\).

Table 2 Comparison of the performance of the proposed approach to the state-of-the-art methods on the LFW database in the most restricted setting (strict LFW, no outside training data used)
Table 3 Comparison of the performance of the proposed approach to the state-of-the-art methods on the LFW database in the unsupervised setting

Next, we study the effect of symmetric MRF matching on recognition performance. We compare the mean accuracies obtained using each descriptor with the proposed symmetric matching method versus the non-symmetric approach. The results are illustrated in Fig. 5. It is observed that irrespective of the value of \(J\), the proposed symmetric face matching method consistently performs better than the conventional non-symmetric approach. The improvement is more pronounced with a fewer number of subregions yet with the largest number of regions used (\(J = 16\)), the improvements for MLBP, MLPQ and MBSIF compared to the non-symmetric approach are more than 3.5 , 4 and 1.4 %, respectively.

Next, as the MLBP, MLPQ and MBSIF descriptors provide different representations, it is expected that the information they provide would be complementary to each other and that the recognition performance can be boosted by combining them. For combination, a sum rule over scores obtained in different regions using different descriptors is employed. The result of fusion along with other state-of-the-art results on the LFW data set (\(J = 16\)) in the most restricted protocol is presented in Table 2. From the table, it is observed that by using only the proposed MBSIF descriptor one can achieve a comparable performance to the previous best results under this setting. Fusing the three MLBP, MLPQ and MBSIF descriptors together we achieve an impressive mean performance of 88.19 %, ranking the proposed approach first under this setting. As noted earlier, our method is unsupervised and can be compared to other approaches under this protocol. In this case, we ran the experiment on the aligned version of the LFW data set [63]. The results of this comparison are provided in Table 3. It can be observed that the proposed approach achieves the best result in this setting.

5.4 Experiment in identification: FERET

In real-world scenarios in-depth rotation of faces is commonly present in face images. In this experiment, we evaluate the proposed method on the rotation shots of the FERET database [48] i.e. the b series in an identification scenario. For this experiment, frontal images of 200 clients of the XM2VTS [42] data set are used as the imposter set. This experiment is designed particularly to explore the capabilities of the proposed methodology for recognition in varying pose conditions. This part of the database consists of 200 subjects captured under 9 different yaw angles ranging from nearly \(-60^\circ\) to \(+60^\circ\). We use the \(ba\) image of each subject (almost frontal) as the gallery image and all the rest as test images. Frontal gallery images are cropped using manually annotated eye coordinates to a size of \(128\times 144\) pixels with an interocular distance of 70 pixels. The test/evaluation images are detected using the Viola and Jones [69] method and scaled so that the face area roughly corresponds to an area of \(128 \times 144\) pixels. Hence, the method is evaluated subject to misalignments and moderate scale deviations. Region parameter \(J\) is set to 16. Table 4 reports the correct identification rates obtained on this data. The results of some other methods are also included for comparison. From the table, it can be observed that the proposed approach outperforms all alternative methods in most poses, except the bb pose (corresponding to an extreme pose deviation of \(+60^{\circ }\) from frontal) in which losing only by approximately 1 %.

Table 4 Comparison of the performance of the proposed approach to the state-of-the-art methods on the FERET database

5.5 Experiments in verification: XM2VTS

We also evaluate our method on the rotation shots of the XM2VTS database [42]. In the XM2VTS rotation data set the evaluation protocol is based on 295 subjects consisting of 200 clients, 25 evaluation imposters and 70 test imposters. The performance of a verification system is often stated in equal error rate (EER) in which the false acceptance and false rejection rates are equal and the threshold for acceptance or rejection of a claimant is set using the true identities of test subjects. In this experiment, frontal training images are cropped using manually annotated eye coordinates to a size of \(128\times 144\) pixels so that the distance between the eyes is 70 pixels. As in the FERET experiment, the test/evaluation images are detected and cropped using the Viola and Jones [69] method. After face detection, each image is scaled so that the face area roughly corresponds to an area of \(128 \times 144\) pixels. Parameter \(J\) is set to 16. This experiment enables one to compare the proposed method to other similar pose-invariant approaches in a verification scenario subject to challenging settings of face misalignment and pose variation. The rest of the procedure is as described in Sect. 5.1. As in the previous experiment on the FERET database, the imposter set is chosen to be the frontal images of the 200 clients of the XM2VTS database. The best results obtained on this data set are listed in Table 5. It can be observed from the table that the proposed approach obtains the lowest error rate on the rotation shots of the XM2VTS [42] database. In addition to the multi-resolution nature of the descriptors employed, the achieved high performance is attributed to the dense pairwise matching provided by the symmetric matching process and the functionality of the client-specific LDA transformation employed.

Table 5 Comparison of the performance of the proposed method to the state-of-the-art methods on the XM2VTS database

6 Conclusion

The paper presented a novel discriminative multiscale image descriptor (MBSIF + CSLDA) using statistical learning based on a variant of linear discriminant analysis. The discriminative descriptor which can be learnt in an unsupervised fashion, was shown to be a suitable solution for the unseen image pair matching tasks. Next, in order to gauge the similarity of a pair of images more effectively, the face pair matching task was symmetrised. For this purpose, the discriminative LDA subspace learning was performed symmetrically, improving recognition performance. A dense pixelwise image pair matching method embedded at the pixel level made the proposed technique applicable to pose robust recognition of faces. Finally, the proposed descriptor was combined with the MLBP and MLPQ features in a score level fusion scheme in an LDA space to further enhance the recognition accuracy.