1 Introduction

The importance of research on face recognition (FR) is driven by both its wide range of potential applications and scientific challenges. And the appearance-based paradigm using the principal component analysis (PCA) [1], also known as Karhunen–Loéve transformation, to extract features has exhibited great advantage, producing the well-known eigenfaces method [2, 3]. Now, the eigenfaces method has become one of the most successful approaches in FR (to see Ref. [4], for an early survey of FR). However, it could be noted that eigenfaces approach is vector-oriented; that is, in FR system using the eigenfaces method, the two-dimensional (2D) facial image matrices must be transformed into one-dimensional (1D) vectors prior to performing PCA. Typically, the resulting image vectors may be very high-dimensional, say 16,384-dimensional (a facial image with a resolution of 128 × 128), where it is difficult to handle the (possibly singular) covariance matrix because of the high dimension and relatively small training sample size even though the eigenvectors could be efficiently computed without the accurate calculation of the covariance matrix.

Recently, to attack the problem, Yang et al. [5] proposed the two-dimensional principal component analysis (2DPCA), which was directly based on 2D image matrices rather than 1D vectors and thus obviated the transformation from 2D matrices to 1D vectors by lexicographic ordering of the pixel elements of each image. Following the introduction of 2DPCA immediately, Wang et al. [6] showed the equivalence of 2DPCA to line-based PCA. And the generalization of 2DPCA to bilateral and kernel-based versions were also presented [7, 8]. The 2DPCA(-based) method is appealing, since the 2D spatial information of the image is well preserved and the computational complexity is significantly reduced compared with the PCA. Moreover, the 2DPCA could naturally and effectively avoid the small sample size (SSS) problem.

Like PCA, one flaw of 2DPCA model, however, is the absence of an associated probability density; that is, the process of implementing 2DPCA is distribution-free—a mathematical method with no underlying statistical model. The proposal of probabilistic PCA (PPCA) [9], which defines PCA from the probabilistic perspective, aims to remedy the disadvantage of PCA. In this paper, by supposing a parametric Gaussian distribution over the image space (spanned by the row vectors of 2D image matrices) and a spherical Gaussian noise model for the image, we endow the 2DPCA with a probabilistic framework, which we will refer to as probabilistic 2DPCA (P2DPCA). Such a probabilistic formulation enables the application of Bayesian method. And on the basis of maximum likelihood (ML) estimation, the estimates of model parameters could be implemented via the expectation maximization (EM) algorithm (which, under mild conditions, converges to the ML estimation) [10]. Besides, the probability model could fit the noise in the data whereas the 2DPCA is not robust to independent noise. Further, whereas the P2DPCA only defines a global projection of the sample vectors in the image space, we extend the P2DPCA to a mixture of local P2DPCA models (MP2DPCA). We adopt the P2DPCA and MP2DPCA as mechanisms for extracting facial information followed by discrimination. The experimental results demonstrate their superiority over the 2DPCA.

A preliminary work of this paper was appeared in Ref. [11]. The rest of the paper is organized as follows: In Sect. 2, we briefly review the formulation of 2DPCA. In Sect. 3, we propose the P2DPCA model, while the MP2DPCA model is presented in Sect. 4. Section 5 reports experimental results and Sect. 6 ends the paper by addressing some concluding remarks.

2 Two-dimensional principal component analysis

In the case of eigenfaces-based FR, the transformation from image matrices into vectors before performing PCA is a disadvantage and may lead to a singular covariance matrix for relatively small number of training samples. However, the 2DPCA seeks to carry out PCA immediately based on 2D image matrices. The idea behind 2DPCA is to project image X, an s × t random matrix, onto μ by the linear transformation z = X μ. The projection direction μ is determined by maximizing the total scatter of the resulting projected samples which is characterized by the trace of the covariance matrix of the projected feature vectors. Namely, the 2DPCA seeks to maximize the criterion J(μ) = tr(COV), where COV denotes the covariance matrix of the projected features vectors of the training samples and tr is the trace operator.

Mathematically, for a set of observed s × t image samples X 1, …, X n , the q (q ≤ t, usually q ≪ t) principal axes μ j are given by the q dominant eigenvectors of the image covariance matrix \({{\mathbf{S}}}_{\texttt I}= \frac{1}{n}\sum_{i=1}^{n}({{\mathbf{X}}}_i-{\bar{{\mathbf{X}}}})^{{\tt T}} ({{\mathbf{X}}}_i-{\bar{{\mathbf{X}}}}),\) where \({\bar{{\mathbf{X}}}}\) is the mean image \({\bar{{\mathbf{X}}}} = \frac{1}{n}\sum_{i=1}^{n}{{\mathbf{X}}}_i,\) such that \({{\mathbf{S}}}_{\texttt I} \mu_j =\tau_j\mu_j.\) For a given image sample X, the principal components (vectors) are given by z j  = X μ j , j = 1, …, q. The principal component vectors obtained are to form an s × q feature matrix Z = (z 1, …, z q ). The linear reconstruction of X is given by \({\hat{{\mathbf{X}}}}={{\mathbf{Z}}} U^{{\tt T}},\) where U = (μ 1, …, μ q ).

The essential of 2DPCA is to view each row vector of 2D image matrix as observed sample, and the q principal axes μ j , j = 1, …, q, are those orthonormal axes onto which the sum of variances of row vectors under projection is maximal. Particularly, when each image matrix comprises only one row vector, the 2DPCA becomes PCA. Therefore, the 2DPCA is also an orthogonal transformation of the coordinate axes in which we describe the row vectors of observed 2D images. By statistical manipulation and matrix representation, the 2DPCA is reduced to solving an eigenvalue problem of image covariance matrix constructed directly via image matrices.

3 Probabilistic model for 2DPCA

Equations z j  = X μ j above relate the principal component vectors z j to the observed random image X. In general z j will not have zero mean. In order for the principal component vectors to have zero mean, they should be defined as z j  = (X − Ξ)μ j , j = 1, …, q, for mean matrix Ξ. In practice Ξ is the sample mean \({\bar{{\mathbf{X}}}}.\) Below, for the sake of simplicity, we made the assumption that the image samples are already mean centered (which is easy to achieve). And we denote the kth row vector of the ith (mean-centered) image sample X i as X k i for k = 1, …, s, i = 1, …, n; namely, \({{\mathbf{X}}}_i^{{\tt T}}=({{\mathbf{X}}}_i^1, \ldots,{{\mathbf{X}}}_i^s).\) Then, we have the following:

Theorem

Regarding X k i , k = 1, …, s, i = 1, …, n, as t-dimensional observations of size sn, and then applying the model of PPCA to these samples:

$$ {{\mathbf{X}}}_i^k = \Lambda F + \varepsilon, \hbox{for}\;k=1, \ldots, s, i=1, \ldots, n, $$
(1)

in which Λ is a t × q factor loading matrix, F is q-dimensional latent (unobservable) variables (also known as common factors), and ɛ is t-dimensional specific factors. This generative model assumes that the latent variables are independently and identically distributed (i.i.d.) as Gaussian with zero-mean and identity covariance, the zero mean specific factors are also i.i.d. as Gaussian with covariance σ2, and F is independent with ɛ and their joint distribution is Gaussian. Then one has that the columns of the ML estimator of Λ span the principal subspace as in the 2DPCA.

Proof

Since the image samples are already mean centered, that is \(\frac{1}{n}\sum_{i=1}^{n}{{\mathbf{X}}}_i=0,\) we obtain \(\frac{1} {sn}\sum_{i=1}^{n}\sum_{k=1}^{s} {{\mathbf{X}}}_i^k=0.\) Then we calculate the “sample” covariance matrix for X k i , k = 1, …, s, i = 1, …, n as

$$ {{\mathbf{S}}} = \frac{1}{sn}\sum_{i=1}^{n}\sum_{k=1}^{s}{{\mathbf{X}}}_i^k {{\mathbf{X}}}_i^{k{\tt T}} = \frac{1}{sn}\sum_{i=1}^{n}{{\mathbf{X}}}_i^{{\tt T}}{{\mathbf{X}}}_i = \frac{1}{s}{{\mathbf{S}}}_{\texttt I}. $$
(2)

And by the properties of the ML estimators of PPCA [9], the ML estimator for Λ could be written as

$$ {\hat{\Lambda}}=W_{q}(\Gamma_{q}-{\hat{\sigma}}^2 I_{q})^{\frac{1} {2}}R, $$
(3)

where the q column vectors in the t × q matrix W q are the principal eigenvectors of S, with corresponding eigenvalues in the q × q diagonal matrix \(\Gamma_q, {\hat{\sigma}}^2\) is the ML estimate of the global noise level. On account of that S is just the image covariance matrix \({{\mathbf{S}}}_{\texttt I}\) (up to a scalar), W q comprises the principal axes as in the 2DPCA. \(\square\)

Such formulation may be termed generative model, since data vectors X k i could be generated by sampling from the F and ɛ distributions and applying (1). Correspondingly, In FR the generative model (1) could be interpreted as follows. An observed facial image is assumed to be generated via three steps: firstly choose s points for the rows of image from the principal subspace spanned by \({\hat{\Lambda}},\) secondly add a zero-mean Gaussian “noise” ɛ (including real noise and conceptual noise such as whisker detail, eyelash detail and hair detail, etc.) to each point, and finally pile the chosen points (which are all t-dimensional) with added noise and mean image to form the observed facial image. Figure 1 schematically shows the procedures.

Fig. 1
figure 1

The procedures of generating an observed facial image

3.1 Feature extraction and reconstruction

The general objective of 2DPCA is to seek some reduced-dimensionality representation of the observed images. However, from the probabilistic perspective, it is natural to adopt the posterior mean E[F|X k] (an estimation of factor scores) as the reduced-dimensionality transformation for X k (which is the kth row vector of an image X). Specifically, in the model of P2DPCA, the posterior mean E[F|X k] becomes

$$ \beta({{\mathbf{X}}}^k)=\left({\hat{\Lambda}}^{{\tt T}} {\hat{\Lambda}}+{\hat{\sigma}}^2 I_q\right)^{-1} {\hat{\Lambda}}^{{\tt T}}{{\mathbf{X}}}^k,\quad k=1, \ldots, s. $$
(4)

So, for an observed image X, the reduced-dimensionality representation, that is an s × q feature matrix, is given by

$$ {{\mathcal{Z}}} = \left(\beta({{\mathbf{X}}}^1), \ldots, \beta({{\mathbf{X}}}^s) \right)^{{\tt T}} = {{\mathbf{X}}}{\hat{\Lambda}}\left({\hat{\Lambda}}^{{\tt T}} {\hat{\Lambda}}+{\hat{\sigma}}^2 I_q\right)^{-1}. $$
(5)

As a result, the reconstructed image of X could be

$$ {\hat{{\mathbf{X}}}}=\left({\hat{\Lambda}}\beta({{\mathbf{X}}}^1), \ldots, {\hat{\Lambda}}\beta({{\mathbf{X}}}^s) \right)^{{\tt T}} + {\bar{{\mathbf{X}}}}={{\mathcal{Z}}}{\hat{\Lambda}}^{{\tt T}}+{\bar{{\mathbf{X}}}}. $$
(6)

It can be seen that when \({\hat{\sigma}}^2\to 0,\) the rows of \({{\mathcal{Z}}}\) represent projections onto the latent space and the conventional 2DPCA is recovered.

3.2 An EM algorithm for P2DPCA

We can obtain an EM algorithm for finding the model parameters by maximizing the corresponding likelihood function. There may be computational advantage in the EM approach for large t.

In the EM framework, the estimates for \({\hat{\Lambda}}\;\hbox{and}\; {\hat{\sigma}}^2\) could be updated as follows:

$$ {\hat{\Lambda}}^{(k+1)}={{\mathbf{S}}}_{\texttt I} {\hat{\Lambda}}^{(k)}\left(s{\hat{\sigma}}^{2(k)}I_q + \Delta^{(k)-1}{\hat{\Lambda}}^{(k){\tt T}}{{\mathbf{S}}}_{\texttt I}{\hat{\Lambda}}^{(k)}\right)^{-1}, $$
(7)
$$ {\hat{\sigma}}^{2(k+1)}=\frac{1}{st}tr\left({{\mathbf{S}}}_{\texttt I} - {{\mathbf{S}}}_{\texttt I}{\hat{\Lambda}}^{(k)}\Delta^{(k)- 1}{\hat{\Lambda}}^{(k+1){\tt T}}\right), $$
(8)

where superscript (k) denotes the kth estimate, and \(\Delta^{(k)}= {\hat{\sigma}}^{2(k)} I_q + {\hat{\Lambda}}^{(k){\tt T}}{\hat{\Lambda}}^{(k)}.\) The complexity of the EM algorithm is limited by O(sntq) per iteration.

4 Mixtures of P2DPCA models

The 2DPCA (or P2DPCA) is, essentially, a linear model for data representation in a low dimensional subspace. It may be insufficient to model data with large variation caused by, for example, pose, expression and lighting. An alternative choice is to model the complex manifold with a mixture of local linear sub-models from the probabilistic formulation of P2DPCA. For j = 1, …, s; i = 1, …, n, we suppose that the t-dimensional samples X j i are generated independently from a mixture of g underlying populations with unknown proportion π 1, …, π g

$$ {{\mathbf{X}}}_i^j =M_m+ \Lambda_m F_{m} + \varepsilon_{m},\quad \hbox{with\, probablilty}\;\pi_m (m=1,\ldots,g), $$
(9)

where M m are t-dimensional mean vectors, and π m are mixing proportions with π m  > 0 and \(\sum_{m=1}^g=1.\) Note that a separate M m is associated with each component of the mixture model, therefore allowing each to model the data covariance structure in a different region of the data space. While the P2DPCA method uses one set of features for the data points, the mixture of P2DPCAs (MP2DPCA) uses more than one set of features. Therefore, the MP2DPCA is expected to represent data more effectively and has better recognition performance than the P2DPCA method.

Also, the model parameters Λ m , σ 2 m and M m could be estimated by using the EM algorithm. Specifically, in the E-step, the posterior probability that x j i belongs to the mth component of the mixture, using the current estimate \({\hat{\Theta}}^{(k)}\) for Θ, is

$$ \tau_m^{(k)}(i,j)= \frac{\pi_m^{(k)} \varphi({{\mathbf{x}}}_i^j;M_m^{(k)},\Sigma_m^{(k)})} {\sum_{l=1}^{g}\pi_l^{(k)} \varphi({{\mathbf{x}}}_i^j;M_l^{(k)},\Sigma_l^{(k)})},\quad (m=1,\ldots,g; j=1,\ldots,s;i=1,\ldots,N), $$
(10)

where φ denotes the probability density function of normal distribution, and

$$ \Sigma_m=\Lambda_m\Lambda_m^{{\tt T}}+\sigma^2_mI_t,\quad (m=1,\ldots,g). $$

In the M-step, for m = 1, …, g, we have the updates as follows

$$ \pi_m^{(k+1)}=\frac{1}{sN}\sum_{i=1}^N\sum_{j=1}^s \tau_m^{(k)}(i,j), $$
(11)
$$ M_m^{(k+1)}=\frac{\sum_{i=1}^N\sum_{j=1}^s\tau_m^{(k)}(i, j){{\mathbf{x}}}_i^j} {\sum_{i=1}^N\sum_{j=1}^s\tau_m^{(k)}(i,j)}. $$
(12)

If let

$$ S_m^{(k+1)}=\frac{1}{sN\pi_m^{(k+1)}} \sum_{i=1}^N\sum_{j=1}^s\tau_m^{(k)}(i,j) ({{\mathbf{x}}}_i^j-M_m^{(k+1)})({{\mathbf{x}}}_i^j-M_m^{(k+1)})^{{\tt T}}, $$

then the updates for Λ m , σ 2 m could be obtained by using eigen-decomposition [9]. However, they still could be iteratively computed as

$$ \Lambda_m^{(k+1)}=S_m^{(k+1)} \Lambda_m^{(k)}\left(\sigma_m^{2(k)}I_q + \Delta_m^{(k)-1}\Lambda_m^{(k){\tt T}}S_m^{(k+1)}\Lambda_m^{(k)}\right)^{-1}, $$
(13)
$$ \sigma_m^{2(k+1)}= \frac{1}{t}{\hbox{tr}}\left(S_m^{(k+1)} - S_m^{(k+1)}\Lambda_m^{(k)}\Delta_m^{(k)-1}\Lambda_m^{(k+1){\tt T}}\right), $$
(14)

where

$$ \Delta_m^{(k)}= \sigma_m^{2(k)} I_q + \Lambda_m^{(k){\tt T}}\Lambda_m^{(k)}. $$
(15)

The derivations of these formula are similar with Ref. [9]. By applying the MP2DPCA model, all the samples are softly divided into g clusters each modelled by a local P2DPCA. We use the most appropriate local P2DPCA for a given sample in terms of the fitted posterior probability. Based on the probabilistic framework, a natural choice is to assign the sample to a cluster belong to which its posterior probability is the largest.

5 Experiments

In this section, we compare the recognition performances of 2DPCA, P2DPCA and MP2DPCA on three benchmark databases: the UMIST face database, AR face database, and FR data named “faces94" collected at University of Essex, UK. The UMIST database (available from: http:// images.ee.umist.ac.uk/danny/database.html) consists of 564 images of 20 subjects [12]. The face images of each person cover a range of poses from profile to frontal views. Subjects vary with respect to race, sex and appearance. The files are all in PGM format, approximately 220 × 220 pixels with 256 grey levels. Pre-cropped versions (with a size of 112 × 92) of the images may be also made available from the same web site and they are used in our experiments. Figure 2 shows the 20 images of one subject. The AR face database (available from: http://rvl1.ecn.purdue.edu/˜aleix/aleix_face_DB.html) contains over 4,000 color images corresponding to 126 subjects (70 males and 56 females) [13]. There are variations of facial expressions, illumination conditions, and occlusions (sun glasses and scarf) with each person. Each individual consists of 26 frontal view images taken in two sessions (separated by 2 weeks), where each session has 13 images. The original images are of size 768 × 576 pixels and of 24 bits of RGB color values. Figure 3 shows the 26 images of one subject. In the experiment, we select 70 subjects (50 man and 20 women), and only use the non-occluded 14 images [i.e., those numbered (a) through (g) and (n) through (t)] of each selected subject for evaluation. These images are then cropped and resized to 100 × 100 pixels. The FR data named “faces94" collected at University of Essex (available from: http://cswww.essex.ac.uk/mv/allfaces/index.html) were taken when the participants sat at fixed distance from the camera and were asked to speak. This database contains 152 individuals, and each person has 20 color images. There are considerable expression changes with each person. The original image size is 200 × 180 pixels. In the experiment, we select 70 subjects (51 man and 19 women) for evaluation. The selected images are further down-sampled into 100 × 90 pixels.

Fig. 2
figure 2

Twenty face examples of one subject from the UMIST face database

Fig. 3
figure 3

Twenty-six face examples of one subject from the AR face database. The images (a) through (m) are from the fist session, and the rest 13 images are from the second session

The 2DPCA, P2DPCA and MP2DPCA methods are used for extracting features of facial images from the training samples, respectively, and then a nearest-neighbor classifier is used to find the most-similar face from the training samples for a queried face. In our experiments, the measure of distance between two feature matrices \({{\mathcal{Z}}}_{i_1}=\left (A_1^{(i_1)}, \ldots,A_s^{(i_1)}\right)^{{\tt T}}\) and \({{\mathcal{Z}}}_{i_2}=\left (A_1^{(i_2)}, \ldots,A_s^{(i_2)}\right)^{{\tt T}},\) is defined as the sum of s Euclidean distances between each corresponding row of the two matrices.

5.1 UMIST database

The difference in representation becomes evident when considering the quality of reconstructed images using the 2DPCA and P2DPCA methods. Figure 4 compares the reconstructed images obtained with the two methods when trained on ten image samples per person. It could be seen that the P2DPCA provides a better characterization of the object.

Fig. 4
figure 4

2DPCA versus P2DPCA reconstructions for a novel testing view. The left and right four images correspond to 2DPCA and P2DPCA reconstructions with varying q, respectively

We now adopt two strategies to evaluate the performances of 2DPCA, P2DPCA and MP2DPCA. In the first series of experiment the interpolation performance is tested by training on a subset of the available views and testing on the intermediate views. We select ten images per person for training and another ten images per person for testing. So, the total number of training samples and testing samples are both 200. The experiment is run ten times, and the average recognition rates obtained are recorded, as shown in Table 1. The second series of experiments test on the extrapolation performance by training on a range of views and testing on novel views outside the training range. The average recognition of ten times are listed in Table 2. From Table 1, it could be seen that the three methods achieve high recognition rate in the case of interpolation. While in the case of extrapolation, as illustrated in Table 2, the MP2DPCA method obtains the best recognition rate among the three systems. The reason may be that, when there is large pose variation, different pose may appear more separated by using the local subspace. And the MP2DPCA is well suited to model such complex facial images. In this experiment, the 2DPCA is a benchmark for evaluation. As a matter of fact, for FR domain the 2DPCA works better than many other methods, including eigenfaces, fisherfaces, ICA, and Kernel eigenfaces, in terms of recognition rate [5]. From Table 2, we notice a paradoxical thing is that the performance declines as q increases. The reason could possibly be that in the extrapolation situation only the most principal components are effective and the samples are “over”-learned if q is large.

Table 1 Interpolation performances of 2DPCA, P2DPCA and MP2DPCA with varying q on the UMIST database (%)
Table 2 Extrapolation performances of 2DPCA, P2DPCA and MP2DPCA with varying q on the UMIST database (%)

5.2 AR database

In the experiment, we randomly select seven images of each person as the training samples, and use the remaining images to form the testing sample set. The cross-validation strategy is applied, and the experiment is run ten times. The average recognition rates across ten rounds of experiments are shown in Table 3. From Table 3, we see that MP2DPCA achieves the overall best recognition performance, and P2DPCA slightly outperform 2DPCA. This two points are consistent with the experimental results in above subsection.

Table 3 The comparison of average recognition rates of 2DPCA, P2DPCA and MP2DPCA with varying q on the AR database (%)

5.3 Database faces94 of University of Essex

In the experiment, we randomly select ten images of each person to form the training sample set, and use the remaining images for testing. We use the same experimental procedure as above subsection. The average recognition rates across 10 rounds of experiments are shown in Table 4, from which we again see that MP2DPCA and P2DPCA overall outperform conventional 2DPCA.

Table 4 The comparison of average recognition rates of 2DPCA, P2DPCA and MP2DPCA with varying q on the database faces94 of University of Essex (%)

6 Conclusion and future work

In this paper, a probabilistic model for the 2DPCA, termed P2DPCA, is developed. And the proposal of MP2DPCA offers a tempting prospect of being able to model faces in unconstrained (complex) environment, for example, with possible varying poses, facial expressions and illumination conditions. All the model parameters could be learned via the EM algorithm on the basis of ML estimation. The proposed methods allow a convenient way for image feature extraction and representation. The probabilistic models could deal properly with missing data problem, which cope well with the partially occluded images when regarding the occluded parts as missing values. This is our future work.