Keywords

1 Introduction

Over the past decades, dimension reduction has been one research hotspot of face recognition, which aims to discover a low-dimensional representation from original high-dimensional data and thus capture the intrinsic structure of data and finally contribute to classification. Principal Component Analysis (PCA) [1] and Linear Discriminant Analysis (LDA) [2] are the most classic methods. PCA seeks a set of optimal projection directions to maximize the data covariance in an unsupervised way, while LDA uses the label information to construct the within-class scatter and between-class scatter and thus finds the discriminant projection directions. Therefore, LDA is better than PCA in face recognition and other recognition tasks. However, both of them fail to reveal the local structure due to their global linearity.

In order to capture the local information, the manifold learning-based methods have been proposed, which generally assume that the closely located samples are likely to have the similar low-dimensional manifold embeddings. As the most representative manifold learning method, Locality Preserving Projections (LPP) [3] applies k-nearest-neighbor to explore the local structure of each sample. In contrast to LPP, Marginal Fisher Analysis (MFA) [4] uses k1 nearest intraclass samples to build the intrinsic graph and simultaneously employs k2 nearest interclass marginal samples to construct the penalty graph. Both of these two graphs can describe the intraclass compactness and the interclass separability respectively. It verifies that the manifold learning-based dimensionality reduction methods can be well explained by graph embedding framework [4]. However, compared with the k1 nearest intraclass samples and k2 nearest interclass samples, the distant intraclass and interclass samples are not considered during the process of graph construction. Therefore, it may cause the distant intraclass samples are far from each other and the distant interclass samples are close to each other in the projected feature space, which inevitably goes against classification. To address this problem, Huang et al. proposed the Graph Discriminant Embedding (GDE) [5] to focus on the importance of the distant intraclass and interclass samples in the graph embedding framework. It uses a strictly monotone decreasing function to describe the importance of nearby intraclass and interclass samples and simultaneously applies a strictly monotone increasing function to characterize the importance of distant intraclass and interclass samples. Consequently, the different contributions of nearby and distant samples to the within-class and between-class scatters are fully considered. For the most graph embedding methods, the importance of samples is directly based on the similarity between two samples. Therefore, outliers have the influence to the graph construction and especially affect the intraclass compactness.

On the other hand, considering there exist unlabeled samples in practical application, Cai et al. [6] proposed the Semi-Supervised Discriminant Analysis (SDA) which performs LDA on labeled samples and uses unlabeled samples to build k-nearest-neighbor graph for preserving local neighbor relationships. It effectively applies semi-supervised learning to achieve low dimensional embedding. However, the neighbor parameter k is difficult to define. Inappropriate k even influences the quality of graph construction and thus degrades the semi-supervised discriminant performance. In the past few years, Low Rank Representation (LRR) [7] has been proposed to capture the global structure of samples using low-rank and sparse matrix decomposition. The decomposed low-rank part reflects the intrinsic affinity of samples. In other words, the similar samples will be clustered into the same subspace. However, LRR is essentially unconstrained and is likely to be sensitive to noise for the corrupted data. Therefore, Guo et al. [8] performed the L1-norm sparsity constraint on low-rank decomposition with the aiming of obtaining the more robust low-rank relationships.

Motivated by the above discussion, we introduce a novel graph embedding discriminant analysis method and its semi-supervised extension for face recognition in this paper. For the labeled samples, we redefine the weights of the intraclass graph through the mean of intraclass samples. In other words, the similarity between two intraclass samples is based on the similarities between them and the mean of intraclass samples, instead of directly using the distance of them. It effectively avoids the influence of outliers to the within-class scatter. Meanwhile we concentrate the different contributions of interclass samples to between-class scatter in view of the idea of GDE. Consequently, we build the improved discriminant criterion to make the intraclass samples as close to each other as possible and interclass samples far from each other. More critically, the constructed graph embedding framework can relieve the influence of outliers to the intraclass compactness. On the other hand, considering there may exist a large number of unlabeled samples in practical application, we apply the low rank representation constrained by L1-norm sparsity to capture their global low-rank relationships, and then minimizes the error function through preserving the low-rank structure of samples. Finally, we integrate the improved graph embedding discriminant criterion and LRR error function for achieving semi-supervised learning. Experimental results on ORL, AR and FERET face datasets demonstrate the effectiveness and robustness of our method.

The remainder of this paper is organized as follows. Section 2 briefly reviews MFA and GDE. Section 3 introduces the proposed method. Section 4 shows the experimental results and the conclusion is given in Sect. 5.

2 Related Works

Given a set of n training samples X = [x1, x2, …, xn], containing C classes: xi ∈ lc, c ∈ {1, 2, …, C}. By using the projection matrix A, the representation of the data set X in the low-dimensional space can be represented as yi= ATxi. In this section, we will briefly review MFA and GDE.

2.1 Marginal Fisher Analysis (MFA)

As the most representative graph embedding method, MFA [4] builds the intrinsic graph to characterize the compactness of intraclass samples, and constructs the penalty graph to describe the separability of interclass samples. Let Gintrinsic= {V, E, WMFA} denotes the intrinsic graph, and Gpenalty= {V, E, BMFA} denotes the penalty graph, where V is the samples, and E is the set of edges connecting samples. WMFA and BMFA are two adjacency matrices identifying the similarity between samples. The weights of these two matrices are defined [4]:

$$ {\mathbf{W}}_{ij}^{MFA} { = }\left\{ {\begin{array}{*{20}l} { 1 ,} \hfill & {{\text{if}}\;{\mathbf{x}}_{i} \in N_{{k_{1} }}^{ + } (j)\;\;{\text{or}}\;\;{\mathbf{x}}_{j} \in N_{{k_{1} }}^{ + } (i)} \hfill \\ { 0 ,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(1)
$$ {\mathbf{B}}_{ij}^{MFA} { = }\left\{ {\begin{array}{*{20}l} { 1 ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{i} \in P_{{k_{2} }} (j)\;\;{\text{or}}\;\;{\mathbf{x}}_{j} \in P_{{k_{2} }} (i)} \hfill \\ { 0 ,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(2)

where \( N_{{k_{1} }}^{ + } (i) \) indicates the k1 nearest intraclass samples of xi, and \( P_{{k_{2} }} (i) \) indicates the k2 nearest interclass marginal samples of xi. The intraclass scatter \( {\mathbf{S}}_{w}^{MFA} \) and interclass scatter \( {\mathbf{S}}_{b}^{MFA} \) can be formulated [4]:

$$ {\mathbf{S}}_{w}^{MFA} = \sum\nolimits_{ij} { | |{\mathbf{y}}_{i} - {\mathbf{y}}_{j} | |^{ 2} {\mathbf{W}}_{ij}^{MFA} } = 2tr\left( {{\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{MFA} {\mathbf{X}}^{\text{T}} {\mathbf{A}}} \right) $$
(3)
$$ {\mathbf{S}}_{b}^{MFA} = \sum\nolimits_{ij} { | |{\mathbf{y}}_{i} - {\mathbf{y}}_{j} | |^{ 2} {\mathbf{B}}_{ij}^{MFA} } = 2tr\left( {{\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{{MFA^{\prime}}} {\mathbf{X}}^{\text{T}} {\mathbf{A}}} \right) $$
(4)

where \( {\mathbf{L}}^{MFA} = {\mathbf{D}}^{MFA} - {\mathbf{W}}^{MFA} \) and \( {\mathbf{L}}^{{MFA^{\prime}}} = {\mathbf{D}}^{{MFA^{\prime}}} - {\mathbf{B}}^{MFA} \) are the Laplacian matrices; \( {\mathbf{D}}_{ii}^{MFA} = \sum\nolimits_{j} {{\mathbf{W}}_{ij}^{MFA} } \) and \( {\mathbf{D}}_{ii}^{{MFA^{\prime}}} { = }\sum\nolimits_{j} {{\mathbf{B}}_{ij}^{MFA} } \) are the diagonal matrices. The discriminant criterion function of MFA can be expressed as [4]:

$$ J_{MFA} ({\mathbf{A}} )= \mathop { \hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{{\mathbf{S}}_{w}^{MFA} }}{{{\mathbf{S}}_{b}^{MFA} }} = \mathop { \hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{tr ({\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{MFA} {\mathbf{X}}^{\text{T}} {\mathbf{A}} )}}{{tr ({\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{{MFA^{\prime}}} {\mathbf{X}}^{\text{T}} {\mathbf{A}} )}} $$
(5)

2.2 Graph Discriminant Embedding (GDE)

The purpose of MFA is to construct two graphs by comprehensively considering the local structure and discriminant structure of samples. However, MFA neglects the importance of the distant samples which may influence the discriminant performance. Therefore, GDE redefines the intrinsic graph Gintrinsic and the penalty graph Gpenalty. The weight matrices WGDE and BGDE in these two graphs are redefined [5]:

$$ {\mathbf{W}}_{ij}^{GDE} = \left\{ {\begin{array}{*{20}l} {e^{{ - \frac{{ | |{\mathbf{x}}_{i} - {\mathbf{x}}_{j} | |^{ 2} }}{t}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{j} \in N_{k1}^{ + } (i) \, \;{\text{or}}\; \, {\mathbf{x}}_{i} \in N_{k1}^{ + } (j),\,l_{i} = l_{j} } \hfill \\ {e^{{ - \frac{t}{{ | |{\mathbf{x}}_{i} - {\mathbf{x}}_{j} | |^{ 2} }}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{j} \notin N_{k1}^{ + } (i) \, \;{\text{and}}\; \, {\mathbf{x}}_{i} \notin N_{k1}^{ + } (j),\,l_{i} = l_{j} } \hfill \\ { 0 ,} \hfill & {{\text{if}}\;\;l_{i} \ne l_{j} } \hfill \\ \end{array} } \right. $$
(7)
$$ {\mathbf{B}}_{ij}^{GDE} = \left\{ {\begin{array}{*{20}l} {e^{{ - \frac{{ | |{\mathbf{x}}_{i} - {\mathbf{x}}_{j} | |^{ 2} }}{t}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{j} \in P_{k2} (i)\;{\text{or}}\;{\mathbf{x}}_{i} \in P_{k2} (j),\;l_{i} \ne l_{j} } \hfill \\ {e^{{ - \frac{t}{{ | |{\mathbf{x}}_{i} - {\mathbf{x}}_{j} | |^{ 2} }}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{j} \notin P_{k2} (i)\;{\text{and}}\;{\mathbf{x}}_{i} \notin P_{k2} (j),\;l_{i} \ne l_{j} } \hfill \\ { 0 ,} \hfill & {{\text{if}}\;\;l_{i} = l_{j} } \hfill \\ \end{array} } \right. $$
(8)

where the parameter t (t > 0) is used to tune the weight of samples. The intraclass scatter \( {\mathbf{S}}_{w}^{GDE} \) and the interclass scatter \( {\mathbf{S}}_{b}^{GDE} \) can be computed [5]:

$$ {\mathbf{S}}_{w}^{GDE} = \sum\nolimits_{ij} { | |{\mathbf{y}}_{i} - {\mathbf{y}}_{j} | |^{ 2} {\mathbf{W}}_{ij}^{GDE} } = 2tr\left( {{\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{GDE} {\mathbf{X}}^{\text{T}} {\mathbf{A}}} \right) $$
(9)
$$ {\mathbf{S}}_{b}^{GDE} = \sum\nolimits_{ij} { | |{\mathbf{y}}_{i} - {\mathbf{y}}_{j} | |^{ 2} {\mathbf{B}}_{ij}^{GDE} } = 2tr\left( {{\mathbf{A}}^{\text{T}} {\mathbf{XL}}^{{GDE^{\prime}}} {\mathbf{X}}^{\text{T}} {\mathbf{A}}} \right) $$
(10)

where \( {\mathbf{L}}^{GDE} = {\mathbf{D}}^{GDE} - {\mathbf{W}}^{GDE} \) and \( {\mathbf{L}}^{{GDE^{\prime}}} = {\mathbf{D}}^{{GDE^{\prime}}} - {\mathbf{B}}^{GDE} \) are the Laplacian matrices; \( {\mathbf{D}}_{ij}^{GDE} = \sum\nolimits_{j} {{\mathbf{W}}_{ij}^{GDE} } \) and \( {\mathbf{D}}_{ij}^{{GDE^{\prime}}} = \sum\nolimits_{j} {{\mathbf{B}}_{ij}^{GDE} } \) are the diagonal matrices. Therefore, the discriminant criterion function of GDE can be computed [5]:

$$ J_{GDE} ({\mathbf{A}}) = \mathop {\hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{{\mathbf{S}}_{w}^{GDE} }}{{{\mathbf{S}}_{b}^{GDE} }} = \mathop {\hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{tr({\mathbf{A}}^{T} {\mathbf{XL}}^{GDE} {\mathbf{X}}^{T} {\mathbf{A}})}}{{tr({\mathbf{A}}^{T} {\mathbf{XL}}^{{GDE^{\prime}}} {\mathbf{X}}^{T} {\mathbf{A}})}} $$
(11)

In GDE method, the similarity defined by monotonic functions can effectively solve the effect of the distant samples on recognition results. However, the similarity of samples is directly defined by the distance of samples, and outliers will have a certain impact on the discriminant performance. Furthermore, GDE computes the similarity for labeled samples without taking unlabeled samples into account.

3 Methodology

3.1 The Proposed Method

As described in Sect. 2.2, GDE does not study the influence of outliers and unlabeled samples on the recognition performance. In this paper, we propose a novel graph embedding discriminant analysis method where the similarity of samples is based on the similarities between them and the mean of intraclass samples. It intends to eliminate the impact of outliers on the within-class scatters and improves the recognition performance.

Let Gintrinsic = {V, E, W} denotes the redefined intrinsic graph, and Gpenalty = {V, E, B} denotes the penalty graph, and the weight matrices W and B are given as follows:

$$ {\mathbf{W}}_{ij} = \left\{ {\begin{array}{*{20}l} {e^{{ - \frac{{||{\mathbf{x}}_{i} - {\tilde{\mathbf{x}}}_{i} ||^{2} + ||{\mathbf{x}}_{j} - {\tilde{\mathbf{x}}}_{i} ||^{2} }}{2t}}} ,} \hfill & {{\text{if }}\;l_{i} = l_{j} } \hfill \\ {0,} \hfill & {{\text{if}}\;\;l_{i} \ne l_{j} } \hfill \\ \end{array} } \right. $$
(12)
$$ {\mathbf{B}}_{ij}^{{}} = \left\{ {\begin{array}{*{20}l} {e^{{ - \frac{{||{\mathbf{x}}_{i} - {\mathbf{x}}_{j} ||^{2} }}{t}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{i} \in D_{k} ({\mathbf{x}}_{j} )\;\;{\text{or}}\;\;{\mathbf{x}}_{j} \in D_{k} ({\mathbf{x}}_{i} ),\,l_{i} \ne l_{j} } \hfill \\ {e^{{ - \frac{t}{{||{\mathbf{x}}_{i} - {\mathbf{x}}_{j} ||^{2} }}}} ,} \hfill & {{\text{if}}\;\;{\mathbf{x}}_{i} \notin D_{k} ({\mathbf{x}}_{j} )\;\;{\text{and}}\;\;{\mathbf{x}}_{j} \notin D_{k} ({\mathbf{x}}_{i} ),\,l_{i} \ne l_{j} } \hfill \\ {0,} \hfill & {{\text{if}}\;\;l_{i} = l_{j} } \hfill \\ \end{array} } \right. $$
(13)

where \( {\tilde{\mathbf{x}}}_{i} \) denotes the mean of within-class samples and Dk(xi) denotes the k nearest interclass neighbors of xi. However, unlike GDE, we redefine the weight matrix W based on the distance between the sample and the mean of within-class samples instead of directly using the distance of samples. Moreover, we also describe the importance of samples through monotonic functions in view of GDE.

According to the constructed two graphs, the intraclass scatter \( {\mathbf{S}}_{w}^{{}} \) and the interclass scatter \( {\mathbf{S}}_{b}^{{}} \) can be formulated:

$$ {\mathbf{S}}_{w} = \sum\limits_{ij} {||{\mathbf{y}}_{i} - {\mathbf{y}}_{j} ||^{2} {\mathbf{W}}_{ij}^{{}} } = \sum\limits_{ij} {||{\mathbf{A}}^{\text{T}} {\mathbf{x}}_{i} - {\mathbf{A}}^{\text{T}} {\mathbf{x}}_{j} ||^{2} {\mathbf{W}}_{ij}^{{}} } = 2tr({\mathbf{A}}^{\text{T}} {\mathbf{XLX}}^{\text{T}} {\mathbf{A}}) $$
(14)
$$ {\mathbf{S}}_{b} = \sum\limits_{ij} {||{\mathbf{y}}_{i} - {\mathbf{y}}_{j} ||^{2} {\mathbf{B}}_{ij} } = \sum\limits_{ij} {||{\mathbf{A}}^{\text{T}} {\mathbf{x}}_{i} - {\mathbf{A}}^{\text{T}} {\mathbf{x}}_{j} ||^{2} {\mathbf{B}}_{ij} } = 2tr({\mathbf{A}}^{\text{T}} {\mathbf{XL^{\prime}X}}^{\text{T}} {\mathbf{A}}) $$
(15)

where \( {\mathbf{L}} = {\mathbf{D}} - {\mathbf{W}} \) and \( {\mathbf{L^{\prime}}} = {\mathbf{D^{\prime}}} - {\mathbf{B}} \) are the Laplacian matrices. \( {\mathbf{D}}_{ij} = \sum\nolimits_{j} {{\mathbf{W}}_{ij} } \) and \( {\mathbf{D^{\prime}}}_{ij} = \sum\nolimits_{j} {{\mathbf{B}}_{ij} } \) are the diagonal matrices. Therefore, the proposed discriminant criterion function is given:

$$ J_{Ours} ({\mathbf{A}}) = \mathop {\hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{tr({\mathbf{A}}^{\text{T}} {\mathbf{XLX}}^{\text{T}} {\mathbf{A}})}}{{tr({\mathbf{A}}^{\text{T}} {\mathbf{XL^{\prime}X}}^{\text{T}} {\mathbf{A}})}} $$
(16)

Consequently, our method can avoid the influence of outliers to the within-class scatter, making the intraclass samples close to each other and interclass samples far away from each other in the projected space.

3.2 Semi-Supervised Extension (SSE)

For unlabeled samples, SDA [6] builds the k-nearest neighbor graph on unlabeled samples to preserve local neighbor relationships. However, the neighbor parameter k in SDA is difficult to define. In the past few years, LRR [7] is proposed, and it uses low-rank and sparse matrix decomposition to express the global structure of the samples, so that the low-rank structure can reflect the intrinsic affinity of samples. Nevertheless, LRR is essentially unconstrained, and it may be affected by noise. In this paper, we use the low rank representation with L1-norm sparsity constraint to capture their low-rank relationships, and minimize the error function through preserving the low-rank structure of samples. For the given training samples X, we refactor the original samples through the data dictionary D = [d1, d2, …, dm] ∈ Rd×m. In other words, X = DZ, where Z = [z1, z2, …, zn] ∈ Rm×n is the refactor coefficient matrix. Besides, there may exist noise in the distribution of samples, therefore the problem can be modeled as follows [8]:

$$ \left\{ {\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{z} \left\| {\mathbf{Z}} \right\|_{*} + \lambda \left\| {\mathbf{E}} \right\|_{2,1} + \gamma \left\| {\mathbf{Z}} \right\|_{1} } \hfill \\ {s.t.\;\;{\mathbf{X}} = {\mathbf{DZ}} + {\mathbf{E}}} \hfill \\ \end{array} } \right. $$
(17)

where ||Z||* denotes the kernel norm of Z, that is, the sum of the singular values of the matrix. \( \left\| {\mathbf{E}} \right\|_{2,1} = \sum\nolimits_{j = 1}^{n} {\sqrt {\sum\nolimits_{i = 1}^{d} {\left( {{\mathbf{E}}_{ij} } \right)^{2} } } } \) is the l2,1-norm of the noise matrix E, which is used to model samples of noise interference. ||Z||1 represents the L1-norm of the matrix, which is used to ensure the sparsity of Z. the parameter λ is used to balance the effects of the noise part; and the parameter γ is to balance the effects of the sparse part on the results. By applying the L1-norm sparsity constraint and adding the error term to solve noise pollution, we can obtain more robust low-rank relationships.

According to Eq. (16), we can see that xi = Xzi +  ei, and ei is the column vector of the noise matrix E. It is assumed that the xi remains the low-rank structure after linearly projection into the low-dimensional feature space, and the influence of noise on classification can be avoided by minimizing the representation error ei. Therefore, we can minimize the norm of ei as follows:

$$ \sum\nolimits_{i} {\left\| {{\mathbf{A}}^{\text{T}} {\mathbf{e}}_{i} } \right\|^{2} } = \sum\nolimits_{i} {\left\| {{\mathbf{A}}^{\text{T}} \left( {{\mathbf{x}}_{i} - {\mathbf{Xz}}_{i} } \right)} \right\|^{2} } = tr\left( {{\mathbf{A}}^{\text{T}} {\mathbf{S}}_{{\mathbf{e}}} {\mathbf{A}}} \right) $$
(18)

where Se can be computed by:

$$ {\mathbf{S}}_{{\mathbf{e}}} = \sum\nolimits_{i} {\left( {{\mathbf{x}}_{i} - {\mathbf{Xz}}_{i} } \right)\left( {{\mathbf{x}}_{i} - {\mathbf{Xz}}_{i} } \right)^{\text{T}} } = {\mathbf{X}}\left( {{\mathbf{I}} - {\mathbf{Z}}} \right)\left( {{\mathbf{I}} - {\mathbf{Z}}} \right)^{\text{T}} {\mathbf{X}}^{\text{T}} $$
(19)

Finally, we integrate the improved graph embedding discriminant criterion and LRR error function for achieving semi-supervised learning. The discriminant criterion function of our extension method is given as following:

$$ J_{Ours - SSE} ({\mathbf{A}}) = \mathop {\hbox{min} }\limits_{{\mathbf{A}}} \;\frac{{tr({\mathbf{A}}^{\text{T}} {\mathbf{XLX}}^{\text{T}} {\mathbf{A}}) + \beta tr({\mathbf{A}}^{\text{T}} {\mathbf{S}}_{{\mathbf{e}}} {\mathbf{A}})}}{{tr({\mathbf{A}}^{\text{T}} {\mathbf{XL^{\prime}X}}^{\text{T}} {\mathbf{A}})}} $$
(20)

where the regularization parameter β (β ∈ (0, 1)) is used to balance the loss of the discriminant criterion function.

3.3 Algorithm

The dimensionality of face images is often very high, thereby we first use the PCA projection APCA for dimensionality reduction. For the dataset X = [XL, XU] in the PCA subspace, where XL is the subset of labeled samples and XU is the unlabeled samples. The proposed algorithm is given as follows:

Step1:

For XL, construct the two adjacency matrices W and B based on Eqs. (12) and (13), and compute the intraclass scatter \( {\mathbf{S}}_{w}^{{}} \) and the interclass scatter \( {\mathbf{S}}_{b}^{{}} \) based on Eqs. (14) and (15).

Step2:

In consideration of unlabeled samples in datasets, compute the reconstruction coefficient matrix Z based on Eq. (17), then compute the matrix Se based on Eq. (19). Let a1, a2, …, ap be the eigenvectors associated with the p minimal eigenvalues of the general eigenvalue problem \( {\mathbf{XL^{\prime}X}}^{\text{T}} {\mathbf{A}} = \lambda ({\mathbf{XLX}}^{\text{T}} + \beta {\mathbf{S}}_{{\mathbf{e}}} ){\mathbf{A}} \), then the optimal projection matrix learned by our extension method can be denoted as AOurs-SSE = [a1, a2, …, ap], and the final projection is A = APCA \( \cdot \) AOurs-SSE.

Step3:

For a test samples x, the representation in the low-dimensional space can be represented as y = ATx, and then we can use a classifier to predict its class.

4 Experiments

In this section, we verify the proposed method with multiple classifiers, i.e. NNC (nearest neighbor classifier), SVM (support vector machine), LRC (linear regression classifier) [9] and LLRC (locality-regularized linear regression classifier) [10], on three public face datasets (ORL, AR and FERET), and particularly evaluates its performance under noise and blur conditions. All the experiments are conducted on a computer with Intel Core i7-7700 3.6 GHz Windows 7, RAM-8 GB, and implemented by MATLAB.

4.1 Experiments on ORL Dataset

The ORL face dataset contains 400 images of 40 people. All images are in grayscale and manually cropped to 92 × 112 pixels. On ORL face dataset, we select 5 images of each person for training samples and the remaining images for testing. Furthermore, the first r (r = 2, 3) images of each person are labeled. The NNC, SVM, LRC and LLRC are picked for classification. Firstly, the dimension of PCA subspace is set to 50. In our method, the value of the interclass neighbor parameter k is 4, and the tuning parameter of adjacency matrices t is set to 23. The neighbor parameter k in LPP is 4. For MFA and GDE, the intraclass neighbor parameter k1 is picked as 1, and the interclass neighbor parameter k2 is 40.

Table 1 reports the maximal recognition rates of different methods on ORL dataset. It can be seen that the recognition performance of our method is better than other methods in various classifiers, especially GDE. When unlabeled samples are existing in dataset, the L1-norm sparse constraint is introduced, which makes our extension method is more effective in classifying these samples. Furthermore, the regularization parameter β which is applied for the loss of the discriminant criterion function is critical to the recognition performance, and too large value of β will make the recognition rate degraded. In addition, Fig. 1 shows the recognition rate curves of all methods under different classifiers as the projection axes varies. It can be easily found that the recognition performance of all method outperforms under LLRC, and our extension method achieves the best performance, that further proves the efficiency of our extension method.

Table 1. Maximal recognition rates (%) and corresponding dimensions (shown in the parentheses) of different methods under four classifiers on ORL dataset.
Fig. 1.
figure 1

Recognition rate curves of all methods under different classifiers when the labeled images of each class r = 3. (a) NNC; (b) SVM; (c) LRC; (d) LLRC.

4.2 Experiments on AR Dataset

The AR face dataset contains over 3000 face images of 120 people. All images are manually cropped to 50 × 40 pixels. On AR face dataset, we only use the images without occlusion, and the first 7 images of each person are taken for training and 14–20 images are applied for testing. Furthermore, the first r (r = 3, 4, 5) images of each person are labeled. The NNC, SVM, LRC and LLRC are employed for classification. Before the experiments, the dimension of PCA subspace is set to 150. The parameters in the experiments are set as follows: In our method, the value of the interclass neighbor parameter k is 6, and the tuning parameter of adjacency matrices t is chosen as 24. The neighbor parameter k in LPP is set to 6. For MFA and GDE, the intraclass neighbor parameter k1 is picked as 1, and the interclass parameter k2 is set to 120.

Table 2 reports the maximal recognition rates of different methods on AR dataset. The results indicate that compared with GDE, our method has a small improvement (<1%) under four classifiers, but the recognition performance of our extension method has obvious advantages when unlabeled samples are existing in datasets. Meanwhile, the more labeled samples, the better performance our extension method achieves. In addition, the recognition rate curves of different methods under various classifiers as the projection axes varies are shown in Fig. 2. We can see that LPP performs the worst, but when the regularization parameter β is set to 0.3, the curves in Fig. 2 demonstrate that our extension method performs better than other methods, which proves that the L1-norm sparse constraint we introduced can eliminate the influence of unlabeled samples on the experimental results.

Table 2. Maximal recognition rates (%) and corresponding dimensions of different methods under four classifiers on AR dataset.
Fig. 2.
figure 2

Recognition rate curves of all methods under different classifiers when the labeled images of each class r = 5. (a) NNC; (b) SVM; (c) LRC; (d) LLRC.

4.3 Experiments on FERET Dataset

The FERET face dataset contains over 1000 face images of 200 people. All images are in grayscale and manually cropped to 80 × 80 pixels. On FERET face dataset, the first 4 images of each person are employed for training and the remaining for testing. Similar to the experiment on ORL dataset, the first r (r = 2, 3) images are labeled. The NNC, SVM, LRC and LLRC are chosen for classification. First of all, the dimension of PCA subspace is set to 150. The parameters of this experiment are set as follows: In our method, the interclass neighbor parameter k is picked as 3, and the tuning parameter of adjacency matrices t is 23. For LPP, the neighbor parameter k is set to 3, and for MFA and GDE, the intraclass neighbor parameter k1 is set to 1 and the interclass neighbor parameter k2 is 200.

Table 3 reports the maximal recognition rates of different methods on FERET dataset. We can know that the overall recognition rate of all methods in this experiment has decreased, but compared with other methods, our method still has certain advantages. Furthermore, Fig. 3 represents the recognition rate curves of all methods under four classifiers as the projection axes varies. The results show that compared with the other three classifiers, all methods on LLRC will achieve better results, and when the value of β is set to 0.3, in most cases the curve of our extension method reaches the best performance, which also validates the efficiency of our extension method.

Table 3. Maximal recognition rates (%) and corresponding dimensions of different methods under four classifiers on FERET dataset.
Fig. 3.
figure 3

Recognition rate curves of different methods under different classifiers when the labeled samples of each class r = 3. (a) NNC; (b) SVM; (c) LRC; (d) LLRC.

4.4 Experiments Under Noise Condition

In view of the effect of noise, in this section, Gaussian noise (mean μ = 0, variance σ = 0.05), pepper & salt noise (noise density d = 0.1) and multiplicative noise (mean μ = 0, variance σ = 0.1) are applied to face samples on ORL, AR and FERET datasets. Face samples under three different noises on ORL, AR and FERET datasets are given in Fig. 4. The comparative results of the recognition rates of different methods on three datasets are shown in Tables 4, 5, and 6.

Fig. 4.
figure 4

Face samples under three different noises on ORL, AR and FERET datasets.

Table 4. Maximal recognition rates (%) of different methods on ORL dataset under different noise conditions (r = 3).
Table 5. Maximal recognition rates (%) of different methods on AR dataset under different noise conditions (r = 5).
Fig. 5.
figure 5

Face samples under two blurs on ORL, AR and FERET datasets.

Table 6. Maximal recognition rates (%) of different methods on FERET dataset under different noise conditions (r = 3).

It can be seen from Tables 4, 5, and, 6 that under three noise conditions, our method performs better than LDA, LPP, MFA and GDE, due to the improvement in the definition of the similarities of intraclass samples. Especially, since the low rank representation of L1-norm sparsity constraint is introduced, the recognition rate of our extension method has obvious advantages under three noise conditions, which further proves the anti-noise ability of our extension method. Furthermore, as the number of samples increases and the faces become more complex, our method still has a better and robust performance under three noise conditions.

4.5 Experiments Under Blur Condition

In consideration of the blurred images in dataset, in this section, motion blur (the motion angle is 10o and the distance of motion is 5) and defocus blur (the radius is 5) are employed to the face images on ORL, AR and FERET datasets. Figure 5 shows face samples under different blur conditions on ORL, AR and FERET datasets. The comparative results of the recognition rate of all methods on three datasets are shown in Tables 7, 8, and 9.

Table 7. Maximal recognition rates (%) of different methods on ORL dataset under different blur conditions.
Table 8. Maximal recognition rates (%) of different methods on AR dataset under different blur conditions.
Table 9. Maximal recognition rates (%) of different methods on FERET dataset under different blur conditions.

According to the results in Tables 7, 8, and 9, it can be found that our method has an advantage over other methods in recognition rate under two blur conditions. Moreover, our extension method uses sparse representation to optimize the similarity of intraclass samples, which makes it perform better than GDE. However, as the number of samples increases and the faces become more complex, the recognition performance will be degraded. It can be seen from Table 9 that when the defocused blurred images are existing, the maximal recognition rate of various methods is low, not exceeding 40%, but the performance of our extension method still has a little advantage, which indicates the anti-interference ability of our extension method.

5 Conclusion

In this paper, we propose a novel method to eliminate the influence of outliers in face recognition. It builds the intrinsic graph in the light of the intraclass samples and their with-class mean, and constructs the penalty graph from the perspective of GDE. In addition, a low-rank representation with sparse constraints is introduced to explore the global low rank relationship of unlabeled samples. The experimental results on three datasets under different noise and blur conditions prove the effectiveness and robust of the proposed method.