1 Introduction

Face recognition (FR) has been a hot topic in computer vision and pattern recognition due to its importance in many real-world applications, such as public security monitoring, access control and card identification. Although face images usually have high dimensionality, they possibly reside on a low-dimensional subspace[1,2]. Many subspace learning methods, such as principal component analysis (PCA)[3], linear discriminant analysis (LDA)[4], locality preserving projection (LPP)[5], marginal Fisher analysis (MFA)[6], local discriminant embedding (LDE)[7] and sparsity preserving discriminant embedding (SPDE)[8] have been proposed for reducing face image dimensionality. Subsequently, a classifier, such as nearest neighbor classifier (NNC)[9] or support vector machine (SVM), is usually employed for classification. Although subspace learning methods have achieved a good accuracy in FR, they are not robust to corrupted face images[10], where corruptions are caused due to occlusion, disguise or pixel contamination. Especially, when the number of training samples is small, the learned subspace projections may be wrong or deficient[11].

Recently, a new robust FR framework, using sparse representation-based classification (SRC)[12], was presented. SRC sparsely encodes a test sample over a dictionary consisting of training samples via L1-norm optimization techniques[1315]. Then, the test sample is classified into the class that generates minimal reconstruction residual. In the case of test samples suffering from corruption and occlusion, SRC uses an identity matrix as occlusion dictionary to deal with occlusion or corruption of test samples. However, the dimensionality of the identity matrix is usually very high, which renders sparse coding computationally expensive. To solve this issue, Deng et al.[16]. proposed an extended SRC (ESRC) method, in which an intra-class variation matrix, determined by subtracting the class centroid from the same class samples, is used as occlusion dictionary. The dimensionality of occlusion dictionary used in ESRC is much smaller than that of occlusion dictionary used in SRC. This fact greatly improves computational efficiency[16].

Although SRC and its extensions have been successfully used in FR, their working mechanism has not been clearly revealed. Recently, Zhang et al.[17] proposed an efficient FR method, namely collaborative representation-based classification (CRC). They thought that it is the collaborative representation using all training samples, but not the L1-norm sparsity constraint, which plays a key role in the success of SRC[17]. Therefore, they replaced the L1-norm by the computationally much more efficient L2-norm and claimed that CRC can achieve competitive FR performance compared with SRC. Naseem et al.[18] presented a new L2-norm based classification method, namely the linear regression-based classification (LRC), by using the concept that samples from the one subject reside on a linear subspace. LRC linearly encodes a test sample over class-specific training samples by using the least squares method and classifies the test sample based on class-specific reconstruction residuals.

Recently, Liu et al.[19] presented a low-rank representation (LRR) method to reveal the global structure of data that are drawn from multiple subspaces. Lu et al.[20] incorporated a graph regularization term into the LRR to encode the local structure information of data. Ma et al.[21] proposed a discriminative low-rank dictionary learning algorithm for sparse representation. Zhang et al.[22] proposed a structured low-rank representation method for image classification, in which they used the discriminative low-rank representations of training data to learn a dictionary and design a linear multi-classifier.

Inspired by the above methods, we present a low-rank sparse representation-based classification (LRSRC) method for robust FR. Given a set of test samples, LRSRC seeks the lowest-rank and sparsest representation matrix over all the training samples. As observed in [19], low-rank model can reveal the subspace structures of data, while sparsity helps to recognize the data class. Since LRSRC imposes both the lowest-rank and sparsest constraints on the representation matrix, the obtained test sample representations are both representative and discriminative. Using the representation vector of a test sample, LRSRC assigns the test sample to the class that generates minimal reconstruction error. It is worth mentioning that LRSRC can efficiently deal with corrupted test samples due to its accurate corruption estimation. It is worth noting that there are some main differences between our work and the works of [21, 22], though our work is related to them. The main differences are as follows: 1) Our work focuses on the low-rankness and sparsity of the representation matrix of test samples, while the work of [21], as a dictionary learning method, mainly emphasizes the low-rankness of the learned dictionary; 2) Our work directly uses the training data matrix as dictionary to calculate the lowest-rank and sparsest representation matrix of test samples, and classifies test samples based on the minimal reconstruction errors, while [22] mainly uses the discriminative low-rank representations of training data to learn a dictionary and design a linear multi-classifier for image classification.

The rest of the paper is outlined as follows. In Section 2, we briefly review some related works, including sparse representation-based classification (SRC) and low-rank representation (LRR). Section 3 presents the proposed LRSRC method for robust FR in detail. Experimental results and corresponding discussions are presented in Section 4. Finally, conclusions are drawn in Section 5.

2 Related work

2.1 Sparse representation based classification

The sparse representation-based classification (SRC) was proposed in [12] for robust face recognition. Suppose there are C classes of training samples, let \({X_i} = [{x_{i1}},{x_{i2}}, \cdots ,{x_{iNi}}] \in {{\bf{R}}^{D \times {N_i}}}\) be the matrix formed by N i training samples of i-th class, in which x ij is a one-dimensional vector produced by the j-th sample of the i-th class. Then X = [X1, X2, ⋯, X C ] ∈ RD×N is the matrix of all training samples, where \(N = \sum\nolimits_{i = 1}^C {{N_i}}\) is the total number of training samples. Using X as dictionary, a test sample y can be approximately represented as linear combination of the elements of X, i.e. y = Xa, in which a is the representation coefficient vector of y. If D < N, the linear equation y = Xa is under determined. Therefore, the solution a is not unique. Since the test sample y can be adequately represented by the training samples from the correct class, the representation is obviously sparse when the amount of training samples is large enough. SRC claims that the sparser the representation coefficient vector a is, the easier it is to recognize the test samples class label[12]. This inspires one to find the sparsest solution of y = Xa by solving the following optimization problem.

$$\hat a = \mathop {\arg \min}\limits_a {\left\Vert {y - Xa} \right\Vert _2} + \lambda {\left\Vert a \right\Vert _1}$$
(1)

where λ is a scalar constant. The optimization problem (1) can be efficiently solved by many algorithms, such as basis pursuit[13], l1_ls[14] and alternating direction algorithm[15].

Having obtained the sparsest solution â, let δ i (â) be a vector, whose only nonzero entries are the entries in â that are associated with class i[12]. Using δ i (â), the test sample y can be reconstructed as ŷ i = i (â). Then, the test sample y is classified into the class that minimizes class reconstruction error between y and ŷ i :

$${\rm{identity}}(y) = \mathop {{\rm{min}}}\limits_i {\left\Vert {y - X{\delta _i}(\hat a)} \right\Vert _2}.$$
(2)

In real-world applications, the observations are usually corrupted or occluded. A test sample y is rewritten as

((3))

where B = [X, I] ∈ R(D+N) and IRD×D is the identity matrix. The clean test sample y0 and the corruption error e can be sparsely encoded over the dictionary X and the occlusion dictionary I. The sparse solution ŵ = [â, ê]T can be computed by solving an optimization problem similar to problem (1). Then, the test sample y is classified by the following decision rule:

$${\rm{identity}}(y) = \mathop {\min}\limits_i {\left\Vert {y - X{\delta _i}(\hat a) - \hat e} \right\Vert _2}.$$
(4)

2.2 Low-rank representation

Recently, Liu et al.[19] presented a low-rank representation (LRR) method to recover the subspace structures from data that reside on a union of multiple subspaces. LRR seeks the lowest-rank representation matrix Z of data matrix X over a given dictionary A. The rank minimization problem is formulated as

$$\mathop {\min}\limits_{Z,E} \;{\rm{rank}}(Z) + \lambda {\left\Vert E \right\Vert _l},\quad {\rm{s}}.{\rm{t}}.\quad X = AZ + E$$
(5)

where λ is scalar constant, and ∥E l denotes a certain regularization strategy for characterizing different corruptions, such as ∥E0 for characterizing random corruption, and the ∥E2,0 for describing sample-specific corruption.

If one just considers the case in which the data matrix X entries are randomly corrupted, the rank optimization problem can be formulated as

$$\mathop {\min}\limits_{Z,E} \;{\rm{rank}}(Z) + \lambda {\left\Vert E \right\Vert _0},\quad {\rm{s}}.{\rm{t}}.\quad X = AZ + E.$$
(6)

As finding the solution of optimization problem (6) is NP-hard and computationally expensive, Liu et al.[19] replaced the rank with nuclear norm and the L0-norm with L1-norm. So, the optimization problem (6) is relaxed to the optimization problem as

$$\mathop {\min}\limits_{Z,E} {\left\Vert Z \right\Vert _{\ast}}+\lambda {\left\Vert E \right\Vert _1},\quad {\rm{s}}.{\rm{t}}.\quad X = AZ + E$$
(7)

where ∥Z* denotes the nuclear norm (i.e., the sum of the singular values) of Z. Given an appropriate dictionary A (in [19], the data matrix X is used as dictionary), the augmented Lagrange multipliers (ALM) algorithm[23] can be used to solve the optimization problem (7). Having obtained the optimal solution (Z*, E*), the representation matrix Z* can be used for recovering the underlying subspace structures of data X.

3 Low-rank sparse representation based classification

In this section, we present a low-rank sparse representation-based classification method for robust face recognition. LRSRC firstly seeks a lowest-rank and sparsest representation matrix of the test samples. Then, by using the representation vector of a test sample, LRSRC assigns the test sample to the class that generates minimal reconstruction error.

3.1 Low-rank sparse representation

Let the training data matrix X = [X1, X2, ⋯, X C ] ∈ RD×N be composed by C classes of training samples, and Y = [y1, y2, ⋯, y M ] ∈ RD×M be the data matrix of all test samples. Using X as dictionary, a test sample y i can be approximately represented as linear combination of the elements of X, i.e., y i = Xz i , where z i is the representation coefficient vector of y i . Thus, Y = XZ, where Z =[z1, z2, ⋯, z M ] ∈ RN×M is the representation matrix of Y.

SRC[12] imposes the sparsest constraint on the representation vector to make it discriminative. However, it does not consider the subspace structures of data, which is essential for classification tasks. As we know from [19], low-rank model can reveal the subspace structures of data. Therefore, we impose both lowest-rank and sparsest constraints on the representation matrix Z and formulate the optimization problem as

$$\mathop {\min}\limits_Z \;{\rm{rank}}(Z) + \beta {\left\Vert Z \right\Vert _0},\quad {\rm{s}}.{\rm{t}}.\quad Y = XZ$$
(8)

where β is a scalar constant to balance low-rankness and sparsity.

As the optimization problem (8) is non-convex, we substitute the L1-norm for L0-norm and the nuclear norm for the rank[19]. Therefore, the optimization problem (8) is relaxed as

$$\mathop {\min}\limits_Z {\left\Vert Z \right\Vert _{\ast}}+\beta {\left\Vert Z \right\Vert _1},\quad {\rm{s}}.{\rm{t}}.\quad Y = XZ.$$
(9)

In many real-world applications, face images are usually corrupted or occluded. Thus, the test data matrix can be rewritten as Y = Y0 + E = XZ + E, in which Y0 is the clean test data matrix and E is the error matrix. The optimization problem is rewritten as

$$\matrix{{\,\mathop {\min}\limits_{Z,E} {{\left\Vert Z \right\Vert}_{\ast}}+\beta {{\left\Vert Z \right\Vert}_1} + \lambda {{\left\Vert E \right\Vert}_1}} \hfill \cr{{\rm{s}}.{\rm{t}}.\quad Y = XZ + E} \hfill \cr}$$
(10)

where λ is a scalar parameter.

3.2 Optimization

By introducing an auxiliary variable W, the problem (10) can be converted into an equivalent optimization problem as

$$\matrix{{\,\mathop {\min}\limits_{Z,E} {{\left\Vert Z \right\Vert}_{\ast}}+\beta {{\left\Vert W \right\Vert}_1} + \lambda {{\left\Vert E \right\Vert}_1}} \hfill \cr{{\rm{s}}.{\rm{t}}.\quad Y = XZ + E,Z = W.} \hfill \cr}$$
(11)

The optimization problem (11) can be solved by the linearized alternating direction method with adaptive penalty (LADMAP)[2426]. We give the augmented Lagrange function of problem (11) as

$$\matrix{{L(Z,W,E,{T_1},{T_2},\mu) = {{\left\Vert Z \right\Vert}_\ast} + \beta {{\left\Vert W \right\Vert}_1} + \lambda {{\left\Vert E \right\Vert}_1} +} \hfill \cr{\qquad \left\langle {{T_1},Y - XZ - E} \right\rangle + \left\langle {{T_2},Z - W} \right\rangle +} \hfill \cr{\qquad {\mu \over 2}(\left\Vert {Y - XZ - E} \right\Vert _F^2 + \left\Vert {Z - W} \right\Vert _F^2)} \hfill \cr}$$
(12)

where T1 and T2 are Lagrangian multipliers, and µ > 0 is a penalty parameter. By some simple algebra, the above augmented Lagrangian function is rewritten as

$$\matrix{{L(Z,W,E,{T_1},{T_2},\mu) = {{\left\Vert Z \right\Vert}_\ast} + \beta {{\left\Vert W \right\Vert}_1} + \lambda {{\left\Vert E \right\Vert}_1} +} \hfill \cr{\qquad h(Z,W,E,{T_1},{T_2},\mu) - {1 \over {2\mu}}(\left\Vert {{T_1}} \right\Vert _F^2 + \left\Vert {{T_2}} \right\Vert _F^2)} \hfill \cr}$$
(13)

where

$$\matrix{{h(Z,W,E,{T_1},{T_2},\mu) =} \hfill \cr{\quad {\mu \over 2}\left({\left\Vert {Y - XZ - E + {{{T_1}} \over \mu}} \right\Vert _F^2 + \left\Vert {Z - W + {{{T_2}} \over \mu}} \right\Vert _F^2} \right).} \hfill \cr}$$
(14)

We minimize the function L with other two variables fixed to update the variables Z, W, E alternately. In each iteration, the updating rules are given as

$$\matrix{{{Z_{k + 1}} = \mathop {\arg \min}\limits_Z {{\left\Vert Z \right\Vert}_\ast} +} \hfill \cr{\quad \left\langle {{\nabla _Z}h({Z_k},{W_k},{E_k},{T_{1,k}},{T_{2,k}},{\mu _k}),Z - {Z_k}} \right\rangle +} \hfill \cr{\quad {{{\mu _k}L} \over 2}\left\Vert {Z - {Z_k}} \right\Vert _F^2 =} \hfill \cr{\quad \mathop {\arg \min}\limits_Z {{\left\Vert Z \right\Vert}_\ast} + {{{\mu _k}L} \over 2}\left\Vert {Z - {A_k}} \right\Vert _F^2 =} \hfill \cr{\quad {\Theta _{{{({\mu _k}L)}^{- 1}}}}({A_k})} \hfill \cr}$$
(15)

where

$$\matrix{{{A_k} = {Z_k} + {X^{\rm{T}}}{{Y - X{Z_k} - {E_k} + {{{T_{1,k}}} \over {{\mu _k}}}} \over L} -} \hfill \cr{\qquad {{{Z_k} - {W_k} + {{{T_{2,k}}} \over {{\mu _k}}}} \over L}.} \hfill \cr}$$

\(L = \left\Vert X \right\Vert _2^2,\Theta\) is the singular value thresholding operator[27], and ∇zh is the partial differentiation of function h with respect to Z.

$$\matrix{{{W_{k + 1}} = \mathop {\arg \min}\limits_W \beta {{\left\Vert W \right\Vert}_1} +} \hfill \cr{\qquad {{{\mu _k}} \over 2}\left\Vert {{Z_{k + 1}} - {W_k} + {{{T_{2,k}}} \over {{\mu _k}}}} \right\Vert _F^2 =} \hfill \cr{\qquad {S_{\beta \mu _k^{- 1}}}\left({{Z_{k + 1}} + {{{T_{2,k}}} \over {{\mu _k}}}} \right)} \hfill \cr}$$
(16)
$$\matrix{{{E_{k + 1}} = \mathop {\arg \min}\limits_E \lambda {{\left\Vert E \right\Vert}_1} +} \hfill \cr{\qquad {{{\mu _k}} \over 2}\left\Vert {Y - X{Z_{k + 1}} - E + {{{T_{1,k}}} \over {{\mu _k}}}} \right\Vert _F^2 =} \hfill \cr{\qquad {S_{\lambda \mu _k^{- 1}}}\left({Y - X{Z_{k + 1}} + {{{T_{1,k}}} \over {{\mu _k}}}} \right).} \hfill \cr}$$
(17)

In (16) and (17), S is the soft-thresholding operator[28]. The complete procedure is outlined in Algorithm 1.

3.3 Classification

Given a training data matrix X = [X1, X2, ⋯, X C ] ∈ RD×N and a test data matrix Y = [y1, y2, ⋯, y M ] G RD×M, we can find the lowest-rank and sparsest representation matrix = [1, 2, ⋯, M ] and the error matrix Ê = [ê1, ê2, ⋯, ê M ] of Y by Algorithm 1. Let δ i ( j )be a vector whose only nonzero entries are the entries in j that are associated with class i[12]. Using δ i ( j ), the test sample y j can be reconstructed, i.e. \(\hat y_j^i = X{\delta _i}({{\hat z}_j}) + {{\hat e}_j}\). Then, the test sample y j is classified into the class that minimizes the class reconstruction error between y j and \(\hat y_j^i\):

$${\rm{identity}}({y_j}) = \arg {\min _i}{\left\Vert {{y_j} - X{\delta _i}({{\hat z}_j}) - {{\hat e}_j}} \right\Vert _2}.$$
(18)

The proposed LRSRC algorithm is summarized in Algorithm 2.

Algorithm 1. LADMAP algorithm for low-rank sparse representation

  • Input: Training data matrix X = [X1, X2, ⋯, X C ] ∈ RD×N, test data matrix Y = [y1, y2, ⋯, y M ] ∈ RD×M, and parameters λ > 0; β > 0.

  • Initialize: \({Z_0} = {W_0} = {E_0} = {T_{1,0}} = {T_{2,0}} = 0,{\mu _0} = 0.1,{\mu _{\max}} = {10^{10}},{\rho _0} = 1.1,{\varepsilon _1} = {10^{- 2}},{\varepsilon _1} = {10^{- 6}},L = \left\Vert X \right\Vert_2^2,maxiter = 1000,k = 0\). while no converged and kmaxiter do

    1. 1)

      Update the variable Z according to (15)

    2. 2)

      Update the variable W according to (16)

    3. 3)

      Update the variable E according to (17)

    4. 4)

      Update the Lagrange multipliers:

      $$\matrix{{{T_{1,k + 1}} = {T_{1,k}} + {\mu _k}(Y - X{Z_{k + 1}} - {E_{k + 1}})} \hfill \cr {{T_{2,k + 1}} = {T_{2,k}} + {\mu _k}({Z_{k + 1}} - {W_{k + 1}}).} \hfill \cr}$$
    5. 5)

      Update the parameter μ: μk+1 = min (ρμ k , μmax),

    6. 6)

      Check the convergence condition:

      $${{{\mu _{k + 1}}\max \left({\sqrt L {{\left\Vert {{Z_{k + 1}} - {Z_k}} \right\Vert}_F},{{\left\Vert {{W_{k + 1}} - {W_k}} \right\Vert}_F},{{\left\Vert {{E_{k + 1}} - {E_k}} \right\Vert}_F}} \right)} \over {{{\left\Vert Y \right\Vert}_F}}} < {\varepsilon _1}\;{\rm{and}}\;{{{{\left\Vert {Y - X{Z_{k + 1}} - {E_{k + 1}}} \right\Vert}_F}} \over {{{\left\Vert Y \right\Vert}_F}}} < {\varepsilon _2}.$$
    7. 7)

      kk + 1

      end while

  • Output: An optimal solution (Z k , W k , E k ).

Algorithm 2. Low-rank sparse representation-based classification

  • Input: Training data matrix X = [X1, X2, ⋯, X C ] ∈ RD×N belonging to C classes, test data matrix Y = [y1, y2, ⋯, y M ] ∈ RD×M, parameters λ > 0, β > 0.

    1. 1)

      Normalize the columns of X to have unit L2-norm

    2. 2)

      Solve the following optimization problem to obtain the representation matrix Z and the error matrix E of test data matrix Y by Algorithm 1:

      $$\mathop {\min}\limits_{Z,E} {\left\Vert Z \right\Vert_\ast} + \beta {\left\Vert Z \right\Vert_1} + \lambda {\left\Vert E \right\Vert_1},\quad {\rm{s}}.{\rm{t}}.\quad Y = XZ + E$$
    3. 3)

      for j = 1: M do

      $${\rm{identity}}({y_j}) = \arg {\min _i}{\left\Vert {{y_j} - X{\delta _i}({{\hat z}_j}) - {{\hat e}_j}} \right\Vert_2},\quad i = 1, \cdots ,C.$$

      end for

  • Output: Class labels of all the test samples.

4 Experiments

We perform experiments on Extended Yale B[29], CMU PIE[30], and AR databases[31] to evaluate the performance of our proposed LRSRC method. We compare LRSRC with some state-of-the-art methods, including nearest neighbor classifier (NN)[9], SRC[12], ESRC[16], CRC[17] and LRC[18]. Note that PCA is used to reduce the dimension of face image and the Eigenface features are used in all the methods.

4.1 Extended Yale B database

The Extended Yale B database[29] consists of 2414 face images of 38 individuals. Each individual has about 64 images, taken under various laboratory-controlled lighting conditions. Fig. 1 (a) shows several example images of one subject in the Extended Yale B database. In our experiments, each image is manually cropped and resized to 32 × 32 pixels. We randomly select l (=16, 24, 32) samples per individual for training, and the remainder for testing. For each given l, we independently carry out all the methods 5 times, and report the average 5-fold recognition rates. The Eigenface feature dimensions are set as 50, 100, 150, 200, 300 and 500 for the comparison of recognition rates across different methods. We set LRSRC parameters as λ = 10, β = 0.09, 0.1 and 0.08 for the 16, 24, and 32 training samples per individual, respectively. Fig. 2 shows the curves of recognition rate versus feature dimension. Table 1 gives the best recognition rates of all the methods.

Fig. 1
figure 1

Example images of the three databases

Fig. 2
figure 2

Recognition rate versus feature dimension of all the methods on Extended Yale B database

Table 1 The best recognition rates (%) and the corresponding dimensions (in the parentheses) of all the methods on Extended Yale B database

From Fig. 2, one can see that our method outperforms all the other methods in most of dimensions under different training sets. For example, with a feature dimension of 200, our method performs better than all the other methods with more than 3% improvement of recognition rate under 3 different training sets. Meanwhile, it can be seen from Table 1 that the best recognition rates of our method are always greater than those of the other methods. Especially, our method achieves the best recognition rate of 94.87% when 16 samples per individual are used for training, while SRC merely arrives at the best recognition rate of 94.26% even when 32 samples per individual are used for training.

This indicates that our method can achieve a good recognition performance, with smaller number of training samples. From the experimental results, we confirm the effectiveness and robustness of our method even if face images are corrupted by illumination variations.

4.2 CMU PIE database

The CMU PIE database[30] contains over 40 000 face images of 68 individuals. Images of each individual were acquired across 13 different poses, under 43 different illumination conditions, and with 4 different expressions. Fig. 1 (b) shows some example images of one subject in the CMU PIE database. Here, we use a near frontal pose subset, namely C07, for experiments, which contains 1629 images of 68 individuals, each individual with about 24 images. In our experiment, all images are manually cropped and resized to be 32 × 32 pixels. A random subset with l (= 8, 10, 12) samples per individual is selected for training, and the rest for testing. For each given l, we independently run all the methods 5 times, and report the average 5-fold recognition rates. The Eigenface feature dimensions are selected as 50, 75, 100, 150, 200 and 300 for the comparison of face recognition rates across different methods. The LRSRC parameters are set as λ = 10, β = 0.09 for these experiments. The curves of recognition rate versus feature dimension are presented in Fig. 3. The best recognition rates of all the methods are shown in Table 2.

Fig. 3
figure 3

Recognition rate versus feature dimension of all the methods on CMU PIE database

Table 2 The best recognition rates (%) and the corresponding dimensions (in the parentheses) of all the methods on CMU PIE database

From Fig. 3, one can clearly see that our method outperforms all the other methods in most of dimensions under 3 training sets. In detail, the recognition rates of our method are about 1.5% greater than those of the other methods in most of dimensions for 8 and 10 training samples per individual, and 1% for 12 training samples per individual. Furthermore, one can know from Table 2 that the best recognition rate of our method is always greater than those of the other methods for each training set. In addition, one can also find from Table 2 that SRC achieves the best recognition rate of 93.78% when 12 samples per individual are used for training, which is still lower than that of our method, even when 8 samples per individual are used for training. It indicates again that our method can obtain good recognition performance, with smaller number of training samples. The experimental results further verify the effectiveness and robustness of our method for dealing with face images corrupted by illumination variations and expression changes.

4.3 AR database

The AR database[31] includes over 4000 color face images of 126 individuals. For each individual, 26 images were taken in 2 separate sessions. In each session, there are 13 images for each person, among which 3 images are occluded by sunglasses, 3 by scarves, and the remaining 7 with illumination and expression variations (and thus are named as neutral images). All images are cropped to 50 × 40 pixels and converted into gray scale. As done in [10], we choose a subset containing 50 male subjects and 50 female subjects in the experiments. Fig. 1 (c) shows the whole images of one subject in the AR database. Following the scheme in [32], we consider 3 scenarios to evaluate the performance of our method.

Sunglasses: In this scenario, we take into account the situation in which both the training and test samples are corrupted by sunglass occlusion. Sunglasses produce about 20% occlusion of the face image. We use 7 neutral images and 1 image with sunglasses from the first session for training, 7 neutral images from the second session and the remaining 5 images with sunglasses from 2 sessions for testing.

Scarf: In this scenario, we consider corruptions of both the training and test images, because of scarf occlusion. Scarf generates around 40% occlusion of the face image. Similarly, we choose 8 training images consisting of 7 neutral images plus 1 image with scarf from the first session for training, and 12 test images consisting of 7 neutral images from the second session and the remaining 5 images with scarf from 2 sessions for testing.

Mixed (Sunglasses + Scarf): In this scenario, we consider the case in which both the training and test images are occluded by sunglasses and scarf. We select 7 neutral images plus 2 occluded images (one image with sunglasses and another with scarf) from the first session for training and the remaining 17 images in 2 sessions for testing.

In this experiment, we select Eigenface feature dimensions as 50, 100, 150, 200, 300 and 500 for the comparison of recognition rates across different methods. The LRSRC parameters are set as λ = 10, β = 0.09. Fig. 4 gives the recognition results of 3 different scenarios. Table 3 summarizes the best recognition rates of all the methods under 3 different scenarios.

Fig. 4
figure 4

Recognition rate curves of all the methods on the AR database

Table 3 The best recognition rates (%) and the corresponding dimensions (in the parentheses) of all the methods on the AR database

From Fig. 4, one can see that the proposed method consistently gives the highest recognition rate in most of dimensions under 3 different scenarios. At the same time, one can find from Table 3 that the best recognition rate of our method is highest for each scenario. Specifically, our method performs better than the best competitor SRC with more than 1% improvement of the best recognition rate for the Sunglasses scenario, 3.5% improvement of the best recognition rate for the Scarf and Mixed scenarios. Moreover, one can easily find from Table 3 that our method also exceeds in performance as compared to the other 4 methods under 3 different scenarios. The experiment results demonstrate the effectiveness and robustness of our method for processing occluded face images.

5 Conclusions

In this paper, we present a low-rank sparse representation-based classification method for robust face recognition. LRSRC can efficiently find the lowest-rank and sparsest representation matrix of a set of test samples over all training samples, and then classifies each test sample into the correct class by using its representation vector. Experimental results on Extended Yale B, CMU PIE, and AR databases confirm that our proposed method is effective and robust, even when face images are corrupted by illumination variations, expression changes, disguises, or occlusions. It must be pointed out that our proposed LRSRC mainly performs the classifications based on the entire test set. In our future work, we will investigate how to extend LRSRC to the single test sample scenarios.