1 Introduction

For the importance of face recognition, most techniques of face recognition have been used in pattern recognition [1, 2], image processing [3], and machine learning [4, 5]. Over the last decade, many algorithms have been proposed for face recognition, such as Fisherface [6], Eigenface [7], and graph manifold learning [8]. In face recognition, appearance-based methods represent an image of size n × m by a vector in the n × m dimensional space and then produce lower-dimensional features of the face image for better classifying it [9]. In the application of face recognition, these n × m dimensional spaces are too large to allow robust and fast face recognition. Dimensionality reduction is a common way to attempt to resolve this problem. Many dimensionality reduction methods have been proposed to produce lower-dimensional features, such as principal component analysis (PCA) [10], linear discriminant analysis (LDA) [11], and 2D-PCA [12]. PCA performs dimensionality reduction by projecting the original data on a much lower-dimensional linear subspace spanned by the leading eigenvectors of the covariance matrix of the origin data. LDA searches for the project axes on which data points of the same class are close to each other and data points of different classes are far from each other. The basic idea of 2D-PCA is to directly use 2D matrices to represent face image which improves the computational efficiency and increases face recognition rate because of the preservation of image structural information.

Sparsity could be a useful principle in neuroscience, information theory, and signal processing over the past few decades [1315]. Sparse representation encodes an image using only a small number of atoms parsimoniously chosen from an overcomeplete dictionary. It has been developed in computer vision and pattern recognition with promising results [16]. Sparse representation classifier (SRC) [17] has been successfully used for robust face recognition. SRC is a nonparametric learning method similar as nearest neighbor (NN) [18] and nearest subspace (NS) [19].

The basic idea of SRC is to represent a test sample as a linear combination of all training samples and then classify the test sample into the class which has the minimal reconstruction error. SRC uses l 1 minimization instead of the l 0 minimization to seek for the sparse solution for its computational convenience and efficiency.

For the efficiency of SRC in face recognition, many extended work have been proposed. In [20], Gao et al. proposed kernel sparse representation, which is the sparse coding in the mapped high-dimensional feature space. In [21], the authors indicated a more general case where features lie in a union of low-dimensional linear subspaces. He et al. [22] incorporated the non-negativity constraint into sparse graph to learn the probabilistic latent clustering relationship between data. Yang et al. proposed robust sparse representation [23].

Although there are so many applications of SRC, l 1 minimization does not always yield sufficiently sparse solution. In [24], the authors proposed l p(0 < p < 1) sparse representation-based classification (l p-SRC) to seek for the optimal sparse representation of a test image. However, how to get the optimal parameter p is still an open problem. If we can solve the solution of the sparse system directly without iterative steps, we can save much time and improve the accuracy of classification.

In this paper, we propose a novel method named enhancing sparsity via full rank decomposition (ES-FRD) for face recognition, which first represents the test sample as a linear combination of the training data as the same as sparse representation, then make a full rank decomposition of the training data matrix. Since, we can obtain the generalized inverse of the training data matrix and solve the general solution of the linear equation directly. For obtaining the optimum solution to represent the test sample, we use the least square method to solve it. We classify the test sample into the class which has the minimal reconstruction error. The contributions of the proposed method are as follows:

  1. 1.

    Computational efficiency. By introducing the general solution of linear equation, we can obtain the optimum solution. Without counting the norm minimization, the proposed method is more efficient than SRC.

  2. 2.

    Open solution. As we know, the state-of-the-art algorithms of SRC have no open solution to solve norm minimization problem. The proposed method has an open solution.

  3. 3.

    More sparse. Our method introduces full rank decomposition, which factorizes the image database into two low-rank matrices. It is helpful for obtaining much sparser solution.

The rest of this paper is organized as follows: We review the related works on sparse representation in Sect. 2. In Sect. 3, we give the details of the proposed method. Section 4 gives the experimental results on three public face image data sets. Finally, we conclude the paper in Sect. 5.

2 Related work

In this section, we briefly review SRC and l p(0 < p < 1)-SRC.

2.1 Sparse representation classification (SRC)

Assuming that there are n training samples from c object classes, \(A = [A_{1} ,A_{2} , \ldots ,A_{c} ] \in R^{m \times n}\) denotes the entire training set, where m is the dimension of sample, and A i (i = 1, 2, …, c) is the set of training samples from the ith object class. Given a test sample \(y \in R^{m}\) of the ith class, the goal is exactly to predict the label of y from the given c class training samples. The linear representation of y can be written in terms of all training samples A as [17]

$$y = Ax$$
(1)

where x is the vector of coefficients. If the test sample y belongs to the ith class, then the entries of x are expected to be zero except some of those associated with this class.

In SRC, the problem of finding the coefficient vector is formulated as a convex programming problem

$$\mathop {\hbox{min} }\limits_{x} \left\| x \right\|_{1} \quad {\text{subject to}}\quad y = Ax$$
(2)

where \(\left\| \cdot \right\|\) denotes the l 1 norm. The sparsity of the coefficient vector can be measured by l 0 norm. However, l 0 norm minimization is a NP-hard problem. Some recent studies [25, 26] show that if the solution of x is sparse enough, the l 1 minimization can be employed to seek for the sparse solution for its computational convenience and efficiency.

After obtaining x of Eq. (2), the class reconstruction residual is used to design a sparse representation-based classification (SRC). For each class i, let \(\delta_{i} :R^{n} \to R^{n}\) be the characteristic function which selects the coefficients associated with the ith class. For \(x \in R^{n}\), \(\delta_{i} \in R^{n}\) is a new vector whose nonzero entries are the entries in x that are associated with class i. The test sample y can be approximated by \(\hat{y}_{i} = A\delta_{i} (x)\) which uses the vector \(\delta_{i}\) from each class. The reconstruction residual for class i is defined as:

$$r_{i} (y) = \left\| {y - \hat{y}_{i} } \right\|_{2}$$
(3)

We classify the given test sample y to the class i associated with the minimal reconstruction residual. We give the algorithm of SRC as follows.

Algorithm of SRC

1. Input: The training data matrix \(A = [A_{1} ,A_{2} , \ldots ,A_{c} ] \in R^{m \times n}\) for c classes, a test sample y

2. Solve the \(l^{1}\)-minimization:

\(\mathop {\hbox{min} }\limits_{x} \left\| x \right\|_{1} \quad {\text{subject to}}\quad Ax = y\)

3. Then compute the residuals \(r_{i} (y) = \left\| {y - \hat{y}_{i} } \right\|_{2}\)

4. Output the identify of y as: Identity (y) = arg min i {r i (y)}

2.2 l p(0 < p < 1) sparse representation for face recognition

The optimization problem of l 1 minimization cannot always obtain the sparsest solution. In recent, l p(0 < p < 1)-norm has been used as an alternative to l 0 norm for sparse signal recovery. The l p(0 < p < 1) sparse representation-based classification seeks for the optimal sparse representation of a test image by choosing the most suitable parameter p. The following is the l p minimization problem:

$$\mathop {\hbox{min} }\limits_{x} \left\| x \right\|_{p} \quad {\text{subject to}}\quad Ax = y$$
(4)

The authors first proposed an iterative algorithm for solving the non-convex system (4) in [24].

An iterative algorithm for l p minimization (0 < p < 1)

Step 1::

Initialize the iteration count t = 0 and coding coefficients \(x_{i}^{0} = 1,i = 1,2, \cdots ,n\)

Step 2::

Update the coding vector x t+1 by solving the weighted l 1 minimization problem

$$x^{t + 1} = \arg \hbox{min} \sum\limits_{i}^{n} {\frac{{\left| {x_{i} } \right|}}{{(\left| {x_{i}^{t} + \mu_{t} } \right|)^{1 - p} }}} \quad {\text{subject to}}\quad Ax = y$$
(5)
Step 3::

Terminate on convergence or when reaches the maximal number of iteration t max. Otherwise, let t = t + 1, and go to step 2

The solution of l p minimization is sought by the iterative l 1 minimization algorithm. In the first step, the coding coefficients are initialized by \(x_{i}^{0} = 1,\) i = 1, 2, …, n. Step 2 is to solve a weighted l 1 minimization where the weights \(w_{i}^{t + 1} = 1/(\left| {x_{i}^{t} } \right| + \mu_{t} )^{1 - p}\) (\(i = 1,2, \cdots ,n\)) depend on the solution of the previous iteration. The weights in Eq. (5) relate inversely to the magnitude of the coefficients, so l p minimization can partially counteract the influence of the coefficient magnitude on the l 1 penalty function [27].

The residuals are computed by

$$r_{i} = \left\| {y - A_{i} \hat{x}_{i} } \right\|_{2}$$
(6)

The test sample y is classified to the object class that has the minimize residual.

The state-of-the-art algorithms of sparse representation use norm minimization to get a coefficient vector. Among l p(0 < p < 1) minimization, l 1 minimization and l 2 minimization, which one is the sparsest is still unknown. Which norm minimization is best suit for sparse representation is still an open problem in theory. In the next section, by introducing the full rank decomposition of the dictionary matrix, we solve the sparse representation system directly, precisely, and computationally efficiently.

The algorithm of l p(0 < p < 1) SRC as follows.

Algorithm of l p(0 < p < 1) SRC

1. Input: The training data matrix \(A = [A_{1} ,A_{2} , \ldots ,A_{c} ] \in R^{m \times n}\) for c classes, a test sample y and the error tolerance \(\varepsilon \ge 0\)

2. Solve the following \(l^{p}\) minimization problem

\(\mathop {\hbox{min} }\limits_{x} \left\| x \right\|_{p}\) subject to \(Ax = y\)

3. Then compute the residuals \(r_{i} = \left\| {y - A_{i} \cdot \hat{x}_{i} } \right\|_{2}\), \(i = 1,2, \ldots ,c.\)

4. Output the identify of y as: Identity (y) = arg \(\min_{i} \left\{ {r_{i} } \right\}\)

3 The proposed method

In this section, we present the proposed method and give analysis of it. The main steps of the proposed method are as follows. First, we approximately obtain the full rank decomposition of the training data matrix. Second, we solve the general solution of Eq. (1) and calculate the minimum residual solution. Finally, we classify the test sample into the class that has the minimal residual.

3.1 Full rank decomposition and general solution of the linear equation in our method

In the linear equation y = Ax, no matter what the dimension of A is, there is always a general solution of it. We present the details to obtain the general solution of a linear equation as follows.

Definition 1 (full rank decomposition)

\(A \in R^{m \times n}\) is a matrix of which rank is r. If A = FG, rank(F) = r, and rank (G) = r, A = FG is the full rank decomposition of matrix A.

By exploiting the full rank decomposition, we can obtain the generalized inverse matrix of A by Eq. (7). Definition 2 is the generalized inverse matrix of the matrix A.

Definition 2 (Generalized inverse matrix)

\(A \in R^{m \times n}\) is a matrix which rank is r, and its full rank decomposition is A = FG, \(F \in R^{m \times r}\) is full column rank, \(G \in R^{r \times n}\) is full row rank, the generalized inverse matrix of A is

$$A^{ - } = G^{T} (F^{T} AG^{T} )^{ - 1} F^{T}$$
(7)

We can compute a general solution of the linear equation y = Ax by Eq. (8) after solving the generalized inverse matrix of A. Definition 3 defines general solutions of linear equation.

Definition 3 (General solution of linear equation)

If \(A^{ - } \in R^{m \times n}\) is a generalized inverse matrix of \(A \in R^{m \times n}\), then the general solution of linear equation y = Ax is

$$x = A^{ - } y + (I - A^{ - } A)z,$$
(8)

where \(z \in R^{n \times 1}\) is a random vector.

3.2 The optimum solution in the proposed method for face recognition

Let \(A = [A_{1} ,A_{2} , \ldots ,A_{c} ] \in R^{m \times n}\) denote the entire training set, in which \(A_{i} \in R^{{m \times n_{i} }}\) are training samples of the ith object classes, where c is the number of class. For any test sample \(y \in R^{m}\), the linear representation of y can be written as

$$y = Ax$$
(1)

where x is the coefficient vector. By full rank decomposition, we find two matrix factors F and G whose product is an approximation of the matrix A, represented as

$$A = FG$$
(9)

By Eq. (7), we solve the generalized inverse matrix of A, that is \(A^{ - } = G^{T} (F^{T} AG^{T} )^{ - 1} F^{T}\). We obtain the general solution of linear equation y = Ax by Eq. (8), i.e., \(x = A^{ - } y + (I - A^{ - } A)z\). Because z is a random vector, the general solution of the linear equation y = Ax is not the optimum. We use the least square to solve the optimum solution of Eq. (1). The details as follow: Let

$$W = \left\| {y - Ax} \right\|_{2}^{2}$$
(10)

substituting Eq. (8) into (10), we obtain

$$\begin{aligned} W & = \left\| {y - Ax} \right\|_{2}^{2} \\ & = (y - Ax)^{T} (y - Ax) \\ & = (y^{T} - x^{T} A^{T} )(y - Ax) \\ & = y^{T} y - 2x^{T} A^{T} y + x^{T} A^{T} Ax \\ & = y^{T} y - 2(A^{ - } y + (I - A^{ - } A)z)^{T} A^{T} y + (A^{ - } y + (I - A^{ - } A)z)^{T} A^{T} A(A^{ - } y + (I - A^{ - } A)z) \\ \end{aligned}$$

Since Eq. (10) is convex and differentiable, any stationary point is a global minimizer of it. Requiring that the derivation of W with respect to z to be zero, we get the following equation

$$W^{\prime}(z) = - 2(I - A^{ - } A)^{T} A^{T} y + 2A^{T} A(I - A^{ - } A)(A^{ - } y + (I - A^{ - } A)z) = 0$$
(11)

We can derive from Eq. (11)

$$z = (I - A^{ - } A)^{ - 1} ((A^{T} A(I - A^{ - } A))^{ - 1} (I - A^{ - } A)^{T} A^{T} y - A^{ - } y).$$
(12)

The Eq. (12) is an optimal coefficient vector for the test sample y.

Then, we calculate the residual by

$$r_{i} (y) = \left\| {y - A\hat{x}} \right\|_{2}$$
(13)

If \(\hat{r}_{k} = \arg \min_{i} \left\{ {r_{i} } \right\},i = 1,2, \ldots ,c\), we classify y into the kth class, c is the number of distinguished classes.

The classification procedure of the proposed method is shown in Algorithm 1.

Algorithm 1 Algorithm of the Proposed Method

1.

Input: a set of training samples \(A = [A_{1} ,A_{2} , \ldots ,A_{c} ] \in R^{m \times n}\) for c classes, a test sample \(y \in R^{m}\)

2.

Normalize the columns of A.

3.

Compute the full rank decomposition of A

4.

Compute the generalized inverse matrix of A by Eq. (7)

5.

Compute the general solution of linear equation y = Ax (1) by Eq. (8)

6.

Solve the optimal solution \(\widehat{x}\) for y = Ax (1) by (11)

7.

Compute the residuals \(r_{i} (y) = \left\| {y - A\hat{x}} \right\|_{2}\), \(i = 1,2, \cdots ,c.\)

8.

Output: identity(y) = arg min i r i (y).

Figure 1 describes the flowchart of the proposed method.

Fig. 1
figure 1

Flowchart of the proposed method

3.3 Analysis of the proposed method

In this section, we analyze the characteristics, rationale, and potential advantages of our method. Our method differs from SRC and l p(0 < p < 1) SRC as follows: Our method uses full rank decomposition to represent the training data matrix approximately and then best represents the test sample as a linear combination of the training data. Here, the “best” means that the residual between the obtained linear combination and the test sample is the smallest. SRC and l p(0 < p < 1) SRC are all use norm minimization to obtain an approximately solution for classifying a new test sample.

The proposed method represents the training data matrix by full rank decomposition and then expresses the test sample as a linear combination of the training data. We use generalized inverse of matrix to solve the minimum norm solution of the linear equation. We classify the test sample by evaluating the reconstruction error class by class. Our method can be also viewed as a method that exploits a linear combination of all training samples to represent the test sample and calculate the solution of this linear representation. The underlying rationale is that for different test samples, the coefficients of the linear representation are different. We can solve the optimal solution of Eq. (8) for the linear representation with the different test samples.

The advantages of our work are as follows:

  1. 1.

    The proposed method introduces the full rank decomposition, which factorizes the image database into two low-rank matrices. It is helpful for obtaining much sparser solution.

  2. 2.

    By introducing the general solution of the linear equation, we can find the optimum solution for the sparse system (1). There is no need to solve the norm minimization problem in the proposed method, thus leading to more efficient procedures.

  3. 3.

    To our knowledge, there is no method can directly solve the solution of SRC. The state-of-the-art algorithms of SRC use norm minimization to solve the coefficient vector. Collaborative representation-based classification (CRC) uses l 2 norm to solve the sparse solution of the linear equation. It obtains dominant performance for face recognition, but it should consider the distribution of data [28]. The proposed method can solve the linear representation solution directly and efficiently and not need consider the distribution of the data. That is to say, our method can directly applied data with any distribution.

4 Experimental results and analysis

We use the FERET [29], ORL [30], and AR databases [31] to evaluate the performance of the proposed method for face recognition. We compare our method with NN, NS, SRC, and l p(0 < p < 1) SRC. For all above learning algorithms, we test the classification performance with the feature subspace dimensions of 36, 49, 64, 81, and 100. The parameters set as Refs. [17] and [24].

4.1 Evaluation on the FERET database

The FERET database [29] was acquired without any restrictions imposed on facial expression and with at least two frontal images shot at different times during the same photo session. The image sets used for evaluating face recognition algorithms displays diversity across gender, ethnicity, and age. For the FERET face database, we only use a subset made up of 1,400 images from 200 individuals with each subject providing seven images [32]. We crop and normalize all the face images of FERET in 40 × 40. We random select different number (3, 4, 5) of images from each subject to construct the training set, and the rest images make up the test set. Figure 2 shows some example images used in our experiments. Figure 3 is the result of NN, NS, SRC, l p(0 < p < 1) SRC and our method conducted on FERET face database.

Fig. 2
figure 2

Example FERET images used in our experiments. (Images of a subject from the FERET database)

Fig. 3
figure 3

Face recognition rate on FERET database. We randomly select three (a), four (b), and five (c) images from each subject to construct the training set and the rest used for testing. We conduct five trials for each partition and compare the performance of different algorithms based on the averaged accuracy of the five trials on each dimension for each type of the partition

From Fig. 3, the classification accuracy of the proposed method is higher than other comparison methods. In particular, when the data dimension is 81 and we random choose 3 samples as training samples, our method achieves 13 % than SRC.

4.2 Evaluation on the ORL database

The ORL database [30] contains images from 40 individuals, each providing 10 different images. All subjects are in up-right, frontal position (with tolerance for some side movement). The size of each face image is 112 × 92, and the resulting standardized input vectors are of dimensionality 10,304. In the experiment, the images are converted into the size of 40 × 40. Figure 4 shows images of the same subject of ORL. Figure 5 is the result of NN, NS, SRC, l p(0 < p < 1) SRC and our method conducted on ORL face database.

Fig. 4
figure 4

Images of one subject in ORL

Fig. 5
figure 5

Face recognition rate on ORL database. We randomly select four (a), five (b), and six (c) images from each subject to construct the training set and the rest used for testing. We conduct five trials for each partition and compare the performance of different algorithms based on the averaged accuracy of the five trials on each dimension for each type of the partition

From Fig. 5, the accuracy of classification of the proposed method is higher than other comparison methods. Fig. 5b shows that at the data dimension is 81 and when we random choose 5 samples as training samples, our method achieves 1.5 % than SRC.

4.3 Evaluation on the AR database

The AR [31] database consists of more than 4,000 face images of 126 subjects (70 men and 56 women). The database characterizes divergence from ideal conditions by incorporating various facial expressions (neutral, smile, anger, and scream), occlusion modes (sunglass and scarf), and luminance alterations (left light on, right light on, and all side lights on). Each individual participated in two sessions, separated by 2 weeks (14 days). In the experiments, we exploited cropped face images of 100 subjects (50 men and 50 women). We crop and normalize all the face images of AR in 40 × 40. We test the robust performance of the proposed method on AR database. The experiments are conducted for variations in facial expression, variations in lighting condition, and contiguous occlusion.

4.3.1 Variations in facial expressions

We selected a subset database in which involves variations in facial expressions. Figure 6 shows images used for testing variations in facial expression of one subject. Figure 6a, e is used for training and the others are used for testing. The number of training sample is 240 and the number of test sample is 720. Figure 7 shows the results of variations in facial expressions.

Fig. 6
figure 6

Facial expression variation in the AR database. ad and eh correspond to two different sessions incorporating neutral, happy, angry, and screaming expressing, respectively

Fig. 7
figure 7

Face recognition rate for testing variations in facial expression. We conduct five trials for each partition and compare the performance of different algorithms based on the averaged accuracy of the five trials on each dimension for each type of the partition

From Fig. 7, we can see that SRC has better classification performance on facial expression than l p(0 < p < 1) SRC. Our method has the best performance among all the comparison methods.

4.3.2 Variations in lighting conditions

We selected images which involves lighting change on left, right, and all sides as a subset for test variations in lighting conditions. Figure 8a, e is used for training and the rest images of Fig. 8 are used for testing. Thus, the total number of training samples is 240 and the total number of test samples is 720. Figure 9 shows the experimental results of lighting variation tested on AR database.

Fig. 8
figure 8

Lighting variations images in the AR database. ad and eh correspond to two different sessions incorporating neutral, left light on, right light on, and all sides light on, respectively

Fig. 9
figure 9

Face recognition rate for testing variations in lighting condition. We conduct five trials for each partition and compare the performance of different algorithms based on the averaged accuracy of the five trials on each dimension for each type of the partition

From Fig. 9, we can see our method has the best performance among all the comparison methods. SRC has better classification performance on lighting condition than l p(0 < p < 1) SRC.

4.3.3 Contiguous occlusion

The problem of face identification in the presence of contiguous occlusion is arguably one of the most challenging paradigms in the context of robust face recognition [33]. For testing the performance of ES-FRD on the contiguous occlusion, we make two parts of experiments in this section. We test the occlusion with sunglasses and scarf, respectively. Figures 10 and 11 are used to test the occlusion with sunglasses and scarf, respectively. Figure 10a, e is used for training and the others are used for testing. Thus, the total number of training samples is 240. The experiment scheme of scarf occlusion is all the same as sunglasses occlusion.

Fig. 10
figure 10

Images of sunglasses occlusion in AR database. bd and fh are sunglasses occlusion images of one individual

Fig. 11
figure 11

Images of scarf occlusion in AR database. bd and fh are scarf occlusion images of one individual

From the Figs. 7, 9 and 12, we verify the robust performance of the proposed method. The proposed method has better classification performance for variations in facial expression, lighting variations, and contiguous occlusion in AR database.

Fig. 12
figure 12

Face recognition rate for testing contiguous occlusion. We conduct five trials for each partition and compare the performance of different algorithms based on the averaged accuracy of the five trials on each dimension for each type of the partition. a Sunglasses occlusion. b Scarf occlusion

4.4 Sparsity evaluation

In this section, we evaluate the sparsity of the proposed algorithm. According to the Ref [34], the sparseness can be calculated by the equation defined as:

$${\text{sparseness}}(\nu ) = \frac{{\sqrt t - \left( {\sum\nolimits_{i} {\left| {\nu_{i} } \right|} } \right)/\sqrt {\sum {\nu_{i}^{2} } } }}{\sqrt t - 1}$$

where t is the dimensionality of vector v. From Table 1, we can see that the proposed method has more sparseness than other comparison methods.

Table 1 Sparseness of SRC, l p(0 < p < 1) SRC, and our method on ORL, FERET database

5 Conclusion

Sparse representation-based classification (SRC) has been successfully applied for face recognition. SRC seeks the sparsest linear combination of training sample for any test samples, but it is time consuming for counting the norm minimization solution of the associated coding coefficients. In the paper, we propose a novel method for face recognition, which need not count norm minimization problem. The proposed method first approximately represents the training data matrix by full rank decomposition, then represents the test sample as a linear combination of the training data. The generalized inverse of matrix is used for solving the solution of linear equation. We classify the test sample into the class which has the minimum residual. The experiment results suggest that the proposed method achieves higher accuracy for face recognition.