Abstract
We propose a fast kernel sparse representation based classification (SRC) for undersampling problem, i.e., each class has very few training samples, in face recognition. The proposed algorithm exploits a nonlinear mapping to map the data from the original input space into a high-dimensional feature space. Then, it performs very fast sparse representation and classification of samples in this space. Similar to the typical SRC methods, the proposed approach is based on the L1 norm minimization, whose direct solution can be very time-consuming. In order to improve the computational efficiency, our method uses the coordinate descent method in the feature space, which can avoid directly solving the L1 norm minimization problem, and significantly expedites the computational procedure. Compared with other SRC methods based on the L1 norm minimization, our proposed method achieves very high computational efficiency, without significantly degrading the classification performance. Several experiments on popular face databases demonstrate that our method is a promising efficient kernel SRC based method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, the representation based classification (RC) has been a hot research topic in pattern recognition and machine learning community, particularly in face recognition. The main procedure of RC is that for a test sample, the RC method exploits the features extracted from the training samples (or directly uses the training samples) to construct one model that is used to represent this test sample by using one or more types of vector norms, e.g., the L1 norm. Then, the test sample is classified via the representation results after the constructed model is solved. The RC method has some variants that can be used other applications such as the video analysis [6, 33]. Up to now, researchers have applied several types of vector norms to represent the samples, e.g., L1, L1/2, L2,1 and L2 [23, 27, 36, 38, 40, 41, 52]. Note that using the L1 norm minimization can yield a sparse representation for samples. The classification methods based on sparse representation are considered robust to outliers and data corruption, e.g., image occlusion. This type of classification methods is referred to as the sparse representation based classification (SRC) [1, 43, 47]. In principle, the representation result obtained by L1/2 norm is sparser than that obtained by L1 norm [41]. Nevertheless, solving the L1/2 norm minimization problem usually entails iteratively solving the L1 norm minimization. We know that the L2,1 norm based classification is robust to outliers through selecting samples to yield grouped sparsity [26, 27]. Although the solution of the L2 norm minimization is usually faster than that of the L1 norm minimization, the models based on the L2 norm cannot produce the sparse solution [52]. Moreover, it is believed that they are not robust to image noise or outliers.
Among the RC methods mentioned above, SRC is one of the most popular sparse representation methods for classification in face recognition. Theoretically, the RC approaches using the L0 norm minimization (i.e., the representation coefficients contain the fewest non-zero components) [12] can produce the sparsest representation result, which may lead to desirable recognition performance [38]. However, the L0 norm minimization is essentially an NP-hard problem. Furthermore, if enough training samples are available and the result of representing the test sample using them is sufficiently sparse, the solution of the L0 norm minimization problem is equal to that of the L1 norm minimization problem [4, 10, 38]. It has also been demonstrated that the solution of the L1 norm minimization is more computationally efficient than that of the L0 norm minimization which is an NP-hard problem. Moreover, L1 norm minimization is a convex problem which is easier to guarantee optimality. Therefore, the L0 norm minimization is often replaced by the L1 norm minimization in many applications involving representation methods [34, 39, 46, 48, 49, 55].
However, RC using the L1 norm minimization has a serious computational obstacle. When the training set is very small, that is, there are very few training samples in each class, using the L1 norm minimization may not obtain the sparse representation if the data dimensionality is high. Here, we refer to this problem as the undersampling classification problem. As a result, the typical SRC methods based on the L1 norm minimization may not achieve the desirable classification effectiveness when dealing with very small-scale training set and high dimensional data, e.g., face image data. In order to address the undersampling classification problem in face recognition, Deng et al. proposed an extended SRC (ESRC) [8] which exploits the training set combining with an auxiliary intraclass variation dictionary [9] to represent the test samples. This method is implemented in the original input space and can perform well in many cases. Nevertheless, ESRC fails to capture the nonlinear information within the data set, which is helpful for the classification.
If the data in the original input space are mapped into a high dimensional feature space, i.e., reproducing kernel Hilbert space (RKHS) [28], by using a nonlinear map, and they are exploited to construct the SRC based model, then the new constructed model can capture well the nonlinear information within the data in principle. Based on this idea, a number of SRC-based algorithms implemented in the transformed feature space are proposed to deal with the nonlinear similarity traits. These algorithms are referred to as the kernel sparse representation for classification (KSRC) [5, 51]. In [15, 16], Gao et al. proposed a kernel sparse representation for classification (KSRC) to capture the nonlinear similarity of features. KSRC can be viewed as an extension of SRC and overcome the drawbacks of the SRC [54]. One of the drawbacks is that SRC cannot classify a testing sample that has the same vector direction as training samples belonging to many classes [53]. To remedy this deficiency, in [19], Jian et al. used multi-objective optimization to develop the class-discriminative kernel sparse representation-based classification, called KSRC 2.0. The multi-objective functions were formulated by the Fisher discrimination criterion. By employing nonlinear KSRC in representing the nonlinearities in the high-dimensional RKHS, A. Shrivastava developed a multiple kernel learning approach which exploits a two-step training procedure to learn the kernel weights and sparse representation coefficients [31]. In order to perform robust multimodal biometrics recognition, S. Shekhar et al. proposed a multimodal sparse representation method and kernelized the algorithm to handle nonlinearity in data [30]. In addition, kernel sparse representation can be used in the one-class classification model for remote sensing image analysis [32]. All the above kernel sparse representation methods are based on the direct solution of the L1 norm minimization problem.
The main goal of our work is to address the underling problem that is also called the small sample size problem in many image recognition tasks in which the number of training images is few and the dimensionality of images is very high, whereas the deep learning based method [7, 21] that is currently popular approach may fail to perform successfully since it usually requires a very large-scale training set. Besides, our work aims at expediting the computing procedure in the high dimensional feature space and can by very easily implemented. Two contributions are made in this paper. The first contribution is that, in order to capture well the nonlinear information within the data and overcome the drawback of the ESRC performed in the original input space, we reformulate the ESRC algorithm in RKHS and propose a fast kernel ESRC (KESRC) algorithm in this paper. Furthermore, typical SRC and ESRC algorithms are based on the L1 norm minimization, and hence, their computational cost is usually very high. In our proposed algorithm, we make another contribution by applying a kernel coordinate descent (KCD) approach [37] to implement KESRC. KCD is the kernel version of the coordinate descent method that is simple and efficient [14]. The solution of our proposed approach does not involve the direct L1 norm minimization. Therefore, the computational cost of our method is much lower than that of the SRC and ESRC algorithms. That is, the proposed algorithm is far faster than typical SRC based algorithms including SRC, ESRC and KSRC, etc. Hereafter, our algorithm is referred to as the fast KESRC (FKESRC) algorithm. Experiments on several popular face data sets demonstrate that our algorithm can achieve very high computational efficiency without significantly degrading the classification effectiveness.
The rest of the paper is organized as follows: In Section 2, we review the traditional SRC and ESRC algorithms. Section 3 describes our FKESRC algorithm. Section 4 gives the details of the experimental results and illustrates the effectiveness of the proposed algorithm. Section 5 offers our conclusions.
2 Review of SRC and ESRC
In this section, we review two typical sparse representation for classification methods: SRC and ESRC. The typical SRC algorithm for image classification, especially face recognition, was originally proposed by Wright et al. [38]. It has attracted much attention in recent years. The typical SRC is introduced as follows.
Assume that the training set is denoted as X = [x1, x2, ..., xN] ∈RD × N where xi ∈ RD (i = 1, 2, .., N) denotes the ith training sample. Given a test sample yt ∈ RD (t = 1, 2, .., T), the traditional SRC essentially seeks the sparsest solution to yt = Xα, where α = [a1, a2, ..., aN]T and ai is the representation coefficient corresponding to the ith training sample, by solving the following optimization problem
where ‖α‖1 denotes the L1-norm of the coefficient vector α. Note that the above SRC model can achieve desirable classification effectiveness under the condition that sufficient training samples are available [38]. When the data dimensionality is very high and the number of the training samples is small, SRC may not perform well. As mentioned above, ESRC proposed by Deng et al. [8] can overcome the drawback of typical SRC algorithms that such algorithms cannot properly handle the learning setting in which the number of the training samples in each class is small. In ESRC, Deng introduced an intraclass variation matrix combining the training samples to represent the test samples. Specifically, the intraclass variation matrix is defined as [8]
where c is the number of sample classes, mi is the center of the ith class(i = 1,2,…,c), Ai denotes the sample matrix of the ith class in which each column represents one training sample, and \( {I}_j=\left[1,...,1\right]\in {R}^{1\times {n}_j} \)where nj is the number of the training samples in the jth class. For a test sample yt ∈ RD, ESRC represents this sample by using the intraclass variation matrix AI and the training set as follows
where η and β are the representation coefficients respectively associated with X and AI, and e denotes the small dense noise. Then, the ESRC model aims to solve the following problem
where ε is the error tolerance, which is a small positive constant.
3 Fast kernel ESRC(FKESRC)
Motivated by the basic idea of the KSRC algorithm, we improve the ESRC algorithm by employing the kernel trick and propose a fast kernel ESRC (FKESRC) algorithm in this section. In Subsection 3.1, we first construct the kernel ESRC (KESRC) model. Then, we give the fast algorithm of KESRC in Subsection 3.2. And finally, the time complexity is analyzed in Subsection 3.3.
3.1 Kernel ESRC(KESRC) model
We apply a nonlinear mapping ϕ(•) to map the samples from the original input space into the high-dimensional feature space (RKHS) in which the data dimensionality is M (M> > D). The total training samples in the original input space are mapped to be ϕ(X) = [ϕ(x1), ϕ(x2), ..., ϕ(xN)] ∈ RM × N, the testing sample in the original space yt is mapped to be φ(yt) ∈ RM, and the center of the ith class in the feature space is
where ni (i = 1,2,…,c) denotes the number of the training samples in the ith class. Similarly, the intraclass variation matrix in this space ϕ(AI) can be computed by
where \( \phi \left({x}_j^{\hbox{'}}\right)=\phi \left({x}_j\right)-\phi \left({m}_i\right) \)(j = 1,2,...,N), and ϕ(mi) denotes the center of the class to which the sample ϕ(xj) belongs in the transformed space. Then, the testing sample φ(yt) can be represented by the training samples combining the intraclass variation matrix ϕ(AI) as follows
where θ=[θ1, ..., θN]T and ω= [ω1, ..., ωN]T are the representation coefficients respectively associated with ϕ(X) and ϕ(AI), and δ denotes the small dense noise. Thus, the kernel ESRC model can be reformulated as
where ξ is a small positive constant.
The above Eq. (8) is equivalent to the following minimization problem
where λ is the tradeoff parameter that balances the reconstruction error and the sparseness of the coefficient vector \( \left[\begin{array}{c}\theta \\ {}\omega \end{array}\right] \). Substituting ϕ(X) = [ϕ(x1), ϕ(x2), ..., ϕ(xN)]and ϕ(AI)= \( \left[\phi \left({x}_1^{\hbox{'}}\right),\phi \left({x}_2^{\hbox{'}}\right),...,\phi \left({x}_N^{\hbox{'}}\right)\right] \) into Eq. (9), we obtain
where ϕ(•) is an implicit nonlinear mapping that can be specified by a kernel function. Here, we employ the Gaussian kernel function to perform the nonlinear mapping. Thus, ϕ(xi)Tϕ(xi)=1, and \( \phi {\left({x}_i^{\hbox{'}}\right)}^T\phi \left({x}_i^{\hbox{'}}\right) \)=1, (i = 1, 2, ..., N). Given two arbitrary samples ϕ(x) and ϕ(y), the Gaussian kernel function of these two samples is k(x, y)=ϕ(x)Tϕ(y)=exp(−‖x − y‖2/σ2) where σ is the Gaussian kernel parameter that is often manually tuned in the real applications, andk(x, x)=1. Therefore, we can reformulate Eq.(10) as
where k(•, yt)=(k(x1, yt), k(x2, yt), ..., k(xN, yt)), k(∘, yt)= \( \left(k\left({x}_1^{\hbox{'}},{y}_t\right),k\left({x}_2^{\hbox{'}},{y}_t\right),...,k\left({x}_N^{\hbox{'}},{y}_t\right)\right) \), \( K=\left(\begin{array}{ccc}k\left({x}_1,{x}_1\right)& \cdots & k\left({x}_1,{x}_N\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N,{x}_1\right)& \cdots & k\left({x}_N,{x}_N\right)\end{array}\right),\mathrm{G}\left(\begin{array}{ccc}k\left({x}_1^{\hbox{'}},{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_1^{\hbox{'}},{x}_N^{\hbox{'}}\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N^{\hbox{'}},{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_N^{\hbox{'}},{x}_N^{\hbox{'}}\right)\end{array}\right) \), and \( {K}^{\ast }=\left(\begin{array}{ccc}k\left({x}_1,{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_1,{x}_N^{\hbox{'}}\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N,{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_N,{x}_N^{\hbox{'}}\right)\end{array}\right) \).
Similar to the other typical SRC based models, our KESRC model is also based on the L1norm minimization, and its traditional solution is time-consuming. Next, we will introduce a new approach to solve the proposed KESRC model by using the coordinate descent approach [14].
3.2 Fast KESRC
In this subsection, we exploit the coordinate descent approach to solve the KESRC model. In the coordinate descent method, we do not need to compute the derivative of the objective function. The coordinate descent performs each coordinate-wise minimization with an explicit formula and exploits the sparsity of the model. It focuses mainly on evaluating only inner products for variables with non-zero coefficients [14]. The computational efficiency of the coordinate descent method is very high. This method can largely improve the computational efficiency of KESRC in theory. Hence, our proposed method is a fast KESRC (FKESRC) algorithm. In FKESRC, we need to compute two parameter vectors θ and ω which are associated with the training samples and the columns of the intraclass variation matrix, respectively. First, we determine the parameter vector θ based on the efficient coordinate descent method. Note that the objective function J(θ, ω) in Eq. (11) can be rewritten as follows
We take the partial derivative of J(θ, ω) with respect to θi(i = 1, 2, ..., N), and have
We set \( \frac{\partial J\left(\theta, w\right)}{\partial {\theta}_i} \) to 0, then,
Let
Thus, the updating of the coefficient θi is independent of the other coefficients θj(j = 1, 2, ..., N, j ≠ i) and ωj(j = 1, 2, ..., N) which are fixed in the calculation of θi. The coefficient θi is computed as
where the function [⋅]+ is defined as
Similarly, we take the partial derivative of J(θ, ω) with respect to ωi(i = 1, 2, ..., N), and have
We set \( \frac{\partial J\left(\theta, w\right)}{\partial {\omega}_i} \) to 0, then
Let
Also, the updating of the coefficient ωi is independent of the other coefficients ωj(j = 1, 2, ..., N, j ≠ i) and θj(j = 1, 2, ..., N) which are fixed in the calculation of ωi. The coefficient ωi is computed as
where the function [⋅]+ is defined in Eq. (17).
After obtaining the coefficients θiand ωi(i = 1, 2, ..., N), we can employ them to classify the testing sample yt. Then, we need to compute the following residuals
where δk(θ) ∈ RN is a vector whose only nonzero entries are the entries in θ associated with class k. Then, Eq. (22) can be rewritten as
Hence, the class label of yt is determined by
In our algorithm, the coefficient vectors θ and ω are updated iteratively by Eq. (16) and (21). The initialization of vectors θ and ω are computed through solving the following equation φ(yt) = ϕ(X)θ + ϕ(AI)ω by kernel trick, and we obtain
where P = [K K∗],S=(PTP + μI)−1PT, μis a small positive value, and I is the identity matrix. The following Algorithm 1 summarizes our FKESRC.
Algorithm 1. Fast KESRC algorithm
In the above algorithm, the number of iterations m is usually a small integer, say, 3 to 10, and can easily lead to a sparse solution. In our experiments, m is set to 3.
3.3 Time complexity
In this subsection, we provide an analysis of the time complexity of our algorithm. The analysis contains three main parts. The first part is the preprocessing or initialization procedure. This procedure needs to compute the matrices K, G and K∗ in Eq.(11), and (PTP + μI)−1PT in Eq. (25). The time complexity of this procedure is about O(N2(3D + 16N)) where N is the number of the training samples and D is the data dimensionality. The second part is to represent the testing samples. For each testing sample, the time complexity of representing this sample is O(mN2) where the number of iterations m is usually a relatively small integer (m = 3 in this work). Hence, the total time complexity of this part is O(mN2T) where T is the total number of the testing samples, which is often far larger than N. The third part is calculating the residuals and classifying each testing sample. The time complexity of this procedure is roughly O(N2). Totally, the time complexity of the third part is about O(N2T). Note that the typical SRC algorithm [38] contains two principal procedures: the representing and classifying testing sample procedures. The time complexities of the representing and classifying the total testing samples is O(NeD2T) where e > = 1.2, and O(NDT), respectively [13, 44].
From the above analysis, we know that if the number of testing samples is sufficiently large, the second part, i.e., the representing part whose time complexity is not related to the data dimensionality, would be the most time-consuming procedure in our algorithm. Also, the representing procedure of the typical SRC algorithm consumes the most time in this algorithm implementation. Nevertheless, the time complexity of representing procedure of the typical SRC algorithm is tightly related to the data dimensionality. Therefore, compared with the typical SRC algorithm, our algorithm is computationally efficient if the number of testing samples is large and the data dimensionality is high.
4 Experiments
This section will verify the computational efficiency and recognition effectiveness of our proposed FKESRC algorithm through conducting several experiments on four popular real-world face databases. The face databases are the GT, AR, YaleB and LFW (the labeled faces in the wild) databases [3, 25, 35]. For comparison, we have implemented the other state-of-the-art representation based classification algorithms. They are the collaborative representation for classification (CRC) [52] that is based on the L2 norm minimization, and typical classification algorithms based on the L1 norm minimization, namely SRC [38], SRC performed by using the coordinate descent scheme (SCD), KSRC [53] and ESRC [8] and its kernel version, i.e., kernel ESRC (KESRC). The sparse representation procedures of these algorithms are implemented by using l1_ls modular [20, 42] which produces high-quality classification performance, by solving the L1 norm minimization problem. Similar to ESRC, our proposed algorithm can be viewed as a dictionary learning method. For comparison, here we have implemented two recently proposed dictionary learning methods. One is the kernel extended dictionary (KED) [18] and the other one is the locality-constrained and label embedding dictionary learning (LCLEDL) [22] method. In the experiments, the kernel function in the KSRC, KESRC and our proposed FKESRC is the Gaussian kernel k(x, y)=exp(−‖x − y‖2/2σ2), where σ is the Gaussian kernel parameter which is set to be r × d, where d is the average Euclidean distance of the training samples and r, referred to as the kernel ratio, needs to be specified on each data set. The optimal error tolerance ε in Eq. (4) is set to 1e-3. Since the SRC, KSRC, ESRC and KESRC algorithms are very time-consuming, the dimensionality of all face images in these algorithms is reduced by using principal component analysis (PCA) in all experiments to save the computational time. For a fair comparison, the image dimensionality in our proposed algorithm and other algorithms is also reduced by PCA. All the experiments are run on a platform with 3.4 GHz CPU and 8.0 GB RAM using Matlab R2016b software.
4.1 Experiments on the GT database
The first experiment was conducted on the GT (Georgia Tech) face database which was built at Georgia Institute of Technology. This database contained 50 individuals with 15 color images per individual. The face images of each individual characterized several variations such as pose, expression, and illumination [25]. Figure 1 gives some samples of this database. All the images were first converted to greyscale images, and then cropped and resized to a resolution of 60 × 50 pixels. We randomly grouped the image samples of each subject into two parts, i.e., the training and testing parts. For each individual, a few images (N = 3, 4, 5 and 6) were randomly selected for training, and the rest were used for testing.
The face image data dimensionality was reduced to 40 by using PCA in this experiment. In the KSRC algorithm, the Gaussian kernel parameter in the kernel function was set to 0.1 × d. That is, the kernel ratio was 0.1 that is same as that of KESRC in all experiments. The kernel ratio in our FKESRC algorithm was set to 0.8. And the parameter λ in Eqs. (16) and (21) was set to 0.01. We randomly ran each algorithm 10 times on each training subset. Table 1 gives the recognition rates (MEAN ± STD-DEV PERCENT) on four training subsets (denoted by N = 3, 4, 5 and 6 in Table 1). From this table, we can see that when N = 4 and 5, FKESRC obtained the highest recognition rates among all the algorithms except for KESRC. Compared with other algorithms, our FKESRC can achieve desirable classification performance.
Our method is one typical SRC based algorithm, and the sparse model of FKESRC is resolved by using the L1 norm minimization procedure. Also, the sparse model of SRC, SCD, KSRC, ESRC and KESRC are resolved by using the L1 norm minimization procedure. Therefore, for fair comparison, we have compared them with our method in terms of the computational time. For each of these algorithms including our FKESRC, we denoted the total computational time as Ct. Then the average computational time was calculated as Ct/10 (randomly ran each algorithm 10 times and average, as mentioned above). Table 2 reports the average computational time of six algorithms (SRC, SCD, KSRC, ESRC, KESRC and our FKESRC). Note that the ESRC and KESRC algorithms are slower than the SRC and KSRC algorithms. The reason was that the ESRC and KESRC models involved the intraclass variation matrix, and consequently, directly solving the ESRC and KESRC models had to spend more CPU time. In fact, our FKESRC model also used the intraclass variation matrix. However, our algorithm is much more computationally efficient than the ESRC and KESRC algorithms since our algorithm exploited the coordinate descent approach. Since our method needs to compute the intraclass variation matrix, the proposed method is slightly slower than the SCD algorithm that does not compute the intraclass variation matrix. However, the recognition rates of our method are higher than those of SCD. From Table 2, we can observe that the computational time of our approach is only 0.6% of that of KESRC when N = 3. And when N = 4, the ratio of the computational time of FKESRC to that of KESRC is only 0.5%. From Tables 1 and 2, we can observe that the proposed FKESRC achieved desirable recognition results which being the second fastest algorithm among all tested algorithms.
4.2 Experiment on the AR database
The second experiment was conducted on the AR data set [29], which contained over 4000 face images of 126 individuals. Each individual included frontal views of faces with different facial expressions, lighting conditions, and occlusions [11, 45]. Figure 2 shows some examples of a test subject in the AR database. We used the face images of 120 people and each person with 26 images. All the images were first changed to greyscale images, and then cropped and resized to a resolution of 20 × 25 pixels. Again, we randomly grouped the image samples of each individual into two parts. One part was used for training and the other part for testing. The numbers of training images that were chosen for each person were also 3, 4, 5 and 6 like before (N = 3, 4, 5 and 6).
The face image data dimensionality was reduced to 300 by using PCA in this experiment. Similarly, we needed to set the parameters in KSRC and our FKESRC. In the KSRC algorithm, the kernel ratio was 1e-3. The kernel ratio in our FKESRC algorithm was set to 1.5 by experimentation. And the parameter λ in Eqs. (16) and (21) was also set to 0.01. Also, we randomly ran each algorithm 10 times on each subset. Figure 3 depicts the recognition rates of all algorithms on the AR database. In this figure, N denoted the number of training images of each individual. From this figure, we can see that our proposed algorithm achieved comparable recognition results with other algorithms. Most importantly, our approach was much more computationally efficient than SRC, KSRC, ESRC and KESRC approaches, except for the SCD algorithm. The computational time of these algorithms was reported in Table 3. As we expected, the ratio of computational time of our algorithm to that of other algorithms was still very small. For example, the ratio of the computational time of our algorithm to that of the KESRC algorithm was only 1.1% when N = 3. Compared with the SRC algorithm, our approach was approximately 40 times faster. From Table 3, we can find that FKESRC was the fastest algorithm among all tested. On the other hand, the average time of representing and classifying each testing sample was only 0.01 s in our approach when N = 3, and when N = 6, the average time was not more than 0.03 s, while the ESRC algorithm needed nearly 1 s for the same job. Therefore, our method can be applied in the real time recognition applications on the AR data set. The main reason of the very high computational efficiency of our proposed approach and the SCD algorithm was that these algorithms avoided directly solving the L1 norm minimization problem, whereas the other algorithms in our experiments need the direct solution of the L1 norm minimization problem which was very time consuming, as demonstrated in the experiments.
4.3 Experiment on the YaleB database
We conducted the third experiment on the YaleB face data set which contained 10 subjects with 5850 face images. Each individual had 585 face images [2]. The YaleB data set was a widely used illumination data set with images spanning a large range of possible illuminations [3]. Figure 4 shows some image examples of a subject in the YaleB database. The goal of this experiment is to investigate the performance of our algorithm on different dimensions.
In the YaleB data set, each image was manually aligned and cropped, and the size of each image was 30 × 40. Also, we randomly grouped the image samples of each individual into two parts for training and testing. The number of training images for each individual was 10. The face image data dimensionalities were reduced to 40, 50, 60, 70 and 80, respectively, by using PCA in this experiment. Thus, we obtained five training subsets.
In this experiment, the kernel ratio of the KSRC algorithm was 0.1. The kernel ratio in our FKESRC algorithm was set to 0.9. And the parameter λ in Eqs. (16) and (21) was again set to 0.01. Also, we randomly ran each algorithm 10 times on each subset. Figure 5 shows the recognition rates of all algorithms on the YaleB database. In this figure, D denoted the dimensionality of face images of each individual. From this figure, we can see our proposed algorithm achieved the desirable recognition results, and had comparable performance with other algorithms in terms of the recognition rates.
Similar to the first two experiments, we payed more attention on the computational efficiency of six SRC based algorithms. Table 4 reports the computational time of these algorithms. From this table, we can conclude that ESRC and KESRC were far slower than other algorithms since they used the intraclass variation matrix and directly solved the L1 norm minimization problem, as mentioned in the previous experiment section. Note that although our FKESRC model also involved the intraclass variation matrix, our method was far faster than ESRC and KESRC, since our approach avoided directly solving the L1 norm minimization problem. According to Table 4, the ratio of computational time of our algorithm to that of the KESRC algorithm was only 0.8% when D = 80. On the other hand, although the SRC and KSRC algorithms did not involve the intraclass variation matrix, both of them were far slower than our approach. Our approach was about 50 times faster than the KSRC algorithm that was faster than the SRC, ESRC and KESRC methods. From Table 4, we can observe that the average time of representing and classifying each testing sample was only about 0.002 s in our approach. Therefore, our method should be more suitable for use in real time recognition applications than other algorithms on the YaleB database.
4.4 Experiment on the LFW database
The fourth experiment was conducted on the LFW (the labeled face in the wild) database in which all face images were collected from the web [13]. This data set contains over 13,000 face images. Figure 6 gives several face image samples of one individual in the LFW database. We used the face images of 86 individuals and each person with 11 images. All the images were first changed to greyscale images, and then cropped and resized to a resolution of 32 × 32 pixels. Similarly, we randomly grouped the image samples of each individual into two parts. One part was used for training and the other part for testing. For each individual, a few images (N = 7 and 9) were randomly selected for training, and the rest were used for testing.
The face image data dimensionality was reduced to 200 by using PCA in this experiment. In the KSRC algorithm, the kernel ratio was 1e-3. The kernel ratio in our FKESRC algorithm was set to 0.5. Also, we randomly ran each algorithm 10 times on each training subset. Table 5 reports the recognition rates of all algorithms. Form this table, we can observe that our approach achieves comparable classification performance with the KSRC, ESRC and KESRC algorithms, and outperforms CRC, SRC, SCD, LCLEDL and KED methods.
In this experiment, we report the computational time of SRC, SCD, KSRC, ESRC, KESRC and our method in Table 6. According to this table, our method is the second fastest one among these six algorithms. Our method is a variant of KESRC and ESRC. The proposed FKESRC is far faster than these two algorithms. From Table 6, we can see that FKESRC is about 60 times faster than KESRC, although KESRC is slightly better than our method in terms of recognition rates. On the LFW database, the average time of representing and classifying each testing sample was only about 0.02 s in our approach when N = 7. Hence, we can make a conclusion that FKESRC is a real-time sparse representation based classification approach.
4.5 Determining the kernel ratio for FKESRC
In this subsection, we introduce how to determine the kernel ratio parameter used in our FKESRC algorithm on four face databases. We chose the optimal kernel ratio from the set Z = {0.01, 0.02, ..., 10}, termed kernel ratio set, that leads to the best recognition result on each database. In order to illustrate the selection procedure, we first chose one subset from each database. Similar to the previous experiments, this subset includes one training subset and one testing subset. Specifically, for the GT data set, the number of training images of a person is three, i.e., N = 3. Similarly, for the AR data set, N = 3; for the YelaB data set, N = 10; and for the LFW data set, N = 7. Then, we randomly ran FKESRC using each kernel ratio in Z one time on the chosen subsets. The recognition results obtained by different kernel ratios are shown in the following Fig. 7. In this figure, the value of abscissa denotes the kernel ratio value. The value of ordinate denotes the recognition rate. From this figure, we can observe that when the kernel ratio is near to 1, our FKESRC tends to achieve higher recognition rates. Hence, it is easy to determine the suitable kernel ratio on all databases.
5 Conclusion
In this work, we proposed a fast kernel sparse representation based classification, i.e., the FKESRC algorithm, which aims to solve the undersampling problem in face recognition. Theoretically, our approach can address well the undersampling problem. FKESRC can also capture the nonlinear information within the data set, and achieve the desirable classification effectiveness, which is demonstrated by the experiments in Section 4. Moreover, it is a computationally efficient sparse representation based classification. Although our algorithm and the typical SRC, KSRC, ESRC and KESRC algorithms are all based on the L1 norm minimization problem, FKESRC does not need to directly solve the L1 norm minimization problem, while the other algorithms do. Among all algorithms, our proposed method is a good trade-off between the recognition rate and the computational time. It can be applied in the real time classification applications such as face recognition. Besides, the basic idea of our method can be potentially exploited in lung prediction [50], background subtraction [24] and saliency detection [17].
References
Adamo A, Grossi G, Lanzarotti R, Lin J (2015) Robust face recognition using sparse representation in LDA space. Mach Vis Appl 26(6):837–847
Basri R, Hassner T, Zelnik-Manor L (2010) Approximate nearest subspace search. IEEE Trans Pattern Anal Mach Intell 33(2):266–278
Beveridge JR, Draper BA, Chang JM, Kirby M et al (2009) Principal angles separate subject illumination spaces in YDB and CMU-PIE. IEEE Trans Pattern Anal Mach Intell 31(2):351–356
Candes EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):4203–4215
Chen Z, Zuo W, Hu Q, Lin L (2015) Kernel sparse representation for time series classification. Inf Sci 292:15–26
J. Chen, X. Song, L. Nie, X. Wang, et al.(2016) Micro tells macro: predicting the popularity of micro-videos via a Transductive model, presented at the proceedings of the 24th ACM international conference on multimedia, Amsterdam, The Netherlands
Choe C, Choe G, Wang T, Han S, et al (2019) "Deep feature learning with mixed distance maximization for person re-identification," Multimedia Tools and Applications, June 26
Deng W, Hu J, Guo J (2012) Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Trans Pattern Anal Mach Intell 34(9):1864–1870
Deng W, Hu J, Guo J, (2013) In defense of sparsity based face recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013 ), pp. 399–406
Donoho DL (2006) For most large underdetermined systems of linear equations the minimal minimal l1-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829
Fan Z, Wang J, Zhu Q, Fang X et al (2013) Local minimum squared error for face and handwritten character recognition. Journal of Electronic Imaging 22(3):033027
Fan Z, Ni M, Zhu Q, Sun C et al (2015)L0-norm sparse representation based on modified genetic algorithm for face recognition. J Vis Commun Image Represent 28:15–20
Fan Z, Ni M, Zhu Q, Liu E (2015) Weighted sparse representation for face recognition. Neurocomputing 151:304–309
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Gao S, Tsang IW-H, Chia L-T(2013) Sparse representation with kernels. IEEE Trans Image Process 22(2):423–434
Gao S, Chia L, Tsang I, Ren Z (2014) Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding. IEEE Transactions on multimedia 16(3):762–771
Han J, Zhang D, Hu X, Guo L et al (2015) Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology 25(8):1309–1321
Huang KK, Dai DQ, Ren CX, Lai ZR (2017) Learning kernel extended dictionary for face recognition. IEEE Transactions on Neural Networks and Learning Systems 28(5):1082–1094
Jian M, Jung C (2013)Class-discriminative kernel sparse representation-based classification using multi-objective optimization. IEEE Trans Signal Process 61(18):4416–4427
Kim S-J, Koh K, Lustig M, Boyd S et al (2007) An interior-point method for large-scale l 1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing 1(4):606–617
Kushwaha N, Pant M, (2018) "Textual data dimensionality reduction - a deep learning approach," Multimedia Tools and Applications, December 15
Li Z, Lai Z, Xu Y, Yang J et al (2017) A locality-constrained and label embedding dictionary learning algorithm for image classification. IEEE Transactions on Neural Networks and Learning Systems 28(2):278–293
Liu W, Yu Z, Lu L, Wen Y et al (2015) KCRC-LCD: discriminative kernel collaborative representation with locality constrained dictionary for visual categorization. Pattern Recogn 48(10):3076–3092
Liu X, Zhao G, Yao J, Qi C (2015) Background subtraction based on low-rank and structured sparse decomposition. IEEE Trans Image Process 24(8):2502–2514
Naseem I, Togneri R, Bennamoun M (2010) Linear regression for face recognition. IEEE Trans Pattern Anal Mach Intell 32(11):2106–2112
Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint L2, 1-norms minimization. Twenty-Fourth Annual Conference on Advances in Neural Information Processing Systems (NIPS 2010) 23:1813–1821
Ren C-X, Dai D-Q, Yan H (2012) Robust classification using L2,1-norm based regression model. Pattern Recogn 45:2708–2718
Scholkopf B, Mika S, Burges CJC, Knirsch P et al (1999) Input space versus feature space in kernel-based methods. IEEE Trans Neural Netw 10(5):1000–1017
Shao C, Song X, Feng Z-H, Wu X-J et al (2017) Dynamic dictionary optimization for sparse-representation-based face classification using local difference images. Inf Sci 393:1–14
Shekhar S, Patel VM, Nasrabadi NM, Chellappa R (2014) Joint sparse representation for robust multimodal biometrics recognition. IEEE Trans Pattern Anal Mach Intell 36(1):113–126
Shrivastava A, Patel V, Chellappa R (2014) Multiple kernel learning for sparse representation-based classification. IEEE Trans Image Process 23(7):3013–3024
Song B, Li P, Li J, Plaza A (2016)One-class classification of remote sensing images using kernel sparse representation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9(4):1613–1623
Song X, Feng F, Liu J, Li Z et al (2017) "NeuroStylist: neural compatibility modeling for clothing matching," presented at the proceedings of the 25th ACM international conference on multimedia. Mountain View, California
Wagner A, Wright J, Ganesh A, Zhou Z et al (2012) Toward a practical face recognition system: robust alignment and illumination by sparse representation. IEEE Trans Pattern Anal Mach Intell 34(2):372–386
Wang S-J, Yang J, Sun M-F, Peng X-J et al (2012) Sparse tensor discriminant color space for face verification. IEEE Transactions on Neural Networks and Learning Systems 23(6):876–888
Wang J, Lu C, Wang M, Li P et al (2014) Robust face recognition via adaptive sparse representation. IEEE Transactions on Cybernetics 44(12):2368–2378
Wang L, Yan H, Lv K, Pan C (2014) Visual tracking via kernel sparse representation with multi-kernel fusion. IEEE Transactions on Circuits and Systems for Video Technology 24(7):1132–1141
Wright J, Yang AY, Ganesh A, Sastry SS et al (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Wright J, Ma Y, Mairal J, Sapiro G et al (2010) Sparse representation for computer vision and pattern recognition. Proc IEEE 98(6):1031–1044
Xu Y, Zhang D, Yang J, Yang J-Y(2011) A two-phase test sample sparse representation method for use with face recognition. IEEE Transactions on Circuits and Systems for Video Technology 21(9):1255–1262
Xu Z, Chang X, Xu F, Zhang H (2012) Regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems 23(7):1013–1027
Xu Y, Fang X, Li X, Yang J et al (2014) Data uncertainty in face recognition. IEEE Transactions on Cybernetics 44(10):1950–1961
Xu Y, Zhang Z, Lu G, Yang J (2016) Approximately symmetrical face images for image preprocessing in face recognition and sparse representation based classification. Pattern Recogn 54:68–82
Yang M, Zhang L (2010) "Gabor feature based sparse representation for face recognition with Gabor occlusion dictionary," presented at the ECCV 2010. Crete, Greece
Yang J, Zhang D, Yang J-y, Niu B (2007) Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics. IEEE Trans Pattern Anal Mach Intell 29(4):650–664
Yang J, Wright J, Huang TS, Ma Y (2010) Image super-resolution via sparse representation. IEEE Trans Image Process 19(11):2861–2873
Yang J, Zhang L, Xu Y, Yang J-y(2012) Beyond sparsity: the role of L1-optimizer in pattern classification. Pattern Recogn 45(3):1104–1118
Yang J, Chu D, Zhang L, Xu Y (2013) Sparse representation classifier steered discriminative projection with applications to face recognition. IEEE Transactions on Neural Networks and Learning Systems 24(7):1023–1035
Yang M, Zhang L, Feng X, Zhang D (2014) Sparse representation based fisher discrimination dictionary learning for image classification. Int J Comput Vis 109(3):209–232
Yao J, Wang S, Zhu X, Huang J (2016) "Imaging Biomarker Discovery for Lung Cancer Survival Prediction," presented at the International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2016, Athens, Greece
Yin J, Liu Z, Jin Z, Yang W (2012) Kernel sparse representation based classification. Neurocomputing 77(1):120–128
Zhang L, Yang M, Feng X (2011) Sparse Representation or Collaborative Representation: Which Helps Face Recognition? ," presented at the ICCV 2011, Barcelona, Spain.
Zhang L, Zhou W-D, Chang P-C, Liu J et al (2012) Kernel sparse representation-based classifier. IEEE Trans Signal Process 60(4):1684–1695
Zhang G, Sun H, Xia G, Sun Q (2016) Multiple kernel sparse representation-based orthogonal discriminative projection and its cost-sensitive extension. IEEE Trans Image Process 25(9):4271–4285
Zheng M, Bu J, Chen C, Wang C et al (2011) Graph regularized sparse coding for image representation. IEEE Trans Image Process 20(5):1327–1336
Acknowledgements
This article is partly supported by Natural Science Foundation of China (NSFC) under grants Nos. 61991401, 61673097, 61490704 and Jiangxi Provincial Natural Science Foundation of China under Grant 20192ACBL20010, as well as Science and Technology Foundation of Jiangxi Transportation Department of China (2015D0066).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fan, Z., Wei, C. Fast kernel sparse representation based classification for Undersampling problem in face recognition. Multimed Tools Appl 79, 7319–7337 (2020). https://doi.org/10.1007/s11042-019-08211-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08211-x