1 Introduction

In recent years, the representation based classification (RC) has been a hot research topic in pattern recognition and machine learning community, particularly in face recognition. The main procedure of RC is that for a test sample, the RC method exploits the features extracted from the training samples (or directly uses the training samples) to construct one model that is used to represent this test sample by using one or more types of vector norms, e.g., the L1 norm. Then, the test sample is classified via the representation results after the constructed model is solved. The RC method has some variants that can be used other applications such as the video analysis [6, 33]. Up to now, researchers have applied several types of vector norms to represent the samples, e.g., L1, L1/2, L2,1 and L2 [23, 27, 36, 38, 40, 41, 52]. Note that using the L1 norm minimization can yield a sparse representation for samples. The classification methods based on sparse representation are considered robust to outliers and data corruption, e.g., image occlusion. This type of classification methods is referred to as the sparse representation based classification (SRC) [1, 43, 47]. In principle, the representation result obtained by L1/2 norm is sparser than that obtained by L1 norm [41]. Nevertheless, solving the L1/2 norm minimization problem usually entails iteratively solving the L1 norm minimization. We know that the L2,1 norm based classification is robust to outliers through selecting samples to yield grouped sparsity [26, 27]. Although the solution of the L2 norm minimization is usually faster than that of the L1 norm minimization, the models based on the L2 norm cannot produce the sparse solution [52]. Moreover, it is believed that they are not robust to image noise or outliers.

Among the RC methods mentioned above, SRC is one of the most popular sparse representation methods for classification in face recognition. Theoretically, the RC approaches using the L0 norm minimization (i.e., the representation coefficients contain the fewest non-zero components) [12] can produce the sparsest representation result, which may lead to desirable recognition performance [38]. However, the L0 norm minimization is essentially an NP-hard problem. Furthermore, if enough training samples are available and the result of representing the test sample using them is sufficiently sparse, the solution of the L0 norm minimization problem is equal to that of the L1 norm minimization problem [4, 10, 38]. It has also been demonstrated that the solution of the L1 norm minimization is more computationally efficient than that of the L0 norm minimization which is an NP-hard problem. Moreover, L1 norm minimization is a convex problem which is easier to guarantee optimality. Therefore, the L0 norm minimization is often replaced by the L1 norm minimization in many applications involving representation methods [34, 39, 46, 48, 49, 55].

However, RC using the L1 norm minimization has a serious computational obstacle. When the training set is very small, that is, there are very few training samples in each class, using the L1 norm minimization may not obtain the sparse representation if the data dimensionality is high. Here, we refer to this problem as the undersampling classification problem. As a result, the typical SRC methods based on the L1 norm minimization may not achieve the desirable classification effectiveness when dealing with very small-scale training set and high dimensional data, e.g., face image data. In order to address the undersampling classification problem in face recognition, Deng et al. proposed an extended SRC (ESRC) [8] which exploits the training set combining with an auxiliary intraclass variation dictionary [9] to represent the test samples. This method is implemented in the original input space and can perform well in many cases. Nevertheless, ESRC fails to capture the nonlinear information within the data set, which is helpful for the classification.

If the data in the original input space are mapped into a high dimensional feature space, i.e., reproducing kernel Hilbert space (RKHS) [28], by using a nonlinear map, and they are exploited to construct the SRC based model, then the new constructed model can capture well the nonlinear information within the data in principle. Based on this idea, a number of SRC-based algorithms implemented in the transformed feature space are proposed to deal with the nonlinear similarity traits. These algorithms are referred to as the kernel sparse representation for classification (KSRC) [5, 51]. In [15, 16], Gao et al. proposed a kernel sparse representation for classification (KSRC) to capture the nonlinear similarity of features. KSRC can be viewed as an extension of SRC and overcome the drawbacks of the SRC [54]. One of the drawbacks is that SRC cannot classify a testing sample that has the same vector direction as training samples belonging to many classes [53]. To remedy this deficiency, in [19], Jian et al. used multi-objective optimization to develop the class-discriminative kernel sparse representation-based classification, called KSRC 2.0. The multi-objective functions were formulated by the Fisher discrimination criterion. By employing nonlinear KSRC in representing the nonlinearities in the high-dimensional RKHS, A. Shrivastava developed a multiple kernel learning approach which exploits a two-step training procedure to learn the kernel weights and sparse representation coefficients [31]. In order to perform robust multimodal biometrics recognition, S. Shekhar et al. proposed a multimodal sparse representation method and kernelized the algorithm to handle nonlinearity in data [30]. In addition, kernel sparse representation can be used in the one-class classification model for remote sensing image analysis [32]. All the above kernel sparse representation methods are based on the direct solution of the L1 norm minimization problem.

The main goal of our work is to address the underling problem that is also called the small sample size problem in many image recognition tasks in which the number of training images is few and the dimensionality of images is very high, whereas the deep learning based method [7, 21] that is currently popular approach may fail to perform successfully since it usually requires a very large-scale training set. Besides, our work aims at expediting the computing procedure in the high dimensional feature space and can by very easily implemented. Two contributions are made in this paper. The first contribution is that, in order to capture well the nonlinear information within the data and overcome the drawback of the ESRC performed in the original input space, we reformulate the ESRC algorithm in RKHS and propose a fast kernel ESRC (KESRC) algorithm in this paper. Furthermore, typical SRC and ESRC algorithms are based on the L1 norm minimization, and hence, their computational cost is usually very high. In our proposed algorithm, we make another contribution by applying a kernel coordinate descent (KCD) approach [37] to implement KESRC. KCD is the kernel version of the coordinate descent method that is simple and efficient [14]. The solution of our proposed approach does not involve the direct L1 norm minimization. Therefore, the computational cost of our method is much lower than that of the SRC and ESRC algorithms. That is, the proposed algorithm is far faster than typical SRC based algorithms including SRC, ESRC and KSRC, etc. Hereafter, our algorithm is referred to as the fast KESRC (FKESRC) algorithm. Experiments on several popular face data sets demonstrate that our algorithm can achieve very high computational efficiency without significantly degrading the classification effectiveness.

The rest of the paper is organized as follows: In Section 2, we review the traditional SRC and ESRC algorithms. Section 3 describes our FKESRC algorithm. Section 4 gives the details of the experimental results and illustrates the effectiveness of the proposed algorithm. Section 5 offers our conclusions.

2 Review of SRC and ESRC

In this section, we review two typical sparse representation for classification methods: SRC and ESRC. The typical SRC algorithm for image classification, especially face recognition, was originally proposed by Wright et al. [38]. It has attracted much attention in recent years. The typical SRC is introduced as follows.

Assume that the training set is denoted as X = [x1, x2, ..., xN] ∈RD × N where xiRD (i = 1, 2, .., N) denotes the ith training sample. Given a test sample ytRD (t = 1, 2, .., T), the traditional SRC essentially seeks the sparsest solution to yt = , where α = [a1, a2, ..., aN]T and ai is the representation coefficient corresponding to the ith training sample, by solving the following optimization problem

$$ {\hat{\alpha}}_1=\mathrm{argmin}{\left\Vert \alpha \right\Vert}_1\;\mathrm{s}.\mathrm{t}.{y}_t= X\alpha $$
(1)

where ‖α1 denotes the L1-norm of the coefficient vector α. Note that the above SRC model can achieve desirable classification effectiveness under the condition that sufficient training samples are available [38]. When the data dimensionality is very high and the number of the training samples is small, SRC may not perform well. As mentioned above, ESRC proposed by Deng et al. [8] can overcome the drawback of typical SRC algorithms that such algorithms cannot properly handle the learning setting in which the number of the training samples in each class is small. In ESRC, Deng introduced an intraclass variation matrix combining the training samples to represent the test samples. Specifically, the intraclass variation matrix is defined as [8]

$$ {A}_I=\left[{A}_1-{m}_1{I}_1,...,{A}_c-{m}_c{I}_c\right]\in {R}^{D\times N} $$
(2)

where c is the number of sample classes, mi is the center of the ith class(i = 1,2,…,c), Ai denotes the sample matrix of the ith class in which each column represents one training sample, and \( {I}_j=\left[1,...,1\right]\in {R}^{1\times {n}_j} \)where nj is the number of the training samples in the jth class. For a test sample ytRD, ESRC represents this sample by using the intraclass variation matrix AI and the training set as follows

$$ {y}_t= X\eta +{A}_I\beta +e $$
(3)

where η and β are the representation coefficients respectively associated with X and AI, and e denotes the small dense noise. Then, the ESRC model aims to solve the following problem

$$ \left[\begin{array}{c}{\hat{\eta}}_1\\ {}{\hat{\beta}}_1\end{array}\right]=\mathrm{argmin}{\left\Vert \left[\begin{array}{c}\eta \\ {}\beta \end{array}\right]\right\Vert}_1,s.t.{\left\Vert \Big[X,{A}_I\Big]\left[\begin{array}{c}\eta \\ {}\beta \end{array}\right]-{y}_t\right\Vert}_2\le \varepsilon $$
(4)

where ε is the error tolerance, which is a small positive constant.

3 Fast kernel ESRC(FKESRC)

Motivated by the basic idea of the KSRC algorithm, we improve the ESRC algorithm by employing the kernel trick and propose a fast kernel ESRC (FKESRC) algorithm in this section. In Subsection 3.1, we first construct the kernel ESRC (KESRC) model. Then, we give the fast algorithm of KESRC in Subsection 3.2. And finally, the time complexity is analyzed in Subsection 3.3.

3.1 Kernel ESRC(KESRC) model

We apply a nonlinear mapping ϕ(•) to map the samples from the original input space into the high-dimensional feature space (RKHS) in which the data dimensionality is M (M> > D). The total training samples in the original input space are mapped to be ϕ(X) = [ϕ(x1), ϕ(x2), ..., ϕ(xN)] ∈ RM × N, the testing sample in the original space yt is mapped to be φ(yt) ∈ RM, and the center of the ith class in the feature space is

$$ \phi \left({m}_i\right)=\frac{1}{n_i}\sum \limits_{x_j\in class\;i}\phi \left({x}_j\right) $$
(5)

where ni (i = 1,2,…,c) denotes the number of the training samples in the ith class. Similarly, the intraclass variation matrix in this space ϕ(AI) can be computed by

$$ \phi \left({A}_I\right)=\left[\phi \left({x}_1^{\hbox{'}}\right),\phi \left({x}_2^{\hbox{'}}\right),...,\phi \left({x}_N^{\hbox{'}}\right)\right] $$
(6)

where \( \phi \left({x}_j^{\hbox{'}}\right)=\phi \left({x}_j\right)-\phi \left({m}_i\right) \)(j = 1,2,...,N), and ϕ(mi) denotes the center of the class to which the sample ϕ(xj) belongs in the transformed space. Then, the testing sample φ(yt) can be represented by the training samples combining the intraclass variation matrix ϕ(AI) as follows

$$ \varphi \left({y}_t\right)=\varphi (X)\theta +\varphi \left({A}_I\right)\omega +\delta $$
(7)

where θ=[θ1, ..., θN]T and ω= [ω1, ..., ωN]T are the representation coefficients respectively associated with ϕ(X) and ϕ(AI), and δ denotes the small dense noise. Thus, the kernel ESRC model can be reformulated as

$$ \underset{\theta, \omega }{\min }{\left\Vert \begin{array}{c}\theta \\ {}\omega \end{array}\right\Vert}_1,s.t.{\left\Vert \phi \left({y}_t\right)-\phi (X)\theta -\phi \left({A}_I\right)\omega \right\Vert}_2\le \xi $$
(8)

where ξ is a small positive constant.

The above Eq. (8) is equivalent to the following minimization problem

$$ \left[\begin{array}{c}\hat{\theta}\\ {}\hat{\omega}\end{array}\right]=\underset{\theta, \omega }{\mathrm{argmin}}J\left(\theta, \omega \right)=\underset{\theta, \omega }{\mathrm{argmin}}\left(\frac{1}{2}{\left\Vert \phi \left({y}_t\right)-\phi (X)\theta -\phi \left({A}_I\right)\omega \right\Vert}_2^2\right)+\lambda {\left\Vert \begin{array}{c}\theta \\ {}\omega \end{array}\right\Vert}_1 $$
(9)

where λ is the tradeoff parameter that balances the reconstruction error and the sparseness of the coefficient vector \( \left[\begin{array}{c}\theta \\ {}\omega \end{array}\right] \). Substituting ϕ(X) = [ϕ(x1), ϕ(x2), ..., ϕ(xN)]and ϕ(AI)= \( \left[\phi \left({x}_1^{\hbox{'}}\right),\phi \left({x}_2^{\hbox{'}}\right),...,\phi \left({x}_N^{\hbox{'}}\right)\right] \) into Eq. (9), we obtain

$$ \underset{\theta, \omega }{\min }J\left(\theta, \omega \right)=\underset{\theta, \omega }{\min}\left(\frac{1}{2}{\left\Vert \varphi \left({y}_t\right)-\sum \limits_{i=1}^N\left({\theta}_i\varphi \left({x}_i\right)+{\omega}_i\varphi \left({x}_i^{\hbox{'}}\right)\right)\right\Vert}_2^2\right)+\lambda {\left\Vert \begin{array}{c}\theta \\ {}\omega \end{array}\right\Vert}_1 $$
(10)

where ϕ(•) is an implicit nonlinear mapping that can be specified by a kernel function. Here, we employ the Gaussian kernel function to perform the nonlinear mapping. Thus, ϕ(xi)Tϕ(xi)=1, and \( \phi {\left({x}_i^{\hbox{'}}\right)}^T\phi \left({x}_i^{\hbox{'}}\right) \)=1, (i = 1, 2, ..., N). Given two arbitrary samples ϕ(x) and ϕ(y), the Gaussian kernel function of these two samples is k(x, y)=ϕ(x)Tϕ(y)=exp(−‖xy2/σ2) where σ is the Gaussian kernel parameter that is often manually tuned in the real applications, andk(x, x)=1. Therefore, we can reformulate Eq.(10) as

$$ \underset{\theta, \omega }{\min }J\left(\theta, \omega \right)=\underset{\theta, \omega }{\min}\left(-2k\left(\bullet, {y}_t\right)\theta -2k\left(\circ, {y}_t\right)\omega +{\theta}^T K\theta +{\omega}^T G\omega +2{\theta}^T{K}^{\ast}\omega +\lambda {\left\Vert \begin{array}{c}\theta \\ {}\omega \end{array}\right\Vert}_1\right) $$
(11)

where k(•, yt)=(k(x1, yt), k(x2, yt), ..., k(xN, yt)), k(∘, yt)= \( \left(k\left({x}_1^{\hbox{'}},{y}_t\right),k\left({x}_2^{\hbox{'}},{y}_t\right),...,k\left({x}_N^{\hbox{'}},{y}_t\right)\right) \), \( K=\left(\begin{array}{ccc}k\left({x}_1,{x}_1\right)& \cdots & k\left({x}_1,{x}_N\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N,{x}_1\right)& \cdots & k\left({x}_N,{x}_N\right)\end{array}\right),\mathrm{G}\left(\begin{array}{ccc}k\left({x}_1^{\hbox{'}},{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_1^{\hbox{'}},{x}_N^{\hbox{'}}\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N^{\hbox{'}},{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_N^{\hbox{'}},{x}_N^{\hbox{'}}\right)\end{array}\right) \), and \( {K}^{\ast }=\left(\begin{array}{ccc}k\left({x}_1,{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_1,{x}_N^{\hbox{'}}\right)\\ {}\vdots & \ddots & \vdots \\ {}k\left({x}_N,{x}_1^{\hbox{'}}\right)& \cdots & k\left({x}_N,{x}_N^{\hbox{'}}\right)\end{array}\right) \).

Similar to the other typical SRC based models, our KESRC model is also based on the L1norm minimization, and its traditional solution is time-consuming. Next, we will introduce a new approach to solve the proposed KESRC model by using the coordinate descent approach [14].

3.2 Fast KESRC

In this subsection, we exploit the coordinate descent approach to solve the KESRC model. In the coordinate descent method, we do not need to compute the derivative of the objective function. The coordinate descent performs each coordinate-wise minimization with an explicit formula and exploits the sparsity of the model. It focuses mainly on evaluating only inner products for variables with non-zero coefficients [14]. The computational efficiency of the coordinate descent method is very high. This method can largely improve the computational efficiency of KESRC in theory. Hence, our proposed method is a fast KESRC (FKESRC) algorithm. In FKESRC, we need to compute two parameter vectors θ and ω which are associated with the training samples and the columns of the intraclass variation matrix, respectively. First, we determine the parameter vector θ based on the efficient coordinate descent method. Note that the objective function J(θ, ω) in Eq. (11) can be rewritten as follows

$$ {\displaystyle \begin{array}{c}J\left(\theta, \omega \right)=-2\sum \limits_{i=1}^N{\theta}_ik\left({x}_i,{y}_t\right)-2\sum \limits_{i=1}^N{\omega}_ik\left({x}_i^{\hbox{'}},{y}_t\right)\\ {}+\sum \limits_{i=1}^N\sum \limits_{j=1}^N{\theta}_i{\theta}_jk\left({x}_i,{x}_j\right)+2\sum \limits_{i=1}^N\sum \limits_{j=1}^N{\theta}_i{\omega}_jk\left({x}_j^{\hbox{'}},{x}_i\right)+\sum \limits_{i=1}^N\sum \limits_{j=1}^N{\omega}_i{\omega}_jk\left({x}_i^{\hbox{'}},{x}_j^{\hbox{'}}\right)+\lambda {\left\Vert \begin{array}{c}\theta \\ {}\omega \end{array}\right\Vert}_1\end{array}} $$
(12)

We take the partial derivative of J(θ, ω) with respect to θi(i = 1, 2, ..., N), and have

$$ \frac{\partial J\left(\theta, w\right)}{\partial {\theta}_i}+-2k\left({x}_i,{y}_t\right)+2\sum \limits_{j=1}^N{\theta}_jk\left({x}_i,{x}_j\right)+2\sum \limits_{j=1}^N{\omega}_jk\left({x}_i,{x}_j^{\hbox{'}}\right)+\lambda \operatorname{sgn}\left({\theta}_i\right) $$
(13)

We set \( \frac{\partial J\left(\theta, w\right)}{\partial {\theta}_i} \) to 0, then,

$$ {\theta}_i=k\left({x}_i,{y}_t\right)-\sum \limits_{j=1,j\ne i}^N{\theta}_jk\left({x}_i,{x}_j\right)-\sum \limits_{j=1}^N{\omega}_jk\left({x}_i,{x}_j^{\hbox{'}}\right)-\frac{\lambda }{2}\operatorname{sgn}\left({\theta}_i\right) $$
(14)

Let

$$ {e}_{\theta}\left({x}_i\right)=k\left({x}_i,{y}_t\right)-\sum \limits_{j=1,j\ne i}^N{\theta}_jk\left({x}_i,{x}_j\right)-\sum \limits_{j=1}^N{\omega}_jk\left({x}_i,{x}_j^{\hbox{'}}\right) $$
(15)

Thus, the updating of the coefficient θi is independent of the other coefficients θj(j = 1, 2, ..., N, ji) and ωj(j = 1, 2, ..., N) which are fixed in the calculation of θi. The coefficient θi is computed as

$$ {\theta}_i=\operatorname{sgn}\left({e}_{\theta}\left({x}_i\right)\right){\left[{e}_{\theta}\left({x}_i\right)|-\frac{\lambda }{2}\right]}_{+} $$
(16)

where the function [⋅]+ is defined as

$$ {\left[h\right]}_{+}=\Big\{{\displaystyle \begin{array}{c}h,\kern0.36em h>0\\ {}0,\kern0.36em h\le 0\end{array}} $$
(17)

Similarly, we take the partial derivative of J(θ, ω) with respect to ωi(i = 1, 2, ..., N), and have

$$ \frac{\partial J\left(\theta, w\right)}{\partial {\omega}_i}=-2k\left({x}_i^{\hbox{'}},{y}_t\right)+2\sum \limits_{j=1}^N{\theta}_jk\left({x}_j,{x}_i^{\hbox{'}}\right)+2\sum \limits_{j=1}^N{\omega}_jk\left({x}_i^{\hbox{'}},{x}_j^{\hbox{'}}\right)+\lambda \operatorname{sgn}\left({\omega}_i\right) $$
(18)

We set \( \frac{\partial J\left(\theta, w\right)}{\partial {\omega}_i} \) to 0, then

$$ {\omega}_i=k\left({x}_i^{\hbox{'}},{y}_t\right)-\sum \limits_{j=1,j\ne i}^N{\omega}_jk\left({x}_i^{\hbox{'}},{x}_j^{\hbox{'}}\right)-\sum \limits_{j=1}^N{\theta}_jk\left({x}_j,{x}_i^{\hbox{'}}\right)-\frac{\lambda }{2}\operatorname{sgn}\left({\omega}_i\right) $$
(19)

Let

$$ {e}_{\omega}\left({x}_i\right)=k\left({x}_i^{\hbox{'}},{y}_t\right)-\sum \limits_{j=1,j\ne i}^N{\omega}_jk\left({x}_i^{\hbox{'}},{x}_j^{\hbox{'}}\right)-\sum \limits_{j=1}^N{\theta}_jk\left({x}_j,{x}_i^{\hbox{'}}\right) $$
(20)

Also, the updating of the coefficient ωi is independent of the other coefficients ωj(j = 1, 2, ..., N, ji) and θj(j = 1, 2, ..., N) which are fixed in the calculation of ωi. The coefficient ωi is computed as

$$ {\omega}_i=\operatorname{sgn}\left({e}_{\omega}\left({x}_i\right)\right){\left[{e}_{\omega}\left({x}_i\right)|-\frac{\lambda }{2}\right]}_{+} $$
(21)

where the function [⋅]+ is defined in Eq. (17).

After obtaining the coefficients θiand ωi(i = 1, 2, ..., N), we can employ them to classify the testing sample yt. Then, we need to compute the following residuals

$$ {r}_k\left({y}_t\right)={\left\Vert \phi \left({y}_t\right)-\Big[\phi (X),\phi \left({A}_I\right)\Big]\left[\begin{array}{c}{\delta}_k\left(\theta \right)\\ {}\omega \end{array}\right]\right\Vert}_2^2,\left(k=1,2,...,c\right) $$
(22)

where δk(θ) ∈ RN is a vector whose only nonzero entries are the entries in θ associated with class k. Then, Eq. (22) can be rewritten as

$$ {r}_k\left({y}_t\right)=k\left({y}_t,{y}_t\right)-2k\left(\bullet, {y}_t\right){\delta}_k\left(\theta \right)-2k\left(\circ, {y}_t\right)\omega +{\delta}_k{\left(\theta \right)}^TK{\delta}_k\left(\theta \right)+{\omega}^T G\omega +2{\delta}_k{\left(\theta \right)}^T{K}^{\ast}\omega $$
(23)

Hence, the class label of yt is determined by

$$ Identity\left({y}_t\right)=\arg \underset{k}{\min }{r}_k\left({y}_t\right),\left(k=1,2,...,c\right), $$
(24)

In our algorithm, the coefficient vectors θ and ω are updated iteratively by Eq. (16) and (21). The initialization of vectors θ and ω are computed through solving the following equation φ(yt) = ϕ(X)θ + ϕ(AI)ω by kernel trick, and we obtain

$$ {\left[\begin{array}{c}\theta \\ {}\omega \end{array}\right]}_{in}={\left({P}^TP+\mu I\right)}^{-1}{P}^Tk\left(\bullet, {y}_t\right)= Sk\left(\bullet, {y}_t\right) $$
(25)

where P = [K K],S=(PTP + μI)−1PT, μis a small positive value, and I is the identity matrix. The following Algorithm 1 summarizes our FKESRC.

figure c

Algorithm 1. Fast KESRC algorithm

In the above algorithm, the number of iterations m is usually a small integer, say, 3 to 10, and can easily lead to a sparse solution. In our experiments, m is set to 3.

3.3 Time complexity

In this subsection, we provide an analysis of the time complexity of our algorithm. The analysis contains three main parts. The first part is the preprocessing or initialization procedure. This procedure needs to compute the matrices K, G and K in Eq.(11), and (PTP + μI)−1PT in Eq. (25). The time complexity of this procedure is about O(N2(3D + 16N)) where N is the number of the training samples and D is the data dimensionality. The second part is to represent the testing samples. For each testing sample, the time complexity of representing this sample is O(mN2) where the number of iterations m is usually a relatively small integer (m = 3 in this work). Hence, the total time complexity of this part is O(mN2T) where T is the total number of the testing samples, which is often far larger than N. The third part is calculating the residuals and classifying each testing sample. The time complexity of this procedure is roughly O(N2). Totally, the time complexity of the third part is about O(N2T). Note that the typical SRC algorithm [38] contains two principal procedures: the representing and classifying testing sample procedures. The time complexities of the representing and classifying the total testing samples is O(NeD2T) where e > = 1.2, and O(NDT), respectively [13, 44].

From the above analysis, we know that if the number of testing samples is sufficiently large, the second part, i.e., the representing part whose time complexity is not related to the data dimensionality, would be the most time-consuming procedure in our algorithm. Also, the representing procedure of the typical SRC algorithm consumes the most time in this algorithm implementation. Nevertheless, the time complexity of representing procedure of the typical SRC algorithm is tightly related to the data dimensionality. Therefore, compared with the typical SRC algorithm, our algorithm is computationally efficient if the number of testing samples is large and the data dimensionality is high.

4 Experiments

This section will verify the computational efficiency and recognition effectiveness of our proposed FKESRC algorithm through conducting several experiments on four popular real-world face databases. The face databases are the GT, AR, YaleB and LFW (the labeled faces in the wild) databases [3, 25, 35]. For comparison, we have implemented the other state-of-the-art representation based classification algorithms. They are the collaborative representation for classification (CRC) [52] that is based on the L2 norm minimization, and typical classification algorithms based on the L1 norm minimization, namely SRC [38], SRC performed by using the coordinate descent scheme (SCD), KSRC [53] and ESRC [8] and its kernel version, i.e., kernel ESRC (KESRC). The sparse representation procedures of these algorithms are implemented by using l1_ls modular [20, 42] which produces high-quality classification performance, by solving the L1 norm minimization problem. Similar to ESRC, our proposed algorithm can be viewed as a dictionary learning method. For comparison, here we have implemented two recently proposed dictionary learning methods. One is the kernel extended dictionary (KED) [18] and the other one is the locality-constrained and label embedding dictionary learning (LCLEDL) [22] method. In the experiments, the kernel function in the KSRC, KESRC and our proposed FKESRC is the Gaussian kernel k(x, y)=exp(−‖xy2/2σ2), where σ is the Gaussian kernel parameter which is set to be r × d, where d is the average Euclidean distance of the training samples and r, referred to as the kernel ratio, needs to be specified on each data set. The optimal error tolerance ε in Eq. (4) is set to 1e-3. Since the SRC, KSRC, ESRC and KESRC algorithms are very time-consuming, the dimensionality of all face images in these algorithms is reduced by using principal component analysis (PCA) in all experiments to save the computational time. For a fair comparison, the image dimensionality in our proposed algorithm and other algorithms is also reduced by PCA. All the experiments are run on a platform with 3.4 GHz CPU and 8.0 GB RAM using Matlab R2016b software.

4.1 Experiments on the GT database

The first experiment was conducted on the GT (Georgia Tech) face database which was built at Georgia Institute of Technology. This database contained 50 individuals with 15 color images per individual. The face images of each individual characterized several variations such as pose, expression, and illumination [25]. Figure 1 gives some samples of this database. All the images were first converted to greyscale images, and then cropped and resized to a resolution of 60 × 50 pixels. We randomly grouped the image samples of each subject into two parts, i.e., the training and testing parts. For each individual, a few images (N = 3, 4, 5 and 6) were randomly selected for training, and the rest were used for testing.

Fig. 1
figure 1

Some examples of face images in the GT face database

The face image data dimensionality was reduced to 40 by using PCA in this experiment. In the KSRC algorithm, the Gaussian kernel parameter in the kernel function was set to 0.1 × d. That is, the kernel ratio was 0.1 that is same as that of KESRC in all experiments. The kernel ratio in our FKESRC algorithm was set to 0.8. And the parameter λ in Eqs. (16) and (21) was set to 0.01. We randomly ran each algorithm 10 times on each training subset. Table 1 gives the recognition rates (MEAN ± STD-DEV PERCENT) on four training subsets (denoted by N = 3, 4, 5 and 6 in Table 1). From this table, we can see that when N = 4 and 5, FKESRC obtained the highest recognition rates among all the algorithms except for KESRC. Compared with other algorithms, our FKESRC can achieve desirable classification performance.

Table 1 Recognition rates on the GT database

Our method is one typical SRC based algorithm, and the sparse model of FKESRC is resolved by using the L1 norm minimization procedure. Also, the sparse model of SRC, SCD, KSRC, ESRC and KESRC are resolved by using the L1 norm minimization procedure. Therefore, for fair comparison, we have compared them with our method in terms of the computational time. For each of these algorithms including our FKESRC, we denoted the total computational time as Ct. Then the average computational time was calculated as Ct/10 (randomly ran each algorithm 10 times and average, as mentioned above). Table 2 reports the average computational time of six algorithms (SRC, SCD, KSRC, ESRC, KESRC and our FKESRC). Note that the ESRC and KESRC algorithms are slower than the SRC and KSRC algorithms. The reason was that the ESRC and KESRC models involved the intraclass variation matrix, and consequently, directly solving the ESRC and KESRC models had to spend more CPU time. In fact, our FKESRC model also used the intraclass variation matrix. However, our algorithm is much more computationally efficient than the ESRC and KESRC algorithms since our algorithm exploited the coordinate descent approach. Since our method needs to compute the intraclass variation matrix, the proposed method is slightly slower than the SCD algorithm that does not compute the intraclass variation matrix. However, the recognition rates of our method are higher than those of SCD. From Table 2, we can observe that the computational time of our approach is only 0.6% of that of KESRC when N = 3. And when N = 4, the ratio of the computational time of FKESRC to that of KESRC is only 0.5%. From Tables 1 and 2, we can observe that the proposed FKESRC achieved desirable recognition results which being the second fastest algorithm among all tested algorithms.

Table 2 Computational time (s) on the GT data set

4.2 Experiment on the AR database

The second experiment was conducted on the AR data set [29], which contained over 4000 face images of 126 individuals. Each individual included frontal views of faces with different facial expressions, lighting conditions, and occlusions [11, 45]. Figure 2 shows some examples of a test subject in the AR database. We used the face images of 120 people and each person with 26 images. All the images were first changed to greyscale images, and then cropped and resized to a resolution of 20 × 25 pixels. Again, we randomly grouped the image samples of each individual into two parts. One part was used for training and the other part for testing. The numbers of training images that were chosen for each person were also 3, 4, 5 and 6 like before (N = 3, 4, 5 and 6).

Fig. 2
figure 2

Some examples of face images in the AR database

The face image data dimensionality was reduced to 300 by using PCA in this experiment. Similarly, we needed to set the parameters in KSRC and our FKESRC. In the KSRC algorithm, the kernel ratio was 1e-3. The kernel ratio in our FKESRC algorithm was set to 1.5 by experimentation. And the parameter λ in Eqs. (16) and (21) was also set to 0.01. Also, we randomly ran each algorithm 10 times on each subset. Figure 3 depicts the recognition rates of all algorithms on the AR database. In this figure, N denoted the number of training images of each individual. From this figure, we can see that our proposed algorithm achieved comparable recognition results with other algorithms. Most importantly, our approach was much more computationally efficient than SRC, KSRC, ESRC and KESRC approaches, except for the SCD algorithm. The computational time of these algorithms was reported in Table 3. As we expected, the ratio of computational time of our algorithm to that of other algorithms was still very small. For example, the ratio of the computational time of our algorithm to that of the KESRC algorithm was only 1.1% when N = 3. Compared with the SRC algorithm, our approach was approximately 40 times faster. From Table 3, we can find that FKESRC was the fastest algorithm among all tested. On the other hand, the average time of representing and classifying each testing sample was only 0.01 s in our approach when N = 3, and when N = 6, the average time was not more than 0.03 s, while the ESRC algorithm needed nearly 1 s for the same job. Therefore, our method can be applied in the real time recognition applications on the AR data set. The main reason of the very high computational efficiency of our proposed approach and the SCD algorithm was that these algorithms avoided directly solving the L1 norm minimization problem, whereas the other algorithms in our experiments need the direct solution of the L1 norm minimization problem which was very time consuming, as demonstrated in the experiments.

Fig. 3
figure 3

Recognition rates on AR database

Table 3 Computational time (s) on the AR data set

4.3 Experiment on the YaleB database

We conducted the third experiment on the YaleB face data set which contained 10 subjects with 5850 face images. Each individual had 585 face images [2]. The YaleB data set was a widely used illumination data set with images spanning a large range of possible illuminations [3]. Figure 4 shows some image examples of a subject in the YaleB database. The goal of this experiment is to investigate the performance of our algorithm on different dimensions.

Fig. 4
figure 4

Some examples of a test subject in the YaleB database

In the YaleB data set, each image was manually aligned and cropped, and the size of each image was 30 × 40. Also, we randomly grouped the image samples of each individual into two parts for training and testing. The number of training images for each individual was 10. The face image data dimensionalities were reduced to 40, 50, 60, 70 and 80, respectively, by using PCA in this experiment. Thus, we obtained five training subsets.

In this experiment, the kernel ratio of the KSRC algorithm was 0.1. The kernel ratio in our FKESRC algorithm was set to 0.9. And the parameter λ in Eqs. (16) and (21) was again set to 0.01. Also, we randomly ran each algorithm 10 times on each subset. Figure 5 shows the recognition rates of all algorithms on the YaleB database. In this figure, D denoted the dimensionality of face images of each individual. From this figure, we can see our proposed algorithm achieved the desirable recognition results, and had comparable performance with other algorithms in terms of the recognition rates.

Fig. 5
figure 5

Recognition rates on YaleB database

Similar to the first two experiments, we payed more attention on the computational efficiency of six SRC based algorithms. Table 4 reports the computational time of these algorithms. From this table, we can conclude that ESRC and KESRC were far slower than other algorithms since they used the intraclass variation matrix and directly solved the L1 norm minimization problem, as mentioned in the previous experiment section. Note that although our FKESRC model also involved the intraclass variation matrix, our method was far faster than ESRC and KESRC, since our approach avoided directly solving the L1 norm minimization problem. According to Table 4, the ratio of computational time of our algorithm to that of the KESRC algorithm was only 0.8% when D = 80. On the other hand, although the SRC and KSRC algorithms did not involve the intraclass variation matrix, both of them were far slower than our approach. Our approach was about 50 times faster than the KSRC algorithm that was faster than the SRC, ESRC and KESRC methods. From Table 4, we can observe that the average time of representing and classifying each testing sample was only about 0.002 s in our approach. Therefore, our method should be more suitable for use in real time recognition applications than other algorithms on the YaleB database.

Table 4 Computational time (s) on the YaleB data set

4.4 Experiment on the LFW database

The fourth experiment was conducted on the LFW (the labeled face in the wild) database in which all face images were collected from the web [13]. This data set contains over 13,000 face images. Figure 6 gives several face image samples of one individual in the LFW database. We used the face images of 86 individuals and each person with 11 images. All the images were first changed to greyscale images, and then cropped and resized to a resolution of 32 × 32 pixels. Similarly, we randomly grouped the image samples of each individual into two parts. One part was used for training and the other part for testing. For each individual, a few images (N = 7 and 9) were randomly selected for training, and the rest were used for testing.

Fig. 6
figure 6

Some examples of one individual in the LFW database

The face image data dimensionality was reduced to 200 by using PCA in this experiment. In the KSRC algorithm, the kernel ratio was 1e-3. The kernel ratio in our FKESRC algorithm was set to 0.5. Also, we randomly ran each algorithm 10 times on each training subset. Table 5 reports the recognition rates of all algorithms. Form this table, we can observe that our approach achieves comparable classification performance with the KSRC, ESRC and KESRC algorithms, and outperforms CRC, SRC, SCD, LCLEDL and KED methods.

Table 5 Recognition rates on the LFW database

In this experiment, we report the computational time of SRC, SCD, KSRC, ESRC, KESRC and our method in Table 6. According to this table, our method is the second fastest one among these six algorithms. Our method is a variant of KESRC and ESRC. The proposed FKESRC is far faster than these two algorithms. From Table 6, we can see that FKESRC is about 60 times faster than KESRC, although KESRC is slightly better than our method in terms of recognition rates. On the LFW database, the average time of representing and classifying each testing sample was only about 0.02 s in our approach when N = 7. Hence, we can make a conclusion that FKESRC is a real-time sparse representation based classification approach.

Table 6 Computational time (s) on the LFW data set

4.5 Determining the kernel ratio for FKESRC

In this subsection, we introduce how to determine the kernel ratio parameter used in our FKESRC algorithm on four face databases. We chose the optimal kernel ratio from the set Z = {0.01, 0.02, ..., 10}, termed kernel ratio set, that leads to the best recognition result on each database. In order to illustrate the selection procedure, we first chose one subset from each database. Similar to the previous experiments, this subset includes one training subset and one testing subset. Specifically, for the GT data set, the number of training images of a person is three, i.e., N = 3. Similarly, for the AR data set, N = 3; for the YelaB data set, N = 10; and for the LFW data set, N = 7. Then, we randomly ran FKESRC using each kernel ratio in Z one time on the chosen subsets. The recognition results obtained by different kernel ratios are shown in the following Fig. 7. In this figure, the value of abscissa denotes the kernel ratio value. The value of ordinate denotes the recognition rate. From this figure, we can observe that when the kernel ratio is near to 1, our FKESRC tends to achieve higher recognition rates. Hence, it is easy to determine the suitable kernel ratio on all databases.

Fig. 7
figure 7

Recognition rates of different kernel ratios on four databases. a GT; b AR; c YaleB; d LFW

5 Conclusion

In this work, we proposed a fast kernel sparse representation based classification, i.e., the FKESRC algorithm, which aims to solve the undersampling problem in face recognition. Theoretically, our approach can address well the undersampling problem. FKESRC can also capture the nonlinear information within the data set, and achieve the desirable classification effectiveness, which is demonstrated by the experiments in Section 4. Moreover, it is a computationally efficient sparse representation based classification. Although our algorithm and the typical SRC, KSRC, ESRC and KESRC algorithms are all based on the L1 norm minimization problem, FKESRC does not need to directly solve the L1 norm minimization problem, while the other algorithms do. Among all algorithms, our proposed method is a good trade-off between the recognition rate and the computational time. It can be applied in the real time classification applications such as face recognition. Besides, the basic idea of our method can be potentially exploited in lung prediction [50], background subtraction [24] and saliency detection [17].