1 Introduction

With the inspiration of sparse mechanism of human vision system [28, 29], sparse representation has become an appealing concept for data representation and achieved competitive performance in image restoration [6, 10, 11, 21] and compressed sensing [8]. Sparse representation can also be used for effective image classification, e.g. face recognition (FR) [40, 42, 45, 48, 52], digit and texture classification [16, 24, 33, 46], etc.

Sparse representation based classification (SRC) framework involves two steps: coding and classification. First, over a dictionary of atoms, a query signal/image is collaboratively coded with some sparse constraint. Then, classification is performed by using the coding coefficients and the dictionary. The dictionary for sparse coding could be predefined. For example, Wright et al. [42] populated a dictionary with the original training samples from all classes to code the query face image, and classified the query face image by passing the coding coefficients into a minimum reconstruction error classifier. This so called SRC classifier shows very competitive performance, but the dictionary adopted in it is not effective enough to represent the query images for performing classification due to two issues. One is that taking the original training samples as the dictionary could not effectively exploit the discriminative information in the training samples. The other is that taking analytically designed off-the-shelf bases as dictionary (e.g., [16] takes Haar wavelets and Gabor wavelets as the dictionary) might be effective enough for universal types of images but not for specific type of images such as face, digit and texture images. These two issues of predefined dictionary can be addressed, to a certain extent, by properly learning a desired dictionary from the given original training samples.

Dictionary learning (DL) devotes to learning from training samples the optimal dictionary over which the given signal could be well represented or coded for processing. A number of DL methods have been proposed for image restoration [1, 6, 10, 11, 25, 55] and image classification [9, 14, 18, 19, 22, 24, 25, 3134, 41, 46, 4850, 52]. One representative DL method for image restoration is K-means singular value decomposition (KSVD) algorithm [1], which learns an over-complete dictionary from example natural images and uses this learned dictionary for image restoration. Inspired by KSVD, many reconstructive dictionary learning (DL) methods [6, 10, 11, 25, 55] have been proposed and showed state-of-the-art performance in image restoration tasks. Although these reconstructive dictionary learning (DL) methods have achieved promising results in image restoration, they are not favorable for image classification since their objective is only to represent the training samples faithfully. Different from image restoration, image classification aims to classify the input query sample correctly. Therefore, the discriminative ability of the learned dictionary is the major concern for image classification. Up to now, Discriminative DL methods have been proposed to promote the discriminative ability of the learned dictionary [9, 14, 18, 19, 22, 24, 25, 3134, 41, 46, 4850, 52] and have led to many state-of-the-art results in pattern classification problems.

One popular type of discriminative DL methods aims to learn a shared dictionary for all classes while improving the discriminative ability of the coefficients vector. In the DL methods proposed by Rodriguez et al. [34] and Jiang et al. [19], the coefficients vector of the samples from the same class are encouraged to be similar to each other. As well as the l 0- or l 1-norm sparsity penalty, nonnegative [15], group [4, 38] and structured sparsity penalty [17] have been proposed to be imposed on the representation coefficients in different applications. It is popular concurrently to learn a dictionary and a classifier over the coefficients vector. In this spirit, Mairal et al. [24] and Pham et al. [25] proposed to learn discriminative dictionaries while training linear classifiers jointly. Inspired by the work of Pham et al. [25], Zhang et al. [52] extended the original K-SVD algorithm by simultaneously learning a linear classifier for face recognition. Following Zhang et al. [52], Jiang et al. [19] proposed Label-Consistent KSVD by introducing a label consistent regularization to enforce the discrimination of coding vectors. The so-called LC-KSVD algorithm exhibits good classification results. Recently, Mairal et al. [25] proposed a task-driven DL (TDDL) framework in which different risk functions of the representation coefficients are minimized for different tasks. Cai et al. [7] proposed a parameterization method to adaptively determine the weight of each coding vector pair, which leads to a support vector guided dictionary learning (SVGDL) model.

Another type of discriminative DL methods aims to learn a class-specific dictionary whose atoms correspond to the subject class labels. Mairal et al. [22] modified the KSVD model by introducing a discriminative reconstruction penalty term for texture segmentation and scene analysis. Yang et al. [47] and Sprechmann et al. [37] learned a dictionary for each class with sparse coefficients and used the learned dictionary for face recognition and signal clustering, respectively. Castrodad et al. [9] proposed to impose non-negative penalty on both dictionary atoms and representation coefficients to learn a set of action-specific dictionaries. From the training images of each category, Wu et al. [44] learned active basis models for object detection and recognition. Ramirez et al. [33] encouraged the dictionaries associated with different classes to be as independent as possible by introducing an incoherence promoting term. Following Ramirez et al. [33], Wang et al. [41] presented a class-specific DL algorithm for sparse modeling in action recognition. Yang et al. [49, 50] proposed to learn a structural dictionary by imposing the Fisher discrimination criterion on the sparse coding coefficients to enhance class discrimination power.

On the other, numerous dimensionality reduction methods are developed for feature extraction. The representative dimensionality reduction methods include Principal Component Analysis (PCA) [39], Linear Discriminate Analysis (LDA) [3] and Locality Preserving Projection (LPP) [27], etc. In all the previous DL methods, the dimensionality reduction (DR) and dictionary learning (DL) are studied as two independent processes. Traditionally, DR is performed first to the training samples and the reduced dimensionality feature are used for DL. However, the pre-learned DR projection may not preserve the best feature for DL. Therefore, the DR and DL processes should be simultaneously conducted for a more effective classification task. Some works has been done to investigate the dimensionality reduction for DL. For example, Zhang et al. [53] proposed an unsupervised learning method for dimensionality reduction in SRC. Feng et al. [12] jointly trained a dimensionality reduction transform and a dictionary for face recognition. Both of these two methods show higher FR rates than PCA and random projection.

In this paper, we propose a simultaneous dimensionality reduction and dictionary learning (SDRDL) model to learn a dimensionality reduction (DR) projection matrix P and a class-specific dictionary D (i.e., the dictionary atoms have correspondences to the class labels) simultaneously for pattern classification. In the proposed SDRDL, an objective function is defined and an iterative optimization algorithm is presented to alternatively optimize the dictionary D and the projection P. In each iteration, SDRDL updates the dictionary D by fixing the projection P and refines the projection P by fixing the dictionary D. After several iterations, the DR projection matrix P and class-specific dictionary D can be obtained together. Therefore, the simultaneously learned P and D will match with each other better. So that more effective pattern classification can be performed by the representation residual. In addition, both the representation residual and the representation coefficients of a query sample will be discriminative, thus a corresponding classification scheme is presented to exploit such discriminative information. The extensive experiments on image classification tasks showed that SDRDL could achieve competitive performance with those state-of-the-art DL methods.

The rest of this paper is organized as follows. Section 2 briefly introduces the SRC scheme in [42]. Section 3 presents the proposed SDRDL model and its optimization procedure. Section 4 presents the SDRDL based classifier. Section 5 conducts experiments and Section 6 concludes the paper.

2 Sparse representation based classification

Wright et al. [42] proposed the sparse representation based classification (SRC) scheme for robust face recognition (FR). Given K classes of subjects, let D = [A 1, A 2, ⋯, A K ] be the dictionary formed by the set of training samples, where A i is the subset of training samples from class i. Let y be a query sample. The algorithm of SRC is summarized as follows.

  1. (a)

    Normalize each training sample in A i , i = 1, 2, ⋯, K.

  2. (b)

    Solve l 1-minimization problem: \( \widehat{x}={\mathrm{argmin}}_x\left\{{\left\Vert y-Dx\right\Vert}_2^2+\gamma {\left\Vert x\right\Vert}_1\right\} \), where γ is scalar constant.

  3. (c)

    Label a query sample y via: Label(y) = argmin i {e i }, where \( {e}_i={\left\Vert y-{A}_i{\widehat{\alpha}}^i\right\Vert}_2^2 \), \( \widehat{x}={\left[{\widehat{\alpha}}^1,{\widehat{\alpha}}^2,\cdots, {\widehat{\alpha}}^K\right]}^T \) and \( {\widehat{\alpha}}^i \) is the coefficient vector associated with class i.

Obviously, the underlying assumption of this scheme is that a query sample can be represented by a weighted linear combination of just those training samples belonging to the same class. Its impressive performance reported in [42] showed that sparse representation is naturally discriminative.

3 Simultaneous dimensionality reduction and dictionary learning (SDRDL)

3.1 Model construction

In most of the previous DL methods introduced in Section 1, the DR and DL processes are performed separately. First, the DR projection matrix is learned from the original training data, then DL is performed to learn a dictionary from the dimensionality reduced training data. Different from most previous DL methods, we propose to learn the DR matrix P and the dictionary D simultaneously for exploiting the discrimination information in the training set more effectively.

Given K classes of subjects, let A = [A 1, A 2, ⋯, A K ] be the set of training samples, where A i is the subset of training samples from class i. The dimensionality reduced data of training set A can be obtained by PA and it should be represented by the dictionary D with the representation matrix Z, i.e., PA ≈ DZ. Z can be written as Z = [Z 1, Z 2, ⋯, Z K ], where Z i is the representation matrix of PA i over D. Beyond requiring that the dimensionality reduced data PA can be well represented by D (i.e., PA ≈ DZ), we also require that both of them can cooperate with each other to distinguish the samples in A. Therefore, we propose to simultaneously learn P and D = [D 1, D 2, ⋯, D K ], where D i is the class-specific sub-dictionary associated with class i, for performing pattern classification by simultaneous dimensionality reduction and dictionary learning (SDRDL) model:

$$ \begin{array}{c}\hfill \left\langle P,D,Z\right\rangle ={\mathrm{argmin}}_{P,D,Z}\left\{{\displaystyle \sum_{i=1}^Kr\left(P,{A}_i,D,{Z}_i\right)+{\lambda}_1{\left\Vert Z\right\Vert}_1+{\lambda}_2h(Z)}\right\}\hfill \\ {}\hfill s.t.{\left\Vert {d}_n\right\Vert}_2=1,\forall n,P{P}^T=I.\hfill \end{array} $$
(1)

where r(P, A i , D, Z i ) is the projective representation-constrained term; ‖Z1 is the sparsity penalty; h(Z) is coefficients diversity term imposed on the coefficient matrix Z; λ 1, λ 2, and γ are scalar parameters. Each atom d n of D is constrained to have a unit l 2-norm to avoid that D has arbitrarily large l 2-norm, resulting in trivial solutions of the coefficient matrix Z. In the following section, we will discuss the terms r(P, A i , D, Z i ) and h(Z) in details.

3.1.1 Projective representation-constrained term r(P, A i , D, Z i )

The dimensionality reduced data PA i of the original data A i should be represented by D with Z i , i.e., PA i  ≈ DZ i . Z i can be written as Z i  = [Z 1 i ; ⋯; Z j i ; ⋯; Z K i ], where Z j i is the representation coefficients of PA i over D j . Denote by R k  = D k Z k i the representation of D k to PA i . There is:

$$ P{A}_i\approx D{Z}_i={D}_1{Z}_i^1+\cdots +{D}_i{Z}_i^i+\cdots +{D}_k{Z}_i^k={R}_1+\cdots +{R}_i+\cdots +{R}_k $$
(2)

where R i  = D i Z i i . Since D i is associated with the i th class, it is naturally expected that PA i could be well represented by D i but not by D j , j ≠ i. This implies that there are some significant coefficients in Z i i such that ‖PA i  − D i Z i i 2 F is small, while some coefficients in Z j i such that ‖PA i  − D j Z j i 2 F is big. Making ‖PA i  − D j Z j i 2 F big can be attained, to some extent, by making Z j i having some very small coefficients such that ‖D j Z j i 2 F is small. Furthermore, since PA i is the dimensionality reduced data of the original data A i , it is also expected that A i can be well reconstructed from the projected subspace by P. This can be accomplished by minimizing ‖A i  − P T PA i 2 F , which is the amount of energy discarded by the DR matrix P or the difference between low-dimensional approximations and the original training set. Therefore, in this work, we define projective representation-constrained term r(P, A i , D, Z i ) as:

$$ r\left(P,{A}_i,D,{Z}_i\right)={\left\Vert P{A}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert P{A}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle \sum_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}+\gamma {\left\Vert {A}_i-{P}^TP{A}_i\right\Vert}_F^2 $$
(3)

By using these four terms in Eq. (3), A i can be well reconstructed by P T PA i , also P and D can be better fit with each other. So that D i will have not only the minimal but also very small representation residual for PA i , while other class-specific sub-dictionary D j , j ≠ i will have big representation residuals of PA i .

3.1.2 Coefficients diversity term h(Z)

Given a query sample, Wright et al. proposed that its sparse representation could be found by SRC scheme and the largest coefficients in the coefficients vector recovered by SRC are associated with the training samples, which have the same class label as the query sample [42]. It implies that the query sample can be approximated by a weighted linear combination of its own training samples with these largest coefficients. Likewise, in our proposed SDRDL model, it is expected that the largest coefficients in Z i (or Z j ) are associated with D i (or D j ), as illustrated in Fig. 1. From this figure, it can also be found that when A i and A j belong to the same class, ‖Z T j Z i 2 F is big, while when A i and A j belong to different classes, ‖Z T j Z i 2 F is small. Actually, ‖Z T j Z i 2 F reflects the relative similarity of samples. Thus, coefficients diversity term h(Z) can be defined as:

$$ h(Z)={\displaystyle \sum_{j\ne i}{\left\Vert {Z}_j^T{Z}_i\right\Vert}_F^2} $$
(4)
Fig. 1
figure 1

Sparse representation of training samples over dictionary D. Training samples with different colors belong to different classes. And atoms with different colors in dictionary D have different class labels

When A i and A j belong to different classes (i.e., i ≠ j), minimizing the coefficients diversity term h(Z) encourages that the largest coefficients in Z i and Z j are associated with the corresponding different sub-dictionary (i.e., D i and D j , respectively). Therefore, the discriminative ability of dictionary D can be further promoted.

3.1.3 SDRDL model

By incorporating Eqs. (3) and (4) into Eq. (1), SDRDL model can be formulated as:

$$ \begin{array}{l}\left\langle P,D,Z\right\rangle ={\mathrm{argmin}}_{P,\ D,Z}\left\{{\displaystyle {\sum}_{i=1}^K\left({\left\Vert P{A}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert P{A}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}+\right.}\right.\hfill \\ {}\left.\left.\gamma {\left\Vert {A}_i-{P}^TP{A}_i\right\Vert}_F^2\right)+{\lambda}_1\left\Vert {Z}_1\right\Vert +{\lambda}_2{\displaystyle {\sum}_{j\ne i}{\left\Vert {Z}_j^T{Z}_i\right\Vert}_F^2}\right\}s.t.{\left\Vert {d}_n\right\Vert}_2=1,\forall n,P{P}^T=I.\hfill \end{array} $$
(5)

The objective of SDRDL is to simultaneously learn the DR matrix P as well as the class-specific dictionary D. In a subspace determined by P, each sub-dictionary D i will have small representation residuals to the samples from class i but have big representation residuals to other classes. Besides, the representation coefficient vectors of samples from one class will be similar to each other but dissimilar to samples from other classes. Ideally, if P and D could be well optimized, they will be proper for classification task.

3.2 Optimization

Obviously, Eq. (5) is a single-objective optimization problem and its objective function is non-convex for 〈P, D, Z〉. As the works [19, 33, 49, 50, 52] have done when trying to solve similar optimization problems, here we divide the whole optimization into two sub-problems: updating D and Z by fixing P; and updating P by fixing D and Z. These two sub-problems are solved alternatively and iteratively for the desired dimensionality reduction projection matrix P and dictionary D.

3.2.1 Update D and Z

Suppose that P is fixed, D and Z are updated. When P is fixed, the objective function in Eq. (5) reduces to:

$$ \begin{array}{c}\hfill \left\langle D,Z\right\rangle ={\mathrm{argmin}}_{D,Z}\left\{{\displaystyle {\sum}_{i=1}^K\left({\left\Vert P{A}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert P{A}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}\right)}+{\lambda}_1{Z}_1\right.\hfill \\ {}\hfill \left.+{\lambda}_2{\displaystyle {\sum}_{j\ne i}{\left\Vert {Z}_j^T{Z}_i\right\Vert}_F^2}\right\}s.t.{\left\Vert {d}_n\right\Vert}_2=1\hfill \end{array} $$
(6)

Let B i  = PA i , we rewrite Eq. (6) as:

$$ \begin{array}{c}\hfill \left\langle D,Z\right\rangle ={\mathrm{argmin}}_{D,Z}\left\{{\displaystyle {\sum}_{i=1}^K\left({\left\Vert {B}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert {B}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}\right)}+{\lambda}_1{Z}_1\right.\hfill \\ {}\hfill \left.+{\lambda}_2{\displaystyle {\sum}_{j\ne i}{\left\Vert {Z}_j^T{Z}_i\right\Vert}_F^2}\right\}s.t.{\left\Vert {d}_n\right\Vert}_2=1\hfill \end{array} $$
(7)

Obviously, D and Z can be solved alternatively and iteratively. When D is fixed, the objective function in Eq. (7) can be reduced to update Z = [Z 1, Z 2, ⋯, Z K ]. Z i can be computed class by class. Thus the objective function in Eq. (7) can be further reduced to:

$$ { \min}_{Z_i}\left\{{\left\Vert {B}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert {B}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}+{\lambda}_1{\left\Vert {Z}_i\right\Vert}_1+{\lambda}_2{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {Z}_j^T{Z}_i\right\Vert}_F^2}\right\} $$
(8)

To prevent Z j from having arbitrarily large l 2-norm, we normalize each column of Z j in Eq. (8) to a unit l 2-norm. Thus, Z i can be computed by the following objective function:

$$ { \min}_{Z_i}\left\{{\left\Vert {B}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert {B}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}+{\lambda}_1{\left\Vert {Z}_i\right\Vert}_1+{\lambda}_2{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {\tilde{Z}}_j^T{Z}_i\right\Vert}_F^2}\right\} $$
(9)

where, \( {\tilde{Z}}_j=\left[{\tilde{z}}_{j,1},{\tilde{z}}_{j,2},\cdots, {\tilde{z}}_{j,{n}_j}\right] \) denotes the normalized Z j and \( {\tilde{z}}_{j,i}={z}_{j,i}/{\left\Vert {z}_{j,i}\right\Vert}_2 \), i = 1, 2, ⋯, n j . We rewrite Eq. (9) as:

$$ { \min}_{Z_i}\left\{{\varphi}_i\left({Z}_i\right)+{\lambda}_1{\left\Vert {Z}_i\right\Vert}_1\right\} $$
(10)

where, \( {\varphi}_i\left({Z}_i\right)={\left\Vert {B}_i-D{Z}_i\right\Vert}_F^2+{\left\Vert {B}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_j{Z}_i^j\right\Vert}_F^2}+{\lambda}_2{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {\tilde{Z}}_j^T{Z}_i\right\Vert}_F^2} \). It can be proved that φ i (Z i ) is convex with Lipschitz continuous gradient (please refer to Appendix for the proof). Therefore, in this work we can adopt a new fast iterative shrinkage-thresholding algorithm (FISTA) [2] to solve Eq. (10), as described in Table 1.

Table 1 Learning sparse code Z i

When Z is fixed, the objective function in Eq. (7) can be reduced to compute D = [D 1, D 2, ⋯, D K ]. \( {D}_i=\left[{d}_1,{d}_2,\cdots, {d}_{p_i}\right] \) is also updated class by class. Thus the objective function in Eq. (7) can be reduced as:

$$ { \min}_{D_i}\left\{{\left\Vert \overline{B}-{D}_i{Z}^i\right\Vert}_F^2+{\left\Vert {B}_i-{D}_i{Z}_i^i\right\Vert}_F^2+{\displaystyle {\sum}_{j=1,j\ne i}^K{\left\Vert {D}_i{Z}_j^i\right\Vert}_F^2}\right\}s.t.{\left\Vert {d}_l\right\Vert}_2=1,l=1,2,\cdots, {p}_i $$
(11)

where \( \overline{B}=B-{\displaystyle {\sum}_{j=1,j\ne i}^K{D}_j{Z}^j} \) and B = [B 1, B 2, ⋯, B K ]; Z i represent the coefficient matrix of B over D i . Equation (11) can be rewritten as the following form [49, 50]:

$$ { \min}_{D_i}{\left\Vert {\overline{B}}_i-{D}_i{X}_i\right\Vert}_F^2s.t.{\left\Vert {d}_l\right\Vert}_2=1,l=1,2,\cdots, {p}_i $$
(12)

where \( {\overline{B}}_i=\left[\overline{B}\kern0.5em {B}_i\kern0.5em 0\cdots 00\cdots 0\right] \), X i  = [Z iZ i i Z i1  ⋯ Z i i − 1 Z i i + 1  ⋯ Z i K ]. Equation (12) can be efficiently solved by updating each dictionary atom one by one via the algorithm [23, 49, 50] as presented in Table 2.

Table 2 Learning dictionary D i

3.2.2 Update P

Suppose that D and Z are fixed, P is updated. When D and Z are fixed, the object function in Eq. (5) can be reduced to:

$$ P={\mathrm{argmin}}_P\left\{{\displaystyle {\sum}_{i=1}^K{\left\Vert P{A}_i-D{Z}_i\right\Vert}_F^2}+{\left\Vert P{A}_i-D{Z}_i^i\right\Vert}_F^2+\gamma {\left\Vert {A}_i-{P}^TP{A}_i\right\Vert}_F^2\right\}s.t.P{P}^T=I. $$
(13)

Let W i  = DZ i and W i i  = DZ i i , Eq. (13) can be re-formulated as:

$$ \begin{array}{l}P={\mathrm{argmin}}_P\left\{{\displaystyle {\sum}_{i=1}^K{\left\Vert P{A}_i-{W}_i\right\Vert}_F^2}+{\left\Vert P{A}_i-{W}_i^i\right\Vert}_F^2+\gamma {\left\Vert {A}_i-{P}^TP{A}_i\right\Vert}_F^2\right\}\hfill \\ {}\kern1em ={\mathrm{argmin}}_P{\displaystyle {\sum}_{i=1}^K\left\{tr\left(P{Q}_i(P){P}^T\right)+tr\left(P{Q}_i^i(P){P}^T\right)+\gamma tr\left({A}_i^T{A}_i-P{A}_i{A}_i^T{P}^T\right)\right\}}\hfill \\ {}\kern2.5em s.t.P{P}^T=I.\hfill \end{array} $$
(14)

where Q i (P) = (A i  − P T W i )(A i  − P T W i )T and Q i i (P) = (A i  − P T W i i )(A i  − P T W i i )T. Since the term A T i A i has no effect on the solution of P, the objection function in Eq. (14) further reduces to:

$$ \begin{array}{c}\hfill P={\mathrm{argmin}}_P{\displaystyle {\sum}_{i=1}^K\left\{tr\left(P{Q}_i(P){P}^T\right)+tr\left(P{Q}_i^i(P){P}^T\right)-\gamma tr\left(P{A}_i{A}_i^T{P}^T\right)\right\}}\hfill \\ {}\hfill ={\mathrm{argmin}}_Ptr\left\{P{\displaystyle {\sum}_{i=1}^K\left({Q}_i(P)+{Q}_i^i(P)-\gamma {A}_i{A}_i^T\right){P}^T}\right\}s.t.P{P}^T=I.\hfill \end{array} $$
(15)

The above minimization can be solved iteratively. In the current iteration t, we use Q i (P (t − 1)) and Q i i (P (t − 1)) to approximate Q i (P (t)) and Q i i (P (t)) in Eq. (15), where P (t − 1) is the projection matrix obtained in iteration t − 1. By applying the Eigen Value Decomposition (EVD) technique, we have:

$$ \left[U,M,V\right]=EVD\left({\displaystyle {\sum}_{i=1}^K\left({Q}_i(P)+{Q}_i^i(P)-\gamma {A}_i{A}_i^T\right)}\right) $$
(16)

where M is diagonal matrix formed by the eigenvalues of ∑ K i = 1 (Q i (P) + Q i i (P) − γA i A T i ). Then we can update P as the m eigenvectors in U associated with the first m smallest eigenvalues of M, i.e. P (t) = U(1 : m, :). However, this way makes the update of P too sharp and the optimization of the whole system in Eq. (5) unstable. Therefore, we choose to update P gradually in our implementation and thus have the following form for updating P in each iteration:

$$ {P}^{(t)}={P}^{\left(t-1\right)}+c\left(U\left(1:m,:\right)-{P}^{\left(t-1\right)}\right) $$
(17)

where c is a small positive constant and used to control the update of P in iterations. The algorithm of updating P is presented in Table 3.

Table 3 Learning dimensionality reduction projection matrix P

3.2.3 Algorithm of SDRDL

The complete SDRDL algorithm is summarized in Table 4. The proposed SDRDL model in Eq. (5) is jointly non-convex to 〈P, D, Z〉, and thus the optimization algorithm presented in Table 4 can at most attain a local minimum. When SDRDL updates D and Z, the objection function in Eq. (6) is convex to each of 〈D, Z〉 by fixing the other, and the proposed SDRDL algorithm will lead to a local minimum of this sub-problem. However, when SDRDL updates P, Q i (P (t − 1)) and Q i i (P (t − 1)) are used to approximate Q i (P (t)) and Q i i (P (t)) in Eq. (15), thereby the obtained solution is only an approximation to the local minimum of the objection function in Eq. (15). Overall, the proposed SDRDL algorithm cannot converge in theory, but by experience can have a stable solution.

Table 4 The algorithm of simultaneous dimensionality reduction and dictionary learning

To illustrate the minimization process of SDRDL, we use the Extended Yale B database [13] and AR database [26] as examples. The reduced dimensionality of the face images is set to be 300. The curves of the objective function in Eq. (5) vs. the iteration number are plotted in Fig. 2a and b for these two databases, respectively. It can be seen that after 15 iterations, the value of the objective function has small variation and becomes stable on these two databases. Our experiment results also indicate that when the minimization stops with more or less than 15 iterations, the learned projection P and dictionary D by SDRDL will lead to almost the same classification accuracy. Therefore, we choose to set the maximal iteration number as 15 and it works well in our experiments.

Fig. 2
figure 2

Convergence curves of SDRDL algorithm on a Extended Yale B database b AR database

4 Classification scheme

Once we obtain the DR projection matrix P and the class-specific dictionary D, the lower dimensional feature of the query sample y can be computed by Py and then it can be coded over the dictionary D. Here, the sparse representation model with l 1-norm is adopted for coding:

$$ \widehat{\alpha}={\mathrm{argmin}}_{\alpha}\left\{{\left\Vert Py-D\alpha \right\Vert}_2^2+\lambda {\left\Vert \alpha \right\Vert}_1\right\} $$
(18)

where λ is constant.

Denote by \( \widehat{\alpha}={\left[{\widehat{\alpha}}^1,{\widehat{\alpha}}^2,\cdots, {\widehat{\alpha}}^K\right]}^T \), where \( {\widehat{\alpha}}^i \) is the coefficient sub-vector associated with sub-dictionary D i . Once \( \widehat{\alpha} \) is computed, the reconstruction residual of each D i can be used to classify an input query sample, as that in SRC [42]. On the other hand, the normalized coefficient matrix of each class, denoted by \( {\tilde{Z}}_j \), is also learned in the proposed SDRDL algorithm and thus the dissimilarity between \( {\tilde{Z}}_j \) and \( \widehat{\alpha} \) is also discriminative. Therefore, the metric for classification can be defined as:

$$ {e}_i={\left\Vert Py-{D}_i{\widehat{\alpha}}^i\right\Vert}_2^2+w{\displaystyle {\sum}_{j\ne i}{\tilde{Z}}_j^T\ {\widehat{\alpha}}_F^2/{n}_j} $$
(19)

where w is a preset weight to balance the contribution of the two terms for classification; n j is the number of training samples from class j. When the number of training samples from each class is the same, Eq. (19) can be rewritten as:

$$ {e}_i={\left\Vert Py-{D}_i{\widehat{\alpha}}^i\right\Vert}_2^2+\delta {\displaystyle {\sum}_{j\ne i}{\left\Vert {\tilde{Z}}_j^T\widehat{\alpha}\right\Vert}_F^2} $$
(20)

where δ = w/n j . The classification rule is simply set as identity(y) = argmin i {e i }.

5 Experimental results

We verify the performance of SDRDL on applications such as face recognition and action classification. Section 5.1 discusses parameter selection; Section 5.2 conducts experiments on face recognition; Section 5.3 perform experiments on action classification. In Tables 5, 6, and 7, the highest classification rates are highlighted in boldface.

Table 5 The face recognition rates (%) of competing methods on the Extended Yale B database
Table 6 The face recognition rates (%) of competing methods on the AR database
Table 7 The accuracies (%) of competing methods on UCF sports action database

5.1 Parameter selection

In SDRDL, it is very important to set the number of atoms in D i , denoted by p i , i = 1, 2, ⋯, K. Usually, each p i is set to be equal. To evaluate the effect of p i on the performance of SDRDL, we conduct face recognition experiment on the Extended Yale B database [13], which consists of 2414 frontal-face images from 38 individuals. For each individual, 20 images are picked randomly for training; the remaining images (about 44 images per individual) are used for testing. In addition, SRC is used as the baseline method in this experiment. Since SRC uses the original training samples as dictionary, we pick randomly p i training samples as the dictionary atoms and conduct ten times the experiment to calculate the average classification accuracy. Figure 3 shows the classification accuracies of SDRDL and SRC vs. the number of dictionary atoms.

Fig. 3
figure 3

The Accuracy of SDRDL and SRC vs. the number of dictionary atoms

It can be seen that SDRDL achieves 4.0 % improvement at least over SRC. Even with p i  = 8, SDRDL yet obtains higher classification accuracy than SRC with p i  = 20. In addition, when p i decreases from 20 to 8, the classification accuracy of SDRDL drops by 2.5 %, while the classification accuracy of SRC drops by 4.2 %. This demonstrates that SDRDL is effective to learn a compact and representative dictionary, by which the computation complexity can be reduced and the classification accuracy can be improved.

There are five parameters which need to be tuned in the proposed SDRDL model, three in the DL model (λ 1, λ 2 and γ) and two in the classification scheme (λ and w (or δ)). In all the experiments, if no specific instructions, the tuning parameters in SDRDL are evaluated by fivefold cross validation.

5.2 Face recognition

The proposed SDRDL method is applied to FR on the Extended Yale B [13] and AR [26] databases and compared with several latest DL based FR methods including discriminative KSVD (DKSVD) [52], label consistent KSVD (LC-KSVD) [19], dictionary learning with structure incoherence (DLSI) [33] and Fisher discrimination dictionary learning (FDDL) [49, 50]. In addition, SDRDL is also compared with SRC [42] and two general classifiers, nearest neighbor classifier (NNC) and linear support vector machine (SVM). Note that the original DLSI method represents the query sample class by class. For a fair comparison, we extend the original DLSI by representing the query sample over the whole dictionary and then use the representation residual for pattern classification (denoted by DLSI* respectively). The number of dictionary atoms in SDRDL is set as the number of training samples in default. Each face image is reduced to dimension of 300 in all FR experiments. Since the number of training samples from each class is equal in all of face databases, Eq. (20) is adopted to perform classification for all FR experiments. The parameters of SDRDL chosen by cross-validation are λ 1 = 0.005, λ 2 = 0.07, γ = 0.9, λ = 0.005 and δ = 0.8.

5.2.1 Extended Yale B database

The Extended Yale B database [13] consists of 2414 frontal face images from 38 individuals taken under varying illumination conditions (see Fig. 4 for example samples). Each individual has 64 images and we randomly selected 20 images for training and the remaining images for testing. The face image is normalized to 54 × 48.

Fig. 4
figure 4

Some samples from the Extended Yale B database

The results of SDDRDL, NNC, SRC, SVM, DKSVD, LC-KSVD, DLSI and FDDL are listed in Table 5. It can be seen that the proposed SDRDL achieves the highest classification accuracy among the competing methods. Particularly, SDRDL achieves 4.8 % higher classification accuracy than FDDL, which achieves the second best performance in the experiment. The reason may be that, FDDL performs dimensionality separately from the discriminative dictionary learning process. SDRDL and FDDL outperform DKSVD and LC-KSVD, which only use representation coefficients to perform classification. DLSI* has 4 % higher classification accuracy than DLSI. This can be explained that representing the query image on the whole dictionary is more reasonable for FR tasks.

5.2.2 AR database

The AR database [26] consists of over 4000 images of 126 individuals. For each individual, 26 pictures are collected from two separated sessions. Following [49, 50], we chose a subset consisting of 50 male individuals and 50 female individuals for the standard evaluation procedure. For each individual, we select 7 images with illumination and expression changes from Session 1 for training and 7 images with the same condition from Session 2 for testing. The face image is of size 60 × 43. Figure 5 shows sample images from the AR database.

Fig. 5
figure 5

Some samples from the AR database

The classification accuracy of SDRDL and other competing methods including NNC, SRC, SVM, DKSVD, LC-KSVD, DLSI and FDDL are shown in Table 6. Again, it can be seen that the proposed SDRDL outperforms FDDL and achieves the highest classification rate. It is because that by coupling the dimensionality reduction and dictionary learning processes, SDRDL can exploit the discriminative information in training set more effectively for FR task. Being consistent with the results on Extended Yale B database, FDDL and SDRDL achieve higher classification rate than DKSVD and LC-KSVD. Also, DLSI* has much better result than DLSI. This further demonstrate that representing the query image on the whole dictionary is more reasonable for FR tasks.

5.3 Action recognition

Finally, we evaluated SDRDL on the UCF sports action dataset [35] for action classification experiment. The UCF sports action dataset [35] includes 140 videos that are collected from various broadcast sports channels (e.g., BBC and ESPN). These videos cover a wide range of scenarios and viewpoints. The actions of these videos contain 10 sport action classes: driving, golfing, kicking, lifting, horse riding, running, skateboarding, swinging-(prommel horse and floor), swinging-(high bar) and walking. Some example images of this dataset are shown in Fig. 6. The action bank features of 140 videos provided by [36] are adopted in the experiment.

Fig. 6
figure 6

Some video frames from the UCF sports action dataset

As the experiment setting in [32] and [19], we evaluated SDRDL via five-fold cross validation. The number of atoms in per sub-dictionary is set as the number of training samples in SDRDL with λ 1 = 0.005, λ 2 = 0.04, γ = 0.9 and λ = 0.005. Because the number of training samples from each class is not equal in this experiment, Eq. (19) is adopted to perform classification. Here w is set to 2.0 and the reduced dimension of the action bank feature [36] is set to be 100. SDRDL is compared with SVM, SRC [42], KSVD [1], DKSVD [52], LC-KSVD [19], DLSI [33], FDDL [49, 50], dictionary learning with commonality and particularity (COPAR) [20] and joint dictionary learning (JDL) [54]. The performance of some specific methods for action recognition including Yao et al. [51] and Qiu et al. [32] are also listed. The classification accuracies are shown in Table 7. It can be observed that SDRDL obtains the best performance. Following SDRDL, FDDL shows the second best performance. With the action bank feature, all the dictionary learning (DL) based methods obtain the classification accuracy over 90 % except KSVD and DKSVD.

6 Conclusion

In this paper we proposed a simultaneous dimensionality reduction and dictionary learning (SDRDL) scheme for image classification. Unlike many DL methods which perform dimensionality reduction and dictionary learning independently, SDRDL formulates DR and DL procedures into a unified framework to exploit the discriminative information more effectively in the training data. In classification, both the representation residual and the representation coefficients were considered. The experimental results on benchmark image databases demonstrated that the proposed SDRDL method surpasses many state-of-the-arts methods.