1 Introduction

Dimensionality reduction is one of the most effective and simple techniques in machine learning and classification problem [1, 2]. It is proposed to solve the critical problem called “curse of dimensionality” [3] which means models cannot perform well when processing high-dimensional data, even cause the singular problem. Recently, many classification models based on image recognition have been proposed, such as multiple-instance learning (MIL) [4] and subspace learning. The most classical subspace learning methods include principal component analysis (PCA) [5,6,7] and Linear discriminant analysis (LDA) [8,9,10]. PCA aims to project the high dimensionality data into the low dimensionality subspace along the direction of the maximum variance and LDA needs to find the optimal discriminative subspace so that the ratio of between-class scatter to within-class scatter is maximized. However, both PCA and LDA cannot learn the local structure information for dimensionality reduction task. To address this problem, many manifold learning techniques based on locality information were proposed. The representative methods include locality preserving projections (LPP) [11, 12], locally linear embedding (LLE) [13, 14], neighborhood preserving projection (NPP) [15], neighborhood preserving embedding (NPE) [16] and the variant of NPE named discriminant sparse neighborhood preserving embedding (DSNPE) [17]. Recently, a structured optimal graph based sparse feature extraction (SOGSFE) [18] method was proposed to obtain the local manifold sparse structure and semi-supervised feature extraction simultaneously.

Previous studies present that a least square regression (LSR) formation [19, 20] characterizes the classical subspace learning methods. LSR is usually mathematically tractable and it has a lot of versions, such as weight LSR [21], partial LSR [22] and nonnegative least squares (NNLS) [23]. However, all above regression methods have two disadvantages. Firstly, for most LSR-based methods, they are only devoted to keeping the local geometry information which is sensitive to the illumination, corruptions and outliers. Thus, these methods may fail to achieve good results in the recognition task. Secondly, LSR-based techniques possess small-class problem which means the number of classes limits the number of projected vectors when learning a low-dimensional subspace.

To address the first problem mentioned above, researchers studied the \( l_{2,1} \)-norm-based least square regression methods and low-rank learning methods. There were a lot of works [24,25,26,27] proved that the robustness of least square regression methods by imposing the \( l_{2,1} \)-norm as the basic metric for pattern recognition. Therefore, more and more \( l_{2,1} \)-norm regression models were used in subspace learning [28,29,30,31,32]. For example, \( l_{2,1} \)-norm regularized discriminative feature selection (UDFS) [33] was designed for robust unsupervised subspace learning. However, UDFS is incapable of modeling the global structure of image points. Fortunately, the low-rank learning technique can also be used to improve robustness on noise/corrupted data and has been greatly noticed recently in the fields of data mining, machine learning and computer vision. A large number of low-rank representation (LRR) methods emerged in the past decade [34,35,36,37,38,39], which aims at corrupted data recovery and noise removal. LRR and its extended versions can discover potential relations among the data vectors by imposing a low-rank constraint. As a result, LRR-based methods can catch the global information of data and weaken the disturbance of noise. For example, robust PCA (RPCA) [40] learns a clean matrix and a sparse error matrix from the input matrix to obtain robust subspace segmentation. However, RPCA presumes that all data points are distributed in a single subspace, which is inconsistent with the fact that data vectors usually locate in multiple subspaces. To solve the above problem, Liu et al. [41, 42] proposed LRR to better acquire the underlying data structure with a low-rank matrix. Moreover, by taking full advantage of the graph regularizer, Yin et al. [39] developed a laplacian regularized LRR-based method that exploited both the global and local geometric information in data for graph learning and subspace projection. Based on [39], Li et al. [43] adopted a rank constraint on the proposed model to further learn an optimal graph with clear clustering structure and obtained encouraging performance. What’s more, Zhuang et al. [44] introduced a non-negative low-rank and sparse graph (NNLRS) for classification and discriminative analysis task. NNLRS-graph not only captured the local linear relationships and global structures between image points but also preserved sparse to ensure the robustness of the algorithm. To integrate dimensionality reduction with LRR, Liu and Yan [45] put forward latent LRR which recovered the low-rank data representation and projected the clean data points into a low-dimensional subspace simultaneously. Since two low-rank matrices must be learned separately, the effectiveness of latent LRR cannot be guaranteed. Thus, approximate low-rank projection (ALRP) [46] treated the two matrices in LatLRR as a whole so that they can be boosted mutually in the learning process. By combining with least square regression, ALRP was extended to the supervised version called SALRP which can do feature extraction and classification task at the same time. Nevertheless, SALRP is limited by the size of the class label matrix to obtain \( c \) projections at most, where \( c \) represents the number of classes. Especially, when \( c \) is small, the dimension of the learned features is not enough for achieving good performance, which is undesirable in many applications.

To solve the problem mentioned above, we develop a novel model called discriminative low-rank projection (DLRP), which combines the properties of LRR and LSR to settle down the underlying problem existing in LRR and its variants. Our purpose is to obtain enough projections in a supervised manner and learn a robust discriminative subspace with low-rank representation. To ensure the effectiveness, we employ some adaptive constraints and \( l_{2,1} \)-norm as the basic measurement. The major contributions of this paper can be concluded as follows:

  1. (1)

    Different from previous LLR-based techniques, our method alleviates the small-class problem that the number of features is bound by the number of classes. This problem may degrade the recognition rate for image feature extraction and classified task. DLRP can not only learn the projected vectors/matrix with arbitrary dimensions during the process of dimensionality reduction but also equip a robust low-rank subspace to gain better results.

  2. (2)

    DLRP shares some advantages of both LSR and LRR. Therefore, it can not only utilize clean data recovered by LRR to obtain more promising image classification performances but also employ the projecting subspace to achieve better regression results for feature extraction.

  3. (3)

    An efficient algorithm, which integrates the linearized alternating direction method with adaptive penalty (LADMAP) and singular value decomposition (SVD), is designed for solving the proposed model. The theoretical studies about the convergence and computation complexity of our algorithm are also shown. The algorithm has validated its superiority on no matter corrupted datasets or deep learning dataset.

The remainder of this paper could be structured as below. In Sect. 2, some related methods will be reviewed for ease of understanding. The proposed DLRP and its optimization algorithm to solve the problem iteratively will be introduced in Sect. 3. Theoretical studies of our model will be shown in Sect. 4. Experimental analysis and results will be shown in Sect. 5 and conclusion in Sect. 6.

2 Related works

In this section, we present the notations and definitions related to our model firstly, and then review some related LRR-based models, such as subspace learning model RPCA, low-rank representation method LRR and its extended version latent LRR.

2.1 Notations and definitions

In this paper, given a collection of data points \( X = [x_{1} ,x_{2} , \ldots ,x_{n} ] \), \( n \) is the number of training samples, and \( Y \) is label information matrix which includes \( c \) classes. The matrix \( Y \) is a binary label matrix and when \( x_{i} \) belongs to \( j \)th class, the \( Y_{ij} = 1 \); if not, \( Y_{ij} = 0 \).\( \left\| X \right\|_{F} = \sqrt {\sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{n} {x_{{_{ij} }}^{2} } } } \) is Frobenius norm (\( F \)-norm) of \( X \). \( \left\| X \right\|_{*} \) denotes the nuclear norm that computes the sum of all the singular values. \( diag\left( X \right) \) means the diagonal matrix \( X \). Table 1 lists the detail notations of this paper.

Table 1 The notations used in this paper

2.2 Robust principal component analysis

As a dimensionality reduction technique, RPCA [40] can decompose the contaminated image information into a clean low-rank term and a sparse error term simultaneously. Because PCA assumes that the noises of data is Gaussian, its effectiveness will be influenced by heavy noises or unexcepted outliers. However, RPCA has not this assumption (RPCA assumes that the noises are sparse). Thus, RPCA is a robust extended version of PCA. This advantage of RPCA is widely used in a variety of applications, such as video surveillance [47], face recognition [48] and singing-voice separation from monaural records [49].

RPCA is a low-rank matrix recovery technique, and its goal is to decompose data \( X \) to \( A + E \), where \( A \) is the clean low-rank data on the low-dimensional subspace and \( E \) records sparse error. Such decomposition can be obtained through solving the following principal component pursuit problem [40]:

$$ \mathop {\hbox{min} }\limits_{A,E} rank(A) + \eta \left\| E \right\|_{0} , \, s.t. \, X = A + E $$
(1)

where \( rank\left( A \right) \) means the rank of \( A \), \( \eta \) is a positive parameter to trade off the low rankness and sparsity.

2.3 Low-rank representation-based clustering methods

Firstly, we introduce the LRR [41, 42] method. Then, we review the LatLRR [45].

As a subspace learning method, LRR aims to find the lowest-rank representation for data clustering. The general formulation of LRR can be written as:

$$ \mathop {\hbox{min} }\limits_{Z} rank\left( Z \right), \, s.t. \, X = DZ $$
(2)

where \( D \) is treated as the dictionary matrix while \( Z \) reveals the potential subspace structure. In real applications, we select \( X \) itself as dictionary matrix, so (2) can be reformulated as:

$$ \mathop {\hbox{min} }\limits_{Z} rank\left( Z \right), \, s.t. \, X = XZ $$
(3)

Due to problem (3) is a NP-hard, we transform (3) into the base minimization problem:

$$ \mathop {\hbox{min} }\limits_{Z} \left\| Z \right\|_{*} , \, s.t. \, X = XZ $$
(4)

Similar to RPCA, LRR can also be rewritten to the following formulation with error component:

$$ \mathop {\hbox{min} }\limits_{Z,E} \left\| Z \right\|_{*} + \lambda \left\| E \right\|_{p} , \, s.t. \, X = XZ + E $$
(5)

where \( \lambda \) is a positive penalty parameter to balance the low-rank part and reconstruction error as \( \eta \) of RPCA. \( \left\| \cdot \right\|_{p} \) denotes certain norm, such as \( F \)-norm, \( l_{1} \)-norm, \( l_{2,1} \)-norm, etc. However, when training samples are high-dimensional and insufficient, LRR fails to obtain good performance. LatLRR is proposed to exploit more unobserved samples. The function of LatLRR is formulated as:

$$ \mathop {\hbox{min} }\limits_{Z,E} \left\| Z \right\|_{*} + \lambda \left\| E \right\|_{p} , \, s.t. \, X = \left[ {X,X_{H} } \right]Z + E $$
(6)

where \( X,X_{H} \) represent the observed data and unobserved data separately. With Bayesian inference [50], \( X \) can equal to \( XZ + LX \) so that LatLRR can use \( L \) to reduce the dimension of original datasets. (6) is rewritten as:

$$ \mathop {\hbox{min} }\limits_{Z,E} \left\| Z \right\|_{*} + \left\| L \right\|_{*} + \lambda \left\| E \right\|_{p} , \, s.t. \, X = XZ + LX + E $$
(7)

Although LatLRR is a good solution to the problem that LRR-based methods cannot directly learn the low-dimensional subspace, but the discriminant information is not used. Therefore, LatLRR is not suitable for feature extraction and classification problems in such case.

In addition, it is expected that label information of the data can be well integrated with LRR-based methods, which will derive the clean data with discriminating power. Moreover, if directly used the terms \( \left\| {Y - XZ} \right\|_{F}^{2} \) or \( \left\| {Y - LX} \right\|_{F}^{2} \) in the above low-rank learning models, we had to encounter the small-class problem. i.e. the upper bound of the learned subspace of \( L \) or \( Z \) will be the maximum of \( c \). Hence, how to recover the discriminant representative coefficient matrix \( Z \) for robust regression task is a critical problem.

3 Objective function and optimization solution

In this subsection, we first the introduce motivations of our model. The object function and update rules are also presented later. At last, an efficient algorithm is designed to iteratively solve the optimization problem.

3.1 Motivations of proposed method

The classical unsupervised dimensionality reduction techniques use Euclidean distances as the metric to measure the local geometric structure and construct the graph by local patches. In real applications, the data are often noisy and even grossly corrupted. Local structural information of data is easily damaged in the contaminated circumstances and then affect the recognition results in previous method. As a result, it is desirable to develop a novel technique that focuses on global information and keeps the robustness to some unexpected situations, such as disguise, occlusion, and specular reflections or pixel corruptions.

To this end, LRR is proposed as its ability to obtain the global structure of the data information. It also has great robustness for subspace learning and even recovers the true data [42]. Therefore, it is expected to integrate the low-rank properties of LRR into the dimensionality reduction process. Furthermore, to make the algorithm obtain more projections and inherit discriminative ability of LSR, we propose a novel supervised subspace learning method called DLRP. The key idea is to use the clean recovered data to fit the label so that data samples with the same labels can keep close to each other on the low-dimensional subspace. In this way, our method can not only use the discriminant information to obtain higher recognition rates but also guarantee the robustness in low-dimensional subspace. Different from the previous LRR-based methods, we impose some adaptive constraints on representation matrix \( Z \) for better performance. In conclusion, our method can jointly generate a robust low-dimensional subspace, where the dimension of features is greater than the number of classes, and obtain a discriminate low-rank representation.

3.2 Objective function of DLRP

Unlike the previous LRR-based methods which cannot directly introduce the label matrix for regression task, we aim to embed the label information into LRR and use the recovered data for supervised subspace learning. In order to make our model perform more robust, we utilize the \( l_{2,1} \)-norm as the basic metric to record the reconstructive errors. Then we have

$$ \mathop {\hbox{min} }\limits_{Z,P,Q} \left\| Z \right\|_{*} + \lambda \left\| {X - XZ} \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} $$
(8)

For evading the trivial solution of the problem (8) and ease of measuring the reconstructive error, we introduce the orthogonal constraint on matrix \( P \) and error matrix \( E \), so we can reformulate (8) as:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{Z,E,P,Q} \left\| Z \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & s.t.\;X = XZ + E,P^{T} P = I \\ \end{aligned} $$
(9)

where \( P \) is an orthogonal matrix and \( Q \) is the projected matrix which maps the recovered data representation \( XZ \) onto the low-dimensional space, \( \alpha \) is a positive weighted parameter to control the fitness of least squares term. To address the small-class problem, which may degrade the performance in real applications, it is expected to utilize the two particular matrices \( P \in R^{c \times d} \) and \( Q \in R^{m \times d} \) to replace the single projector \( L \in R^{c \times m} \) in LatLRR (\( PQ^{T} \) is the same size as \( L \) obviously).

In conventional LSR, the single projection matrix can only learn at most \( c \) projections. However, the proposed method can obtain \( d \) (\( d \ge c \)) projections at least for extracting salient features. In this way, the small-class problem in LSR or LatLRR can be released. In conclusion, in the term \( \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \), the matrix \( Q^{T} \) first projects the low-rank representation \( XZ \) to robust low-dimensional subspace, then the orthogonal matrix \( P \) is used to obtain more projective information and regress the projected features to the label matrix \( Y \) with an orthogonal rotation. This regularization term can make the low-dimensional representation \( PQ^{T} XZ \) own the similar structure as that of \( Y \).

figure a

In (9), the first part is the minimization of the rank of \( Z \), which can keep the global structure of original data, while the second part is for recording errors derived from noise and corruptions. For balance the representation coefficients, we further introduce a new constraint on the low-rank representation \( Z \). Thus, the final objective function of DLRP is defined as follows:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{Z,E,P,Q} \left\| Z \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & s.t.\; \, X = XZ + E,P^{T} P = I,1_{n}^{T} Z = 1_{n}^{T} \\ \end{aligned} $$
(10)

where \( 1 \in R^{n \times 1} \) represents the vector with all elements equal to one. To avoid all row elements of \( Z \) equal to zero, it is sensible to enforce low-rank representation \( Z \) by constraint \( 1_{n}^{T} Z = 1_{n}^{T} \), which means the sum of each row of low-rank matrix \( Z \) must be one.

(10) is our final objective function which can be iteratively optimized. Therefore, in next subsection, we design an iterative algorithm to compute the optimal solution of DLRP.

3.3 Iterative algorithms

In this part, we introduce the solution of optimization problem. We first use the LADMAP [51] to optimal \( Z \) and \( E \) in (11) and (42). Then we optimal \( P \) and \( Q \) in sequence. To make the object function easily separated and solved, we introduce an auxiliary matrix \( J \) to replace \( Z \). So, we have

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{Z,E,P,Q,J} \left\| J \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & s.t.\; \, X = XZ + E,P^{T} P = I,1_{n}^{T} Z = 1_{n}^{T} ,Z = J \\ \end{aligned} $$
(11)

The augmented Lagrangian function of problem (11) is:

$$ \begin{aligned} & L\left( {Z,E,J,C_{1} ,C_{2} ,C_{3} ,\mu } \right) \\ & \quad = \left\| J \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & \quad \quad + \,\left\langle {C_{1} ,X - XZ - E} \right\rangle + \left\langle {C_{2} ,Z - J} \right\rangle + \left\langle {C_{3} ,1_{n}^{T} Z - 1_{n}^{T} } \right\rangle \\ & \quad \quad + \,\frac{\mu }{2}\left( {\left\| {X - XZ - E} \right\|_{F}^{2} + \left\| {Z - J} \right\|_{F}^{2} + \left\| {1_{n}^{T} Z - 1_{n}^{T} } \right\|_{F}^{2} } \right) \\ & \quad = \left\| J \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & \quad \quad + \,\frac{\mu }{2}\left( {\left\| {X - XZ - E + \frac{{C_{1} }}{\mu }} \right\|_{F}^{2} + \left\| {Z - J + \frac{{C_{2} }}{\mu }} \right\|_{F}^{2} + \left\| {1_{n}^{T} Z - 1_{n}^{T} + \frac{{C_{3} }}{\mu }} \right\|_{F}^{2} } \right) \\ & \quad \quad - \,\frac{1}{2\mu }\left( {\left\| {C_{1} } \right\|_{F}^{2} + \left\| {C_{2} } \right\|_{F}^{2} + \left\| {C_{3} } \right\|_{F}^{2} } \right) \\ \end{aligned} $$
(12)

where \( C_{1} \),\( C_{2} \)\( ,C_{3} \) are Lagrange multipliers, \( \mu > 0 \) is a penalty parameter. With other variables fixed, we can solve the \( Z,E,J,C_{1} ,C_{2} ,C_{3} ,\mu \) by minimizing the augmented Lagrangian function and \( P,Q \) respectively.

\( Z \) -step

Given the other variables, the terms in (12) and (42) that depend on \( Z \) are

$$ \begin{aligned} & \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & \quad + \,\frac{\mu }{2}\left( {\left\| {X - XZ - E + \frac{{C_{1} }}{\mu }} \right\|_{F}^{2} + \left\| {Z - J + \frac{{C_{2} }}{\mu }} \right\|_{F}^{2} + \left\| {1_{n}^{T} Z - 1_{n}^{T} + \frac{{C_{3} }}{\mu }} \right\|_{F}^{2} } \right) \\ \end{aligned} $$
(13)

Then, (13) can be transformed to the minimization problem as follow:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{Z} \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & \quad + \,\frac{\mu }{2}\left( {\left\| {M - XZ} \right\|_{F}^{2} + \left\| {Z - N} \right\|_{F}^{2} + \left\| {1_{n}^{T} Z - L} \right\|_{F}^{2} } \right) \\ \end{aligned} $$
(14)

where \( M = X - E + \frac{{C_{1} }}{\mu } \), \( N = J - \frac{{C_{2} }}{\mu } \) and \( L = 1_{n}^{T} - \frac{{C_{3} }}{\mu } \). Then (14) can be solved by setting its derivative with respect to \( Z \) be \( 0 \), we have

$$ W_{a} Z = W_{b} + W_{c} $$
(15)

where

$$ W_{a} = 2\alpha X^{T} QQ^{T} X + \mu \left( {X^{T} X + I_{n} + 1_{n}^{{}} 1_{n}^{T} } \right) $$
(16)
$$ W_{b} = 2\alpha X^{T} QP^{T} Y $$
(17)
$$ W_{c} = \mu \left( {X^{T} M + N + 1_{n}^{{}} L} \right) $$
(18)

Finally, we can have the solution as follow:

$$ Z = W_{a}^{ - 1} \left( {W_{b} + W_{c} } \right) $$
(19)

\( E \) -step

\( E \) is obtained by solving the following function:

$$ E = \arg \mathop {\hbox{min} }\limits_{E} \lambda \left\| E \right\|_{2,1} + \frac{\mu }{2}\left\| {X - XZ - E + \frac{{C_{1} }}{\mu }} \right\|_{F}^{2} $$
(20)

According to [52], \( E \) can be presented as the following closed solution:

$$ E = \varOmega_{{{\raise0.7ex\hbox{$\lambda $} \!\mathord{\left/ {\vphantom {\lambda \mu }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\mu $}}}} \left( {X - XZ + \frac{{C_{1} }}{\mu }} \right) $$
(21)

where \( \varOmega \) is the shrinkage operator.

\( J \) -step

For solving \( J \), (12) is rewritten to the following minimization problem:

$$ J = \arg \mathop {\hbox{min} }\limits_{J} \left\| J \right\|_{*} + \frac{\mu }{2}\left\| {Z - J + \frac{{C_{2} }}{\mu }} \right\|_{F}^{2} $$
(22)

then \( J \) can be achieved as follow by using the singular value thresholding (SVT) operator [53]:

$$ J = \varTheta_{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \mu }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\mu $}}}} \left( {Z + \frac{{C_{2} }}{\mu }} \right) $$
(23)

where \( \varTheta \) is SVT operator.

\( C_{1} ,C_{2} ,C_{3} ,\mu \) -step

Lagrange multipliers \( C_{1} ,C_{2} ,C_{3} \) and penalty parameter \( \mu \) update as follow:

$$ C_{1} \leftarrow C_{1} + \mu \left( {X - XZ - E} \right) $$
(24)
$$ C_{2} \leftarrow C_{2} + \mu \left( {Z - J} \right) $$
(25)
$$ C_{3} \leftarrow C_{3} + \mu \left( {1_{n}^{T} Z - 1_{n}^{T} } \right) $$
(26)
$$ \mu \leftarrow \hbox{min} \left( {\rho \mu ,\mu_{\hbox{max} } } \right) $$
(27)

where \( \rho > 0 \) and \( \mu \) are constants

\( P \) -step

Given the other variables, \( P \) can be solved by minimizing the following problem:

$$ P = \arg \mathop {\hbox{min} }\limits_{P} \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} s.t.P^{T} P = I $$
(28)

(28) can be expanded to:

$$ \begin{aligned} P & = \arg \mathop {\hbox{min} }\limits_{P} \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \\ & = \arg \mathop {\hbox{min} }\limits_{P} \alpha tr\left[ {\left( {Y - PQ^{T} XZ} \right)^{T} \left( {Y - PQ^{T} XZ} \right)} \right] \\ & = \arg \mathop {\hbox{min} }\limits_{P} \alpha tr\left( {Y^{T} Y - 2Y^{T} PQ^{T} XZ + Z^{T} X^{T} QQ^{T} XZ} \right) \\ & \quad s.t.\;P^{T} P = I \\ \end{aligned} $$
(29)

When other variables are fixed, (29) is equivalent to optimize the follow problem:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{P} tr\left( { - 2Q^{T} XZY^{T} P} \right)s.t.P^{T} P = I \\ & \Leftrightarrow \mathop {\hbox{max} }\limits_{P} tr\left( {Q^{T} XZY^{T} P} \right)s.t.P^{T} P = I \\ \end{aligned} $$
(30)

Then \( P \) can be optimal by computing singular value decomposition (SVD) [40] of matrix \( Q^{T} XZY^{T} \) as follow:

$$ Q^{T} XZY^{T} = \tilde{U}\tilde{D}\tilde{V}^{T} $$
(31)

Finally, the optimal \( P \) can be presented as

$$ P = \tilde{V}\tilde{U}^{T} $$
(32)

\( Q \) -step

For (29), \( P \) and \( Z \) are given already, we take partial derivative of (29) with respect to \( Q \) and set it to 0, it is simple to get

$$ Q = \left( {XZZ^{T} X^{T} + I} \right)^{ - 1} XZY^{T} P $$
(33)

By updating the iterative algorithm mentioned above, we finally obtained the optimal solution to DLRP. Algorithm 1 presents the steps that how to employ the algorithm iteratively. Theoretical studies of our model, such as convergence and computational complexity, can be found in the next section.

4 Algorithm analysis and comparisons

In this section, we first introduce the convergence of our algorithm. And then, we analyze its computational complexity. Finally, we show the details about comparisons between our algorithm and other most related algorithms.

4.1 Convergence discussion

As shown in previous section, we optimized the proposed model with an iterative algorithm. So it is necessary to prove the convergence of our model.

Let

$$ U\left( {Z,E,J,P,Q} \right) = \left\| J \right\|_{*} + \lambda \left\| E \right\|_{2,1} + \alpha \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} $$
(34)

We can obtain the following theorem.

Theorem 1

The iterative scheme of algorithm in Algorithm 1 monotonically decreases the objective function value of (10) in each iteration.

Proof

Suppose the objective function in the \( t \)th iteration can be presented as

$$ U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) = \left\| {J_{t} } \right\|_{*} + \lambda \left\| {E_{t} } \right\|_{2,1} + \alpha \left\| {Y - P_{t} Q_{t}^{T} XZ_{t} } \right\|_{F}^{2} $$
(35)

where \( Z_{t} ,E_{t} ,J_{t} ,P_{t} \) and \( Q_{t} \) mean matrix \( Z,E,J,P \) and \( Q \) in \( t \)th iteration, respectively. When \( E_{t} ,J_{t} ,P_{t} \) and \( Q_{t} \) are given. We know that inverse operation in (19) to compute \( Z_{t} \) will finally decrease the objective function, thus we have

$$ U\left( {Z_{t + 1} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) \le U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) $$
(36)

When \( Z_{t} ,J_{t} ,P_{t} \) and \( Q_{t} \) are given. The convergence in (21) has proven in [52], thus we have

$$ U\left( {Z_{t + 1} ,E_{t + 1} ,J_{t} ,P_{t} ,Q_{t} } \right) \le U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) $$
(37)

When \( Z_{t} ,E_{t} ,P_{t} \) and \( Q_{t} \) are given. \( J_{t} \) can be achieved by using the SVT operator which further decreases the objective function value and has been proven in [53]. Then, we have

$$ U\left( {Z_{t + 1} ,E_{t + 1} ,J_{t + 1} ,P_{t} ,Q_{t} } \right) \le U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) $$
(38)

When \( Z_{t} ,E_{t} ,J_{t} \) and \( Q_{t} \) are given. The optimal value of \( P_{t + 1} \) derives from the SVD decomposition on \( Q_{t}^{T} XZ_{t} Y^{T} \), which continuously reduces the objective function value. Therefore, we have

$$ U\left( {Z_{t + 1} ,E_{t + 1} ,J_{t + 1} ,P_{t + 1} ,Q_{t} } \right) \le U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) $$
(39)

When \( Z_{t} ,E_{t} ,J_{t} \) and \( P_{t} \) are given, the solution derived by partial deviation provides the optimal \( Q_{t + 1} \). Finally we have

$$ U\left( {Z_{t + 1} ,E_{t + 1} ,J_{t + 1} ,P_{t + 1} ,Q_{t + 1} } \right) \le U\left( {Z_{t} ,E_{t} ,J_{t} ,P_{t} ,Q_{t} } \right) $$
(40)

which completes the proof.□

We will present some experimental results to show the convergence property of DLRP in the following sections. In fact, the proposed algorithm can converge fast even within 5 iterations.

4.2 Computation complexity

The major time-consuming computations are

  1. (1)

    the matrix inverse operation to solve \( Z \) and \( Q \) in step 1 and 6;

  2. (2)

    the SVT to solve the values of \( J \) in step 3;

  3. (3)

    the SVD to solve the value of \( P \) in step 5;

Thus, we just give the complexity about these steps in algorithm of DLRP. Firstly, the computational complexity of variable \( Q \) is about \( {\rm O}\left( {m^{3} } \right) \) while computing \( Z \) needs \( {\rm O}\left( {n^{2} m + nmd} \right) \); Secondly, To compute the variable \( J \) costing \( {\rm O}\left( {n^{3} } \right) \); At last, optimizing the matrix \( P \) may cost \( {\rm O}\left( {d^{3} } \right) \) computational complexity at most. To sum up, the overall cost for DLRP is up to \( {\rm O}\left[ {T\left( {m^{3} + n^{2} m + nmd + n^{3} + d^{3} } \right)} \right] \), where \( T \) means the number of iterations. Thus, in this case, PCA should be used to reduce the data dimension as preprocessing, which could greatly degree the computational cost.

4.3 Difference between our method and some similar methods

In this section, a further explanation of the essence of DLRP will be presented. We compare the proposed model with the most related methods on their performance in feature extraction and classification. The first model is (41) proposed in [38] and the second is (42) proposed in [46].

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{Z,P} \left\| Z \right\|_{*} + \lambda \left\| {Q^{T} X - Q^{T} XZ} \right\|_{2,1} \\ & s.t.\;Q^{T} Q = I \\ \end{aligned} $$
(41)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{P,Q,Z,E} \frac{1}{2}\left\| {Y - Q^{T} X} \right\|_{F}^{2} + \frac{1}{2}\lambda_{1} \left\| Q \right\|_{F}^{2} + \lambda_{2} \left\| E \right\|_{1} + \lambda_{3} \left\| Z \right\|_{*} \\ & s.t.\;X = PQ^{T} XZ + E,P^{T} P = I \\ \end{aligned} $$
(42)

It is easy to find that all of DLRP, (41) and (42) aim to seek a projection matrix \( Q \) to reduce the dimension of original data \( X \) in low-rank subspace and use the low rankness to improve robustness to the noise for feature extraction simultaneously. However, our model (10) is essentially different from the models (41) and (42). Although model (41) can reduce the negative impact from the occlusions and corruptions by integrating \( l_{2,1} \)-norm loss function and low-rank property, the subspace it obtained does not capture the expected properties, e.g., utilizing supervised information to improve the discriminating power of recovered data. Furthermore, (41) cannot solve the small-class problem obviously. Model (42) is very similar to our model, both two models aim to obtain the discriminant low-rankness representation \( Z \) as well as the robust projecting subspace \( Q \). From (42), we can find that \( \left\| {Y - Q^{T} X} \right\|_{F}^{2} \) aims to fit the class label matrix \( Y \) with the low-dimensional representation \( Q^{T} X \). While the term \( \left\| {Y - PQ^{T} XZ} \right\|_{F}^{2} \) in (10) can not only keep closely correlated to the label indicator \( Y \) with low-rank and low-dimensional representation \( PQ^{T} XZ \) in discriminant subspace but also release the small-class problem with \( PQ^{T} \). In general, it can be seen that (42) cannot completely solve the small-class problem in essence for the regression task. In addition, in order to strengthen the recognition power and representation ability of the data, we append some additional constraints on (10), which is different from (41) and (42).

Figure 1 presents the recognition performance of LPP, RPCA, UDFS, LatLRR, DLRP, model (41) (LRE) and (42) (SALRP) on the PIE face database with 5 × 5 block noise. From Fig. 1, it is can be found that DLRP and model (41) behave significantly superior to the model (42) on corrupted datasets, which indicates that integrating low-rank property and \( l_{2,1} \)-norm as the distance metric can greatly improve the robustness to the contaminated image samples.

Fig. 1
figure 1

Recognition rate (%) of different models on PIE face database with 5 × 5 block noise

5 Experiments

In this section, to systematically estimate the performance of our algorithm, we test the proposed algorithm on the five databases with the random pixel corruptions and different level uninterrupted occlusions. The first four databases are used for testing the robustness of our model in different cases, the last one the LFW database is used to test the performance of DLRP based on deep learning features. Except for the proposed DLRP, the conventional low-rank representation models RPCA, LRR and its extension LatLRR, LRE [38] and SALRP [46] are involved for comparisons. Besides, both \( l_{2,1} \)-norm regularized unsupervised discriminative feature selection model UDFS [33] and locality preserving method LPP are also compared.

5.1 The description of datasets

The AR face dataset [54] includes 2400 face pictures collected from 120 different individuals. The FERET face dataset [55] contains 200 people’s face images and 1400 images in total. The Yale face database [56] includes 2414 people face pictures collected from 38 individuals. The CMU PIE face dataset has 41 368 face images which were shot from 68 people. The LFW face dataset [57] is only selected a subset which includes 4324 images of 158 subjects. Similar to [58], the deep conventional neural network (CNN) extracted about 1024 the deep features of original data before we use LFW for the recognition task.

All pictures of databases are cropped to \( 50 \times 40 \), \( 40 \times 40 \), \( 32 \times 32 \), \( 32 \times 32 \) and \( 112 \times 96 \) size for the AR, FERET, Yale, CMU PIE and LFW face databases, respectively. Figure 2a–e display samples of the people’s face pictures on AR, FERET, Yale, CMU PIE and LFW face databases.

Fig. 2
figure 2

Face samples of different datasets. a AR dataset, b FERET dataset, c Yale dataset, d PIE dataset, e LFW dataset

5.2 Experimental settings

In our experiments, we compare the proposed model against the RPCA, LPP, UDFS, LatLRR, LRE, and SALRP. We use the Yale, FERET, CMU PIE and AR face datasets to prove the robustness of DLRP when data contaminated by random pixel corruptions and different degree contiguous occlusion. According to the number of samples in different datasets, we randomly selected \( T_{w} \) images of each individual on Yale, AR, FERET, CMU PIE and LFW datasets as training sets and the remainders as the test set, where \( T_{w} = 4,5,6 \), \( T_{w} = 4,5,6 \), \( T_{w} = 2,3,4 \), \( T_{w} = 4,5,6 \) and \( T_{w} = 5,6,7 \) respectively. The best parameters were determined by the validation set, which was also used to choose the optimal parameters for other compared models. The regularization parameters \( \lambda \) and \( \alpha \) is selected from \( 10^{ - 6} \) to \( 10^{6} \). To achieve believable results from experiments, all algorithms on each dataset conducted ten times independently and average recognition accuracy were shown. The optimal dimensions of low-dimensional subspace selected in the range \( \left[ {50,200} \right] \) at the intervals of 10. After the feature extraction process, the nearest neighbor (1NN) classifier based on Euclidean distance as the metric was utilized for accuracy of image classification. The average recognition rate and corresponding the number of dimensions of Yale, FERET, PIE databases are presented in the Tables 2, 3 and 4 when \( T_{w} = 6 \). The average recognition rates versus the number of dimensions on LFW and AR databases are also reported in Fig. 5a, b.

Table 2 Best recognition rate and corresponding dimensions of different algorithms on contaminated Yale datasets (Tw = 6)
Table 3 Best recognition rate and corresponding dimensions of different algorithms on contaminated FERET datasets (Tw = 4)
Table 4 Best recognition rate and corresponding dimensions of different algorithms on contaminated CMU PIE datasets (Tw = 6)

5.3 Robustness evaluation with random pixel noises

To evaluate the robustness of the DLRP, RPCA, LPP, UDFS, LatLRR, LRE and SALRP algorithms in the case that there are random pixel corruptions in face pictures, we add the Gaussian noise in images of Yale, FERET and PIE datasets. Figure 3 shows original data and corrupted data images with 0.1 densities and 0.15 densities respectively. Tables 2, 3 and 4 present the recognition rate and the number of dimensions of RPCA, LPP, UDFS, LatLRR, LRE and SALRP on random pixel noises with different densities of Gaussian noise. The robustness of DLRP to random pixel noises can be proven from the experimental results.

Fig. 3
figure 3

Some examples of the original and noise images with different densities (Den) of Gaussian noise on Yale, FERET and PIE databases

5.4 Robustness evaluation with random block corruptions

To estimate the robustness of the DLRP, RPCA, LPP, UDFS, LatLRR, LRE and SALRP algorithms in the case that there are various level contiguous noises and corruptions in face pictures, we add some blocks with random positions in images. Figure 4 shows original data and samples of Yale, FERET and PIE datasets with block size \( 5 \times 5 \) and \( 7 \times 7 \). Tables 2, 3 and 4 display the best recognition rates of RPCA, LPP, UDFS, LatLRR, LRE and SALRP. As shown in Tables 2, 3 and 4, DLRP achieves the best results among all compared algorithms in most cases. Thus, the robustness of our model to contiguous occlusions has been proved.

Fig. 4
figure 4

Some examples of the original and noise images with different sizes of block noise on Yale, FERET and PIE datasets

5.5 Experimental analysis on AR face dataset

To investigate the performance of our model and its competing models in the case that there are some variations in expressions of face, conditions of light and different degrees of occlusion, such as sunglasses and scarf. For all we know, the AR face database is a challenging dataset for recognition tasks for most of the regression methods. However, our proposed DLRP can outperform its competitions in most situations just as Fig. 5a presents.

Fig. 5
figure 5

Recognition rate (%) of algorithms on a AR database (Tw = 6) and b LFW database (Tw = 7)

5.6 Experimental analysis on LFW database in deep learning conditions

The LFW database is utilized to test the recognition results of our model and the compared models on the deep learning circulation. CNN was utilized as a first step to extract the deep features, then the subspace learning techniques were used to extract the features (i.e. RPCA, SPCA, LPP, UDFS, LatLRR, LRE and ALRP). The experimental results on recognizing deep features are shown in Fig. 5b. It shows that DLRP is still the best choice among all the methods.

5.7 Parameter selection and convergence study

Figure 6 draws the curves of iteration-based convergence on Yale dataset with 0.1 Gaussian noise, AR database and LFW dataset. As vividly depicted in figures, we can observe that the algorithm of DLRP could converge within 5 iterations on three databases. Similar properties on the convergence of the algorithm can also be discovered on other datasets.

Fig. 6
figure 6

Convergence curves on a AR database and b Yale database with 0.1 Gaussian noise c LFW database

There are two parameters in the proposed model, namely \( \lambda \) and \( \alpha \). Figure 7 shows the classification rates versus \( \lambda \) with \( \alpha \) fixed, and \( \alpha \) with \( \lambda \) fixed on the AR, LFW and Yale, PIE and FERET dataset with 0.1 Gaussian noise when \( \lambda \) and \( \alpha \) ranged from \( 10^{ - 6} ,10^{ - 5} \ldots 10^{5} ,10^{ - 6} \). As can be seen, DLRP is very robust to the values of \( \alpha \) and \( \lambda \) except on LFW database. Therefore, all of these Experiment results prove that DLRP can achieve better average recognition rates when parameter \( \lambda \) lies in \( 10^{0} \) to \( 10^{4} \) and \( \alpha \) lies in \( 10^{ - 6} \) to \( 10^{1} \) respectively. On the other side, when \( \alpha \) ranges from \( 10^{1} \) to \( 10^{6} \) and \( \lambda \) ranges from \( 10^{ - 6} \) to \( 10^{0} \), these two regularization terms are important for our models.

Fig. 7
figure 7

Different properties of DLRP. Testing the variation of recognition rate with a \( \alpha \) regularization term coefficients b \( \lambda \) regularization term coefficients

5.8 Experimental conclusions

Based on the experimental results on the various datasets, the following observations and conclusions are reached.

  1. (1)

    Compared with the traditional dimension reduction methods, such as LPP, RPCA and UDFS, the proposed model can perform better when there are data specific corruptions on the pictures. DLRP combining the LRR with dimensionality reduction function is the key reason for high classification accuracy.

  2. (2)

    From the Tables 2, 3 and 4, we can find that when the degree of noises and occlusions of the data samples are increased, the effect of most methods are affected, while our method is not affected too much or even performs better. This indicates our model is the best choice to perform on the contaminated circumstance.

  3. (3)

    DLRP is superior to LRE for robust face image feature extraction and classification in most cases, which is due to the fact that our method not only fully uses the properties of LRR but also integrates the supervised information to improve discriminant ability. In addition, releasing the small-class problem and imposing adaptive constraints also have a positive effect on the recognition rates of the face image.

  4. (4)

    DLRP outperforms SALRP in most cases, especially increasing about 30% to 40% on contaminated Yale database (see Table 2). The possible reason for the high recognition rate is that DLRP directly projects the clean data \( XZ \) into the label subspace while SALRP projects the original data \( X \) which includes more noises and corruptions. That means DLRP can use the low-rank representation to weaken the disturbance of noise in projected subspace, such that the learned low-dimensional subspace can be more robust than SALRP.

  5. (5)

    From Figs. 1 and 5a, b, we can find that DLRP cannot achieve the consistent winner on low-dimensional subspace, especially under 80 dimensions. In other words, DLRP is unable to classify some data with low dimensionality. The major reason is that when dimensionality is so low that DLRP cannot maintain the manifold structure of samples, which may occur the problem of overfitting in this situation. However, DLRP can perform well on data with high dimensionality, which means it is favorable to select DLRP for classifying the high dimensional data.

Based on the above experimental observations and analysis, we can find that the DLPR can be superior to the classical models (i.e., LLRR, LPP, RPCA and UDFS) and most related models (i.e., SALRP and LRE). This indicates that integrating the low-rank properties with supervised information and releasing small-class problem are significant to obtain the high recognition rates. Furthermore, DLRP essentially inherits the advantage from the low rank representation to improve its robustness for corrupted datasets. Thus, the proposed model can learn an optimal robust subspace to achieve good performance in our experiments.

6 Conclusion

In this paper, a novel model named DLRP, which combines the low-rank representation with subspace learning, is proposed for image classification and feature extraction. Different from the previous LRR-based methods, DLRP can not only obtain clean data under the low-rank constraint but also learn a robust projecting subspace. In addition, DLRP figures out the small-class problem that exists in the typical least square regression or its derivatives. Thus, the proposed model can capture enough projections for feature extraction. Moreover, we impose some adaptive constraints on low-rank matrix to improve the performances which the most low-rank learning methods lack. To demonstrate the effectiveness and robustness of DLRP, we conducted mass comprehensive experiments on five benchmark datasets for comparing with related techniques. The limitation of our algorithm is that it cannot perform well on some low dimensional data, which we try to improve in the future.