1 Introduction

In the field of computer vision and pattern recognition, dealing with high-dimensional data is a challenging task. In most cases, these high-dimensional data can be represented by a low-dimensional subspace. For example, the face data under different illumination conditions can be approximated by a low-dimensional linear subspace [1]. Subspace clustering aims to cluster the data points drawn from a union of low-dimensional subspaces.

The subspace clustering problem has attracted attention in recent years, and many methods have been proposed to address it, including iterative methods [2,3,4], algebraic methods [5,6,7], statistical methods [8,9,10,11], and methods based on spectral clustering [12,13,14,15,16,17,18,19,20,21,22]. Researchers favor the methods based on spectral clustering, which can obtain better results. These methods divide the spectral clustering problem into two steps. First, an affinity matrix is built from the data. Then, spectral clustering is applied on the affinity matrix to segment the data. The basis of such methods is to construct an informative affinity matrix.

Recently, various algorithms based on sparse representation [13, 14] or low-rank representation [15,16,17, 20, 22] have been proposed in the literature for subspace clustering. Such methods are robust to noise and outliers and do not require the knowledge of the dimensions and the number of subspaces. Most of these methods are based on a self-expressiveness model, which states that each sample can be expressed as a linear combination of other samples. They utilize all samples as a dictionary and find a sparse or low-rank representation matrix. With relatively clean data, \({\mathbf{X}} = {\mathbf{XR }}\), where \({\mathbf{X}}\) is the data matrix and \({\mathbf{R}}\) is the coefficient matrix. With corrupted data, the constraint is changed to \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\), where \({\mathbf{E}}\) denotes the reconstruction error. In general, the subspace clustering problem can be summed up as the following optimization problem:

$$\begin{aligned} \mathop {\min }_{\mathbf{R,E }} \Vert {\mathbf{R}} \Vert _k + \lambda \Vert {\mathbf{E}} \Vert _t \quad {\mathrm{s.t.}} \quad {\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}} , {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} \end{aligned}$$
(1)

where \(\Vert \cdot \Vert _k\) and \(\Vert \cdot \Vert _t\) are suitable norms, \(\lambda\) is a trade-off parameter, and the constraint \({\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}}\) is optionally used to avoid a trivial solution of \({\mathbf{R}}\).

The methods differ in the choice of norms for the regularization on \({\mathbf{R}}\) and/or the reconstruction error \({\mathbf{E}}\). For example, Elhamifar and Vidal proposed sparse subspace clustering (SSC) [13, 14] based on sparse representation (SR) [23]. The \(l_1\) norm is chosen for \(\Vert \cdot \Vert _k\) to get a relatively sparse matrix \({\mathbf{R}}\), and the F-norm and/or the \(l_1\) norm of \({\mathbf{E}}\) are used to handle Gaussian noise and/or outliers. Liu et al. [16, 17] argued that due to learning the sparse representation of each data vector individually, SSC may not be able to capture the global structure of data, and the performance of SSC will be largely depressed when the data is grossly corrupted. Consequently, they proposed a method called the low-rank representation (LRR), which finds the lowest rank representation of all data jointly. They used the nuclear norm \(\Vert \cdot \Vert _*\) for \(\Vert \cdot \Vert _k\) to obtain a relatively low-rank matrix \({\mathbf{R}}\) and the \(l_{2,1}\) norm of \({\mathbf{E}}\) to tackle sample-specific corruptions, where some data vectors are corrupted and others are clean. Vidal et al. [20] held that using data as the dictionary is unreasonable when data are grossly corrupted. They learned a noise-free dictionary from corrupted data and found the low-rank representation based on the learned dictionary. They named this algorithm low-rank subspace clustering (LRSC). In LRSC, the nuclear norm \(\Vert \cdot \Vert _*\) is applied to \(\Vert \cdot \Vert _k\), the F-norm, and the \(l_1\) norm of \({\mathbf{E}}\) is used to handle Gaussian noise and outliers. Wang et al. [21] reported that the representation matrix is often simultaneously sparse and low rank, and they therefore combined SSC and LRR, termed low-rank sparse subspace clustering (LRSSC). In LRSSC, the nuclear norm \(\Vert \cdot \Vert _*\) and the \(\Vert \cdot \Vert _1\) norm are used for \(\Vert \cdot \Vert _k\), and the F-norm of \({\mathbf{E}}\) is used to deal with Gaussian noise. Also, several variants of these algorithms have been proposed, such as least squares regression (LSR) [19], \(l_0\)-SSC [24], SSC-learned orthogonal matching pursuit (SSC-LOMP) [25], subspace clustering with learning an adaptive low-rank graph (SC-LALRG) [26], and subspace clustering with block diagonal representation (BDR) [27]. These methods have yielded state-of-the-art results in different subspace segmentation tasks.

However, finding sparse or low-rank representation is time-consuming when the dimension of the feature is high [14]. To tackle this problem, we can process the high-dimensional data into low-dimensional data using principal component analysis [28] or random projections [29] prior to obtaining sparse or low-rank representation. The major drawback of this method is that it may lose too much information, leading to poor performance. A more reasonable approach is to find the low-dimensional embedding and sparse or low-rank representation jointly. Patel et al. [30] proposed an extension of SSC: latent space sparse subspace clustering (LS3C). Given a dataset, LS3C finds a low-dimensional embedding and a sparse representation matrix simultaneously. They also proposed an extension of LS3C: latent space low-rank and sparse subspace clustering (LSLRSSC) [31]. LSLRSSC finds a low-dimensional embedding and a sparse and low-rank representation matrix simultaneously. Also, Wei et al. [32] proposed an extension LS3C, which they called latent space robust subspace segmentation (LSRS2). In LSRS2, the weighted \(l_1\) norm and the nuclear norm are used for the representation matrix. These methods learn a low-dimensional space without losing too much information, and a PCA-like regularization term is added to their objective functions. PCA aims to discover the global structure of the Euclidean space; nevertheless, the local manifold structure is more necessary than the global Euclidean structure in many real-world application problems [33, 34]. Tang et al. [35] designed a special term describing the property of learned space instead of PCA that can keep the local manifold structure of the high-dimensional data: robust subspace learning-based low-rank representation (RSLLRR). RSLLRR learns a low-rank representation matrix, but the representation matrix is often sparse and low rank. In addition, the methods mentioned above using the F-norm or the \(l_{2,1}\) norm to capture the reconstruction error are not robust [36,37,38].

Consequently, in this paper, we propose a novel algorithm, robust latent space low-rank and sparse subspace clustering (RLSLRSSC). To preserve the local manifold structure of the high-dimensional data, we devise a graph constraint term instead of a PCA-like term in the objective function of our proposed algorithm. Moreover, to make our algorithm robust to noise, we characterize the reconstruction error more precisely. Due to the limited representational capability with insufficient data and the error in data acquisition and transmission, we assume a portion of the error obeys a Gaussian distribution. Hence, we use the F-norm to depict Gaussian noise. We use the \(l_1\) norm or the \(l_{2,1}\) norm to depict outliers. In summary, our main contributions are as follows:

  1. 1.

    By adopting a graph constraint term rather than a PCA-like term, we propose a novel representation learning method that finds a discriminating latent space and a low-rank and sparse representation matrix simultaneously.

  2. 2.

    We depict the noise more accurately, so the robustness of our algorithm will be enhanced.

  3. 3.

    A simple and efficient iterative method is proposed to optimize our algorithm.

The remainder of this paper is organized as follows. In Sect. 2, we briefly review some related work. Section 3 presents the proposed method RLSLRSSC. In Sect. 4, we describe the optimization algorithms, the time complexity, and convergence analysis. Section 5 reports experimental results. Section 6 concludes this paper.

2 Related work

In this section, we briefly review sparse and low-rank subspace clustering methods, including SSC [13, 14], LRR [16, 17], LRSSC [21], and LSLRSSC [31]. For brevity, we summarize some notations in Table 1.

Table 1 Summery of main notations used in this paper

2.1 Sparse subspace clustering (SSC)

Sparse representation (SR) [23] is an effective tool for representing and compressing high-dimensional data. In the recent years, an increasing number of methods have been proposed based on SR. Also, a lot of applications derive from SR, such as Tree2Vector [39, 40]. Similarly, SSC [13, 14] learns the affinity matrix for a dataset based on SR. Given a dataset, \({\mathbf{X}} = [{\mathbf{x}} _1, {\mathbf{x}} _2, \ldots , {\mathbf{x}} _N] \in {\mathbf{R}} ^{D \times N}\) drawn from a union of Q independent linear subspaces \(S_1, S_2, \ldots , S_Q\). Each sample \({\mathbf{x}} _i \in {\mathbf{R}} ^D\) can be linearly and approximately represented by a combination of a few points from the rest of the samples in \({\mathbf{X}}\). The idea of SSC can be expressed as follows:

$$\begin{aligned} \mathop {\min }_{\mathbf{R }} \ \Vert {\mathbf{R}} \Vert _1 \quad {\mathrm{s.t.}} \quad {\mathbf{X}} = {\mathbf{XR }}, {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} \end{aligned}$$
(2)

where \({\mathbf{R}} = [{\mathbf {r}} _1, {\mathbf {r}} _2, \ldots , {\mathbf {r}} _N] \in {R}^{N \times N}\) is a coefficient matrix whose column \({\mathbf {r}} _i\) is the representation vector corresponding to \({\mathbf{x}} _i\), \({\mathrm{diag}}({\mathbf{R}} ) \in {R}^{N}\) is the vector whose elements equal the diagonal elements of \({\mathbf{R}}\), and \({\mathbf{0}} \in {R}^{N}\) is a vector containing zeros only.

In fact, the data lie in a union of affine subspaces. In this condition, we obtain the following optimization:

$$\begin{aligned} \mathop {\min }_{\mathbf{R }} \ \Vert {\mathbf{R}} \Vert _1 \quad {\mathrm{s.t.}} \quad {\mathbf{X}} = {\mathbf{XR }}, {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} , {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} \end{aligned}$$
(3)

where \({\mathbf {1}}\) is a vector containing one only. Considering that the data may be corrupted, the constraint \({\mathbf{X}} = {\mathbf{XR }}\) should be relaxed to \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\). \({\mathbf{E}}\) is the noise matrix, and the F-norm is used to capture the noise. The following problem, therefore, can be solved to obtain the representation matrix \({\mathbf{R}}\):

$$\begin{aligned} \mathop {\min }_{\mathbf{R }} \ \Vert {\mathbf{X}} - {\mathbf{XR }} \Vert _F^2 + \lambda \Vert {\mathbf{R}} \Vert _1 \ {\mathrm{s.t.}} \ {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} , {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} \end{aligned}$$
(4)

where \(\lambda\) is a regularization parameter. This problem can be solved by using the alternating direction method of multipliers (ADMM) [41].

2.2 Low-rank representation (LRR)

LRR [16, 17] uses low-rank representation to construct an affinity matrix for a group of data. Similar to SSC, LRR uses the data \({\mathbf{X}}\) as the dictionary. When the data is noise-free, a low-rank representation \({\mathbf{R}}\) can be obtained by solving the problem

$$\begin{aligned} \mathop {\min }_{\mathbf{R }} \ {\mathrm{rank}}({\mathbf{R}} ) \quad {\mathrm{s.t.}} \quad {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} \end{aligned}$$
(5)

However, the above optimization problem is difficult to solve. As a common approach to addressing the above problem, the rank function is replaced by its convex surrogate, the nuclear norm. When the data is corrupted by some arbitrary noise E, i.e., \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\), the \(l_{2,1}\) norm is employed to deal with the noise. The objective function of LRR is

$$\begin{aligned} \mathop {\min }_{\mathbf{R,E }} \Vert {\mathbf{R}} \Vert _* + \lambda \Vert {\mathbf{E}} \Vert _{2,1} \quad {\mathrm{s.t.}} \quad {\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}} \end{aligned}$$
(6)

The above problem can be solved by an inexact augmented Lagrange multiplier (ALM) method [42].

2.3 Low-rank and sparse subspace clustering (LRSSC)

The representation matrix \({\mathbf{R}}\) is often simultaneously sparse and low rank. By combining SSC and LRR, a low-rank and sparse representation can be learned. In LRSSC [21], the representation matrix can be found by solving the following optimization problem:

$$\begin{aligned}&\mathop {\min }_{\mathbf{R }} \ \Vert {\mathbf{X}} - {\mathbf{XR }} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 \\&\quad {\mathrm{s.t.}} \ {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} , {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} \end{aligned}$$
(7)

where \(\lambda _1\) and \(\lambda _2\) are two regularization parameters. This problem can be solved using the ADMM [41] method.

2.4 Latent space low-rank and sparse subspace clustering (LSLRSSC)

Finding low-rank and sparse representation is very time-consuming. Dissimilar to LRSSC, LSLRSSC [31] finds a latent low-dimensional space and a low-rank and sparse representation matrix simultaneously. LSLRSSC can be expressed as the following problem:

$$\begin{aligned}&\mathop {\min }_{\mathbf{P,R }} \ \Vert {\mathbf{PX }} - {\mathbf{PXR }} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 \\&\quad +\lambda _3 \Vert {\mathbf{X}} - {\mathbf{P}} ^{\mathrm{T}}{\mathbf{PX }} \Vert _F^2 \ {\mathrm{s.t.}} \ {\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}} , {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} , {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} \end{aligned}$$
(8)

where \(\lambda _1\), \(\lambda _1\), \(\lambda _1\) are three trade-off parameters and \({\mathbf{P}}\) is a projection matrix. The last term is used for guaranteeing that the projection does not lose too much information in the original domain. This problem can be solved by optimizing \({\mathbf{P}}\) and \({\mathbf{R}}\) alternatively using ADMM [41]. In SSC, LRR, LRSSC, and LSLRSSC, once the representation matrix \({\mathbf{R}}\) is learned, the affinity matrix \({\mathbf{W}}\) can be constructed by \({\mathbf{W}} = |{\mathbf{R}} | + |{\mathbf{R}} ^{\mathrm{T}}|\), where \(|{\mathbf{R}} |\) is the modulus of \({\mathbf{R}}\). Finally, spectral clustering [43] is used to obtain clustering results.

3 Robust latent space low-rank and sparse subspace clustering (RLSLRSSC)

In this section, we present a novel method, robust latent space low-rank and sparse subspace clustering (RLSLRSSC). Our model consists of three steps. First, it projects the high-dimensional data onto a lower dimensional space, and we add a graph constraint to promote the discriminative ability of the projected space. Second, the projected data is represented by a low-rank and sparse representation matrix. Third, we need to deal with the noise and outliers. Simply, we aim to learn a low-rank and sparse representation matrix and a discriminative projection matrix simultaneously by using a graph constraint. RLSLRSSC can be expressed as the following optimization problem:

$$\begin{aligned}&\mathop {\min }_{\mathbf{P,R,G,E }} \lambda f({\mathbf {G}} ) + \lambda _1 f_1({\mathbf{R}} ) + \lambda _2 f_2({\mathbf{P}} ) + \lambda _3 f_3({\mathbf{E}} ) \\&\quad {\mathrm{s.t.}} \quad {\mathbf{PX }} = {\mathbf{PXR }} + {\mathbf {G}} + {\mathbf{E}} , {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} , {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} \end{aligned}$$
(9)

where \(\lambda , \lambda _1, \lambda _2, \lambda _3\) are four regularization parameters, \({\mathbf{P}}\) is a projection matrix, \({\mathbf{R}}\) is a low-rank and sparse representation matrix, \({\mathbf {G}}\) denotes universal noise, and \({\mathbf{E}}\) denotes outliers. In most cases, universal noise obeys a Gaussian distribution, and we can apply the F-norm for \({\mathbf {G}}\), \(f({\mathbf {G}} ) = \Vert {\mathbf {G}} \Vert _F^2\). One can find both sparse and low-rank representation as is performed in LRSSC by setting \(f_1({\mathbf{R}} ) = \Vert {\mathbf{R}} \Vert _* + \alpha \Vert {\mathbf{R}} \Vert _1\), where \(\alpha\) is a trade-off parameter. To deal with outliers, the \(l_1\) norm or the \(l_{2,1}\) norm is employed for \({\mathbf{E}}\), \(f_3({\mathbf{E}} ) = \Vert {\mathbf{E}} \Vert _1\) or \(f_3({\mathbf{E}} ) = \Vert {\mathbf{E}} \Vert _{2,1}\). \(f_2({\mathbf{P}} )\) describes the property of the learned space, which enforces discrimination of the learned space.

The graph constraint term \(f_2({\mathbf{P}} )\) learns a transformation matrix, which can preserve useful information and projects the data onto a discriminative latent space, where different classes are separated more easily compared with the original space. Because the local manifold structure is more significant than the global Euclidean structure [33, 34], preserving the local manifold structure in the projected space is more reasonable. We hope that if two data points \({\mathbf{x}} _i\) and \({\mathbf{x}} _j\) are close in the original space, their corresponding low-dimensional embedding \({\mathbf {y}} _i\) and \({\mathbf {y}} _j\) should be close, too. \(f_2({\mathbf{P}} )\) is designed to preserve the preferable relationship between the data samples. As a result, we choose the k nearest neighbors of each data point by Euclidean distance. Then, the adjacency graph \({\mathbf{W}}\) can be obtained with the k nearest neighbors by using radial basis function (RBF). In RBF, it is important to select the value of \(\sigma\). However, when the data includes clusters with different local statistics, the adjacency graph may not be accurate even using the optimal \(\sigma\). In addition, it is hard to get the optimal \(\sigma\) because there are many parameters in our model. To overcome these problems, \({\mathbf{W}}\) is defined as

$${\mathbf{W}} _{ij} = \left\{ \begin{array}{ll} \exp \left( -\frac{ \left\| {{\mathbf{x}} _i - {\mathbf{x}} _j} \right\| _2^2}{\sigma _i \sigma _j}\right) &\quad {\mathbf{x}} _i \in N({\mathbf{x}} _j) \,\, {\hbox{or}}\, \, {\mathbf{x}} _j \in N({\mathbf{x}} _i) \\ 0 &\quad {\mathrm{otherwise}} \end{array} \right.$$
(10)

where \(N({\mathbf{x}} _i)\) denotes the nearest neighbors of \({\mathbf{x}} _i\) and \(\sigma _i\) is a local scaling [44] parameter for \({\mathbf{x}} _i\), which permits self-tuning of the point-to-point distances according to the local statistics of the neighborhoods of \({\mathbf{x}} _i\). The local scale \(\sigma _i\) can be defined as \(\sigma _i = d({\mathbf{x}} _i,{\mathbf{x}} _t)\), where \({\mathbf{x}} _t\) is the tth neighbor of \({\mathbf{x}} _i\).

We formulate the graph constraint term \(f_2({\mathbf{P}} )\) as

$$\begin{aligned} \mathop {\min }_{\mathbf{P }} f_2({\mathbf{P}} ) = \sum _{i=1}^{N} \sum _{j=1}^{N} \frac{1}{2} \Vert {\mathbf {y}} _i - {\mathbf {y}} _j \Vert _2^2 {\mathbf{W}} _{ij} \end{aligned}$$
(11)

where \({\mathbf {y}} _i\) is the low-dimensional embedding corresponding to \({\mathbf{x}} _i\) and we have \({\mathbf {y}} _i = {\mathbf{P}} {} {\mathbf{x}} _i\). Let \({\mathbf {D}}\) be a diagonal matrix with \({\mathbf {D}} _{jj} = \sum _{j=1} {\mathbf{W}} _{ij}\), and \({\mathbf{L}}\) be the Laplacian matrix with \({\mathbf{L}} = {\mathbf {D}} - {\mathbf{W}}\). The objective function (11) can be modified as the following equivalent problem:

$$\begin{aligned}\mathop {\min }_{\mathbf{P }} f_2({\mathbf{P}} ) = tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) \ {\mathrm{s.t.}} \ {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} \end{aligned}$$
(12)

The above optimization problem can be seen as a simpler one in orthogonal locality preserving projections (OLPP) [45,46,47]. If the sample size is less than the feature dimension of the sample, we will come across the undersampled size problem. To ensure that the matrix \({\mathbf{ M}} = {\mathbf{XL }}{} {\mathbf{X}} ^{\mathrm{T}}\) will be nonsingular, we may use PCA to reduce the dimensionality of the data in advance.

By expanding \(f({\mathbf {G}} )\), \(f_1({\mathbf{R}} )\), \(f_2({\mathbf{P}} )\) , and \(f_3({\mathbf{E}} )\), our proposed model (9) can be expressed as

$$\begin{aligned}&\mathop {\min }_{\mathbf{P,R,E }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 \\&\quad + \lambda _3 tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) + \lambda _4 \Vert {\mathbf{E}} \Vert _l \quad {\mathrm{s.t.}} \ {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} , {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} \end{aligned}$$
(13)

where \(\Vert \cdot \Vert _l\) denotes a certain norm. We consider two kinds of norms on \({\mathbf{E}}\): (i) the \(l_1\) norm, resulting in a model denoted by RLSLRSSC-L1; and (ii) the \(l_{2,1}\) norm, resulting in a model denoted by RLSLRSSC-L21.

4 Optimization

The optimization problem (13) is difficult to solve, and it is not jointly convex concerning (\({\mathbf{P}} ,{\mathbf{R}} ,{\mathbf{E}}\)). We can divide the optimization problem (13) into three subproblems to find the projection matrix \({\mathbf{P}}\), the representation matrix \({\mathbf{R}}\), and the error matrix \({\mathbf{E}}\) separately. These subproblems are solved alternatively by updating one variable and fixing the other ones. Each subproblem will be discussed in detail in the following section.

4.1 Update of projection matrix \({\mathbf{P}}\)

Given fixed \({\mathbf{R}}\) and \({\mathbf{E}}\), the cost function in Eq. (13) is further reduced to

$$\begin{aligned}&J_{\mathbf{P}} = \mathop {\min }_{\mathbf{P }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _3 tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) \\&\quad {\mathrm{s.t.}} \quad {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} \end{aligned}$$
(14)

The above objective function is nonconvex with a local minimum of it obtained as follows. First, considering \({\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}}\), we have

$$\begin{aligned} \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 = tr({\mathbf{P}} \varPhi ({\mathbf{P}} ){\mathbf{P}} ^{\mathrm{T}}) \end{aligned}$$
(15)

where \(\varPhi ({\mathbf{P}} ) = ({\mathbf{X}} -{\mathbf{XR }}-{\mathbf{P}} ^{\mathrm{T}}{} {\mathbf{E}} )({\mathbf{X}} -{\mathbf{XR }}-{\mathbf{P}} ^{\mathrm{T}}{} {\mathbf{E}} )^{\mathrm{T}}\). In this case, the objective function in Eq. (14) can then be rewritten as

$$\begin{aligned} J_{\mathbf{P}} =&\mathop {\min }_{\mathbf{P }} \ \lambda tr({\mathbf{P}} \varPhi ({\mathbf{P}} ){\mathbf{P}} ^{\mathrm{T}}) + \lambda _3 tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) \\&\mathop {\min }_{\mathbf{P }} \ tr({\mathbf{P}} (\lambda \varPhi ({\mathbf{P}} ) + \lambda _3{\mathbf{XL }}{} {\mathbf{X}} ^{\mathrm{T}}){\mathbf{P}} ^{\mathrm{T}}) \\&\quad {\mathrm{s.t.}} \quad {\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}} \end{aligned}$$
(16)

To optimize the above minimization in the tth iteration, we use \(\varPhi ({\mathbf{P}} _{t-1})\) to estimate the \(\varPhi ({\mathbf{P}} )\) in Eq. (16), where \({\mathbf{P}} _{t-1}\) is the projection matrix acquired in \(({t-1})\)th iteration. Using the eigenvalue decomposition (EVD) technique, we have

$$\begin{aligned} \left[ {\mathbf {U}} ,{\mathbf {S}} ,{\mathbf {U}} \right] = {\mathrm{EVD}}(\lambda \varPhi ({\mathbf{P}} ) + \lambda _3{\mathbf{XL }}{} {\mathbf{X}} ^{\mathrm{T}}) \end{aligned}$$
(17)

Then, we can update \({\mathbf{P}}\) as the d eigenvectors in \({\mathbf {U}}\) associated with the first d smallest eigenvalues in \({\mathbf {S}}\), i.e., \({\mathbf{P}} _t = {\mathbf {U}} (:,1:d)^{\mathrm{T}}\).

4.2 Update of representation matrix \({\mathbf{R}}\)

We then can optimize \({\mathbf{R}}\) with \({\mathbf{P}}\) and \({\mathbf{E}}\) fixed. By ignoring irrelevant terms, the cost function in Eq. (13) reduces to

$$\begin{aligned}&\mathop {\min }_{\mathbf{R }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 \\&\quad {\mathrm{s.t.}} \quad {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} \end{aligned}$$
(18)

To solve the above minimization, we need to introduce two auxiliary variables, \({\mathbf {J}}\) and \({\mathbf {S}}\) and reformulate the optimization problem. For brevity, let \({\mathbf{A}} = {\mathbf{PX }}\) and \({\mathbf{B}} = {\mathbf{PX }}-{\mathbf{E}}\). We can rewrite Eq. (18) as

$$\begin{aligned}&\mathop {\min }_{\mathbf{R }} \ \lambda \Vert {\mathbf{B}} - {\mathbf{AR}} \Vert _F^2 + \lambda _1 \Vert {\mathbf {J}} \Vert _* + \lambda _2 \Vert {\mathbf {S}} \Vert _1 \\&\quad {\mathrm{s.t.}} \quad {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} = {\mathbf {1}} ,\ {\mathbf{R}} = {\mathbf {J}} ,\ {\mathbf{R}} = {\mathbf {S}} , \end{aligned}$$
(19)

We can solve the above problem by using the inexact augmented Lagrangian multiplier (ALM) [42] method. The augmented Lagrangian function of Eq. (19) is

$$\begin{aligned}&\mathop {\min }_{\mathbf{R }} \ \lambda \Vert {\mathbf{B}} - {\mathbf{AR}} \Vert _F^2 + \lambda _1 \Vert {\mathbf {J}} \Vert _* + \lambda _2 \Vert {\mathbf {S}} \Vert _1 \\&\quad + tr[{\varvec{\varLambda }}_1({\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} -{\mathbf {1}} )] + tr[{\varvec{\varLambda }}_2({\mathbf{R}} - {\mathbf {J}} )] + tr[{\varvec{\varLambda }}_3({\mathbf{R}} - {\mathbf {S}} )] \\&\quad + \frac{u_1}{2}\Vert {\mathbf{R}} ^{\mathrm{T}} {\mathbf {1}} -{\mathbf {1}} \Vert _F^2 + \frac{u_2}{2}\Vert {\mathbf{R}} - {\mathbf {J}} \Vert _F^2 + \frac{u_3}{2}\Vert {\mathbf{R}} - {\mathbf {S}} \Vert _F^2 \end{aligned}$$
(20)

where \({\varvec{\varLambda }}_1\), \({\varvec{\varLambda }}_2\) , and \({\varvec{\varLambda }}_3\) are three Lagrangian multipliers and \(u_1\), \(u_1\), \(u_3\) are three penalty factors. The optimization of Eq. (20) can be found in Algorithm 1.

4.3 Update of noise matrix \({\mathbf{E}}\)

In order to solve \({\mathbf{E}}\), we keep \({\mathbf{P}}\) and \({\mathbf{R}}\) fixed. Hence, the objective function in Eq. (13) can be rewritten as

$$\begin{aligned}&\mathop {\min }_{\mathbf{E }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _4 \Vert {\mathbf{E}} \Vert _l \quad \end{aligned}$$
(21)

when the \(l_1\) norm is used for noise, we have

$$\begin{aligned}&\mathop {\min }_{\mathbf{E }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _4 \Vert {\mathbf{E}} \Vert _1 \quad \end{aligned}$$
(22)

This problem can be efficiently solved by shrinkage [48]. Whereas the \(l_{2,1}\) norm is used for noise, we have

$$\begin{aligned}&\mathop {\min }_{\mathbf{E }} \ \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _4 \Vert {\mathbf{E}} \Vert _{2,1} \quad \end{aligned}$$
(23)

According to Lemma 1, we can update \({\mathbf{E}}\) by

$$\begin{aligned} {\mathbf{E}} = {\mathrm{argmin}} \ \frac{1}{2} \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \frac{\lambda _4}{2\lambda } \Vert {\mathbf{E}} \Vert _{2,1}. \end{aligned}$$

Lemma 1

[49] Given a matrix\({\mathbf {y}} = [{\mathbf {y}} _1, {\mathbf {y}} _2, \ldots , {\mathbf {y}} _N]\), if the optimal solution of

$$\begin{aligned} \mathop {\min }_{\mathbf{Z}} \ \frac{1}{2} \Vert {\mathbf {Z}} - {\mathbf {y}} \Vert _F^2 + \eta \Vert {\mathbf {Z}} \Vert _{2,1} \end{aligned}$$

is\({\mathbf {Z}} ^*\), and then, the ith column of\({\mathbf {Z}} ^*\)is

$$\qquad {\mathbf {Z}} ^*(:,i) = \left\{ \begin{array}{ll} \frac{\Vert {\mathbf {y}} _i\Vert _2 - \eta }{\Vert {\mathbf {y}} _i\Vert _2}{} {\mathbf {y}} _i &{}\quad {\mathrm{if}} \ \eta \le \Vert {\mathbf {y}} _i\Vert _2 \\ 0 &{}\quad {\mathrm{otherwise}} \end{array} \right.$$

4.4 The algorithms

We first describe the algorithmic procedure for solving \({\mathbf{R}}\) in Algorithm 1. Then, we discuss the whole algorithmic procedure for solving Eq. (13) in Algorithm 2. Once the representation matrix \({\mathbf{R}}\) is learned, spectral clustering is applied on the affinity matrix \({\mathbf{W}} = |{\mathbf{R}} |^{\mathrm{T}} + |{\mathbf{R}} |\) to obtain the clustering result. The proposed methods are summarized in Algorithm 3.

figure a
figure b
figure c

4.5 Complexity and convergence analysis

4.5.1 Time complexity

We analyze the time complexity of two main subproblems of RLSLRSSC optimization as follows:

  1. 1.

    The most time-consuming subproblem is computing low-rank and sparse representation. In Algorithm 1, step 5 is the most time-consuming due to singular value decomposition (SVD) with the cost of \(O(N)^3\), where N is the number of data samples. The time complexity of computing the matrix inverse in step 3 costs \(O(d)^3\), where d is the dimension of latent space. In short, the total time complexity of Algorithm 1 is \(i_1 \ O(N)^3\), where \(i_1\) is the iterative number of this algorithm.

  2. 2.

    For updating the projection matrix \({\mathbf{P}}\), the most time-consuming step is EVD. The time complexity of this step is also \(O(N)^3\).

Algorithm 1 is a step of Algorithm 2, and as a result, the total time complexity of RLSLRSSC is \(i_1 \ i_2 \ O(N)^3\), where \(i_1\) is the iterative number of Algorithm 1, and \(i_2\) is the number of iterations of Algorithm 2.

Then, we analyze the time complexities of all the compared methods briefly. For SSC, the time complexity is \(iD \times O(N)^3\), where i is the iterative number and D is the size of dimension. For LSR and LRSC, the time complexity is \(O(N)^3\). For LRR and BDR, the time complexity is \(i \ O(N)^3\). The time complexity of LSLRSSC and our algorithm is same, which is \(i_1 \ i_2 \ O(N)^3\).

4.5.2 Convergence analysis

A proof to ensure the convergence of the inexact ALM method can be found in [17], so we assume Algorithm 1 is convergent. Now, we consider the convergence of Algorithm 2.

Proposition 1

The cost function \(J = \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 + \lambda _3 tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) + \lambda _4 \Vert {\mathbf{E}} \Vert _l\) converges to a minimum by employing Algorithm 2.

Proof

First, it is easy to prove that \(J \ge 0\) for any \({\mathbf{P}} , {\mathbf{R}} , {\mathbf{E}}\).

Then, based on Algorithm 2, assume we have completed t iterations and \({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t\) are obtained. In the \((t+1)\)th iteration, \({\mathbf{R}} _{t+1} = {\mathrm{argmin}}_{\mathbf{R}} \ J({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t)\) after step 3. Therefore, \(J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t) \le J({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t)\). Similarly, we also have \({\mathbf{P}} _{t+1} = {\mathrm{argmin}}_{\mathbf{P}} \ J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) and \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t) \le J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) after step 4. Finally, we have \({\mathbf{E}} _{t+1} = {\mathrm{argmin}}_{\mathbf{E}} \ J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) and \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _{t+1}) \le J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) after step 5. Hence, \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _{t+1}) \le J({\mathbf{P}} _{t}, {\mathbf{R}} _{t}, {\mathbf{E}} _t)\), and we can deduce that RLSLRSSC is convergent.

5 Experimental results

In this section, we carry out several experiments on three face benchmark databases to demonstrate the superior clustering performance of our proposed method RLSLRSSC. We compare our RLSLRSSC with SSC, LSR, LRR, LRSC, BDR, LRSSC, and LSLRSSC. Two measures are employed to evaluate the performance of all approaches, i.e., clustering accuracy (ACC) and normalized mutual information (NMI) [50]. To test the robustness of these methods, we show their results on the databases corrupted by four kinds of noise: Gaussian noise, Laplacian noise, sample-specific corruptions, and Gaussian and Laplacian mixed noise. We describe four kinds of noise experiments in detail in the following paragraphs.

For the first experiment, we add Gaussian noise to the database. The level of noise intensity (LNI) varies from 0 to 5. When the LNI = 0, the original database is used. The corrupted data are constructed by adding Gaussian noise with zero mean and variance \(0.01 \times \hbox {LNI}\).

For the second experiment, we add Laplacian noise to each image. Similar to the first experiment, the LNI varies from 1 to 5. The corrupted data are generated by adding Laplacian noise with the location parameter zero and the scale parameter \(0.05\times \hbox {LNI}\).

For the third experiment, we add Gaussian noise to some selected samples. The level of corruption percentage (LCP) varies from 1 to 10. We randomly select \(\hbox {LCP}\times 10\%\) percentage of the data samples and add Gaussian noise with the LNI = 2 to them.

For the last experiment, we add two kinds of noise to the images: Gaussian noise and Laplacian noise. The LNI varies from 1 to 5. To construct the contaminated data, Gaussian noise with zeros mean and variance \(0.01 \times \hbox {LNI}\) is added to all images, and then, Laplacian noise with the location parameter zero and the scale parameter \(0.05 \times \hbox {LNI}\) is added.

Fig. 1
figure 1

Samples from the ORL database and contaminated ORL database. a The original ORL database. b The ORL database corrupted by Gaussian noise. c The ORL database corrupted by Laplacian noise. d The ORL database corrupted by Gaussian and Laplacian mixed noise. The LNI of the corrupted images is 2

Table 2 Clustering results on the ORL database corrupted by Gaussian noise
Table 3 Clustering results on the ORL database corrupted by Laplacian noise
Fig. 2
figure 2

Clustering performance on the ORL database with sample-specific corruptions. a Accuracy; b NMI

Table 4 Clustering results on the ORL database corrupted by Gaussian and Laplacian mixed noise

5.1 Experiment on the ORL database

The ORL database [51] consists of 400 human face images, with 40 distinct subjects containing 10 different images. For each subject, the images were taken under varying lighting conditions, with different facial expressions and facial details at different times. All the images were taken against a dark background with upright, frontal positions. The size of the face images is \(32 \times 26\). The images of the ORL database are shown in Fig. 1.

We describe the parameter setting here. The parameters are set to \(\lambda =0.01, \lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.1, \lambda _4=0.1\) for the clean data. For the corrupted data, tuning all parameters will cost too much time. Because \(\lambda , \lambda _4\) are the coefficients of the error matrix, they indicate the pollution levels. Hence, we fine-tune only \(\lambda , \lambda _4\) for the corrupted data. We choose a set of optimal parameters and then repeat five runs and obtain five results and calculate the average result to obtain the final performance.

First, we use the ORL database corrupted by Gaussian noise to demonstrate our method. The performance of related methods is listed in Table 2. When the original images are used, the accuracy of RLSLRSSC-L21 is \(86.10\%\), which is higher than LSLR-SSC (the second place) by \(5.40\%\), which is because RLSLRSSC uses a graph constraint rather than a PCA-like term, so RLSLRSSC can preserve the local manifold structure well. When the LNI increases gradually from 1 to 5, the performance of all the methods decreases. Overall, RLSLRSSC achieves the best result.

We evaluate the performance of related methods on the ORL database corrupted by Laplacian noise. The results are listed in Table 3. As displayed in Table 3, in most cases, RLSLRSSC-L21 can obtain better results. Hence, we deduce that the \(l_{2,1}\) norm can capture Laplacian noise well.

We then obtain the clustering performance on the ORL database with sample-specific corruptions. As displayed in Fig. 2, SSC has better results when the LCP is chosen from the set {4, 7, 8, 9}, and our algorithm can obtain competitive results. RLSLRSSC can obtain the best results when the LCP belongs to the set {1, 2, 3, 5, 6, 10}. Overall, RLSLRSSC can deal with sample-specific corruptions well.

As Table 4 shows, RLSLRSSC is robust to Gaussian and Laplacian mixed noise. When the LNI varies from 1 to 4, the performance of RLSLRSSC precedes the competing methods. When the LNI = 5, LSLRSSC has the best performance, RLSLRSSC and LRSSC are competitive, possibly because if we fine-tune only the parameter \(\lambda , \lambda _4\) for saving the time of parameter adjustment, a local optimum is obtained.

Fig. 3
figure 3

Samples from the UMIST database and contaminated UMIST database. a The original UMIST database. b The UMIST database corrupted by Gaussian noise. c The UMIST database corrupted by Laplacian noise. d The UMIST database corrupted by Gaussian and Laplacian mixed noise. The LNI of the corrupted images is 2

Table 5 Clustering results on the UMIST database corrupted by Gaussian noise
Table 6 Clustering results on the UMIST database corrupted by Laplacian noise
Fig. 4
figure 4

Clustering performance on the UMIST database with sample-specific corruptions. a Accuracy, b NMI

Table 7 Clustering results on the UMIST database corrupted by Gaussian and Laplacian mixed noise

5.2 Experiments on the UMIST database

The UMIST database [52] consists of 564 images of 20 individuals. We use a cropped version of the UMIST database that is publicly available on the Web page of University of Sheffield.Footnote 1 Each image covers a range of poses from profile to frontal views. We resize the images from \(112 \times 92\) to \(32 \times 32\). The images of UMIST database are shown in Fig. 3.

We set the parameters to \(\lambda =0.003, \lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.1, \lambda _4=0.01\) for the clean data. Then, we fine-tune \(\lambda , \lambda _4\) for the corrupted data. Also, we choose a set of optimal parameters, repeat five runs, and calculate the average result to get the final performance.

First, we use the UMIST database corrupted by Gaussian noise. Table 5 shows the results of the different methods. When we use the original UMIST database, the accuracy of RLSLRSSC-L21 is 69.01%, which is higher than LRSC (the second place) by 19.65%. When the LNI increases gradually from 1 to 5, RLSLRSSC-L1 achieves the best performance in most cases.

Next, we evaluate the performance of all methods on the UMIST database corrupted by Laplacian noise. The results are displayed in Table 6. When \(\hbox {LNI} \le 3\), RLSLRSSC-L1 obtains the best results, and when \(\hbox {LNI} \ge 4\), RLSLRSSC-L21 obtains the best results.

As Fig. 4 shows, the performance of RLSLRSSC on the UMIST database with sample-specific corruptions is better than that of the other methods. When the LCP varies from 1 to 10, the accuracy and NMI of RLSLRSSC are higher than those of the other methods by 8% and 9% at least, respectively. Compared with other algorithms, RLSLRSSC more effectively handles sample-specific corruptions.

Table 7 lists the results of the different methods for the UMIST database corrupted by Gaussian and Laplacian mixed noise. When the UMIST database is corrupted slightly by Gaussian and Laplacian mixed noise, RLSLRSSC performs better than the other methods. However, when the LNI = 5, the performance of LSLRSSC is the best, which may be for the same reason as for the experiment on the ORL database. Overall, RLSLRSSC is suitable to deal with Gaussian and Laplacian mixed noise.

5.3 Experiment on the CMU face database

The CMU face database consists of 624 black-and-white face images of 20 individuals. The images vary in pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and back. The CMU face images dataset with size \(30 \times 32\) is obtained from the UCI repository of machine learning databases.Footnote 2 The CMU face database can be found in Fig. 5. We illustrate the 2D visualization of the CMU face by conducting t-sne [53] on the original CMU face database, and the clustering results are shown in Fig. 6).

Fig. 5
figure 5

Samples from the CMU face database and contaminated CMU face database. a The original CMU face database. b The CMU face database corrupted by Gaussian noise. c The CMU face database corrupted by Laplacian noise. d The CMU face database corrupted by Gaussian and Laplacian mixed noise. The LNI of the corrupted images is 2

Table 8 Clustering results on the CMU face database corrupted by Gaussian noise
Table 9 Clustering results on the CMU face database corrupted by Laplacian noise
Table 10 Clustering results on the CMU face database corrupted by Gaussian and Laplacian mixed noise
Fig. 6
figure 6

Clustering result on the CMU face database. a Objective result; b cluster result; and c learned similarity graph

We set the parameters to \(\lambda =0.001\), \(\lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.01, \lambda _4=0.001\) for the original CMU face database. Then, we fine-tune \(\lambda , \lambda _4\) for the corrupted data. Also, we choose a set of optimal parameters and then repeat five runs and obtain the average result.

First, Gaussian noise is added to the CMU face database. Table 8 shows the results of different methods. RLSLRSSC achieves about 10% better performance than the other methods under all corruption conditions. RLSLRSSC outperforms other algorithms on the CMU face database corrupted by Gaussian noise.

Next, we evaluate the performance of all methods on the CMU face database corrupted by Laplacian noise. The results are displayed in Table 9. Compared with other methods, RLSLRSSC-L1 achieves better results.

We then obtain the clustering results on the CMU face database with sample-specific corruptions displayed in Fig. 7. We observe that when the LCP varies from 1 to 10, the clustering performances of RLSLRSSC-L1 and RLSLRSSC-L21 are close and stable. We conclude that RLSLRSSC is robust to sample-specific corruptions.

Table 10 lists the results of the different methods on the CMU face database corrupted by Gaussian and Laplacian mixed noise. We can see that RLSLRSSC-L21 yields the best results in most cases.

Fig. 7
figure 7

Clustering performance on the CMU face database with sample-specific corruptions. a Accuracy and b NMI

Fig. 8
figure 8

Clustering performance with different values of \(\lambda _1\) on the ORL database. a Accuracy and b NMI

Fig. 9
figure 9

Clustering performance with different values of \(\lambda _3\) on the ORL database. a Accuracy and b NMI

Fig. 10
figure 10

Clustering performance with different values of \(\lambda , \lambda _4\) on the ORL database corrupted by Gaussian and Laplacian mixed noise with the \({\mathrm{LNI}} = 3\). a Accuracy for RLSLRSSC-L1. b NMI for RLSLRSSC-L1. c Accuracy for RLSLRSSC-L21. d NMI for RLSLRSSC-L21

5.4 Parameter setting

In this subsection, we describe the parameter setting in all experiments. For SSC, the weighting parameter for the \(l_1\) norm is chosen from the set {0.1, 1, 10, 20, 50}. In LSR, the parameter for \(\Vert Z \Vert _F\) is chosen from the set {0.01, 0.1, 1, 10, 100}. In most cases, the parameter is set to 100. In LRR, the parameter for the nuclear norm is selected from the set {0.01, 0.1, 1, 10, 100}. In LRSC, the regularization parameter \(\tau\) is selected from the set {0.01, 0.1, 1, 10, 100}. The regularization parameter \(\alpha\) has the default setting, \(\alpha = 0.5 \times \tau\). In BDR, the parameters \(\lambda , \gamma\) are selected from the set {0.001, 0.01, 0.1, 1, 10, 100}. In LRSSC, LSLRSSC, and our proposed method RLSLRSSC, the parameter for the nuclear norm \(\lambda _1\) is selected from the set {0.01, 0.1, 1, 10}, and the parameter for the \(l_1\) norm \(\lambda _2\) is \(\lambda _2 = 0.1 \times \lambda _1\). In LSLRSSC and RLSLRSSC, the dimension of latent space is fixed at 100.

There are seven parameters in our model, which need to be tuned: \(\lambda , \lambda _1, \lambda _2, \lambda _3, \lambda _4\) in Eq. (13), k, t in (10). According to the suggestions of [44], the values of kt are between 4 and 7, the larger the dataset size, the larger the values of kt. The values of kt are 4 because the size of dataset used in experiment is small. In the experiments, we find out that our algorithm can get better performance when \(\lambda _2 = 0.1 \times \lambda _1\). Because there are many combinations of remaining four parameters: \(\lambda , \lambda _1, \lambda _3, \lambda _4\), we can choose the optimal value of the parameters in two steps. In the first step, we assume that there are few outliers in the clean dataset, so we set the value of \(\lambda _4\) slightly large, \(\lambda _4 = 1\). Then, we fix \(\lambda _4\) and search for the optimal value of \(\lambda , \lambda _1, \lambda _3\). The value of \(\lambda\) is selected from the set {0.001, 0.01, 0.1, 1}, and the values of \(\lambda _1, \lambda _3\) are selected from the set {0.01, 0.1, 1, 10}. In the second step, we search for the optimal value of \(\lambda , \lambda _4\) in {0.001, 0.003, 0.005, 0.01, 0.05, 0.1, 0.5, 1}, by fixing other parameters. For the contaminated dataset, we just fine-tune \(\lambda , \lambda _4\) for saving time. We found that the optimal value of \(\lambda\) is between 0.001 and 0.01 and the optimal value of \(\lambda _4\) is between 0.001 and 0.1 for the clean dataset. Hence, for the contaminated dataset, \(\lambda\) is chosen from the set {0.001, 0.002, 0.003, 0.004, 0.005, 0.01} and \(\lambda _4\) is chosen from the set {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1}.

We investigate how sensitive the \(\lambda _1, \lambda _2\) is. The value of \(\lambda = 0.01, \lambda _3 = 0.1, \lambda _4 = 0.1\) is set for RLSLRSSC, and the effects of the other two parameter are explored. \(\lambda _2\) has the default value of \(\lambda _2 = 0.1 \times \lambda _1\), so we analyze only the effects of \(\lambda _1\). We use the original ORL database to conduct our experiment and obtain the average results over five runs. The results are shown in Fig. 8. The performances of RLSLRSSC-L1 and RLSLRSSC-L21 are very close. When \(\lambda _1\) increases from 0.001 to 0.1, the performance of RLSLRSSC improves remarkably. When \(\lambda _1\) increases from 0.1 to 1, the performance decreases sharply. Hence, an optimal value can be discovered easily when \(\lambda _1\) = 0.1.

Then, to research the parameter sensitivity of \(\lambda _3\), we set the value of \(\lambda = 0.01, \lambda _1 = 0.1, \lambda _1 = 0.01, \lambda _4 = 0.1\) for RLSLRSSC and study the effects of \(\lambda _3\). We also use the original ORL database to conduct the experiment and get the average results over five runs. As shown in Fig. 9, the performance of RLSLRSSC is stable when \(\lambda _3 \ge 0.05\). We notice that when \(\lambda _3 = 0\), the performance decreases remarkably, which shows the importance of the graph constraint term.

Finally, we investigate the clustering performance on the ORL database changes by varying the regularization parameter \(\lambda , \lambda _4\) for RLSLRSSC-L1 and RLSLRSSC-L21. We set the value of \(\lambda _1 = 0.1, \lambda _2 = 0.01, \lambda _3 = 0.1\) and then explore the effects of the other two parameters. We use the ORL database corrupted by Gaussian and Laplacian mixed noise with the value of the LNI = 3. For each set of parameters, we average the results over five runs. Figure 10 illustrates the final results. The larger the error, the smaller the weight, i.e., if the value of \(\lambda\) or \(\lambda _4\) decreases, the error increases. The performance of RLSLRSSC-L1 is improved to a certain extent when the value of \(\lambda\) or \(\lambda _4\) decreases, so the terms associated with \(\lambda\) and \(\lambda _4\) are proved to work. RLSLRSSC-L21 is not as stable as RLSLRSSC-L1. When \(\lambda = 0.001\), \(\lambda _4\) decreases from 0.005 to 0.001, the performance of RLSLRSSC-L21 declines substantially. In brief, RLSLRSSC-L1 performs stably with different parameters and addresses Gaussian and Laplacian mixed noise well.

6 Conclusion and future works

This work proposes a novel subspace clustering method, RLSLRSSC, to deal with the corrupted face database clustering problem. The F-norm and the \(l_1\) norm or the \(l_{2,1}\) are used successfully to capture the noise and outliers. RLSLRSSC simultaneously learns a robust projection and a low-rank and sparse representation in the low-dimensional subspace. By incorporating a graph constraint, the local manifold structure can be preserved, and the discriminative ability of the learned projection can be promoted. We also present adequate experiments to demonstrate that RLSLRSSC is superior to other state-of-the-art methods.

There is, however, still an unsolved problem. To obtain a discriminative latent space, we introduced a graph constraint in our model. The high-dimensional data can be represented by the low-dimensional embedding in the learned latent space. In fact, we get the low-dimensional embedding in the manner of linear transformation. For the data with nonlinear structure, our method may not work well. In the recent years, deep learning has yielded fruitful results. The data can be nonlinearly mapped into a latent space by using deep learning. In the future, we plan to combine our method with deep learning.