Abstract
Recently, low-rank and sparse representation-based methods have achieved great success in subspace clustering, which aims to cluster data lying in a union of subspaces. However, most methods fail if the data samples are corrupted by noise and outliers. To solve this problem, we propose a novel robust method that uses the F-norm for dealing with universal noise and the \(l_1\) norm or the \(l_{2,1}\) norm for capturing outliers. The proposed method can find a low-dimensional latent space and a low-rank and sparse representation simultaneously. To preserve the local manifold structure of the data, we have adopted a graph constraint in our model to obtain a discriminative latent space. Extensive experiments on several face benchmark datasets show that our proposed method performs better than state-of-the-art subspace clustering methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the field of computer vision and pattern recognition, dealing with high-dimensional data is a challenging task. In most cases, these high-dimensional data can be represented by a low-dimensional subspace. For example, the face data under different illumination conditions can be approximated by a low-dimensional linear subspace [1]. Subspace clustering aims to cluster the data points drawn from a union of low-dimensional subspaces.
The subspace clustering problem has attracted attention in recent years, and many methods have been proposed to address it, including iterative methods [2,3,4], algebraic methods [5,6,7], statistical methods [8,9,10,11], and methods based on spectral clustering [12,13,14,15,16,17,18,19,20,21,22]. Researchers favor the methods based on spectral clustering, which can obtain better results. These methods divide the spectral clustering problem into two steps. First, an affinity matrix is built from the data. Then, spectral clustering is applied on the affinity matrix to segment the data. The basis of such methods is to construct an informative affinity matrix.
Recently, various algorithms based on sparse representation [13, 14] or low-rank representation [15,16,17, 20, 22] have been proposed in the literature for subspace clustering. Such methods are robust to noise and outliers and do not require the knowledge of the dimensions and the number of subspaces. Most of these methods are based on a self-expressiveness model, which states that each sample can be expressed as a linear combination of other samples. They utilize all samples as a dictionary and find a sparse or low-rank representation matrix. With relatively clean data, \({\mathbf{X}} = {\mathbf{XR }}\), where \({\mathbf{X}}\) is the data matrix and \({\mathbf{R}}\) is the coefficient matrix. With corrupted data, the constraint is changed to \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\), where \({\mathbf{E}}\) denotes the reconstruction error. In general, the subspace clustering problem can be summed up as the following optimization problem:
where \(\Vert \cdot \Vert _k\) and \(\Vert \cdot \Vert _t\) are suitable norms, \(\lambda\) is a trade-off parameter, and the constraint \({\mathrm{diag}}({\mathbf{R}} ) = {\mathbf{0}}\) is optionally used to avoid a trivial solution of \({\mathbf{R}}\).
The methods differ in the choice of norms for the regularization on \({\mathbf{R}}\) and/or the reconstruction error \({\mathbf{E}}\). For example, Elhamifar and Vidal proposed sparse subspace clustering (SSC) [13, 14] based on sparse representation (SR) [23]. The \(l_1\) norm is chosen for \(\Vert \cdot \Vert _k\) to get a relatively sparse matrix \({\mathbf{R}}\), and the F-norm and/or the \(l_1\) norm of \({\mathbf{E}}\) are used to handle Gaussian noise and/or outliers. Liu et al. [16, 17] argued that due to learning the sparse representation of each data vector individually, SSC may not be able to capture the global structure of data, and the performance of SSC will be largely depressed when the data is grossly corrupted. Consequently, they proposed a method called the low-rank representation (LRR), which finds the lowest rank representation of all data jointly. They used the nuclear norm \(\Vert \cdot \Vert _*\) for \(\Vert \cdot \Vert _k\) to obtain a relatively low-rank matrix \({\mathbf{R}}\) and the \(l_{2,1}\) norm of \({\mathbf{E}}\) to tackle sample-specific corruptions, where some data vectors are corrupted and others are clean. Vidal et al. [20] held that using data as the dictionary is unreasonable when data are grossly corrupted. They learned a noise-free dictionary from corrupted data and found the low-rank representation based on the learned dictionary. They named this algorithm low-rank subspace clustering (LRSC). In LRSC, the nuclear norm \(\Vert \cdot \Vert _*\) is applied to \(\Vert \cdot \Vert _k\), the F-norm, and the \(l_1\) norm of \({\mathbf{E}}\) is used to handle Gaussian noise and outliers. Wang et al. [21] reported that the representation matrix is often simultaneously sparse and low rank, and they therefore combined SSC and LRR, termed low-rank sparse subspace clustering (LRSSC). In LRSSC, the nuclear norm \(\Vert \cdot \Vert _*\) and the \(\Vert \cdot \Vert _1\) norm are used for \(\Vert \cdot \Vert _k\), and the F-norm of \({\mathbf{E}}\) is used to deal with Gaussian noise. Also, several variants of these algorithms have been proposed, such as least squares regression (LSR) [19], \(l_0\)-SSC [24], SSC-learned orthogonal matching pursuit (SSC-LOMP) [25], subspace clustering with learning an adaptive low-rank graph (SC-LALRG) [26], and subspace clustering with block diagonal representation (BDR) [27]. These methods have yielded state-of-the-art results in different subspace segmentation tasks.
However, finding sparse or low-rank representation is time-consuming when the dimension of the feature is high [14]. To tackle this problem, we can process the high-dimensional data into low-dimensional data using principal component analysis [28] or random projections [29] prior to obtaining sparse or low-rank representation. The major drawback of this method is that it may lose too much information, leading to poor performance. A more reasonable approach is to find the low-dimensional embedding and sparse or low-rank representation jointly. Patel et al. [30] proposed an extension of SSC: latent space sparse subspace clustering (LS3C). Given a dataset, LS3C finds a low-dimensional embedding and a sparse representation matrix simultaneously. They also proposed an extension of LS3C: latent space low-rank and sparse subspace clustering (LSLRSSC) [31]. LSLRSSC finds a low-dimensional embedding and a sparse and low-rank representation matrix simultaneously. Also, Wei et al. [32] proposed an extension LS3C, which they called latent space robust subspace segmentation (LSRS2). In LSRS2, the weighted \(l_1\) norm and the nuclear norm are used for the representation matrix. These methods learn a low-dimensional space without losing too much information, and a PCA-like regularization term is added to their objective functions. PCA aims to discover the global structure of the Euclidean space; nevertheless, the local manifold structure is more necessary than the global Euclidean structure in many real-world application problems [33, 34]. Tang et al. [35] designed a special term describing the property of learned space instead of PCA that can keep the local manifold structure of the high-dimensional data: robust subspace learning-based low-rank representation (RSLLRR). RSLLRR learns a low-rank representation matrix, but the representation matrix is often sparse and low rank. In addition, the methods mentioned above using the F-norm or the \(l_{2,1}\) norm to capture the reconstruction error are not robust [36,37,38].
Consequently, in this paper, we propose a novel algorithm, robust latent space low-rank and sparse subspace clustering (RLSLRSSC). To preserve the local manifold structure of the high-dimensional data, we devise a graph constraint term instead of a PCA-like term in the objective function of our proposed algorithm. Moreover, to make our algorithm robust to noise, we characterize the reconstruction error more precisely. Due to the limited representational capability with insufficient data and the error in data acquisition and transmission, we assume a portion of the error obeys a Gaussian distribution. Hence, we use the F-norm to depict Gaussian noise. We use the \(l_1\) norm or the \(l_{2,1}\) norm to depict outliers. In summary, our main contributions are as follows:
- 1.
By adopting a graph constraint term rather than a PCA-like term, we propose a novel representation learning method that finds a discriminating latent space and a low-rank and sparse representation matrix simultaneously.
- 2.
We depict the noise more accurately, so the robustness of our algorithm will be enhanced.
- 3.
A simple and efficient iterative method is proposed to optimize our algorithm.
The remainder of this paper is organized as follows. In Sect. 2, we briefly review some related work. Section 3 presents the proposed method RLSLRSSC. In Sect. 4, we describe the optimization algorithms, the time complexity, and convergence analysis. Section 5 reports experimental results. Section 6 concludes this paper.
2 Related work
In this section, we briefly review sparse and low-rank subspace clustering methods, including SSC [13, 14], LRR [16, 17], LRSSC [21], and LSLRSSC [31]. For brevity, we summarize some notations in Table 1.
2.1 Sparse subspace clustering (SSC)
Sparse representation (SR) [23] is an effective tool for representing and compressing high-dimensional data. In the recent years, an increasing number of methods have been proposed based on SR. Also, a lot of applications derive from SR, such as Tree2Vector [39, 40]. Similarly, SSC [13, 14] learns the affinity matrix for a dataset based on SR. Given a dataset, \({\mathbf{X}} = [{\mathbf{x}} _1, {\mathbf{x}} _2, \ldots , {\mathbf{x}} _N] \in {\mathbf{R}} ^{D \times N}\) drawn from a union of Q independent linear subspaces \(S_1, S_2, \ldots , S_Q\). Each sample \({\mathbf{x}} _i \in {\mathbf{R}} ^D\) can be linearly and approximately represented by a combination of a few points from the rest of the samples in \({\mathbf{X}}\). The idea of SSC can be expressed as follows:
where \({\mathbf{R}} = [{\mathbf {r}} _1, {\mathbf {r}} _2, \ldots , {\mathbf {r}} _N] \in {R}^{N \times N}\) is a coefficient matrix whose column \({\mathbf {r}} _i\) is the representation vector corresponding to \({\mathbf{x}} _i\), \({\mathrm{diag}}({\mathbf{R}} ) \in {R}^{N}\) is the vector whose elements equal the diagonal elements of \({\mathbf{R}}\), and \({\mathbf{0}} \in {R}^{N}\) is a vector containing zeros only.
In fact, the data lie in a union of affine subspaces. In this condition, we obtain the following optimization:
where \({\mathbf {1}}\) is a vector containing one only. Considering that the data may be corrupted, the constraint \({\mathbf{X}} = {\mathbf{XR }}\) should be relaxed to \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\). \({\mathbf{E}}\) is the noise matrix, and the F-norm is used to capture the noise. The following problem, therefore, can be solved to obtain the representation matrix \({\mathbf{R}}\):
where \(\lambda\) is a regularization parameter. This problem can be solved by using the alternating direction method of multipliers (ADMM) [41].
2.2 Low-rank representation (LRR)
LRR [16, 17] uses low-rank representation to construct an affinity matrix for a group of data. Similar to SSC, LRR uses the data \({\mathbf{X}}\) as the dictionary. When the data is noise-free, a low-rank representation \({\mathbf{R}}\) can be obtained by solving the problem
However, the above optimization problem is difficult to solve. As a common approach to addressing the above problem, the rank function is replaced by its convex surrogate, the nuclear norm. When the data is corrupted by some arbitrary noise E, i.e., \({\mathbf{X}} = {\mathbf{XR }} + {\mathbf{E}}\), the \(l_{2,1}\) norm is employed to deal with the noise. The objective function of LRR is
The above problem can be solved by an inexact augmented Lagrange multiplier (ALM) method [42].
2.3 Low-rank and sparse subspace clustering (LRSSC)
The representation matrix \({\mathbf{R}}\) is often simultaneously sparse and low rank. By combining SSC and LRR, a low-rank and sparse representation can be learned. In LRSSC [21], the representation matrix can be found by solving the following optimization problem:
where \(\lambda _1\) and \(\lambda _2\) are two regularization parameters. This problem can be solved using the ADMM [41] method.
2.4 Latent space low-rank and sparse subspace clustering (LSLRSSC)
Finding low-rank and sparse representation is very time-consuming. Dissimilar to LRSSC, LSLRSSC [31] finds a latent low-dimensional space and a low-rank and sparse representation matrix simultaneously. LSLRSSC can be expressed as the following problem:
where \(\lambda _1\), \(\lambda _1\), \(\lambda _1\) are three trade-off parameters and \({\mathbf{P}}\) is a projection matrix. The last term is used for guaranteeing that the projection does not lose too much information in the original domain. This problem can be solved by optimizing \({\mathbf{P}}\) and \({\mathbf{R}}\) alternatively using ADMM [41]. In SSC, LRR, LRSSC, and LSLRSSC, once the representation matrix \({\mathbf{R}}\) is learned, the affinity matrix \({\mathbf{W}}\) can be constructed by \({\mathbf{W}} = |{\mathbf{R}} | + |{\mathbf{R}} ^{\mathrm{T}}|\), where \(|{\mathbf{R}} |\) is the modulus of \({\mathbf{R}}\). Finally, spectral clustering [43] is used to obtain clustering results.
3 Robust latent space low-rank and sparse subspace clustering (RLSLRSSC)
In this section, we present a novel method, robust latent space low-rank and sparse subspace clustering (RLSLRSSC). Our model consists of three steps. First, it projects the high-dimensional data onto a lower dimensional space, and we add a graph constraint to promote the discriminative ability of the projected space. Second, the projected data is represented by a low-rank and sparse representation matrix. Third, we need to deal with the noise and outliers. Simply, we aim to learn a low-rank and sparse representation matrix and a discriminative projection matrix simultaneously by using a graph constraint. RLSLRSSC can be expressed as the following optimization problem:
where \(\lambda , \lambda _1, \lambda _2, \lambda _3\) are four regularization parameters, \({\mathbf{P}}\) is a projection matrix, \({\mathbf{R}}\) is a low-rank and sparse representation matrix, \({\mathbf {G}}\) denotes universal noise, and \({\mathbf{E}}\) denotes outliers. In most cases, universal noise obeys a Gaussian distribution, and we can apply the F-norm for \({\mathbf {G}}\), \(f({\mathbf {G}} ) = \Vert {\mathbf {G}} \Vert _F^2\). One can find both sparse and low-rank representation as is performed in LRSSC by setting \(f_1({\mathbf{R}} ) = \Vert {\mathbf{R}} \Vert _* + \alpha \Vert {\mathbf{R}} \Vert _1\), where \(\alpha\) is a trade-off parameter. To deal with outliers, the \(l_1\) norm or the \(l_{2,1}\) norm is employed for \({\mathbf{E}}\), \(f_3({\mathbf{E}} ) = \Vert {\mathbf{E}} \Vert _1\) or \(f_3({\mathbf{E}} ) = \Vert {\mathbf{E}} \Vert _{2,1}\). \(f_2({\mathbf{P}} )\) describes the property of the learned space, which enforces discrimination of the learned space.
The graph constraint term \(f_2({\mathbf{P}} )\) learns a transformation matrix, which can preserve useful information and projects the data onto a discriminative latent space, where different classes are separated more easily compared with the original space. Because the local manifold structure is more significant than the global Euclidean structure [33, 34], preserving the local manifold structure in the projected space is more reasonable. We hope that if two data points \({\mathbf{x}} _i\) and \({\mathbf{x}} _j\) are close in the original space, their corresponding low-dimensional embedding \({\mathbf {y}} _i\) and \({\mathbf {y}} _j\) should be close, too. \(f_2({\mathbf{P}} )\) is designed to preserve the preferable relationship between the data samples. As a result, we choose the k nearest neighbors of each data point by Euclidean distance. Then, the adjacency graph \({\mathbf{W}}\) can be obtained with the k nearest neighbors by using radial basis function (RBF). In RBF, it is important to select the value of \(\sigma\). However, when the data includes clusters with different local statistics, the adjacency graph may not be accurate even using the optimal \(\sigma\). In addition, it is hard to get the optimal \(\sigma\) because there are many parameters in our model. To overcome these problems, \({\mathbf{W}}\) is defined as
where \(N({\mathbf{x}} _i)\) denotes the nearest neighbors of \({\mathbf{x}} _i\) and \(\sigma _i\) is a local scaling [44] parameter for \({\mathbf{x}} _i\), which permits self-tuning of the point-to-point distances according to the local statistics of the neighborhoods of \({\mathbf{x}} _i\). The local scale \(\sigma _i\) can be defined as \(\sigma _i = d({\mathbf{x}} _i,{\mathbf{x}} _t)\), where \({\mathbf{x}} _t\) is the tth neighbor of \({\mathbf{x}} _i\).
We formulate the graph constraint term \(f_2({\mathbf{P}} )\) as
where \({\mathbf {y}} _i\) is the low-dimensional embedding corresponding to \({\mathbf{x}} _i\) and we have \({\mathbf {y}} _i = {\mathbf{P}} {} {\mathbf{x}} _i\). Let \({\mathbf {D}}\) be a diagonal matrix with \({\mathbf {D}} _{jj} = \sum _{j=1} {\mathbf{W}} _{ij}\), and \({\mathbf{L}}\) be the Laplacian matrix with \({\mathbf{L}} = {\mathbf {D}} - {\mathbf{W}}\). The objective function (11) can be modified as the following equivalent problem:
The above optimization problem can be seen as a simpler one in orthogonal locality preserving projections (OLPP) [45,46,47]. If the sample size is less than the feature dimension of the sample, we will come across the undersampled size problem. To ensure that the matrix \({\mathbf{ M}} = {\mathbf{XL }}{} {\mathbf{X}} ^{\mathrm{T}}\) will be nonsingular, we may use PCA to reduce the dimensionality of the data in advance.
By expanding \(f({\mathbf {G}} )\), \(f_1({\mathbf{R}} )\), \(f_2({\mathbf{P}} )\) , and \(f_3({\mathbf{E}} )\), our proposed model (9) can be expressed as
where \(\Vert \cdot \Vert _l\) denotes a certain norm. We consider two kinds of norms on \({\mathbf{E}}\): (i) the \(l_1\) norm, resulting in a model denoted by RLSLRSSC-L1; and (ii) the \(l_{2,1}\) norm, resulting in a model denoted by RLSLRSSC-L21.
4 Optimization
The optimization problem (13) is difficult to solve, and it is not jointly convex concerning (\({\mathbf{P}} ,{\mathbf{R}} ,{\mathbf{E}}\)). We can divide the optimization problem (13) into three subproblems to find the projection matrix \({\mathbf{P}}\), the representation matrix \({\mathbf{R}}\), and the error matrix \({\mathbf{E}}\) separately. These subproblems are solved alternatively by updating one variable and fixing the other ones. Each subproblem will be discussed in detail in the following section.
4.1 Update of projection matrix \({\mathbf{P}}\)
Given fixed \({\mathbf{R}}\) and \({\mathbf{E}}\), the cost function in Eq. (13) is further reduced to
The above objective function is nonconvex with a local minimum of it obtained as follows. First, considering \({\mathbf{P}} {} {\mathbf{P}} ^{\mathrm{T}} = {\mathbf {I}}\), we have
where \(\varPhi ({\mathbf{P}} ) = ({\mathbf{X}} -{\mathbf{XR }}-{\mathbf{P}} ^{\mathrm{T}}{} {\mathbf{E}} )({\mathbf{X}} -{\mathbf{XR }}-{\mathbf{P}} ^{\mathrm{T}}{} {\mathbf{E}} )^{\mathrm{T}}\). In this case, the objective function in Eq. (14) can then be rewritten as
To optimize the above minimization in the tth iteration, we use \(\varPhi ({\mathbf{P}} _{t-1})\) to estimate the \(\varPhi ({\mathbf{P}} )\) in Eq. (16), where \({\mathbf{P}} _{t-1}\) is the projection matrix acquired in \(({t-1})\)th iteration. Using the eigenvalue decomposition (EVD) technique, we have
Then, we can update \({\mathbf{P}}\) as the d eigenvectors in \({\mathbf {U}}\) associated with the first d smallest eigenvalues in \({\mathbf {S}}\), i.e., \({\mathbf{P}} _t = {\mathbf {U}} (:,1:d)^{\mathrm{T}}\).
4.2 Update of representation matrix \({\mathbf{R}}\)
We then can optimize \({\mathbf{R}}\) with \({\mathbf{P}}\) and \({\mathbf{E}}\) fixed. By ignoring irrelevant terms, the cost function in Eq. (13) reduces to
To solve the above minimization, we need to introduce two auxiliary variables, \({\mathbf {J}}\) and \({\mathbf {S}}\) and reformulate the optimization problem. For brevity, let \({\mathbf{A}} = {\mathbf{PX }}\) and \({\mathbf{B}} = {\mathbf{PX }}-{\mathbf{E}}\). We can rewrite Eq. (18) as
We can solve the above problem by using the inexact augmented Lagrangian multiplier (ALM) [42] method. The augmented Lagrangian function of Eq. (19) is
where \({\varvec{\varLambda }}_1\), \({\varvec{\varLambda }}_2\) , and \({\varvec{\varLambda }}_3\) are three Lagrangian multipliers and \(u_1\), \(u_1\), \(u_3\) are three penalty factors. The optimization of Eq. (20) can be found in Algorithm 1.
4.3 Update of noise matrix \({\mathbf{E}}\)
In order to solve \({\mathbf{E}}\), we keep \({\mathbf{P}}\) and \({\mathbf{R}}\) fixed. Hence, the objective function in Eq. (13) can be rewritten as
when the \(l_1\) norm is used for noise, we have
This problem can be efficiently solved by shrinkage [48]. Whereas the \(l_{2,1}\) norm is used for noise, we have
According to Lemma 1, we can update \({\mathbf{E}}\) by
Lemma 1
[49] Given a matrix\({\mathbf {y}} = [{\mathbf {y}} _1, {\mathbf {y}} _2, \ldots , {\mathbf {y}} _N]\), if the optimal solution of
is\({\mathbf {Z}} ^*\), and then, the ith column of\({\mathbf {Z}} ^*\)is
4.4 The algorithms
We first describe the algorithmic procedure for solving \({\mathbf{R}}\) in Algorithm 1. Then, we discuss the whole algorithmic procedure for solving Eq. (13) in Algorithm 2. Once the representation matrix \({\mathbf{R}}\) is learned, spectral clustering is applied on the affinity matrix \({\mathbf{W}} = |{\mathbf{R}} |^{\mathrm{T}} + |{\mathbf{R}} |\) to obtain the clustering result. The proposed methods are summarized in Algorithm 3.
4.5 Complexity and convergence analysis
4.5.1 Time complexity
We analyze the time complexity of two main subproblems of RLSLRSSC optimization as follows:
- 1.
The most time-consuming subproblem is computing low-rank and sparse representation. In Algorithm 1, step 5 is the most time-consuming due to singular value decomposition (SVD) with the cost of \(O(N)^3\), where N is the number of data samples. The time complexity of computing the matrix inverse in step 3 costs \(O(d)^3\), where d is the dimension of latent space. In short, the total time complexity of Algorithm 1 is \(i_1 \ O(N)^3\), where \(i_1\) is the iterative number of this algorithm.
- 2.
For updating the projection matrix \({\mathbf{P}}\), the most time-consuming step is EVD. The time complexity of this step is also \(O(N)^3\).
Algorithm 1 is a step of Algorithm 2, and as a result, the total time complexity of RLSLRSSC is \(i_1 \ i_2 \ O(N)^3\), where \(i_1\) is the iterative number of Algorithm 1, and \(i_2\) is the number of iterations of Algorithm 2.
Then, we analyze the time complexities of all the compared methods briefly. For SSC, the time complexity is \(iD \times O(N)^3\), where i is the iterative number and D is the size of dimension. For LSR and LRSC, the time complexity is \(O(N)^3\). For LRR and BDR, the time complexity is \(i \ O(N)^3\). The time complexity of LSLRSSC and our algorithm is same, which is \(i_1 \ i_2 \ O(N)^3\).
4.5.2 Convergence analysis
A proof to ensure the convergence of the inexact ALM method can be found in [17], so we assume Algorithm 1 is convergent. Now, we consider the convergence of Algorithm 2.
Proposition 1
The cost function \(J = \lambda \Vert {\mathbf{PX }} - {\mathbf{PXR }} - {\mathbf{E}} \Vert _F^2 + \lambda _1 \Vert {\mathbf{R}} \Vert _* + \lambda _2 \Vert {\mathbf{R}} \Vert _1 + \lambda _3 tr({\mathbf{PXL }}{} {\mathbf{X}} ^{\mathrm{T}}{} {\mathbf{P}} ^{\mathrm{T}}) + \lambda _4 \Vert {\mathbf{E}} \Vert _l\) converges to a minimum by employing Algorithm 2.
Proof
First, it is easy to prove that \(J \ge 0\) for any \({\mathbf{P}} , {\mathbf{R}} , {\mathbf{E}}\).
Then, based on Algorithm 2, assume we have completed t iterations and \({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t\) are obtained. In the \((t+1)\)th iteration, \({\mathbf{R}} _{t+1} = {\mathrm{argmin}}_{\mathbf{R}} \ J({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t)\) after step 3. Therefore, \(J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t) \le J({\mathbf{P}} _t, {\mathbf{R}} _t, {\mathbf{E}} _t)\). Similarly, we also have \({\mathbf{P}} _{t+1} = {\mathrm{argmin}}_{\mathbf{P}} \ J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) and \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t) \le J({\mathbf{P}} _t, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) after step 4. Finally, we have \({\mathbf{E}} _{t+1} = {\mathrm{argmin}}_{\mathbf{E}} \ J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) and \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _{t+1}) \le J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _t)\) after step 5. Hence, \(J({\mathbf{P}} _{t+1}, {\mathbf{R}} _{t+1}, {\mathbf{E}} _{t+1}) \le J({\mathbf{P}} _{t}, {\mathbf{R}} _{t}, {\mathbf{E}} _t)\), and we can deduce that RLSLRSSC is convergent.
5 Experimental results
In this section, we carry out several experiments on three face benchmark databases to demonstrate the superior clustering performance of our proposed method RLSLRSSC. We compare our RLSLRSSC with SSC, LSR, LRR, LRSC, BDR, LRSSC, and LSLRSSC. Two measures are employed to evaluate the performance of all approaches, i.e., clustering accuracy (ACC) and normalized mutual information (NMI) [50]. To test the robustness of these methods, we show their results on the databases corrupted by four kinds of noise: Gaussian noise, Laplacian noise, sample-specific corruptions, and Gaussian and Laplacian mixed noise. We describe four kinds of noise experiments in detail in the following paragraphs.
For the first experiment, we add Gaussian noise to the database. The level of noise intensity (LNI) varies from 0 to 5. When the LNI = 0, the original database is used. The corrupted data are constructed by adding Gaussian noise with zero mean and variance \(0.01 \times \hbox {LNI}\).
For the second experiment, we add Laplacian noise to each image. Similar to the first experiment, the LNI varies from 1 to 5. The corrupted data are generated by adding Laplacian noise with the location parameter zero and the scale parameter \(0.05\times \hbox {LNI}\).
For the third experiment, we add Gaussian noise to some selected samples. The level of corruption percentage (LCP) varies from 1 to 10. We randomly select \(\hbox {LCP}\times 10\%\) percentage of the data samples and add Gaussian noise with the LNI = 2 to them.
For the last experiment, we add two kinds of noise to the images: Gaussian noise and Laplacian noise. The LNI varies from 1 to 5. To construct the contaminated data, Gaussian noise with zeros mean and variance \(0.01 \times \hbox {LNI}\) is added to all images, and then, Laplacian noise with the location parameter zero and the scale parameter \(0.05 \times \hbox {LNI}\) is added.
5.1 Experiment on the ORL database
The ORL database [51] consists of 400 human face images, with 40 distinct subjects containing 10 different images. For each subject, the images were taken under varying lighting conditions, with different facial expressions and facial details at different times. All the images were taken against a dark background with upright, frontal positions. The size of the face images is \(32 \times 26\). The images of the ORL database are shown in Fig. 1.
We describe the parameter setting here. The parameters are set to \(\lambda =0.01, \lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.1, \lambda _4=0.1\) for the clean data. For the corrupted data, tuning all parameters will cost too much time. Because \(\lambda , \lambda _4\) are the coefficients of the error matrix, they indicate the pollution levels. Hence, we fine-tune only \(\lambda , \lambda _4\) for the corrupted data. We choose a set of optimal parameters and then repeat five runs and obtain five results and calculate the average result to obtain the final performance.
First, we use the ORL database corrupted by Gaussian noise to demonstrate our method. The performance of related methods is listed in Table 2. When the original images are used, the accuracy of RLSLRSSC-L21 is \(86.10\%\), which is higher than LSLR-SSC (the second place) by \(5.40\%\), which is because RLSLRSSC uses a graph constraint rather than a PCA-like term, so RLSLRSSC can preserve the local manifold structure well. When the LNI increases gradually from 1 to 5, the performance of all the methods decreases. Overall, RLSLRSSC achieves the best result.
We evaluate the performance of related methods on the ORL database corrupted by Laplacian noise. The results are listed in Table 3. As displayed in Table 3, in most cases, RLSLRSSC-L21 can obtain better results. Hence, we deduce that the \(l_{2,1}\) norm can capture Laplacian noise well.
We then obtain the clustering performance on the ORL database with sample-specific corruptions. As displayed in Fig. 2, SSC has better results when the LCP is chosen from the set {4, 7, 8, 9}, and our algorithm can obtain competitive results. RLSLRSSC can obtain the best results when the LCP belongs to the set {1, 2, 3, 5, 6, 10}. Overall, RLSLRSSC can deal with sample-specific corruptions well.
As Table 4 shows, RLSLRSSC is robust to Gaussian and Laplacian mixed noise. When the LNI varies from 1 to 4, the performance of RLSLRSSC precedes the competing methods. When the LNI = 5, LSLRSSC has the best performance, RLSLRSSC and LRSSC are competitive, possibly because if we fine-tune only the parameter \(\lambda , \lambda _4\) for saving the time of parameter adjustment, a local optimum is obtained.
5.2 Experiments on the UMIST database
The UMIST database [52] consists of 564 images of 20 individuals. We use a cropped version of the UMIST database that is publicly available on the Web page of University of Sheffield.Footnote 1 Each image covers a range of poses from profile to frontal views. We resize the images from \(112 \times 92\) to \(32 \times 32\). The images of UMIST database are shown in Fig. 3.
We set the parameters to \(\lambda =0.003, \lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.1, \lambda _4=0.01\) for the clean data. Then, we fine-tune \(\lambda , \lambda _4\) for the corrupted data. Also, we choose a set of optimal parameters, repeat five runs, and calculate the average result to get the final performance.
First, we use the UMIST database corrupted by Gaussian noise. Table 5 shows the results of the different methods. When we use the original UMIST database, the accuracy of RLSLRSSC-L21 is 69.01%, which is higher than LRSC (the second place) by 19.65%. When the LNI increases gradually from 1 to 5, RLSLRSSC-L1 achieves the best performance in most cases.
Next, we evaluate the performance of all methods on the UMIST database corrupted by Laplacian noise. The results are displayed in Table 6. When \(\hbox {LNI} \le 3\), RLSLRSSC-L1 obtains the best results, and when \(\hbox {LNI} \ge 4\), RLSLRSSC-L21 obtains the best results.
As Fig. 4 shows, the performance of RLSLRSSC on the UMIST database with sample-specific corruptions is better than that of the other methods. When the LCP varies from 1 to 10, the accuracy and NMI of RLSLRSSC are higher than those of the other methods by 8% and 9% at least, respectively. Compared with other algorithms, RLSLRSSC more effectively handles sample-specific corruptions.
Table 7 lists the results of the different methods for the UMIST database corrupted by Gaussian and Laplacian mixed noise. When the UMIST database is corrupted slightly by Gaussian and Laplacian mixed noise, RLSLRSSC performs better than the other methods. However, when the LNI = 5, the performance of LSLRSSC is the best, which may be for the same reason as for the experiment on the ORL database. Overall, RLSLRSSC is suitable to deal with Gaussian and Laplacian mixed noise.
5.3 Experiment on the CMU face database
The CMU face database consists of 624 black-and-white face images of 20 individuals. The images vary in pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and back. The CMU face images dataset with size \(30 \times 32\) is obtained from the UCI repository of machine learning databases.Footnote 2 The CMU face database can be found in Fig. 5. We illustrate the 2D visualization of the CMU face by conducting t-sne [53] on the original CMU face database, and the clustering results are shown in Fig. 6).
We set the parameters to \(\lambda =0.001\), \(\lambda _1=0.1, \lambda _2=0.01, \lambda _3=0.01, \lambda _4=0.001\) for the original CMU face database. Then, we fine-tune \(\lambda , \lambda _4\) for the corrupted data. Also, we choose a set of optimal parameters and then repeat five runs and obtain the average result.
First, Gaussian noise is added to the CMU face database. Table 8 shows the results of different methods. RLSLRSSC achieves about 10% better performance than the other methods under all corruption conditions. RLSLRSSC outperforms other algorithms on the CMU face database corrupted by Gaussian noise.
Next, we evaluate the performance of all methods on the CMU face database corrupted by Laplacian noise. The results are displayed in Table 9. Compared with other methods, RLSLRSSC-L1 achieves better results.
We then obtain the clustering results on the CMU face database with sample-specific corruptions displayed in Fig. 7. We observe that when the LCP varies from 1 to 10, the clustering performances of RLSLRSSC-L1 and RLSLRSSC-L21 are close and stable. We conclude that RLSLRSSC is robust to sample-specific corruptions.
Table 10 lists the results of the different methods on the CMU face database corrupted by Gaussian and Laplacian mixed noise. We can see that RLSLRSSC-L21 yields the best results in most cases.
5.4 Parameter setting
In this subsection, we describe the parameter setting in all experiments. For SSC, the weighting parameter for the \(l_1\) norm is chosen from the set {0.1, 1, 10, 20, 50}. In LSR, the parameter for \(\Vert Z \Vert _F\) is chosen from the set {0.01, 0.1, 1, 10, 100}. In most cases, the parameter is set to 100. In LRR, the parameter for the nuclear norm is selected from the set {0.01, 0.1, 1, 10, 100}. In LRSC, the regularization parameter \(\tau\) is selected from the set {0.01, 0.1, 1, 10, 100}. The regularization parameter \(\alpha\) has the default setting, \(\alpha = 0.5 \times \tau\). In BDR, the parameters \(\lambda , \gamma\) are selected from the set {0.001, 0.01, 0.1, 1, 10, 100}. In LRSSC, LSLRSSC, and our proposed method RLSLRSSC, the parameter for the nuclear norm \(\lambda _1\) is selected from the set {0.01, 0.1, 1, 10}, and the parameter for the \(l_1\) norm \(\lambda _2\) is \(\lambda _2 = 0.1 \times \lambda _1\). In LSLRSSC and RLSLRSSC, the dimension of latent space is fixed at 100.
There are seven parameters in our model, which need to be tuned: \(\lambda , \lambda _1, \lambda _2, \lambda _3, \lambda _4\) in Eq. (13), k, t in (10). According to the suggestions of [44], the values of k, t are between 4 and 7, the larger the dataset size, the larger the values of k, t. The values of k, t are 4 because the size of dataset used in experiment is small. In the experiments, we find out that our algorithm can get better performance when \(\lambda _2 = 0.1 \times \lambda _1\). Because there are many combinations of remaining four parameters: \(\lambda , \lambda _1, \lambda _3, \lambda _4\), we can choose the optimal value of the parameters in two steps. In the first step, we assume that there are few outliers in the clean dataset, so we set the value of \(\lambda _4\) slightly large, \(\lambda _4 = 1\). Then, we fix \(\lambda _4\) and search for the optimal value of \(\lambda , \lambda _1, \lambda _3\). The value of \(\lambda\) is selected from the set {0.001, 0.01, 0.1, 1}, and the values of \(\lambda _1, \lambda _3\) are selected from the set {0.01, 0.1, 1, 10}. In the second step, we search for the optimal value of \(\lambda , \lambda _4\) in {0.001, 0.003, 0.005, 0.01, 0.05, 0.1, 0.5, 1}, by fixing other parameters. For the contaminated dataset, we just fine-tune \(\lambda , \lambda _4\) for saving time. We found that the optimal value of \(\lambda\) is between 0.001 and 0.01 and the optimal value of \(\lambda _4\) is between 0.001 and 0.1 for the clean dataset. Hence, for the contaminated dataset, \(\lambda\) is chosen from the set {0.001, 0.002, 0.003, 0.004, 0.005, 0.01} and \(\lambda _4\) is chosen from the set {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1}.
We investigate how sensitive the \(\lambda _1, \lambda _2\) is. The value of \(\lambda = 0.01, \lambda _3 = 0.1, \lambda _4 = 0.1\) is set for RLSLRSSC, and the effects of the other two parameter are explored. \(\lambda _2\) has the default value of \(\lambda _2 = 0.1 \times \lambda _1\), so we analyze only the effects of \(\lambda _1\). We use the original ORL database to conduct our experiment and obtain the average results over five runs. The results are shown in Fig. 8. The performances of RLSLRSSC-L1 and RLSLRSSC-L21 are very close. When \(\lambda _1\) increases from 0.001 to 0.1, the performance of RLSLRSSC improves remarkably. When \(\lambda _1\) increases from 0.1 to 1, the performance decreases sharply. Hence, an optimal value can be discovered easily when \(\lambda _1\) = 0.1.
Then, to research the parameter sensitivity of \(\lambda _3\), we set the value of \(\lambda = 0.01, \lambda _1 = 0.1, \lambda _1 = 0.01, \lambda _4 = 0.1\) for RLSLRSSC and study the effects of \(\lambda _3\). We also use the original ORL database to conduct the experiment and get the average results over five runs. As shown in Fig. 9, the performance of RLSLRSSC is stable when \(\lambda _3 \ge 0.05\). We notice that when \(\lambda _3 = 0\), the performance decreases remarkably, which shows the importance of the graph constraint term.
Finally, we investigate the clustering performance on the ORL database changes by varying the regularization parameter \(\lambda , \lambda _4\) for RLSLRSSC-L1 and RLSLRSSC-L21. We set the value of \(\lambda _1 = 0.1, \lambda _2 = 0.01, \lambda _3 = 0.1\) and then explore the effects of the other two parameters. We use the ORL database corrupted by Gaussian and Laplacian mixed noise with the value of the LNI = 3. For each set of parameters, we average the results over five runs. Figure 10 illustrates the final results. The larger the error, the smaller the weight, i.e., if the value of \(\lambda\) or \(\lambda _4\) decreases, the error increases. The performance of RLSLRSSC-L1 is improved to a certain extent when the value of \(\lambda\) or \(\lambda _4\) decreases, so the terms associated with \(\lambda\) and \(\lambda _4\) are proved to work. RLSLRSSC-L21 is not as stable as RLSLRSSC-L1. When \(\lambda = 0.001\), \(\lambda _4\) decreases from 0.005 to 0.001, the performance of RLSLRSSC-L21 declines substantially. In brief, RLSLRSSC-L1 performs stably with different parameters and addresses Gaussian and Laplacian mixed noise well.
6 Conclusion and future works
This work proposes a novel subspace clustering method, RLSLRSSC, to deal with the corrupted face database clustering problem. The F-norm and the \(l_1\) norm or the \(l_{2,1}\) are used successfully to capture the noise and outliers. RLSLRSSC simultaneously learns a robust projection and a low-rank and sparse representation in the low-dimensional subspace. By incorporating a graph constraint, the local manifold structure can be preserved, and the discriminative ability of the learned projection can be promoted. We also present adequate experiments to demonstrate that RLSLRSSC is superior to other state-of-the-art methods.
There is, however, still an unsolved problem. To obtain a discriminative latent space, we introduced a graph constraint in our model. The high-dimensional data can be represented by the low-dimensional embedding in the learned latent space. In fact, we get the low-dimensional embedding in the manner of linear transformation. For the data with nonlinear structure, our method may not work well. In the recent years, deep learning has yielded fruitful results. The data can be nonlinearly mapped into a latent space by using deep learning. In the future, we plan to combine our method with deep learning.
References
Basri R, Jacobs D (2001) Lambertian reflectance and linear subspaces. In: ICCV
Agarwal P, Mustafa N (2004) k-means projective clustering. In: ACM symposium on principles of database systems, pp 155–165
Zhang T, Szlam A, Lerman G (2009) Median k-flats for hybrid linear modeling with many outliers. In: ICCV
Zhang T, Szlam A, Wang Y, Lerman G (2012) Hybrid linear modeling via local best-fit flats. Int J Comput Vis 100(3):217–240
Huang K, Ma Y, Vidal R (2004) Minimum effective dimension for mixtures of subspaces: a robust GPCA algorithm and its applications. In: CVPR
Ma Y, Yang AY, Derksen H, Fossum R (2008) Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Rev 50(3):413–458
Vidal R, Ma Y, Sastry S (2005) Generalized principal component analysis (GPCA). IEEE Trans Pattern Anal Mach Intell 27(12):1–15
Archambeau C, Delannay N, Verleysen M (2008) Mixtures of robust probabilistic principal component analyzers. Neurocomputing 71(7–9):1274–1282
Gruber A, Weiss Y (2004) Multibody factorization with uncertainty and missing data using the EM algorithm. In: CVPR
Ma Y, Derksen H, Hong W, Wright J (2007) Segmentation of multivariate mixed data via lossy coding and compression. IEEE Trans Pattern Anal Mach Intell 29(9):1546–1562
Yang AY, Rao SR, Ma Y (2006) Robust statistical estimation and segmentation of multiple subspaces. In: CVPR
Chen G, Lerman G (2009) Spectral curvature clustering (SCC). Int J Comput Vision 81(3):317–330
Elhamifar E, Vidal R (2009) Sparse subspace clustering. In: CVPR
Elhamifar E, Vidal R (2013) Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell 35(11):2765–2781
Favaro P, Vidal R, Ravichandran A (2011) A closed form solution to robust subspace estimation and clustering. In: CVPR
Liu G, Lin Z, Yu Y (2010) Robust subspace segmentation by low-rank representation. In: ICML
Liu G, Lin Z, Yan S, Sun J, Ma Y (2013) Robust recovery of subspace structures by low-rank representation. IEEE Trans Pattern Anal Mach Intell 35(1):171–184
Lu C, Lin Z, Yan S (2013) Correlation adaptive subspace segmentation by trace lasso. In: ICCV
Lu C, Min H, Zhao ZQ, Zhu L, Huang DS, Yan S (2012) Robust and efficient subspace segmentation via least squares regression. In: ECCV
Vidal R, Favaro P (2014) Low rank subspace clustering (LRSC). Pattern Recognit Lett 43(1):47–61
Wang YX, Xu H, Leng C (2013) Provable subspace clustering: when LRR meets SSC. In: NIPS
Chen J, Yang J (2014) Robust subspace segmentation via low-rank representation. IEEE Trans Cybern 44(8):1432–1445
Donoho DL, Elad M, Temlyakov VN (2006) Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans Inf Theory 52(1):6–18
Yang Y, Feng J, Jojic N, Yang J, Huang TS (2016) L0-sparse subspace clustering. In: ECCV
Li J, Kong Y, Fu Y (2017) Sparse subspace clustering by learning approximation L0 codes. In: AAAI
Yin M, Xie S, Wu Z, Zhang Y, Gao J (2018) Subspace clustering via learning an adaptive low-rank graph. IEEE Trans Image Process 27(8):3716–3728
Lu C, Feng J, Lin Z, Mei T, Yan S (2018) Subspace clustering by block diagonal representation. IEEE Trans Pattern Anal Mach Intell 41(2):487–501
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 245–250
Patel VM, Nguyen HV, Vidal R (2013) Latent space sparse subspace clustering. In: ICCV
Patel VM, Nguyen HV, Vidal R (2015) Latent space sparse and low-rank subspace clustering. IEEE J Sel Topics Signal Process 9(4):691–701
Wei L, Wu A, Yin J (2015) Latent space robust subspace segmentation based on low-rank and locality constraints. Expert Syst Appl 42(19):6598–6608
He X, Niyogi P (2004) Locality preserving projections. In: NIPS
He X, Yan S, Hu Y, Niyogi P, Zhang H (2005) Face recognition using laplacianfaces. IEEE Trans Pattern Anal Mach Intell 27(3):328–340
Tang K, Su Z, Jiang W, Zhang J, Sun X, Luo X (2018) Robust subspace learning-based low-rank representation for manifold. Neural Comput Appl 29:329
Jalali A, Sujay S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: NIPS
Wang D, Liu Q, Xia Y, Dong P, Luo J, Huang Q, Feng DD (2013) Dictionary learning based impulse noise removal via L1–L1 minimization. Signal Process 93(9):2696–2708
Chen Z, Wu Y (2013) Robust dictionary learning by error source decomposition. In: ICCV
Zhang H, Wang S, Zhao M, Xu X, Ye Y (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30(10):1873–1886
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Lin Z, Chen M, Wu L, Ma Y (2009) The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical report, UIUC technical report UILU-ENG-09-2215
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: NIPS
Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. In: NIPS
Cai D, He X (2005) Orthogonal locality preserving indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp 3–10
Kokiopoulou E, Saad Y (2007) Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique. IEEE Trans Pattern Anal Mach Intell 29(12):2143–2156
Cai D, He X, Han J, Zhang H (2006) Orthogonal laplacianfaces for face recognition. IEEE Trans Image Process 15(11):3608–3614
Hale E, Yin W, Zhang Y (2008) Fixed-point continuation for l1-minimization: methodology and convergence. SIAM J Optim 19(3):1107–1130
Yang J, Yin W, Zhang Y, Wang Y (2009) A fast algorithm for edge-preserving variational multichannel image restoration. SIAM J Imaging Sci 2(2):569–592
Cai D, He X, Han J (2007) Spectral regression: a unified approach for sparse subspace learning. In: ICDM
Samaria FS, Harter AC (1994) Parameterisation of a stochastic model for human face identification. In: IEEE workshop on applications of computer vision
Phillips J, Bruce V, Soulie FF (1999) In face recognition: from theory to applications. Springer, Berlin
Maaten LVD, Hinton G (2008) Visualizing data using T-SNE. J Mach Learn Res 9:2579–2605
Acknowledgements
This work is supported by the National Natural Science Foundation of China (61402181, 61502174), the Natural Science Foundation of Guangdong Province (2015A030313215, 2017A030313358, 2017A030313355), the Science and Technology Planning Project of Guangdong Province (2016A040403046), the Guangzhou Science and Technology Planning Project (201704030051).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, Y., Wei, J., Wang, J. et al. Graph constraint-based robust latent space low-rank and sparse subspace clustering. Neural Comput & Applic 32, 8187–8204 (2020). https://doi.org/10.1007/s00521-019-04317-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04317-3