1 Introduction

In numerous aspects of machine learning, data mining and computer vision [13], data are usually high dimensional [46]. Moreover, a set of high-dimensional data is often drawn from multiple low-dimensional subspaces [7], such as the face images, the point trajectories of moving objects [8] and the texture features of pixel on an image [911]. Recently, subspace clustering [12] processes this kind of data by following their underlying subspaces to attract increasing attentions. A number of subspace clustering methods have thus been proposed. Roughly, according to the principle of representing the subspaces, the previous subspace clustering methods can be grouped into three categories, such as algebraic methods [13, 14], statistical methods [11] and spectral clustering-based methods [1520].

The early studies of subspace clustering are mostly based on algebraic methods or statistical methods. Although exquisite formulations of algebraic methods for spectral clustering are used, their performance drops quickly in the datasets with noise or partially coupled subspaces, such as generalized principal component analysis (GPCA) [14]. By contrast, the statistical methods such as expectation maximization (EM) regard subspace clustering as a mixed data inference problem so that prevalent methods stem from general statistical learning domain can be used. Though many new techniques have been introduced to promote the criterion [e.g., agglomerative lossy compression (ALC)], the performance of statistics methods is restricted, due to its dependency on reckoning precise subspace models.

The representative spectral clustering methods include sparse subspace clustering (SSC), least squares representation (LSR) and smooth representation clustering (SRC) [15, 16, 18, 19]. The key step of these spectral clustering methods is to construct an affinity matrix by utilizing the global or the local information of samples. Unlike the previous works that computing affinity matrix from existing algebraic or statistical methods, the recent spectral clustering methods were put forward under the self-representation concept, i.e., representing each sample by a linear combination of other samples. More specifically, SSC could obtain a block-diagonal and also sparse affinity matrix when the subspaces are independent [21]. However, if the data from the same subspace are extremely relevant, SSC has to select only one at random. In this way, the spectral clustering method cannot firmly yield the proper groups, as the affinity matrix may be “too sparse” [22]. Therefore, though SSC could find a sparse affinity matrix, it ignores the relevant structures of the data in the same subspace and thus its performance is unsatisfied. By substituting the sparse representation with a low-rank representation and also considering the correlation structure of the data, LRR is designed to group the related data from the same subspace into together so that achieving a block-diagonal affinity matrix while the datasets are without interference and the subspaces are independent. However, the datasets in real life are always with outliers or noises, and thus, the subspaces are overlapping. While dealing with such data, LRR will output a dense and far from block-diagonal solution due to that the nuclear norm lacks of the ability to choose subspaces. Therefore, LRR could cluster the correlated data into together, resulting in the dense affinity matrix rather than a sparse one. The most recent works for subspace clustering are least squares regression (LSR) and SRC. Both of them encourage grouping effect [18, 19] to promote the accuracy of clustering models but lack of sparsity. In sum, the previous spectral clustering methods utilize the characterization of self-representation, such as sparse representation and low-rank representation (LRR), to improve their performance, but they still have more or less problems. This paper thus focuses on conducting spectral clustering for subspace clustering [23, 24].

In order to effectively cluster the data drawn from real life, a good subspace clustering model should take both grouping effect and sparsity into account. To do this, we propose a new method based on self-representation for subspace clustering by utilizing grouping effect and also an \(\ell_{2,1}\)-norm regularizer inducing row sparsity, named Subspace Clustering based on Grouping Effect (SRGE). On the one hand, affinity matrix achieved by SRGE is sparse (i.e., block-diagonal), with less connection between clusters. On the other hand, it not be too sparse, i.e., the nonzero connection within cluster is sufficient enough for grouping correlated data that drawn from the same subspace [25]. In this way, the model can both group the correlated data drawn from the same subspace (i.e., LSR and LRR) and reduce the connections between clusters (i.e., SSC).

In order to overcome the disadvantages of the algorithms mentioned above, we introduce self-representation-based \(\ell_{2,1}\)-norm to achieve a proper sparsity for the affinity matrix and also utilize the grouping effect to make sure the similar samples to be clustered into together rather than merely depending on the sparsity or low rank.

The contributions of this paper are summarized as follows:

  1. 1.

    With grouping effect in our subspace clustering model, it can be self-adaptive for different types of data since it takes the correlations of the data into account. The grouping effect makes sure SRGE groups the highly correlated data together. Moreover, if the subspaces are independent and the objective function satisfies the enforce block grouping effect (EBGE) conditions as well, the optimal solution of SRGE is block-diagonal and also has grouping effect. The term \({\text{tr}}({\varvec{Z}{\tilde{\varvec{L}}}\varvec{Z}}^{\text{T}} )\) is a special case that satisfies the EBGE conditions.

  2. 2.

    The SRGE takes the correlation of samples into account by the self-representation of samples, i.e., representing each sample by other samples rather than data pairs. Moreover, the robust loss function (i.e., \(\left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1}\)) is a balance between \(\ell_{1}\)-norm and F-norm and has been verified robust to noises [26].

  3. 3.

    Finally, as the objective function of our model is convex but not smooth, we utilize ADMM [27] to solve it efficiently.

The rest of this paper is organized as follows: In Sect. 2, we provide a brief review of the previous subspace clustering methods. Then, we propose the new grouping effect-based self-representation for subspace clustering in Sect. 3. The experimental results are presented in Sect. 4. Finally, we state the conclusions and our future work in Sect. 5.

2 Related work

2.1 Self-representation

Given a dataset \({\varvec{X}} = \left[ {{\varvec{x}}_{1} ,{\varvec{x}}_{2} , \ldots,{\varvec{x}}_{n} } \right] \in {\varvec{R}}^{d \times n}\), where \({\varvec{x}}_{i} \in {\varvec{R}}^{d}\) is a data point, self-representation is to represent each sample x i by the other samples, i.e., \({\varvec{x}}_{i} = \sum\nolimits_{j \ne i} {{\varvec{Z}}_{i} {\varvec{x}}_{j} }\). Usually, data drawn from real life contain noises or outliers; then, we have \({\varvec{x}}_{i} = \sum\nolimits_{j \ne i} {{\varvec{Z}}_{i} {\varvec{x}}_{j} { + }{\varvec{e}}}\), where e and Z are the representation error and the self-representation matrix, respectively. It is robust to the outliers because the self-representation coefficients depend on all the other samples rather than data pairs.

The recent spectral clustering methods search the self-representation matrix by solving the following representation-based model:

$$\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{\varvec{Z}} \left\| {\varvec{X} - \varvec{A}(\varvec{X})\varvec{Z}} \right\|_{l} + \lambda Q\left( \varvec{Z} \right)} \hfill & {{\text{s}}.{\text{t}}.} \hfill & {\varvec{Z} \in \varvec{C}} \hfill \\ \end{array}$$
(1)

where \({\varvec{X}} \in {\varvec{R}}^{d \times n}\) is the dataset with n samples and d dimensional features. A(X) is the dictionary matrix. Q(Z) and C are the regularization and constraint set on Z, respectively. λ > 0 is a weigh parameter. In this paper, we also utilize this self-representation-based model and set A(X) = X.

2.2 Subspace clustering

The subspace clustering is defined as follows [18]:

Definition 1

(Subspace clustering) Given a group of data vectors \({\varvec{X}} = \left[ {{\varvec{X}}_{1} , \ldots ,{\varvec{X}}_{k} } \right] \in {\varvec{R}}^{d \times n}\) drawn form a union of k subspaces \(\{ {\varvec{S}}_{i} \}_{i = 1}^{k} (i = 1, \ldots ,k)\), let X i be a collection of n i data vectors that drawn from the subspace S i and \(n = \sum\nolimits_{i = 1}^{k} {n_{i} }\). The task of subspace clustering is to cluster these data according to the underlying subspaces that they are drawn from.

The detailed steps of spectral clustering are shown as follows: Firstly, construct the affinity matrix J (ideally, it should be block-diagonal and its entries of the between-cluster points are zero, i.e., sparse) according to the underlying subspaces. Secondly, compute the first k eigenvalues and eigenvectors of the affinity matrix J so as to construct an eigenvector space. Finally, cluster the eigenvectors by k-means.

For the noise-free data and the mutually independent subspaces, we could find that the early methods such as the literature [28] can well solve the subspace clustering problem. However, in most of actual applications, the data usually contain a variety of noises or it lie on the cross sections of multiple dependent subspaces. Therefore, the data with different labels may be grouped into the same cluster. This is absolutely incorrect. To eliminate the influence of these errors (i.e., noises) which play an important role on subspace clustering, various subspace clustering methods have been proposed [1416, 18, 25]. In the following discussion, we will have a review of the related works of the recent spectral clustering-based subspace clustering methods.

Elhamifar and Vidal [15] proposed SSC. They utilized a combination of other samples to represent each sample. Moreover, in this model, representation coefficients were enforced to be sparse. After that, Liu et al. [18] put forward the LRR algorithm for subspace clustering. LRR aims to enforce the affinity matrix to be low rank. In this way, the global information of the samples can be caught. LRR is robust for the reason that the rank will be high if there are noises in the data. In recent years, Lu et al. [16] put forward a least square regression-based algorithm for the construction of the affinity matrix. It deems that the performance of the subspace clustering could be enhanced by grouping effect. After that, Hu et al. [19] proposed SRC. In this method, the grouping effect of representation coefficients was used to construct the affinity matrix. The SRC algorithm with enforcing grouping effect has a robust performance on subspace clustering. Furthermore, Peng et al. [29] proposed thresholding ridge regression for subspace clustering by eliminating the effect of errors coming from linear projection spaces.

By utilizing \(\ell_{1}\)-norm (minimization), SSC inspires sparse block-diagonal for clustering, while it is short of grouping effect. On the contrary, LRR utilizes rank minimization, while SRC and LSR make use of \(\ell_{2}\)-norm. All of these three methods take advantage of strong grouping effect, but they are lack of sparsity. Moreover, they are sensitive to noises and outliers. Therefore, although they have obtained significant success in subspace clustering, none of them could create a pinpoint block-diagonal representation matrix for realistic data.

3 Method

3.1 Notations

In this work, we utilize bold italic capital letters and lowercase symbols to denote the matrices and vectors, respectively. The trace of a matrix A which is square is tr(A). A T means the transpose of A, and A −1 means the inverse of A, respectively. [A] j denotes the j-th column of the matrix A. We denote v converges to v 0 with \(v \to v_{0}\). Several norms of vector and matrix are utilized. \(\left\| {\varvec{v}} \right\|_{p}\) means the \(\ell_{p}\)-norm of the vector v. \(\left\| {\varvec{A}} \right\|_{1}\), \(\left\| {\varvec{A}} \right\|_{F}\), \(\left\| {\varvec{A}} \right\|_{2,1}\) denote the \(\ell_{1}\)-norm (\(\left\| {\varvec{A}} \right\|_{1} = \sum\nolimits_{j = 1}^{n} {\sum\nolimits_{i = 1}^{n} {\left| {{\varvec{A}}_{ij} } \right|} }\)), F-norm, \(\ell_{2,1}\)-norm (\(\sum\nolimits_{j} {\left\| {\left[ {\varvec{A}} \right]_{j} } \right\|_{2} }\), i.e., sum of the \(\ell_{2}\)-norms of the column vectors) of A, respectively. Rank (Z) means the rank of Z, and moreover, \(\left\| {\varvec{Z}} \right\|_{*}\) means the sum of all the singular values of Z.

3.2 Grouping effect

LSR has grouping effect which is only for its specific model and cannot be put into the other models. Aiming to address this problem, in this section, we reanalyze the grouping effect which is reconstruction based and then put forward the enforced block grouping effect conditions (EBGE) which helps us to apply the grouping effect flexibly.

Grouping effect was first demonstrably stated by Lu et al. [16, 25]: If the samples are close to each other, their representation coefficients are also close to each other [19]. The grouping effect makes sure the data to be clustered according to their underlying subspaces. In this way, the data with extremely similar representation coefficients and normally from same subspace could be grouped together. It is defined as follows:

Definition 2

(Grouping effect) Given a set of data \({\varvec{X}} = [{\varvec{x}}_{1} ,{\varvec{x}}_{2} , \ldots ,{\varvec{x}}_{n} ] \in {\varvec{R}}^{d \times n}\), a self-representation matrix \({\varvec{Z}} = [{\varvec{z}}_{1} ,{\varvec{z}}_{2} , \ldots ,{\varvec{z}}_{n} ] \in {\varvec{R}}^{n \times n}\) has grouping effect if and only if \(||{\varvec{x}}_{i} - {\varvec{x}}_{j} ||_{2} \to 0 \Rightarrow ||{\varvec{z}}_{i} - {\varvec{z}}_{j} ||_{2} \to 0\), \(\forall i \ne j\).

SSC, LRR and LSR utilize different criterions (independent, noise free or orthogonal) to control the affinities of the data points. The ultimate goal of them is to obtain a block-diagonal affinity matrix for true clustering. While what kind of criteria will be able to get such an affinity matrix? At first, we consider a very simple case by utilizing a basis of subspace as dictionary.

Theorem 1

(Block-diagonal) Given the subspace \(\left\{ {{\varvec{S}}_{i} } \right\}_{i = 1}^{k} (i = 1, \ldots ,k)\) and assuming they are independent, B i is a matrix and the columns of it are composed by a basis of the subspaces S i , B = [ B 1 , …, B k ], and U i is a matrix composed by some column vectors from S i , U = [ U 1 , …, U k ]. The solution Z * of the following equation

$$\varvec{U} = \varvec{BZ}$$
(2)

is unique and block-diagonal.

Proof

To prove Theorem 1, we just need to prove: For any data point y from S i , there is a unique decomposition of y: y = B 1 z 1 + ··· + B i z i +··· + B k z k , where \({\varvec{B}}_{i} {\varvec{z}}_{i} \in {\varvec{S}}_{i}\) (i = 1,…, k). Since the subspaces are independent, then we could find that B i z i  = y and B j z j  = 0 for all \(j \ne i\). For B j is the basis of the subspace S i , thus \({\varvec{z}}_{i} \ne 0\) and z i is unique, and z j  = 0 for all \(j \ne i\).

In Theorem 1, we can know that if the subspaces are independent, it is easy to cluster data accurately by solving Eq. (2), while this model is not suitable for the data with noise which damage the structure of the subspace.

Therefore, we rewrite the model Eq. (1) into a more general form:

$$\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{{\varvec{z}}} f({\varvec{Z}}) + \lambda {\text{Q}}\left( {\varvec{Z}} \right)} \hfill & {{\text{s}}.{\text{t}}.} \hfill & {{\varvec{Z}} \in {\varvec{C}} = \{ {\varvec{Z}}|{\varvec{X}} = {\varvec{XZ}}\} } \hfill \\ \end{array}$$
(3)

If f(Z) and Q(Z) satisfy the EBGE conditions, then the solution to problem Eq. (3) is block-diagonal and also has grouping effect.

Enforced block grouping effect conditions (EBGE) We assume f(Z) and Q(Z) are two functions defined on C (\({\varvec{C}} \ne \emptyset\)) that is a set of matrices, and moreover, \({\varvec{Z}} = \left[ {\begin{array}{ll} {\varvec{A}} & {\varvec{U}} \\ {\varvec{V}} & {\varvec{B}} \\ \end{array} } \right] \in {\varvec{C}}\). In Z, A and B are square matrices, U and V are of suitable dimension and A, B ∈ C. \({\varvec{Z}}^{M} = \left[ {\begin{array}{ll} {\varvec{A}} & {\varvec{0}} \\ {\varvec{0}} & {\varvec{B}} \\ \end{array} } \right] \in {\varvec{C}}\) denotes the block-diagonal matrix of Z.

The EBGE conditions of problem Eq. (3) are:

  1. 1.

    For any permutation matrix P, we have f(Z) = f(ZP), P, ZP ∈ C.

  2. 2.

    \(f\left( {\varvec{Z}} \right) \ge f\left( {{\varvec{Z}}^{M} } \right)\), the equality is tenable if and only if Z = Z M.

  3. 3.

    Q(Z) = Q(ZP), and Z ∈ C if and only if ZP ∈ C, for any permutation matrix P.

  4. 4.

    Z ∈ C if and only if P T ZP ∈ C, and Q(Z) = Q(P T ZP), for any permutation matrix P.

Theorem 2

If f( Z ) and Q( Z ) satisfy the EBGE conditions, the optimal solution Z * to problem Eq. ( 3 ) is block-diagonal and also has grouping effect:

$${\varvec{Z}}^{*} = \left[ {\begin{array}{llll} {Z_{1}^{*} } &\quad 0 &\quad \cdots &\quad 0 \\ 0 &\quad {Z_{2}^{*} } &\quad \cdots &\quad 0 \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ 0 &\quad 0 &\quad \cdots &\quad {Z_{k}^{*} } \\ \end{array} } \right]$$

where \({\varvec{Z}}_{i}^{*} \in {\varvec{R}}^{{n_{i} \times n_{i} }}\) match with X i , and n i is the dimensionality of the subspace S i correspondingly, for each i.

Proof

  1. 1.

    The optimal solution of Eq. (3) is block-diagonal. Since f(Z) = f(ZP) for all permutations P, thus the objective function is invariant to all permutations. We utilize Z A to denote the optimal solution of problem Eq. (3) and decompose it into two parts Z A = Z F + Z *, where

    $$\begin{aligned} {\varvec{Z}}^{A} &= \left[ {\begin{array}{llll} {Z_{1}^{*} } &\quad * &\quad \cdots &\quad * \\ * &\quad {Z_{2}^{*} } &\quad \cdots &\quad * \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ * &\quad * &\quad \cdots &\quad {Z_{k}^{*} } \\ \end{array} } \right],\\ {\varvec{Z}}^{F} &= \left[ {\begin{array}{llll} 0 &\quad * &\quad \cdots &\quad * \\ * &\quad 0 &\quad \cdots &\quad * \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ * &\quad * &\quad \cdots &\quad 0 \\ \end{array} } \right]. \end{aligned}$$

    We assume [X] j  = [XZ A] j  ∈ S l ; thus, [XZ *] j  ∈ S l and [XZ F] j  ∈ S i (\(i \ne l\)). While [XZ F] j  = [XZ A] j  − [XZ *] j , and since the subspaces are independent, \(S_{l} \cap S_{i} = \emptyset\), so we have [XZ F] j  = 0. After that we have XZ F = 0 and XZ A = X, which Z A is doable for the problem Eq. (3). For the EBGE conditions (2), \(f\left( {{\varvec{Z}}^{A} } \right) \ge f\left( {{\varvec{Z}}^{*} } \right)\) is tenable, while Z A is the optimal solution: \(f\left( {{\varvec{Z}}^{A} } \right) \le f\left( {{\varvec{Z}}^{*} } \right)\). So \(f\left( {{\varvec{Z}}^{A} } \right) = f\left( {{\varvec{Z}}^{*} } \right)\), the equality is tenable if and only if Z A = Z *. Therefore, the optimal solution of Eq. (3) is block-diagonal.

  2. 2.

    The optimal solution of Eq. (3) has grouping effect.

    Through Definition 2, we have \(||{\varvec{X}}_{2} - {\varvec{X}}_{1} ||_{F} \to 0 \Rightarrow ||{\varvec{Z}}_{2}^{*} - {\varvec{Z}}_{1}^{*} ||_{F} \to 0\). If the EBGE condition (3) is satisfied, it is effortless to verify \(||{\varvec{z}}_{i} - {\varvec{z}}_{j} ||_{2} \to 0\) as \({\varvec{Z}}_{2}^{*}\) and \({\varvec{Z}}_{1}^{*}\) only differ in the i-th and the j-th columns.

    Inspired by the SRC [19], we can easily know that \({\text{tr}}({\varvec{ZLZ}}^{\text{T}} )\) and \({\text{tr}}({\varvec{Z}}^{\text{T}} {\varvec{LZ}})\) satisfy the EBGE conditions. In this paper, we utilize \({\text{tr}}({\varvec{ZLZ}}^{\text{T}} )\) as the regularization of our objective function.

3.3 SRGE Algorithm

As stated in Sect. 1, in this section we propose the SRGE method.

The SRGE algorithm proposed in this paper is very efficient. Firstly, SRGE represents each sample by utilizing the correlations between samples. Secondly, \(\ell_{2,1}\)-norm and trace–norm are introduced as the row sparse constraint and grouping effect constraint, respectively. Finally, the block-diagonal self-representation matrix Z is generated, and then the affinity matrix J which is finally put into the spectral model to cluster data is obtained.

According to the definition of the sample self-representation, we need to find a column vector \({\varvec{z}}_{i} \in {\varvec{R}}^{n \times 1}\), so that x i can be represented by Xz i . Normally, the noises and outliers from the data in our real life often bring errors into the representation, i.e., x i  = Xz i  + e, where the e is the representation error. The main purpose of this paper is to find the optimal representation matrix Z *, which makes the error between X and XZ as small as possible.

As the loss function of SRC is \(\left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{F}^{2}\) and the solution of it is not sparse, and moreover, the influence of noises and outliers of the original data are not well disposed. Therefore, we introduce the \(\left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1}\) as the loss function of the problem Eq. (1). Because \(\left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1}\) is not squared, the outliers will become less important than the other samples. The regularization term of LSR is as follows:

$${\text{Q}}({\varvec{Z}}) = {\text{tr}}({\varvec{ZZ}}^{\text{T}} ) = \frac{1}{n}\left\| {{\varvec{Z}}^{\text{T}} {\varvec{e}}} \right\|_{2}^{2} + \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\left\| {{\varvec{z}}_{i} - {\varvec{z}}_{j} } \right\|_{2}^{2} } }$$
(4)

where e is the all ones vector. It means to assign equal weights to all pairs of representations regardless of whether the representations are close to others or not. It does have grouping effect but in the implicitly way. In order to explicitly integrate grouping effect into our representation model, we introduce the following term as the regularization:

$$Q({\varvec{Z}}) = \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {w_{ij} \left\| {{\varvec{z}}_{i} - {\varvec{z}}_{j} } \right\|_{2}^{2} } } = {\text{tr}}({\varvec{ZLZ}}^{\text{T}} )$$
(5)

where W = (w ij ) is the weight matrix that reflects the closeness of the point x i and x j . L = D − W is the Laplacian matrix, in which D is the diagonal matrix, i.e., \({\varvec{D}}_{ii} = \sum\nolimits_{j = 1}^{n} {{\varvec{w}}_{ij} }\). The most common way to generate W is utilizing the k nearest neighbor (k-nn) graph by heat kernel [30, 31] or 0–1 weights [32, 33]. In our experiments, we utilize the 0–1 weights which is good enough to construct the k-nn graph (according to [19], the default value of k is 4). As we all known, there are many other subspace clustering methods to construct the complex k-nn graph [3], but that is not the main focus of this paper.

In order to prevent the instability of the samples, we use \({\tilde{\varvec{L}}} = {\varvec{L}} + \xi {\varvec{I}}\) as the substitution of \({\tilde{\varvec{L}}}\) (a default value of ξ is 0.001); then, the problem Eq. (1) is transformed into the following objective function:

$$\mathop {\hbox{min} }\limits_{{\varvec{Z}}} {\text{J}}\left( {\varvec{Z}} \right) = \left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1} + \lambda {\text{tr}}\left( {\varvec{Z}}{\tilde{\varvec{L}}}{\varvec{Z}}^{\text{T} } \right)$$
(6)

The above objective function firstly utilizes trace–norm [34] to ensure each sample is represented by samples that has the similar representation coefficients (i.e., the closest samples). And then it takes row sparse \(\ell_{2,1}\)-norm as the loss function to reduce the influence of noises and the outliers. Both of the above two norms help to guarantee the model is robust.

Traditional methods utilize \({\varvec{J}}_{1} = {{\left( {\left| {{\varvec{Z}}^{{\varvec{*}}} } \right| + \left| {{\varvec{Z}}^{{*{\text{T}}}} } \right|} \right)} \mathord{\left/ {\vphantom {{\left( {\left| {{\varvec{Z}}^{{\varvec{*}}} } \right| + \left| {{\varvec{Z}}^{{*{\text{T}}}} } \right|} \right)} 2}} \right. \kern-0pt} 2}\) to obtain the affinity matrix and then use spectral clustering algorithm to cluster data. In this way, the clustering results of these methods will have following characters: The samples in the same subspace are of high similarity, there are great differences between the samples that drawn from different subspaces, and all the subspaces are block-diagonal. These characteristics help spectral clustering-based subspace clustering methods a lot in achieving a good clustering performance.

3.4 Subspace clustering by SRGE

The same as SSC, LSR and LRR, our algorithm is also a spectral clustering-based method. After obtaining the optimal self-representation coefficient matrix Z *, we compute the affinity matrix by:

$${\varvec{J}}_{1} = {{\left( {\left| {{\varvec{Z}}^{{\varvec{*}}} } \right| + \left| {{\varvec{Z}}^{{*{\text{T}}}} } \right|} \right)} \mathord{\left/ {\vphantom {{\left( {\left| {{\varvec{Z}}^{{\varvec{*}}} } \right| + \left| {{\varvec{Z}}^{{*{\text{T}}}} } \right|} \right)} 2}} \right. \kern-0pt} 2}$$
(7)

and then utilize the spectral clustering algorithm [23] to generate the final clustering result.

J 1 is efficient, however, its effectiveness mainly due to the block-diagonal character of Z *. Aiming to utilize the grouping effect that is useful to obtain a better subspace clustering result, we compute the affinity matrix by:

$$\varvec{J}_{2} = \frac{{\left( {\varvec{z}_{i}^{{*{\text{T}}}} \varvec{z}_{j}^{*} } \right)^{2} }}{{\left\| {\varvec{x}_{i} } \right\|_{2}^{2} \left\| {\varvec{x}_{j} } \right\|_{2}^{2} }}$$
(8)

The new affinity is the inner product of representation coefficient vector normalized by norms of the original features. This normalization decreases the biases generated by the amplitudes of the original attribute. It is very familiar in the motion segmentation and handwritten numeral recognition problems.

The details of subspace clustering by SRGE are summarized in Algorithm 1.

figure a

3.5 Optimization algorithm

We learn that the objective function Eq. (6) is convex but not smooth. It is not feasible to be solved directly. In this work, we propose an efficient optimization algorithm to solve the problem Eq. (6).

At first, differentiating Eq. (6) concerning each column of Z and setting it to zero, the optimal solution Z * could be obtained from the following equation:

$${\varvec{X}}^{\text{T}} {\varvec{D{\rm X}Z}} + {\varvec{Z}}(\alpha {\tilde{\varvec{L}}}) + ( - {\varvec{X}}^{\text{T}} {\varvec{DX}}) = 0$$
(9)

Equation (9) is a normative Sylvester equation [35], and it has a unique solution:

$${\varvec{Z}} = {\text{lyap}}({\varvec{X}}^{\text{T}} {\varvec{D{\rm X}}},(\lambda {\tilde{\varvec{L}}}),( - {\varvec{X}}^{\text{T}} {\varvec{DX}}))$$
(10)

where X and \({\tilde{\varvec{L}}}\) are known, D is a block-diagonal matrix, let U = X − XZ = [u 1, …,u n ]T, then the diagonal elements of D are \({\varvec{d}}_{ii} = \frac{1}{{\left\| {{\varvec{u}}_{i} } \right\|_{2} }}\). As D is unknown and depends on Z, we utilize the ADMM algorithm to solve it in the iterative way.

At first, we decompose Eq. (6) into N subproblem as follows:

$$\arg \mathop {\hbox{min} }\limits_{{{\varvec{z}}_{i} }} \left\| {{\varvec{x}}_{i} - {\varvec{x}}_{i} {\varvec{z}}_{i} } \right\|_{2,1} + \lambda {\text{tr}}\left( {{\varvec{z}}_{i} {\tilde{\varvec{L}}\varvec{z}}_{i}^{\text{T}} } \right)$$
(11)

where z i is the column subsectors of Z and vec(Z) = [z 1, z 2,…,z n ]T is satisfied.

For the reason that Eq. (6) utilizes constrained trace–norm as the regularization, the elements of Z cannot be solved independently; thus, it is extraordinary and inefficient to utilize the soft threshold λ in the equation. However, it can be used in the process of optimization efficiently; at this time, it is called as multiplier alternating direction method. Under the help of dummy variable, the general form of this method can be rewritten as follows:

$$\begin{array}{*{20}l} {\arg \mathop {\hbox{min} }\limits_{{{\varvec{Z}},{\varvec{V}}}} \left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1} + \lambda {\text{tr}}\left( {\varvec{V}}{\tilde{\varvec{L}}}{\varvec{V}}^{\text{T}} \right) + \rho \left\| {{\varvec{Z}} - {\varvec{V}}} \right\|_{F}^{2} } \hfill & {{\text{s}} . {\text{t}} .} \hfill & {{\varvec{Z}} = {\varvec{V}}} \hfill \\ \end{array}$$
(12)

The objective function is complex, so we utilize the extending form of Lagrange function to be a substitute of Eq. (12):

$${\text{L}}({\varvec{Z}},{\varvec{C}},{\varvec{\Lambda}}) = \left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1} + \lambda {\text{tr}}\left( {\varvec{V}}{\tilde{\varvec{L}}}{\varvec{V}}^{\text{T}} \right) + \frac{\rho }{2}\left\| {{\varvec{Z}} - {\varvec{V}}} \right\|_{F}^{2} + {\text{vec}}({\varvec{\Lambda}}){\text{vec}}({\varvec{Z}} - {\varvec{V}})$$
(13)

The basic idea of ADMM includes the following iteration steps:

  1. 1.

    \({\varvec{Z}}^{{\left( {k + 1} \right)}} = \arg \mathop {\hbox{min} }\limits_{{\varvec{Z}}} {\text{L}}\left( {{\varvec{Z}},{\varvec{V}}^{\left( k \right)} ,{\varvec{\Lambda}}^{\left( k \right)} } \right)\)

  2. 2.

    \({\varvec{V}}^{{\left( {k + 1} \right)}} = \arg \mathop {\hbox{min} }\limits_{{\varvec{V}}} {\text{L}}\left( {{\varvec{Z}}^{{\left( {k + 1} \right)}} ,{\varvec{V}},{\varvec{\Lambda}}} \right)\)

  3. 3.

    \({\varvec{\Lambda}}^{{\left( {k + 1} \right)}} \leftarrow {\varvec{\Lambda}}^{\left( k \right)} + \rho \left( {{\varvec{Z}} + {\varvec{V}}} \right)\)

Given Λ and ρ, the key to this method is to obtain an optimal solution of Eq. (12). Problem Eq. (6) can be decomposed into two subproblems by using ADMM algorithm.

The first subproblem: If we just minimize the Z in Eq. (12), i.e., when the trace–norm penalizes \({\text{tr(}}{\varvec{V}}{\tilde{\varvec{L}}}{\varvec{V}}^{\text{T}} )\), the problem will be transformed into a very simple least square regression problem. The second subproblem: If we just minimize the V in Eq. (12), i.e., when \(\left\| {{\varvec{X}} - {\varvec{XZ}}} \right\|_{2,1}\) disappeared, V can be solved independently. In the above ways, which decompose the problem into two subproblems, the threshold value was utilized efficiently. As the current estimation of Z and V is combined with the third step of ADMM, the current estimation of the Lagrange multiplier matrix Λ can be updated. The penalize parameter ρ plays a special and important role: A flawed estimate Λ is allowed to solve Z and V.

4 Experiments

In this section, we evaluate our SRGE method on four applications of subspace clustering: motion segmentation, handwritten digit clustering, psychology balance clustering and face clustering. We compare our method with the state-of-the-art representation reconstruction-based methods such as SSC, LRR, LSR and SRC.

4.1 Experimental datasets and evaluation criteria

The datasets utilized in our experiment are: Hopkins 155 [36], USPS [37], Jaffe [38] and Balance [39]. All of them are the most common benchmark datasets for judging subspace clustering methods. The best results for all methods are reported.

figure b

Hopkins 155 [36], as a motion segmentation dataset, consists of 155 video sequences. Each sequence has two or three motions (correspondingly, a motion is a subspace), and every sequence is a sole clustering task; therefore, there are 155 subspace clustering tasks totally. Unlike other algorithms, our method does not have any dimensionality reduction. Then, we compare our method with SSC, LRR, LSR and SRC. During the comparison, we utilize the same parameter for all sequences, which is the same as the other methods. All the methods are performed on each sequence. And the mean, maximum, minimum and the standard deviation of the clustering error are reported.

USPS [37] is also widely used in subspace clustering. It contains 9298 handwritten digit images. Each of the images has 256 (16 × 16) pixels. We utilize the first 100 images in our experiments.

Jaffe [38] is an international standard face dataset and consists of 10 female positive facial expression images (213 samples), with each image having 32 × 32 pixels.

Balance [39] is the experiment results of psychology. It contains 625 samples that with 4-dimensionality features.

The same as other methods, we utilize clustering error (CE) as a measure of the accuracy [40]. In the condition of the optimal permutation, CE could obtain the minimum error by matching the ground truth and the clustering result. The definition of it is:

$${\text{CE}} = 1 - \frac{1}{N}\sum\limits_{i = 1}^{N} {\delta \left( {p_{i} ,{\text{map}}(q_{i} )} \right)}$$
(14)

where q i and p i mean the output label and the ground truth of the i-th sample, respectively. In above function, δ(x, y) = 1, if and only if x = y, otherwise δ(x, y) = 0. The best mapping function, map(q i ), permutes clustering labels to match the ground truth labels and also can be efficiently calculated by the Kuhn–Munkres algorithm [4145].

4.2 Experimental results and analysis

In Table 1, we display the clustering error of five methods on Hopkins 155 database by utilizing the common affinity measure Eq. (7). It reveals that SRGE achieves a clustering error of 3.35 %, while the best result of other methods is 3.92 % achieved by SSC. The improvement of SRGE on this database is limited because the reported error is the mean of 155 clustering errors. Among these errors, most of them are zero, and even if there are high improvements, the mean result is limited by others. Another reason of the limitation is that the correlations of data are very strong, i.e., the dimensions of each subspace are only two or three [16]. In order to distinguish the performance of each method, we represent the best results with boldface in each table.

Table 1 The clustering results (%) of each method by J 1 on Hopkins 155

Tables 2, 4 and 5 show the clustering errors on USPS, Jaffe and Balance, respectively. For fair comparison, we utilize the same affinity measure Eq. (7) in all algorithms. It can be summarized that the performance improvement by our method over others is prominent, especially on Jaffe database, in which the clustering error is 0.94 %. Moreover, as we stated in Sect. 4, the affinity measure Eq. (8) is much better than Eq. (7) in developing the grouping effect. Therefore, in each method, we utilize the new affinity measure J 2 to cluster data on USPS database and the results are displayed in Table 3. By comparing Tables 2 and 3, we can learn that the results of all methods with J 2 are improved significantly than the one with J 1. Moreover, the result of SRGE obviously outperforms the others.

Table 2 The clustering results (%) of each method by J 1 on USPS
Table 3 The clustering results (%) of each method by J 2 on USPS
Table 4 The clustering results (%) of each method by J 1 on Jaffe
Table 5 The clustering results (%) of each method by J 1 on Balance

In Fig. 1, we diagrammatize the clustering results and the original state of the images from USPS by using the affinity measure Eq. (7). As the labels of our experiments are random selected, we rearrange them and use them to generate Fig. 1 eventually. Moreover, we illustrate the affinity matrices which achieved by utilizing Eq. (8) in Fig. 2. Obviously, the grouping effect from SRGE is better than the one from others. It also indicates that the block-diagonal matrix obtained by SRGE is much more excellent than others.

Fig. 1
figure 1

The clustering results of each method on USPS obtained from the first five images of each number

Fig. 2
figure 2

On USPS database, affinity matrices of SRGE, SRC, SSC, LSR and LRR are generated by utilizing the affinity measure Eq. (8) (the grouping effect of SRGE is much more protruding than those of others)

5 Conclusions and future work

In this work, we propose a new subspace clustering method, namely SRGE. We utilize the self-similarity of samples and the trace–norm which has the property of grouping effect to construct the self-representation coefficient matrix. Afterward, by using spectral clustering method, we obtain the clustering result. In addition, effectiveness and robustness of SRGE mainly come from the self-representation of samples and the grouping effect. The former represents each sample by all the other samples rather than data pairs, while the latter makes sure the similar data have the same representation coefficients and be assigned to the same group eventually. Owing to our proposed objective function is convex but not smooth, we use ADMM algorithm to solve it. Moreover, we propose a new affinity measure based on grouping effect. Finally, the experimental results on benchmark datasets indicate that SRGE is much more efficient than the state-of-the-art subspace clustering methods. In the future work, we will use the grouping effect in large-scale subspace clustering and semi-surprised learning.