Keywords

1 Introduction

As an important branch of the field of pattern recognition, clustering has been extensively developed in the past few decades. Commonly used clustering methods include prototype-based clustering [1], matrix factorization-based clustering [2], and graph-based clustering [3], etc. Among them, graph-based clustering has attracted much attention because it can exploit the geometrical structure information of the data [4, 5].

As a typical graph-based clustering method, Spectral Clustering (SC) [6] often achieves superior clustering performance when handling datasets with high dimensions. In the spectral clustering method, a low-dimensional representation of data is first constructed by utilizing the predetermined similarity matrix. Subsequently, SC yields the discrete clustering result by calling the spectral rotation method [7] or using k-Means methods [8] based on the relaxed spectral solutions. So, the similarity matrix plays a critical role in spectral clustering. Similarity matrix construction methods often include two types, point-pairwise distance-based methods and self-expression learning methods-based. The typical point-pairwise distance-based methods include the Gaussian kernel method, adaptive neighbors [9, 10], typicality-aware adaptive graph [11], and so on. These methods construct sparse similarity matrices by learning sample neighborhood structure information.

The self-expression learning methods, including the Sparse Subspace Clustering (SSC) [12], Low-Rank Representation (LRR) [13], Block-Diagonal Representation (BDR) [14], and their extensions [15, 16], have been reported and achieved desired performance. Among them, SSC [12] aims to group data drawn from a union of multiple linear subspaces. When the subspaces from which data are drawn are independent, SSC can obtain desired performance by learning a sparse affinity matrix. LRR [13] minimizes the rank of the coefficient matrix to recover the row space of the data. BDR [14] proposes a novel regularizer for directly pursuing a coefficient matrix with the block diagonal property. Iteratively Reweighted Least Squares (IRLS) [15] solves a joint low-rank and sparse minimization problem. Least Squares Regression (LSR) [16] minimizes the F-norm of the coefficient matrix to achieve that the correlated samples have similar coefficient vectors. Simplex Sparse Representation (SSR) [17] introduces the simplex constraint to ensure the translation invariance of the model.

The above self-expression learning-based methods seek sparse or low-rank representation coefficient matrices to measure the similarity between pairs of samples. However, in these methods, the way that directly uses all sample points to represent a fixed sample may not be ideal since points from other classes also participate in the representing process. Therefore, the generated similarity matrix is not reliable and affects the performance of downstream tasks.

To alleviate this issue, we propose Sparse Subspace Clustering with the \(l_0\) inequality constraint (SSC-\(l_0\)). SSC-\(l_0\) aims to achieve a representation learning mechanism in which only the points coming from the same class are expected to be used for representing each sample in this class.

The main contributions of this work are listed as follows:

  1. 1.

    A new self-expression learning method, named Sparse Subspace Clustering with the \(l_0\) inequality constraint (SSC-\(l_0\)), is proposed. In SSC-\(l_0\), the \(l_0\) inequality constraint constrains the number of non-zero elements in the coefficient vector, which helps SSC-\(l_0\) to conduct representation learning among the points in the same class.

  2. 2.

    By introducing the simplex constraint to ensure the translation invariance of the model, an optimization method concerning \(l_0\) inequality constraint is presented to solve the proposed SSC-\(l_0\), and its convergence is theoretically analyzed. Since the \(l_0\) inequality constraint problem is difficult to be solved, the proposed optimization method has the potential to be widely used sparse learning models.

  3. 3.

    Extensive experiments on benchmark datasets demonstrate the superiority of SSC-\(l_0\).

The rest of this paper is organized as follows. In Sect. 2, we introduce some notations and the related methods. In Sect. 3, we elaborate on the proposed SSC-\(l_0\) and its corresponding optimization algorithm. Experimental results are reported in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Related Works

2.1 Notations

In this study, the bold uppercase letters and lowercase italic letters are used to stand for the matrices and scalars, respectively, such as \(\textbf{X}=\left[ \textbf{x}_1;\textbf{x}_2;\cdots ;\textbf{x}_n\right] \in \mathbb {R}^{n\times d}\) represents a data matrix, where d and n denote the dimensions and the number of samples, respectively. The bold lowercase letters stand for vectors, such as \(\textbf{x}_i\in \mathbb {R}^{1\times d}\) is a sample point. \(Tr(\textbf{A})\) and \(\textbf{A}^T\) stand for the trace and the transpose of matrix \(\textbf{A}\), respectively. \( \textbf{A} \ge \textbf{0}\) means that each element \(a_{ij} \ge 0\) in \( \textbf{A}\).

2.2 Sparse Subspace Clustering

Let \(\left\{ \textbf{x}_j\in \mathbb {R}^{1\times d}\right\} _{j=1}^n\) be a set of data points drawn from a union of c independent linear subspaces \(\left\{ \mathcal {S}_i\right\} _{i=1}^c\). Let \(n_i\) is the number of data points drawn from the subspace \(\mathcal {S}_i\), \(d_i\) is dimension of \(\mathcal {S}_i\), and \(\textbf{Y}_i\) a data matrix corresponding to subspace \(\mathcal {S}_i\). If \(n_i\ge d_i\), the points that are drawn from the subspace \(\mathcal {S}_i\) is self-expressive [12]. This means that if \(\textbf{x}\) is a new data point in \(\mathcal {S}_i\), then it can be represented as a linear combination of \(d_i\) points in the same subspace. Denoting \(\textbf{z}_i\in \mathbb {R}^{1\times n_i}\) is a fragment of the coefficient vector, and it corresponds to the data matrix \(\textbf{Y}_i\). If we let \(\textbf{X}\) be a data matrix with proper permutation, i.e., \(\textbf{X}=\left[ \textbf{Y}_1,\textbf{Y}_1,\cdots ,\textbf{Y}_c\right] \), for any data point \(\textbf{x}\), there exists a vector \(\textbf{s}=\left[ \textbf{z}_1,\textbf{z}_2,\cdots ,\textbf{z}_c\right] \in \mathbb {R}^{1\times n}\) such that \(\textbf{x}=\textbf{sX}\), where \(\textbf{z}_i\ne \textbf{0}\) and \(\textbf{z}_j=\textbf{0}\) for all \(j\ne i\) (\(\textbf{z}_i\in \mathbb {R}^{1\times n_i}\) is the fragment corresponding to points in the same subspace as \(\textbf{x}\)). And such a \(\textbf{s}\) can be sought by solving the following problem:

$$\begin{aligned} \min _{ \textbf{s}} ||\textbf{s}||_0, \ \ \mathrm {s.t.}\ \ \textbf{x}= \textbf{s}\textbf{X}, \end{aligned}$$
(1)

where \(||\textbf{s}||_0\) represents the number of non-zero elements in row vector \(\textbf{s}\). This can be proved by the following Theorem 1.

Theorem 1

Denoting \(\textbf{s}^*\) is the optimal solution to the problem (1). Let \(\textbf{X}\) be a data matrix with proper permutation and \(\textbf{x}\in \mathcal {S}_i\), then \(\textbf{s}^*=\left[ \textbf{z}_1,\textbf{z}_2,\cdots ,\textbf{z}_c\right] \in \mathbb {R}^{1\times n}\), where \(\textbf{z}_i\ne \textbf{0}\) and \(\textbf{z}_j=\textbf{0}\) for all \(j\ne i\).

Proof

Let \(\textbf{s}^*=\textbf{s}_\uparrow ^*+\textbf{s}_\downarrow ^*\), where \(\textbf{s}_\uparrow ^*=\left[ \textbf{0},\textbf{0},\cdots ,\textbf{z}_i,\cdots \textbf{0}\right] \in \mathbb {R}^{1\times n}\) and \(\textbf{s}_\downarrow ^*=\left[ \textbf{z}_1,\textbf{z}_2,\cdots ,\textbf{z}_{i-1},\textbf{0},\textbf{z}_{i+1},\cdots ,\textbf{z}_c\right] \in \mathbb {R}^{1\times n}\). It has Theorem 1 holds when we show that \(\textbf{s}_\downarrow ^*=\textbf{0}\).

Let \(\textbf{s}_\downarrow ^*\ne \textbf{0}\), since \(\textbf{s}^*=\textbf{s}_\uparrow ^*+\textbf{s}_\downarrow ^*\), it has \(\textbf{x}=\textbf{s}^*\textbf{X}=(\textbf{s}_\uparrow ^*+\textbf{s}_\downarrow ^*)\textbf{X}\). According to the independence assumption [12], it has \(\textbf{x}\in \mathcal {S}_i\), \(\textbf{s}_\uparrow ^*\textbf{X}\in \mathcal {S}_i\), and \(\textbf{s}_\downarrow ^*\textbf{X}\notin \mathcal {S}_i\). Thus, we have \(\textbf{s}_\downarrow ^*\textbf{X}=\textbf{0}\). This implies that

$$\begin{aligned} \textbf{x}=\textbf{s}^*\textbf{X}=\textbf{s}_\uparrow ^*\textbf{X} \end{aligned}$$
(2)

which means that \(\textbf{s}_\uparrow ^*\) is also a solution of Eq. (1). And we have \(||\textbf{s}_\uparrow ^*||_0 < ||\textbf{s}_\uparrow ^*+\textbf{s}_\downarrow ^*||_0 = ||\textbf{s}^*||_0 \), which conflicts with \(\textbf{s}^*\) being the optimal solution to Eq. (1). So we have \(\textbf{s}_\downarrow ^*=\textbf{0}\). The Theorem 1 holds.    \(\blacksquare \)

The matrix format of Eq. (1) is as follows:

$$\begin{aligned} \min _{ \textbf{S}} \sum _{i=1}^n {||\textbf{s}_i||_0}, \ \ \mathrm {s.t.}\ \ \textbf{X}= \textbf{S}\textbf{X}, \end{aligned}$$
(3)

where \(\textbf{S}=\left[ \textbf{s}_1;\textbf{s}_2;\cdots ;\textbf{s}_n\right] \in \mathbb {R}^{n\times n}\) is a coefficient matrix, and \(\textbf{s}_i\in \mathbb {R}^{1\times n}\) is a coefficient vector corresponding to the point \(\textbf{x}_i\).

Theorem 1 demonstrates that when the subspaces are independent, any sample \(\textbf{x}_i\) can be represented by the samples coming from the same class. And \(s_{ij}=0\) if samples \(\textbf{x}_j\) and \(\textbf{x}_i\) come from different classes.

According to Theorem 1, when \(\textbf{X}\) is properly permuted and subspaces are independent, the coefficient matrix \(\textbf{S}\) has a block diagonal structure, which can be used to improve clustering performance.

3 Sparse Subspace Clustering with the \(l_0\) Inequality Constraint

3.1 Motivation of SSC-\(l_0\)

Most self-expression learning methods seek a sparse or low-rank representation coefficient matrix and minimize the difference between the original samples and their reconstructed estimations. The objective functions of SSC, LRR, and their extensions [14,15,16] can be generalized as follows:

$$\begin{aligned} \min _{ \textbf{S}} ||\textbf{X}- \textbf{S}\textbf{X}||_F^2 + \gamma \mathcal {R}(\textbf{S}), \end{aligned}$$
(4)

where \(\textbf{S}\) is a reconstruction coefficient matrix, \(||\textbf{X}- \textbf{S}\textbf{X}||_F^2\) indicates the self-expression reconstruction loss. \(\mathcal {R}(\textbf{S})\) is a regularization term, which guarantees that \(\textbf{S}\) is sparse or low-rank, or has a strict block diagonal structure, etc.

Numerous studies have shown that self-expression learning methods achieve impressive performance in clustering tasks [13]. However, in Eq. (4), directly using all points to represent a fixed sample in a class may degrade the quality of the learned similarity matrix, as points from other classes also participate in the representing process. To alleviate this issue, we attempt to achieve representation learning between points of the same class. However, this work is not easy because the ground truth labels are not provided beforehand. According to Theorem 1, the permuted coefficient matrix has a block diagonal structure when the subspaces from which the data are drawn are independent. Thus, such block diagonal structures can be used to guide the similarity measurements among samples. However, real data often do not qualify the subspace independence assumption [18].

In practice, it is easier for data points from the same class to represent each other than that from different classes. Thus, when reconstructing a point, if the number of non-zero elements in the coefficient vector is limited, a model is more likely to select data points from the class where the reconstructed point lies to complete the reconstruction process. This means that introducing the \(l_0\) inequality constraint may improve the performance of the self-expression learning model.

3.2 Objective Function of SSC-\(l_0\)

According to the discussion in Sect. 3.1, introducing the \(l_0\) inequality constraint may improve the performance of the self-expression learning model. The proposed objective function is as follows:

$$\begin{aligned} \begin{aligned} &\min _{ \textbf{S}} \ \ ||\textbf{X}- \textbf{S}\textbf{X}||_F^2 + \gamma ||\textbf{S}||_F^2,\\ \mathrm {s.t.}\ \ \textbf{S} &\ge 0,\textbf{S}\textbf{1}^T=\textbf{1}^T,||\textbf{s}_i||_0 \le k \ (i=1,2,\cdots ,n), \end{aligned} \end{aligned}$$
(5)

where \(\textbf{S}=\left[ \textbf{s}_1;\textbf{s}_2;\cdots ;\textbf{s}_n\right] \in \mathbb {R}^{n\times n}\) is a reconstruction coefficient matrix, \(\gamma \) is a non-negative balance parameter, and k is a constant. The constraint \(||\textbf{s}_i||_0 \le k \) means that the max number of non-zero elements in the coefficient vector \(\textbf{s}_i\) is k.

By introducing the \(l_0\) inequality constraint \(||\textbf{s}_i||_0 \le k \), in the first term of Eq. (5), each sample is represented as a linear combination of the samples that are more likely from the class that the represented sample belongs to. Since the coefficient matrix is non-negative, the elements in the coefficient matrix can reflect the similarity between the sample pairs. The \(l_2\) norm \(||\textbf{s}_i||_2^2\) can be utilized to avoid overfitting on \(\textbf{s}_i\). The constraint \(\textbf{s}_i\textbf{1}^T=1\) can be used to ensure the translation invariance [17].

3.3 Optimization of SSC-\(l_0\)

First, because of the difficulty of solving model Eq. (5), we turn to optimize the following model:

$$\begin{aligned} \begin{aligned} \min _{ \textbf{S},\textbf{M}} \ \ & ||\textbf{X}- \textbf{M}\textbf{X}||_F^2 + \alpha ||\textbf{M}- \textbf{S}||_F^2 + \gamma ||\textbf{S}||_F^2,\\ \mathrm {s.t.}\ \ {} & \textbf{S} \ge \textbf{0},\textbf{S}\textbf{1}^T=\textbf{1}^T,||\textbf{s}_i||_0 \le k \ (i=1,2,\cdots ,n), \end{aligned} \end{aligned}$$
(6)

where \(\alpha \) is a non-negative parameter. Problem (6) can be transformed into the problem (5) when \(\alpha \) is large enough. And the problem (6) can be minimized using the following iterative algorithm:

Updating \(\textbf{M}\). When \(\textbf{S}\) is fixed, problem (6) becomes:

$$\begin{aligned} \begin{aligned} \min _{ \textbf{M}} \ \ ||\textbf{X}- \textbf{M}\textbf{X}||_F^2 + \alpha ||\textbf{M}- \textbf{S}||_F^2. \end{aligned} \end{aligned}$$
(7)

Setting the derivative of Eq. (7) with respect to \(\textbf{M}\) to zero, it has:

$$\begin{aligned} \textbf{M}=\left( \textbf{X}\textbf{X}^T+\alpha \textbf{S}\right) \left( \textbf{X}\textbf{X}^T+\alpha \textbf{I}\right) ^{-1}. \end{aligned}$$
(8)

Updating \(\textbf{S}\). When \(\textbf{M}\) is fixed, problem (6) becomes:

$$\begin{aligned} \begin{aligned} &\min _{ \textbf{S}} \ \ \alpha ||\textbf{M}- \textbf{S}||_F^2 + \gamma ||\textbf{S}||_F^2,\\ \mathrm {s.t.}\ \ \textbf{S}& \ge \textbf{0},\textbf{S}\textbf{1}^T=\textbf{1}^T,||\textbf{s}_i||_0 \le k \ (i=1,2,\cdots ,n). \end{aligned} \end{aligned}$$
(9)

Since problem (9) is independent of each \(\textbf{s}_i\), problem (9) can be converted into subproblems:

$$\begin{aligned} \begin{aligned} \min _{ \textbf{s}_i} \ \ & {\alpha ||\textbf{m}_i- \textbf{s}_i||_2^2 + \gamma ||\textbf{s}_i||_2^2} ,\\ \mathrm {s.t.}\ \ {} &\textbf{s}_i \ge \textbf{0},\textbf{s}_i\textbf{1}^T=1,||\textbf{s}_i||_0 \le k,\\ \Leftrightarrow \min _{ \textbf{s}_i} \ \ & \mathcal {O}(\textbf{s}_i) = ||\textbf{s}_i - \textbf{u}_i||_2^2 ,\\ \mathrm {s.t.}\ \ {} &\textbf{s}_i \ge \textbf{0},\textbf{s}_i\textbf{1}^T=1,||\textbf{s}_i||_0 \le k, \end{aligned} \end{aligned}$$
(10)

where \(\textbf{u}_i=\frac{\alpha }{\alpha +\gamma }\textbf{m}_i\).

Obviously, Eq. (10) seems to be an NP-hard problem. Some methods sparsify the vector \(\textbf{s}_i\) obtained from all samples [19]. However, the convergence of this solving method cannot be guaranteed theoretically. Here, we attempt to identify an equivalent problem to Eq. (10) that is readily solvable.

Denoting \(\check{\varGamma }\) is a set that includes the indices corresponding to the first k largest elements in \(\textbf{u}_i\). Then, the optimal solution to the following problem (11) is also an optimal solution to the problem (10). To prove this, we propose further the following definitions, lemmas, and theorems.

$$\begin{aligned} \min _{\textbf{s}_i\ge \textbf{0},\textbf{s}_i\textbf{1}^T=1,s_{ij}=0\ ifj\notin \check{\varGamma }\left( i\right) \ }{\left\{ \mathcal {O}_1=\sum _{j\in \check{\varGamma }\left( i\right) }\left( s_{ij}-u_{ij}\right) ^2\right\} }. \end{aligned}$$
(11)

Definition 1

Let \(\textbf{s}_i\) satisfy the constraints of Eq. (10). If there is at least one non-zero element \(s_{ij}\ \left( s_{ij}>0\right) \) in \(\textbf{s}_i\) that satisfies \(j\notin \check{\varGamma }\left( i\right) \), then \(\textbf{s}_i\) is called a \(\check{\zeta }\) solution of Eq. (10).

Definition 2

\(\varPi \left( \textbf{s}_i,\check{\varGamma }\left( i\right) \right) =\sum _{j=1}^{n}\mathcal {A}\left( s_{ij},j\right) \), where \(\mathcal {A}\left( s_{ij},j\right) =1\) if \(s_{ij}>0\) and \(j\in \check{\varGamma }\left( i\right) \); otherwise \(\mathcal {A}\left( s_{ij},j\right) =0\).

Lemma 1

For any \(\check{\zeta }\) solution \({\ddot{\textbf{s}}}_i\) of Eq. (10), there exists another solution \({\widetilde{\textbf{s}}}_i\) of Eq. (10) that satisfies \(\varPi \left( {\widetilde{\textbf{s}}}_i,\check{\varGamma }\left( i\right) \right) =\varPi \left( {\ddot{\textbf{s}}}_i,\check{\varGamma }\left( i\right) \right) +1\) such that \(\mathcal {O}\left( {\widetilde{\textbf{s}}}_i\right) \le \ \mathcal {O}\left( {\ddot{\textbf{s}}}_i\right) \).

Proof

Let \(\tau \notin \check{\varGamma }\left( i\right) \) and \(\epsilon \in \check{\varGamma }\left( i\right) \) be indices that satisfy \({\ddot{\textbf{s}}}_i=\left[ \cdots ,{\ddot{s}}_{i\tau },\cdots ,{\ddot{s}}_{i\epsilon },\cdots \right] \) and \({\widetilde{\textbf{s}}}_i=\left[ \cdots ,{\widetilde{s}}_{i\tau },\cdots ,{\widetilde{s}}_{i\epsilon },\cdots \right] \), where \(u_{i\tau } \le u_{i\epsilon }\), \({\ddot{s}}_{i\epsilon }=0\), \({\widetilde{s}}_{i\tau }=0\), \({\ddot{s}}_{i\tau }={\widetilde{s}}_{i\epsilon } > 0\), and \({\ddot{s}}_{ij}={\widetilde{s}}_{ij}\) for any \(j\ne \tau \) and \(j\ne \epsilon \).

Thus, one has:

$$\begin{aligned} \left( {\ddot{s}}_{i\tau }-u_{i\tau }\right) ^2+\left( {\ddot{s}}_{i\epsilon }-u_{i\epsilon }\right) ^2\ge \left( {\widetilde{s}}_{i\tau }-u_{i\tau }\right) ^2+\left( {\widetilde{s}}_{i\epsilon }-u_{i\epsilon }\right) ^2, \end{aligned}$$
(12)

which means \(\mathcal {O}\left( {\widetilde{\textbf{s}}}_i\right) \le \mathcal {O}\left( {\ddot{\textbf{s}}}_i\right) \). Thus, Lemma 1 holds.    \(\blacksquare \)

Lemma 2

For any \(\check{\zeta }\) solution \({\ddot{\textbf{s}}}_i\) of Eq. (10), there exists another non-\(\check{\zeta }\) solution \({\widetilde{\textbf{s}}}_i\) of Eq. (10) such that \(\mathcal {O}\left( {\widetilde{\textbf{s}}}_i\right) \le \mathcal {O}\left( {\ddot{\textbf{s}}}_i\right) \).

Proof

Lemma 2 holds when Lemma 1 holds.    \(\blacksquare \)

Lemma 3

Let \(\textbf{s}_i^*\) be the optimal solution of Eq. (10), there exists a non-\(\check{\zeta }\) solution \({\dddot{\textbf{s}}}_i^*\) that satisfy \(\mathcal {O}\left( {\dddot{\textbf{s}}}_i^*\right) =\mathcal {O}\left( \textbf{s}_i^*\right) \).

Proof

Obviously, Lemma 3 holds when Lemma 2 holds.    \(\blacksquare \)

Lemma 3 demonstrates that if \(\textbf{s}_i^*\) is an optimal solution of Eq. (10) and it is a \(\check{\zeta }\) solution, then there exists a non-\(\check{\zeta }\) solution \({\dddot{\textbf{s}}}_i^*\) which is also an optimal solution to the problem (10).

Theorem 2

Let \({\bar{\textbf{s}}}_i^*\) be the optimal solution to the problem (11), then \({\bar{\textbf{s}}}_i^*\) is also an optimal solution to the problem (10).

Proof

Let \(\textbf{s}_i^*\) be the optimal solution of Eq. (10), according to Lemma 3, there exists a non-\(\check{\zeta }\) solution \({\dddot{\textbf{s}}}_i^*\) that satisfy \(\mathcal {O}\left( {\dddot{\textbf{s}}}_i^*\right) =\mathcal {O}\left( \textbf{s}_i^*\right) \). And it has:

$$\begin{aligned} \mathcal {O}\left( {\dddot{\textbf{s}}}_i^*\right) =\check{\sigma }+\mathcal {O}_1\left( {\dddot{\textbf{s}}}_i^*\right) , \end{aligned}$$
(13)

where \(\check{\sigma }=\sum _{j=1,j\notin \check{\varGamma }(i)}^{n}\left( 0-u_{ij}\right) ^2\), and \({\dddot{\textbf{s}}}_i^*\ge \textbf{0}\), \({\dddot{\textbf{s}}}_i^*\textbf{1}^T=1\), \(||{\dddot{\textbf{s}}}_i^*||_0 \le k\). Considering \(\check{\varGamma }\left( i\right) \) is fixed, \( \check{\sigma }\) is a constant. And we have \(\mathcal {O}\left( {\bar{\textbf{s}}}_i^*\right) =\check{\sigma }+\mathcal {O}_1\left( {\bar{\textbf{s}}}_i^*\right) \). Since \({\bar{\textbf{s}}}_i^*\) is the optimal solution to the problem (11), we have \(\mathcal {O}_1\left( {\bar{\textbf{s}}}_i^*\right) \le \mathcal {O}_1\left( {\dddot{\textbf{s}}}_i^*\right) \). So, \(\mathcal {O}\left( {\bar{\textbf{s}}}_i^*\right) \le \mathcal {O}\left( {\dddot{\textbf{s}}}_i^*\right) =\mathcal {O}\left( \textbf{s}_i^*\right) \). Thus, Theorem 2 holds.    \(\blacksquare \)

The problem (11) can be transformed into a vector format. We can solve it easily by utilizing the KKT conditions and the Newton method [17].

figure a

The algorithm of SSC-\(l_0\) is summarized in Algorithm 1. Assume that Algorithm 1 iterates at most T times. The time cost of updating \(\textbf{M}\) is \(O(T\left( dn^2+n^3\right) )\), where n is the number of samples and d is the number of features. The time cost of solving the problem (11) is \(O(T\left( n \ k\ log\ k\right) )\) [20], where k is the maximum number of non-zero elements in the coefficient vector. Given that \(k\ll n\), the overall time complexity of SSC-\(l_0\) is \(O(n^3)\).

After a non-negative coefficient matrix \(\textbf{S}\) is learned, spectral clustering can be executed based on the produced \(\textbf{S}\).

In Algorithm 1, the optimal solutions for \(\textbf{M}\) and \(\textbf{S}\) can be obtained by solving problems (8) and (11). Then,

$$\begin{aligned} \begin{aligned} J_{SSC-l_0}\left( \textbf{M}^{\left( t\right) },\textbf{S}^{\left( t\right) }\right) \le J_{SSC-l_0}\left( \textbf{M}^{\left( t-1\right) },\textbf{S}^{\left( t\right) }\right) \le J_{SSC-l_0}\left( \textbf{M}^{\left( t-1\right) },\textbf{S}^{\left( t-1\right) }\right) , \end{aligned} \end{aligned}$$
(14)

where \(t-1\) and t represent the \((t-1)\)th and tth iteration, respectively. Inequality Eq. (14) indicates that the objective function values of SSC-\(l_0\) decrease monotonically, i.e., SSC-\(l_0\) is convergent.

4 Experiments

In this study, all experiments are conducted on a personal computer with i5-9500 CPU @3.00 GHz and 8 GB RAM. The codes are implemented in MATLAB R2021a 64 bit.

4.1 Datasets and Comparison Methods

Ten image datasets are utilized for experiments, including five face image datasets (Yale, Jaffe [21], ORL1024 [22], PIE [23] and FERET [24]), and four handwritten digit image datasets (BinaryFootnote 1, SemeionFootnote 2, DigitFootnote 3, and MinistFootnote 4), and one palm image dataset (Palm) [9]. The details of these datasets are shown in Table 1.

Table 1. The Benchmark Datasets.

Seven clustering methods are selected for comparison, including Least Squares Regression (LSR1) [16], Sparse Subspace Clustering (SSC) [12], Low-Rank Representation (LRR) [13], Iteratively Reweighted Least Squares (IRLS) [15], Block-Diagonal Representation (BDR) [14], Clustering with Typicality-aware Adaptive Graph (CTAG) [11], and the Simplex Sparse Representation (SSR) [17]. Please see Sect. 1 for more details on these methods.

SSC-\(l_0\) and seven other methods vary parameters in the range of \(\{{10}^{-3},{10}^{-2},\) \({10}^{-1},{10}^0,{10}^1,{10}^2,{10}^3\}\). Spectral clustering is applied to coefficient matrices, followed by post-processing with k-means. To reduce sensitivity to initialization, k-means is repeated 50 times, and the reported result is the average of the 20 trials with the lowest loss.

4.2 Clustering Analysis

Clustering Performance. Table 2 shows the ACC values obtained by the different methods on the selected benchmark datasets. In Table 2, the best results are exhibited in bold and the second-best results are marked in brackets. The average results of each method over all selected datasets are listed as the last row in Table 2. From Table 2, we have the following observations:

Table 2. ACC Values of the SSC-\(l_0\) and the Selected Methods on Benchmark Datasets. (%)
  1. 1.

    LRR and IRLS rank second and third, respectively, in terms of average ACC values among the eight methods. This can be attributed to their ability to restore the row space of the data by learning a coefficient matrix with a low-rank structure. This characteristic aids in enhancing their robustness and improving their clustering ability.

  2. 2.

    SSR exhibits the lowest average ACC value, which can be attributed to the absence of an effective regularization term.

  3. 3.

    SSC-\(l_0\) achieves the best ACC values across all selected datasets except PIE and Digit. The reason is that SSC-\(l_0\) tends to represent each sample as a linear combination of the samples from the same class under the \(l_0\) inequality constraint. However, the other seven methods lack this ability. In general, data points coming from the same class are easier to represent each other. When reconstructing a point, if the number of non-zero elements in the coefficient vector is limited, it is more likely to select data points from the same class to complete the corresponding reconstruction work. Therefore, the \(l_0\) inequality constraint can improve the performance of the SSC-\(l_0\).

Visualization of the Similarity Matrix. Figure 1 visualizes the coefficient matrix obtained by SSC-\(l_0\) on datasets including Jaffe, PIE, and Palm. In Fig. 1, the coefficient matrix exhibits a block-diagonal structure, indicating that similar samples in SSC-\(l_0\) tend to belong to the same class. This characteristic enhances the clustering performance of SSC-\(l_0\). The reason for showing the block-diagonal structure is that if the number of non-zero elements in the coefficient vector is limited, when reconstructing a point, it is more likely to select a point of the class to which the reconstruction point belongs.

Fig. 1.
figure 1

The coefficient matrix obtained by SSC-\(l_0\) on the different datasets. (a) Jaffe. (b) PIE. (c) Palm.

Parameter Sensitivity. SSC-\(l_0\) has three parameters: \(\alpha \), \(\gamma \), and k. Figure 2 illustrates the impact of these parameters on SSC-\(l_0\) performance across different datasets. The results indicate that SSC-\(l_0\) is sensitive to the values of \(\alpha \), \(\gamma \), and k. To obtain optimal performance, a grid search method is recommended for parameter selection in SSC-\(l_0\).

Fig. 2.
figure 2

ACC values of SSC-\(l_0\) on different datasets when varying parameters \(\alpha \), \(\gamma \), and k. (a) PIE. (b) Binary. (c) Semeion. The color in the figure reflects the ACC values.

Convergence Analysis. Figure 3 shows the convergence curves of SSC-\(l_0\) on datasets Yale, ORL1024, and FERET. From Fig. 3, the objective function values of SSC-\(l_0\) decrease monotonically and SSC-\(l_0\) converges after 100 iterations.

4.3 Robustness Analysis

To compare the robustness of the eight methods, the following experiments are designed:

  1. 1.

    In the ORL2116 dataset, 20% of the samples are randomly selected and added salt and pepper noise. The noisy densities are 0.1, 0.2, 0.3, 0.4, and 0.5. Figure 4(a) illustrates the original ORL2116 dataset and noisy ORL2116 datasets. Figure 5(a) presents the performances of eight methods on these noisy datasets.

  2. 2.

    In the ORL1024 dataset, 20% of the samples are randomly selected and added random noisy blocks. The sizes of the noisy blocks were 5\(\,\times \,\)5, 10\(\,\times \,\)10, 15\(\,\times \,\)15, 20\(\,\times \,\)20, and 25\(\,\times \,\)25 pixels. Figure 4(b) displays the original ORL1024 dataset and the corresponding noisy ORL1024 datasets. The performances of different methods on these noisy datasets are presented in Fig. 5(b).

  3. 3.

    In the Jaffe dataset, 20% of the samples are randomly selected and added random noisy blocks. The sizes of the noisy blocks were 4 * 4, 8 * 8, 12 * 12, 16 * 16 and 20 * 20 pixels. Figure 4(c) displays the original Jaafe dataset and the corresponding noisy Jaffe datasets. The performances of different methods on these noisy datasets are presented in Fig. 5(c).

Fig. 3.
figure 3

Convergence curves of SSC-\(l_0\) on different datasets. (a) Yale.(b) ORL1024. (c) FERET.

Fig. 4.
figure 4

Original dataset and noisy datasets. (a) ORL2116. (b) ORL1024. (c) Jaffe.

Form Fig. 5, although the performance of the eight methods degrades as increasing the noise levels, SSC-\(l_0\) achieve almost the best results no matter which noise levels are involved, which demonstrates further that SSC-\(l_0\) has high robustness. The reason is as follows. Clean points have better representation ability compared to noisy points. When the number of non-zero elements in the coefficient vector is limited, clean points are more likely to be used for reconstruction. Thus, constraining the number of non-zero elements can improve model robustness. SSC and SSR use \(l_1\) regularization and \(l_1\) equation constraint to induce sparsity in the coefficient matrix, but they do not directly constrain the number of non-zero elements in the coefficient vector. In contrast, SSC-\(l_0\) can precisely control the number of non-zero elements, leading to better performance.

Fig. 5.
figure 5

Performance of different clustering methods on the noisy datasets with different noise levels. (a) Noisy ORL2116 datasets with different noise densities. (b) Noisy ORL1024 datasets with different sizes noisy blocks. (c) Noisy Jaffe datasets with different sizes noisy blocks.

5 Conclusions

Considering data points coming from the same class are easier to represent each other and when the number of non-zero elements in the coefficient vector is limited, a model is more likely to select data points from the same class to complete corresponding reconstruction work, we propose Sparse Subspace Clustering with the \(l_0\) inequality constraint (SSC-\(l_0\)) to conduct representation learning among the points in the same class. By introducing the simplex constraint, an optimization method concerning \(l_0\) inequality constraint is proposed, and its convergence is also theoretically analyzed. Since the \(l_0\) inequality constraint problem is difficult to be solved, the proposed optimization method can be widely used in lots of sparse learning models. Extensive experiments demonstrate the superiority of SSC-\(l_0\). Establishing a one-step clustering method based on SSC-\(l_0\) may further improve the performance of the proposed clustering model, which is our next work.