1 Introduction

Due to the so-called "curse of dimensionality," many traditional clustering algorithms often do not perform well on high-dimensional data [1]. Thus, subspace clustering (SC) was conceived to handle such a problem. SC operates under the assumption that high-dimensional spaces can be represented as a union of multiple lower-dimensional subspaces, with each subspace corresponding to a different underlying factor or pattern in the data [2]. The obtained subspace structure is further used as data prior in constructing an affinity matrix for clustering, usually through spectral post-processing. Notably, sparse subspace clustering (SSC) [3] and low-rank representation (LRR) [4, 5] are two popular methods for constructing the affinity matrix. Unlike SSC, which assumes each data point can be represented as a linear combination of a limited number of other data points in the same subspace, LRR believes that the data can be represented as a low-rank matrix. The choice between the two methods often depends on the specific characteristics of the dataset. For example, LRR is exceptional when it comes to dealing with the influence of noise and outliers [6, 7]. Moreover, many researchers have also proposed hybrid methods that combine the sparsity effect of SSC and the global nature of LRR to improve the performance of SC.

Notably, Wang et al. [8] proposed a low-rank subspace sparse representation (LRSR) framework for subspace segmentation, which can recover and segment embedding subspaces simultaneously. Zhu et al. [9] introduced the low-rank sparse subspace (LSS) clustering method. LSS addresses the problems associated with the two-stage approach of traditional graph clustering methods by avoiding spectral post processing. LSS method dynamically learns the affinity matrix from the low-dimensional space of the original data by using a transformation matrix to project the original data to their low-dimensional space, conducting feature selection and subspace learning, and utilizing the rank constraint to obtain the clustering results directly. Sui, Wang and Zhang [10] presented a subspace clustering method that formulates the problem as structured representation learning. Specifically, they discovered that propagating a low-rank structure promotes sparsity in the data representation, resulting in a more robust clustering description. Therefore, their proposed method leverages two cascade self-expressions to implement the propagation. Xia et al. [11] proposed a nonconvex low-rank learning framework, which exploits the low-rank property of nonlinear data and induces a high-dimensional Hilbert space that approaches the true feature space. Recently, Teng et al. [12] introduced a new method, called kernel-based sparse representation learning with global and local low-rank label (KSR-GL3). KSR-GL3 uses the global and local low-rank label constraint to ensure the semantic invariance, low-rankness, and discrimination of features during learning. Despite the successes of these existing SC methods, most of them still face the problem of noise distortion and overlapping subspaces, which can result in samples being wrongly assigned to more than one subspace. The ideal is for each sample to only belong to one subspace.

To address the above problem, this paper proposes a new method that induces data manifold via two representative structures. Motivated by the robustness of LRR, we begin by learning a low-rank matrix to characterize the original data and capture its global structure. In order to facilitate our approach, a two-way manifold learning strategy is incorporated into our model such that an affinity matrix is learned simultaneously through a \(k\)-symmetric nearest neighbor graph of data. Thus, we further introduced a dual regularization term to enable the LRR and the affinity graph to guide each other adaptively to better capture both global and neighborhood structures of the data together, which leads to more accurate clustering results. By imposing several non-negative constraints and a rank constraint on the affinity matrix, we also avoid spectral post processing procedure, allowing us to obtain the final clustering results directly to prevent extra computational cost. Our main contributions are as follows.

  1. 1)

    We propose a new method that uses two-way representative structures to effectively capture the data manifold and improve clustering performance, unlike, traditional subspace clustering methods, which obtain clustering through one data structure.

  2. 2)

    We avoid spectral post-processing procedures by constraining the learned affinity matrix with a rank constraint and several nonnegative constraints in order to directly derive the clustering matrix.

  3. 3)

    Extensive experiments are performed to evaluate the effectiveness of the proposed method. The results shows that our method performs better than state-of-the-art methods (SOTA) in six evaluation metrics.

The rest of this paper is structured as follows. Section 2 presents the related works. The proposed method formulation is described in Section 3. Results and analysis are provided in Section 4 while Section 5 concludes the paper.

2 Related work

Over the years, several SC approaches have been proposed to tackle the problem of subspace segmentation. These methods can be categorised based on their approach to learning an affinity matrix. In the following paragraphs, we describe those LRR based methods that are closely related to our proposed approach. Firstly, given an unlabelled data matrix \(X\), the traditional LRR model can be formulated as.

$$\underset{A,E}{\text{min}} {\Vert A\Vert }_{*}+{\lambda }_{2}{\Vert E\Vert }_{\text{2,1},} s.t. \text{ X}=\text{XA}+E,$$
(1)

where \(X\in {\mathbb{R}}^{d \times n}\) is used as the self-dictionary to learn a low-rank matrix \(A \in {\mathbb{R}}^{n \times n}\) to characterize the data such that\({A}_{ij}\), is only non-zero if samples \({x}_{i}\) and \({x}_{j}\) are members of the same subspace. \(E \in {\mathbb{R}}^{d \times n}\) is an error matrix to capture the error components since real-world data are not always clean. Thus, considering that data corruption is usually sample specific, the \({L}_{21}\)-norm is imposed to encourage the columns of \(E\) to be zero. The goal is to recover clean data through the product of the original data and a low-rank matrix. That is, using \(XA\) while eliminating corrupt samples through the constraint\(E=X-XA\). Specifically, in order to reveal the true membership of\(X\), LRR expects that \(A\) would have a \(k\) block diagonal structure. However, this is difficult to obtain in complex applications, so LRR also expects that when the block diagonal fails to hold, spectral clustering [13] (via a spectral postprocessing of \(A)\) may ensure the robustness of the segmentation, which is also very uncertain. As a result, various methods have been proposed to strengthen the robustness of LRR.

Structure-constrained LRR (SC-LRR) [14] uses a predefined weight matrix to analyse the structure of multiple disjoint subspaces and reveal their relationship for clustering. Laplacian regularized low-rank representation (LRLRR) [15] incorporates hypergraph Laplacian regularization to capture intrinsic non-linear geometric information in data. LRLRR was motivated by LRR's inability to consider non-linear geometric structures within data, which may result in the missing locality and similarity information during the learning process. Low-rank representation with adaptive graph regularization (LRR_AGR) [16] integrates a distance regularization term and a non-negative constraint to enable simultaneous exploitation of global and local information for graph learning. A novel rank constraint is also introduced to encourage the learned graph to have clear clustering structures. Adaptive Structure-constrained Low-Rank Coding (AS-LRC) [17] combines an adaptive weighting-based block-diagonal structure-constrained low-rank representation and the group sparse salient feature extraction into a unified framework. First, AS-LRC performs latent decomposition of given data into three matrices: a low-rank reconstruction by a block-diagonal codes matrix, a group sparse locality-adaptive salient feature part, and a sparse error part. The auto-weighting matrix is computed based on the locality-adaptive features, encouraging the codes to be block-diagonal, and avoids the problem of choosing optimal neighbourhood size or kernel width for weight assignment. Furthermore, considering that it is difficult for SC-LRR to determine the best value for the corresponding parameter of the constraint term, an Improved structured low-rank representation (ISLRR) [18] was introduced as an extension of SC-LRR. ISLRR includes the structure information of datasets into the equality constraint term of LRR, avoiding the need to adjust the extra parameter. Similarly, SinNLRR [19] uses an adaptive penalty selection method to avoid sensitivity to the parameters. Also, a non-negative and low rank structure is imposed on the affinity matrix to ensure more robust and accurate clustering results.

Recently, some adaptive affinity graphs methods have emerged. For example, Adaptive Low-rank Representation (ALRR) [20] adaptively learns a dictionary from the data to make the obtained model more resistant to the negative impact of noise. Adaptive Data Correction-based Graph Clustering (ADCGC) [21] works to improve clustering performance by removing errors and noise from the original data. Adaptive Graph Regularization from Clean Data (RLRR-AGR) [22] unifies graph construction and subsequent optimization in a framework to capture the data's underlying non-linear geometric information. Despite the effectiveness of these adaptive methods over the fixed graph ones, one drawback is that the data's manifold structure is pre-captured using a non-flexible low-rank matrix, which is a disadvantage. Especially in cases where the data is severely corrupted, resulting in a significant distance between similar samples and causing unrelated samples to remain together. In such cases, the constructed graph may not accurately represent the actual geometric structure of the data. Hence, our proposed method induces the geometric structure of the data via a two-way representative technique to ensure that only similar samples are connected while samples from different classes do not stay together.

3 The proposed method

This section formulates the proposed method and provides an optimization method to solve it.

3.1 Model formulation

To achieve excellent performance in subspace clustering, an accurate \(k\) block diagonal subspace structure, with each sample belonging to one subspace is required [23]. But most SC methods are not robust enough to guarantee that because; (1) they focus either on local or the global data structure and simply assume that similar samples habitually reside close to one another (and the corrupt ones always stay far away from others, including their likes) in reconstructing each sample with similar ones using the self-expressiveness property of data. In actuality, two unrelated samples could stay close and be regarded as similar due to corruption causing the data to be arbitrarily distributed. Thus, the subspace structure learned in this way would have negative correlations embedded in it. (2) they construct an affinity matrix from such a subspace structure in advance for subspace clustering through a spectral post-processing procedure. In other words, an error in the first step may impact the final step. To address this, our core idea in this paper is to allow two representative subspace structures to guide each other to simultaneously capture the neighbourhood and global data structures through a dual regularization term without spectral post-processing procedure. Thus, by adopting the LRR \((A\)) of Eq. (1) as our first structure, we further learn the second structure\(B\in {\mathbb{R}}^{n \times n}\). For this, we use the \(k\)-nearest neighbour graph approach to generate a matrix \(P \in {\mathbb{R}}^{n \times n}\) from\(X\). This involves identifying \(k\) symmetric nearest neighbours for each sample in \(X\) using Euclidean distance as the distance metric (Reference [24] contains a detailed explanation of \(k\)-nearest neighbor graph). The spectral clustering algorithm is then applied directly to obtain the affinity matrix \(B\) through the low-dimensional embedding matrix of \(P\) as follows.

$$\text{min }Tr\left({Q}^{T}PQ\right)+{\Vert B-Q{Q}^{T}\Vert }_{F}^{2}, s.t.,{Q}^{T}Q=I, B1=1, B\ge 0,$$
(2)

where constraints \(B1=1\) and \(B\ge 0\) is to ensure that the entries of \(B\) are non-negative. The low embedding matrix \(Q\) is used to obtain the affinity matrix \(B\) because, according to Zhan et al. [25], such an embedding matrix can guarantee a block diagonal structure. Ideally, for such validation, \(B\) can be used directly for clustering. However, to further ensure a robust solution, since \(A\) and \(B\) are obtained independently, we want to align both structures to capture the global and neighbourhood structures of data simultaneously. Thus, we introduce a dual regularization term \({\Vert B-A\Vert }_{F}^{2}\), with the Frobenius norm applied to further enforce a block-diagonal solution. Therefore, combining Eq. (1), Eq. (2) and the dual regularization term, we arrive at the following model.

$$\begin{array}{c}\text{min}{\Vert A\Vert }_{*}+{\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+Tr\left({Q}^{T}PQ\right)+{\Vert B-A\Vert }_{F}^{2}+{\Vert E\Vert }_{\text{2,1},}\\ s.t., \text{X}=\text{XA}+E, {Q}^{T}Q=I, B1=1, B\ge 0,\end{array}$$
(3)

As depicted by Eq. (3), \(A\) and \(B\) matrices are simultaneously learned by allowing them to approximate one another to find more better solution adaptively. Therefore, unlike most existing approach, we avoid the spectral post processing procedure by imposing a rank constraint \(rank\left({L}_{B}\right)=n-c\) on the Laplacian matrix of \(B\) to allow it to denote our clustering structure using Theorem 1. So, our proposed model described in Fig. 1 is:

$$\begin{array}{c}\text{min}{\Vert A\Vert }_{*}+{\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+Tr\left({Q}^{T}PQ\right)+{\Vert B-A\Vert }_{F}^{2}+{{\lambda }_{4}\Vert E\Vert }_{\text{2,1},}\\ s.t., \text{X}=\text{XA}+E, {Q}^{T}Q=I,\\ B1=1, B\ge 0, rank\left({L}_{B}\right)=n-c,\end{array}$$
(4)

where \({L}_{B}={D}_{B}-B\) such that \({D}_{B}\in {\mathbb{R}}^{n x n}\) denotes the diagonal matrix with \(ith\) entry \({\sum }_{j}{B}_{ij}=1\). And parameters \({\lambda }_{1}\) and \({\lambda }_{2},{\lambda }_{3},{\lambda }_{4}\) are used balance the different terms. To relax Eq. (4), and make it more solvable, we introduce an auxiliary term \(S=A\). Additionally, to address the non-linearity of the rank constraint, the method follows a similar strategy as in a previous study [7] to express the smallest eigenvalue of \({L}_{B}\) as \({\theta }_{i}\)(\({L}_{B}\)). Hence, with sufficient \({\lambda }_{5}\), Eq. (4) can be rewritten as:

Fig. 1
figure 1

The framework of the proposed method. Firstly, by adopting the LRR \(A\) (denoted as \(Z\) in the Figure) of Eq. (1) as our first structure, we further learn a second structure \(B\), a similarity matrix (denoted as \(S\) in the Figure). For this, we use the \(k\)-nearest neighbour graph approach to first generate a matrix \(P\), through which we learn the low embedding matrix \(Q\) used to obtain \(B\) directly

$$\begin{array}{c}\text{min}{\Vert S\Vert }_{*}+{{\lambda }_{1}\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+{\lambda }_{2}Tr\left({Q}^{T}PQ\right)+{{\lambda }_{3}\Vert B-A\Vert }_{F}^{2}+{{\lambda }_{4}\Vert E\Vert }_{\text{2,1}} + {\lambda }_{5}{\sum }_{i=1}^{c}{\theta }_{i}({L}_{B}),\\ s.t., X=XA+E, {Q}^{T}Q=I,\\ B1=1, B\ge 0, S=A\end{array}$$
(5)

Theorem 1[7]: if \(B\) is non-negative, the multiplicity \(c\) of the zero eigenvalue of the graph Laplacian \({L}_{B}\) corresponds to the number of connected components in the graph associated with \(B\).

Refer to preposition 2 of reference [26] for a detailed proof of the Theorem 1.

3.2 Optimization

Based on its efficiency, we adopt the Augmented Lagrange Multiplier (ALM) [22, 23] method to solve the objective function of Eq. (5). Firstly, by following Ky Fan’s theorem [28], \({\sum }_{i=1}^{c}{\theta }_{i}({L}_{B})\) is the same as minimizing \(Tr({F}^{T}{L}_{B}F)\) subject to \({F}^{T}F=I\). Thus, we have:

$$\begin{array}{c}\text{min}{\Vert S\Vert }_{*}\\ +{{\lambda }_{1}\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+{\lambda }_{2}Tr\left({Q}^{T}PQ\right)+{{\lambda }_{3}\Vert B-A\Vert }_{F}^{2}+{{\lambda }_{4}\Vert E\Vert }_{\text{2,1}} + {\lambda }_{5} Tr({F}^{T}{L}_{B}F),\\ s.t., X=XA+E, {Q}^{T}Q=I,\\ B1=1, B\ge 0, S=A, {F}^{T}F=I,\end{array}$$
(6)

Consequently, we obtain the LaGrange function of Eq. (6) as follows:

$$\begin{array}{c}\text{min}{\Vert S\Vert }_{*}\\ +{{\lambda }_{1}\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+{\lambda }_{2}Tr\left({Q}^{T}PQ\right)+{{\lambda }_{3}\Vert B-A\Vert }_{F}^{2}+{{\lambda }_{4}\Vert E\Vert }_{\text{2,1}} + {\lambda }_{5} Tr\left({F}^{T}{L}_{B}F\right),+Tr\left[{Y}_{1}^{T}\left(X-XA-E\right)\right]+Tr\left[{Y}_{2,}^{T}\left(S-A\right)\right]+ \frac{\mu }{2}\left({\Vert X-XA-E\Vert }_{F}^{2}+{\Vert S-A\Vert }_{F}^{2} \right),\end{array}$$
(7)

where \({Y}_{1}\) and \({Y}_{2}\) are Lagrange multipliers, which are necessary for solving constrained problems. Therefore, separating the unconnected terms in Eq. (7), the minimization problem and the ideal solution of each variable are given below in no particular order.

\({\varvec{S}}\) subproblem:

$$\text{arg}\underset{S}{\text{min}}\frac{1}{\mu }{\Vert S\Vert }_{*}+\frac{1}{2}{\Vert S-\left(A+\frac{{Y}_{2}}{\mu }\right)\Vert }_{F}^{2}$$
(8)

The optimal solution to Eq. (8) can be found by first representing the singular value thresholding [27] of \({Z}_{\mu }\left[Y\right]\) as \(U{\vartheta }_{\mu }[\sum ]{V}^{T}\) to obtain the following

$$S={Z}_{\frac{1}{\mu }}\left[A+\frac{{Y}_{2}}{\mu }\right],$$
(9)

where \({\vartheta }_{\mu }\left[\Sigma \right]=min\left(0,\Sigma +\mu \right)+max(0,\Sigma -\mu )\).

\({\varvec{E}}\) subproblem:

$$E=\text{arg}\underset{E}{\text{min}}\frac{{\lambda }_{4}}{\mu }{\Vert E\Vert }_{\text{2,1}}+\frac{1}{2}{\Vert E-\left(X-XA+\frac{{Y}_{1}}{\mu }\right)\Vert }_{F}^{2}$$
(10)

\({\varvec{Q}}\) subproblem

$$\begin{array}{c}\text{arg}\underset{H}{\text{min}}{{\lambda }_{1}\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+{\lambda }_{2}Tr\left({Q}^{T}PQ\right),\\ s.t., {Q}^{T}Q=I.\end{array}$$
(11)

Considering the fact that Eq. (11) is different for each \(i\), we can rewrite it as follows:

$$\begin{array}{c}\text{arg}\underset{Q}{\text{max}}Tr\left({Q}^{T}\left(P+2{\lambda }_{1}B\right)Q\right),\\ s.t., {Q}^{T}Q=I.\end{array}$$
(12)

The optimal value of \(Q\) can be found by deriving it from the \(c\) eigenvectors of the corresponding topmost \(c\) greatest eigenvalues of (\(P+ 2{\lambda }_{1}B\)).

\({\varvec{A}}\) subproblem:

$$\text{arg}\underset{A}{\text{min}}{{\lambda }_{3}\Vert B-A\Vert }_{F}^{2}+Tr\left[{Y}_{1}^{T}\left(X-XA-E\right)\right]+Tr\left[{Y}_{2}^{T}\left(A-S\right)\right]+\frac{\mu }{2}\left({\Vert X-XZ-E\Vert }_{F}^{2}+{\Vert S-A\Vert }_{F}^{2}\right).$$
(13)

Setting the derivative \(\frac{\partial }{\partial A}=0\), we can obtain \(A\) using the following formula.

$$A={\left[\left(\frac{2}{\mu }+1\right)I+{X}^{T}X\right]}^{-1}\left(\left({X}^{T}X-{X}{\prime}*E+S\right)+\frac{2{\lambda }_{3}B+{X}^{T}{Y}_{1}-{Y}_{2}}{\mu }\right).$$
(14)

B subproblem

$$\begin{array}{c}\text{arg}\underset{B}{\text{min}{\lambda }_{1}}{\Vert B-Q{Q}^{T}\Vert }_{F}^{2}+{\lambda }_{3}{\Vert B-A\Vert }_{F}^{2}+{\lambda }_{5}Tr\left({F}^{T}{L}_{B}F\right),\\ s.t., B1=1, B\ge 0,{F}^{T}F=I.\end{array}$$
(15)

For simplicity, we rewrite Eq. (15) as:

$$\begin{array}{c}\underset{B}{\text{min}}{\Vert B\Vert }_{F}^{2}-2\langle {\lambda }_{1}Q{Q}^{T}+{\lambda }_{3}A,B\rangle +{\lambda }_{5}Tr\left({F}^{T}{L}_{B}F\right),\\ s.t., B1=1, B\ge 0, {F}^{T}F=I.\end{array}$$
(16)

By denoting \(2({\lambda }_{1}Q{Q}^{T}+{\lambda }_{3}A)\) by \(W\), we have:

$$\begin{array}{c}\text{arg}\underset{B}{\text{min}}{\Vert B\Vert }_{F}^{2}-Tr\left(W{B}^{T}\right)+{\lambda }_{5}Tr\left({F}^{T}{L}_{B}F\right),\\ s.t., B1=1, B\ge 0, {F}^{T}F=I.\end{array}$$
(17)

Like Eq. (12), the optimal value of \(F\) can be found by deriving it from the \(c\) eigenvectors of the corresponding topmost \(c\) greatest eigenvalues of \({L}_{B}\). Thus, Eq. (18) can be recasted as:

$$\begin{array}{c}\underset{{b}_{j}}{\text{min}}\left({\lambda }_{5}{\Vert {f}_{i}-{f}_{j}\Vert }_{2}^{2}-{w}_{ij}\right){b}_{ij}+{b}_{j}^{T}{b}_{j},\\ s.t., {b}_{j}\ge 0,{1}^{T}{b}_{j}=1.\end{array}$$
(18)

Then, denoting \({\lambda }_{5}{\Vert {f}_{i}-{f}_{j}\Vert }_{2}^{2}-{w}_{ij}\) as \({q}_{ij}\), Eq. (19) is simplified as:

$$\begin{array}{c}\underset{{b}_{j}}{\text{min}}\frac{1}{2}{\Vert {b}_{j}+\frac{{q}_{j}}{2}\Vert }_{2}^{2},\\ s.t., {b}_{j}\ge 0,{1}^{T}{b}_{j}=1.\end{array}$$
(19)

Same as Eq. (30) of [28], \({b}_{j}\) can be obtained in the following way:

$${b}_{j}={\left(-\frac{{w}_{j}}{2}+\eta 1\right)}_{+}.$$
(20)

Algorithm 1 outlines the entire solution for the proposed algorithm.

figure a

Algorithm 1 Our Proposed Algorithm

4 Experiments

In this section, the effectiveness of the proposed method is evaluated through several experiments. The experimental settings, including the datasets used, and the compared methods are described in Section 4.1. The results of the experiments and their analysis are presented in Section 4.2.

4.1 Experimental settings

To demonstrate the effectiveness of our proposed method, we utilized several datasets: COIL20Footnote 1, USPSFootnote 2, ORLFootnote 3, and WebKBFootnote 4 datasets (See Table 1 for a summary of each dataset and Fig. 2 for images of some datasets) to perform various experiments. We compare the performance of our methods with seven state-of-the-art methods: LRR [5], LRR + SSC [29], SinNLRR [19], Implicit Block Diagonal Low-Rank Representation (IBDLR) [30], Nonlinear Orthogonal Non-Negative Matrix Factorization (NONMF) [31], Adaptive Low-Rank (ALR) [32], and Coupled Low-Rank Representation (CLRR) [33] using six standard evaluation metrics: accuracy (ACC), normalized mutual information (NMI), F-score, Recall, Precision and adjusted rand index (AR). For the parameter settings of each method, we adopt the optimal settings in the corresponding literature. We describe the datasets and compared methods below.

Table 1 Summary of the datasets
Fig. 2
figure 2

Illustration images of (a) ORL, (b) COIL20 and (c) USPS datasets

  • COIL20: It consists of a collection of 20 different objects, each captured from various angles under controlled conditions. The dataset contains a total of 1440 samples, where each sample represents an image of one of the objects. We extracted 1024 features for each image.

  • UCI Digits: It holds a collection of handwritten digit images, where each image represents one of the ten digits (0–9). The dataset contains a total of 2000 samples, with each sample represented by a feature vector of 240 dimensions.

  • ORL: It is a collection of grayscale facial images captured from 40 distinct subjects, with each subject having 10 different images taken under varying conditions, resulting in 400 samples. Each image in the ORL dataset has a dimension of 1024, representing a 32 × 32 pixel image.

  • WebKB: It is a widely used benchmark in the field of text classification and information retrieval. It consists of a collection of web pages from various university websites, with each web page representing a document. The dataset contains a total of 203 samples. Each document in the WebKB dataset is represented by a feature vector with a dimension of 1703.

The following is a description of each comparison method.

  • LRR: It aims to find the representation of data samples as linear combinations of bases in a given dictionary with the lowest possible rank.

  • LRR + SSC: It considers data correlation and performs automatic data selection while grouping together correlated data.

  • SinNLRR: It imposes a nonnegative and low rank structure on the similarity matrix to accurately identify and classify cell types.

  • IBDLR: It incorporates implicit feature representation and a block diagonal prior to improve the accuracy of the LRR model.

  • NONMF: It extends the nonlinear orthogonal Nonnegative Matrix Factorization (NMF) framework by incorporating a graph regularization.

  • ALR: It uses an adaptive solution of a low-rank kernel matrix such that the mapped data in the feature space possesses both low-rank characteristics and self-expression properties.

  • CLRR: It uses a k block diagonal regularization term to ensure a block diagonal clustering structure.

4.2 Experimental results

Table 2 shows the result concerning ACC, NMI, F-score, Recall, Precision, and AR obtained by different methods on the COIL20 dataset. As can be seen in the table, the proposed method outperforms the other methods. Specifically, the performance of 88.78% (ACC), 95.19%(NMI), 86.72%(F-score), 76.80%(Precision), and 85.87%(AR) acquired by the proposed method is better than that of the closest method, CLRR, by over 1%, and much better than the other methods. The much closer performance of CLRR is due to its utilization of the \(k\) block diagonal regularization, which is shown to be very effective in several works such as [34]. Nonetheless, when the computational runtime is compared, one can easily see that by avoiding the \(k\) block diagonal regularization and inducing the dual regularization strategy through the \(k\) nearest neighbour graph and avoiding the complexity of manifold recovery structure, the proposed approach is more efficient than CLRR.

Table 2 Clustering results of different algorithms on COIL20 Dataset

Table 3 displays the result concerning ACC, NMI, F-score, Recall, Precision, and AR obtained by different methods on the UCI dataset. It can be observed that the proposed method maintained its superiority in handwritten recognition. Although both CLRR and the proposed method obtained competitive performance, with the proposed method outperforming CLRR in only three of the six metrics: F-score, Precision, and AR, the computational runtime of the proposed method is almost half of that of CLRR. Following the performance of CLRR and our method is NONMF and ALR methods. ALR introduces a novel kernel subspace clustering method capable of handling non-linear models, making it a more effective approach for non-linear subspace clustering. NONMF, on the other hand, allows for a factorization that preserves the local geometric structure of the data after nonlinear mapping. In other words, NONMF considers the nonlinearity of the underlying manifold in subspace clustering tasks to enhance its performance.

Table 3 Clustering results of different algorithms on UCI Digits

Similar results are obtained on ORL dataset where CLRR and our method also obtained competitive performance, as shown in Table 4. However, as can be seen in Table 5, the proposed method demonstrates strong performance when compared to its closest competitors, CLRR and IBDLR on the WebKB dataset. In terms of accuracy, the proposed method achieves an impressive 81.77%, surpassing CLRR's 79.80% and outperforming IBDLR's 71.67%. Specifically, the difference between the proposed method and CLRR, which is the closest method is about 2% in ACC, NMI, Precision and AR, respectively, confirming that our dual manifold learning strategy is very effective. In addition, Fig. 3 further demonstrate the effectiveness of the proposed method by showing the scatter plot of the optimized clustering results of different methods on the COIL20 dataset.

Table 4 Clustering results of different algorithms on ORL Dataset
Table 5 Clustering results of different algorithms on WebKB Dataset
Fig. 3
figure 3

A scatter plot of the optimized clustering results of different methods on the COIL20 dataset

Furthermore, considering that the degree of corruption in real-world data is not usually known in advance [35], the robustness of various algorithms to corruption is further investigated by adding different degrees (5% and 10%) of occlusion into the arbitrarily selected ORL dataset. Precisely, we randomly introduced some black blocks into the images of ORL. According the results shown Fig. 4, all methods have performance degradation as the corruption level increases. The reason is straightforward, because, as the sizes of occlusion grows, it tends to obscure the discriminative details present in the data, making it challenging for various clustering methods to accurately identify the underlying structure. The presence of occlusion can lead to misinterpretation and incorrect clustering assignments. However, the proposed method exhibits a relatively higher level of robustness compared to the other methods that were compared. This is because the proposed method is able to mitigate the detrimental effects of data corruption to a greater extent via the dual regularization term, allowing it to generate more accurate and reliable clustering results.

Fig. 4
figure 4

Clustering performances of different algorithms on corrupted ORL dataset

4.3 Parameter sensitivity

This section presents experimental results to examine the sensitivity of parameters related to ACC of the proposed clustering method using COIL20, UCI Digits and ORL datasets. The objective function of the method includes five regularization parameters: \({\lambda }_{1}\), \({\lambda }_{2}\), \({\lambda }_{3}\), \({\lambda }_{4}\) and \({\lambda }_{5}\) which needs to be set beforehand. These parameters determine the impact of the various terms in our objective function. In order to investigate their effects on data clustering, the study defines a range of candidate values: {0.001, 0.01, 0.1, 1, 10, 100, 1000}, for each parameter by applying the proposed method with various combinations. Firstly, we fix other parameters to 1 while varying \({\lambda }_{1}\) and \({\lambda }_{2}\) to observe its influence on the clustering accuracy (ACC). Then, by fixing \({\lambda }_{1}\) and \({\lambda }_{2}\) to their optimal value from the previous stage, we vary \({\lambda }_{3}\) and \({\lambda }_{4}\) to understand how the different combinations affect the accuracy of the clustering results. Finally, we vary \({\lambda }_{5}\) by fixing all the other parameters to obtain the final optimal sets. Although this process seems expensive, it is reasonable in practice. Moreover, to the best of our knowledge, how to select optimal parameters for different dataset is still an open problem in the literature.

From Fig. 5, 6, it is obvious to see that the clustering ACC is relatively insensitive to parameter \({\lambda }_{3}\) and \({\lambda }_{4}\), especially on COIL20 and UCI Digits datasets. Figure 7 indicates that the clustering accuracy (ACC) is not significantly affected by the parameter \({\lambda }_{5}\). This insensitivity is primarily attributed to the behaviour of the rank constraint term associated with \({\lambda }_{5}\) in the graph learning process. Because when the value of \({\lambda }_{5}\) is too large, the rank constraint term becomes dominant in the graph learning, overshadowing the preservation of the local and global structure. As a result, even though the generated graph may still have connected components, it fails to reveal the actual underlying intrinsic structure of the data.

Fig. 5
figure 5

Visualization of the clustering performance of our method w.r.t. ACC by fixing other parameters and varying-parameters \({{\varvec{\lambda}}}_{1}\) and \({{\varvec{\lambda}}}_{2}\) on COIL20, UCI Digits, and ORL datasets

Fig. 6
figure 6

Visualization of the clustering performance of our method w.r.t. ACC by fixing other parameters and varying-parameters \({{\varvec{\lambda}}}_{3}\) and \({{\varvec{\lambda}}}_{4}\) on COIL20, UCI Digits, and ORL datasets

Fig. 7
figure 7

Visualization of the clustering performance of our method w.r.t. ACC by fixing other parameters and varying-parameters \({{\varvec{\lambda}}}_{5}\) on COIL20, UCI Digits, and ORL datasets

4.4 Convergence study

The convergence proof of the ALM algorithm typically involves establishing mathematical guarantees that the algorithm will converge to an optimal solution under certain conditions. Notably, the convergence of the ALM algorithm with two sub-blocks has been generally proven [4]. However, it is still difficult to prove that with more sub-blocks. Ours has five sub-blocks, making it even more challenging to prove. As a result, we study the convergence behaviour of the proposed method by calculating the relative error of the dual regularization term. As shown in Fig. 8, the proposed method has strong convergence properties in simultaneously learning two manifold structures to achieve a more optimal solution. Thus, the relative error practically lowers after every iteration, causing it to converge quickly between 10 iterations.

Fig. 8
figure 8

The convergence curve of the proposed method on the four datasets

5 Conclusion

This study proposes a novel method, which uses a two-way learning technique to induce the data manifold via two representative structures. The first structure is a low-rank matrix obtained directly from the original data. The second structure called the similarity matrix is learned via a k-symmetric nearest neighbor graph. In order to guarantee better clustering results, a dual regularization term is further introduced to allow both structures to guide themselves adaptively to find a more optimal solution without spectral post-processing procedure. Several experiments were performed to evaluate the effectiveness of the proposed method using four popular benchmark datasets. The results obtained concerning ACC, NMI, F-score, Recall, Precision, and AR show that the proposed method has some advantages over compared methods. Despite its effectiveness, our proposed method has several limitations. Notably, its computation efficiency may be limited in very large-scale problems [36] since it has many sub-blocks to update in each iteration. Secondly, finding the parameters' good combination is tricky for several datasets [37]. Thus, this paper can be improved with further studies exploring large-scale techniques to enhance computational efficiency, and self-parameter tuning strategy to improve performance. Additionally, we will exploit multiview learning [38, 39] and the deep learning paradigm to further enhance the performance of our method.