1 Introduction

In recent research endeavors, considerable attention has been directed towards the development of subspace clustering methods, addressing the efficient processing of high-dimensional data. Subspace clustering serves as a valuable technique for grouping high-dimensional data distributed across a union of subspaces. The fundamental premise of subspace clustering posits that data samples within the same cluster should reside in a common subspace. Widely applied across various domains such as computer vision, face clustering [24, 55, 75], video analysis [66], image representation and compression [14], hyperspectral image processing [73], saliency detection [6], motion segmentation [65], and domain adaptation task [28], subspace clustering has become a versatile tool.

Existing subspace clustering methods fall into five broad categories [60]: iterative models [13], algebraic models [62], statistical models [8], spectral clustering-based models [6, 23, 30, 66], and deep learning-based models [15, 17, 18, 42, 78]. Among these, spectral clustering-based subspace clustering stands out as the most renowned due to its extensive exploration and practical applications [6, 54, 71]. The spectral clustering-based subspace clustering primarily consists of two components: (1) the construction of an affinity matrix and (2) spectral clustering [53]. Recently, diverse forms of affinity matrices have surfaced, depending on the regularization term incorporated. The most notable models include sparse subspace clustering (SSC) [4, 5, 59] and clustering based on low-rank representation (LRR) [3, 12, 26, 63, 72, 77]. These approaches, rooted in sparse or low-rank representation techniques, acquire a coefficient matrix through a self-expression model. SSC enforces an \(l_1\) norm constraint on coefficients, while LRR employs \(l_*\) norm or nuclear norm constraints on coefficients. Several works focused on enhancing the robustness of subspace clustering algorithms by employing a multi-objective framework and subspace optimization techniques [41], integrating subspace fuzzy clustering techniques [64] or adjusting the threshold based on the distribution of data points within subspaces [22].

Various techniques for subspace clustering have been developed, each employing distinct constraints to derive an optimal affinity matrix [1, 2, 30, 44, 56, 61, 71]. However, these methods are limited to processing linear subspaces, proving inadequate for handling the prevalent nonlinear structures found in real-world data [21, 52]. Another constraint is the assumption that data can be separated into distinct subspaces. Recently, techniques based on multiview clustering [11, 45,46,47, 57, 58, 74, 76] have also gained attention.

To address the limitations associated with linear subspace clustering on nonlinear data, kernel self-expression methods have been introduced [16, 27, 38,39,40, 66]. Kernelized SSC (KSSC) [40] and kernelized LRR (KLRR) [37, 66] are two prominent methods that capture nonlinear structure information in the input space, exhibiting significant advancements in various applications. Despite their efficiency in processing nonlinear data, these kernel subspace clustering methods may lose some similarity information between samples during the reconstruction of the original data in kernel space.

Recently, transform learning-based approaches for subspace clustering [9, 10, 33,34,35] have garnered attention. These methods operate on data originally deemed inseparable into subspaces, transforming it into a high-dimensional feature space where linear separability into subspaces is achievable. Notably, these methods eliminate the need to manually choose a mapping function, as they autonomously learn the mapping from the data itself.

In practical scenarios, real-world data exhibiting manifold structures often entails complexities beyond mere sparsity or low-rank characteristics. Consequently, it becomes crucial to formulate a representation that can adeptly capture the intricate structural information inherent in the original data. Numerous methodologies have been devised to uncover underlying structures by delving into data relationships [33,34,35]. Recently introduced subspace clustering methodologies grounded in structure learning comprise Similarity Learning via Kernel-Preserving Embedding (SLKE) [20] and Structure Learning with Similarity Preserving (SLSP) [19]. SLKE constructs a model that preserves similarity information among data, resulting in improved performance. In contrast, SLSP establishes a structure learning framework that integrates the similarity information of the original data, addressing potential drawbacks associated with the SLKE algorithm, which could lead to the loss of certain low-order information. Despite these methods exhibiting commendable performance, their effectiveness is dependent on a learned similarity matrix that might lack an optimal block diagonal structure for spectral clustering.

In the literature, diverse norm regularization terms have been utilized in self-expressive models to acquire a block diagonal coefficient matrix. These terms include the 1-norm, l2-norm, and nuclear norm. However, these regularization techniques exhibit two shortcomings: an inability to control the number of blocks in the coefficient matrix and the potential suboptimality of the learned coefficient matrix due to data noise. Addressing these limitations, Block Diagonal Representation (BDR) subspace clustering algorithms [30, 70] have been introduced, aiming to attain a good block diagonal structure in the coefficient matrix. For example, Implicit Block Diagonal Low-Rank Representation (IBDLR) integrates block diagonal priors and implicit feature representation into the low-rank representation model, progressively enhancing clustering performance [67]. Notably, these BDR-based subspace clustering methods, while effective, have not been integrated into similarity-preserving mechanisms. The kernelized version of transform learning was introduced in [32].

This work proposes “Similarity Preserving Kernel Block Diagonal Representation based Transformed Subspace Clustering (KBD-TSC)” that leverages the kernel self-expressing framework. Kernelized transformed subspace clustering accounts for the data that is not originally separable into subspaces by leveraging kernel self-expression-based transform learning. The proposed method transforms the data into high-dimensional feature spaces where they are linearly separable into subspaces. It doesn’t suffer from choosing the mapping function, as it learns the mapping from the data itself. Although the kernel subspace clustering methods based on kernel self-expression can efficiently process the nonlinear structure data, some similarity information between samples may be lost when reconstructing the original data in kernel space. The integration of similarity preserving regularizer and block diagonal regularizer into the proposed model facilitates enhanced preservation of similarity information between the original data points. Experimental results on nine datasets validate the effectiveness and robustness of the proposed KBD-TSC method.

The principal contributions of this paper are as follows:

  • A novel subspace clustering approach is proposed which accounts for the data that is not originally separable into subspaces by leveraging kernel self-expression-based transform learning.

  • A similarity preserving regularizer is incorporated in the proposed model to facilitate enhanced preservation of similarity information between the original data points.

  • The integration of a block diagonal representation into the proposed model aims to derive a similarity matrix characterized by an optimal block diagonal structure. This helps in achieving the desired optimal block diagonal matrix.

  • The proposed KBD-TSC model is evaluated on nine datasets featuring different types of manifolds, including handwritten digits clustering, face image clustering, object clustering, and text clustering. The experiments involved comparing the proposed model with several state-of-the-art approaches. The results strongly support the effectiveness of our proposed model.

The structure of the remaining paper is organized as follows: Sect. 2 provides a review of basic concepts in related work. Section 3 elaborates on the proposed algorithms and their solutions. Subsequently, Sect. 4 discusses experimental results, and Sect. 5 concludes the paper.

2 Background

2.1 Subspace clustering

The standard subspace clustering techniques are self-expression-oriented, where the aim is to express every single data point in terms of the linear combination of the rest of other data points that lie under the same subspace. The data must hold the basic pre-requisite of being separable into various subspaces. Let \(\varvec{X} = [x_1, x_2, \cdots , x_N] \in \Re ^{d \times N}\) be the matrix of data points where every column vector \(\varvec{x_i}\) is drawn from a union of lower-dimensional subspaces \(\begin{Bmatrix} \varvec{S_1}\bigcup \varvec{S_2} \bigcup \cdots \varvec{S_n} \end{Bmatrix}\) that have dimensions \(\begin{Bmatrix} d_k \end{Bmatrix}_{k=1}^n\) where n is the total number of manifolds. The subspace clustering technique targets to segments each set \(X_k\) of \(N_k\) points that basically belong to the same subspace \(S_k\) of dimension \(d-k\).

$$\begin{aligned} \underset{{\textbf {Z}}}{\text {minimize }}\frac{1}{2}\left\| {\textbf {X}}- {\textbf {XZ}}\right\| _F^2 + \lambda (\Omega ({\textbf {Z}})), \; \text {s.t.\; diag}({\textbf {Z}})=0, \; {\textbf {Z}} \ge 0. \end{aligned}$$
(1)

where \(\Omega (\varvec{Z})\) is the regularization term and \(\lambda >0\) denotes the hyperparameter. \(\left\| {\textbf {Z}} \right\| _*\), \(\left\| {\textbf {Z}} \right\| _1\), \(\left\| {\textbf {Z}} \right\| _F^2\) are three common regularizers. Then an affinity matrix is constructed using \(\varvec{Z}\), which applies any graph-cut technique to compute clusters.

2.2 Kernelized subspace clustering

The kernelized version for the subspace clustering is backed by mapping a kernel function as follows:

$$\begin{aligned} \underset{{\textbf {Z}}}{\text {minimize }} \frac{1}{2} \Vert \ker ({\textbf {X}}) - \ker ({\textbf {X}}){\textbf {Z}} \Vert _F^2 + \lambda (\Omega ({\textbf {Z}})) \nonumber \\ \equiv \underset{{\textbf {Z}}}{\text {minimize }} Tr({\textbf {I}}-2{\textbf {Z}}+{\textbf {Z}}^\top {\textbf {Z}})K + \lambda (\Omega ({\textbf {Z}})), \; \text {s.t. \; diag}({\textbf {Z}})=0, \; {\textbf {Z}} \ge 0. \end{aligned}$$
(2)

where \(\ker (\varvec{X})\) is the kernel mapping function, \(\varvec{K}\) is the kernel matrix where each matrix element is calculated as \(K_{i,j} = \ker (X_i)^\top \ker (X_j)\).

2.3 Kernelized transform learning

A more recent unsupervised representation learning methodology known as transform learning that functions as a dictionary learning method with analysis capabilities. It seeks to train the transform matrix and coefficient matrix from the input data in such a way that the learned transform is capable of analysing the data and ultimately generating the coefficients matrix [48,49,50,51]. Since we anticipate that the input data would be able to be divided into many groups, we typically use raw data pixels as input to the clustering method. However, transform learning can be applied to any high-dimensional data, resulting in the effective representations of the transform matrix and the coefficients’ matrix being learned in latent space. The data’s nonlinearity can be handled by the kernel approach in an effective and efficient manner. Kernelization can be used if the learned transform coefficients are not linearly divided into distinct subspaces. The kernel transform learning [32] formulation can be given as Eq. 3.

$$\begin{aligned} \underset{{\textbf {A}},{\textbf {Z}}}{\text {minimize }} \Vert {\textbf {AK}} - {\textbf {Z}} \Vert _F^2 + \epsilon (\Vert {\textbf {A}} \Vert _F^2 - logdet({\textbf {A}})) + \mu \Vert {\textbf {Z}} \Vert _1. \end{aligned}$$
(3)

where \(\varvec{K} = \ker ({\textbf {X}})^\top \ker ({\textbf {X}})\) is the kernel, \(\epsilon\) and \(\mu\) are the hyperparameters. Constraints introduced in Eq. 3 assist prevent trivial solutions. The first constraint − logdetFootnote 1\((\varvec{A})\) guarantees that \(\varvec{A}\) is the complete matrix. To prevent it from rising to the other extreme, the second restriction \(\left\| \varvec{A} \right\| _F^2\) is used. The other restriction, \(\left\| \varvec{Z} \right\| _1\), is used to make the coefficients sparser.

Equation 3. is solved using an alternating minimization method. Equation 4. describes how the iterative updating procedures for \(\varvec{A}\) and \(\varvec{Z}\) are carried out alternately.

$$\begin{aligned} \begin{array}{l} {\textbf {A}} \leftarrow \underset{{\textbf {A}}}{\text {minimize }} \left\| {\textbf {AK}} - {\textbf {Z}} \right\| _F^2 + \epsilon ( \left\| {\textbf {A}} \right\| _F^2 - {\text{log det}}({\textbf {A}}));\\ {\textbf {Z}} \leftarrow \underset{{\textbf {Z}}}{\text {minimize }} \left\| {\textbf {AK}} - {\textbf {Z}} \right\| _F^2 + \mu \left\| {\textbf {Z}} \right\| _1. \end{array} \end{aligned}$$
(4)

The update of \(\varvec{A}\) is pretty straightforward as it is directly differentiable. There are some linear algebraic tricks to solve it too which are given in [31]. Equation 5 describes how to perform one-step soft thresholding to update \(\varvec{Z}\).Footnote 2

$$\begin{aligned} {\textbf {Z}} \leftarrow signum({\textbf {AK}}).\max (0, abs({\textbf {AK}})- \mu ). \end{aligned}$$
(5)

2.4 Transformed subspace clustering

Subspace clustering operates on the underlying assumption that the given data can be segmented into numerous sub-spaces. Yet, in the context of high-dimensional data, this assumption often does not hold true. To address this challenge, we employ clustering algorithms on the features derived from transform learning. This approach involves projecting the data onto a latent space, where we acquire separable coefficients that define distinct sub-spaces [34]. The modified subspace clustering can be articulated as follows, focusing on acquiring sparse representations based on the transformed features.

$$\begin{aligned} \underset{{\textbf {A,Z,C}}}{\text {minimize }} \Vert {\textbf {AX}} - {\textbf {Z}} \Vert _F^2 + \epsilon (\Vert {\textbf {A}} \Vert _F^2 - {\text{log det}}({\textbf {A}})) + \gamma \Vert {\textbf {Z}} - {\textbf {ZC}} \Vert _F^2 + \lambda (\Omega ({\textbf {C}})). \end{aligned}$$
(6)

Alternate minimization, as described in Eq. 7, is utilised to solve Eq. 6. In each cycle, we alternately update \(\varvec{A}\), \(\varvec{Z}\), and \(\varvec{C}\).

$$\begin{aligned} \begin{array}{*{20}{l}} {\textbf {A}} \leftarrow \underset{{\textbf {A}}}{\text {minimize }} \Vert {\textbf {AX}} - {\textbf {Z}} \Vert _F^2 + \epsilon (\Vert {\textbf {A}} \Vert _F^2 - {\text{log det}}({\textbf {A}})); \\ {\textbf {Z}} \leftarrow \underset{{\textbf {Z}}}{\text {minimize }} \Vert {\textbf {AX}} - {\textbf {Z}} \Vert _F^2 + \gamma (\Vert {\textbf {Z}}-{\textbf {ZC}} \Vert _F^2); \\ {\textbf {C}} \leftarrow \underset{{\textbf {C}}}{\text {minimize }} \Vert {\textbf {Z}}-{\textbf {ZC}} \Vert _F^2 + \lambda (\Omega ({\textbf {C}})). \\ \end{array} \end{aligned}$$
(7)

After calculating the affinity matrix and applying spectral clustering to it to discover clusters, we obtain \(\varvec{C}\). Additionally, to create the kernelized transform subspace clustering formulation, the subspace clustering loss is included in equation 6.

2.5 Block diagonal representation

By adding a block-diagonal regularisation term, the BDR method [29] completes the block diagonal matrix and achieves higher clustering performance. The BDR algorithm’s optimization model is expressed as follows:

$$\begin{aligned} \underset{{\textbf {Z}}}{\text {minimize }} \frac{1}{2} \Vert {\textbf {Z}} \Vert _{\fbox {m}} + \Vert \ker ({\textbf {X}}) - \ker ({\textbf {X}}){\textbf {Z}} \Vert _F^2, s.t. {\textbf {Z}} \ge 0, {\textbf {Z}}^\top = {\textbf {Z}}, diag({\textbf {Z}})=0. \end{aligned}$$
(8)

Here, \(\varvec{X}\) and \(\varvec{Z}\) represent data matrix and coefficient matrix respectively, while \(\left\| \varvec{Z} \right\| _{\fbox {m}}\) denotes m-block diagonal regularizer.

3 Proposed method: KBD-TSC

The fundamental assumption underlying self-expression-based subspace clustering is the requirement for data to be segregable into distinct subspaces. Traditional methods also rely on the assumption of an inherent linear structure. However, when these assumptions are not met, particularly when data samples are non-separable into subspaces and linear subspace clustering methods struggle with non-linear structures, a more adaptable model is needed. In response to this challenge, we present a model designed to generalize effectively on non-linear structured data, even when they are not easily separable into subspaces.

The proposed Kernel Block Diagonal-based Transformed Subspace Clustering (KBD-TSC), is adept at preserving similarity information among samples, concurrently achieving an optimal block diagonal structure in the obtained similarity matrix. This approach involves embedding a non-linear model that integrates kernelized transformed subspace clustering with a kernel self-expression framework to achieve the desired objective. Incorporating a block diagonal regularization term into the kernel self-expression framework is pivotal for obtaining a similarity matrix characterized by a block diagonal structure. Furthermore, the preservation of similarity information is secured by minimizing the distinction between two inner products: one encompassing the inner products among original data in kernel space and the other involving the inner products of reconstructed data in kernel space. The resolution of the entire optimization problem employs alternate minimization techniques.

3.1 Similarity preserving model

To uphold similarity information among samples, our objective is to minimize the difference between two inner products. One corresponds to the inner product among the original data in kernel space, and the other corresponds to the inner product of the reconstructed data in kernel space, drawing inspiration from the work of Kang et al. [20].

$$\begin{aligned} \underset{{\textbf {Z}}}{\text {minimize }} \frac{1}{2} \Vert {\textbf {K}}-{\textbf {Z}}^\top {\textbf {KZ}} \Vert _F^2. \end{aligned}$$
(9)

where \(\varvec{K} = \ker ({\textbf {X}})^\top \ker ({\textbf {X}})\) is a positive semi-definite matrix.

3.2 Proposed algorithm

We introduce a model designed for effective generalization on non-linear manifolds. This model integrates kernelized self-expression transformed subspace clustering with a similarity-preserving kernel block diagonal representation. The kernelized component of transform learning characterizes non-linear data as a linear combination of itself in the transform domain. The incorporation of transform learning with subspace clustering loss facilitates the separation of data into subspaces. To enhance this process, a block diagonal regularization term is introduced, aiming to achieve a similarity matrix between samples with a block diagonal structure. Consequently, our proposed model not only preserves the similarity information among non-linear samples in the transform domain but also simultaneously acquires a similarity matrix with an optimal block diagonal structure.

The complete joint formulation for the proposed model is expressed as Eq. 10.

$$\begin{aligned}{} & {} \underset{{\textbf {A,Z}}}{\text {minimize }} \left (\underbrace{\Vert {\textbf {AK-Z}} \Vert _F^2 + \epsilon (\Vert {\textbf {A}} \Vert _F^2 - {\text{log det}}({\textbf {A}}))}_\text{Kernelized\,\,transform\,\,learning} \nonumber\right. \\ {} & {} \quad\left. + \underbrace{\frac{1}{2}Tr({\textbf {Z}}-2{\textbf {KZ}}+{\textbf {Z}}^{T}{} {\textbf {KZ}})}_\text{Kernel\,\,self\,\,expression\,\,subspace\,\,clustering} + \underbrace{\alpha \Vert {\textbf {K}}-{\textbf {Z}}^{T}{} {\textbf {KZ}} \Vert _F^2 + \gamma \Vert {\textbf {Z}} \Vert _{\fbox {k}}}_\text{Similarity\,\,preserving\,\,with\,\,block\,\,diagonal}\right )\nonumber \\{} & {} \quad s.t.\; {\textbf {Z}}\ge 0, diag\left( {\textbf {Z}}\right) = 0, {\textbf {Z}}^\top ={\textbf {Z}}. \end{aligned}$$
(10)

where \(\alpha , \epsilon , \gamma\) are positive hyperparameters.

To simplify and separate out the variables, let us introduce an auxiliary matrix \(\varvec{B}\) and a regularization term \(\left\| {\varvec{Z-B}} \right\| _F^2\) into our proposed model. Thus, the optimization problem in equation 10 can be translated to

$$\begin{aligned} \underset{{\textbf {A,Z,B}}}{\text {minimize }} \begin{array}{ll} (\left\| {{\textbf {AK}} - {\textbf {Z}}} \right\| _F^2+\epsilon \left( {\left\| {\textbf {A}} \right\| _F^2 - {\text{log det}}({\textbf {A}})} \right) + \frac{1}{2}Tr\left( {\textbf {K}}-2{\textbf {KZ}}+{\textbf {Z}}^\top {\textbf {KZ}} \right) \\ + \alpha {\left\| {\textbf {K}}- {\textbf {Z}}^\top {\textbf {KZ}} \right\| _F^2}+ \frac{\beta }{2} \left\| {{\textbf {Z-B}}} \right\| _F^2 + \gamma \left\| {\textbf {B}} \right\| _{\fbox {k}}) \\ s.t.\; {\textbf {Z}}\ge 0, diag\left( {\textbf {Z}} \right) = 0, {\textbf {Z}}^\top ={\textbf {Z}}. \\ \end{array} \end{aligned}$$
(11)

3.3 Optimization of the proposed KBD-TSC model

To facilitate the solution of the problem in equation 11, three new auxiliary variables are introduced that leads to the following equivalent problem:

$$\begin{aligned} \underset{{\textbf {A,Z,B,J,G,H}}}{\text {minimize }}\begin{array}{ll} (\left\| {{\textbf {AK}}-{\textbf {Z}}} \right\| _F^2+\epsilon \left( {\left\| {\textbf {A}} \right\| _F^2 - {\text{log det}}({\textbf {A}})} \right) + \frac{1}{2}Tr\left( {\textbf {K}}-2{\textbf {KJ}}+{\textbf {Z}}^\top {\textbf {K J}} \right) \\ + \alpha {\left\| {\textbf {K}}- {\textbf {G}}^\top {\textbf {K H}} \right\| _F^2} + \frac{\beta }{2} \left\| {{\textbf {J-B}}} \right\| _F^2 + \gamma \left\| {\textbf {B}} \right\| _{\fbox {k}}) \\ s.t.\; {\textbf {B}}\ge 0, diag\left( {\textbf {B}} \right) = 0, {\textbf {B}}^\top ={\textbf {Z}}, {\textbf {G}}={\textbf {Z}}, {\textbf {H}}={\textbf {Z}}. \\ \end{array} \end{aligned}$$
(12)

We use ADMM for solving equation 12, and its corresponding Lagrangian [25] is given as follows:

$$\begin{aligned} \begin{array}{ll} {\L } \left( {\textbf {Z,J,G,H,B}}, \lambda _1, \lambda _2, \lambda _3 \right) \\ \quad =\left\| {{\textbf {AK-Z}}} \right\| _F^2+\epsilon \left( {\left\| {\textbf {A}} \right\| _F^2 - {\text{log det}}{\textbf {A}}} \right) + \frac{1}{2}Tr\left( {\textbf {K}}-2{\textbf {KJ}}+{\textbf {Z}}^\top {\textbf {K J}}\right) \\ + \alpha {\left\| {\textbf {K}}- {\textbf {G}}^\top {\textbf {K H}} \right\| _F^2} + \frac{\beta }{2} \left\| {{\textbf {J-B}}} \right\| _F^2 + \gamma \left\| {\textbf {B}} \right\| _{\fbox {k}} \\ + \frac{\mu }{2} \left[ \left\| {\textbf {J-Z}}+\frac{\lambda _1}{\mu } \right\| _F^2 + \left\| {\textbf {G-Z}}+\frac{\lambda _2}{\mu } \right\| _F^2 + \left\| {\textbf {H-Z}}+\frac{\lambda _3}{\mu } \right\| _F^2\right] .\\ \end{array} \end{aligned}$$
(13)

where \(\lambda _1, \lambda _2, \lambda _3\) are Lagrangian multipliers and \(\mu > 0\) is a penalty parameter. Now, these variables can be updated alternately. The updates for all variables are given as follows:

  • Update A: After keeping other variables fixed, \(\varvec{A}\) can be updated as follows:

    $$\begin{aligned} \underset{{\textbf {A}}}{\text {minimize }} \Vert {\textbf {AK-Z}} \Vert _F^2+ \epsilon \left( {\Vert {\textbf {A}} \Vert _F^2 - {\text{log det}}({\textbf {A}})} \right) . \end{aligned}$$
    (14)

    For updating transform, given the original data as in equation 15, we can use equation 16.

    $$\begin{aligned} \underset{{\textbf {T}}}{\text {minimize }} \Vert {\textbf {TX-Z}} \Vert _F^2 + \epsilon (\Vert {\textbf {T}} \Vert _F^2 - {\text{log det}}({\textbf {T}})). \end{aligned}$$
    (15)

    The update of the transform matrix \(\varvec{T}\) is straightforward, as each term is directly differentiable. But, there are better ways of solving the update of \(\varvec{T}\) by using some linear algebraic tricks [51].

    $$\begin{aligned} \begin{gathered} {\textbf {X}}{{\textbf {X}}^\top } + \epsilon {\textbf {I}} = {\textbf {L}}{{\textbf {L}}^\top }, \\ {{\textbf {L}}^{ - 1}}{} {\textbf {X}}{{\textbf {Z}}^\top } = {\textbf {US}}{{\textbf {V}}^\top }, \\ {\textbf {T}} \leftarrow \frac{1}{2}{} {\textbf {U}}({\textbf {S}} + {({{\textbf {S}}^2} + \epsilon {\textbf {I}})^{1/2}}){{\textbf {V}}^\top }{{\textbf {L}}^{ - 1}}. \\ \end{gathered} \end{aligned}$$
    (16)

    Now, the solution to equation 14 is similar to the update of \(\varvec{T}\) in equation 16. Here \(\varvec{A}\) acts as transform matrix \(\varvec{T}\), and instead of passing original samples \(\varvec{X}\), we input kernel matrix \(\varvec{K}\).

  • Update J: After keeping other variables fixed, \(\varvec{J}\) can be updated as follows:

    $$\begin{aligned} \begin{array}{ll} \underset{{\textbf {J}}}{\text {minimize }} \frac{1}{2}Tr\left( {\textbf {K}}-2{\textbf {KJ+Z}}^\top {\textbf {KJ}} \right) + \frac{\beta }{2} \Vert {{\textbf {J-B}}} \Vert _F^2 + \frac{\mu }{2} \Vert {\textbf {J-Z}}+\frac{\lambda _1}{\mu } \Vert _F^2\\ s.t.\; {\textbf {B}}\ge 0, diag\left( {\textbf {B}} \right) = 0,{\textbf {B}}^\top ={\textbf {B}}. \end{array} \end{aligned}$$
    (17)

    Taking its first derivative and equating it to 0 gives:

    $$\begin{aligned} {\textbf {J}} = \left( {\textbf {K}} +{\textbf {B}}+\mu {\textbf {I}} \right) ^{-1}\left( {\textbf {K}} + \mu {\textbf {Z}} - \lambda _1 + \beta {\textbf {B}} \right) . \end{aligned}$$
    (18)
  • Update G: After keeping other variables fixed, \(\varvec{G}\) can be updated as follows:

    $$\begin{aligned} \underset{{\textbf {G}}}{\text {minimize }} \alpha \Vert {{\textbf {K-G}}^\top {\textbf {KH}}} \Vert _F^2 + \frac{\mu }{2} \Vert {\textbf {G-Z}}+\frac{\lambda _2}{\mu } \Vert _F^2. \end{aligned}$$
    (19)

    Taking its first derivative and equating to 0 gives:

    $$\begin{aligned} {\textbf {G}} = \left( 2 \alpha {\textbf {KH}} {\textbf {H}}^\top {\textbf {K}}^\top + \mu {\textbf {I}} \right) ^{-1}\left( 2 \alpha {\textbf {KH}} {\textbf {K}}^\top + \mu {\textbf {Z}} - \lambda _2\right) . \end{aligned}$$
    (20)
  • Update H: After keeping other variables fixed, \(\varvec{H}\) can be updated as follows:

    $$\begin{aligned} \underset{{\textbf {H}}}{\text {minimize }} \alpha \Vert {{\textbf {K}} - {\textbf {G}}^\top {\textbf {KH}}} \Vert _F^2 + \frac{\mu }{2} \left\Vert {\textbf {H-Z}}+\frac{\lambda _3}{\mu } \right\Vert _F^2. \end{aligned}$$
    (21)

    Taking its first derivative and equating to 0 gives:

    $$\begin{aligned} {\textbf {H}} = \left( 2 \alpha {\textbf {K}}^\top {\textbf {G}} {\textbf {G}}^\top {\textbf {K}} + \mu {\textbf {I}} \right) ^{-1}\left( 2 \alpha {\textbf {K}}^\top {\textbf { G K}} + \mu {\textbf {Z}} - \lambda _3\right) . \end{aligned}$$
    (22)
  • Update Z: After keeping other variables fixed, the sub-problem becomes:

    $$\begin{aligned} \underset{{\textbf {Z}}}{\text {minimize }} \frac{3 \mu }{2}\left\Vert {\textbf {Z}} - \frac{{\textbf {J}}+{\textbf {G}}+{\textbf {H}}+{\textbf {AK}}+ \left( \lambda _1 + \lambda _2 +\lambda _3 \right) / \mu }{3} \right\Vert _F^2. \end{aligned}$$
    (23)

    Taking its first derivative and equating to 0 gives:

    $$\begin{aligned} {\textbf {Z}} = \frac{{\textbf {J}}+{\textbf {G}}+{\textbf {H}}+{\textbf {AK}}+ \left( \lambda _1 + \lambda _2 +\lambda _3 \right) / \mu }{3}. \end{aligned}$$
    (24)
  • Update B: After keeping other variables fixed, \(\varvec{B}\) can be updated as follows:

    $$\begin{aligned} \underset{{\textbf {B}}}{\text {minimize }} \frac{\beta }{2} \Vert {{\textbf {J}} - {\textbf {B}}} \Vert _F^2 + \gamma \Vert {\textbf {B}} \Vert _{\fbox {k}} \; s.t.\; {\textbf {B}}\ge 0, diag\left( {\textbf {B}} \right) = 0, {\textbf {B}}^\top ={\textbf {B}}. \end{aligned}$$
    (25)

    Using K.Fan theorem [7], equation 25 can be rewritten as follows:

    $$\begin{aligned} \begin{array}{ll} \underset{{\textbf {B}}}{\text {minimize }} \frac{\beta }{2} \Vert {{\textbf {J}} - {\textbf {B}}} \Vert _F^2 + \gamma \left\langle diag\left( {\textbf {B}} \right) - {\textbf {B}}, {\textbf {S}} \right\rangle \\ s.t.\; {\textbf {B}}\ge {\textbf {0}}, diag\left( {\textbf {B}} \right) = 0, {\textbf {B}}^\top ={\textbf {B}}, {\textbf {0}}< {\textbf {S}} < {\textbf {I}}, Tr({\textbf {S}})=k. \end{array} \end{aligned}$$
    (26)

    where \(\varvec{S}=\varvec{UU}^\top\), \(\varvec{U}\) consists of k eigenvectors that correspond to k smallest eigenvalues of \(diag(\varvec{B})-\varvec{B}\). Now, equation 26 can be translated to:

    $$\begin{aligned} \underset{{\textbf {B}}}{\text {minimize }} \frac{1}{2}\left\| {\textbf {B}}-J+ \frac{\gamma }{\beta } \left( diag({\textbf {S}})1^\top -{\textbf {S}} \right) \right\| _F^2. \end{aligned}$$
    (27)

    Let us define

    $$\varvec{Q} = \varvec{J} - \frac{\gamma }{\beta } \left( diag(\varvec{S})1^\top -\varvec{S} \right) , \; \tilde{\varvec{Q}} = \varvec{Q}- Diag(diag(\varvec{Q})), \; \text {then} \; \varvec{B} = max(0, (\varvec{Q}+ \tilde{\varvec{Q}})/2).$$

Once we obtain the matrix \(\varvec{B}\), the similarity matrix can be computed by \((\varvec{B} + \varvec{B} ^\top )/2\). After this, the clustering results can be obtained by applying spectral clustering on the similarity matrix. The step-by-step algorithm is given as algorithm 1 to understand the entire model better.

Algorithm 1
figure a

The proposed algorithm: KBD-TSC

4 Experimental results and analysis

4.1 Dataset description

The proposed KBD-TSC algorithm is evaluated on nine images datasets which are as follows:

  1. 1.

    YaleFootnote 3: It consists of 165 facial images of 15 individuals in grayscale mode. The images are resized to 32\(\times\)32 pixels.

  2. 2.

    JaffeFootnote 4: This dataset consists of 213 facial images corresponding to 7 facial expressions. The images are resized to 26 \(\times\) 26 pixels.

  3. 3.

    ORLFootnote 5: It consists of 400 facial images of 40 subjects. Each image is of size 26 \(\times\) 26 pixels.

  4. 4.

    ARFacesFootnote 6: This dataset comprises of 4000 facial images of 126 people.

  5. 5.

    COIL20Footnote 7: It consists of 1440 images of 20 objects. Each image size is 32 \(\times\) 32 pixels.

  6. 6.

    BAFootnote 8: This dataset contains 1404 images of handwritten digits and uppercase alphabets. Each image size is 20 \(\times\)16 pixels.

  7. 7.

    tr11Footnote 9: This is a text dataset consists of 414 samples, 6429 features and 9 classes.

  8. 8.

    tr41Footnote 10: This is a text dataset consists of 878 samples, 7454 features and 10 classes.

  9. 9.

    tr45Footnote 11: This is a text dataset consists of 690 samples, 8261 features and 10 classes.

4.2 Baseline methods

We compare our proposed KBD-TSC method with several state-of-the-art methods, including spectral clustering (SC) [36], kernelized sparse subspace clustering (KSSC) [40], Kernel low-rank representation (KLRR) [37], Implicit block diagonal low-rank representation (IBDLR) [67], Similarity Learning via Kernel Preserving Embedding sparse (SLKEs) [20], Similarity Learning via Kernel Preserving Embedding low rank (SLKEr) [20], Structure learning with similarity preserving sparse (SLSPs) [19], Structure learning with similarity preserving low-rank (SLSPr) [19] and Kernel block diagonal representation subspace clustering with similarity preservation (KBDSP) [68].

4.3 Evaluation metrics

During experiments, it is typically presumed that the quantity of clusters is already established. Under these circumstances, various metrics, including accuracy, Normalized Mutual Information (NMI) and Purity, are commonly employed [42, 43]. The metrics are described below:

  • Accuracy: The accuracy is defined as the ratio of the number of data instances that are assigned the same cluster as in the ground truth to the total number of data instances.

  • Normalized Mutual Information (NMI): This metric computes the normalized measure of similarity between the labels of same data instances. The range of NMI is [0,1] where 0 signifies no correlation and 1 signifies the perfect correlation.

  • Purity: Purity measures the extent to which data points within each cluster are assigned to the same true class [69]. A larger purity value indicates better clustering performance.

4.4 Kernel design

We have designed 12 different kernels in this work, including one linear kernel, four polynomial kernels, and seven Gaussian kernels (Table 1).

Table 1 Details of parameter values w.r.t. different kernel functions

In the case of the Gaussian kernel, \(\sigma\) is the maximum distance between \(x_i, x_j\).

4.5 Computational complexity

In the proposed algorithm, the first part is the construction of kernel matrix which is bounded by \(O(n^2)\). The second part is updating step of different variables, each one of them is bounded by \(O(n^3)\). Thus, the proposed algorithm has overall time complexity of \(O(tn^3)\) where t and n represents the number of iterations and number of data samples respectively.

4.6 Parameter sensitivity analysis

There are four hyper-parameters in the proposed KBD-TSC algorithm 11, i.e., \(\epsilon , \alpha , \beta , \gamma\). The parameter \(\epsilon\) controls the good conditioning of the transform. The parameter \(\alpha\) balances the similarity-preserving term \(\left\| \varvec{K}-\varvec{Z}^T\varvec{KZ}\right\| _F^2\), the parameter \(\beta\) is used to control the term \(\left\| \varvec{Z}-\varvec{B}\right\| _F^2\), the parameter \(\gamma\) is used to control the block-diagonal structure term \(\left\| \varvec{B}\right\| _{\fbox {k}}\).The YALE and JAFFE datasets are used for parameter evaluation using NMI. The parameters \(\epsilon , \alpha , \beta , \gamma\) take values from the sets \(\left\{ {1e-2, 1e-1, 0.5, 1}\right\}\), \(\left\{ {1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1} \right\}\), \(\left\{ {1e-5, 1e-3, 0.1, 1} \right\}\), and \(\left\{ {1e-2, 1e-1, 1, 10, 30, 50} \right\}\) respectively. The best values of the parameters for the optimal clustering performance is 0.1, 0.01, 0.001, 0.1 for \(\epsilon , \alpha , \beta , \gamma\), respectively. Therefore, we keep \(\epsilon = 0.1, \alpha =0.01, \beta = 0.001, \gamma = 0.1\) in all the experiments of this paper. The parameter tuning is done using grid searching. The parameter settings of all experiments have been given in the Table 2, in which recommended parameters are indicated in bold. We performed parameter sensitivity analysis on JAFFE Dataset for wide range of \(\alpha\), \(\beta\) and \(\gamma\) values against NMI score. It is observed that the proposed technique is not very sensitive to these hyperparameters. Moreover, when these parameters are set to \(\alpha =1e-2, \beta =1e-3, and \gamma = 1e-1\), the best clustering performance is achieved. Hence, we fixed these values for all the experiments.

Table 2 Hyperparameter settings

4.7 Results and discussion

The experimental results for all the nine datasets are shown in terms of Accuracy, NMI, and Purity in Tables 3, 4, and 5 respectively. We average out the results of all experiments on 12 kernels. The experimental results are reported by averaging the results of ten iterations in the Tables 3, 4 and 5. From the results, it can be observed that the proposed KBD-TSC approach outperforms the state-of-the-art methods.

Table 3 Comparison of clustering results based upon accuracy
Table 4 Comparison of clustering results based upon NMI score
Table 5 Comparison of clustering results based upon purity

To be more specific in analysis, the results from Tables 3, 4, 5 are discussed below:

  1. 1.

    Compared to SC algorithm, the proposed KBD-TSC method obtains better results regarding all evaluation metrics: accuracy, NMI, and purity. From Tables 3, 4, 5, it can be easily observed that the average value of accuracy, NMI, and purity of the proposed method are 20.81, 19.32, and 23.44 % higher than SC, respectively. The reason for the same is that the input to spectral clustering is the learned \(\varvec{Z}\) instead of the raw kernel matrix.

  2. 2.

    Our proposed KBD-TSC method also outperforms the kernel-based methods KSSC and KLLR. This is because of the similarity-preserving trick in the transform domain.

  3. 3.

    In comparison to SLKEs and SLKEr, our proposed algorithm exhibits superior performance. This improvement can be attributed to two key aspects: firstly, the proposed framework for kernel self-expression has the capability to retain specific low-order details from the input data; secondly, the introduction of the term representing block diagonal structures in our model, within the latent transform space, facilitates the acquisition of a similarity matrix characterized by a block diagonal arrangement.

  4. 4.

    SLSPs and SLSPr, which are capable of handling nonlinear datasets and preserving similarity information, gives better performance as compared to SC, KSSC, KLRR, SLKEs, and SLKEr. However, the proposed KBD-TSC algorithm consistently outperforms them in most instances. Specifically, average values of accuracy, NMI, and purity in Tables 3, 4, and 5 indicate that the proposed method surpasses SLSPs by 15.39%, 16.47%, and 3.17%, respectively. These findings confirm that the introduced term representing block diagonal structures significantly contributes to improving performance.

  5. 5.

    Both IBDLR and the proposed KBD-TSC mthod facilitate the acquisition of a desired affinity matrix with an optimal block diagonal structure by integrating the block diagonal representation term. Tables 3, 4, and 5 demonstrate that the proposed KBD-TSC method and IBDLR outperform other compared algorithms on all datasets. This underscores the effectiveness of methods incorporating the block diagonal representation term, particularly for datasets with multiple classes. Remarkably, in the case of COIL20 and BA datasets characterized by a larger number of instances, the proposed method demonstrates a performance improvement of almost 15% compared to alternative methods, with the exception of IBDLR on the COIL20 dataset. Additionally, the suggested KBD-TSC method surpasses the performance of IBDLR specifically on the COIL20 dataset, underscoring the advantages of incorporating a similarity-preserving strategy in the transform domain.

  6. 6.

    In datasets with high-dimensional features such as TR11, TR41, and TR45, SLSPr demonstrates superior performance to IBDLR, attributed to its integration of a similarity-preserving mechanism. Capitalizing on both the similarity-preserving strategy and the block diagonal representation term, the suggested KBD-TSC consistently outshines IBDLR and even surpasses SLSPr in most instances across TR11, TR41, and TR45 datasets. These outcomes underscore the effectiveness of the proposed KBD-TSC method in effectively managing datasets with intricate features, enabling the extraction of inherent data structures.

In a nutshell, the experimental results demonstrate the effectiveness of our proposed KBD-TSC method combined with similarity preserving regularizer, transform learning-based kernel self-expressing model, and block diagonal representation term.

4.8 Convergence analysis

The convergence plots of the proposed method is shown in Fig. 1. For all the datasets, the proposed method converges within 10 iterations.

Fig. 1
figure 1

Convergence graph of the proposed method with nine different datasets in 30 iterations

4.9 Computational time

The experiments are conducted on a 64-bit Windows system with Intel i7 processor and 32GB RAM. The running time of the proposed method and the various state-of-the-art methods for all the datasets are shown in Table 6. From Table 6, it can be observed that the proposed KBD-TSC method is the fastest among all kernel-based techniques.

Table 6 Runtime comparison (in seconds)

4.10 Ablation experiments

For the ablation experiments, we have compared the NMI score of the proposed technique against the objective function without the similarity preserving term and the objective function without the block diagonal term. It is observed that these terms are important for making clustering efficient. The NMI score gives the best value for all the datasets when these terms are included in the objective function. The ablation experiment results are discussed in Table 7. The objective function without the similarity preserving term is labelled as “similarity-” and the objective function without the block diagonal term is labelled as “block-.”

Table 7 Comparison of NMI score of the proposed objective function against the objective function without similarity preserving term and the objective function without block diagonal term for all the datasets

5 Conclusion

This paper presents a novel subspace clustering approach that integrates transform learning-based kernel block diagonal representation and a similarity-preserving strategy. The method exhibits effective performance even when the raw data lacks inherent separability into subspaces and demonstrates robust generalization capabilities for non-linear manifolds. The proposed KBD-TSC operates through a three-step process. Initially, it captures the non-linear structure of the input data by incorporating the kernel self-expressing framework into the transform learning-based framework. The second step introduces the block diagonal representation term to create a similarity matrix with a block diagonal structure. In the final step, the similarity-preserving term is introduced to capture pairwise similarity information between various data points. The effectiveness of the proposed approach is evaluated on nine benchmark datasets, showcasing its superiority over several state-of-the-art methods. In future work, we aim to extend the proposed method to multiple kernel learning.