1 Introduction

Clustering is one of the most important unsupervised learning topics in machine learning, where data samples are classified into different clusters based on their similarity. It has been studied and applied in many research areas such as text mining [1,2,3,4,5], gene expression [6,7,8,9], and image recognition [10,11,12]. In particular, researchers have used many clustering algorithms for image segmentation [13,14,15,16].

Among the various clustering methods, assigning data samples to clusters based on the prototype center of a cluster is one of the most effective and well-studied methods. k-means [17] is the most representative and classical clustering method that clusters around center data samples, which clusters all data samples by minimizing the sum of distances from data samples to their nearest cluster prototype. k-means works with data samples as the cluster prototype, which often fails when the distributions of data samples are not around several central data samples. In contrast to k-means, k-plane clustering (kPC) [18] and q-flat [19] \((0 \le q \le m-1)\) use hyperplanes or affine subspaces as the entity of the centers and assign each data sample to the nearest hyperplane or \((m-q)\)-dimensional affine subspace, where m is the original feature dimension. When the value of q is 0 or \(m-1\), q-flat is degraded to k-means or kPC, respectively. From the above descriptions, k-means, kPC and q-flat only use the structure within-clusters by minimizing some distances within-clusters. However, minimizing the distances within-clusters does not consider the discriminative information between different clusters. The k-proximal plane clustering (kPPC) [20] introduces the dissimilarity between clusters, which is a great improvement over kPC. Twin support vector clustering (TWSVC) [21] and least squares TWSVC (LSTWSVC) [22] also consider between-cluster separability, inspired by the twin support vector machine (TWSVM) [23] and least squares twin support vector machine (LSTWSVM) [24] on classification. To improve robustness, \(l_{1}\)-norm-based clustering methods were also investigated, such as robust TWSVC (RTWSVC) [25], fast RTWSVC (FRTWSVC) [25], and k-subspace discriminant clustering (kSDC) [26].

However, all of the above methods are vector-based ones. If the problem has the matrix data input, a matrix must be converted to a vector before the vector-based methods can be applied. This leads to high-dimensional data and a high computational cost. In addition, some of the underlying structural information is lost. To overcome these shortcomings, a two-dimensional embedded image clustering (A2DEIC) [27] can directly work with matrices instead of flat vectors was recently proposed. However, the objective function of A2DEIC is not smooth and difficult to solve. Moreover, the A2DEIC algorithm is affected by weighting parameters, and finding the optimal parameter is time-consuming. We also notice that though much progress has been made in the field of two-dimensional dimensionality reduction [28,29,30,31,32,33], little attention has been paid to two-dimensional clustering.

Recently, Li et al. [34] proposed a matrix-based dimensionality reduction method two-dimensional Bhattacharyya bound linear discriminant analysis (2DBLDA). In 2DBLDA, the between-class distance and the within-class distance are weighted by a constant calculated from the input data. This constant helps the objective of 2DBLDA to achieve the minimum Bhattacharyya error bound. Moreover, the design of 2DBLDA avoids the small sample size problem and can be solved by a simple standard eigenvalue decomposition problem. Inspired by the spirit of 2DBLDA, in this paper, we extend 2DBLDA to the clustering problem and propose a novel two-dimensional k-subspace clustering method (2DkSC) that considers both discriminative and underlying structural information. In particular, 2DkSC succeeds in minimizing the similarity within clusters and maximizing the dissimilarity between clusters. Moreover, taking the advantage of 2DBLDA, the cluster data samples are clustered into k-subspaces. In summary, 2DkSC has the following characteristics:

\(\bullet\) 2DkSC maximizes the matrix-based between-cluster distance which is measured by the weighted pairwise distances of cluster centers and meanwhile minimizes the matrix-based within-cluster distance, and clusters data samples into these k-subspaces directly. In this way, on the premise of preserving the original matrix data structure, 2DkSC considers both local and discriminative information during clustering by finding the most appropriate reduced dimension for lower dimensional spaces.

\(\bullet\) The weighting constant between the between-cluster and within-cluster terms is determined by the involved data that makes the proposed 2DkSC adaptive and without setting any parameters. Inherited from 2DBLDA, the constant is meaningful in the sense that it achieves minimizing the upper bound of Bhattacharyya error.

\(\bullet\) From the experimental results of image recognition, 2DkSC has the highest ACC and NMI in five of the six datasets. For example, 2DkSC achieves 77.4% NMI in the Coil100 dataset, which is 3.89% better than the vector-based q-flat algorithm and 6.50% better than the matrix-based A2DEIC algorithm. This phenomenon proves the superiority of our proposed algorithm for image clustering.

The rest of the paper is organized as follows. In section 2, kPC, kPPC, q-flat and A2DEIC are briefly introduced. In section 3, our method is presented. The experiments and conclusions can be found in section 4 and 5, respectively. Details of the weighting constant is provided in the appendix 6.

2 Related works

Given the dataset \(T=\{{\textbf {X}}_1,\,{\textbf {X}}_2,\ldots ,\,{\textbf {X}}_N\}\), where \({\textbf {X}}_l\in \mathbb {R}^{m\times n}\) for \(l=1,\,2,\,\ldots ,N\). In particular, if a data sample is vector form, n equals 1. The goal of clustering is to partition T into k disjoint clusters \(C_i\) for \(i=1,2,\ldots ,k\) satisfying \(C_{i'}\cap _{i'\ne i} C_i =\varnothing\) and \(T=\cup _{i=1}^kC_i\). Correspondingly, \(y_l\in \{1,2,\cdots ,k\}\) can be used to indicate the cluster label of the data sample \({\textbf {X}}_l\). Assume that the i-th cluster contains \(N_{i}\) data samples. Then \(\sum \nolimits _{i=1}^{k}N_i=N\). Let \({\overline{{\textbf {X}}}}_i=\frac{1}{N_i}\sum \nolimits _{s=1}^{N_i}{} {\textbf {X}}_{s}^i\) be the mean of the data samples of the i-th cluster, \(i=1,2,\ldots ,k\), where \({\textbf {X}}_{s}^i\) is the s-th data sample of the i-th cluster. For a matrix \({\textbf {Q}}=({\textbf {q}}_1,\,{\textbf {q}}_2,\ldots ,{\textbf {q}}_n)\in \mathbb {R}^{m\times n}\), its Frobenius norm (F-norm) \(\Vert {\textbf {Q}}\Vert _F\) is defined as \(\Vert {\textbf {Q}}\Vert _F=\sqrt{\sum \nolimits _{i=1}^{n}\Vert {\textbf {q}}_i\Vert _2^2}\). The F-norm is a natural generalization of the vector \(l_2\)-norm on matrices.

2.1 kPC

kPC [18] divides the data samples into k clusters, so that the data samples gather around their own clustering hyperplane. For the i-th cluster, the hyperplane of kPC is determined by minimizing the sum of the distances between the data samples of the i-th cluster and the hyperplane of the i-th cluster, solving the following programming problem

$$\begin{aligned} \begin{array}{ll} \underset{{\textbf {w}}_i,b_i}{\min }~&{}\Vert {\textbf {w}}_i^T{\textbf {A}}_i+b_i{\textbf {e}}_i\Vert _2^2\\ \hbox {s.t.}~~&{}\Vert {\textbf {w}}_i\Vert _2^2 = 1, \end{array} \end{aligned}$$
(1)

where \({\textbf {w}}_i\in \mathbb {R}^m\), \(b_i\in \mathbb {R}\), \({\textbf {A}}_i\in \mathbb {R}^{m\times N_i}\) is the matrix consisting of the data samples labeled i and \({\textbf {e}}_i\) is a vector of ones of an appropriate dimension, \(i=1,2,\ldots ,k\). The constraint normalizes the normal vector of the hyperplane of the cluster center.

The solution of the problem (1) can be obtained by solving k eigenvalue problems. Once k hyperplanes of the cluster center are obtained, a data sample \({\textbf {x}}\in \mathbb {R}^{m}\) is assigned to the i-th cluster by

$$\begin{aligned} \text {Cluster}({\textbf {x}}) = \underset{i=1,2,\ldots ,k}{\arg \min }\,\,\Vert {\textbf {w}}_i^T{\textbf {x}}+b_i\Vert _2^2, \end{aligned}$$
(2)

The kPC clustering starts with a random initial assignment of data samples. Each data sample is assigned a label by (2). Then k cluster center hyperplanes are updated by solving (1). The final k cluster center hyperplanes are obtained when the overall objective function does not decrease or the overall assignment of all data samples to cluster center hyperplanes is repeated.

2.2 kPPC

In contrast to kPC, kPPC [20] is proposed by introducing between-cluster information into each cluster to construct the cluster hyperplane. kPPC not only requires that the data samples in each cluster be as close as possible to their own center hyperplane, but also pushes the data samples in the other clusters far away from this center hyperplane, solving the following optimization problem

$$\begin{aligned} \begin{aligned} \underset{{\textbf {w}}_i,b_i}{\min }~&\Vert {\textbf {w}}_i^T{\textbf {A}}_i+b_i{\textbf {e}}_i\Vert _2^2- c\Vert {\textbf {w}}_i^T\widehat{{\textbf {A}}}_i+b_i\widehat{{\textbf {e}}}_i\Vert _2^2 \\ \hbox {s.t.}~~&\Vert {\textbf {w}}_i\Vert _2^2 = 1, \end{aligned} \end{aligned}$$
(3)

where \({\textbf {A}}_i\in \mathbb {R}^{m\times N_i}\) is the matrix consisting of the data samples of label i, and \(\widehat{{\textbf {A}}}_i\in \mathbb {R}^{m\times (N-N_i)}\) is the matrix consisting of the data samples of the other labels. c is a positive parameter, \(\widehat{{\textbf {e}}}_i\) is the vector of ones of an appropriate dimension as \({\textbf {e}}_i\).

Different from random initialization in kPC, an initialization based on a Laplacian graph-based is constructed in kPPC, which makes kPPC more stable than kPC [20]. kPPC is also solved by an eigenvalue problem.

2.3 q-flat

q-flat [19] aims to partition the data samples into k clusters, each of which is well approximated by minimizing the sum of the squared distances of each data sample to the nearest flat. For the i-th cluster, q-flat minimizes the following problem to find its best fit \((m-q)\)-dimensional subspace.

$$\begin{aligned} \begin{aligned} \underset{{\textbf {W}}_i, \varvec{\gamma }_i}{\min }~~&\Vert {\textbf {W}}_i^T{\textbf {X}}_i-\varvec{\gamma }_i{\textbf {e}}^T_i\Vert _F^2\\ \hbox {s.t. }\,&{\textbf {W}}_i^T{\textbf {W}}_i={\textbf {I}}, \end{aligned} \end{aligned}$$
(4)

where \({\textbf {W}}_{i}\in \mathbb {R}^{m\times q}\), \(q\le m\), \({\textbf {I}}\) is the identity matrix of an appropriate dimension, and \(\varvec{\gamma }_i\in \mathbb {R}^{q}\), \(i=1,2,\ldots ,k\), \({\textbf {e}}_i\) is a vector of ones of an appropriate dimension.

In practice, q-flat also assumes a random initial assignment of the data samples and reassigns the data samples with

$$\begin{aligned} \text {Cluster}({\textbf {x}}) = \underset{i=1,2,\ldots ,k}{\arg \min }\,\,\Vert {\textbf {W}}_i^T{\textbf {x}}-\varvec{\gamma }_i\Vert _2^2 \end{aligned}$$
(5)

after obtaining all \({\textbf {W}}_{i}\) and \(\varvec{\gamma }_i\).

Similar to kPC and kPPC, q-flats alternates between updating clusters and assigning clusters to determine k cluster flats and find k clusters.

2.4 A2DEIC

Different from kPC, kPPC, and q-flat, A2DEIC [27] proposes an image clustering algorithm that deals directly with matrix representation. It uses two projection matrices to map the original data samples into a low-dimensional subspace and achieve clustering. Given the image data set T, A2DEIC minimizes the following objective function

$$\begin{aligned} \begin{aligned} \underset{{\textbf {U}}, {\textbf {V}}}{\min }~~&\sum \limits _{i=1}^{N}\sum \limits _{j=1}^{k}y_{ij}\Vert {\textbf {U}}^T({\textbf {X}}_i-{\overline{{\textbf {X}}}}_j){\textbf {V}}\Vert _F^2-\lambda \sum \limits _{i=1}^{N}\Vert {\textbf {U}}^T({\textbf {X}}_i-\overline{{\textbf {X}}}){\textbf {V}}\Vert _F^2\\ \hbox {s.t. }\,&{\textbf {U}}^T{\textbf {U}}={\textbf {I}}, {\textbf {V}}^T{\textbf {V}}={\textbf {I}}, \end{aligned} \end{aligned}$$
(6)

where \({\textbf {U}}\in \mathbb {R}^{m\times q_1}\) and \({\textbf {V}}\in \mathbb {R}^{n\times q_2}\) are projection matrices mapping the original data samples into a low-dimensional subspace \(\mathbb {R}^{q_1 \times q_2}\). \(y_{ij}\in \{0,1\}\) denotes the cluster indicator value of data samples \({\textbf {X}}_i\). The value is 1 if the data samples \({\textbf {X}}_i\) is partitioned into the i-th cluster, and 0 otherwise. \(\overline{{\textbf {X}}}\) is the mean of all data sample matrices and \({\overline{{\textbf {X}}}}_j\) is the mean of the data samples in the j-th cluster. \(\lambda\) is a positive parameter. A2DEIC is solved through an iterative algorithm.

3 Two-dimensional k-subspace clustering

3.1 Problem formulation

When the input data is in matrix (or two-dimensional) form, such as images, vector-based algorithms must convert matrices to vectors, which limits consideration of the spatial relationship between pixels and increases computational complexity. As seen above, A2DEIC is proposed to process data input from the matrix. However, the behavior of A2DEIC is greatly affected by its tuning parameters, and its optimization problem is complicated to solve. Inspired by the spirit of 2DBLDA, we propose a new two-dimensional k-subspace clustering algorithm (2DkSC) for image matrices. Inheriting from 2DBLDA, 2DkSC automatically adapts to the given dataset and no parameters need to be adjusted, which can solve the optimization problem efficiently. Moreover, it realizes simultaneously learning the clustering results in a most discriminant subspace of an appropriate dimension by preserving the original structure information of the image matrix.

Specifically, 2DkSC first initializes the cluster assignment and computes the k subspaces. Then, a new round assignment is performed according to the obtained k subspaces and the whole procedure is repeated. For the i-th cluster, \(i=1,...,k\), we solve the following optimization problem

$$\begin{aligned} \begin{aligned} \underset{{\textbf {W}}_{i}}{\min }~~&\Delta _{i}\sum _{s=1}^{N_i}\Vert {\textbf {W}}_{i}^T({\textbf {X}}_s^i-\overline{{\textbf {X}}}_i)\Vert _{F}^{2}-\frac{1}{N}\sum _{j\not =i}\sqrt{N_iN_j}\Vert {\textbf {W}}_{i}^T(\overline{{\textbf {X}}}_{i}-\overline{{\textbf {X}}}_{j})\Vert _{F}^{2}\\ \hbox {s.t. }\,&{\textbf {W}}_{i}^T{\textbf {W}}_{i}={\textbf {I}}, \end{aligned} \end{aligned}$$
(7)

where \({\textbf {W}}_{i}\in \mathbb {R}^{m\times d}\) is the projection matrices for the i-th subspace, \(d \le m\), \(\Delta _{i}=\frac{1}{4}\sum \nolimits _{j\not =i}\frac{\sqrt{N_iN_j}}{N}\Vert {\overline{{\textbf {X}}}_{i}-\overline{{\textbf {X}}}_{j}}\Vert _F^2\) is a weighting constant.

We now give the geometric meaning of the model (7). Minimizing the first term in (7) forces the data samples of the i-th cluster around its own cluster center in its subspace. Minimizing the second term in (7) keeps the centers of two different clusters apart in the projected space, which guarantees the between-cluster separativeness. The weighting constant \(\Delta _{i}\) in front of the first term balances the importance between clusters and the importance within clusters, which is derived by minimizing an upper bound of theoretical framework of the Bhattacharyya error bound optimality. The details can be found in the appendix 6. We can observe that 2DkSC can be adapted to different data samples since the weighting constant \(\Delta _{i}\) is determined by the given data set. The constraint \({\textbf {W}}_{i}^T{\textbf {W}}_{i}={\textbf {I}}\) ensures that the obtained discrimination directions of the i-th cluster are orthonormal to each other, which ensures minimal redundancy in the projected space.

3.2 Solving algorithm and computational complexity analysis

2DkSC can be solved by the following standard eigenvalue decomposition problem

$$\begin{aligned} \begin{aligned} \underset{{\textbf {W}}_{i}}{\min }&~~{\textbf {tr}}({\textbf {W}}_{i}^T{\textbf {M}}_{i}{} {\textbf {W}}_{i})\\ \hbox {s.t. }\,&{\textbf {W}}_{i}^T{\textbf {W}}_{i}={\textbf {I}}, \end{aligned} \end{aligned}$$
(8)

where

$$\begin{aligned} \begin{aligned} {\textbf {M}}_{i}=\,&\Delta _{i}\sum _{s=1}^{N_i}\left( {\textbf {X}}_s^i-\overline{{\textbf {X}}}_i\right) \left( {\textbf {X}}_s^i-\overline{{\textbf {X}}}_i\right) ^T-\frac{1}{N}\sum _{j\not =i}\sqrt{N_iN_j}\left( \overline{{\textbf {X}}}_{i}-\overline{{\textbf {X}}}_{j}\right) \left( \overline{{\textbf {X}}}_{i}-\overline{{\textbf {X}}}_{j}\right) ^T. \end{aligned} \end{aligned}$$
(9)

With the initial cluster labels of all data samples, 2DkSC updates the data sample labels and k clustering subspaces alternately. After finding the optimal solution of model (8) for each cluster, a data sample \({\textbf {X}}_l\) is relabeled as follows

$$\begin{aligned} \text {Cluster}({\textbf {X}}_l) = \underset{i=1,2,\ldots ,k}{\arg \min }\,\,\Vert {\textbf {W}}_{i}^T({\textbf {X}}_l-\overline{{\textbf {X}}}_i)\Vert _{F}^{2}, ~l=1,\,2,\,\ldots ,N \end{aligned}$$
(10)

and k clusters are updated accordingly. These updated clusters are used to determine new projection directions by model (7). The entire process continues until a repeated assignment of cluster marks is made for all data samples. The clustering process of 2DkSC can be realized by Algorithm 1.

figure a

For 2DkSC, the main computational cost is to solve the optimization problems (8). From Algorithm 1, we can see that the main computational cost of 2DkSC is to compute the matrix \({\textbf {M}}_{i}\) and perform its standard eigenvalue decomposition. Its computational complexity is \(O(m^3)\). Therefore, the computational complexity for Step (a) in Algorithm 1 is \(O(rkm^3)\), where r is the number of iterations and k is the cluster number. The computational complexity for Step (b) is O(rkmnN). Therefore, considering that for high-dimensional data rknN is much smaller than \(m^2\), the computational complexity for 2DkSC is \(O(rkm^3)\).

To further illustrate the contribution of our method, we discuss the differences between the proposed 2DkSC and the four closely related methods, kPPC, TWSVC, q-flat and A2DEIC.

(i) Difference From kPPC, TWSVC and q-flat: Compared to the vector-based clustering algorithms kPPC, TWSVC and q-flat, the proposed 2DkSC is a matrix-based method. The similarity between kPPC, TWSVC and 2DkSC is that their objective functions both maximize the distance between clusters while minimizing the distance within clusters. However, q-flat minimizes only the distance within clusters. The weighting constant of 2DkSC is derived from the Bhattacharyya error bound and can be adaptively adjusted, while the weighting parameters of kPPC and TWSVC require grid search parameters. In addition, 2DkSC and q-flat can achieve clustering and dimensionality reduction simultaneously, while kPPC and TWSVC do not provide dimensionality reduction, only clustering. 2DkSC and kPPC obtain their solutions by solving eigenvalue problems, while q-flat is solved by the singular value decomposition and TWSVC by two quadratic problems.

(ii) Difference From A2DEIC: Although A2DEIC can also directly deal with the matrix subspace, A2DEIC is strongly influenced by its tuning parameters and the search for the optimal parameter is difficult and time-consuming, while 2DkSC does not need to tune any parameters, which can solve the optimization problem efficiently. 2DkSC can solve its optimization problem simply by a standard eigenvalue problem, while A2DEIC solves its optimization problem by an iteration technique.

4 Experiments

We compare the proposed approach with seven related clustering algorithms, including k-means [17], q-flat [19], kPPC [20], TWSVC [21], FRTWSVC [25], kSDC [26], and A2DEIC [27]. All our experiments are performed on a PC computer with an Intel 3.30 GHz CPU and 4 GB RAM memory under Matlab 2017b platform. kPPC and A2DEIC obtain their solutions by solving eigenvalue problems. q-flat is solved by singular value decomposition. TWSVC is solved by two quadratic problems. FRTWSVC solves a series of linear systems of equations. kSDC is solved by an alternating direction method of multipliers. As for the parameter selection, the tuning parameters c in kPPC, TWSVC, FRTWSVC and A2DEIC are selected from the set \(\{2^{-8},2^{-7},\ldots ,2^{7}\}\). The optimal parameter is selected for all the investigated methods using the grid search technique. k is set to be equal to the ground truth cluster number for each dataset by default. For unknown k, one way is to use the non-parametric Bayesian method [35] to estimate it. Another approach is to run the clustering method on dataset with different number of clusters as input to find its optimum, whose quality can be measured by clustering accuracy or normalized mutual information. Once the optimal parameter is selected, it is used to learn the final clusters. For methods with random initialization, the average clustering result over ten runs are adopted.

4.1 Evaluation metrics

Following most work on clustering, we use clustering accuracy (ACC) and normalized mutual information (NMI) [36,37,38] as evaluation measures, which are in the range [0, 1]. A larger value indicates more accurate clustering results. Suppose, \(p_i\) represents the label predicted by a clustering algorithm and \(t_i\) represents the corresponding true label of a data sample \({\textbf {X}}_l\). The ACC is defined as follows:

$$\begin{aligned} \begin{aligned} ACC=\frac{\sum \nolimits _{i=1}^{N}\delta (t_i,map(p_i))}{N}, \end{aligned} \end{aligned}$$
(11)

where \(\delta (y_1,y_2)=1\) if \(y_1=y_2\) and \(\delta (y_1,y_2)=0\) otherwise. \(map(p_i)\) is the best mapping function that converts clustering labels to match true labels, using the Kuhn-Munkres algorithm [38].

Let us denote by C the set of clusters resulting from the ground truth, and by \(C'\) the set resulting from our algorithm. There is a mutual information metric \(MI(C,C')\), which is defined as follows:

$$\begin{aligned} \begin{aligned} MI(C,C')=\sum \limits _{c_i\in C,c_j'\in C'}p(c_i,c_j')log_{2}\frac{p(c_i,c_j')}{p(c_i)p(c_j')}, \end{aligned} \end{aligned}$$
(12)

where \(p(c_i)\) and \(p(c_j')\) are the probabilities that a document arbitrarily selected from the corpus belongs to clusters \(c_i\) and \(c_j'\), respectively, and \(p(c_i,c_j')\) is the joint probability that the arbitrarily selected document belongs to both clusters \(c_i\) and \(c_j'\) simultaneously. In our experiments, we use the normalized mutual information (NMI) as follows:

$$\begin{aligned} \begin{aligned} NMI(C,C')=\frac{MI(C,C')}{max(H(C),H(C'))}, \end{aligned} \end{aligned}$$
(13)

where H(C) and \(H(C')\) are the entropies of C and \(C'\) respectively. It is easy to verify that \(NMI(C,C')\) ranges from 0 to 1. \(NMI(C,C')=1\) if the two groups of clusters are identical, and \(NMI(C,C')=0\) if the two groups are independent.

4.2 Datasets

The experiments are performed on six image datasets, including one object image, one handwritten image, and four face images.

Object recognition: We use the Coil100 dataset [39]. Coil100 contains 900 images with 100 different objects.

Handwritten digit recognition: We use the USPSFootnote 1 dataset to evaluate the performance of handwritten digit recognition performance. The dataset contains 11000 samples with 10 classes, where each sample corresponds to one digit.

Face recognition: Four face image datasets (Yale [40], Indian [41], ORLFootnote 2 and FERET [42]) are used. The Yale dataset contains 165 images of 15 individuals. The Indian dataset contains 242 human face images of 22 females. The ORL dataset contains 400 images of 40 individuals. The FERET dataset contains 14126 images comprising 1199 individuals and 365 duplicate image sets. Here, we use a subset that contains 1400 images of 200 individuals.

The number of samples and categories as well as the image size are listed in Table 1. Some of the gray images are shown in Fig. 1.

Table 1 The summary of data statistics
Fig. 1
figure 1

Six image datasets

4.3 Experimental results

4.3.1 Performance analysis

The results of comparing the performance of different algorithms are shown in Tables 2 and 3, and the best results are indicated in bold figures. The p-value from paired t-test in 5% significance level are adopted. From the experimental results, the following observations can be obtained:

(1) 2DkSC achieves the best clustering results in terms of both average clustering ACC and average NMI. Moreover, 2DkSC has the highest ACC and NMI in five of the six datasets, respectively. For example, 2DkSC achieves 77.45% NMI in the Coil100 dataset, which is 3.89% better than the vector q-flat-based method algorithm and 6.50% better than the matrix A2DEIC-based method algorithm. This phenomenon proves the superiority of our proposed algorithm for image clustering.

(2) As a two-dimensional embedding for image clustering, A2DEIC achieves the second best performance in terms of both average clustering ACC and average NMI. The reason is that A2DEIC can directly handle matrix representations. In this way, the spatial information can be preserved in the original data. For example, A2DEIC has the highest ACC of 70.30% on the Yale dataset, 4.02% higher than 2DkSC and 10.30% higher than the vector-based kSDC algorithm.

(3) We also find out that q-flat performs better than other vector-based algorithms, ranking third in both average accuracy and average NMI. For example, q-flat has the highest NMI of 54.09% on the USPS dataset, 4.66% higher than 2DkSC and 13.13% higher than the k-means algorithm. Similar to q-flat, kSDC is also a vector-based clustering algorithm, and its average accuracy and average NMI are ranked fourth. The result supports the fact that q-flat and kSDC are able to capture the intrinsic structure in the low-dimensional subspace.

(4) k-means has better performance than the plane-based clustering algorithms kPPC, TWSVC, and FRTWSVC in terms of both average accuracy and average NMI. The performance of TWSVC is better than that of kPPC and FRTWSVC. kPPC has the worst performance.

(5) In terms of CPU time, kPPC, TWSVC and kSDC are slower than other methods. In contrast, 2DkSC costs the least CPU time compared to the seven similar clustering algorithms. This is because 2DkSC does not require any adjusting parameters and the solution can be achieved quickly, which shows the efficiency of our proposed method.

(6) p values bettween 2DkSC and other methods show that on most of the datasets, 2DkSC is statistically different from other methods.

Table 2 ACC(%), CPU time (second) and p value for different algorithms
Table 3 NMI(%), CPU time (second) and p value for different algorithms

4.3.2 The influence of the dimension

To observe the discriminative ability, the clustering results of 2DkSC along different dimensions are shown in Fig. 2. In Fig. 2, the clustering results of A2DEIC and our 2DkSC are shown when the reduced dimension is set to \(d = 1, 2,..., m\). The results show the following: (i) Although the curve of A2DEIC algorithm is below the optimal parameters, its highest ACC and NMI are not as good as our method. (ii) With the increase of the number of reduced dimensions, ACC and NMI of our 2DkSC vary relatively. (iii) The 2DkSC has the highest results under the optimal reduced dimension on all datasets. (iv) The A2DEIC and 2DkSC are strongly affected by the reduced dimension, and it is necessary to choose an optimal reduced dimension.

Fig. 2
figure 2figure 2

Clustering results of A2DEIC and 2DkSC along different dimensions

5 Conclusion

In this paper, a novel two-dimensional k-subspace clustering method named 2DkSC is investigated. Both discriminative and underlying structural information are embedded in 2DkSC. Therefore, 2DkSC realizes dimensionality and clustering simultaneously. The 2DkSC algorithm has no parameters, its weighting constant can be adaptively adjusted according to the involved data, and the optimization problem has a closed form solution. Experimental results on image recognition have shown the superiority of the proposed method. However, a drawback of 2DkSC is that it may not be very robust to noise since it is based on the squared F-norm. Therefore, we will investigate the robust two-dimensional subspace clustering algorithm in the future. Our MATLAB code can be downloaded from http://www.optimalgroup.org/Resources/Code/2DkSC.html.