1 Introduction

Clustering [1] is the process of grouping data points into clusters, so that data points within a cluster have high similarity and data points are very dissimilar in the different clusters. Clustering has been applied in many areas, including data mining, statistics, biology, and information retrieval [24]. There are many methods for clustering, such as partitioning methods [57], hierarchical methods [8, 9], and density-based methods [10, 11]. Many clustering methods [5, 6, 10] aim at clustering data points to several clusters by center points, such as k-means [3, 12] and fuzzy C-means [13], which may perform better when data points distribute as spherical shapes. However, data points may not be clustered around a data point, e.g., a toy example shown in Fig. 1a [14], where each cluster’s data points distribute along a straight line.

Fig. 1
figure 1

Different performance of k-means and kPC on the artificial data points, which distribute along two cross straight lines. The data points with the same color and shape belong to the same cluster

k-plane clustering (kPC) [15, 16] is proposed to handle the case that data points fall into clusters around planes. To obtain the final center planes, kPC alternates between center plane updates and cluster assignments. Therefore, in Figure 1, kPC performs better than k-means. However, kPC only consider the within-cluster compactness may leads difficulties. A toy example is given by Fig. 2, in which kPC constructs two overlapped lines that misclassify the data points of cluster 2 into the cluster 1. Hence, kPC fails to find reasonable clusters and ends up estimating the number of clusters to 2. In addition, selecting the initial data points randomly in both k-means and kPC will affect the cluster results in a great extent [1719]. A set of poor quality initial seeds may lead to a poor quality clustering result.

Fig. 2
figure 2

Different methods on artificial dataset, who’s cluster 1 and cluster 2 lie approximately on the same line

Recent studies [2022] show that the between-cluster data points could be used to improve the cluster performance. Therefore, we introduce the between-cluster data points in kPC and propose a novel plane-based clustering, called k-proximal plane clustering (kPPC). Similar to kPC, kPPC also updates the center plane and cluster assignments alternately. The difference is, in the updating process, each center plane is not only close to its data points but also far away from the others, leading to solve several eigenvalue problems. In addition, we extend the above procedure to nonlinear case by kernel trick [14, 2327]. Compared with kPC, the main advantages of our kPPC are: (1) the center plane in kPPC not only close to its data points but also far away from the others; (2) instead of the random initialization in kPC, kPPC constructs a Laplace graph to give the efficient initial clusters; (3) a nonlinear model of kPPC is also proposed by the kernel trick;

The remainder of this paper is divided into the following parts. Section 2 gives a brief review of the related work. Section 3 presents the details of our kPPC. In Sect. 4, we report the experimental results on both artificial and benchmark datasets. Finally, Sect. 5 concludes the paper.

2 Related works

In this paper, we consider a set of m unlabeled data points in the n-dimensional real space represented by the matrix A in \(R^{m\times n}\). For the clustering problem, the purpose is to cluster A into k clusters with the labels \(y_i\in {N}, i=1,2,\ldots ,m\) represented by the vector \({\mathbf{y}} \in N^{m}\). Next, we briefly review k-means and kPC.

2.1 k-means

k-means [28, 29] wishes to cluster m data points into k clusters so that data points of each cluster lie near the cluster center data point.

k-means keeps each data point from the same cluster being the closest to its cluster center data point, which uses the sum of the squared error to represent the dissimilarity measure by solving the following problem

$$\begin{aligned} \underset{{{\mathbf {A}}}_i}{\min }\sum _{i=1}^{k}\sum _{{\mathbf {x}}\in {\mathbf {A}}_i}||{\mathbf {x}}-{{\mathbf {A}}}_i||_2^2, \end{aligned}$$
(1)

where \(\mathbf x\) is the data point in space representing a given data point, \({{\mathbf {A}}}_i\) is the cluster center points of the ith cluster, \({{\mathbf {A}}}_i\) contains the rows of \({{\mathbf {A}}}\) that belong to the ith cluster, and \(||\cdot ||_2\) means the \(L_2\) norm.

By using the Lagrangian multiplier method [30, 31], cluster centers points can be updated according to

$$\begin{aligned} {{\mathbf {A}}}_i=\frac{1}{m_i}\sum _{{\mathbf {x}}\in {{\mathbf {A}}}_i}{\mathbf {x}}, ~~i=1,\ldots ,k. \end{aligned}$$
(2)

where \(m_i\) is the number of data points corresponding to \({{\mathbf {A}}}_i\).

After getting the cluster center points, a data point \(\mathbf x\) is labeled by its nearest center as follow

$$\begin{aligned} y_{i}=\underset{i}{\arg \min }&{||{\mathbf {x}}-{{\mathbf {A}}}_i||}, ~~i=1,\ldots ,k. \end{aligned}$$
(3)

Given a set of m data points, k-means randomly choose k data points as the initial cluster centers. In turn, every data point is assigned to its cluster center by (3). Then, k cluster centers points are updated by (2). The final k clusters are obtained if no redistribution of data points or cluster’s centroid in any cluster.

2.2 kPC

The kPC [15] wishes to cluster \({{\mathbf {A}}}\) into k clusters such that the data points of each cluster lie close to a cluster center plane defined as follows

$$\begin{aligned} f_i({\mathbf {x}})={\mathbf {x}} {{\mathbf {w}}}_i+b_i=0,\quad i=1,2,\ldots ,k, \end{aligned}$$
(4)

where \({{\mathbf {w}}}_i \in R^{n}\) is the weighted vector and \(b_i\in R\) is the bias.

kPC keeps each data point from the same cluster being the closest to the same cluster center plane, which leads to solve the following programming problems with \(i=1,2,...,k\).

$$\begin{aligned} \begin{array}{ll} \underset{{{\mathbf {w}}}_i,b_i}{\min }&{}||{{\mathbf {A}}}_i{{\mathbf {w}}}_i+b_i{{\mathbf {e}}}_i||_2^2\\ s.t.&{}||{{\mathbf {w}}}_i||_2^2=1, \end{array} \end{aligned}$$
(5)

where \({\mathbf {A}}_i\) contains the rows of \({\mathbf {A}}\) that belong to the ith cluster, \({\mathbf {e}}_i\) is a vector of ones of appropriate dimension. The geometric interpretation is clear, i.e., given the ith cluster, the objective of (5) minimizes the sum of the distances between the data points of the ith cluster and the ith center plane, and the constraint normalizes the normal vector of the center plane. The solution of the problem (5) can be obtained by solving k eigenvalue problems [15].

Once we obtained the k cluster center planes, a data point \(\mathbf {x}\) is assigned to the ith cluster by

$$\begin{aligned} y=\underset{i}{\arg \min }&||{\mathbf {x}} {{\mathbf {w}}}_i+b_i||,\quad i=1,\ldots ,k. \end{aligned}$$
(6)

Given the original data points \({\mathbf {A}}\), kPC starts from randomly initializing \(({\mathbf {w}}_i,b_i)\), where \(||{\mathbf {w}}_i||_2^2=1\). In turn, every data point is assigned a label by (6). Then, k center planes are updated by solving (5). The final k cluster center planes are obtained if a repeated overall assignment of data points to cluster center plane or a non-decrease in the overall objective function occur.

3 kPPC formulation

3.1 Linear kPPC

For each cluster, kPPC not only requests the objective data points close to the corresponding center plane but also requests the other data points far away from this cluster center plane, which leads to solve the following programming problems with \(i=1,2,...,k\)

$$\begin{aligned} \begin{array}{ll} \underset{({{\mathbf {w}}}_i,b_i)}{\min }&{} ||{{\mathbf {A}}}_i{{\mathbf {w}}}_i+b_i{{\mathbf {e}}}_i||_2^2-c||{{\mathbf {B}}}_i{{\mathbf {w}}}_i+b_i{\bar{{\mathbf {e}}}}_i||_2^2\\ s.t.&{} ||{{\mathbf {w}}}_i||_2^2=1, \end{array} \end{aligned}$$
(7)

where \({\mathbf {A}}_i\) contains the rows of \({\mathbf {A}}\) that belong to the ith cluster and \({\mathbf {B}}_i\) is the data points not belonging to the ith cluster; \(c>0\) is a parameter to weight the proximal term and the aloof term; \({\mathbf {e}}_i\) and \(\bar{{\mathbf {e}}}_{\mathbf {i}}\) are vectors of ones with an appropriate dimensions. Once we obtained the k cluster center planes, the data points can be assigned the same as kPC by (6).

Now, we give the details to solve (7). For \(i=1,2,\ldots ,k\), the problem (7) can be solved by the Lagrangian multiplier method [23, 30, 31] below. The Lagrangian function of (7) is

$$\begin{aligned} L=||{\mathbf {A}}_i{\mathbf {w}}_i+b_i{\mathbf {e}}_i||_2^2-c||{\mathbf {B}}_i{\mathbf {w}}_i+b_i{\bar{{\mathbf {e}}}}_i||_2^2-\lambda (||{\mathbf {w}}_i||_2^2-1), \end{aligned}$$
(8)

where \(\lambda\) is the Lagrangian multiplier. Set the partial derivatives of L with respect to \(({\mathbf {w}}_i,b_i)\) to zero, we get the following two equalities:

$$\begin{aligned} \frac{1}{2}\frac{\partial L}{\partial {\mathbf {w}}_i}=({\mathbf {A}}_i^\top {\mathbf {A}}_i-c{\mathbf {B}}_i^\top {\mathbf {B}}_i){\mathbf {w}}_i+({\mathbf {A}}_i^\top {\mathbf {e}}_i-c{\mathbf {B}}_i^\top {\bar{{\mathbf {e}}}}_i)b_i-\lambda {\mathbf {w}}_i=0,\end{aligned}$$
(9)
$$\begin{aligned} \frac{1}{2}\frac{\partial L}{\partial b_i}=({\mathbf {e}}_i^\top {\mathbf {A}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\mathbf {B}}_i){\mathbf {w}}_i+({\mathbf {e}}_i^\top {\mathbf {e}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i)b_i=0. \end{aligned}$$
(10)

When \({\mathbf {e}}_i^\top {\mathbf {e}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i=0\), \(b_i\) will disappear, and thus there are two cases of the solution of (7):

Case (1). \({{\mathbf {e}}}_i^\top {{\mathbf {e}}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i\ne 0\):

$$\begin{aligned} {\mathbf{Q}} {{{w}}}_i=\lambda {{{w}}}_i,\end{aligned}$$
(11)
$$\begin{aligned} b_i=\dfrac{({\mathbf {e}}_i^\top {\mathbf {A}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\mathbf {B}}_i){\mathbf {w}}_i}{-{\mathbf {e}}_i^\top {\mathbf {e}}_i+c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i}, \end{aligned}$$
(12)

where \({\mathbf{Q}} ={\mathbf {A}}_i^\top {{\mathbf {A}}}_i-c{\mathbf {B}}_i^\top {\mathbf {B}}_i +\dfrac{({\mathbf {e}}_i^\top {\mathbf {A}}_i-c{\bar{{{\mathbf {e}}}}}_i^\top {{\mathbf {B}}}_i)^\top ({\mathbf {e}}_i^\top {\mathbf {A}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\mathbf {B}}_i)}{-{\mathbf {e}}_i^\top {\mathbf {e}}_i+c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i}\).

Case (2). \({\mathbf {e}}_i^\top {\mathbf {e}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i=0\):

$$\begin{aligned} ({\mathbf {A}}_i^\top {\mathbf {A}}_i-c{\mathbf {B}}_i^\top {\mathbf {B}}_i){\mathbf {w}}_i=\lambda {{\mathbf {w}}}_i, \end{aligned}$$
(13)
$$\begin{aligned} b_i=\frac{1}{m}_i\sum _{\mathbf {x}\in {\mathbf {A}}_i}\mathbf {x} {\mathbf {w}}_i, \end{aligned}$$
(14)

where \(m_i\) is the number of data points corresponding to \({\mathbf {A}}_i\). It is easy to conclude that \({\mathbf {w}}_i\) is the eigenvector of the smallest eigenvalue of the matrix \({\mathbf{Q}}\) in (Eqs. 11, 12) (or \({\mathbf {A}}_i^\top {\mathbf {A}}_i-c{\mathbf {B}}_i^\top {\mathbf {B}}_i\) of (Eqs. 13, 14, 15].

Now, we turn to the initialization of the proposed kPPC. As we know, the initial centers selection is very important in both k-means and kPC [18, 3235]. Therefore, we propose an effective way to give the initial centers. Specifically, we introduce a similarity graph to initialize the centers. The following steps are adopted:

  1. 1.

    Calculate the P nearest neighbors undirected graph G [36] for \({\mathbf {A}}\) by

    $$\begin{aligned} G_{ij}=\left\{ \begin{array}{ll} 1&\quad\text{if }\ {\mathbf {x}}_i\in N(\mathbf {x}_j) ~ \text{ or }\ ~ \mathbf {x}_j\in N(\mathbf {x}_i)\\ 0 &\quad \text{otherwise } \end{array}\right. \end{aligned}$$
    (15)

    where \(N(\mathbf {x}_i)=\{\mathbf {x}_i^{(1)},\mathbf {x}_i^{(2)},...\mathbf {x}_i^{(P)}\}\) is the set of its P nearest neighbors. The data points are labeled into \(\bar{c}\) connected components by tagging them the same label if they are connected in G, and different labels if they are not connected.

  2. 2.

    Suppose k is the number of clusters one wanted. If \(\bar{c}<k\), repeatedly find two neighbors with the largest distance and break their neighborhood to divide a connected component into two, until the number of the current connected components is k; if \(\bar{c}>k\), find two connected components of data points with the closest distance and merge them into one repeatedly, until the number of the current connected components is k.

  3. 3.

    Export the current cluster labels as the initial cluster.

According the above procedure, we obtain the following algorithm 1.

figure a

We consider the algorithm converges if a repeated overall assignment of data points to cluster center plane or a non-decrease in the overall objective function occur.

3.2 Nonlinear kPPC

When the data points distribution in the nonlinear manifold (such as Figs. 5 and 6), the linear methods may not doing well. In this section, we extend our linear kPPC to manifold clustering by kernel trick. The thought of kernel trick [31, 37, 38] is to map the input into a higher-dimensional feature space with some non-linear transformation. The trick is that this mapping is never given explicitly, but implicitly applied in this newly-mapped data space. The nonlinear kPPC formulation also follows this process.

Instead of the aforementioned linear plane in input space, we consider the following kernel-generated surface [3941]

$$\begin{aligned} K(\mathbf {x}^\top ,{\mathbf {A}}^\top ){\mathbf {w}}_i+b_i=0,~~i=1,2,\ldots ,k, \end{aligned}$$
(16)

where K is an appropriately chosen kernel.

Nonlinear kPPC is to minimize the following programming problems with \(i=1,2,...,k\).

$$\begin{aligned} \underset{({\mathbf {w}}_i,b_i)}{\min } ||K({\mathbf {A}}_i,{\mathbf {A}}^\top ){\mathbf {w}}_i+{\mathbf {e}}_ib_i||_2^2-c||K({\mathbf {B}}_i,{\mathbf {A}}^\top ){\mathbf {w}}_i+{\bar{{\mathbf {e}}}}_ib_i||_2^2\end{aligned}$$
(17)
$$\begin{aligned} s.t.\quad||{\mathbf {w}}_i||_2^2=1, \end{aligned}$$
(18)

where \({\mathbf {A}}_i\) is the data points of the ith cluster and \({\mathbf {B}}_i\) is the data points not in the ith cluster, \(c>0\) is a parameter to weight the proximal term and the aloof term, \({\mathbf {e}}_i\) and \({\bar{{\mathbf {e}}}}_i\) are vectors of ones of appropriate dimensions. Once we obtained k clustering surfaces, a data point \(\mathbf {x}\) is assigned to the ith cluster by

$$\begin{aligned} y=\underset{i}{\arg \min } ||K(\mathbf {x}^\top ,{\mathbf {A}}^\top ){\mathbf {w}}_i+b_i||,\quad i=1,\ldots ,k. \end{aligned}$$
(19)

The above problems (17, 18) also can be solved by the Lagrangian multiplier method below

$$\begin{aligned} L=||K({\mathbf {A}}_i,{\mathbf {A}}^\top ){\mathbf {w}}_i+{\mathbf {e}}_i b_i||_2^2-c||K({\mathbf {B}}_i,{\mathbf {A}}^\top ){\mathbf {w}}_i+{\bar{{\mathbf {e}}}}_ib_i||_2^2-\lambda (||{\mathbf {w}}_i||_2^2-1), \end{aligned}$$
(20)

where \(\lambda\) is the Lagrangian multiplier. Set the partial derivatives of L with respect to \(({\mathbf {w}}_i,b_i)\) Ù:

Case (i). \({\mathbf {e}}_i^\top {\mathbf {e}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i\ne 0\):

$$\begin{aligned} {\mathbf{Q}} {{w}}_i=\lambda {{w}}_i \end{aligned}$$
(21)
$$\begin{aligned} b_i=\dfrac{({\mathbf {e}}_i^\top K({\mathbf {A}}_i,{\mathbf {A}}^\top )-c{\bar{{\mathbf {e}}}}_i^\top K({\mathbf {B}}_i,{\mathbf {A}}^\top )){\mathbf {w}}_i}{-{\mathbf {e}}_i^\top {\mathbf {e}}_i+c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i}, \end{aligned}$$
(22)

where \({\mathbf{Q}} =(K({\mathbf {A}}_i,{\mathbf {A}}^\top )^\top K({\mathbf {A}}_i,{\mathbf {A}}^\top )-cK({\mathbf {B}}_i,{\mathbf {A}}^\top )^\top K({\mathbf {B}}_i,{\mathbf {A}}^\top )+ \dfrac{({\mathbf {e}}_i^\top K({\mathbf {A}}_i,{\mathbf {A}}^\top )-c{\bar{{\mathbf {e}}}}_i^\top K({\mathbf {B}}_i,{\mathbf {A}}^\top ))^\top ({\mathbf {e}}_i^\top K({\mathbf {A}}_i,{\mathbf {A}}^\top )-c{\bar{{\mathbf {e}}}}_i^\top K({\mathbf {B}}_i,{\mathbf {A}}^\top ))}{-{{e}}_i^\top {{e}}_i+c{\bar{{{e}}}}_i^\top {\bar{{{e}}}}_i})\).

Case (ii). \({\mathbf {e}}_i^\top {\mathbf {e}}_i-c{\bar{{\mathbf {e}}}}_i^\top {\bar{{\mathbf {e}}}}_i=0\):

$$\begin{aligned} (K({\mathbf {A}}_i,{\mathbf {A}}^\top )^\top K({\mathbf {A}}_i,{\mathbf {A}}^\top )-cK({\mathbf {B}}_i,{\mathbf {A}}^\top )^\top K({\mathbf {B}}_i,{\mathbf {A}}^\top )){\mathbf {w}}_i=\lambda {\mathbf {w}}_i\end{aligned}$$
(23)
$$\begin{aligned} b_i=\dfrac{1}{m_i}\sum _{\mathbf {x}\in K({\mathbf {A}}_i,{\mathbf {A}}^\top )}\mathbf {x} {\mathbf {w}}_i, \end{aligned}$$
(24)

where \(m_i\) is the number of data points corresponding to \({\mathbf {A}}_i\).

We also use a similarity graph to construct the initial cluster center planes instead of the randomly initialization for nonlinear kPPC. The process contains the following steps: (1) we choose an appropriately kernel function \(K(\cdot ,\cdot )\) to map the data points into a higher-dimensional feature space; (2) the same strategy as linear kPPC is used to construct the initial center planes. Then, we obtain the algorithm 2.

From algorithm 1 and 2, we can see that the main computational cost in our kPPC is to compute one P nearest neighbors undirected graph G and some eigenvalue decomposition problems. The computational complexity of each is at most O(\(n^3\)). Thus, the complexity of our kPPC is O(\(t*n^3\)), where t is the number of iterations.

figure b

The convergence criterion of algorithm 2 is the same as algorithm 1.

4 Experimental results

In this section, to validate our kPPC, experiments are carried out on several artificial datasets and benchmark datasets [42]. We compare our kPPC with other clustering methods, including kmeans [28, 29], kPC [15], and updated versions of DBSCAN [43, 44]. All these methods are implemented by Matlab (2008a) [45] on a PC with an Intel P4 processor (2.5 GHz) with 2 GB RAM. Clustering accuracy and F-measure [46] are used as the performance evaluation criterions, which are defined as follows. Consider \(\widetilde{C}\) = \((\widetilde{c_{1}},...,\widetilde{c_{k}})\) a prediction partition of a clustering method and \(\widetilde{H}\) = \((\widetilde{h_{1}},...,\widetilde{h_{k}})\) is a artificial defined partition of \({\mathbf {A}}\). We refer to a pair of data points \((\mathbf {x}_i,\mathbf {x}_j)\) from \({\mathbf {A}}\) using the following terms:

SS::

if both data points belong to the same cluster of the clustering structure \(\widetilde{C}\) and to the same group of \(\widetilde{H}\).

SD::

if data points belong to the same cluster of \(\widetilde{C}\) and to different group of \(\widetilde{H}\).

DS::

if data points belong to different cluster of \(\widetilde{C}\) and to the same group of \(\widetilde{H}\).

DD::

if data points belong to different cluster of \(\widetilde{C}\) and to different group of \(\widetilde{H}\).

Assuming now that \(f_{11}\), \(f_{10}\), \(f_{01}\) and \(f_{00}\) are the number of SS, SD, DS and DD pairs respectively. Now we can define the following indices to measure the degree of similarity between \(\widetilde{C}\) and \(\widetilde{H}\):

$$\begin{aligned} \text {R\,and\,Statistic}={\it{\frac{f_{\text{11}}+f_{\text{00}}}{f_{\text{11}}+f_{\text{10}}+f_{\text{01}}+f_{\text{00}}}}}. \end{aligned}$$
(25)

In this paper, the clustering accuracy is computed by the Rand statistic [46] as above.

F-measure [46] is computed as follow:

$$\begin{aligned} {F-measure}=\frac{2*\widetilde{P}*\widetilde{R}}{\widetilde{P}+\widetilde{R}}. \end{aligned}$$
(26)

where \(\widetilde{P}\) is the precision, \(\widetilde{R}\) is the recall.

$$\begin{aligned} \widetilde{P}=\frac{f_{11}}{f_{11}+f_{10}}. \end{aligned}$$
(27)
$$\begin{aligned} \widetilde{R}=\frac{f_{11}}{f_{11}+f_{01}}. \end{aligned}$$
(28)

The parameter c in both linear and nonlinear kPPC is selected from \(\{2^i|i=-8,-7,\ldots ,8\}\). For the nonlinear cluster methods, we choose the RBF kernel and select the kernel parameter p in \(\{2^i|i=-8,-7,\ldots ,8\}\). The parameter P nearest neighbors undirected graph based initialization is selected from \(\{P=1,3,5,7,9\}\).

Fig. 3
figure 3

Different methods on cross dataset 1. \((\cdot )^{l}\) denotes the linear cases. In e, the value of parameter c is corresponding to linear kPPC’s clustering accuracy

Fig. 4
figure 4

Different methods on cross dataset 2. \((\cdot )^{l}\) denotes the linear cases. In e, the value of parameter c is corresponding to linear kPPC’s clustering accuracy

4.1 Artificial datasets

In this subsection, four artificial datasets are generated. The first two datasets (see Figs. 3a and 4a are composed of three clusters and rest two datasets (see Figs. 5a and 6a are composed of two clusters. The number of data points are 160, 160, 126, and 150, respectively. Figs. 3, 4, 5 and 6b–e show the results of the different cluster methods performed on these datasets. Tables 1 and 2 show the clustering accuracies and the F-measures of these methods, respectively.

The first two datasets are cross-plane type datasets, and we use the linear cluster methods to cluster these datasets. From Figs. 3 and 4 , we observe that both kPC and our linear kPPC are better than k-means and DBSCAN on these two datasets, and our kPPC get the best performance, while kPC performs not as good as our kPPC. This shows that our linear kPPC is more reasonable than the others when the dataset has the linear structure. Table 1 shows that our linear kPPC gives the best performance and both the clustering accuracies and F-measure are highest.

The rest two datasets have nonlinear structure, and we use the nonlinear cluster methods to cluster these datasets. From Figs. 5 and 6, we observe that DBSCAN get the best performance, and our nonlinear kPPC clusters two clusters almost the same as the original datasets, while the other two methods do not perform well in the nonlinear case. This shows DBSCAN and our nonlinear kPPC can catch the distribution of the manifold datasets more precisely. However, DBSCAN achieves better clustering performance than the kPPC in Table 2. As we know, DBSCAN is a density-based spatial clustering technique and can merge the overlapping datasets into one cluster. Therefore, we observe that DBSCAN get the worse performance in Table 1. While we have two non-overlapping datasets and the spatial clustering density of the datasets is uniform in Table 2. Therefore, the clustering effect of DBSCAN is a little better.

Fig. 5
figure 5

Different methods on two-circles dataset. \((\cdot )^{n}\) denotes the nonlinear cases. In e, the value of parameters c and p are corresponding to nonlinear kPPC’s clustering accuracy

Fig. 6
figure 6

Different methods on two-moons dataset. \((\cdot )^{n}\) denotes the nonlinear cases. In e, the value of parameters c and p are corresponding to nonlinear kPPC’s clustering accuracy

Table 1 Experiments on linear artificial datasets, where \((\cdot )^{l}\) denotes linear case
Table 2 Experiments on nonlinear artificial datasets, where \((\cdot )^{n}\) denotes nonlinear case
Table 3 Details of the benchmark datasets

4.2 Benchmark datasets

In this subsection, we compare our kPPC with the others in their ability to recover cluster labels [15] on several benchmark datasets listed in Table 3. We use accuracy and F-measure as cluster evaluation criteria in the following tables. Both the random initialization and graph-based initialization k-means, kPC, and kPPC are experimented. For random initialization, all these methods are run ten times, and we record the average accuracy, the standard deviation and the one-run CPU time in Tables 4 and 9. The better results are marked by bold. The p value was calculated by performing a paired t test [4749] that comparing our kPPC to the other methods under the assumption of the null hypothesis that there is no difference between the dataset accuracy distributions.

From Table 4, it is seen that our kPPC has better ability to recover cluster labels than the other methods, especially the linear kPPC gets the six highest accuracies on these eight datasets, and the accuracy of our kPPC is almost higher than kPC on both random and graph-based initialization. This indicates that our kPPC is better than kPC, which means that the introduction of the between-cluster information can improve the clustering ability of kPC. Furthermore, graph-based clusters are more stable because their standard deviation are always zero, while the bias of random clusters are ranging from 0.02 to 2.17%. Therefore, our graph-based initialized kPPC are more reasonable and stable than randomly initialized kPPC. From Table 5, it is seen that kmeans is more stable than other methods with random initialization, which implies the F-measures of these plane-based methods, includes kPC and kPPC, strongly depend on the initialization. our linear kPPC also gets the five highest F-measures on these eight datasets. From Tables 4 and 5, we can seen that our kPPC owns the highest cluster results compared with the others. One shortcoming of kPPC is that the time consuming is larger than other methods from Table 6.

For the nonlinear case, Tables 7 and 9 have the similar results with the above linear case. Therefore confirm our conclusions further.

Table 4 The clustering accuracy of the linear clustering methods on the benchmark datasets
Table 5 The F-measure of the linear clustering methods on the benchmark datasets
Table 6 The time of the linear clustering methods on the benchmark datasets
Table 7 The clustering accuracy of the manifold clustering methods on the benchmark datasets
Table 8 The F-measure of the manifold clustering methods on the benchmark datasets
Table 9 The time of the manifold clustering methods on the benchmark datasets
Fig. 7
figure 7

Relationship of parameters c, P, and clustering accuracy for linear cases

Fig. 8
figure 8

Relationship of parameters c, p, P, and clustering accuracy for nonlinear cases on Seeds:(a)-(e), Glass:(f)-(j), Spiral:(k)-(o), Jain:(p)-(t)

4.3 Parameter influence

In this subsection, we show the influence of parameters c and p in our kPPC of the above benchmark datasets. Figs. 7 and 8 show the relations between the parameters and the accuracies of our kPPC on these datasets, where Fig. 7 corresponds to the linear kPPC and Fig. 8 corresponds to the nonlinear kPPC, respectively. In Fig. 8, we conduct experiments on four selected benchmark datasets listed in Table 3.

From Fig. 7, we obtain three interesting observations: (1) we know that kPPC with commonly setting parameter \(P=1\) performs better than other values of P, and kPPC often works not well if one increase P from 5, 7, and 9. Therefore, we suggest one to set P smaller to get a better performance. (2) the parameter c has a great influence for the clustering accuracy, and with the increase of c, the clustering accuracy tends to be stable. (3) kPPC performs well when c is relatively small but larger than zero and it can be seen that when \(c=0\), the clustering accuracies are mostly lower. The reason lies in that when \(c=0\), our kPPC degrades to kPC. Therefore, we conclude that the parameter c can be set more flexible, such as from 0 to 64.

There are three parameters c, p, and P in the nonlinear kPPC. From Fig. 8, we can see that (1) the parameter \(P=3\) can makes nonlinear kPPC perform better than other values of P. (2) both c and p have great influence for the clustering accuracy. Moreover, \(c \in [1,64]\) may be a good option to obtain a well performance. (3) the best accuracy of the different datasets is obtained with different p, and \(p \in [0.02,32]\) often makes it perform well.

5 Conclusions

In this paper, we have proposed an improved version of kPC, called kPPC. In kPPC, each cluster center plane is not only close to the objective data points but also far away from the other data points. Our kPPC has been extended to nonlinear clustering. In addition, the initial labels of the data points in kPPC are computed efficiently by a Laplacian graph. Preliminary experiments confirm the merits of our kPPC. Our kPPC Matlab codes can be downloaded from: http://www.optimal-group.org/Resource/kPPC.html. For the future work, it seems worth to find the way to explore the localized representations of the plane such as [50, 51], and use other methods to construct the graph such as [52]. Also, extending kPPC by using other plane-based classifiers such as [53, 54] and by considering fuzzy clustering algorithm such as fuzzy linear c-means [55], fuzzy c-lines [56], and fuzzy c-varieties [57] are also interesting.