Introduction

Different from traditional computing technologies, cognitive computing tries to achieve more natural human–computer interaction by mimicking the cognitive process of human [1, 2]. In the future, machines should have the ability to learn and their intelligence degree will increase as they are being used. As a representative of the cognitive computing system, “Watson” computer developed by IBM can realize autonomous learning like human brain through real-time computing and analysis on large data [3]. So cognitive computing centers on data, not on the processor. How to dig useful information from complex data has become a new challenge for cognitive computing. Clustering is an effective unsupervised learning technique in data mining field. According to the characteristics of data or certain rules, clustering aims to classify data objects into different sets (or clusters) so that the objects in the same set have high similarities, while the objects from different sets have relatively low similarities [4]. Classifying the unlabeled data objects into different categories will help reveal the real structure of original haphazard data and provide assistance for further data analysis. In our daily life, clustering techniques have been widely used in many applications, such as business customer analysis, biology data processing, image segmentation, speech recognition and medical diagnosis. [5].

The application of clustering techniques to cognitive computing may help solve the imprecise and uncertain issues in biological systems [6, 7]. Traditional clustering algorithms, such as k-means algorithm, FCM algorithm and EM algorithm (Expectation Maximization), are based on convex data space. When the spatial distribution of points is not convex, these clustering algorithms may easily fall into local optima. So they are not suitable for the clustering problem on arbitrary shaped data space [8]. But in real applications, many data points are distributed in anomalous shapes. In order to solve this problem, many academic researchers began to use spectral method to deal with non-convex distributed data in recent years [9]. The main idea of spectral method is converting the clustering problem on data space into a graph partitioning problem on graph G, where G is an undirected weighted graph. In graph G, every point in the data set is as a vertex, and the similarity value between each couple of points is as the weight of the edge. Most spectral clustering algorithms have three steps: firstly, define an affinity matrix to describe the similarities of pairwise data objects according to the given data set; secondly, construct Laplacian matrix and calculate its eigenvalues and eigenvectors; thirdly, select appropriate eigenvectors to group n data objects into k clusters.

In spectral clustering, the definition of graph cut criteria is crucial. Graph cut criteria are the principle of how to divide a weighted graph into several parts to make the objective function reach the optimal solution. Its ultimate goal is to maximize the connection weights within each sub-graph, while minimize the connection weights between sub-graphs [10]. The selection of graph cut criteria has a direct impact on the quality of clustering results. Minimum cut, ratio cut, normalized cut and average cut are classic graph partition criteria [11]. However, seeking the optimal solution of graph cut criteria is an NP-hard problem. Fortunately, spectral method can provide a loose solution within polynomial time for these optimization problems. Here, “loose” means relax the discrete optimization problem to the real number field, and then using some heuristic approach to re-convert it to a discrete solution [12]. The essence of graph partitioning can be summarized as the minimization or maximization problem of matrix trace. The completion of these minimizing or maximizing tasks relies on the spectral decomposition of graph Laplacian matrix.

Recently, an improved version of normalized cut named Cheeger cut has aroused much attention [13]. Research shows that Cheeger cut is able to produce more balanced clusters by calculating the second eigenvector of p-Laplacian matrix [14]. p-Laplacian matrix is a nonlinear generalization form of graph Laplacian matrix. However, during the clustering process, it is hard to build a high-quality similarity matrix that can properly describe the intrinsic relationship between data points; besides, the parameter p of p-Laplacian matrix is sensitive and often requires empirical knowledge to determine its value. In order to solve these problems, this paper defines a novel similarity measurement that takes advantage of the local density information embedded in shared nearest neighbors (SNN), then uses fruit fly optimization algorithm (FOA) to automatically determine the optimum value of parameter p, and proposes a self-tuning p-spectral clustering algorithm based on shared nearest neighbors (SNN-PSC).

This paper is organized as follows: Section 2 reviews typical graph cut criteria and describes the concepts about p-Laplacian matrix and then analyzes the relationship between p-Laplacian and Cheeger cut; Section 3 introduces SNN similarity measure and fruit fly optimization method to improve the performance of p-spectral clustering and presents the specific scheme of SNN-PSC algorithm; Section 4 verifies the effectiveness of SNN-PSC algorithm through the experiments on both artificial data sets and real-world data sets; the final part is a summary of this paper and discusses some future research prospects.

Materials

Graph Cut Criteria

The idea of spectral clustering comes from spectral graph partition theory. Donath and Hoffman established a link between graph partitioning problem and the eigenvectors of similarity matrix. Fiedler proved that the bipartition of a graph is closely related to the second eigenvector of Laplacian matrix [15]. According to Rayleigh–Ritz theory, we can use the eigenvectors of Laplacian matrix to approximate the optimal solution of graph cuts [16].

The definition of graph cut criteria aims to find an optimal partition that the similarities within a cluster are as large as possible and the similarities between clusters are as small as possible. Given a data set, we can construct an undirected weighted graph G = (V, E), where V is the set of vertices represented by data points, E is the set of edges weighted by the similarities between the edge’s two vertices. Then the simplest graph partitioning problem based on graph G is described as follows:

$${\text{cut}}\left( {A_{1} , \ldots ,A_{k} } \right) = \sum\limits_{i = 1}^{k} {{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}$$
(1)

where \({\text{cut}}\left( {A,B} \right) = \sum\limits_{{_{i \in A,j \in B} }} {w_{ij} }\), \(\bar{A}_{i} = \{ v_{p} |v_{p} \in V\;{\text{and}}\;v_{p} \notin A_{i} \}\) is the complementary set of sub-graph A i . A good graph partition is to minimize objective function (1), but this objective function only considers the external connections of clusters, neglecting the density distribution within each cluster. In addition, it is sensitive to the impact of outliers in the data set, which will result in unbalanced data partition.

In order to get more balanced clusters, Hagen and Kahng presented radio cut criterion, denoted as Rcut [17]:

$${\text{Rcut}}\left( {A_{1} , \ldots ,A_{k} } \right) = \sum\limits_{i = 1}^{k} {\frac{{{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}}{{\left| {A_{i} } \right|}}}$$
(2)

Rcut objective function introduces cluster size as a balancing item to minimize the similarity between clusters, which can avoid unbalanced partition to some extent. But more vertices are not a necessary and sufficient condition of large similarity within clusters, so Shi and Malik proposed normalized cut, denoted as Ncut [18]:

$${\text{Ncut}}\left( {A_{1} , \ldots ,A_{k} } \right) = \sum\limits_{i = 1}^{k} {\frac{{{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}}{{{\text{vol}}\left( {A_{i} } \right)}}}$$
(3)

where \({\text{vol}}(A_{i} ) = \sum\limits_{{p \in A_{i} ,q \in V}} {w_{pq} }\). Minimizing the objective function means minimizing the total weights between sub-graphs and maximizing the inner weights of each sub-graph at the same time, so it is able to generate better classification results. However, as a nonlinear function of the projection matrix, this objective function may fall into local minimum.

Cheeger et al. [19] improved normalized cut and proposed Cheeger cut, denoted as Ccut:

$${\text{Ccut}}(A_{1} , \ldots ,A_{k} ) = \sum\limits_{i = 1}^{k} {\frac{{{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}}{{\hbox{min} \left( {A_{i} } \right)}}}$$
(4)

The objective function of Ccut wants to make the data within the cluster be more compact, which means that the correlation between the cluster size (the number of data points in a cluster) and the total cut weight of the cluster is as small as possible after clustering. But according to the Rayleigh quotient principle, calculating the optimal Cheeger cut is an NP-hard problem. Next we will try to get an approximate solution of Cheeger cut by introducing p-Laplacian into spectral clustering.

Graph p-Laplacian

Spectral graph theory is inextricably linked with graph Laplacian matrix. Rayleigh quotient principle indicates that the optimization of graph cut problem is equivalent to calculating the eigenvalues of Laplacian matrix, which provides a relaxed solution of the theoretically optimal graph partition. Hein et al. [20] defined the inner product form of standard graph Laplacian Δ 2 as follows:

$$\left\langle {f,\varDelta_{2} f} \right\rangle_{i} \,= \frac{1}{2}\sum\limits_{i,j = 1}^{n} {w_{ij} (f_{i} - f_{j} )^{2} }$$
(5)

where f is the eigenvector of Laplacian matrix. Assuming the Laplacian operator is generalized to Δ p , where p ∈ (1, 2], then Δ p can be denoted as:

$$\left\langle {f,\varDelta_{p} f} \right\rangle_{i} = \frac{1}{2}\sum\limits_{i,j = 1}^{n} {w_{ij} (f_{i} - f_{j} )^{p} }$$
(6)

Laplacian matrix L has un-normalized form and normalized form. Un-normalized Laplacian matrix is represented as L = D − W, and the normalized form is represented as L = I − D −1 W, where W is the affinity matrix, D is a diagonal matrix. It is not difficult to derive the corresponding un-normalized and normalized form of p-Laplacian matrix:

$$(\varDelta_{p}^{(u)} f)_{i} = \sum\limits_{j \in V} {w_{ij} \varphi_{p} (f_{i} - f_{j} )}$$
(7)
$$(\varDelta_{p}^{(n)} f)_{i} = \frac{1}{{d_{i} }}\sum\limits_{j \in V} {w_{ij} \varphi_{p} (f_{i} - f_{j} )}$$
(8)

where d i  = ∑  n j=1 w ij is the degree function of graph, φ p (x) = |x|p−1sign(x), and when p = 2, \(\varphi_{2} (x) = x\). Taking un-normalized p-Laplacian matrix, for example, next we give the definition of p-Laplacian’s eigenvalue λ p .

Definition 1

If there is a real number λ p satisfying formula (9), λ p is the corresponding eigenvalue of eigenvector f.

$$\left( {\varDelta_{p}^{(u)} f} \right)_{i} = \lambda_{p} \varphi_{p} \left( {f_{i} } \right),\quad \forall i = 1, \ldots , n$$
(9)

In matrix operation, the eigenvector corresponding to the smallest eigenvalue is expected to contain important discrimination information. This property is also play an important role for clustering algorithm. Related studies have shown that [21], in linear operation, if λ is the smallest eigenvalue of an operator Δ, λ should satisfy formula (10) to reach the lower bound.

$$\lambda = \mathop {\arg \hbox{min} }\limits_{{f \in R^{n} }} \frac{{\left\langle {f,\varDelta f} \right\rangle }}{{\left\| f \right\|^{2} }}$$
(10)

where \(\left\| f \right\|^{2} = \sum\limits_{i = 1}^{n} {\left| {f_{i} } \right|^{2} }\). Definition 2 applies this feature to nonlinear p-Laplacian operator.

Definition 2

If λ p is the smallest eigenvalue of p-Laplacian operator Δ p , λ p should satisfy formula (11) to reach the lower bound.

$$\lambda_{p} = \mathop {\arg \hbox{min} }\limits_{{f \in R^{n} }} \frac{{\left\langle {f,\varDelta_{p} f} \right\rangle }}{{\left\| f \right\|^{p} }}$$
(11)

According to formula (11), the objective function of un-normalized p-Laplacian operator is constructed as:

$$F_{p} (f) = \frac{{\left\langle {f,\varDelta_{p}^{(u)} f} \right\rangle }}{{\left\| f \right\|^{p} }} = \frac{{\frac{1}{2}\sum\limits_{i,j = 1}^{n} {w_{ij} \left| {f_{i} - f_{j} } \right|^{p} } }}{{\left\| f \right\|^{p} }}$$
(12)

The above analysis indicates that the lower bound value of F p (f) is related to the eigenvalues and eigenvectors of Δ p . As the category information of data points is implicit in the eigenvectors of p-Laplacian matrix, the eigen-decomposition of p-Laplacian matrix is an essential part for spectral clustering. Similarly, the objective function of normalized p-Laplacian matrix can be obtained in the same way.

The Relationship Between p-Laplacian and Cheeger Cut

Cheeger cut can enhance the relationships within clusters and weaken the relationships between clusters, so it is able to produce more balanced sub-graphs. Since seeking the optimal solution of Cheeger cut is an NP-hard problem, we introduce p-Laplacian operator to make the original problem be solved in polynomial time. Related researches give evidence that the graph can be recursively divided into two parts using the second eigenvector v (2) p of p-Laplacian matrix until the number of sub-graphs meets the requirements.

Theorem 1

For a segmentation problem \((A_{i} ,\bar{A}_{i} )\) , if the objective function of Cheeger cut is \({\text{Ccut}}(A_{1} , \ldots ,A_{k} ) = \sum\limits_{i = 1}^{k} {\frac{{{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}}{{\hbox{min} \left( {A_{i} } \right)}}}\) and the objective function of p -Laplacian matrix is F p (f), then the solution of F p (f) is approximate to the relaxed solution of Cheeger cut.

Proof

First define an associated partition (f, A i ), expressed as:

$$(f,A_{i} ) = \left\{ {\begin{array}{*{20}c} {\frac{1}{{\left| {A_{i} } \right|^{p - 1} }}\quad i \in A_{i} ,p \in (1,2]} \\ {\frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}\quad i \in \bar{A}_{i} ,p \in (1,2]} \\ \end{array} } \right.$$
(13)

In order to obtain the minimum eigenvalue of objective function, we use formula (13) to transform formula (12) into the following expression:

$$F_{p} (f,A_{i} ) = \frac{{\sum\limits_{ij} {w_{ij} \left| {\frac{1}{{\left| {A_{i} } \right|^{p - 1} }} - \frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}} \right|^{p} } }}{{\left\| {f,A} \right\|_{p}^{p} }}$$
(14)

where \(\left\| {f,A} \right\|_{p}^{p} = \sum\limits_{i} {\left| {f_{i} ,A} \right|^{p} } = \frac{1}{{\left| {A_{i} } \right|^{p - 1} }} - \frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}\), then formula (14) can be converted to

$$\begin{aligned} F_{p} (f,A_{i} ) \,=\, & \frac{{\sum\limits_{ij} {w_{ij} \left| {\frac{1}{{\left| {A_{i} } \right|^{p - 1} }} - \frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}} \right|^{p} } }}{{\left| {\frac{1}{{\left| {A_{i} } \right|^{p - 1} }} - \frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}} \right|}} \,=\, \sum\limits_{ij} {w_{ij} \left| {\frac{1}{{\left| {A_{i} } \right|^{p - 1} }} - \frac{1}{{\left| {\bar{A}_{i} } \right|^{p - 1} }}} \right|^{p - 1} } \\ & \le \sum\limits_{ij} {w_{ij} \left| {\frac{2}{{\hbox{min} \left\{ {\left| {A_{i} } \right|,\left| {\bar{A}_{i} } \right|} \right\}}}} \right|^{p - 1} =\, 2^{p - 1} } \frac{{{\text{cut}}\left( {A_{i} ,\bar{A}_{i} } \right)}}{{\hbox{min} \left\{ {\left| {A_{i} } \right|,\left| {\bar{A}_{i} } \right|} \right\}}} \\ \end{aligned}$$

Comparing the above inequality with the objective function of Cheeger cut, we can see that

$$\mathop {\lim }\limits_{p \to 1} F_{p} (f,A_{i} ) = {\text{Ccut}}\left( {A_{i} ,\bar{A}_{i} } \right)$$
(15)

So the solution of F p (f) is a relaxed approximate solution of Cheeger cut. Thus the optimal minimized objective function can be expressed as:

$$\lambda_{p} = \mathop {\arg \hbox{min} }\limits_{p \to 1} F_{p} (f)$$
(16)

where λ p is the eigenvalue corresponding to eigenvector f.

The foregoing analysis shows that the problem of Cheeger cut can be solved by constructing p-Laplacian matrix and calculating its eigenvectors. Specifically, the second eigenvector v (2) p of p-Laplacian matrix will lead to a bipartition of the graph by setting an appropriate threshold [8]. The optimal threshold value should minimize its corresponding Cheeger cut.

Un-normalized graph p-Laplacian matrix Δ (u) p corresponds to ratio Cheeger cut \({\text{RCC}}(C,\bar{C}) = \frac{{{\text{cut}}\left( {C,\bar{C}} \right)}}{{\hbox{min} \left\{ {\left| C \right|,\left| {\bar{C}} \right|} \right\}}}\), whose second eigenvector v (2) p should satisfy formula (17):

$$\mathop {\arg \hbox{min} }\limits_{{A_{t} = \left\{ {i \in V|v_{p}^{(2)} (i)\, > \,t} \right\}}} {\text{RCC}}\left( {A_{t} ,\bar{A}_{t} } \right)$$
(17)

Normalized graph p-Laplacian matrix Δ (n) p corresponds to normalized Cheeger cut \({\text{NCC}}\left( {C,\bar{C}} \right) = \frac{{{\text{cut}}\left( {C,\bar{C}} \right)}}{{\hbox{min} \left\{ {{\text{vol}}(C),{\text{vol}}(\bar{C})} \right\}}}\), whose second eigenvector v (2) p should satisfy formula (18):

$$\mathop {\arg \hbox{min} }\limits_{{A_{t} = \left\{ {i \in V|v_{p}^{(2)} (i)\, > \,t} \right\}}} NCC(A_{t} ,\bar{A}_{t} )$$
(18)

Methods

How to select a proper distance measurement that can truly describe the internal structure of data set is a critical factor to p-spectral clustering. Data points in the same group should have high similarity and space consistency. The quality of similarity matrix will make a great influence on the performance of spectral clustering [22]. The most common method of measuring the similarities between data points is Gaussian kernel function. However, in Gaussian kernel, the scale parameter σ is usually fixed, so the pairwise data similarity only depend on their Euclidean distance, lacking adaptability to their surrounding points [23]. When dealing with complex multi-scale data sets, the similarity simply based on Euclidean distance cannot accurately reflect the distribution of data, which would significantly reduce the performance of spectral clustering, resulting in poor clustering partition.

Shared-Nearest-Neighbors Similarity Measure

This paper presents a novel similarity measure, which can adaptively adjust the similarities between data points according to their local density. Researchers found that if two points belong to the same cluster, they should locate in the same area with a relatively high density. And between these two points, there will be many intermediate points to bond them together. In order to reflect the glue effect between two points, we define “shared nearest neighbors” (SNN) and use it to extend the self-tuning Gaussian kernel to enlarge the similarities within a cluster.

SNN(a, b) is the number of data points in the intersection of the ε-neighborhood of point a, b. The ε-neighborhood of a point is a spherical region with this point as the sphere center and ε as the radius. SNN(a, b) reflects the local density of point a and point b, helping to distinguish the points in the same cluster or in different clusters. For example, in Fig. 1, point a, b has more common neighbors than point a, c, so SNN(a, b) > SNN(a, c). According to this fact, SNN can guide the algorithm to find the right cluster members.

Fig. 1
figure 1

Common neighbors on Two-Moon data set

As SNN is adaptive to the local density of data, it can be used to improve the pairwise similarity of data points, and the mathematical expression is given as follows:

$$w_{\text{L}} (x_{i} ,x_{j} ) = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{d(x_{i} ,x_{j} )^{2} }}{{\sigma_{i} \sigma_{j} ({\text{SNN}}(x_{i} ,x_{j} ) + 1)}}} \right)} \hfill & {i \ne j} \hfill \\ 0 \hfill & {i = j} \hfill \\ \end{array} } \right.$$
(19)

where σ i and σ j are, respectively, the Euclidean distance from point x i , x j to their m-th nearest neighbors. σ i and σ j can automatically and timely adjust themselves according to the sparse or dense distribution of data points within the neighborhoods of point i , j. When two points are located in a relatively sparse cluster, we may enlarge their similarity appropriately by adjusting the scale parameter σ. In this way, the points of sparse cluster will have more chance to be grouped together.

This density adaptive similarity measurement has the following features:

  1. 1.

    If d(x i , x j ) ≥ 2ɛ, then SNN(x i , x j ) = 0, w L(x i , x j ) = exp (−d(x i , x j )2/σ i σ j ), which means that the threshold ε is a local parameter and will not affect distant data points.

  2. 2.

    For two pairs of points x i , x j and x m , x n , assume d(x i , x j ) = d(x m , x n ) < 2ɛ, but in fact, x i , x j are in the same dense region, while x m , x n are in different dense regions, then there is a great possibility w L(x i , x j ) > w L(x m , x n ). For example, in Fig. 1, w L(a, b) > w L(a, c).

It can be seen from feature (2) that this measurement is density sensitive. It can enlarge the data similarities within the same cluster and weaken the similarities between different clusters. Since clusters are a few dense regions of data set, the SNN-based similarity measurement is able to reflect the distribution of data points in a better way.

Fruit Fly Optimization Method

Although using p-Laplacian operator is able to get an approximate solution of Cheeger cut, it is not easy to determine the parameter p of p-Laplacian matrix. If p’s value is improper, the algorithm would be misled and produce poor clustering results. In order to solve this problem and improve the clustering accuracy, this paper introduces fruit fly optimization method to find the optimal parameters p. Fruit fly optimization algorithm [24] is a novel global optimization method deduced from the foraging behavior of drosophila. Drosophila’s sensory perception is superior to other species, especially in the sense of smell and vision. The olfactory organ of fruit fly can well collect various smells floating in the air. When flying near to food, fruit flies will lock the food position by their sharp sense of sight and fly to the direction where their companions are gathered.

Compared with other swarm intelligence algorithms, FOA is simple and easy to understand (for example, the objective function of particle swarm optimization is a second-order differential equation [25], while that of FOA is a first-order differential equation). FOA is also an efficient algorithm and may spend less running time to find the optimal solution. Besides, FOA only has four parameters, but other swarm intelligence algorithms need to adjust at least seven or eight parameters. In these algorithms, the complex relationships and interactions between parameters are difficult to study clearly, and improper parameter values will seriously affect the performance of algorithm, resulting in large calculation errors. According to the characteristics of drosophila searching for food, the specific steps of FOA are shown as Algorithm 1.

Algorithm 1

Fruit fly optimization algorithm

Input: population size Sizepop, maximum iteration number Maxgen

Output: the optimal value of fitness function, the optimal parameter p

Step 1: Randomly initialize fruit flies’ population location X_axis, Y_axis.

Step 2: Set a random direction and distance for each fruit fly searching food with its olfactory; RandomValue is the search distance:

$$\left\{ {\begin{array}{*{20}c} {X_{i} = X\_{\text{axis}} + {\text{RandomValue}}} \\ {Y_{i} = Y\_{\text{axis}} + {\text{RandomValue}}} \\ \end{array} } \right.$$
(20)

Step 3: As not knowing the exact location of food, first estimate the distance Dist i between the new position and the origin and then calculate the determine value S i of smell concentration on the new position, which is the reciprocal of the distance:

$${\text{Dist}}_{i} = \sqrt {X_{i}^{2} + Y_{i}^{2} }$$
(21)
$$S_{i} = 1/Dist_{i}$$
(22)

Step 4: Bring the determine value S i into smell consistence decision function (or called fitness function) to calculate the smell concentration Smell i relative to the position of fruit fly:

$${\text{Smell}}_{i} = {\text{Function}}(S_{i} )$$
(23)

Step 5: Identify the Drosophila that having the best smell concentration (best individual) in fruit fly population:

$$[{\text{bestSmell bestindex}}] = \hbox{min} ({\text{Smell}}_{i} )$$
(24)

Step 6: Record and retain the best smell concentration value bestSmell and its X, Y coordinates, and then the Drosophila groups will fly to this position using their visual sense:

$$\left\{ {\begin{array}{*{20}c} {{\text{Smellbest}} = {\text{bestSmell}}} \\ {X\_{\text{axis}} = X({\text{bestindex}})} \\ {Y\_{\text{axis}} = Y({\text{bestindex}})} \\ \end{array} } \right.$$
(25)

Step 7: Enter the iterative refinement process, repeat from Step 2 to Step 5, and determine whether the best smell concentration is better than the previous value. Meanwhile, make sure the current iteration time is less than the maximum iteration number Maxgen, otherwise jump to Step 6.

Self-Tuning p-Spectral Clustering Algorithm Based on Shared Nearest Neighbors

In order to improve the performance of p-spectral clustering algorithm, this paper introduces SNN similarity measure and fruit fly optimization method and proposes a self-tuning p-spectral clustering algorithm based on shared nearest neighbors (SNN-PSC). Its main idea is as follows: first, measure the similarity of pairwise points according to their shared neighbors’ number to construct similarity matrix; recursively divide the sub-graph into two parts and calculate the eigenvalues and eigenvectors of p-Laplacian matrix; use fruit fly optimization method to determine the optimum value of parameter p; minimize the graph cut criterion; and finally obtain high quality clustering results. The detailed steps of SNN-PSC algorithm are given in Algorithm 2, and its flowchart is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of SNN-PSC algorithm

Algorithm 2

Self-tuning p-spectral clustering algorithm based on shared nearest neighbors

Input: data set X = {x 1x 2, …, x n }, the number of clusters k

Output: k divided clusters

Step 1: According to formula (19), calculate the similarities between data points based on “shared nearest-neighbors” and create the similarity matrix W ∊ R n×n.

Step 2: Initialize the first cluster A 1 = V and set the cluster number s = 1.

Step 3: Repeat from Step 3 to Step 7.

Step 4: Construct p-Laplacian matrix according to formula (7) or (8); then treat the objective function F p (f) as FOA’s fitness function and use FOA algorithm to find the optimal parameter p to minimize the fitness function.

Step 5: Calculate the second eigenvector v (2) p of un-normalized p-Laplacian matrix Δ (u) p (or normalized p-Laplacian matrix Δ (n) p ) and search an appropriate threshold value that satisfies formula (17) (or formula (18)].

Step 6: Use v (2) p to split each cluster A i (i = 1, 2,…, s) and minimize the overall cut criterion [ratio cut (2) corresponds to Δ (u) p , normalized cut (3) corresponds to Δ (n) p ].

Step 7: s ⇐ s + 1.

Step 8: When the number of clusters s = = k, stop the loop and output the clustering results.

Discussion

In order to analyze the clustering performance of proposed SNN-PSC algorithm, we conduct the experiments on both artificial data sets and UCI real-world data sets, and compare SNN-PSC algorithm with the standard spectral clustering algorithm (denoted as SSC) [26], p-spectral clustering algorithm (denoted as PSC) [14]. In the experiments, SNN-PSC algorithm’s population size Sizepop = 50, maximum iterations Maxgen = 100; SSC algorithm’s scale parameter σ = 1.2. On artificial data sets, PSC algorithm’s Laplace parameter p = 1.5; on UCI data sets, the value of p is 1.8, 1.6, 1.4, 1.2, respectively. The experimental environments of computer are as follows: Pentium dual-core 2.60 GHz CPU, 2 GB RAM, MATLAB 2013a programming platform.

Experiments on Artificial Data Sets

There are two kinds of artificial data sets used in the experiments, namely Two-Moon data set and Two-Gaussian data set. Each data set has two types “Balanced” and “Unbalanced”. Every type of data set contains 1000 data points embedded in a 10-dimensional space. And Gaussian noise N(0, σ 2) is added in these data sets; the noise variance σ 2 = 0.01. The characteristics of different data sets are given in Table 1.

Table 1 Characteristics of artificial data sets

Figure 3 shows the clustering results of three algorithms on Two-Gaussian data sets. The first row is SSC algorithm, the second row is PSC algorithm, and the third row is SNN-PSC algorithm. It can be seen from the figure that the clustering performance of these three algorithms is quite well on balanced Two-Gaussian data set. When the data set is unbalanced, PSC algorithms and SNN-PSC algorithm can still effectively classify the data points into two groups, but for SSC algorithm, partial points of sparse cluster are erroneously assigned into the dense cluster and the clustering result is not satisfactory.

Fig. 3
figure 3

Clustering results on Two-Gaussian data sets. a SSC algorithm (Balanced dataset), b SSC algorithm (Unbalanced dataset), c PSC algorithm (Balanced dataset), d PSC algorithm (Unbalanced dataset), e SNN-PSC algorithm (Balanced dataset), f SNN-PSC algorithm (Unbalanced dataset)

Figure 4 shows the clustering results of three algorithms on Two-Moon data sets. The first row is SSC algorithm, the second row is PSC algorithm, and the third row is SNN-PSC algorithm. It can be seen from the figure that SSC algorithm performs badly on both balanced data set and unbalanced data set; PSC algorithm can effectively handle balanced Two-Moon data set, but may produce error partitions for unbalanced data set; only SNN-PSC algorithm has strong adaptive capacity and can obtain better clustering results on the two types of data sets.

Fig. 4
figure 4

Clustering results on Two-Moon data sets. a SSC algorithm (Balanced dataset), b SSC algorithm (Unbalanced dataset), c PSC algorithm (Balanced dataset), d PSC algorithm (Unbalanced dataset), e SNN-PSC algorithm (Balanced dataset), f SNN-PSC algorithm (Unbalanced dataset)

Experiments on Real-World Data Sets

To further verify the effectiveness of SNN-PSC algorithm, we select five real data sets from UCI machine learning database. Their data characteristics are shown in Table 2.

Table 2 Characteristics of UCI data sets

Clustering error rate (CE) is a commonly used method to evaluate the performance of clustering [27]. After obtaining the clustering results, we need to build a permutation mapping function to find corresponding relationship between the cluster label and the actual class label. Clustering error rate is based on this mapping relationship, and its calculation formula is as follows:

$${\text{CE}} = 1 - \frac{1}{n}\sum\limits_{i = 1}^{n} {\delta (y_{i} ,{\text{map}}(c_{i} ))}$$
(26)

where y i is the true class label, c i is the serial number of clusters obtained by clustering algorithm, \(\delta (x,y) = \left\{ {\begin{array}{*{20}c} 1 & {x = y} \\ 0 & {x \ne y} \\ \end{array} } \right.\) is a discriminant function. Among all possible corresponding relations, clustering error rate is determined by the case that the number of misclassified data points is the minimum. The smaller the clustering error rate, the better the performance of clustering algorithm.

We first test the clustering performance of PSC algorithm on UCI real-world data sets. The clustering results are shown in Fig. 5. In Fig. 5, the horizontal coordinate is the parameter p of p-Laplacian matrix with different values and the vertical coordinate is the CE rates of algorithm on different data sets. From Fig. 5, we can see that with the changes in p’s value, the clustering performance of PSC algorithm also swings up or down. The fluctuation of the curves indicates that PSC algorithm is easily affected by parameter p. For different data sets, in order to give full play to the performance of PSC algorithm, the selection of proper parameters is crucial. So we use FOA algorithm to automatically find the optimal parameter p to improve the clustering performance. Table 3 shows the concrete comparison of the CE rates and clustering time of SSC algorithm, PSC algorithm, FOA-PSC algorithm and SNN-PSC algorithm on UCI data sets. FOA-PSC algorithm served as a control uses FOA directly on the parameter selection for PSC, without introducing the SNN measure.

Fig. 5
figure 5

Clustering performance of PSC with different parameter p

Table 3 The clustering performance of different algorithms on UCI data sets

From the experimental data, we can see that the clustering time of SSC algorithm on each data set is the shortest, but its clustering performance is not as good as PSC algorithm, FOA-PSC algorithm and SNN-PSC algorithm. PSC algorithm is sensitive to the value of parameter p; when p is set with different values, the clustering time will make a great difference. As shown in Table 3, with p’s value gets close to one, it will cost much more time for PSC algorithm to converge. FOA-PSC algorithm performs much better than PSC algorithm as its parameter p is optimized and its clustering error rate is the lowest on Dermatology data set. But for the rest of data sets, the CE rates of SNN-PSC algorithm are much lower than the others, which demonstrates that SNN-PSC can provide better clustering results. Because in SNN-PSC algorithm, the similarity matrix is constructed by SNN method, which can well describe the intrinsic link between data points, and FOA algorithm is used to adaptively search the optimum value of parameter p, avoiding the blindness of parameter selection, so the clustering accuracy is improved to some extent.

Conclusion

In spectral clustering, an undirected weighted graph made up of data points will be separated into a series of sub-graphs. And each sub-graph represents a cluster in the final clustering results. Cheeger cut criterion can produce balanced clusters, but according to algebraic graph theory, calculating the global optimal solution of Cheeger cut is an NP-hard problem. An approximate solution is relaxing the discrete optimization problem to the real number field by the eigen-decomposition of p-Laplacian matrix. As the similarity measurement plays an important role in p-spectral clustering, this paper develops a density adaptive similarity measurement based on shared nearest neighbors, which takes into account the neighborhood condition of each data point and may reveal the local density information of data structure. Then we present a SNN-based self-tuning p-spectral clustering algorithm called SNN-PSC. In order to avoid the negative influence of parameter selection, FOA method is used to determine the optimal parameter p of p-Laplacian operator. Comprehensive experiments on both artificial and real-world data sets demonstrate that SNN-PSC algorithm is robust to the noise around data points. The performance of SNN-PSC is much better than that of the standard spectral clustering algorithm and p-spectral clustering algorithm. Next we will study how to reduce the computational complexity of SNN-PSC algorithm and improve its operating efficiency.