Keywords

1 Introduction

Data Clustering has been an important task in data mining, and existing clustering algorithms can be classified into five categories [1]: partitioning methods [24], hierarchical methods [2, 57], density-based methods [810], grid-based methods [11, 12], and model-based methods [13, 14].

Among many clustering algorithms, support vector clustering (SVC) [15, 16] has become a significant boundary-based clustering algorithm in several applications such as community discovery, speech recognition and bioinformatics analysis [17]. SVC has the following features: first, it can be applied to various shapes of the clusters; second, the number of clusters is not needed in advance; third, it can deal with structured data by using kernel functions; fourth, it can reduce the impact of noise on the cluster partition.

However, there is still room for improvement for SVC. The algorithm is still inadequate due to two bottlenecks: expensive computational cost and poor labeling piece, and this degrades the popularity of SVC [17]. To address these limitations, some work has been done: Ben-Hur et al. [15] improved the original algorithm and proposed a method called support vector graph (SVG). The main idea of this method was that support vectors (SVs) were used to construct the adjacency matrix and derive connected component with an aim to reduce time complexity; Yang et al. [18] proposed the proximity graph (PG), and its time complexity was reduced to \( O(N\log N) \) or \( O(N) \); Lee et al. [19] devised gradient descent (GD) by looking for the stable equilibrium point (SEP); Jung et al. [20] proposed the fast support vector clustering (FSVC), which improved the speed of the algorithm as well as the quality of clustering; Sei-Hyung Lee [21] designed a cone-based cluster partition method to avoid random operations, and it was called Cone Cluster Labeling (CCL), which improved the quality of clustering but increased operation cost; Convex decomposition based cluster labeling (CDCL) [22] was proposed to improve both the efficiency and accuracy of clustering based on convex decomposition; L-CRITICAL was a novel SVC cluster labeling algorithm, and it solved the labeling phase of SVC within competitive processing time [23]; Proximity Multi-sphere Support Vector Clustering (PMS-SVC) was developed based on the multi-sphere approach to support vector data description [24]; Rough–Fuzzy Support Vector Clustering (RFSVC) can obtain rough fuzzy clusters using the support vectors as cluster representatives [25].

The clustering algorithm of similarity segmentation based point sorting (CASS-PS) [26] has a faster speed in clustering. However, the similarity measure of the algorithm is based on distance, which is likely to cause staggered sorting issue between different cluster elements, and this will reduce the accuracy of clustering results.

In this paper, we propose an improved SVC algorithm called partitioning clustering based on support vector ranking (PC-SVR). The algorithm’s crucial components are (1) SV’s sorting based on their geometric properties in the feature space and (2) cluster partition that uses the clustering algorithm of similarity segmentation based point sorting (CASS-PS). The proposed algorithm guarantees the quality of the clustering and improves the speed of clustering at the same time.

2 Partitioning Clustering Based on Support Vector Ranking

Our PC-SVR algorithm combines the first stage of SVC and CASS-PS, and the algorithm is composed of two stages: first, sort the support vectors (SVs) into an array; second, split the sorted array.

2.1 Support Vector Sorting

In the feature space, data are mapped to the minimal sphere. Assume this sphere is \( S \), and the center is \( {\text{a}} \). According to \( K(x,x) = \exp \left( { - q \cdot \left\| {x - x} \right\|^{2} } \right) = 1 \), we can get \( K(x,x) = < \Phi (x) \cdot \Phi (x) > = \left\| {\Phi (x)} \right\|^{2} = 1 \), which means all the data points are located on the surface of the unit ball. Assume this ball is \( B \), and the center of \( B \) is \( O \). So, the covering is the intersection, whose shape is like a cap. The center of this cap is denoted as \( a^{\prime} \), as shown in Fig. 1. Since SVs are on the surface of \( S \), they are also on the intersection hyper line of \( S \) and \( B \).\( \Phi \left( {v_{i} } \right) \) and \( \Phi \left( {v_{j} } \right) \) are SVs in the feature space, and \( \theta \) is the angle between the support vectors and two sphere center. The transverse section of the cap is illustrated in Fig. 2.

Fig. 1
figure 1

Intersection between ball and sphere

Fig. 2
figure 2

Transverse section of the cap

Given a dataset containing N data points \( \left\{ {x_{i} |x_{i} \subseteq x,1 \le i \le N} \right\} \), let \( V = \left\{ {\left. {v_{i} \left| {v_{i} {\kern 1pt} is{\kern 1pt} {\kern 1pt} {\kern 1pt} a{\kern 1pt} {\kern 1pt} {\kern 1pt} SV,1 \le i \le N_{SV} } \right.} \right\}} \right. \). In this research, we will use the geometric properties of samples in the feature space as follows [21]:

Lemma 1.

\( \angle (\Phi (v_{i} oa^{\prime})) = \angle (\Phi (v_{j} oa^{\prime})) \) \( \forall v_{i} ,v_{j} \in V \)

Lemma 2.

\( \forall x \in X,v \in V,\angle (\Phi (v)o\Phi (x)) < \theta \Leftrightarrow {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ||v - x|| < ||v - \Phi^{ - 1} (a^{\prime})|| \)

Lemma 3.

\( x \in X,v \in V,||v - x|| < ||v - \Phi^{ - 1} (a^{\prime})|| \Leftrightarrow x,v \) belongs to the same cluster.

The following corollary can be proven by the above three properties, as in [17]:

Corollary.

In the feature space of Gaussian Kernel, SVs are collected in terms of clusters on the intersection hyper line of \( S \) and \( B \). This is illustrated in Fig. 3.

Fig. 3
figure 3

The distribution of SVs in three clusters

From the above properties, for any \( v_{i} ,v_{j} \in V \), in the feature space, they have the same angle as \( Oa^{\prime} \) in Fig. 1, and if the angle between the sample point and the SV is less than \( \theta \), the point and the SV belong to the same cluster. In the data space, the point to which the distance from the SV is less than \( ||v - \Phi^{ - 1} (a^{\prime})|| \) has the same cluster label as the SV, and thus the computation of the feature space can be converted to the computation of input space.

Therefore, we can use the angle between two SVs (\( v_{i} \) and \( v_{j} \)) and \( O \) to measure the distance between two SVs. The relation between the angle and the Gaussian kernel function is as follows: \( \cos (\angle \Phi (v_{i} )O\Phi (v_{j} )) = < \Phi (v_{i} ) \cdot \Phi (v_{j} ) > = K(v_{i} \cdot v_{j} ) \). Namely, the comparison of the distance of two SVs is transformed into the comparison of the kernel function, and the greater the distance between the two SVs is, the larger the angle is, and the smaller the kernel value is.

The similarity matrix for SVs is constructed according to the values of the kernel function. Then, according to the matrix, SVs are sorted as follows: first, the two SVs whose \( \Phi \left( {v_{i} } \right) \), \( \Phi \left( {v_{j} } \right) \) have the minimum distance are selected as the head and tail of an ordered array; Second, find the SV whose \( \Phi \left( {v_{k} } \right) \) has the minimum distance from the head (or tail) of the sorted array as the head (or tail) of the array. Repeat this step until all data points are stored in an array and we can get the sequence of SVs in the feature space on the circumference, as shown in Fig. 4.

Fig. 4
figure 4

The Sequence of SVs

2.2 Partition Clustering

This part mainly uses the CASS-PS algorithm to partition the cluster.

In the data space, we firstly calculate the distance between each sample point (except for SVs) and each SV, and find the nearest SV. Then we insert the point into the adjacent position of the array in which the SV is located. Repeat this process until all the sample points are stored in the array. In order to observe the transformation of the distance between adjacent elements more directly, we draw the distance curve between the adjacent sample points according to the distance between elements. The distance curve can show obvious changes in distance between adjacent elements, and especially it has a great wave between adjacent sample points of SVs.

At the same time, we can use the wavelet filter function to reduce the impact of noise points or isolated points, and thus we can find the best segmentation point more accurately. Then we set up a threshold whose value can be set as the mean amplitude, and we ignore the part below the threshold in order to simplify the determination of split points. So the continuous curve is divided into several discontinuous curve segments. Furthermore, we find the position which has the maximum distance between adjacent elements as the splitting point in each curve. Then we sort these splitting points after finding the splitting points of all (the whole) curve, and the position which has the maximum distance between adjacent splitting points is selected as the first splitting place. According to this procedure, it can be decided that the next step is re-segmentation or termination. After algorithm terminates, we output the number of clusters.

2.3 The Implementation of PC-SVR Algorithm

In this research we mainly use the geometric properties of sample points in the feature space and CASS-PS to improve the cluster partitioning, which is the second stage of the SVC algorithm. In the feature space, SVs are collected based on the clusters on the intersection hyper line of the minimal sphere and the unit ball. Sorting SVs is based on the similarity between two SVs, which is based on the value of the kernel function. Since SVs are already sorted, it is useful to avoid the limitation of the CASS-PS algorithm, that is, the sample points of different clusters tend to overlap.

The detailed steps of our algorithm are finally given as follows:

  1. (1)

    Given a sample set \( S = \left\{ {x_{i} |x_{i} \subseteq X} \right\} \), its sample size is N, and set parameters q and C. Initialize a one-dimensional array based on the sample size;

  2. (2)

    Calculate the kernel matrix of the sample set;

  3. (3)

    Calculate the radius R of the minimal sphere and SVs according to Lagrange polynomial;

  4. (4)

    Calculate the kernel matrix of SVs, and construct a similarity matrix of support vectors;

  5. (5)

    Sort SVs according to the similarity matrix and get a sorted array of SVs. At this point, the first stage of the algorithm is completed;

  6. (6)

    Calculate the distance from other sample points to all SVs, and find the closest SV to the sample point to be sorted. Insert the sample point into the back of the SV;

  7. (7)

    Repeat Step 6, until all other sample points are completed with the interpolation, and we get a new sample point array;

  8. (8)

    Draw the curve of the distance between adjacent sample points. Apply the wavelet filter function to the sample point array to reduce noise. Set a certain threshold and retain the portion above the threshold as the split segment;

  9. (9)

    Find the points that have the maximum distance (the peak of the distance curve) in various segments, and sort these points. According to the number of clusters, select the corresponding points as the splitting points to split the array of sample points;

  10. (10)

    Label cluster labels on the sample points.

3 Experiment Analysis

3.1 Evaluation Criteria of Experimental Results

In this paper, Rand index [27], Adjust Rand index [28] and Accuracy index [29] are used to evaluate the clustering results.

The Rand index is an external evaluation metric and it evaluates the effectiveness of clustering by comparing the actual results and the results obtained by the clustering algorithms. Given a dataset that contains \( n{\kern 1pt} \) elements and its known partition result P, we run the algorithm to be evaluated to get another partition result Q. Suppose \( r \) is the number of data which belong to the same cluster in P and Q, \( s \) is the number of data which belong to different clusters in P and Q, \( {\kern 1pt} t \) is the number of data belong to the same cluster in P but belong to the different cluster in Q, and \( v \) is the number of data belong to the same cluster in Q but belong to a different cluster in P. On the base of the above, \( r \) and \( s \) can determine the similarity of clustering results, while \( {\kern 1pt} t \) and \( v \) can describe the inconsistency of the results. Rand index is given as follows:

$$ RI = \frac{r + s}{r + s + t + v} $$
(1)

The values of Rand index range in [0, 1], and the greater the value of RI is, the better the clustering results are.

The adjust Rand index will standardize the clustering results in addition to the comparison of the known clustering results and the results obtained by an algorithm. The formula is as follows:

$$ s_{1} = \sum\limits_{i = 1}^{{K_{P} }} {C_{{N_{i} }}^{2} } ,\;\;s_{2} = \sum\limits_{j = 1}^{{K_{Q} }} {C_{{N_{j} }}^{2} } ,\;\;s_{3} = \frac{{2s_{1} s_{2} }}{N(N - 1)},\;\;\;\;\;ARI = \;\frac{{\sum\limits_{i = 1}^{{K_{P} }} {\sum\limits_{j = 1}^{{K_{Q} }} {C_{{N_{ij} }}^{2} } - s_{3} } }}{{(s_{1} + s_{2} )/2 - s_{3} }} $$
(2)

In the above, P and Q represent the two clustering results of a sample set consisting of n elements, and \( K_{P} \) and \( K_{Q} \) are the numbers of clusters in P and Q, respectively. \( N_{i} \) and \( N_{j} \) represent the numbers of elements in clusters \( {\kern 1pt} i{\kern 1pt} \) and \( j \) in P and Q, respectively, and \( N_{ij} \) represents the number of elements in both cluster \( {\kern 1pt} i{\kern 1pt} \) in P and cluster \( j \) in Q. Adjust Rand index ranges in [-1, 1], and the greater the index value is, the more similar the results of the two clustering results are. Adjust Rand index can also be used as a method for determining whether the algorithm is applicable to certain datasets.

Accuracy is one of the most commonly used external evaluation indices. The formula is as follows:

$$ AC = \frac{{\sum\nolimits_{i = 1}^{m} {c_{i} } }}{N} $$
(3)

In the above, m represents the number of clusters, and N represents the number of elements in the sample set. The above formula is based on the principle of similarity comparison between the correct results and the results obtained by the clustering algorithm.

3.2 Experimental datasets

In this research, the experiments are carried out by both artificial data and real data. All the datasets are described in Table 1. The two artificial datasets: Example 1 and Example 2.

Table 1. Description of datasets

Example 1 is a set of two-dimensional datasets with the size of \( 150 \times 2 \). In order to verify the feasibility of the algorithm, the dataset is relatively easy to separate.

Example 2 is a set of concentric ring datasets with the size of \( .. \). This type of dataset is difficult to cluster, and the purpose is to verify whether the algorithm can deal with the linearly inseparable situations.

The four real datasets used in this research are taken from UCI [30], and they are frequently used in clustering analysis: Iris dataset, Wine dataset, Wisconsin dataset and Balance Scale dataset.

3.3 Experimental Results and Analysis

  1. (1)

    We make a comparison between the experimental results on two artificial datasets based on the PC-SVR algorithm and the original clustering algorithm of similarity segmentation based point sorting algorithm (CASS-PS). The two artificial datasets are shown in Fig. 5.

    Fig. 5
    figure 5

    Two artificial datasets

As shown in Fig. 6, the clustering results of the two algorithms on Example 1 are both satisfactory. But the PC-SVR algorithm is more accurate than the CASS-PS algorithm, and it does not have wrong clustering points. Figure 6 shows that PC-SVR which uses the sorting and clustering algorithm after sorting SVs makes the sorting process more accurate, and it does not tend to assign the data points of the same cluster to the wrong ones.

Fig. 6
figure 6

A comparison of Example 1 clustering effect

Figure 7 shows the results on Example 2 dataset obtained by using the two algorithms.

Fig. 7
figure 7

A comparison of clustering effect on Example 2

As shown in Fig. 7, PC-SVR inherits the advantages of SVC to deal with the linear inseparable datasets when clustering, and it is more accurate than SVC. For this type of datasets, the clustering result of CASS-PS algorithm is not satisfactory.

  1. (2)

    The comparison of time cost between PC-SVR and SVC is presented in Table 2.

    Table 2. Time comparison between SVC and PC-SVR (in second)

Table 2 presents the comparison between SVC and PC-SVR on the two artificial datasets and four sets of classical data in terms of the running time. From this table, we can see the efficiency of the PC-SVR algorithm has been greatly improved compared with the original SVC algorithm.

  1. (3)

    We use Rand Index to make a comparison of experimental results between the PC-SVR algorithm and other four existing algorithms on three sets of real datasets. The other four clustering algorithms are Support Vector Clustering, Cluster Algorithm of Similarity Segment based Point Sorting, Convex Decomposition based Cluster Labeling and Cone Cluster Labeling. The above experimental results about CDCL and CCL is from reference [22]. The results are shown in Fig. 8.

    Fig. 8
    figure 8

    A comparison of Rand Index for all five algorithms

Figure 8 reports the results for the Rand Index on the three datasets with five algorithms. We can see that PC-SVR performs the best on Iris dataset. Although does not getting the highest Rand Index values on other two datasets, PC-SVR is only 3.46 % lower than CDCL (the best one on Wine dataset) and 12.15 % lower than CCL (the best one on Wisconsin dataset).

In addition, in order to fully verify the clustering performance of the PC-SVR algorithm, we use the other two indices of clustering results to evaluate and compare the clustering performance of PC-SVR algorithm and the other four classical algorithms which are K-means, Hierarchical Clustering, Support Vector Clustering and Cluster Algorithm of Similarity Segment based Point Sorting on the four real datasets, that is, Adjust Rand Index and Accuracy. The results are listed in Tables 3 and 4.

Table 3. Ajust Rand Index
Table 4. Accuracy

The above results show that the PC-SVR algorithm can ensure the quality of the clustering and improve the speed of clustering, and the clustering performance is excellent.

4 Conclusions and Future Work

In this paper, the PC-SVR algorithm based on support vector sorting has been proposed, and it is divided into two parts: support vector sorting and segmentation. In the first part, we sort the SVs on the basis of their geometrical properties of the feature space. In the second part, we partition the samples by using the point sorting-based partition cluster algorithm and generate the clustering. Experimental results demonstrate the effect of PC-SVR for improving the performance of SVC, and better clustering performance has been achieved compared with existing approaches.

In the future work, we would explore the potential application fields of our approach, for instance, in the field of bioinformatics and social media analysis.