Keywords

1 Introduction

Clustering analysis is to discover the group structure of a data set, and disclose the knowledge, patterns and rules hidden in the data [5, 7, 14, 16]. It is an unsupervised learning process and implemented by grouping similar objects into same clusters and dissimilar ones in other clusters [1, 3,4,5, 10, 12, 13, 16]. Its applications range from astronomy to bioinformatics, bibliometrics, biomedical, pattern recognition [1, 3,4,5, 11, 15]. With the emerging of big data from various areas in the real world, there have been more and more experts focusing on studying the clustering techniques to try to understand and summarize the complex data automatically as far as possible and find the potential knowledge and rules and patterns embedded in the big data without any previous domain knowledge [1,2,3, 9, 11, 15].

There are many kinds of clustering algorithms such as partitioning, hierarchical, density-based, etc. [1, 3,4,5, 12, 13, 16]. The novel density-based clustering algorithm was proposed by Alex Rodríguez and Alessandro Laio in [1]. The clustering algorithm can find the clustering of a data set by finding the density peaks as cluster centers and assign each point except for the density peaks to its nearest neighbor with higher density. We call the clustering algorithm as DPC for short. DPC is powerful except for its weaknesses in calculating the densities of points and its one step assignment strategy [1, 12, 13]. There are several advanced density peaks based clustering algorithms [1, 8, 12, 13]. The K nearest neighbors based density peaks finding clustering (KNN-DPC) [12] and the fuzzy weighted K nearest neighbor based density peaks finding clustering (FKNN-DPC) [13] were proposed to remedy the deficiencies of DPC. The extensive experiments demonstrated the excellent performance of KNN-DPC and FKNN-DPC. However, because DPC uses the arbitrary cutoff distance \(d_c\) to define the local density of a point, which makes the cluster centers from sparse clusters may not be found for they cannot become density peaks. KNN-DPC and FKNN-DPC use the K nearest neighbors of a point to define its local density, and can solve the problem of DPC to some extent, but they did not thoroughly solve it. It is the common phenomena that there are both dense and sparse clusters in a date set simultaneously, which make it difficult for the aforementioned clustering algorithms to find the proper cluster centers and even the clustering as well.

To try to let the cluster centers can become density peaks no matter they come from dense or sparse clusters by absorbing the distributive information of points in a data set as far as possible, we propose the heuristic clustering algorithm by adopting the local standard deviation of a point to define its local density because it is well known that the local standard deviation of a point embodies the information of how dense the local area is around the point. As a result we introduce the new local standard deviation based density peaks finding clustering algorithm named SD_DPC. We tested its power on some special synthetic data sets which are composed of dense and sparse clusters simultaneously. We also test SD_DPC on a typical synthetic dataset from [6]. The experimental results demonstrate that SD_DPC is powerful in finding the cluster center by finding the local standard deviation based density peaks and the clustering of a data set, and its performance is superior to DPC, KNN-DPC and FKNN-DPC.

This paper is organized as follows: Sect. 2 introduces the new SD_DPC in detail. Section 3 tests the power of it by special synthetic data sets, and compares its performance with that of DPC, KNN-DPC and FKNN-DPC in terms of several typical criteria for testing a clustering algorithm, namely clustering accuracy (Acc), adjusted mutual information (AMI), and adjusted rand index (ARI). Section 4 is some conclusions.

2 The Main Idea of the Proposed Clustering Algorithm

DPC [1] has become the hotspot in machine learning for its strong power in detecting cluster centers and exclude outliers and recognize the clusters with any arbitrary shapes and dimensions. The basic assumption in DPC is that the ideal cluster centers are always surrounded by neighbors with lower local density, and they are at a relatively large distance from any other points with higher local density. To find the ideal cluster centers, DPC introduces the local density metric of point i in (1) and the distance \(\delta _i\) of point i in (2).

$$\begin{aligned} \rho _i = |\{d_{ij}| d_{ij} < d_c \}| \end{aligned}$$
(1)
$$\begin{aligned} \ \delta _i =\{\begin{array}{ll} max_j~\{d_{ij}\}, &{} \rho _{i} =max_{j}\{\rho _{j}\}\\ min (\{ d_{ij} | \rho _j > \rho _i \}), &{}~~otherwise \end{array} \end{aligned}$$
(2)

where \(d_{ij}\) is Euclidean distance between points i and j, and \(d_c\) the cutoff distance given manually. We can see from (1) that the local density \(\rho _i\) of point i is the number of points j that are closer to i than \(d_c\). DPC is robust to \(d_c\) for large data sets [1]. The definition in (2) disclose that \(\delta _i\) is the maximum distance from point i to any other point j when point i has got the highest density, otherwise \(\delta _i\) is the minimum distance between point i and any other point j with higher density [1]. It can be seen from (2) that \(\delta _i\) is much larger than the nearest neighbor distance for the points with local or global maxima density. The points with anomalously large value of \(\delta _i\) are to be chosen as cluster centers by DPC.

The most important contribution of DPC is that it proposed the idea of decision graph which is the collection of points (\(\rho _i\),\(\delta _i\)) in a 2-dimension space with \(\rho \) and \(\delta \) to be x-axis and y-axis respectively. Cluster centers are the points with high \(\delta \) and relatively high \(\rho \), that is the cluster centers are the ones at the top right corner of the decision graph. The second innovation of DPC is its one step assignment that it assign each remaining points except for those density peaks to the same cluster as its nearest neighbor with higher density. This one step assignment contribution leads the DPC’s efficient execution.

However, everything has two sides. The local density definition of DPC in (1) may results in the lower density for those cluster centers from sparse clusters for the arbitrary cutoff distance \(d_c\). The very efficient one step assignment of DPC for remaining points may lead to the similar “Domino Effect” [12, 13], that is once a point is assigned erroneously, then there may be many more points subsequently assigned incorrectly [12, 13], especially in the case where there are several overlapping clusters.

$$\begin{aligned} \rho _i = \sum _{j \in KNN_i} exp(-d_{ij}) \end{aligned}$$
(3)

KNN-PDC [12] and FKNN-DPC [13] introduced new definition of density \(\rho _i\) for point i by (3). The new definition can be used to any size of data set to calculate the density \(\rho _i\) of point i. The \(KNN_i\) in (3) means the data set composed of the points of the K-nearest neighbors of point i. Furthermore, KNN-DPC and FKNN-DPC respectively proposed their own two-step assignment strategy, where FKNN-DPC introduced fuzzy weighted K nearest neighbors theory in its second step of assignment strategy other than that of KNN-DPC only the K nearest neighbors theory is used in its two-step assignment strategy. It was demonstrated that both KNN-DPC and FKNN-DPC are superior to DPC [12, 13]. Due to the contribution of fuzzy weighted K-nearest neighbors that FKNN-DPC is more robust than KNN-DPC [13].

Although KNN-DPC and FKNN-DPC have been demonstrated very powerful in finding the patterns of data sets, they did not solve the problem of DPC thoroughly, such as finding the cluster centers from the sparse clusters by finding density peaks.

To detect the cluster centers from sparse clusters by finding the density peaks, we try to absorb the distribution information of points in a dataset as far as possible, so we introduce the new local density \(\rho _i\) for point i in (4) by introducing the local standard deviation of point i with the knowledge that the local standard deviation of a point embodying the local information of how dense the local area is around point i. The \(KNN_i\) in (4) is the same as that in (3), that is the set of K nearest neighbors of point i, and \(d_{ij}\) the Euclidean distance between data points i and j. Then the new local standard deviation based density peaks finding clustering algorithm is proposed in this paper and named as SD_DPC. The assignment strategy in SD_DPC is the same as that of DPC only to demonstrate that the local density definition \(\rho _i\) of point i in (4) can solve the problem existing in DPC that the cluster center of sparse cluster may not be detected by finding density peaks.

$$\begin{aligned} \rho _i = \frac{1}{\sqrt{\frac{1}{K-1}\sum _{j\in KNN_i} d^2_{ij}}} \end{aligned}$$
(4)

Here are the main steps of SD_DPC.

step1 calculate the density \(\rho _i\) of point i in (4), and its distance \(\delta _i\) in (2);

step2 plot all of points in the 2-dimensional space with their densities \(\rho \) and \(\delta \) as their x and y coordinates respectively;

Table 1. Description of synthetic data sets
Table 2. The parameters for generating dataset1

step3 select those points at the top right corner in the 2-dimensional space, that is density peaks with relatively higher densities and distances, as cluster centers of the data set;

step4 assign the remaining points except for the cluster centers to its nearest neighbor with higher density.

Table 3. The parameters for generating dataset2
Table 4. The parameters for generating dataset3
Fig. 1.
figure 1

The clusterings of a3 by 4 clustering algorithms, respectively.

Fig. 2.
figure 2

The clusterings of dataset1 by 4 clustering algorithms, respectively

Fig. 3.
figure 3

The clusterings of dataset2 by 4 clustering algorithms, respectively

3 Experiments and Analysis

This section will display the experimental results of SD_DPC and the analysis. In order to test the power of SD_DPC, especially its power to detect the cluster centers from sparse clusters and find the clustering of a data set which has got both dense and sparse clusters simultaneously, we synthetically generated data sets with both dense and sparse clusters. In addition, we test the power of SD_DPC by the typical data set a3 from other reference. It should be explained that for the page limitation we cannot display more experimental results on any other typical bench mark data sets. The Subsect. 3.1 will display the data sets used in this paper. The experimental results of SD_DPC and the analysis are shown in Subsect. 3.2. We compared the performance of SD_DPC with that of DPC, KNN-DPC and FKNN-DPC. For the page limitation, the decision graph and some other experimental results such as the comparison with other clustering algorithms cannot be included in this paper.

We normalize the data using a min-max normalization given by (5), where \(x_{ij}\) is the value of attribute j of point i, and \(min(x_j)\) and \(max(x_j)\) are the minimum and maximum values of attribute j, respectively. The min-max normalization in (5) can preserve the original relationship in data [4], and reduce the influence on experimental results from different metrics for attributes and reduce the runtime of algorithms’ as well.

$$\begin{aligned} x_{ij} \leftarrow \frac{x_{ij} - min(x_j)}{max(x_j) -min(x_j)} \end{aligned}$$
(5)

3.1 Data Sets Description

Table 1 shows the informations of the synthetic data sets used in our experiments. The data set a3 comes from [6], and the other three data sets are synthetically generated. The parameters for generating the synthetical data sets are shown in Tables 2, 3 and 4, respectively. We designed these various size of data sets with dense and sparse clusters simultaneously only to test the ability and the scalability of SD_DPC in finding the cluster center from a sparse cluster and the clustering of a data set as well.

3.2 Results and Analysis

Figures 1, 2, 3 and 4 respectively display the clusterings of data sets from Table 1 by 4 clustering algorithms. The square point in each cluster is the cluster center detected by the related algorithm. Table 5 displays the quantity results of the aforementioned clustering algorithms in terms of Acc, AMI and ARI, and the number of clusters discovered by each algorithm and the number of clusters covering the cluster centers by F/P, where F refers to the number of cluster centers found by algorithms and P the number of clusters in which the cluster centers lie. The parameters pre-specified for each algorithm are also shown in Table 5 by Par.

From the clusterings of a3 and dataset1 by 4 clustering algorithms respectively shown in Figs. 1 and 2, we can say that all of the 4 clustering algorithms can detect all of the cluster centers and the clustering of a3 and dataset1. The difference between the clusterings by 4 algorithms only comes from the border points between clusters. The quantity comparison between the clustering results of 4 clustering algorithms on a3 and dataset1 will be displayed in Table 5 in terms of Acc, AMI and ARI.

Fig. 4.
figure 4

The clusterings of dataset3 by 4 clustering algorithms, respectively

Table 5. Comparison of 4 clustering algorithms on 4 synthetic data sets.

The clusterings shown in Fig. 3 disclose that the cluster centers of three sparse clusters in dataset2 cannot be detected by DPC, while the other three algorithms can detect all the cluster centers by finding density peaks. However, the cluster centers found by 4 clustering algorithms are not completely same, and border points between cluster5 and cluster6 are grouped into different clusters by KNN-DPC, FKNN-DPC and SD_DPC. The clustering by SD_DPC is the best one compared to original pattern of dataset2, which is not displayed for the page limitations.

The clusterings of dataset3 by 4 clustering algorithms shown in Fig. 4 demonstrate the power of our SD_DPC. From the results in Fig. 4, it can be seen that DPC cannot detect the clustering of dataset3, while the other three algorithms can. KNN-DPC and FKNN-DPC are much better than DPC in finding the clustering of dataset3, but they are not as good as our SD_DPC for the mistakes they made in finding the cluster2 and cluster3 of dataset3. The detail quantity evaluation of the 4 clustering algorithms on dataset3 will be shown in Table 5.

The clustering results in Table 5 reveal that our SD_DPC has got the best performance among the 4 clustering algorithms while it was defeated by KNN-DPC on dataset1. Our previous study KNN-DPC is also a good clustering algorithm. DPC only has got comparable performance on a3.

The overall analysis demonstrate that our proposed SD_DPC is powerful in finding the cluster centers and the clustering of a data set no matter the clusters are dense or sparse and with any arbitrary shapes. So we can conclude that the proposed local density of a point based on its local standard deviation is valid.

4 Conclusions

This paper proposed to adopt the local standard deviation of point i to define its local density \(\rho _i\), so that the distribution information of points in a data set can be absorbed as much as possible to overcome the problem of DPC which may not detect the cluster center of a sparse cluster by finding density peaks. As a consequence the heuristic clustering algorithms named SD_DPC has been introduced. The performance of SD_DPC was tested on several synthetic data sets and compared with that of DPC, KNN-DPC, FKNN-DPC. The experimental results demonstrate that the proposed SD_DPC is superior to DPC, KNN-DPC and FKNN-DPC.