Keywords

1 Introduction

Clustering [1,2,3,4] is the important process of pattern recognition, machine learning, and other fields. It can be used as an independent tool for data distribution and can be applied in image processing [3,4,5,6], data mining [3], intrusion detection [7, 8], and bioinformatics [9]. Without any prior knowledge, clustering methods assign points into different clusters according to their similarity such that points in the same clusters are similar to each other while points in the different clusters have a low similarity. Clustering methods are divided into different categories [10]: density-based clustering method, centroid-based clustering method, model-based clustering method, and grid-based clustering method.

Commonly centroid-based clustering methods include K-means [11] and K-medoids [12]. This type of method performs clustering by judging the distance between the point and the cluster center. Therefore, these methods can only identify spherical or spherical-like clusters and need the number of clusters as a priori [13].

The density-based clustering methods can identify clusters of arbitrary shapes, and it is not sensitive to noise [2]. Common and representative density-based clustering methods include DBSCAN [14], OPTICS [15], DPC [16], etc. DBSCAN defines a density threshold by using a neighborhood radius Eps and the minimum number of points Minpts. Based on this, it distinguishes core points and noisy points. As an effective extension of the DBSCAN, OPTICS only need to determine the value of Minpts and generate an augmented cluster ranking that represents the density-based clustering structure of each point. The DPC proposed by Liao et al. is based on two assumptions: the cluster center is surrounded by neighbors with lower local density, the distance between the cluster center and points with high local density is relatively large. It can effectively identify high-density centers [16].

At present, there are still drawbacks to most density-based clustering methods. DBSCAN-like methods can produce good clustering results but they depend on the distance threshold [17]. To avoid it, ARKNN-DBSCAN [18] and RNN-DBSCAN [19] redefine the local density of points by using the number of reverse nearest neighbors. Jian et al. [20] proposed a clustering center recognition standard based on relative density relationship, which is less affected by density kernel and density difference. IDDC [21] uses the relative density based on K-nearest neighbor to estimate the local density of points. CSPV [22] is a potential-based clustering method, it replaces density with the potential energy calculated by the distribution of all points. The one-step clustering process in some methods may lead to continuity errors, that is, once a point is incorrectly assigned, then more points may be assigned incorrectly [23]. To solve this problem, Yu et al. [24] proposed a method that can assign the non-grouped points to the suitable cluster according to the evidence theory and the information of K-nearest neighbors, improving the accuracy of clustering. Liu et al. [25] proposed a fast density peak clustering algorithm based on shared nearest neighbor(SNN-DPC), which improves the clustering process and reduces the impact of density peak and allocation process on clustering results to a certain extent. However, the location and number of cluster centers still need to be manually selected from the decision graph.

These methods have solved the problems existing in the current methods to a certain extent, but these methods do not consider the importance of points, which may result in inaccurate density calculation. This paper attempts to solve the inaccurate definition of local density and the errors caused by one-step clustering process. Therefore, a new density-based clustering method is proposed. Based on the K-nearest neighbor density estimation [26, 27] and the shared nearest neighbor [25, 28], we redefine the K-nearest neighbor density estimation to calculate the local density, which assigns the different importance for each neighbor of the given point. A new clustering process is proposed: the number of shared nearest neighbors between the given point and the higher-density point is calculated first, the cluster that the given point belongs to can be identified, and the remaining points are allocated according to the distance between them and the nearest higher-density point. To some extent, it avoids the continuity error caused by directly assigning points to the cluster where the nearest higher-density neighbor is located. After calculating the local density, all points are sorted in the descending order, and then the cluster centers are selected from the points whose density is higher than the given point. Through this process, the method can automatically discover both the cluster center and the number of clusters.

The rest of this paper is organized as follows. Section 2 introduces relevant definitions and the new clustering method we proposed. In Sect. 3, we discuss the experimental results on synthetic datasets and real-world datasets and compare the proposed method with other classical clustering methods according to several evaluation metrics. Section 4 summarizes the paper and discusses future work. Table 1 illustrates the symbols and notations used in this paper.

Table 1. Symbols and notations

2 Method

A new clustering method is proposed according to the new density estimation method and new allocation strategy in the clustering process. And the new density estimation method is based on the K-nearest neighbor density estimation, which is the nonparametric density estimation method proposed by fix and Hodges [26]. K-nearest neighbor density estimation [26, 27] is a well-known and simplest density estimation method which is based on the concept: the density function of a continuity point can be estimated using the number of neighbors observed in a small region near the point. In the dataset \(X\left[N\right]={\left\{{x}_{i}\right\}}_{i=1}^{N}\), the estimation of the density function is based on the distance from \({x}_{i}\) to its K-th nearest neighbor. For points in different density regions, the neighborhood size determined by the K-nearest neighbors is adaptive, ensuring the resolution of high-density regions and the continuity of low-density regions.

For each \({x}_{i}\in {R}^{d}\), the estimated density of \({x}_{i}\) is:

$${\rho }_{k}\left(i\right)=\frac{K}{N*{V}_{d}*{{r}_{k}\left(i\right)}^{d}}$$
(1)

where \({V}_{d}\) is the volume of the unit sphere in \({R}^{d}\), K is the number of neighbors, \({r}_{k}\left(i\right)\) represent the distance from \({x}_{i}\) to the K-th nearest neighbor in the dataset \(X\left[N\right]\).

2.1 The New Density Estimation Method

Based on the K-nearest neighbor density estimation, the new density estimation method is proposed. According to the number of shared neighbors between the given point and others, the volume of a region containing the K shared nearest neighbors of the given point is used to estimate the local density. Generally, the parameter \(K={k}_{1}\times \sqrt{N}\), k1 is coefficient. Local density estimation is redefined as:

$${\rho }_{k}\left(i\right)=\frac{{K}_{i}}{N*{V}_{i}}$$
(2)

where N is the number of points in the dataset. For point \({x}_{i}\), \({K}_{i}\) is the number of weighted points according to shared nearest neighbors, \({V}_{i}\) is the volume of the high-dimensional sphere with radius \({R}_{i}\), and all the spheres considered in the experiment are closed Euclidean spheres.

For K-nearest neighbors [27, 29] of each point, it refers that selecting K points according to the distance between points. For points \({x}_{i}\) and \({x}_{j}\) in the dataset, the K-nearest neighbor sets of \({x}_{i}\) and \({x}_{j}\) are defined as \(KNN\left(i\right)\) and \(KNN\left(j\right)\). Based on K-nearest neighbors, the shared nearest neighbors [25, 28] between \({x}_{i}\) and \({x}_{j}\) are their common K-nearest neighbor sets, expressed as:

$$SNN\left(i,j\right)=KNN\left(i\right)\cap KNN\left(j\right)$$
(3)

That is to say, the matrix \(SNN\) represents the numbers of shared nearest neighbors between points.

The points are sorted in descending order according to the number of shared neighbors. Point \({x}_{j}\) is the K-th shared nearest neighbor of the given point \({x}_{i}\), and the neighborhood radius \({R}_{i}\) of point \({x}_{i}\) is defined as:

$${R}_{i}=D\left[i,j\right]$$
(4)

D is the distance matrix of the dataset, and the distance is Euclidean distance.

Given the radius \({R}_{i}\) of point \({x}_{i}\), the volume of its neighborhood can be calculated by:

$${V}_{i}={{R}_{i}}^{d}$$
(5)

where d is the dimension of the feature vector in the dataset.

In general, for point \({x}_{i}\), the importance of its each neighbor is different, and the contribution to the density estimation of point \({x}_{i}\) should be different. In our definition, this contribution is related to the number of shared nearest neighbors between \({x}_{i}\) and its each neighbor. According to the number of shared neighbors between points \({x}_{i}\) and \({x}_{j}\), the weight coefficient formula is defined to assign different weights to K-nearest neighbors of any point:

$$\upomega \left(i,j\right)=\frac{\left|SNN\left(i,j\right)\right|}{K}$$
(6)

where \(|SNN\left(i,j\right)|\) is the number of shared neighbors between point \({x}_{i}\) and point \({x}_{j}\).

\({K}_{i}\) is redefined by adding the different weights to K-nearest neighbors of point \({x}_{i}\), as shown in Eq. 7:

$${K}_{i}=\sum\nolimits_{j=1}^{K}\omega \left(i,j\right)$$
(7)

As the neighbor of \({x}_{i}\), if \({x}_{j}\) has the more number of shared nearest neighbors with \({x}_{i}\), that is, the weight of \({x}_{j}\) is bigger, \({x}_{j}\) has more contribution in calculating the local density of point \({x}_{i}\). Using Eq. 6 and Eq. 7, Eq. 1 can be expressed in the form of Eq. 8.

$${\rho }_{k}\left(i\right)=\frac{\sum_{j=1}^{K}\frac{\left|SNN\left(i,j\right)\right|}{K}}{N*{V}_{i}}$$
(8)

In summary, when calculating the local density of \({x}_{i}\), different weights are added to K points falling in a neighborhood according to the number of shared neighbors. The more the number of shared nearest neighbors with \({x}_{i}\), the greater contribution to the local density estimation of \({x}_{i}\).

2.2 A New Allocation Strategy in the Clustering Process

The allocation process of some clustering methods has poor fault tolerance. When one point is assigned incorrectly, more subsequent points will be affected, which will have a severe negative impact on the clustering results [23, 24]. Therefore, a new clustering process is proposed to make the allocation more reasonable and avoid the continuity error caused by direct allocation to a certain extent.

In the proposed clustering method, all points are sorted in the descending order according to the local density value. The sorted index is stored in the array \(sortedIdx\left[1\dots N\right]\). Then, in the sorted points queue, points are accessed one by one with the local density value from the highest to the lowest. The first point in the queue has the highest local density and automatically becomes the center of the first cluster. For each subsequent point \(sortedIdx\left[i\right]\) in the queue, two special points are identified: a point parent1 is the nearest point to \(sortedIdx\left[i\right]\) in the visited points, and a point parent2 is the point that has the most number of shared neighbors with \(sortedIdx\left[i\right]\). The number of shared nearest neighbors between \(sortedIdx\left[i\right]\) and parent2 is compared with K/2. If it is at least half of K/2, \(sortedIdx\left[i\right]\) is assigned to the cluster where parent2 belongs. Otherwise, the distance between \(sortedIdx\left[i\right]\) and parent1 is compared with the given distance bandwidth parameter B. If the distance is greater than parameter B, \(sortedIdx\left[i\right]\) is the center of the new cluster; if not, it is assigned to the cluster where point parent1 belongs. This process continues until all points are visited and assigned to the proper clusters. The detail of the proposed method is shown in Algorithm 1.

figure a

3 Experiments

In this section, we use classical synthetic datasets and real-world datasets to test the performance of the proposed method. Moreover, we take K-means, DBSCAN, CSPV, DPC, and SNN-DPC as the control group. According to several evaluation metrics, the performance of the proposed method is compared with five classical clustering methods.

3.1 Datasets and Processing

To verify the performance of the proposed method, we select real-world datasets and synthetic datasets with different sizes, dimensions, and the number of clusters. The synthetic datasets include Flame, R15, D31, S2, and A3. The real-world datasets include Iris, Wine, Seeds, Breast, Wireless, Banknote, and Thyroid. The characteristics of datasets used in the experiments are presented in Table 2. The evaluation metrics used in experiments are as followed: Normal Mutual Information (NMI) [30], adjusted Rand index(ARI) [30], and Fowlkes-Mallows index (FMI) [31]. The upper bound is 1, where larger values indicate better clustering results.

Table 2. Characteristics of datasets

3.2 Parameters Selection

We set parameters of each method to ensure the comparison of their best performance. The parameters corresponding to the optimal results of different methods are chosen. The real number of clusters is assigned to K-means, DPC, and SNN-DPC.

The proposed method needs two key parameters: the number of nearest neighbor K and the distance bandwidth B. The selection of parameter B [22] is derived from the distance matrix D:

$$ MinD\left( i \right) = {}_{{j = 1, \ldots N,j \ne i}}^{{\;\;\;\quad \quad \quad min}} \left( {D\left[ {i,j} \right]} \right) $$
(9)
$$ B = {}_{{i = 1, \ldots N}}^{{\,\;\;\quad max}} \left( {MinD\left( i \right)} \right) $$
(10)

The parameter K is selected by the formula \(K={k}_{1}*\sqrt{N}\) to determine the relationship between K and N, k1 is the coefficient. The parameter is related to the size of the dataset and clusters. In the proposed method, k1 is limited in (0,9] to adapt to different datasets. Figure 1 shows the FMI indices of some representative datasets with different k1 values. It can be seen that for datasets S2 and R15, the FMI index is not sensitive to k1 when k1 is within region (0, 1.5), and for the Wine dataset, the FMI index is not sensitive to k1 within the whole region.

Fig. 1.
figure 1

Results on different datasets with different k1

3.3 Experimental Results

We conduct comparison experiments on 12 datasets and evaluate the clustering results with different evaluation metrics. In the following experiments, we first verify the effects of the new density estimation method, which is proposed in this paper. Then to test the effectiveness of the automatically discovered number of clusters, the proposed method is compared with the other methods. And the whole proposed method with the other five commonly used methods is compared.

The New Density Estimation Method.

Based on the original K-nearest neighbor density estimation [26], the new density estimation method is proposed, which is described in Sect. 2.1. To check if the new method improves the accuracy of the local density calculation, the comparison experiment is conducted between the original method and the new method. Firstly, the original and the new density estimation method are used to estimate the local density of the points. Secondly, after the local density is calculated and sorted in descending order, the same clustering process is used to assign points, which is proposed by [22]. Finally, the clustering results of datasets are evaluated by different metrics, which are shown in Fig. 2 and Fig. 3.

Compared with the original method, the new method is superior on most real-world datasets but is slightly poor on the Seeds. On the synthetic datasets R15, D31, S2, and A3, the new method has good clustering results, which is not much different from the original method. In summary, the new method shows an advantage over the original density estimation method.

Fig. 2.
figure 2

Comparison of density estimation with the original method on FMI.

Fig. 3.
figure 3

Comparison of density estimation with the original method on NMI.

The Effectiveness of the Automatically Discovered Number of Clusters.

The comparison experiments are conducted among DBSCAN, CSPV, and the proposed method to verify the validity of the automatically discovered number of clusters. The proposed method is not compared with K-means, DPC, and SNN-DPC because the real number of clusters is used in these methods. The experimental results are shown in Table 3. The accuracy of the discovered number of clusters by DBSCAN, CSPV, and the proposed method are 42%, 42%, and 83% respectively. On 5 synthetic datasets, the proposed method can correctly discover the number of clusters; the proposed method outperforms DBSCAN and CSPV on real-world datasets Iris, Seeds, Breast, and Wireless. In summary, the proposed method is better than DBSCAN and CSPV for automatically discovering the number of clusters.

Table 3. The number of clusters discovered by different methods

Experiments on the Different Datasets.

The experiments are conducted on the different datasets, and the experimental results are presented in Table 4. From Table 4, the proposed method has the best clustering results on real-world datasets Iris, Seeds, Breast, Wireless, and Banknote than other clustering methods. On the Wine, the result of the proposed method is the best. For the dataset Thyroid, the proposed method performs better than DPC and SNN-DPC.

Table 4. Comparison of clustering results on datasets with three measures

The clustering results of the proposed method for the 5 synthetic datasets are shown in Fig. 4. For 5 synthetic datasets, on the datasets Flame, S2, and A3, the proposed method has the best clustering result, especially on the Flame, the same result as the original data label is obtained. On the dataset D31, the proposed method is slightly poor than the best. The proposed method generates the same results as K-means, DPC, and SNN-DPC on dataset R15, but the proposed method can discover the number of clusters automatically. On the synthetic datasets, the results of the proposed method are similar to SNN-DPC, but slightly better than SNN-DPC. And the proposed method outperforms the other five methods on most real-world datasets.

Fig. 4.
figure 4

Clustering results of the proposed method on synthetic datasets

In summary, the proposed method has more advantages and outperformance than other methods in the effectiveness of clustering results in most cases. These results show that our redefinition of local density and the new clustering process is effective.

4 Conclusion

In this paper, a new clustering method is proposed according to the K-nearest neighbor density estimation and shared nearest neighbors. When calculating the local density, the number and the different contributions of points in the neighborhood are considered, which improves the accuracy of local density calculation to a certain extent. This paper proves that the proposed method can adapt to most different datasets and using the improved local density estimation can improve the clustering performance. The proposed method has a parameter K, the formula \(K={k}_{1}\times \sqrt{N}\) is used to determine the relationship between K and N, and k1 is the coefficient. Although k1 is limited in a reasonable range, k1 had a considerable influence on the clustering results in some datasets. As a possible direction for future work, we will explore the possibility of reducing the influence of the parameter K on the clustering results.