1 Introduction

Clustering is a fundamental approach to organize data into distinct groups for finding intrinsic hidden patterns of the data. It can be applied on various fields such as image processing [15], cyber security [6, 7], pattern recognition [8, 9], bioinformatics [1014], protein analysis [15, 16], micro-array analysis [17], and social networks [18]. Clustering algorithms attempt to group more similar data into the same cluster, while dissimilar data are organized into different clusters. Many clustering algorithms have been proposed based on different characteristics and can be further categorized into portioning-based [1921], density-based [2228], model-based [29, 30], hierarchical-based [3134], and grid-based [35] approaches.

K-means [20] is a state-of-the-art clustering algorithm. It partitions data into k number of partitions, and then each partition is iteratively optimized to get the optimized clusters. K-means creates spherical clusters and not sensible to detect outliers or noise in the data. K-means is simple to understand and implement. The effectiveness of K-means is subject to the appropriate knowledge of number of clusters and selection of initial centroids.

Affinity propagation (AP) [36] is effective clustering approach based on message passing between data points. Unlike K-means, etc., AP does not need the prior knowledge of the number of clusters or selection of initial centroid. However, time and space complexity of AP is much higher than K-means.

Mean shift [37] is a famous kernel density estimation-based clustering algorithm. It is successfully used to image segmentation, image clustering, visual tracking, and air-travel routines. However, the effectiveness of mean shift depends upon the window size, which is used as bandwidth to estimate the densities. Time complexity of mean shift is also higher than K-means.

Recently, density-based clustering approaches have gained popularity among researchers. Density-based clustering algorithms attempt to find arbitrary shapes of clusters in the large spatial domain of datasets even in the presence of noise. These approaches require minimum domain knowledge to organize data into clusters [22].

DBSCAN [23] is a state-of-the-art density-based clustering algorithm that discovers arbitrary shapes of clusters utilizing minimum domain knowledge about data. However, the effectiveness of DBSCAN is subjected to the appropriate selection of input parameters, and it is not fully deterministic for border points and could not perform well in overlapping densities. A various number of variant have been proposed to overcome these limitations, such as OPTICS [24], DBCLASD [25], VDBSCAN [26], ST-DBSCAN [28].

A new density-based clustering algorithm by fast search and find of density peaks (CFSFDP) was proposed by Alex et al. [38]. CFSFDP tends to find density peaks for the selection of the cluster centers, effectively and efficiently with minimum human interaction. CFSFDP provides the heuristic approach of decision graph to the analyzer for the selection of the cluster center. The human-based selection of cluster center is a big limitation toward the spontaneous analysis of data using CFSFDP.

To overcome the aforementioned limitation of CFSFDP, we propose a fuzzy-CFSFDP for adaptive selection of the center clusters. Fuzzy-CFSFDP finds all density peaks and treats each peak as local cluster and then merges local clusters to find the global cluster. The merging process on local clusters merges them into global if more than two density peaks would closer to each other and possess an average density at the shared border region of clusters.

The rest of paper is organized as follows. The related work is presented in Sect. 2. Section 3 describes the problem formulation and proposed method. Experimental results are presented and discussed in Sect. 4, and finally, the concluding remarks and future work are presented in Sect. 5.

2 Related work

CFSFDP provides a unique solution of fast clustering by finding of density peaks in the dataset. CFSFDP is based on two assumptions that the cluster center is a highly dense point as compared with its surrounding neighbors and it is located at a large distance from other cluster centers as compared with its local data points. For each data point i, CFSFDP calculates its local density (\(\rho _i\)) and distance (\(\delta _i\)) from nearest high dense point. \(\rho _i\) of a point i is calculated as follow:

$$\begin{aligned} \rho _i=\sum _jX\left( d_{ij}-d_c\right) , \end{aligned}$$
(1)

where

$$\begin{aligned} X(d)=\left\{ \begin{array}{ll} 1\quad d<0\\ 0 \quad {\text{otherwise}} \end{array}\right. \end{aligned}$$

where \(d_{ij}\) is the distance between point i to j and \({d}_{c}\) is the cutoff distance. \({d}_{c}\) is an important parameter to calculate the densities and can be selected based on the heuristic approach that in average there exist 1 to 2 % of neighbors in a dataset [38]. The effectiveness of CFSFDP potentially depends upon the appropriate choice of \(d_c\). For small datasets, estimation of \(\rho _i\) using Eq. 1 might be affected by a large statistical error [38], in such case the methods of [37, 39] for estimating the densities are suggested. Equation 1 simply counts the number of points that are closer than \({d}_{c}\) to i. However, the distance of each data point i can be calculated as follows:

$$\begin{aligned} \delta _i = \left\{ \begin{array}{ll} {\mathop {\min }_{ j:\rho {}_j>\rho {}_i} {\left( d_{ij}\right) }\ }\quad {\text{if}}\ \ \exists \ \ j \ \ s.t. \rho _j> \rho {}_i \\ {\mathop {\max }_{j :\rho _j >\rho _i} {\left( d_{ij}\right) }\ } \quad {\text{otherwise.}} \end{array} \right. \end{aligned}$$
(2)

Cluster centers possess large \(\rho\) and \(\delta\) as compared with other cluster points while the data points having higher \(\delta\) and low \(\rho\) are treated as hallo clusters (suitable to declare as noise or outliers).

Data points with high local or global density have the maximum value of \(\delta {}\). Therefore, \(\delta _i\) is much larger for locally or globally high dense data points. CFSFDP provides a heuristic approach to analyzer for the selection of expected cluster centers manually by plotting calculated statistics of \(\rho\) and \(\delta\) on a decision graph, as shown in Fig. 1b.

Figure 1a contains 28 data points that are shown with decreasing density order, while the calculated crossposting values of \(\rho\) and \(\delta\) are plotted in Fig. 1b. In Fig. 1b, points 1 and 10 have the high density with high value of \(\delta\), which is the characteristics of cluster center. However, points 26, 27 and 28 have high values of \(\delta\) and low values of \(\rho\), hence can be considered as outliers or noise. Thus, decision graph is a key feature of CFSFDP to select cluster centers with minimum human interaction.

After successful declaration of cluster centers, remaining data points are assigned to the cluster centers in a single round based on minimum nearest distance to cluster center.

Fig. 1
figure 1

CFSFDP in two dimensions. a Data points distribution. b Decision graph for data in a [38]

Furthermore, to refine and separate noise from the clusters, a border region is identified for each cluster. Border region is defined as a set of points that are at \({d}_{c}\) distance from other cluster’s points. CFSFDP finds maximum dense points at border region, known as \({\rho }_b\). Data points that have higher density than \(\rho _b\) are considered as cluster core, and rest of them are declared as noise or outlier know as cluster halos.

Algorithm 1: Fuzzy clustering by fast search and find of density peaks

Require: D distance matrix, Output: Organized clusters

  1. 1.

    calculate \(\rho _i\) from Eq. 1

  2. 2.

    calculate \(\delta _i\) from Eq. 2

  3. 3.

    plot \(\rho\) and \(\delta\) on decision graph

  4. 4.

    select cluster centers through decision graph

  5. 5.

    assign remaining points to local cluster centers

  6. 6.

    check the border point conditions for created clusters.

3 Proposed method

In this section, we explain the problem formulation and then propose fuzzy clustering by fast search and finding of density peaks, in detail.

3.1 Problem formulation

In CFSFDP, decision graph is a heuristic approach for analyzers to manually select the expected cluster centers on the basis of high density and high distance values. The human-based selection of cluster centers is a potential barrier toward automatic analysis of data. According to CFSFDP, a cluster center has higher \(\rho\) and large value of \(\delta\) as compared with non-center data points. Thus, on the decision graph, expected clusters always possess large value of \(\delta\) as compared with the non-cluster data points. However, in some cases single cluster contains more than one density peak and CFSFDP considers each different density peak as a potential cluster center that makes difficult for human to select exact number of clusters in a dataset. These phenomena can be easily observed from decision graph of aggregation dataset, as presented in Fig. 2c. To make effective selection of clusters on decision graph, human should be expert to the domain of underlying dataset.

The key limitations of CFSFDP are: (1) human-based selection of cluster centers is a big barrier in intelligent analysis of data, (2) when one cluster contains more than one density peaks, it is hard to identify cluster centers through decision graph.

3.2 Fuzzy clustering by fast search and finding of density peaks

Fuzzy-CFSFDP is an adaptive way to select exact number of cluster without human intervention and merge clusters if density peaks would closer to each other and possess an average density at the shared border region of clusters. Fuzzy-CFSFDP is based on the fact that:

$$\begin{aligned} {EC}_i=\left( \delta _i\right) \ge 2\sigma \left( \delta _i\right) , \end{aligned}$$
(3)

where, \({EC}_i\) presents the expected cluster centers, \(\sigma \left( {\delta _i}\right)\) is the standard deviation of all distances calculated by Eq. 2. Equation 3 is derived from the definition of cluster center given in CFSFDP. According to CFSFDP, cluster center has large distance from other cluster centers; therefore, the rest of data points should be less than \(2\sigma \left( \delta \right)\). Therefore, \({EC}_i\) contains noisy data points and all those points which potentially might be cluster centers. According to CFSFDP, noisy data points possess high value of \(\delta\) but low value of \(\rho\). Therefore, the noise from expected cluster centers can be separated by using the following equation:

$$\begin{aligned} {LC}_i={EC}_i\ge \mu ({\rho _i}), \end{aligned}$$
(4)
Fig. 2
figure 2

Decision graph of CFSFDP and clustering results of fuzzy-CFSFDP in aggregation dataset. a Decision graph of aggregation dataset representation created by CFSFDP. b Presents the decision graph of aggregation dataset created by fuzzy-CFSFDP. Colored points represent the number of expected cluster centers, and all non-center points are adjusted as \(\delta =0\) for better understanding of readers. c Points out the local clustering results of fuzzy-CFSFDP method without merging of closer densities. d Final cluster created by fuzzy-CFSFDP after merging the local clusters into global clusters (color figure online)

where \({LC}_i\) are local cluster centers without noisy data points, and \(\mu {(\rho _i})\) is the mean of all estimated values of \({{\rho _i}}\). The local cluster centers should be the points that have \(\rho\) greater or equal to the average values of \({\rho }\), because noisy data points are data points which have low \(\rho\) in a dataset. In this way, local cluster centers are the data points that have large distance and higher densities as compared with the neighbor’s data points. After the selection of local cluster centers, fuzzy-CFSFDP assigns remaining data points to each cluster center based on their minimum distance from each local cluster center. The next step is to merge the local cluster into the global clusters. To merge clusters into global clusters, fuzzy-CFSFDP finds minimum distance between local clusters and merges into single cluster if that cluster resides at a \(d_c\) distance from other cluster with average density.

Algorithm 2: Fuzzy clustering by fast search and find of density peaks

Require: D distance matrix, Output: Organized clusters

  1. 1.

    calculate \(\rho _i\) from Eq. 1

  2. 2.

    calculate \(\delta _i\) from Eq. 2

  3. 3.

    find \({EC}_i\) from Eq. 3

  4. 4.

    find \({LC}_i\) from Eq. 4

  5. 5.

    assign remaining points to local cluster centers

  6. 6.

    merge local clusters into global clusters

  7. 7.

    check the border point conditions for created clusters.

3.3 Complexity analysis

The fuzzy-CFSFDP uses \({\mathcal {O}}(n)\) operation to find the expected cluster centers and further takes \({\mathcal {O}}(n)\) operation to refine the expected clusters to discover local clusters. To merge local cluster into global clusters , proposed method finds two nearest clusters and merges only and only if border points are at \(d_c\) distance with average density from other cluster border region. To merge two local clusters into global clusters, it needs \({\mathcal {O}}(LC_i * LC_j)\) operations, where i and j are data points that belong to cluster \(LC_i\) and \(LC_j\).

4 Experiments

To evaluate the robustness of our proposed method, clusters created by fuzzy-CFSFDP are compared with state-of-the-art clustering methods on synthetic clustering datasets. The details of used datasets are shown in Table 1.

Table 1 The detail description of datasets

4.1 Result and discussion

To evaluate the performance of fuzzy-CFSFDP method, we used the aggregation dataset. In aggregation dataset, some clusters are composition of different densities. The CFSFDP method tends to find the maximum dense point in each density and highlight it as a cluster center in decision graph, as shown in Fig. 2a. To select the exact number of cluster centers from decision graph in Fig. 2a, human should have the domain knowledge of underlying dataset. However, fuzzy-CFSFDP first finds the local cluster centers by utilizing Eq. 3 and then removes the noise points from expected cluster centers by using Eq. 4. After Eq. 4, the local clusters are shown in Fig. 2c. Ten local expected cluster centers are shown in Fig. 2b, where the \(\delta\) of non-cluster points are adjusted as zero to separate non-cluster points from cluster centers. After the identification of local expected cluster centers, the rest of points are assigned to the cluster centers in a single round. The next step is to merge the expected local clusters into global clusters if two clusters shared the border region and average density greater than \(d_c\) is found then they are merged into single cluster. The global clusters of aggregation dataset are shown in Fig. 2d.

In path-based spiral dataset, fuzzy-CFSFDP detects four local clusters as shown in Fig. 3a. In Fig. 3a, colors points presents the cluster centers and have higher value of \(\delta\). However, the \(\delta\) of non-center points are adjusted as zero to separate non-cluster points from cluster centers. The four local clusters of path-based spiral dataset are shown in Fig. 3b. In next step, fuzzy-CFSFDP merges only two local clusters into one cluster because they shared an average threshhold density at shared border region. The global clusters created by fuzzy-CFSFDP are shown in Fig. 3c.

Fig. 3
figure 3

Fuzzy-CFSFDP detection of cluster centers, expected and global created clusters. a Detected expected cluster centers in dataset, the colored points are expected local cluster centers and black point are cluster non-center points. b Presents local cluster of path-based spiral dataset. c Points out the global clusters of path-based spiral dataset after merging closed densities into single cluster (color figure online)

Fig. 4
figure 4

Identified cluster centers and created clusters in toys problem and R15 datasets by fuzzy-CFSFDP. a Two colored points presented as expected local cluster centers, which have higher \(\delta\) value. b Clusters of toys problem created by fuzzy-CFSFDP. c 15 clusters centers organized by fuzzy-CFSFDP in R15 dataset. d Shows the created cluster of R15 dataset by assigning points to cluster centers in a single round (color figure online)

We also use small datasets like toys problem and R15 to evaluate the robustness of proposed fuzzy-CFSFDP method. In both toys problem and R15 datasets, fuzzy-CFSFDP successfully identifies the exact cluster centers as shown in Fig. 4a, c. In both figures, the cluster centers are those points, which have higher value of \(\delta\). For non-cluster points \(\delta\) is adjusted as zero to separate non-clusters points from cluster points. The shared densities at border regions are not sufficient to merge the local clusters into global clusters. So in this case, merging process is skipped and local clusters are considered as global clusters. In these cases, the computational cost of fuzzy-CFSFDP and CFSFDP is almost same. The global clusters of these datasets are shown in Fig. 4b, d, respectively.

Fig. 5
figure 5

Detection of cluster centers in large datasets and effectively separation of clusters using proposed method. a Identified 31 cluster centers in D31 dataset by fuzzy-CFSFDP. Cluster centers are denoted with colored points, and non-center points are shown having 0 value of \(\delta\). b 31 organized clusters of D31 dataset by fuzzy-CFSFDP. c Fuzzy-CFSFDP discovered 20 cluster centers in A1 dataset, having nonzero value of \(\delta\). d Shows the 20 clusters of A1 dataset, created by fuzzy-CFSFDP (color figure online)

To benchmark our proposed fuzzy-CFSFDP method on large datasets, we use Dim-2, A1, D31, and S1 datasets. In all these datasets, fuzzy-CFSFDP successfully identifies exact number of clusters without further merging the local clusters. In these datasets, each cluster contains only a single density peak. If two density peaks do not share common boarder points with average density found in cluster, fuzzy-CFSFDP skips the merging process in such scenario (Fig. 5).

Fig. 6
figure 6

Comparison of proposed fuzzy-CFSFDP with state-of-the-art clustering algorithms. aThe ideally separated 2 clusters of flame dataset. b Clusters obtained by K-means, at k =  2. c Two clusters created by mean shift clustering method at optimal size of window. d 13 clusters created by affinity propagation clustering of flame dataset. e Two clusters created by using CFSFDP in flame dataset. f Ideal clusters created by fuzzy-CFSFDP of flame dataset

We utilized gene dataset (flame) to evaluate the effectiveness of fuzzy-CFSFDP. We also make comparison with the state-of-the-art methods to validate the robustness of fuzzy-CFSFDP method. On flame dataset, K-means creates spherical clusters because it is not sensitive to detect the connected densities, as shown in Fig. 6b. Affinity propagations create 13 clusters of flame dataset, as shown in Fig. 6d. Both the K-means and AP are partition-based clustering algorithms and make partitions of more likely spherical in shapes and hence could not find relations between connected densities. However, the results of mean shift are better than K-means and affinity propagation. Mean shift exactly finds two clusters at optimum value of window size. However, in mean shift it is harder to find the exact number of windows size. The number and shape of clusters depend upon the window size in mean shift. Mean shift miss-classifies only four data points on flame dataset, as shown in Fig. 6c. In CFSFDP, the performance of method highly depends upon the appropriate choice of\(\ d_c\) distance and selection of appropriate cluster centers over decision graph. The flame cluster results of CFSFDP , at \(\hbox {dc} = 0.710634\) (1 % whole dataset) are shown in Fig. 6e. Basically, CFSFDP detects all density peaks in dataset and assigns maximum \(\delta\) values to each density peak. However, in some datasets different density peaks exist in a single cluster. In this scenario, CFSFDP also assigns maximum distance to each peak in a single cluster that results in a potentially more cluster center. Therefore like flame dataset, CFSFDP could not find better clusters, as shown in Fig. 6e. However, fuzzy-CFSFDP is capable to detect exact number of clusters by utilizing the defined fuzzy rules and merge the local clusters into global clusters to refine the local clusters.

The proposed fuzzy-CFSFDP works in two steps: first it finds the local clusters and then merge the local clusters if different clusters shared a threshhold density at border region of the clusters. Table 2 shows the details of local and global clusters. In all tested datasets, only aggregation and spiral datasets contain connected densities and hence needed the merging process to merge local clusters into global clusters. In rest of other datasets, local clusters and global clusters are same. The fuzzy-CFSFDP does not perform the merging process on these datasets.

Table 2 The number of local clusters and global clusters in each dataset created by proposed fuzzy-CFSFDP

5 Conclusions

In this paper, a method for adaptively selection of cluster centers for CFSFDP, named as fuzzy-CFSFDP is proposed. Fuzzy-CFSFDP utilizes the fuzzy rules to select cluster centers for different density peaks and then merges density peaks in case of having similar intrinsic patterns. CFSFDP uses a heuristic approach, known as decision graph to select cluster centers manually. In fuzzy-CFSFDP, we have overcome the limitation of manual selection of cluster centers. Experiments on nine synthetic datasets present the robustness and effectiveness of our proposed fuzzy-CFSFDP method. We also compared the results with the state-of-the-art methods for validation of the effectiveness of our proposed fuzzy-CFSFDP method.

Fuzzy-CFSFDP provides robust performing in static data; however, nowadays more and more data are appearing in a dynamic manner. In future, we will try to make an incremental fuzzy-CFSFDP to deal with stream and big data.