Keywords

1 Introduction

Data clustering has wide application in pattern recognition, image analysis and fault diagnosis [14, 15]. In the past decades, much efforts were devoted to data clustering and various clustering algorithms have been proposed. Unfortunately, in applying these algorithms to real clustering tasks, there are still many problems to be solved. The k-means-like algorithm, NCuts [12] and the general spectral clustering algorithms [17] uses as input the number of clusters and their results rely on the parameters heavily. In addition, these algorithms tend to generate clusters of spherical shapes, and the results are also influenced by cluster center initialization. The DBSCAN [5], AP [1] and DSets [10] algorithms are able to determine the number of cluster automatically. However, all these three algorithms have their own problems. The DBSCAN algorithm depends on two parameters Eps and MinPts for density estimation. Generally, the other density based clustering algorithms are also dependent on user-specified density parameters. The AP algorithm must be fed the preference value of each data, which impacts on clustering results significantly. The DSets algorithm are built on the pairwise similarity matrix of the data to be clustered and is parameter independent in itself. However, in the case that data are represented as feature vectors, the estimation of data similarity usually introduces one or more similarity parameters, which have been found to impact on the clustering results [7, 8]. Besides, both the AP and DSets algorithms have the tendency to generate spherical clusters only. These observations show that it is still necessary to explore new clustering algorithms, although there are already a vast amount of clustering algorithms in the literature.

Our work in this paper is on the basis of the density peak (DP) based clustering algorithm proposed in [11]. The DP algorithm is based on the assumption that cluster centers are density peaks and they are relatively far from each other. With the local density \(\rho _i\) of each data i and the distance \(\delta _i\) to the nearest neighbor with higher density to represent the data in a decision graph, it is found that the cluster centers are with both high \(\rho \) and high \(\delta \), whereas the non-center data are with either small \(\rho \) or small \(\delta \). As a result, the cluster centers are isolated from non-centered data and it is relatively easy to differentiate between two kinds of data. By assuming that the label of one data is the same as that of its nearest neighbor with higher density, all the non-center data can be grouped into clusters sequentially.

Local density calculation is the key of the DP algorithm as it determines \(\rho \), \(\delta \) and then the cluster centers. In [11] the authors use cut-off and Gaussian kernels, both of which involve a cut-off distance \(d_c\). Although [11] presents an empirical method to calculate the range of \(d_c\), we have found that the clustering results vary significantly with different \(d_c\)’s in this range. In order to solve this problem, in this paper we present a non-parametric density kernel based on the DSets algorithm. One important feature of the DSets algorithm is that each data in a cluster is assigned a weight. Our study of the dominant set definition shows that this weight reflects the relationship of the data with all the others, and can be viewed as a measure of the local density. By calculating the pairwise data similarity matrix properly, all the data can be included in one single cluster, and therefore the weights of all the data can be obtained with the DSets algorithm. We show that this process can be accomplished independent of user-specified parameters. The effectiveness of our algorithm is demonstrated in experiments and comparisons with other density kernels.

2 Density Peak Clustering

2.1 The DP Algorithm

The DP algorithm is proposed based on the following observations. First, cluster centers are usually the density peaks in the neighborhood. This means that compared with non-center data, cluster centers have relatively large local density \(\rho \). Second, in practice few data are with the same local density, therefore the distance \(\delta \) of one data to its nearest neighbor with higher density is usually not large. In contrast, cluster centers are surrounded by data with lower density, therefore their \(\delta \)’s are relative large. In summary, the cluster centers usually have both large \(\rho \)’s and large \(\delta \)’s, whereas non-center data have either small \(\rho \)’s or small \(\delta \)’s. This difference between cluster centers and non-center data makes it possible to isolate and identify cluster centers from non-center data. Then based on the assumption that the label of one data is the same as that of its nearest neighbor with higher density, the non-center data can be grouped into clusters. Although this assumption has no theoretic ground, it is consistent with human intuition and works well in experiments.

From the above description we see that the key of the DP algorithm is the calculation of local density. While the local density can be estimated in different ways, existing approaches usually involve user-specified parameters, which may influence the density values and then the clustering results. For example, the cutoff kernel and Gaussian kernel is used in [11] to calculate the local density, and both kernels involve the cutoff distance \(d_c\). The cutoff kernel measures the density by the number of data in the neighborhood of radius \(d_c\), and the Gaussian kernel uses \(d_c\) as the decay parameter. After the local density \(\rho \)’s are calculated, the distance \(\delta _i\) is obtained by

$$\begin{aligned} \delta _i= \min _{j \in S,\rho _j>\rho _i} d_{ij}. \end{aligned}$$
(1)

With the \(\rho \)’s and \(\delta \)’s of all the data, we use a \(\rho -\delta \) decision graph to illustrate the relationship of the cluster centers and non-center data in the \(\rho -\delta \) space. Taking the Spiral dataset [3] for example, we show the \(\rho -\delta \) decision graph in Fig. 1(b). For space reason, here we use only the Gaussian kernel and \(d_c\) is calculated by including 1.6% of all the data in the neighborhood.

Fig. 1.
figure 1

The Spiral dataset and two decision graphs with the Gaussian kernel.

It is evident in Fig. 1(b) there are three data with both large \(\rho \)’s and large \(\delta \)’s, and they are presented as the outliers of the set of data. Obviously these three data are the centers of the three clusters. Considering that identifying cluster centers with the \(\rho -\delta \) decision graph involves two thresholds, [11] then proposes to use \(\gamma =\rho \delta \) as the single criterion of cluster center selection. We sort the data in the decreasing order according to their \(\gamma \)’s and obtain the \(\gamma \) decision graph in Fig. 1(c), where the three cluster centers with large \(\gamma \)’s can be recognized relatively easily.

In general, while the \(\rho -\delta \) decision graph and \(\gamma \)’s decision graph are helpful in identifying cluster centers, it is still quite difficult to find out the correct cluster centers automatically. In this paper we assume the number of clusters, N, is determined beforehand, and use the N largest \(\gamma \)’s to identify the cluster centers.

2.2 The Problems

In both the cutoff and Gaussian kernels the parameter \(d_c\) needs to be specified, and it is suggested in [11] to determine \(d_c\) such that 1% to 2% of all data are included in the neighborhood on average. However, we have found that with both kernels, different values of \(d_c\) in this range causes significant variance in the clustering results. In addition, the best results may not be obtained with \(d_c\) in this range. In fact, we show how the clustering results vary with the percentages used to calculate \(d_c\) in Fig. 2, where we use F-measure to evaluate the clustering results. Eight datasets, namely Aggregation [6], Compound [16], Spiral, R15 [13], Jain [9] and three UCI datasets Thyroid, Iris and Breast, are used in experiments.

Fig. 2.
figure 2

The influence of the percentage in calculating \(d_c\) on the clustering results.

Figure 2 indicates that the percentage and the parameter \(d_c\) has a significant influence on the clustering results. Unfortunately, we cannot arrive at any useful conclusion as to the appropriate range of \(d_c\) from Fig. 2. In addition, it is not clear how the clustering results are correlated with \(d_c\). Consequently, it is very difficult to determine the appropriate \(d_c\).

3 Our Algorithm

In order to solve the parameter dependence problem of existing density kernels, in this paper we present a non-parametric density by making use of the nice properties of the DSets algorithm. In this section we firstly introduce the dominant set definition, and then present in details how the DSets algorithm can be used to calculate the local density.

3.1 Dominant Set

In order to derive the non-parametric density kernel based on the dominant set, we firstly present the definition of dominant set briefly. The details of the definition can be found in [10].

We use S to denote the set of data for clustering, and \(A=(a_{ij})\) to represent the pairwise similarity matrix. With D as a non-empty subset of S and \(i \in D\), \(j \notin D\), we measure the relationship of j and i by

$$\begin{aligned} \phi _D(i,j)=a_{ij}- \frac{1}{|D|} \sum \limits _{k \in D} a_{ik}, \end{aligned}$$
(2)

with |D| denoting the number of data in D. The we define

$$\begin{aligned} w_D(i)= {\left\{ \begin{array}{ll} 1, &{}\text {if}\quad |D|=1,\\ \sum \limits _{l \in D \setminus \{i\}} \phi _{D \setminus \{i\}}(l,i)w_{D \setminus \{i\}}(l),&{}\text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)

With this key variable and \(W(D)=\sum _{i \in D}w_D(i)\), the formal definition of dominant set can be presented as follows. A subset D such that \(W(T) > 0\), for all non-empty \(T \subseteq D\) is called a dominant set if

  1. 1.

    \(w_D(i)>0\), for all \(i \in D\).

  2. 2.

    \(w_{D \bigcup \{i\}}(i)<0\), for all \(i \notin D\).

In [10] the authors show that a dominant set can be extracted with game dynamics, e.g., replicator dynamics, developed in evolutionary game theory. Specifically, we use \(x \in R^n\) to denote the weights of the data, which can be obtained by replicator dynamics. In this paper we adopt the more efficient dynamics proposed in [2]. It is shown that this weight vector corresponds to the weighted characteristic vector \(x^D\) of a dominant set D, which is defined as

$$\begin{aligned} x_i^D= {\left\{ \begin{array}{ll} \frac{w_D(i)}{W(D)},&{}\text {if}\quad i \in D,\\ 0,&{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

In other words, after we obtain the weight vector, the data with positive weights form a dominant set. In extracting a dominant set, the weights of all the data for clustering can be initialized to \(\frac{1}{n}\). The dominant sets can be obtained sequentially in a peeling-off manner [10].

3.2 Non-parametric Density Kernel

From the last subsection we observe that in a dominant set, each data i is assigned a weight equaling to \(\frac{w_D(i)}{W(D)}\). On the other hand, Eq. (3) indicates that \(w_D(i)\) measures the similarity between i and the other data, and a large \(w_D(i)\) means that i has a high overall similarity with other data. It is evident that if i is in the central area of a dominant set, then it is likely that \(w_D(i)\) is large and i has a large weight. In contrast, one data i in the border area of a dominant set tends to have a small weight. Since the weights of data in a dominant set can be used to differentiate between the data in central and border areas, they can be treated as the data density in the DP algorithm. In this sense, we can make use of the dominant sets algorithm to calculate the density, and therefore treat dominant set as a density kernel.

However, in applying this density kernel to the DP algorithm, there are still two problems to be solved. First, while the dominant set extraction uses as input only the pariwise similarity matrix and no parameters are involved, the calculation of data similarity usually introduces parameters. For example, the commonly used similarity measure \(s(i,j)=exp(-d(i,j)/\sigma )\) introduces the parameter \(\sigma \). Second, in the case that there are more than one clusters in the dataset and the dynamics proposed in [2] are used, there will be some data with zero weights. These data with identical density will influence the clustering of non-center data negatively. By studying the definition of dominant set, in the following we show how to solve these two problems.

The definition of \(w_D(i)\) in Eq. (3) indicates that a large \(w_D(i)\) corresponds to large similarities between i and other data. Then the dominant set definition states that each data in a dominant set has a positive \(w_D(i)\). This is equivalent to saying that each data in a dominant set are similar to all the others. As a result, the dominant set definition imposes a high requirement on the internal similarity in a dominant set. With a fixed dataset, the variance of \(\sigma \) results in the change of similarity value. A small \(\sigma \) leads to small similarity values, which further result in a large amount of small dominant sets. In contrast, a large \(\sigma \) corresponds to large similarity values and then a small number of large dominant sets. By adopting a sufficiently large \(\sigma \), we can group all the data into a dominant set, and therefore assign non-zero weights to all the data. Although \(\sigma \) influences the similarity values, it does not change the magnitude ordering of these similarity values, and therefore has no influence on the ordering of data weights. Consequently, the value of \(\sigma \) does not impact on the DP clustering results, only if all data are assigned positive weights.

In practice, if \(\sigma \) is too large, many large similarity values may become identical due to limited digits after decimal. Therefore we use the following algorithm to determine the \(\sigma \) and generate the density used in the DP algorithm. With \(\overline{d}\) denoting the average of pairwise distances, we build a list composed of \({\overline{d}, 10\overline{d}, 50\overline{d}, 100\overline{d}, 200\overline{d}, \cdots }\). Given a dataset, we assign \(\sigma \) with the values in the list from small ones to large ones, until all the resulted data weights are greater than zero.

4 Experiments

We test the proposed density kernel in experiments on the eight datasets, and compare the results with those from the cutoff kernel and Gaussian kernel. In addition, we also compare with some other algorithms, including k-means, DBSCAN, NCuts, AP and SPRG [17]. With k-means, NCuts and SPRG, we set the required number of clusters as ground truth and report average results of 10 runs. With DBSCAN, the MinPts is set as 2, manually selected from 1 to 10, and Eps is determined based on MinPts [4]. For AP, the required preference value is manually selected to be \(p_{min}+9.2step\), where \(step=(p_{max}-p_{min})/10\) and \([p_{min},p_{max}]\) is the preference value range calculated with the method proposed by the authors of [1]. In DP algorithm with the cutoff and Gaussian kernel, the percentage of data used to calculate \(d_c\) is set as 1.1% and 2.0%, respectively, both of which are manually selected from 1, 1.1, 1.2, \(\cdots \), 2.0. The comparison of these algorithms are presented in Tables 1 and 2, where F-measure and Jaccard index are used to evaluate the clustering results. The comparison indicates that in terms of average clustering quality, our non-parameter kernel performs better than the cutoff and Gaussian kernels, and our algorithm also outperforms some other algorithms with carefully selected parameters.

Table 1. Clustering results (F-measure) comparison on eight datasets.
Table 2. Clustering results (Jaccard index) comparison on eight datasets.

5 Conclusions

In this paper we present a non-parametric density kernel to be used in the density peak based clustering algorithm. We study the dominant set definition and propose to treat the extraction of dominant set as a density kernel which is independent of parameters. We compare with the cutoff and Gaussian kernels in the DP algorithm and also some other clustering algorithms to illustrate the effectiveness of the proposed density kernel. One problem with the proposed density kernel is the relatively high computation load involved in similarity calculation and density calculation, which will be studied in our future work.