Keywords

1 Introduction

Clustering analysis is one of the ways to perform unsupervised analysis [1]. It is dedicated to dividing data into clusters with the goal of the similarity between data within the cluster and minimizing the similarity between data between clusters [2]. It has been widely used in many areas such as image processing, bioinformatics, in-depth learning, pattern recognition and so on. Clustering analysis can be classified into partition-based clustering, density-based clustering [3], grid-based clustering, hierarchical clustering [4] and so on [5, 6]. However, in many cases, the number of clusters must be assigned artificially, while inappropriate assignments affect analysis results negatively. If the number of clusters is much larger than the actual number of clusters, the resulting clustering results will be very complicated and the characteristics of the data cannot be analyzed. If the number of clusters is much smaller than the actual number of clusters, some valuable information will be lost in the clustering results. The loss of this information leads to the inability to obtain valuable information in later data mining.

Therefore, many solutions have been proposed to determine the optimal number of clusters. Some solutions utilize cluster validity Indexes, e.g., DB Index [7], I Index [8] and Xie-Beni Index [9], to determine the optimal number. Some solutions exploit heuristics to deduce the number. For example, the solution of Laio et al. [10] is one of those solutions. It estimates the optimal number according to density and distance (the density of data and the distance between). However, the heuristic still needs to input the number of clusters artificially, and cannot fully cluster automatically. Recently, Gupta et al. [11] propose a solution to identify the optimal number according to the last leap and the last major leap of the minimal distances between cluster centers. However, when running on overlapping data sets, the accuracy of most of those solutions drops severely.

In order to solve the problem of poor estimation of cluster numbers on overlapping data sets, we propose an algorithm that focuses on cluster number estimation on overlapping data sets. And this method has a very fast speed. The solution selects cluster centers in a static way. It generates a score for each data point according to a density-distance model. The score of each data point does not change any more once it is generated. The solution takes the top k data points with the highest scores as the centers of k clusters. It utilizes the significant change of the minimal distance between cluster centers to identify the optimal number of the clusters in overlapping data sets.

The rest of this paper is organized as follows: we review the relevant published work in Sect. 2. After analyzing the estimation problem of the number of the clusters in overlapping data sets in Sect. 3, we elaborate our solution in Sect. 4. The experimental results are discussed in Sect. 5. Finally, the paper concludes in Sect. 6.

2 Related Work

It is very important to determine the number of clusters which data points are grouped into, especially for partition-based clustering solutions. However, it is not easy to estimate the optimal value of the number. Fortunately, cluster validity Indexes [12,13,14,15,16] provide a useful tool for the estimation. Davies and Bouldin [7] proposed DBI index based on inter-cluster similarities to obtain the best cluster number. Xie and Beni [9] put forward Xie-Beni Index based on intra cluster compactness and inter cluster separation, and utilize the index to determine the optimal number. Bensaid [17] and Ren et al. [18] improve the solution of Xie and Beni to enhance the reliability and robustness of that solution, respectively. Some other validity Indexes [19], e.g., Bayesian Information Criterion (BIC) [20], diversity [21], intra-cluster coefficient and inter-cluster coefficient [22], are also exploited to estimate the number of clusters. To obtain better estimation, some solutions even apply multiple indexes in the estimation [5].

In addition to cluster validity indexes, some other factors are also utilized for the estimation. The solutions proposed by Wang et al. [23] and Laio et al. [10] perform the estimation according to the factors related to density. Concretely, the solution of Laio et al. calculates a produce of density and distance for each data point and estimations the number of clusters based on those products. Because the user is required to input the number of clusters according to the visualization. Therefore, it is impossible to process data in batches. Recently, Gupta et al. [11] observed that the last significant drop of the distance between cluster centers indicated the natural number of clusters. Based on the observation, they proposed the Last Leap solution (LL) and the Last Major Leap solution (LML) to estimate the natural number of clusters.

Many algorithms already have a very high accuracy for determining the number of clusters in some simple data sets. However, when running on the data sets with overlapping clusters, the accuracy of those solutions drops severely. Here, the overlapping data set refers to a data set with no obvious boundary between clusters. For example, Fig. 1 shows a non-overlapping two-dimensional data set, and Fig. 2 shows an overlapping two-dimensional data set. At the same time, a detailed explanation of the notions mentioned below is shown in Table 1.

Fig. 1.
figure 1

Non-overlapping

Fig. 2.
figure 2

Overlapping

Table 1. Notions.

3 Motivation

Given a data set-P, \(P=\{p_{1}, p_{2}, \ldots , p_{m}\}\), where \(( \forall p _{i} ) {p _{i}}\in \mathbb {R}^{d}\), partition-based solutions try to divide P into k subsets noted as \(C_1\), \(C_2\), \(\ldots \), \(C_k\). Each of those subsets is known as a cluster and identified by a cluster center. Partition-based solutions require users to offer the number of clusters, i.e., the number of k, which indicates that users are involved in the procedure of clustering somehow. To make sure that clustering is truly unsupervised, clustering solutions should be equipped with the ability of estimating the optimal number of clusters. The objective of this work is to search for the optimal number from a set-\(K= \{k_{i}|{k_{i}\in N^{*}} and \; {k_{i}} < {k_{max}} \,and \; {k_{max}}= \sqrt{m}\}\).

Gupta et al. observed that the last significant drop of the minimal distances between cluster centers indicates the optimal number of clusters. Based on the observation, they proposed the Last Leap solution (LL) and the Last Major Leap solution (LML). LL and LML work well on the data sets in which the clusters are well-separated and have equal sizes and variances. They even do better than most solutions. However, they encounter severe accuracy degradation on overlapping data sets. Many other solutions also have the same problem on overlapping data sets. In this work, we focus on the accuracy problem on overlapping data sets, and try to find a solution for that problem.

4 A Fast Estimation Solution

In this section, we elaborate a fast estimation solution for the number of clusters in overlapping data sets. The minimum distance between the center points is used to measure the change in the degree of separation between clusters. The solution exploits a density-distance model to select cluster centers in a static way. In order to realize the automatic determination of the number of clusters, it is necessary to use the formula to determine the degree of change in distance. However, when the value of k is greater than the optimal value, the following situations are likely to occur. On the whole, the distance between the center points has not changed much at this time. But from a local perspective, it is a big change. This leads to misjudgment. Therefore, we use the tightness of the center point to narrow the K value range to avoid the distance between the center points being too small. Finally, we take the number satisfying the constraint of the minimum distances as the optimal number of clusters.

4.1 Selecting Cluster Centers

Density-based clustering solutions have the ability of selecting global optimal points as cluster centers without iterations. In order to the performance of clustering, Laio et al. proposed a fast clustering algorithm based on the cluster centers selected based on the products of the density and distance of each data point. Here, the distance of a data point represents the minimum distance between that point and any other point which has higher density than that point. The algorithm has the ability which can select proper cluster centers in non-sphere and strongly overlapping data sets.

In order to avoid the influence of outliers on judging the change in minimum distance between the center points, we introduce the density-distance model designed based on density weight. Density weight is defined to measure the importance of the density of a data point.

Definition 1

Density weight. \((\forall p_i ) p_i\in P\), the density weight of \(p_i\) describe the importance of \(\rho (p_i )\) in deciding whether to accept \(p_i\) as a cluster center. It is calculated by Function 1.

$$\begin{aligned} weight(p_i)=\frac{\rho (p_i )}{\overline{\rho }} \end{aligned}$$
(1)

Definition 2

Relative high-density point set. \((\forall p_i ) p_i \in P\), the relative high-density point set of \(p_i\) consists of the points which have a higher density than \(p_i\). It is noted as \(P_{p_i}^h\), and described as

        \(P_{p_i}^h=\{p_j |\exists (p_j \in P \; and \; \rho (p_j )>\rho (p_i ))\}\).

Definition 3

Density-bound minimum distance. \((\forall p_i ) p_i\in P\), the density-bound minimum distance represents the minimum distance from \(p_i\) to any point with higher density. It is denoted as \(dist_{min\cdot \rho } (p_i)\), and calculated by Function 2, where \(dist(p_i, p_k )=(p_i-p_k)^2\).

$$\begin{aligned} dist_{min\cdot \rho } (p_i )=min_{(p_k \in P_{p_i}^h )} dist(p_i,p_k) \end{aligned}$$
(2)

Definition 4

Density-distance score. \((\forall p_i ) p_i \in P\), the density-distance score of \(p_i\) is defined to measure whether \(p_i\) is suitable for a cluster center. It is denoted as \(score_{d\cdot d}(p_i)\).

The density-distance score is calculated by a density-distance model.

The score can be calculated according to Function 3. Considering Function 1, Function 3 can be transformed into function 4.

$$\begin{aligned} score_{d\cdot d}(p_i)= \rho (p_i)\cdot weight(p_i)\cdot dist_{(min\cdot \rho )}(p_i) \end{aligned}$$
(3)
$$\begin{aligned} score_{d\cdot d}(p_i)=\frac{\rho (p_i)^2}{\overline{\rho }} \cdot dist_{min\cdot \rho } (p_i) \end{aligned}$$
(4)

The Points with high density-distance scores are more suitable for being cluster centers than the points with low scores. Therefore, the interference of outliers is reduced by assigning lower scores.

To select cluster centers quickly, our approach calculates a density-distance score for each data point, and sorts all these data points in the descending orders of density-distance scores. \((\forall k_i) k_i\in K\), it takes the top \(k_i\) data points as the centers of \(k_i\) clusters.

4.2 Adjusting the Search Space of the Optimal Number

The aim of our solution is to find the optimal number of the clusters in an overlapping data set from a search space described by \(S=\{1, 2,\ldots ,k_{max}\}\). The size of the search space is decided by \(k_{max}\). If \(k_{max}\) is much larger than the optimal number of clusters, some relatively close data points are probably selected as cluster centers. Those data points lead to the misjudgement on the significant changes of the minimum distances between cluster centers. This results in the erroneous estimation of the optimal number of clusters. Furthermore, the too large value of \(k_{max}\) increases additional search cost.

To avoid the erroneous estimation and the additional search cost, we introduce the definition of cluster center closeness. Cluster center closeness is introduced to assist the proper assignment of \(k_{max}\). It describes the tightness among cluster centers. The lower value of the closeness indicates the sparser distribution of the cluster centers, and vice versa.

Definition 5

Cluster center closeness. The cluster center closeness of k clusters describes the adjacency of those centers. It is noted as cls(k) and calculated by Function 5.

$$\begin{aligned} cls(k)=\frac{dist_{m\cdot f\cdot c}( \overline{dist},count_{\rho <\overline{\rho } }) }{\mathop {{min}}\limits _{c_{i}\in C_{k},c_j \in C_k \,and\, i\ne j}dist( c_i,c_j) } \end{aligned}$$
(5)

In Function 5, \(dist_{m\cdot f\cdot c} (\overline{dist},count_{\rho <\overline{\rho }})\) describes the minimum distance of the centers. It is calculated by Function 6, where \(count_{\rho <\overline{\rho }}\) denotes the total number of points of which the densities are smaller than the average density.

$$\begin{aligned} dist_{m\cdot f\cdot c} (\overline{dist},count_{\rho<\overline{\rho }}) =\frac{\overline{dist} \cdot count_{\rho <\overline{\rho }}}{m} \end{aligned}$$
(6)

The case that \(cls(k_i)\) exceeds 1 indicates that some centers of the \(k_i\) clusters from the same cluster. This means that \(k_i\) overpasses the optimal number of clusters. In this situation, it is unnecessary to search the optimal number from \(k_i\) to \(k_{max}\). Here, we define break value to describe \(k_i\) in this situation.

Definition 6

Break value. The break value describes the minimum value of k which satisfies that \(cls(k) > 1\). It is denoted as \(k_{break}\) and calculated by Function 7

$$\begin{aligned} k_{break}=\mathop {{min}}\limits _{k_i \in K and \; cls(k_i )>1}k_i \end{aligned}$$
(7)

To improve performance and avoid misjudgement, it is necessary to exclude the values from \(k_{break}\) to \(k_{max}\) from the search space for the optimal number of clusters. The new search space is described as \(S'=\{k_i \mid k_i \in N^* and \; k_i < k_{break}\}\). Consequently, the cluster center set is also adjusted to be consistent with the change of the search space. Actually, it is narrowed to only include \((k_{break}-1)\) points with the highest density-distance scores.

4.3 Identifying the Optimal Number of Clusters

Cluster centers are selected only according to density-distance scores. Concretely, The top k data points with the highest scores are taken as the centers of k clusters, while the top \(k+1\) data points with the highest scores are taken as the centers of \(k+1\) clusters. Therefore, cluster centers can be selected ahead. Correspondingly, the minimum distance between cluster centers is fixed for a given number of clusters.

Based on the observation of Gupta et al., the significant change of the minimum distance between cluster centers is exploited to identify the optimal number of clusters. Furthermore, the optimal number is identified according to the last significant change. Here, we utilize Function 8 to discover any significant change.

$$\begin{aligned} f(C_k, C_{k+1})=\frac{\mathop {{min}}\limits _{c_{i}\in C_k, c_j \in C_k \; and \; i \ne j}dist(c_i, c_j)}{\mathop {{min}}\limits _{c_{i}\in C_{k+1}, c_j \in C_{k+1} \; and \; i \ne j}dist(c_i, c_j)} \end{aligned}$$
(8)

If \(f(C_k, C_{k+1})\) reaches up to or overpass a predefined threshold, the significant change of the minimum distances occurs when k increases to \((k+1)\). All the significant changes of the minimum distances can be discovered by calculating the values of \(f(C_k, C_{k+1})\) when the number of cluster increases from 1 to \(k_{break}\). The number related to the last significant change of the minimum distances is taken as the optimal number of clusters.

The framework to estimate the optimal number of clusters is described as following:

Step 1

\(k_{max}\leftarrow \sqrt{m}\);

Step 2

Calculate a density-distance score for each data point according to Function 4;

Step 3

Select the top \(K_{max}\) points with the highest density-distance scores as centers of \(K_{max}\) clusters;

Step 4

Calculate a cls(k) for each k from 1 to \(k_{max}\) according to Function 5;

Step 5

Calculate \(k_{break}\), and reconstruct the search space of the optimal number as \(S'\);

Step 6

Calculate all the significant change of the minimum distances between cluster centers according to \(S'\) and Function 8;

Step 7

Take the element related to the last significant change of the minimum distances in \(S'\) as the optimal number.

5 Experiments and Discussion

In this section, we present extensive experiments on artificial data sets and real data sets to evaluate our approach. In the experiments, we compare our approach with ten different solutions. Before going into details, we introduce the data sets and performance metrics used in the experiments.

5.1 Data Sets and Metrics

To evaluate our approach, we carried out extensive experiments on thirteen data sets. This includes nine overlapping data sets and four non-overlapping data sets. The overlapping data sets are SM2, Flame, SM5, SM6, SM8, Wine, Seeds, Iris and Ionoshpere. Five of those data sets are artificial data sets and the rest of the data sets are real data sets. There are also four non-overlapping data sets, namely A-N-O-1, A-N-O-2, A-N-O-3 and A-N-O-4. The details of these data sets are described in Table 2. Nine artificial data sets are shown in Fig. 3.

Fig. 3.
figure 3

Distribution of data sets

Table 2. Details of dataset.

Accuracy and execution time are adopted as two metrics used to evaluate our approach. Accuracy is exploited to measure the effectiveness of our approach while execution time is utilized to measure the performance.

Table 3. Approaches to be compared with ours

In evaluation, our approach is compared with ten different approaches showed in Table 3. All those approaches are executed 20 times on all of the data sets and all the results discussed in the rest of this section are the average results of the 20 executions.

Table 4. Estimation of the optimal number of clusters in overlapping data sets
Table 5. Estimate of the optimal number of clusters in non-overlapping data sets
Fig. 4.
figure 4

The correct number of each method on the overlapping data set and the non-overlapping data set.

5.2 Experiment Results and Analysis

We track the 20 executions of each approach on the overlapping data sets and record the estimated number of clusters in Table 4. If an approach gets two different numbers of clusters in a data set during 20 executions, the results are record as the two number separated by a slash. If an approach obtains multiple numbers of clusters, the results are described as two numbers connected by a dash. One of the two number is the obtained minimal number, and the other is the obtained maximal number.

Table 6. Execution time of approaches

According to the Table 4, LML estimates the number of clusters correctly only on the Iris data set. BIC does better than LML. It obtains the correct estimation on SM2 and SM8 data sets. Jump and I also exhibit relatively high accuracy. They estimate the number of clusters correctly on 4 overlapping datasets. They are followed by CE, and PC which estimate the number of clusters correctly on 3 data sets. CH can correctly estimate the number of clusters on the five data sets. Among those solutions, our approach does best. It estimates the number of clusters correctly on all the overlapping data sets.

We also evaluate the accuracy of our approach on non-overlapping data sets, and describe the results in Table 5. According to the table, BIC and CH estimate the number of clusters correctly on two data sets, while LML and ZXF do better. They correctly estimate the number of clusters on three data sets. PC, LL, I, FHV, CE, Jump and our approach does best. They obtain the results consistent with the natural number of clusters on each of the non-overlapping data sets.

Figure 4 shows the accuracy of each method on overlapping and non-overlapping data sets. Based on the estimation on both overlapping data sets and non-overlapping data sets, our approach exhibit highest accuracy than all of the other solutions.

In addition to effectiveness, we also evaluate the performance of our approach on all of the data sets. Concretely, we calculated the average execution time of all the approaches and depict the results in Table 6. According to the Table, our approach spends less time on each of the data sets than each of the other approaches does. Our approach spends 0.015 s to 1.7 s on those data sets, while other approaches take 1.32 s to 279.05 s.

Our approach estimates the number of clusters with higher performance than other solutions do. All the other solutions exhibit the similar performance on the same data set.

Our approach exhibits highest performance in all the approaches for two reasons. The first reason is that our approach selects cluster centers in a static way. It calculates a score for each data points. Once a score is calculated, it will not change any more. Our approach takes the top k data points with the highest scores as the centers for k clusters. The other way to choose a cluster center is through iteration. When certain conditions are met, the iteration will stop and the center point will be obtained. The second reason is that our approach narrow the search space of the optimal number of cluster, and hence degrading search cost.

6 Conclusion

In this work, we focused on the problem to estimate the optimal number of clusters in overlapping data sets. To deal with the problem, we proposed a fast estimation approach. The approach selects cluster centers in a static way according to density and distance. It utilizes the significant change of the minimal distance between cluster centers to identify the optimal number of clusters. The experimental result demonstrated the usefulness and effectiveness of our approach. In the future, we will conduct research on estimating the optimal number of the clusters in more complex data sets.