A two-stage density clustering algorithm

Wang, Min; Zhang, Ying-Yi; Min, Fan; Deng, Li-Ping; Gao, Lei

doi:10.1007/s00500-020-05028-x

A two-stage density clustering algorithm

Methodologies and Application
Published: 26 May 2020

Volume 24, pages 17797–17819, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

A two-stage density clustering algorithm

Download PDF

Min Wang¹,
Ying-Yi Zhang¹,
Fan Min ORCID: orcid.org/0000-0002-3290-1036^2,3,
Li-Ping Deng⁴ &
…
Lei Gao²

400 Accesses
8 Citations
Explore all metrics

Abstract

Clustering by fast search and find of density peaks (CFDP) is a popular density-based algorithm. However, it is criticized because it is inefficient and applicable only to some types of data, and requires the manual setting of the key parameter. In this paper, we propose the two-stage density clustering algorithm, which takes advantage of granular computing to address the aforementioned issues. The new algorithm is highly efficient, adaptive to various types of data, and requires minimal parameter setting. The first stage uses the two-round-means algorithm to obtain $\sqrt{n}$ small blocks, where n is the number of instances. This stage decreases the data size directly from n to $\sqrt{n}$. The second stage constructs the master tree and obtains the final blocks. This stage borrows the structure of CFDP, while the cutoff distance parameter is not required. The time complexity of the algorithm is $O(mn^\frac{3}{2})$, which is lower than $O (mn^2)$ for CFDP. We report the results of some experiments performed on 21 datasets from various domains to compare a new clustering algorithm with some state-of-the-art clustering algorithms. The results demonstrated that the new algorithm is adaptive to different types of datasets. It is two or more orders of magnitude faster than CFDP.

Density Peak Clustering Based on Cumulative Nearest Neighbors Degree and Micro Cluster Merging

Article 02 August 2019

M_k-NNG-DPC: density peaks clustering based on improved mutual K-nearest-neighbor graph

Article 13 November 2019

A New Density Clustering Method Using Mutual Nearest Neighbor

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is a fundamental approach used to organize data into distinct groups to find intrinsic hidden patterns in the data. Its applications include image processing (Pappas 1992; Leong and Ong 2017), bioinformatics (Shuji et al. 2015), social networks (Chang et al. 2014) and pattern recognition (Guo et al. 2009). Popular clustering algorithms include partition-based (MacQueen et al. 1967; Kaufman 2008), density-based (Kriegel et al. 2011; Rodriguez and Laio 2014) and hierarchical (Johnson 1967; Dasgupta 2002). Partition-based clustering algorithms typically include the k-means (MacQueen et al. 1967) and k-medoid algorithms (Kaufman 2008). They construct a single partition of the dataset based on the distance between instances. Among them, the k-medoid algorithm always requires that each cluster center is an existing instance. By contrast, the k-means algorithm uses the average value of the cluster to build the virtual center. Because an instance is always assigned to the nearest center, these approaches are unable to detect non-spherical clusters.

Density clustering (Ester et al. 1996; Rodriguez and Laio 2014) performs well on data with circular, arc and some other irregular shapes. CFDP (Rodriguez and Laio 2014) has attracted much attention because of its simplicity and good results, and it can automatically find the clustering centers. However, it is criticized because it is inefficient and applicable only to some types of data, and requires the manual setting of the key parameter. First, the time complexity of the algorithm is $O(mn^2)$, where m and n are the number of attributes and instances, respectively. Hence, the algorithm is inapplicable to data with millions of instances. Second, for many datasets, the clustering results are often unsatisfactory in terms of purity, the Jaccard coefficient (JC), Fowlkes and Mallows index (FMI) and Rand index (RI). Third, the quality of the results depends substantially on cutoff distance threshold $d_c$. It is difficult, if not impossible, for the user to make an optimal setting in practice.

One solution to these issues is to introduce the granular computing (Hu et al. 2016; Li et al. 2016; Qian et al. 2015; Yao and Yao 2002; Chen et al. 2015) methodology. This methodology is widely used to manage many machine learning tasks, such as classification (Wang and Musa 2014), clustering (Wilderjans and Cariou 2016; Sarma et al. 2013), recommendation (Zhang et al. 2019, 2017), active learning (Wang et al. 2017), attribute reduction (Min et al. 2011; Li et al. 2011) and three-way decisions (Huang et al. 2017; Zhao et al. 2016a, b; Li et al. 2017; Yu et al. 2016; Liu and Liang 2017). For the clustering task, each instance can be considered as the finest granule, the entire dataset can be considered as the coarsest granule, and the clustering result has the granule level between the finest and coarsest. We may construct other granule levels to address the aforementioned issues.

In this paper, we propose the two-stage density clustering (TSD) algorithm, which is highly efficient, adaptive to various types of data, and requires minimal parameter setting. Figure 1 illustrates our new algorithm through a running example. Figure 1a describes a dataset of 100 instances. The first stage is pre-clustering, as shown in Fig. 1b. We divide the dataset into ten blocks and obtain ten virtual centers $[c_1, c_2, \dots , c_{10}]$. Block size array $\rho = [17, 11, 9, 6, 12, 11, 9, 9, 6, 10]$ is obtained for the next stage. Pre-clustering ensures the local distribution of data and decreases the data size. The second stage is the density clustering of virtual centers, as shown in Fig. 1c. First, we obtain density $\rho _i$ of each instance and calculate its minimum distance $\delta _i$. Second, we construct the master tree according to ($\rho _i$, $\delta _i$). Finally, we cluster ten virtual centers into three blocks and obtain their cluster indices $cl = [1, 1, 1, 3, 3, 2, 2, 2, 3, 3]$. Figure 1d shows the cluster results. All instances in each block have the same cluster indices as the virtual centers.

The granular computing methodology is used to design the TSD algorithm. In the pre-clustering stage, approximately $\sqrt{n}$ small local granules are obtained using a two-round-means subroutine, where n is the number of instances. This stage does not change the distribution of the data. In the density clustering stage, both the inner-granule size and inter-granule distance are used to construct the master tree. Then, local granules are accumulated to form the final clusters.

The two-stage density clustering (TSD) algorithm has the following advantages. First, the number of instances required for density clustering is reduced to $\sqrt{n}$. Thus, the time complexity is $O(mn^\frac{3}{2})$, which is much lower than $O(mn^2)$ for clustering by fast search and find of density peaks (CFDP). Second, the TSD algorithm has the advantages of the k-means algorithm and CFDP and has good adaptability. It has good clustering performance for datasets containing data of various types. Third, the density $\rho $ is set to the number of instances in each block. The method is simple and effective and avoids manually setting the cutoff distance $d_c$. Therefore, the main problem solved is the efficiency and parameter setting in the CFDP algorithm. The new algorithm is $\sqrt{n}$ times faster than CFDP. It does not require the value of $\rho $ to be set.

Experiments were performed on 21 datasets to quantify the performance of the TSD algorithm. These datasets were chosen from different applications, such as botany, materials science and games, with different data distributions. The largest dataset, Poker (Cattral and Oppacher 2007), contains 1,025,009 instances. We compared the TSD algorithm with five types of clustering algorithms: partition clustering (MacQueen et al. 1967), peak density clustering (Rodriguez and Laio 2014; Xie et al. 2016; Liu et al. 2018; Xu et al. 2016), maximum margin clustering (MMC) (Li et al. 2009), spectral clustering (Wang et al. 2011) and balanced clustering (Liu et al. 2017a). Four external evaluation functions were used to evaluate the clustering results. The time complexity was verified on seven large datasets. The experimental results demonstrated that the TSD algorithm had good clustering performance on various types of datasets. It was two or more orders of magnitude faster than CFDP.

The remainder of this paper is organized as follows: In Sect. 2, we review five types of clustering algorithms. In Sect. 3, we present the TSD clustering algorithm. We describe experiments on 21 datasets in Sect. 4. Finally, we draw a conclusion in Sect. 5.

2 Related work

In this section, we review five types of clustering algorithms: partition-based clustering (MacQueen et al. 1967; Kaufman 2008), density-based clustering (Kriegel et al. 2011; Rodriguez and Laio 2014; Xie et al. 2016; Liu et al. 2018; Xu et al. 2016), maximum margin clustering (MMC) (Li et al. 2009), spectral clustering (Chen and Cai 2011; Wang et al. 2011) and balanced clustering (Liu et al. 2017a).

2.1 Partition-based clustering

Partition-based clustering algorithms, such as the k-means (MacQueen et al. 1967) and k-medoid (Kaufman 2008) algorithms, are classic and efficient. The k-means algorithm calculates cluster centers iteratively as follows:

Step 1. Initialize k centers $c_1$ ...$c_k$ using random sampling.
Step 2. Each instance belongs to the block of the nearest center.
Step 3. Each new center takes the mean values of all instances of its block.
Step 4. Repeat Steps 2 and 3 until the cluster centers do not change.

Because the Euclidean distance is typically used as a similarity measure, partition-based clustering algorithms cannot detect non-spherical clusters.

2.2 Peak density clustering

Density clustering (Ester et al. 1996; Rodriguez and Laio 2014) explores clusters with different shapes based on the data density. DBSCAN (Ester et al. 1996) can find clusters with various shapes and manage noise. It controls class growth based on a density threshold. However, it does not perform well for overlapping densities.

Similar to DBSCAN, CFDP (Rodriguez and Laio 2014) aims to detect non-spherical clusters. Cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from instances with higher densities. For each instance i, CFDP computes two quantities: its density $\rho _i$ and its minimum distance $\delta _i$ from instances of a higher density. Density $\rho _i$ of instance i is defined as

$$\begin{aligned} {\rho _i} = \sum \limits _j {\chi (d_{ij} - d_c}), \end{aligned}$$

(1)

where $\chi (x) = 1$ if $x < 0$ and $\chi (x) = 0$ otherwise, and $d_c$ is a cutoff distance. $\delta _i$ is measured by computing the minimum distance between instance i and any other instance with a higher density:

$$\begin{aligned} {\delta _i} = \mathop {\min } \limits _{j:{\rho _j} > {\rho _i}} (d_{ij}). \end{aligned}$$

(2)

For the instance with the highest density, we conventionally take ${\delta _i} = {\max }(d_{ij})$.

CFDP detects non-spherical clusters and automatically finds the correct number of clusters. However, it encounters the following drawbacks in practice:

(1)
High time complexity: The time complexity of the algorithm is $O(mn^2)$, where m and n are the number of attributes and instances, respectively. The high time complexity makes it impossible to use the algorithm for large-scale data clustering.
(2)
Difficult to set $d_c$ and accurately select the cluster centers. The performance of the algorithm depends on user-specified cutoff distance $d_c$. No specific method was presented in (Rodriguez and Laio 2014).

Liu et al. (2017b) also discovered this problem. He has the following description in the abstract of the paper. “However, the improper selection of its parameter cutoff distance $d_c$ will lead to the wrong selection of initial cluster centers, but the CFDP cannot correct it in the subsequent assignment process. Furthermore, in some cases, even the proper value of $d_c$ was set, initial cluster centers are still difficult to be selected from the decision graph.” Chen et al. (2016) described this problem in his paper. “But is has two big challenges when selecting cluster centers. The first challenge is it needs to manually select cluster centers. Even so, on some datasets, the number of cluster centers it generates will be either more or less than the right number. The second one is it is unable to group data points correctly when a cluster has more than one centers.”

Various methods have been proposed to further improve the CFDP algorithm. Liu et al. (2017b, 2018), Chen et al. (2016), Xie et al. (2016), Du et al. (2016) introduced k-nearest neighbors (KNN) ideas into the CFDP algorithm to improve its adaptive ability and performance. Xu et al. (2016), Liang and Chen (2016), Lu and Zhu (2017) combined the idea of hierarchy with the CFDP algorithm to improve efficiency.

Xie et al. (2016) proposed a new robust fuzzy KNN density peak clustering (FKNN-DPC) algorithm. The proposed algorithm introduced a uniform metric to calculate the local density and developed two assignment strategies to detect the true distribution of a dataset. Liu et al. (2018) proposed a shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) algorithm. They presented three new definitions: SNN similarity, the local density $\rho $ and the distance from the nearest larger density point $\delta $. These definitions take the information about nearest neighbors and shared neighbors into account. Xu et al. (2016) proposed a density peak-based hierarchical clustering method (DenPEHC) algorithm that directly generates clusters on each possible clustering layer and introduced a grid granulation framework to enable DenPEHC to cluster large-scale and high-dimensional datasets.

As opposed to the above methods, TSD uses two-stage clustering. The first stage uses a two-round-means algorithm to better handle spherical datasets. The second stage uses the improved CFDP algorithm to efficiently process non-spherical datasets. Thus, TSD effectively integrates the idea of hierarchical clustering, which can improve the adaptability of the algorithm.

2.3 Maximum margin clustering

MMC algorithms (Li et al. 2009) aim to find an optimal (maximum) hyperplane in high-dimensional feature space. Specifically, the hyperplane and labeled sample can be obtained by optimizing the following objective function:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{y \in {{\{\pm 1\} }^n}} \mathop {\min }\limits _{\omega ,b,\xi } \frac{1}{2}{\left\| \omega \right\| ^2} + C\sum \limits _{i = 1}^n {{\xi _i}} ,\\ s.t.\begin{array}{*{20}{c}} {}&{}{{y_i}({\omega ^T}\phi ({x_i}) + b)} \end{array} \ge 1 - {\xi _i},\forall i = 1, \dots , n,\\ {\xi _i} \ge 0,\forall i = 1, \dots , n,\\ - l \le \sum \limits _{i = 1}^n {{y_i} \le l} , \end{array} \end{aligned}$$

(3)

where $\phi (\cdot )$ denotes a nonlinear mapping from the original space to high-dimensional space. $\xi _i \ge 0$ is the relaxation variable that corresponds to $x_i$. l is the constant controlling the balance between classes. $\omega $ and b uniquely determine the hyperplane. Naturally, the clustering label is optimized by the objective function.

MMC is limited to small to medium-sized datasets because of the semidefinite program. The LGMMC algorithm (Li et al. 2009) improves efficiency and scalability by maximizing the margin of opposite clusters using label generation.

2.4 Spectral clustering

Spectral clustering (Chen and Cai 2011; Wang et al. 2011) is evolved from graph theory. It mainly includes three steps:

Step 1. Construct a new matrix to represent the original dataset.
Step 2. Compute the eigenvalues and eigenvectors of the matrix. Map each instance to a low-dimensional representation based on the eigenvectors.
Step 3. Assign cluster indices according to the new representation.

Spectral clustering has advantages in managing sparse and high-dimensional datasets. The disadvantage is that the time complexity is too high and cannot manage intersections. Chen and Cai (2011) proposed the landmark-based spectral clustering algorithm to improve efficiency. Wang et al. (2011) proposed the spectral multi-manifold clustering (SMMC) algorithm to manage intersections.

2.5 Balanced clustering

Balanced clustering (Liu et al. 2017a) is required in a variety of applications, such as photo query systems (Dengel et al. 2011) and wireless sensor networks (Chuang et al. 2009). These balanced algorithms can be categorized into two types: hard-balanced and soft-balanced. Liu et al. (2017a) proposed a soft-balanced BCLS algorithm based on least square linear regression. It considers a balance constraint to regularize the clustering model. The purpose is to minimize

$$\begin{aligned} \sum \limits _{k = 1}^c {s_k^2} = \left\| s \right\| _2^2 = \left\| {{1^T}Y} \right\| _2^2 = tr({Y^T}{11^T}Y). \end{aligned}$$

(4)

The algorithm achieves balanced clustering by minimizing the square sum of instances in each cluster.

3 Proposed algorithm

In this section, we present the TSD algorithm with time complexity analysis.

3.1 Algorithm description

Figure 2 shows our TSD clustering framework. Table 1 is the symbols and variables used in Fig. 2. Stage I is pre-clustering. The dataset is divided into e clusters using the two-round-means algorithm (Algorithm 1), $e = \sqrt{n}$. $\sqrt{n}$ is an empirical value used in many studies. In Yu and Cheng (2001), the authors have provided a theoretical explanation for this rule. Therefore, we set $\sqrt{n}$ as the block number in Algorithm 1. This stage computes block information $b_{1 \times e}$ and determines virtual centers $c_{1 \times e}$. Simultaneously, we obtain block size array $\rho = [\left| b_1 \right| ,\left| b_2 \right| , \dots , \left| b_e \right| ]$ for the next stage. We do not use the k-means algorithm directly. Instead, we control the iteration to exactly two to save runtime because more iterations require more time, but the performance of the clustering cannot be substantially improved. We discuss this issue in Sect. 4.

Table 1 Notation and variables used in Fig. 2

A two-stage density clustering algorithm

Abstract

Similar content being viewed by others

Density Peak Clustering Based on Cumulative Nearest Neighbors Degree and Micro Cluster Merging

Mk-NNG-DPC: density peaks clustering based on improved mutual K-nearest-neighbor graph

A New Density Clustering Method Using Mutual Nearest Neighbor

Explore related subjects

1 Introduction

2 Related work

2.1 Partition-based clustering

2.2 Peak density clustering

2.3 Maximum margin clustering

2.4 Spectral clustering

2.5 Balanced clustering

3 Proposed algorithm

3.1 Algorithm description

3.1.1 Stage I: Two-round-means

3.1.2 Stage II: Density clustering

Definition 1

Property 1

Proof

3.2 Complexity analysis

Proposition 1

Proof

4 Experiments

4.1 Datasets

4.2 Evaluation measure

4.3 Algorithm adaptability

4.3.1 Impact of parameter settings

4.3.2 Effect of the preprocessing technique for different shapes of datasets

4.4 Comparison with state-of-the-art clustering algorithms

4.4.1 Comparison on synthetic datasets

4.4.2 Comparison on benchmark datasets

4.4.3 Comparison on domain datasets

4.4.4 Comparison with CFDP optimization algorithm

4.5 Runtime comparison

4.6 Optimization of the TSD algorithm

4.7 Discussions

5 Conclusions and future works

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

M_k-NNG-DPC: density peaks clustering based on improved mutual K-nearest-neighbor graph