1 Introduction

Clustering is an unsupervised learning method that tries to find some obvious distribution structures and patterns in unlabeled data sets by maximizing the similarity of the objects in a common cluster and minimizing the similarity of the objects in different clusters [1]. Clustering has been used in many areas such as machine learning, pattern recognition, image processing, marketing and customer analysis, agriculture, security and crime detection, information retrieval, and bioinformatics. Cluster is often one important step in the process of data analysis.

Clustering algorithms have been studied for decades. There have been hundreds of clustering algorithms until now, but none of them is all-purpose. Almost all clustering algorithms have flaws. Some clustering algorithms are suitable for dealing with data with certain types, and others are suitable for handling data with special distribution structures. Many real data have complex distributions, diversiform types, great capacity, noises, or isolates. So there is a continuous demand for researching different kinds of clustering methods. To obtain better clustering results in real-world applications where the amount of data is often very large and the types of data are diversiform, researchers try their best to develop new efficient and effective clustering algorithms.

The traditional clustering algorithms are usually classified into partitioning methods [2, 3], hierarchical methods [4,5,6,7], density-based methods [8, 9], grid-based methods [10, 11], model-based methods [12] and graph-based methods [12,13,14]. Recent clustering methods have quantum clustering algorithms [15], spectral clustering algorithms [16, 17], kernel clustering algorithms [18], affinity propagation clustering algorithms [19], synchronization clustering algorithms [20,21,22,23,24,25,26,27], and so on. Partitioning methods are divided into hard clustering and fuzzy clustering. Fuzzy clustering can cope with overlapping or cluster boundaries. Partitioning methods are sensitive to the predefined number of clusters, initial prototypes, noise and outliers. They have linear time complexity with the number of data points. K-Means [2] and FCM [3] are two famous and often used partition-based clustering algorithms.

Recently, several original clustering algorithms, such as Affinity Propagation (AP) algorithm [19], Synchronization Clustering (SynC) algorithm [20] and clustering by fast search and find of Density Peaks (DP) algorithm [28], were published. AP algorithm is a new type of clustering algorithm published in Science. After AP algorithm was published, clustering based on probability graph models grew a new research direction. As we know, SynC algorithm is the first synchronization clustering algorithm. After SynC algorithm was presented, synchronization clustering attracts some researchers. Some synchronization clustering methods [21,22,23,24,25,26,27] were published from different views. DP algorithm is a clustering algorithm based on the assumption that “cluster centers can be characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities”. In DP algorithm, the number of clusters can be obtained automatically, outliers can be identified easily, and even nonspherical clusters can be explored quickly. So we think DP algorithm can lead a new research direction in clustering field.

Synchronization clustering is a kind of novel clustering approach. The original synchronization clustering algorithm (named as SynC) says that it can find the intrinsic structure of the data set without any distribution assumptions and handle outliers by dynamic synchronization [20]. This paper is inspired by several papers [20, 24, 25, 29,30,31] and the strategy of “divide and conquer”.

In the circumstance of big data, general clustering algorithms cannot process a large amount of data in the main memory at a time. To conquer this problem, basing on SynC algorithm [20], ESynC algorithm [25] and SSynC algorithm [24], this paper researches a Multi-Level Synchronization Clustering (MLSynC) method based on SSynC algorithm by using a framework of “divide and collect” and a linear weighted Vicsek model. MLSynC method has a different process with SynC algorithm, ESynC algorithm and SSynC algorithm. SynC algorithm [20] is based on an extensive Kuramoto model, ESynC algorithm [25] is based on a linear version of Vicsek model, SSynC algorithm [24] is based on a linear weighted Vicsek model and MLSynC method uses the strategy of “divide and conquer” and a linear weighted Vicsek model for clustering. Because the linear weighted Vicsek model has a nicer superposition characteristic for clustering, MLSynC method can be used to process big data effectively and efficiently.

The idea of “divide and collect” in MLSynC method is similar to the MapReduce framework [32] in some aspects, although it is developed independently. MapReduce is a parallel and distributed programming model that is used to process very large data sets on a cluster. In the MapReduce framework, the Map and Reduce operations affect the clustering result very much, so they cannot be directly extended to clustering field. MLSynC method can be used for clustering on a cluster with a parallel programming model or on a personal computer with a serial programming method. If the partition of the data set is proper, MLSynC method is both efficient and effective.

The remainder of this paper is organized as follows. Section 2 lists some related works. Section 3 gives some basic knowledge. Section 4 introduces MLSynC method. Section 5 validates MLSynC method by some simulated experiments. Conclusions and future works are presented in Section 6.

The symbols used in this paper are summarized in Table 1.

Table 1 Symbols used in this paper

2 Related works

2.1 Synchronization clustering methods

In 2010, Böhm et al. [20] presented a novel clustering approach, SynC algorithm, inspired by the synchronization principle. SynC algorithm can find the intrinsic structure of the data set without any distribution assumptions and handle outliers by dynamic synchronization. To implement automatic clustering, those natural clusters can be discovered by using the Minimum Description Length (MDL) principle [33].

After SynC algorithm was presented, some researchers published several synchronization clustering papers [21,22,23,24,25,26,27]. To find subspace clusters of some high-dimensional sparse data sets, a novel effective and efficient subspace clustering algorithm, ORSC [21], was proposed. To find the intricate patterns of a complex graph, a novel and robust graph clustering algorithm, RSGC [22], was proposed by regarding the graph clustering as a dynamic process towards synchronization. To explore meaningful levels of the hierarchical cluster structure, a novel dynamic hierarchical clustering algorithm, hSync [23], was presented based on synchronization and the MDL principle. Inspired by the work of [20] and Vicsek model [29], Chen [24] presented a Shrinking Synchronization Clustering (SSynC) algorithm by using a linear weighted Vicsek model. Inspired by the work of [20], Chen [25] proposed an Effective Synchronization Clustering (ESynC) algorithm based on a linear version of Vicsek model. Simulations validate that the linear version of Vicsek model is an effective synchronization model for clustering. Based on the metaphor of gravitational kinematics and the central force optimization method, Hang et al. [27] presented a local synchronization clustering algorithm, which can find clusters of those data sets with arbitrary size, shape and density, and determine the number of clusters automatically. Chen [26] presented a Fast Synchronization Clustering (FSynC) method basing on the work of [20] and spatial index structure. FSynC algorithm, which is a parametric algorithm, is an improved version of SynC algorithm by combining a multidimensional grid partitioning method and a Red-Black tree structure to construct the near neighbor point sets of all points [26].

2.2 Competitive learning for clustering

Competitive learning is a type of unsupervised learning model. In simple competitive networks based on the winner-take-all activation rule, the neuron with the highest activity in response to the input wins the competition, inhibiting the other neurons to quiescence. In self-organization maps, winner and its neighborhood neurons may learn to update their weights. After the competition, the input space is separated multiple subspaces and each neuron will output maximum value for those inputs in its responding subspace.

The famous competitive learning models have Von der Malsburg’s model [34], Kohonen’s self-organization map (SOM) [35, 36] and the adaptive resonance theory (ART) based on the biological motivation [37, 38]. SOM uses a set of heuristic procedures and does not use any minimization objective function. It can often preserve clustering topology well. But it is not suitable for non-vectorial data and has some shortcomings, such as forced termination, unguaranteed convergence, nonoptimized procedure and the sensitivity to the sequence of data. ART family includes a series of unsupervised learning models [39], such as ART1, ART2, ART3, ARTMAP models and fuzzy ARTMAP. The distinctive characteristics of ART are code representation, long-term memory and corresponding geometric interpretation [40]. In supervised learning, semi-supervised learning, self-supervised and other machine learning fields, ART-based models are also researched. For example, the self-supervised ARTMAP [41] can deserve partial knowledge previously acquired from labeled data and further learn new features from unlabeled data. This model can integrate multiple knowledge from a teacher, the environment and internal model activation. Recently, Seiffertt [42] extends the core ART theory and algorithms to some mixed input domains by establishing a dynamic equation model of ART in the time scales calculus.

The function of the concept root core in SSynC algorithm is similar to the concept prototype of K-Means algorithm or the concept neuron of competitive networks. The root cores after the synchronization iteration in SSynC algorithm can often be regarded as the prototypes of the clusters of the original data. In our proposed model of this paper, the location of the root core that represents some cores is regarded as their cluster center, and the location of the root core that represents only one or several cores is regarded as the final synchronization location of one or several isolates. Each neuron of competitive networks can represent one data subspace.

Learning vector quantizatioin (LVQ) and SOM can represent all original data using some cluster centers. The weight vectors of the SOM are often converged to the mean of input vectors. ESynC algorithm [25] uses a linear version of Vicsek model, and SSynC algorithm [24] uses a linear weighted Vicsek model. Center or mean of the input vectors from one data subspace is the common ground among SOM, ESynC algorithm and SSynC algorithm.

2.3 The scalability of the clustering algorithm

Dealing with data sets with a very large sample size and (or) a very high number of dimensions has always been an important challenge in clustering field because of the time-space complexity and clustering accuracy. A famous scalable version of K-Means, scaleKM, was presented by Bradley et al. [43]. Although scaleKM is much faster than K-Means for big data sets, it still requires a predefined input parameter, the correct number of clusters, which is unknown for unlabeled data. Another scalable version of K-Means for large data sets, Mini-batch K-Means (MBKM), was proposed by Sculley [44]. MBKM uses the same objective function and iteration process as K-Means, but it improves the clustering speed by selecting small random batches from the original data set.

Dimension reduction is widely used in high dimensional data sets. A simple and efficient dimension reduction method is random projection [45]. This kind of single random projection is often unstable. And ensemble-based techniques [46] was developed by combining the result of multiple lower-dimensional clustering results.

To explore those continuous data areas from massively large high dimensional data sets, an orthogonal partitioning clustering method, O-Cluster [47], was presented by combining an active random sampling strategy with an axis-parallel partitioning technique. O-Cluster can get high-quality clustering results with excellent scalability and robustness, although it needs a large buffer size and a sensitive parameter. Recently, by using fast data-space reduction and an intelligent sampling strategy, Rathore et al. [48] presented a rapid hybrid, ensemble-based clustering algorithm named as FensiVAT. FensiVAT can enhance the scalability ability of both dimensions and sample size.

3 An effective multi-level synchronization clustering method based on a framework of “divide and collect” and SSynC algorithm

Facing big data, general clustering methods cannot process all data in the main memory at a time. To conquer this problem, we present an effective Multi-Level Synchronization Clustering (MLSynC) method by using a framework of “divide and collect” and a linear weighted Vicsek model. The framework of “divide and collect” is an application of the strategy of “divide and conquer” in clustering field. The linear weighted Vicsek model described by Eq.(s5) of Sdefinition 5 of Supplementary Material is used in SSynC algorithm.

Although we use the Euclidean metric as our dissimilarity measure in this paper, this method is by no means restricted to this metric and this kind of data space. If we can construct a proper dissimilarity measure in a hybrid-attribute space, this method can still be used.

3.1 The application condition of MLSynC method

Sfig. 1 of Supplementary Material presents a figure that compares the synchronization clustering results of MLSynC method using two different partitioning methods, a random partitioning method and a direct partitioning method. From Sfig. 1, we observe that if the spatial distributions of two partitioned data subsets vary very large and the clustering structure of the original data set is dissevered by the partitioning, MLSynC method will get a different clustering result with SSynC algorithm. If the spatial distributions of two partitioned data subsets have very small deviation, or the partitioning is uniform, MLSynC method will get a similar clustering result with SSynC algorithm.

3.2 The description of a two-level framework algorithm of MLSynC method

MLSynC method has a different clustering process with SynC algorithm [20], ESynC algorithm [25] and SSynC algorithm [24]. Figure 1 presents a two-level framework algorithm of MLSynC method. Sfig. 2 of Supplementary Material presents a three-level framework algorithm of MLSynC method. In the two-level framework algorithm of MLSynC method, the original data set that is usually large and cannot be processed in the main memory at a time is partitioned into m subsets. Each subset is processed by a clustering model (or a clustering machine) based on SSynC algorithm. After collected all root cores from the m clustering models, a clustering model based on SSynC algorithm is used again.

Fig. 1
figure 1

A two-level framework algorithm of MLSynC method

In Fig. 1, if m is too large, then the two-level framework algorithm of MLSynC method should be replaced by a three-level (or four-level and above) framework algorithm. A three-level framework algorithm of MLSynC method is presented in Online Resource 3 of Supplementary Material of this paper.

From Fig. 1, we observe that MLSynC method has a natural framework based on SSynC algorithm by integrating the clustering results of all subsets. MLSynC method also has nicer incremental clustering ability. For example, in Fig. 1, if every subset in {S2, ···, Sm} only has one point, then the two-level framework algorithm becomes an incremental clustering algorithm.

Here, we present the description of the two-level framework algorithm of MLSynC method.

figure c

In Algorithm 1, steps 1 - 5 belong to the “divide stage”, and steps 6 and 7 belong to the “collect stage”.

3.3 The recursive algorithm of MLSynC method

Here, we present the description of the recursive algorithm of MLSynC method.

figure d

3.4 Time and space complexity analysis of MLSynC method

From the time complexity analysis of SSynC algorithm that presented in another paper [24], we know that SSynC algorithm needs Time = \( O\left(d\cdot \left({n}_{\left(t=0\right)}^2+\cdots +{n}_{\left(t=T-1\right)}^2\right)\right) \) < O(Tdn2), which is usually less than SynC algorithm and ESynC algorithm. Here T is the times of synchronization in SSynC algorithm, d is the data dimension, n is the number of points in the original data set D = {x1, ···, xn}, and n(t) is the number of active cores in the t-step synchronization evolution.

3.4.1 Time and space complexity analysis of the two-level framework algorithm of MLSynC method

In the two-level framework algorithm of MLSynC method, step 1 needs Time = O(n) and Space = O(n).

The time cost of step 2 is:

$$ {\displaystyle \begin{array}{l}\mathrm{Time}=O\left(d\cdot {\sum}_{i=1}^m\left({n}_i^2\left(t=0\right)+\cdots +{n}_i^2\left(t={T}_i-1\right)\right)\right)\\ {}\approx O\left(d\cdot \frac{n_{\mathrm{ave}}^2\left(t=0\right)+\cdots +{n}_{\mathrm{ave}}^2\left(t={T}_{\mathrm{max}}-1\right)}{m}\right)\end{array}} $$
(1)

In Eq. (1), Ti is the synchronization times in the i-th clustering module based on SSynC algorithm, ni(t = 0) is the initial number of active cores in the i-th clustering module, nave(t = 0) is the initial average number of active cores in the m clustering modules, and Tmax is the max synchronization times in the m clustering modules.

Suppose subset Si (i = 1, ···, m) has Ki clusters or isolates, then step 3 needs Time = O(K1 + ··· + Km) = O(|C|). Here |C| is the number of elements in the core set C.

Step 4 needs Time = \( O\left(d\cdot \left({\left|C\right|}_{\left(t=0\right)}^2+\cdots +{\left|C\right|}_{\left(t=T-1\right)}^2\right)\right) \). Here T is the synchronization times using SSynC algorithm in step 4 and |C|(t) is the number of active cores in the t-step synchronization evolution.

Step 5 needs Time = O(n) and Space = O(n).

3.4.2 Time and space complexity analysis of the recursive algorithm of MLSynC method

In the recursive algorithm of MLSynC method, step 1 needs Time = O(n) and Space = O(n).

In many cases, the time cost of step 2 is:

$$ \mathrm{Time}(n)=\left\{\begin{array}{c}O\left(d\cdot \left({n}_{\left(t=0\right)}^2+\cdots +{n}_{\left(t=T-1\right)}^2\right)\right),\kern0.5em if\ n\le FitNum,\\ {}O(1)+2\cdot \mathrm{Time}\left(\frac{n}{2}\right),\kern0.5em if\ n> FitNum.\end{array}\right. $$
(2)

Step 3 needs Time = O(n) and Space = O(n).

According to our analysis, MLSynC method often needs less time than SSynC algorithm.

3.5 The setting of parameters δ, ε and m in MLSynC method

For each subset or collected core set in MLSynC method, SSynC algorithm is used for clustering with proper values of parameters δ and ε.

3.5.1 The setting of parameter δ in MLSynC method

Parameter δ can affect the clustering result of MLSynC method. There are several kinds of methods to select a proper value for parameter δ. The first one [20] is that parameter δ is optimized by the MDL principle. In Chen [49], two other methods were presented to estimate parameter δ. The third one is the heuristic selection of parameter δ in Chen [25]. The fourth one is a linear-searching exploring method of parameter δ that is described in Chen [26]. Here, we can also select a proper value for parameter δ according to Stheorem 1 and Sproperty 1 of Supplementary Material.

3.5.2 The setting of parameter ε in MLSynC method

Parameter ε affects the time cost of MLSynC method slightly. Usually, parameter ε has a long valid interval. For example, if parameter δ > 15, then the valid interval of parameter ε is about in (0, 10]. In simulations, we almost get the same results for several different values (such as 0.00001, 0.0001, 0.001, 0.01, 0.1, 1 and 10) of parameter ε.

3.5.3 The setting of parameter m in MLSynC method

In the two-level framework algorithm of MLSynC method, parameter m affects its whole time cost. If parameter m is set too large, then the number of points in each part of the data set becomes less. So SSynC algorithm in the two-level framework algorithm of MLSynC method needs less clustering time for each subset, and the two-level framework algorithm of MLSynC method needs much divide and collect time. Usually, there is a balance between the increase of “divide and collect” time and the decrease of clustering time with the increase of parameter m. So we think that parameter m cannot be set too large.

3.6 The improvement of MLSynC method

In MLSynC method, one improved version of SSynC algorithm can be obtained by combining a multidimensional grid partitioning method and a Red-Black tree structure to construct the near neighbor point sets of all active cores. The improving method that can decrease its time cost is introduced in Chen [26]. Generally, we first partition the data space of the data set D = {x1, ···, xn} by using a kind of multidimensional grid partitioning method. Then design an effective index of all grid cells and construct δ near neighbor grid cell set for each grid cell. If every grid cell uses a Red-Black tree to index its active cores in each synchronization step, then constructing the δ near neighbor point set for every active core will become quicker when the number of grid cells is proper.

Before synchronization iterative evolution, if we set a proper value for parameter δ to filtrate isolates, then these isolates can be set as inactive cores that will not be operated in the next iterative evolution. This improvement of the implementation technique is often effective in some data sets.

3.7 The convergence of MLSynC method

In MLSynC method, SSynC algorithm is used to clustering each subset and the collected core set. So the convergence of MLSynC method is completely depended on the convergence of SSynC algorithm. According to the convergence analysis of SSynC algorithm and our simulations, we know that MLSynC method is also convergent.

4 The comparison of the dynamic clustering processes among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method

SynC algorithm uses the extensive Kuramoto model described by Eq.(s2) of Sdefinition 2 of Supplementary Material at each step evolution that is a nonlinear renewal model. ESynC algorithm uses the linear version of Vicsek model described by Eq.(s3) of Sdefinition 3 at each step evolution that is a linear renewal model. SSynC algorithm uses the linear weighted Vicsek model described by Eq.(s5) of Sdefinition 5 at each step evolution that is a linear weighted renewal model. MLSynC method uses a framework of “divide and collect” and the linear weighted Vicsek model.

Figure 2 compares the tracks of 2000 data points from DS0 among the clustering processes of SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method. From Fig. 2, we observe that MLSynC method, ESynC algorithm and SSynC algorithm have better local synchronization effect than SynC algorithm.

Fig. 2
figure 2figure 2figure 2

The comparison of the dynamical clustering processes with time evolution among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method. From a to e of Fig. 2, the data set is from DS0 with 2000 points, parameter δ is set as 18 in the four algorithms, and parameter ε is set as 1 in SSynC algorithm and MLSynC method. In MLSynC method, parameter m is set as 10, and the two-level framework algorithm is used. a*, b-4, c-4, d-4 and e-4 are the evolution displays in the “collect stage” of the two-level framework algorithm of MLSynC method

Figure 3a compares a measure index of clustering result, the cluster order parameter with t-step evolution (t: 0 - 49) [20], among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method. Figure 3b compares another measure index of clustering result, the t-step average length of edges (t: 0 - 49) [25], among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method. And Fig. 3c compares the relationship between the final number of clusters and parameter δ after finished clustering using the four algorithms respectively.

Fig. 3
figure 3

The comparison of SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method in two measure indexes of clustering result and the relation between the final number of clusters and parameter δ. In Fig. 3, the data set is from DS0 with 2000 points, and parameter ε is set as 1 in SSynC algorithm and MLSynC method. In MLSynC method, parameter m is set as 10, the two-level framework algorithm is used. In a and b, parameter δ is set as 18 in the four algorithms. In MLSynC method of a and b, two indexes (The cluster order parameter with t-step evolution and the t-step average length of edges) are computed in the “collect stage” of the two-level framework algorithm of MLSynC method

From Fig. 3a and b, we observe that the t-step average length of edges is better than the cluster order parameter with t-step evolution in measuring the final synchronization results. From Fig. 3c, we observed that parameter δ has a long valid interval in ESynC algorithm, SSynC algorithm and MLSynC method. From Fig. 3c, we also observe that the smaller parameter δ is set in SynC, ESynC, SSynC and MLSynC, the larger the final number of clusters is. In many data sets with obvious clusters, if we use a proper partitioning method to group the data set, MLSynC method can often get the correct number of clusters when parameter δ chooses any value from its valid interval. ESynC algorithm and SSynC algorithm can often get the correct number of clusters when parameter δ chooses any value from its valid interval. But the final number of clusters using SynC algorithm is often much larger than the actual number of clusters whenever parameter δ gets any value in a long interval.

5 Simulated experiments

5.1 Experimental design

Our experiments are finished on a personal computer. Experimental programs are developed using C / Python / Matlab language under Windows 7.

To verify the improvement in the clustering effect and time cost of MLSynC method, there will be some simulated experiments of some artificial data sets, eight UCI data sets [50] and three bmp pictures in the next several sections.

Four kinds of artificial data sets (DS1 - DS4) are produced in a 2-D region [0, 600] × [0, 600] by a program presented in Online Resource 4 of Supplementary Material of this paper. Other kinds of artificial data sets (DS5 - DS16) are produced in an interval [0, 600] in each dimension by a similar program. DS0 is produced in a 2-D region [0, 200] × [0, 200] by a similar program. Iris et al. [50] are eight UCI data sets used in our experiments. Three bmp pictures (named Picture1, Picture2 and Picture3) are obtained from the Internet. Stable 2 of Supplementary Material presents the description of the experimental data sets.

In our simulated experiments, the maximum times of synchronization evolution in the while repetition of SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method is set as 50. MLSynC method is implemented using the two-level framework algorithm of MLSynC method.

The comparison results of these clustering algorithms are presented by eight figures (Figs. 4 - 5, Sfigs. 3 - 8 of Supplementary Material) and eleven tables (Stables 3 - 13 of Supplementary Material). Because of the limited pages, we only select two figures from Sfigs. 3 - 8 in the manuscript. Sfigs. 3 - 8 and Stables 3 - 13 are presented in Online Resource 6 of Supplementary Material. And the performance of algorithms is measured by time cost (second). Clustering quality of algorithms is measured by display figures of clustering results and three robust information-theoretic measures, Adjusted Mutual Information (AMI) [51], Normalized Mutual Information [52] (NMI) and Adjusted Variation Information (AVI) [51]. According to Vinh et al. [51], the higher the value of the three measures gets, the better the clustering quality of algorithms is. In simulations, we use a revised Matlab code according to [51] to compute the three clustering quality measures.

In Section 5.2, MLSynC method will be compared with SynC algorithm, ESynC algorithm, SSynC algorithm and some other classic clustering algorithms (K-Means [2], FCM [3], AP [19], DBSCAN [9], Mean Shift [53, 54], spectral clustering algorithm [16, 17], information-based clustering algorithm [49], DP clustering algorithm [28]) in clustering quality and time cost using some artificial data sets.

In Section 5.3, MLSynC method will be compared with SynC algorithm, ESynC algorithm, SSynC algorithm and some other classic clustering algorithms in clustering quality and time cost using eight UCI data sets.

In Section 5.4, MLSynC method will be compared with SynC algorithm, ESynC algorithm, SSynC algorithm and some other classic clustering algorithms in the compressed effect of clustering results, clustering quality and time cost using three bmp pictures.

In the experiments, parameter δ used in SynC algorithm, ESynC algorithm, SSynC algorithm, MLSynC method, DBSCAN algorithm and Mean Shift algorithm is the threshold of Sdefinition 1 of Supplementary Material. In DBSCAN algorithm, parameter MinPts = 4, and parameter Eps is the same as parameter δ.

The detailed discussion on how to select a proper value for parameter δ in SynC algorithm is discussed in [20]. ESynC algorithm, SSynC algorithm and MLSynC method use Stheorem 1 and Sproperty 1 to select a proper value for parameter δ. In SSynC algorithm and MLSynC method, parameter ε has a trivial effect on time cost and clustering results.

To simplify our simulation of the two-level framework algorithm of MLSynC method, we set the same value for parameter δ1, δ2, ···, δm and δmerge. So they are also named as δ.

5.2 Experimental results of some artificial data sets (from DS1 - DS16)

5.2.1 The comparison of the clustering results among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method (from DS1- DS4)

Stable 3 of Supplementary Material presents the comparison results of four synchronization clustering algorithms (SynC, ESynC, SSynC and MLSynC) by using four artificial data sets (from DS1 - DS4). In Stable 3, by intercomparing SynC, ESynC, SSynC and MLSynC, we observe that MLSynC is the fastest clustering algorithm. At the same time, MLSynC, SSynC and ESynC can get better local synchronization results than SynC in the four data sets.

In the two-level framework of MLSynC method, parameter m is set as 10. If we use two different partitioning methods (a near unevenly partitioning method and a near evenly partitioning method) in the four data sets, the two sequences of the number of clusters of 10 subsets in each data set are different. The comparison results are presented in Stable 4 of Supplementary Material. From Stable 4, we observe that two different partitioning methods result in different clustering distributions of 10 subsets in each data set.

5.2.2 The comparison of the clustering results among SynC algorithm, ESynC algorithm, SSynC algorithm, MLSynC method and some classical clustering algorithms (from DS1 - DS8)

Fig. 4 and Sfigs. 4 - 6 of Supplementary Material present the comparison clustering results of several clustering algorithms by some display figures that reflect the clustering quality. From Fig. 4 and Sfigs. 4 - 6, we observe that MLSynC, SSynC and ESynC can get better clustering quality (obvious clusters or isolates displayed by figures) than SynC, AP, K-Means and FCM in the four artificial data sets (from DS1 - DS4). Mean Shift, DBSCAN can obtain similar clustering quality (obvious clusters displayed by figures) with MLSynC, SSynC and ESynC in these artificial data sets. Especially, MLSynC, SynC, ESynC and SSynC can all easily find some isolates if setting a proper value for parameter δ, and MLSynC can get the same clustering results (the same clusters displayed by figures) with SSynC, ESynC in these data sets.

Fig. 4
figure 4

The comparison of the clustering results of several algorithms (DS1, n = 400). In Fig. 4, parameter δ = 18 in SynC, ESynC, SSynC, DBSCAN, Mean Shift and MLSynC method; the number of data points n = 400; parameter ε = 1 in SSynC algorithm and MLSynC method. In MLSynC method, parameter m is set as 2, and the two-level framework algorithm is used

Stable 5 of Supplementary Material presents the clustering quality of several clustering algorithms (SynC, ESynC, SSynC, MLSynC and some classical clustering algorithms) by using six kinds of artificial data sets (DS2, DS4, DS5, DS6, DS7 and DS8). When computing AMI and NMI, the predefined cluster labels of the eight artificial data sets are used in the true_mem that is an input file of the MATLAB code from [51]. In Stable 5, by intercomparing MLSynC, SynC, ESynC, SSynC and some classical clustering algorithms, we observe that MLSynC can get acceptable and similar clustering results with SSynC and ESynC in the eight data sets if the partitioning of the data set in MLSynC method is near evenly. Because the two data sets (DS4 (n = 1000) and DS5 (n = 12,000)) have two connected clusters, MLSynC, SSynC and ESynC do not get the largest values of AMI and NMI. We also observe that the partitioning method of the data sets can affect the clustering results of MLSynC method.

Stable 6 of Supplementary Material presents the comparison results of several clustering algorithms in time cost. In Stable 6, for DS1, parameter δ = 18 in SSynC, ESynC, SynC and DBSCAN; parameter δ = 25 in MLSynC. For DS2, parameter δ = 25 in MLSynC, SSynC, ESynC, SynC and DBSCAN. For DS3, parameter δ = 18 in MLSynC, SSynC, ESynC, SynC and DBSCAN. For DS4, parameter δ = 18 in SSynC, ESynC, SynC and DBSCAN; parameter δ = 20 in MLSynC. Parameter ε = 1 in SSynC and MLSynC. In MLSynC, parameter m is set as 10, and the two-level framework algorithm is used.

In Stable 6, intercomparing MLSynC, SynC, ESynC, SSynC, DBSCAN, FCM and K-Means, we observe that MLSynC is faster than SynC, ESynC, DBSCAN and SSynC. K-Means is the fastest clustering algorithm. Mean Shift and AP cannot run normally on a personal computer because the number of data points is set as 40,000.

Sfigs. 7 - 11 and Stable7 of Supplementary Material present appended experimental results in clustering quality among three famous clustering algorithms, spectral clustering algorithm [16, 17], information-based clustering algorithm [55] and DP clustering algorithm [28]. Stable 7 presents the setting of parameters for three clustering algorithms in four artificial data sets (DS1 - DS4, n = 400). Sfig. 7 presents the clustering results and decision graphs of DP clustering algorithm. Sfigs. 8 - 11 present the comparison of the clustering results of three famous clustering algorithms using four artificial data sets. In four artificial data sets (DS1 - DS4, n = 400), DP clustering and MLSynC method can find correct clusters, spectral clustering and information-based clustering cannot obtain actual clustering results. DP clustering cannot explore isolates of two data sets (DS1 and DS3, n = 400), and MLSynC method can explore isolates from them.

5.2.3 The comparison of the valid interval of parameter δ among MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm, DBSCAN algorithm and mean shift algorithm using some artificial data sets (from DS5 - DS16)

Here we compare the valid interval of parameter δ among MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm, DBSCAN algorithm and Mean Shift algorithm.

Stable 8 of Supplementary Material presents the comparison results of the valid interval of parameter δ among MLSynC, SynC, ESynC, SSynC, DBSCAN and Mean Shift. Here, [ek, ek + 1] can be obtained from Eq.(8) of [49]. In Stable 8, intercomparing MLSynC, SynC, ESynC, SSynC, DBSCAN and Mean Shift, we observe that although the valid interval of parameter δ in MLSynC is shorter than that in SSynC and ESynC, the valid interval of parameter δ in MLSynC is long enough in many kinds of data sets.

Stable 9 of Supplementary Material compares the valid interval of parameter δ in MLSynC method for several different values of parameter ε using some artificial data sets with different dimensions. In Stable 9, intercomparing several different values of parameter ε, we observe that if parameter ε is less than parameter δ, the valid interval of parameter δ has a very small difference for several different values of parameter ε.

5.3 Experimental results of eight UCI data sets

Because we do not know the true dissimilarity measure of these UCI data sets, all points of these UCI data sets are standardized into an interval [0, 600] in each dimension in the experiments. When computing AMI and NMI, because we do not know the true cluster labels of these UCI data sets, the class labels of these UCI data sets are used in the true_mem that is an input file of the MATLAB code from [51].

5.3.1 The comparison of the clustering results among MLSynC method, SynC algorithm, ESynC algorithm and SSynC algorithm

Stable 10 of Supplementary Material presents the comparison results of four synchronization clustering algorithms (MLSynC method, SynC algorithm, ESynC algorithm and SSynC algorithm) by using eight UCI data sets. In Stable 10, intercomparing MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm, we observe that MLSynC method, ESynC algorithm and SSynC algorithm can get better local synchronization results than SynC algorithm in the eight UCI data sets, and MLSynC method is the fastest algorithm. From this simulated experiment, we also find that if the number of points in the data set is small and we use an uneven partition method, the difference of clustering effect between MLSynC method and SSynC algorithm is large.

5.3.2 The comparison of the clustering quality among MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm and some classical clustering algorithms

Stable 11 of Supplementary Material presents the comparison clustering quality of several clustering algorithms (MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm and some classical clustering algorithms) using eight UCI data sets. In Stable 11, by intercomparing these clustering algorithms, we observe that MLSynC method gets the largest value of AMI and NMI in five UCI data sets (Iris, Wdbc, Glass, Segmentation and Cloud) if it uses a sequential and uneven partitioning method. So we can say that MLSynC method often gets better clustering results than some clustering algorithms in some UCI data sets. From the final number of clusters in Stable 11, we observe that MLSynC method, SSynC algorithm and ESynC algorithm can get better local synchronization results than SynC algorithm.

Stable 12 presents the comparison of clustering quality among three famous clustering algorithms (spectral clustering algorithm [16, 17], information-based clustering algorithm [55] and DP clustering algorithm [28]) using eight UCI data sets. Stable 13 presents the comparison of clustering quality among MLSynC method, spectral clustering algorithm, information-based clustering algorithm and DP clustering algorithm. In eight UCI data sets, spectral clustering and DP clustering get better clustering results than information-based clustering. MLSynC method obtains the best clustering quality in seven UCI data sets, and DP clustering obtains the best clustering quality in one UCI data set.

5.4 Experimental results of three bmp pictures

The value in RGB (Red, Green and Blue) color space of pixel points is in an interval [0, 255] in each dimension. In Stable 14 and Sfig. 13 of Supplementary Material, parameter ε is set as 1 in MLSynC method and SSynC algorithm. In MLSynC method of Stable 14 and Sfig. 13, parameter m is set as 10, and the two-level framework algorithm is used.

5.4.1 The comparison of the clustering results among SynC algorithm, ESynC algorithm, SSynC algorithm and MLSynC method

Stable 14 presents the comparison results of four synchronization clustering algorithms (SynC, ESynC, SSynC and MLSynC) using three pixel-point data sets from RGB color space of three bmp pictures. In Stable 14, by intercomparing SynC, ESynC, SSynC and MLSynC, we observe that MLSynC is the fastest clustering algorithm. At the same time, MLSynC, SSynC and ESynC can get better local synchronization results than SynC in these pixel-point data sets.

5.4.2 The comparison of the clustering compressed effect among MLSynC method, SynC algorithm, ESynC algorithm, SSynC algorithm and some classical clustering algorithms

Sfig. 12 lists Picture3 and its RGB space distribution of 200 * 200 pixel points. Sfig. 13 of Supplementary Material lists the original picture and several compressed pictures of Picture3. In Sfig. 13, several compressed pictures are drawn by using the means of clusters obtained by clustering the 200 * 200 pixel points of Picture3 in RGB color space using different algorithms. Because AP algorithm needs too much time and space for Picture3, this experiment does not use it. Although the queue of DBSCAN algorithm slops over, it still gets a compressed picture basing on the clustering results. From Sfig. 13, we observe that MLSynC method, ESynC algorithm and SSynC algorithm can get multi-level clustering compressed effects for different values of parameter δ.

5.5 Analysis and conclusions of experimental results

From the comparison experimental results of these tables (Stable 3, Stable 6, Stable 10 and Stable 14 of Supplementary Material), we observe that MLSynC method is faster than SSynC algorithm, ESynC algorithm and SynC algorithm. We think that MLSynC method is superior to SSynC algorithm, ESynC algorithm and SynC algorithm in time cost because of its framework of “divide and collect”.

From the simulations of some artificial data sets (from DS5 - DS16), we observe that the effective interval of parameter δ in MLSynC method is enough long just like Mean Shift algorithm, DBSCAN algorithm, SSynC algorithm and ESynC algorithm. In some cases, the effective interval of parameter δ in MLSynC method is longer than that in DBSCAN algorithm.

In some display figures (Sfigs. 36 of Supplementary Material), by intercomparing MLSynC method, SSynC algorithm, ESynC algorithm and SynC algorithm, we observe that MLSynC method can explore the same clusters and isolates (displayed by some figures) with ESynC algorithm and SSynC algorithm if the partition in MLSynC method is near well-proportioned or unaided (each subset is independent with each other in the space). In many kinds of data sets, MLSynC method, SSynC algorithm and ESynC algorithm can explore obvious clusters or isolates if selecting a proper value for parameter δ, and SynC algorithm cannot explore obvious clusters in many data sets.

From simulations of some data sets, we observe that the iterative times of SynC algorithm, AP algorithm, K-Means algorithm and FCM algorithm is larger than that of MLSynC method, SSynC algorithm and ESynC algorithm. In many data sets, MLSynC method, ESynC algorithm, SSynC algorithm, DP algorithm, Mean Shift algorithm and DBSCAN algorithm have a better ability than SynC algorithm, K-Means algorithm, FCM algorithm, spectral clustering algorithm, information-based clustering algorithm and AP algorithm in exploring clusters and isolates. Especially, AP algorithm needs the longest time. From the simulation, we also find that DP clustering algorithm cannot explore isolates and determine the number of clusters automatically.

MLSynC method is an improved clustering algorithm with faster clustering speed than SSynC algorithm and ESynC algorithm almost in all cases. Usually, parameter ε has a long effective interval (for example, the effective interval of parameter ε is about (0, 10] if parameter δ > 15). In simulations, we observe that if parameter ε gets some different values in its effective interval, the clustering results of MLSynC method is almost the same except the time cost.

Because the values in RGB space, the pixel points of Picture3 are almost continuous and have no obvious clusters. In this case, MLSynC method, SSynC algorithm and ESynC algorithm can get more obvious multi-level compressed effects than some other clustering algorithms, such as K-Means algorithm and FCM algorithm. In simulations, we also observe that DBSCAN algorithm needs more space than MLSynC method, SSynC algorithm and ESynC algorithm because of its recursive process.

Because of the limited page space, we only select some typical data sets (sixteen kinds of artificial data sets, eight UCI data sets and three bmp pictures) used in our experiments. For all experimental data sets, we observe that MLSynC method improves ESynC algorithm in time cost. For other data sets, we think MLSynC method is still superior to SynC algorithm in time cost. We believe that the selection of experimental data sets is not biased.

From Fig. 1 and some other simulated experimental results, we conclude the application condition of MLSynC method. In any one of the following two cases, MLSynC method can get a similar clustering effect with SSynC algorithm and ESynC algorithm.

  • Case 1: The spatial distribution of any partitioned data subset is almost the same as that of the original data set.

  • Case 2: Any two partitioned data subsets cannot intersect, cannot be joined, or cannot be much near (less than or equal to parameter δ). In this case, any unabridged cluster of the original data set will not be partitioned into multiple near (larger than parameter δ) small clusters after the partition process of MLSynC method.

When the partition of the original data set does not satisfy any case above, MLSynC method will often get different clustering effects with SSynC algorithm and ESynC algorithm.

6 Conclusions

This paper presents an improved synchronization clustering method, MLSynC, which often gets better clustering results than the original synchronization clustering algorithm, SynC. From the experimental results, we observe that MLSynC method can often obtain less iterative times, faster clustering speed and better clustering quality than SynC algorithm in multi-kinds of data sets.

The major contributions of this paper can be summarized as follows:

  1. (1)

    It develops an effective framework of “divide and collect” in clustering field by using a linear weighted Vicsek model.

  2. (2)

    It presents two concrete implementations of MLSynC method, a two-level framework algorithm and a recursive algorithm.

  3. (3)

    It validates the improved effect of MLSynC method in time cost and clustering quality by some simulated experiments.

MLSynC method is also robust to outliers and can find obvious clusters with different shapes. The number of clusters does not have to be fixed before clustering. Usually, parameter δ has some valid interval that can be determined by using an exploring method presented in [49], the heuristic method described by Stheorem 1 and Sproperty 1, or using the MDL-based method presented in [20].

Usually, some clustering algorithms will get a large deviation when they are transplanted on the MapReduce framework. Only some specially designed clustering methods based on MapReduce framework and ensemble clustering methods can obtain the same, similar, or better clustering results. MLSynC method has some similarities with the MapReduce framework. So we can say it is an application example of the MapReduce framework in clustering field. In MLSynC method, the relation between the distribution deviation of partitioned data subsets and the clustering results should be investigated. How the partitioning strategies of the original data set affect the clustering results can also be further researched.

This work opens some possibilities for further improvement and investigation. First, improve MLSynC method in time cost further. For example, designing a similarity-preserving hashing function that needs O(1) time complexity is valuable in the process of constructing δ near neighbor point set. Second, extend the applicability and explore the clustering effect of our algorithms in high-dimensional data. Third, implement MLSynC method on a cluster with a parallel programming model or the MapReduce framework.