1 Introduction

Clustering is an unsupervised learning method that tries to search some obvious clusters in unlabeled data by maximizing the similarity of the objects in a common cluster and minimizing the similarity of the objects in different clusters [20]. Clustering has been used in many areas, such as pattern classification, image processing, marketing analysis, information retrieval, bioinformatics, and so on.

Clustering algorithms have been studied for decades. We think that almost all clustering algorithms have flaws. Some clustering algorithms are suitable for dealing with data with certain types, and others are suitable for handling data with special distribution structures. In real-world applications, some data have complex distributions, others have diversiform types, others have great capacity, and others have many noises or isolates. So there is a continued demand for researching different kinds of clustering methods. In order to obtain better clustering results in real-world applications where the amount of data is often very large and the types of data are diversiform, researchers try to develop new efficient and effective clustering algorithms.

The traditional clustering methods are usually classified into partitioning methods [3, 27], hierarchical methods [16, 23, 47], density-based methods [2, 11, 30, 31], grid-based methods [1, 45], model-based methods [41], and graph-based methods [21, 32]. Recent clustering methods have quantum clustering methods [17], spectral clustering methods [26, 33], and synchronization clustering methods [4, 18, 3439].

Recently, several original clustering algorithms, such as Affinity Propagation (AP) algorithm [13], Synchronization Clustering (SynC) algorithm [4], and clustering by fast search and find of Density Peaks (DP) algorithm [30], were published. AP algorithm is a new type of clustering algorithm published on Science in 2007. After AP algorithm was published, clustering based on probability graph models grew a new research direction. As we know, SynC algorithm [4] is the first clustering algorithm based on synchronization model. After SynC algorithm was presented, synchronization clustering methods attract the interest of some researchers. Some synchronization clustering methods [18, 3439] were published from different views. DP algorithm [30] is a clustering algorithm based on the assumption that “cluster centers can be characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities” [30]. In DP algorithm, the number of clusters can be obtained automatically, outliers can be identified easily, and even nonspherical clusters can be explored quickly. So we think DP algorithm can also lead a new research direction in clustering field.

Synchronization clustering is a kind of novel clustering approach. The original synchronization clustering algorithm named as SynC, which is a famous synchronization clustering algorithm presented in [4], claimed that it can find the intrinsic structure of the data set without any distribution assumptions and handle outliers by dynamic synchronization [4].

This paper reveals some fundamental differences among synchronization clustering based on an extensive Kuramoto model [4], synchronization clustering based on the original version of Vicsek model, and synchronization clustering based on a linear version of Vicsek model. After explored the synchronization clustering method based on the linear version of Vicsek model, we present an Effective Synchronization Clustering (ESynC) algorithm using the linear version of Vicsek model. ESynC algorithm is inspired by SynC algorithm and Vicsek model.

The remainder of this paper is organized as follows. Section 2 lists some related work. Section 3 gives some basic knowledge. Section 4 introduces ESynC algorithm and its improved version, IESynC algorithm. Section 5 validates our two algorithms by some simulated experiments. Conclusions and future work are presented in Section 6.

2 Related work

This paper is inspired by several papers [4, 19, 42, 44].

In 1995, Vicsek et al. [42] presented a basic model of multi-agent systems that contains noise effects. This basic model can also be regarded as a special version of Reynolds model [29]. Simulation results demonstrate that some systems using Vicsek model [42] or one-dimensional model presented by Czirok et al. [10] can be synchronized when they have large population density and small noise. Naturally, we expect that this kind of model can be used to explore clusters and noises of some data sets by local synchronization. In 2003, Jadbabaie et al. [19] analyzed a simplified Vicsek model without noise effects and provided a theoretical explanation for the nearest neighbor rule that can cause all agents to eventually move in the same direction. In 2008, Liu et al. [25] provided the synchronization property of Vicsek model after given initial conditions and the model parameters. In 2009, Wang et al. [44] researched Vicsek model under noise disturbances and presented some theoretical results. In 2010, Nagy et al. [28] found a well-defined hierarchical leader-follower influential network among pigeon flocks. So they suggested that hierarchical organization of group flight might be more efficient than an egalitarian one. After that, some reports about the communication mechanism of bird flocks were published in some famous journals, such as Nature and its sub journals, PNAS, and PRL. In 2014, Zhang et al. [46] found that pigeon flocks adopted a mode that switches between hierarchy and egalitarian. They think the switching mechanism of pigeon flocks is promising for some industrial applications, such as multi-robot system coordination and unmanned vehicle formation control. In 2015, Chen et al. [8] found that pigeon flocks adopted a simple two-level interactive network containing one leader and some followers. They think that “the two-level organization of group flight may be more efficient than a multilevel topology for small pigeon flocks” [8].

In 2010, Böhm et al. presented a novel clustering approach, SynC algorithm [4], inspired by the synchronization principle. SynC algorithm can find the intrinsic structure of the data set without any distribution assumptions and handle outliers by dynamic synchronization [4]. In order to implement automatic clustering, those natural clusters can be discovered by using the Minimum Description Length (MDL) principle [15]. After SynC algorithm was presented, Shao et al. published several synchronization clustering papers from different views [3439]. In order to detect the outliers from a real complex data set more naturally, a novel outlier detection algorithm, “Out of Synchronization” [34], was presented from a new perspective. In order to find subspace clusters of some high-dimensional sparse data sets, a novel effective and efficient subspace clustering algorithm, ORSC [35], was proposed. In order to find the intrinsic patterns of a complex graph, a novel and robust graph clustering algorithm, RSGC [37], was proposed by regarding the graph clustering as a dynamic process towards synchronization. In order to explore meaningful levels of the hierarchical cluster structure, a novel dynamic hierarchical clustering algorithm, hSync [38], was presented based on synchronization and the MDL principle. In 2013, Huang et al. [18] also presented a synchronization-based hierarchical clustering method basing on the work of [4]. In 2014, Chen [6] presented a Fast Synchronization Clustering (FSynC) algorithm basing on the work of [4]. FSynC algorithm, which is a parametric algorithm, is an improved version of SynC algorithm by combining multidimensional grid partitioning method and Red-Black tree structure to construct the near neighbor point sets of all points [6].

Recent years, some physicists also researched the explosive synchronization of some complexity networks to uncover the underlying mechanisms of the synchronization state [22, 24, 48]. In these papers, the synchronization rules of some networks were explored.

3 Some basic knowledge

Suppose there is a data set S = { X 1, X 2, …, X n } in a d-dimensional Euclidean space. Naturally, we use Euclidean metric as our dissimilarity measure, dis (⋅,⋅). In order to describe our algorithms clearly, some concepts are presented first.

Definition 1

[5] The δ near neighbor point set δ(P) of point P is defined as:

$$ \delta (P) = \{X | 0 \textless \textit{dis}(X, P) \le \delta, X\in S, X \ne P\}, $$
(1)

where dis (X, P) is the dissimilarity measure between point X and point P in the data set S. Parameter δ is a predefined threshold.

Definition 2

[4]. The extensive Kuramoto model for clustering is defined as:

Point X = (x 1, x 2, …, x d ) is a vector in d-dimensional Euclidean space. If point X is regarded as a phase oscillator according to Kuramoto model, with an interaction in the δ near neighbor point set δ(X), then the dynamics of the k-th dimension x k (k = 1, 2, …, d) of point X over time is described by:

$$ x_{k} (t+1) = x_{k} (t) + \frac{1}{\vert \delta (X(t))\vert} \sum\limits_{Y\in \delta (X(t))} {\sin (y_{k} (t)-x_{k} (t))}, $$
(2)

where X(t = 0) = (x 1(0), x 2(0), …, x d (0)) represents the original phase of point X, and x k (t + 1) describes the renewal phase value in the k-th dimension of point X at the t step evolution.

Definition 3

The t-step δ near neighbor undirected graph G δ (t) of the data set S = { X 1, X 2, …, X n } is defined as:

$$ G_{\delta} (t) = (V(t), E(t)), $$
(3)

where V(t = 0) = S = { X 1, X 2, …, X n } is the original vertex set, E(t = 0) = {(X i , X j ) | X j δ(X i ), X i (i = 1, 2, …, n) ∈ S} is the original edge set. V(t) = { X 1(t), X 2(t), …, X n (t)}is the t-step vertex set of the data set S, E(t) = { (X i (t), X j (t)) | X j (t) ∈ δ(X i (t)), X i (t) (i = 1, 2, …, n) ∈ V(t)} is the t-step edge set, and the weight computing equation of edge (X i , X j ) is weight(X i , X j ) = dis(X i , X j ).

Definition 4

The t-step average length of edges, AveLen(t), in a t-step δ near neighbor undirected graph G δ (t) is defined as:

$$ AveLen(t) = \frac{1}{\vert E(t)\vert} \sum\limits_{e\in E(t)} {\left| e \right|} $$
(4)

where E(t) is the t-step edge set of G δ (t), and |e|is the length (or weight) of edge e. The average length of edges in G δ (t) decreases to its limit 0, that is AveLen(t) → 0, as more δ near neighbor points synchronize together with time evolution. In our two algorithms, AveLen(t) can be used to characterize the degree of local synchronization.

Definition 5

[4]. The cluster order parameter r c characterizing the degree of local synchronization is defined as:

$$ r_{c} = \frac{1}{n}\sum\limits_{i=1}^{n} {\sum\limits_{Y\in \delta (X)} {e^{-dis(X,Y)}}}. $$
(5)

Definition 6

The Vicsek model [42] for clustering is defined as:

Point X = (x 1, x 2, …, x d ) is a vector in d-dimensional Euclidean space. If point X is regarded as an agent according to the Vicsek model, with an interaction in the δ near neighbor point set δ(X), then the dynamics of point X over time is described by:

$$ X(t+1) = X(t) + \frac{X(t)+\sum\limits_{Y\in \delta (X(t))} Y} {\left\| {X(t)+\sum\limits_{Y\in \delta (X(t))} Y} \right\|}\cdot v(t)\cdot {\Delta} t, $$
(6)

where X(t = 0) = (x 1(0), x 2(0), …, x d (0)) represents the original location of point X, X(t + 1) describes the renewal location of point X at the t step evolution, v(t) is the move velocity at the t step evolution, and v(t)⋅Δt is the move length at the t step evolution.

A special case of this original version of Vicsek model is that if the δ near neighbor point of one point is null, then this point will move along its vector direction.

In some multi-agent systems based on Vicsek model, v(t) is a constant. If v(t) is always a constant, maybe (6) cannot be used for clustering. In a simulation using the data set of Fig. 1, we find that the original version of Vicsek model based on (6) cannot work well for clustering when v(t) is a constant. So we present another effective version of Vicsek model for clustering.

Fig. 1
figure 1figure 1figure 1

Compare the dynamical synchronization clustering processes with time evolution among the Extensive Kuramoto model (EK model) the Linear version of Vicsek model (LV model) and the Original version of Vicsek model (OV model). From (b) to (e) of Fig. 1, parameter δ is set as 18 in the three models

Definition 7

A linear version of Vicsek model for clustering is defined as:

Point X = (x 1, x 2, …, x d ) is a vector in d-dimensional Euclidean space. If point X is regarded as an agent according to a linear version of Vicsek model, with an interaction in the δ near neighbor point set δ(X), then the dynamics of point X over time according to Jadbabaie et al. [19] and Wang et al. [44] is described by:

$$ X(t+1) = \frac{1}{\left( {1+\left| {\delta (X(t))} \right|} \right)}\left( {X(t)+\sum\limits_{Y\in \delta (X(t))} Y} \right), $$
(7)

where X(t = 0) = (x 1(0), x 2(0), …, x d (0)) represents the original location of point X, and X(t+1) describes the renewal location of point X at the t step evolution.

From (7), which is similar to the searching equation of next location in Mean Shift clustering algorithm [9, 17], we can see that the renewal location of point X is the mean location of point X and other points in its δ near neighbor point set δ(X).

Equation (7) can also be rewritten by:

$$X(t+1) = X(t) + \sum\limits_{Y\in \delta (X(t))} {\left( {Y-X(t+1)} \right)} $$
$$ = X(t) + \frac{1}{\left( {1+\left| {\delta (X(t))} \right|} \right)}\left( {\sum\limits_{Y\in \delta (X(t))} {\left( {Y-X(t)} \right)}} \right). $$
(8)

(8) has some similarity with (2) in form, but they have essential difference. The renewal model of (2) is nonlinear and the renewal model of (7) and (8) is linear.

A special case using this linear Vicsek model is that two points meet the condition that the δ near neighbor point of one point only contains another. After one time synchronization using (7), the two points will move to their middle location. So we think this linear Vicsek model for clustering is consistent with the intuition. Another special case using this linear Vicsek model is that one point meets the condition that its δ near neighbor point set is null. After one time synchronization using (7), this point is still an isolate.

Definition 8

According to [5], the Grid Cell is defined as follows:

Grid cells of the d-dimensional Euclidean space of the data set S can be constructed after partitioned the multidimensional Euclidean space using a kind of multidimensional grid partitioning method.

The data structure of grid cell g can be defined as:

$$\begin{array}{@{}rcl@{}} \text{DS(g)} = \left(\text{Grid}\_\text{Label}, \text{Grid}\_\text{Coordinates}, \text{Grid}\_\text{Location},\right. \\\qquad\quad\left.\text{Grid}\_\text{Range}, \text{Points}\_\text{Number}, \text{Points}\_\text{Set}\right).\end{array} $$
(9)

In (9),

Grid_Label is the key label of the grid cell.

Grid_Coordinates is the coordinates of the grid cell. It is a d− dimensional integer vector expressed by I = (i 1, i 2, …, i d ) that can help to construct δ near neighbor grid cell set more quickly.

Grid_Location is the center location of the grid cell. It is a d−dimensional vector expressed by P = (p 1, p 2, …, p d ).

Grid_Range records the region of the grid cell. It is a d−dimensional interval vector expressed by:

$$ R = ([p_{1}-r_{1}/2, p_{1}+r_{1}/2), \textellipsis , [p_{d}-r_{d}/2, p_{d}+r_{d}/2)), $$
(10)

where r i (i = 1, 2, …, d) is the interval length in the i-th dimension of the grid cell.

Points_Number records the number of points in the grid cell.

Points_Set records the labels of all points in the grid cell.

In FSynC algorithm [6] and IESynC algorithm, we use a Red-Black tree that has efficient inserting and deleting operations to record the labels of all points in the grid cell. In this paper, Grid cells and Red-Black trees are used to decrease the time cost of ESynC algorithm.

Definition 9

[5]. Suppose there is a set of N grid cells Grid(S) = { g 1, g 2, …, g N } in the d-dimensional Euclidean space of the data set S, then the δ near neighbor grid cell set δ(g i ) of grid cell g i (i = 1, 2, …, N) is defined as:

$$\begin{array}{@{}rcl@{}} \delta (g_{i}) &=& \{g_{j} \textbar (\exists P)(\exists Q)(0 \textless \textit{dis}(P,Q) \le \delta ),\\ &&P \in g_{i}, Q \in g_{j}, g_{j} \in \textit{Grid}(S), g_{j} \ne g_{i}\} \end{array} $$
(11)

The construction details of δ near neighbor grid cell set are described in [5].

Definition 10

The data set S = { X 1, X 2, …, X n } using the linear version of Vicsek model described by (7) for clustering is said to achieve local synchronization, if the final locations of all points satisfy:

$$ X_{i}(t = T) =\textit{SL}_{k}(T), i = 1, 2, \textellipsis , n, k = 1, 2, \textellipsis , K. $$
(12)

In (12), T is the times of the synchronization, K is the number of steady locations in the final synchronization step, SL k (T) is the k-th steady location in the final synchronization step.

Usually, the final location of point X i (i = 1, 2, …, n) depends on the value of parameter δ, the original location of point X i , and the original locations of other points in the data set S.

Theorem 1

The data set S = { X 1, X 2, …, X n } using the linear version of Vicsek model described by (7) for clustering will achieve local synchronization, if parameter δ satisfies:

$$ \delta_{\min} \le \delta \le \delta_{\max}. $$
(13)

Suppose e min (MST(S)), which is equal to min {dis(X i , X j )| (X i , X j S) ∧ (X i X j )}, is the weight of the minimum edge in the Minimum Span Tree (MST) of the complete graph of the data set S, and e max(MST(S)) is the weight of the maximum edge in the MST of the complete graph of the data set S. Apparently, there is δ min = e min(MST(S)). If the data set S has no isolate, then usually there is e max(MST(S)) ≤ δ max <max{dis(X i , X j ) |(X i , X j S) ∧ (X i X j )}. If the data set S has isolates, we should filtrate all isolates at first.

Proof

if δ < δ min, then for any point X i (i = 1, 2, …, n), there is δ(X i ) = Ø. In this case, any point in the data set S cannot synchronize with other points, so synchronization will not happen.

In another case, that is δ > δ max, then for any point X i (i = 1, 2, …, n), there is δ(X i (t)) = S - { X i (t)}. According to (7), there is X i (t+1) = mean (S). Here, mean (S) is the mean of all points in the data set S. Any point in the data set S will synchronize with all other points, so global synchronization happens. After one time synchronization, all points in the data set S will synchronize to their mean location.

Apparently, if δ minδδ max, local synchronization will happen. And the final result of synchronization is affected by the value of parameter δ and the original locations of all points in the data set S. □

Property 1

The data set S = { X 1, X 2, …, X n } using the linear version of Vicsek model described by (7) for clustering will obtain an effective result of local synchronization with some obvious clusters or isolates, if parameter δ satisfies:

$$\begin{array}{@{}rcl@{}} &&max\{\textit{longthestEdgeInMs}t(\textit{cluster}_{k})\textbar k= 1, 2, \textellipsis , K\}\\ &&\le \delta \textless min\{\textit{dis}(\textit{cluster}_{i}, \textit{cluster}_{j})\textbar i, j= 1, 2, \textellipsis, K\},\\ \end{array} $$
(14)

where longthestEdgeInMs t(cluster k ) is the weight of the longest edge in the minimum spanning tree of the k-th cluster, dis(cluster i , cluster j ) is the weight of the minimum edge connecting the i-th cluster and the j-th cluster, and K is the number of clusters in the final synchronization step.

Proof

Suppose the data set S = {X 1, X 2, …, X n } has K obvious clusters. If parameter δ is larger than or equal to max{longthestEdgeInMst(cluster k )| k = 1, 2, …, K}, then data points in the same cluster will synchronize. If parameter δ is less than min{dis(cluster i , cluster j )| i, j = 1, 2, …, K}, then data points in different clusters cannot synchronize. □

4 An effective synchronization clustering algorithm based on a linear version of Vicsek model

ESynC algorithm has almost the same process as SynC algorithm except using a different dynamics clustering model, a linear version of Vicsek model represented by (7).

Although we use the Euclidean metric as our dissimilarity measure in this paper, the algorithm is by no means restricted to this metric and this kind of data space. If we can construct a proper dissimilarity measure in a hybrid-attribute space, the algorithm can also be used.

4.1 Compare the extensive Kuramoto model with the linear version of Vicsek model and the original version of Vicsek model

Comparing (2) with (7), we can see that the renewal model of the extensive Kuramoto model at each step evolution is nonlinear and the renewal model of the linear version of Vicsek model at each step evolution is linear.

Figure 1 compares the tracks of 800 data points in the dynamics synchronization clustering processes among the Extensive Kuramoto model (EK model), the Linear version of Vicsek model (LV model), and the Original version of Vicsek model (OV model). Figure 2a compares the cluster order parameter with t-step evolution (t: 0 - 49) among the three models. Figure 2b compares the t-step average length of edges (t: 0 - 49) among the three models. And Fig. 2c compares the relation between the final number of clusters and the value of parameter δ among the three models.

Fig. 2
figure 2

Compare EK model, LV model, and OV model. In Fig. 2, the data set has 800 points. In Fig. 2 (a) and (b), parameter δ is set as 18 in the three models

From Fig. 1, we observe that OV model cannot obtain local synchronization effect, and LV model gets better local synchronization effect than EK model. From Fig. 2a and b, we observe that the t-step average length of edges is better than the cluster order parameter with t-step evolution in measuring the synchronization results. From Fig. 2c, we observe that the smaller parameter δ is set in LV model and EK model, the larger the final number of clusters is. For many data sets with obvious clusters, LV model can often get the correct final number of clusters when parameter δ chooses any value in its valid interval, and the final number of clusters using EK model is much larger than the actual number of clusters when parameter δ chooses any value in a long interval. In OV model, the final number of clusters is the same as the number of data points when parameter δ chooses any value in a long interval.

4.2 The description of SynC algorithm

The original synchronization clustering algorithm named as SynC is developed by Böhm et al. [4]. In order to show the difference between SynC algorithm and ESynC algorithm, we introduce it below using our language according to the description of [4].

figure b

4.3 The description of ESynC algorithm

ESynC algorithm can be described as follows.

figure c

4.4 Complexity analysis of ESynC algorithm

Step1 of ESynC algorithm needs Time = O(n) and Space = O(n).

In Step2, constructing the δ near neighbor point sets for all points if using a simple method needs Time = O(dn 2) and Space = O(nd). In constructing the δ near neighbor point sets, time cost can be decreased by using the strategy of “space exchanging time”.

Step3 needs Time = O(n) and Space = O(n).

According to [4] and our analysis, ESynC algorithm needs Time = O(Tdn 2), which is also the time complexity of SynC Algorithm. Here T is the times of the while repetition in Step2.

4.5 Setting parameter in ESynC algorithm

Parameter δ in ESynC algorithm can affect the result of clusters or isolates. In [4], parameter δ is optimized by the MDL principle [15]. In [7], two other methods were presented to estimate parameter δ. Here, we can also select a proper value for parameter δ according to Theorem 1 and Property 1.

4.6 The convergence of ESynC algorithm

In all simulations, if using ESynC algorithm to synchronize the original data set S, then S(T) = { X 1(T), X 2(T), …, X n (T)} will stay on some locations steadily after several iterations (many simulations only need 4 - 5 times). In the final convergent set S(T), a steady location that represents some points can be regarded as their cluster center, and a steady location that represents only one or several points is regarded as the final synchronization location of one or several isolates.

The renewal computing equation of ESynC algorithm can be represented by a matrix formula,

$$ \boldsymbol{S}(t + 1) = \boldsymbol{A}(t)\cdot \boldsymbol{S}(t). $$
(15)

In (15), S(t) = (X 1(t), X 2(t), …, X n (t))T is the t-step location vector of n points, { X 1, X 2, …, X n }, and A(t) is an n×n matrix.

Suppose S(t = 0) is arranged by the final cluster order, which means that those points in the same cluster are coterminous in S(t = 0). In this case, A(T) will be a block matrix. For example, suppose the data set S has K final clusters or isolates, then A(T) of (1) is just a block matrix.

$$ A(T)\,=\,\!\left( {{\begin{array}{*{20}c} {{\begin{array}{*{20}c} {1/\left| {c_{1}} \right|} \hfill & {...} \hfill & {1/\left| {c_{1}} \right|} \hfill \\ {...} \hfill & {...} \hfill & {...} \hfill \\ {1/\left| {c_{1}} \right|} \hfill & {...} \hfill & {1/\left| {c_{1}} \right|} \hfill \end{array}} } \hfill & 0 \hfill & {...} \hfill & 0 \hfill \\ 0 \hfill & {{\begin{array}{*{20}c} {1/\left| {c_{2}} \right|} \hfill & {...} \hfill & {1/\left| {c_{2}} \right|} \hfill \\ {...} \hfill & {...} \hfill & {...} \hfill \\ {1/\left| {c_{2}} \right|} \hfill & {...} \hfill & {1/\left| {c_{2}} \right|} \hfill \end{array}} } \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & {...} \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & {...} \hfill & {{\begin{array}{*{20}c} {1/\left| {c_{K}} \right|} \hfill & {...} \hfill & {1/\left| {c_{K}} \right|} \hfill \\ {...} \hfill & {...} \hfill & {...} \hfill \\ {1/\left| {c_{K}} \right|} \hfill & {...} \hfill & {1/\left| {c_{K}} \right|} \hfill \end{array}} } \hfill \end{array}} } \right) $$
(16)

In (1), | c k |labels the number of all points in the k-th cluster or isolates.

Perhaps analyzing the synchronization process based on the linear version of Vicsek model from theory is more complex than analyzing Mean Shift clustering algorithm [7, 14] or multi-agent systems based on Vicsek model.

4.7 The improvement of ESynC algorithm

An improved version of ESynC algorithm, named as IESynC algorithm (Its description is presented in Appendix 1 of Supplementary material), is designed by combining multidimensional grid partitioning method and Red-Black tree structure to construct the near neighbor point sets of all points. The improving method that can decrease the time cost of constructing near neighbor point set is introduced in [6]. Here, we describe this method as possible as simply. Generally, we first partition the data space of the data set S = { X 1, X 2, …, X n } using a kind of multidimensional grid partitioning method. Then we can obtain a set of all grid cells. Last design an effective index of all grid cells and constructing the δ near neighbor grid cell set for each grid cell. If every grid cell uses a Red-Black tree to index its data points in each synchronization step, then constructing the δ near neighbor point set for every point will become quicker when the number of grid cells is proper.

Another improved version of ESynC algorithm in both time cost and space cost is described in another paper.

5 Simulated experiments

5.1 Experimental design

Our experiments are finished in a personal computer (Capability Parameters: Pentium(R) Dual-Core CPU E5400 2.7GHz, 2G Memory). Experimental programs are developed using C and C ++ language under Visual C ++6.0 of Windows XP.

To verify the improvement in clustering effect and time cost of ESynC algorithm and IESynC algorithm, there will be some simulations of some artificial data sets, eight UCI data sets (Frank et al., 2010), and three bmp pictures in Sections 5.25.3, and 5.4.

Four kinds of artificial data sets (DS1 - DS4) are produced in a 2-D region [0, 600] × [0, 600] by a program presented in Appendix 2 of Supplementary material. Eight kinds of artificial data sets (DS5 - DS12) are produced in a range [0, 600] in each dimension by a similar program. Iris et al. [12] are eight UCI data sets that used in our experiments. Three bmp pictures (named as Picture1, Picture2, and Picture3) are obtained from Internet. The description of the experimental data sets is presented in Table 1 of Appendix 3 of Supplementary material.

Table 1 Compare three synchronization clustering algorithms (SynC, ESynC, and IESynC) using four kinds of artificial data sets (DS1 - DS4). In Table 1, parameter δ = 18, the number of data points n = 10000. In IESynC of Table 1, r 1 = r 2 = 10

In Section 5.2, ESynC algorithm and IESynC algorithm will be compared with SynC algorithm and some other classic clustering algorithms (K-Means [27], FCM [3], AP [13], DBSCAN [11], Mean Shift [9, 14]) in clustering quality and time cost using some artificial data sets.

In Section 5.3, ESynC algorithm will be compared with SynC algorithm and some other classic clustering algorithms in clustering quality and time cost using eight UCI data sets.

In Section 5.4, ESynC algorithm and IESynC algorithm will be compared with SynC algorithm and some other classic clustering algorithms in compressed effect of clustering results, clustering quality, and time cost using three bmp pictures.

In the experiments, parameter r i (i = 1, 2, …, d) used in IESynC algorithm is the interval length in the i-th dimension of grid cell [5], and parameter δ used in SynC, ESynC, IESynC, DBSCAN, and Mean Shift is the threshold of Definition 1. In DBSCAN algorithm, parameter MinPts = 4, and parameter Eps is the same as parameter δ.

The detailed discussion on how to construct grid cells and δ near neighbor point sets is described in [5]. How to select a proper value for parameter δ of SynC algorithm is discussed in [4]. ESynC algorithm can use the same method as SynC algorithm to select a proper value for parameter δ. In IESynC algorithm, setting different values for parameter r i (i = 1, 2, …, d) will result in different number of grid cells and different time cost.

In our simulated experiments, the maximum times of synchronization moving in the while repetition of SynC algorithm, ESynC algorithm, and IESynC algorithm is set as 50.

Comparison results of these algorithms are presented by some figures (Figs. 3, 4, 5, Figs. 46 of Appendix 3 of Supplementary material) and some tables (Tables 1, 2, 3, 4, 5, 6). And performance of algorithms is measured by time cost (second). Clustering quality of algorithms is measured by display figures and two robust information-theoretic measures, Adjusted Mutual Information (AMI) [43] and Normalized Mutual Information (NMI) [40]. According to the opinions of [43], the higher the value of two measures gets, the better the clustering quality of algorithm is. In simulations, we use the Matlab code from [43] to compute the two clustering quality measures.

Fig. 3
figure 3figure 3

Compare the clustering results of several algorithms (DS1, n = 400)

Fig. 4
figure 4

Compare several algorithms in time cost using four artificial data sets (DS1 - DS4, n = 20000)

Fig. 5
figure 5

Compare the original picture and several compressed pictures of Picture3. In Fig. 5, several compressed pictures based on different clustering algorithms are drawn by using the means of clusters that are obtained by clustering the 200 * 200 pixel points of Picture3 in RGB space

Table 2 Compare the clustering quality of several clustering algorithms (SynC, ESynC, IESynC, and some classical clustering algorithms) using six kinds of artificial data sets (DS2, DS4, DS5, DS6, DS7, and DS8). In Table 2, parameter δ = 18 in DS2, DS4, DS5, and DS6; parameter δ = 30 in DS7 and DS8
Table 3 Compare the valid interval of parameter δ among SynC, ESynC, IESynC, DBSCAN, and Mean Shift using four artificial data sets with different dimensions. In Table 3, n = 10000
Table 4 Compare two synchronization clustering algorithms (SynC and ESynC) using eight UCI data sets
Table 5 Compare the clustering quality of several clustering algorithms (SynC, ESynC, and some classical clustering algorithms) using eight UCI data sets
Table 6 Compare three synchronization algorithms (SynC, ESynC, and IESynC) using three picture data sets. In Table 6, parameter δ = 18 or 30 in SynC, ESynC, and IESynC; parameter r i (i = 1, 2, 3) = 6 in IESynC

5.2 Experimental results of some artificial data sets (DS1 - DS12)

5.2.1 Comparison results among SynC algorithm, ESynC algorithm, and IESynC algorithm

Table 1 presents the comparison results of three synchronization clustering algorithms (SynC, ESynC, and IESynC) using four kinds of artificial data sets (DS1 - DS4). In Table 1, by intercomparing SynC, ESynC, and IESynC, we observe that IESynC is the fastest clustering algorithm. At the same time, ESynC and IESynC can get better local synchronization results than SynC in the four data sets.

5.2.2 Comparison results among SynC algorithm, ESynC algorithm, IESynC algorithm, and some classical clustering algorithms

Table 2 presents the clustering quality of several clustering algorithms (SynC, ESynC, IESynC, and some classical clustering algorithms) using six kinds of artificial data sets (DS2, DS4, DS5, DS6, DS7, and DS8). When computing two information-theoretic measures, NMI and AMI, the predefined cluster labels of the eight artificial data sets are used in true_mem that is an input file of the MATLAB code [43]. In Table 2, by intercomparing SynC, ESynC, IESynC, and some classical clustering algorithms, we observe that ESynC and IESynC can get acceptable clustering results in the eight artificial data sets. Because the three artificial data sets (DS4 (n = 400), DS4 (n = 800), and DS5 (n = 10000)) have two connected clusters, ESynC and IESynC do not get the largest values of NMI and AMI.

Figures 3 and 46 of Appendix 3 of Supplementary material present the comparison clustering results of several clustering algorithms by some display figures that can reflect the clustering quality clearly. In Figs. 3 and 46 of Appendix 3 of Supplementary material, parameter δ = 18 in SynC, ESynC, IESynC, DBSCAN, and Mean Shift; the number of data points n = 400.

From Figs. 3 and 46 of Appendix 3 of Supplementary material, we observe that ESynC and IESynC can get better clustering quality (obvious clusters or isolates displayed by figures) than SynC, AP, K-Means, and FCM in some artificial data sets (from DS1 - DS4). Mean Shift, DBSCAN can obtain similar clustering quality (obvious clusters displayed by figures) with ESynC and IESynC in some artificial data sets (from DS1 - DS4). Especially, SynC, ESynC, and IESynC can all easily find some isolates if setting a proper value for parameter δ.

Figure 4 presents the comparison results of several clustering algorithms in time cost. In Fig. 4, parameter δ = 18 in SynC, ESynC, IESynC, DBSCAN, and Mean Shift; the number of data points n = 20000. In IESynC, r 1 = r 2 = 6.

In Fig. 4, Intercomparing ESynC, IESynC, Mean Shift, DBSCAN, FCM, and K-Means, we observe that IESynC is faster than ESynC with the same clusters, and K-Means is the fastest clustering algorithm.

5.2.3 Comparing the valid interval of parameter δ among SynC, ESynC, IESynC, DBSCAN, and Mean Shift using some artificial data sets (DS5 - DS8)

Here we compare the valid interval of parameter δ among SynC algorithm, ESynC algorithm, IESynC algorithm, DBSCAN algorithm, and Mean Shift algorithm.

Table 3 is the comparison results among these clustering algorithms. Here, [ e k , e k+1] can be obtained from (8) of [7]. In Table 3, intercomparing SynC, ESynC, IESynC, DBSCAN, and Mean Shift, we observe that the valid interval of parameter δ in ESynC and IESynC is longer than that in DBSCAN, the valid interval of parameter δ in DBSCAN is consistent with [ e k , e k+1], and parameter δ in Mean Shift has the largest valid interval.

5.3 Experimental results of eight UCI data sets

Because we do not know the true dissimilarity measure of these UCI data sets, all points of these UCI data sets are standardized into a range [0, 600] in each dimension in this experiment. When computing two information-theoretic measures (NMI and AMI), because we do not know the true cluster labels of these UCI data sets, the class labels of these UCI data sets are used in the true_mem that is an input file of the MATLAB code [43].

5.3.1 Comparison results between SynC algorithm and ESynC algorithm

Table 4 gives the comparison results of two synchronization clustering algorithms (SynC and ESynC) using eight UCI data sets. In Table 4, by comparing SynC with ESynC, we observe that ESynC can get better local synchronization results, less synchronization times, and less clustering time than SynC in the eight UCI data sets.

5.3.2 Comparison results among SynC algorithm, ESynC algorithm, and some classical clustering algorithms

Table 5 gives the comparison clustering quality of several clustering algorithms (SynC, ESynC, and some classical clustering algorithms) using eight UCI data sets. In Table 5, by intercomparing these clustering algorithms, we observe that ESynC algorithm does not get the largest values of NMI and AMI except Cloud data set. We think there are three reasons. First, we use the Euclidean metric to compute the dissimilarity measure of the eight UCI data sets without any actual knowledge on these data sets. Second, we observe that the largest values of NMI and AMI in some data sets do not mean the best clustering quality. Third, the class labels of these UCI data sets, which are not often consistent with the actual distributions of clusters, are used as the benchmark of clusters in our simulations (Because we don’t have better choice). From the view of the final number of clusters of Table 5, we observe that ESynC algorithm can get better clustering results than some other clustering algorithms in some UCI data sets.

5.4 Experimental results of three bmp pictures

The value in RGB (Red, Green, and Blue) color space of pixel points is in a range [0, 255] in each dimension. In IESynC algorithm of Table 6 and Fig. 5, parameter r i (i = 1, 2, 3) is set as 6.

5.4.1 Comparison results among SynC algorithm, ESynC algorithm, and IESynC algorithm

Table 6 presents the experimental results in time cost and local synchronization results among SynC algorithm, ESynC algorithm, and IESynC algorithm. The data sets in Table 6 are pixel points of three bmp pictures. In Table 6, by intercomparing SynC algorithm, ESynC algorithm, and IESynC algorithm, we observe that ESynC and IESynC are faster than SynC in these data sets. At the same time, ESynC algorithm and IESynC algorithm can get better local synchronization results than SynC algorithm in these data sets.

5.4.2 Comparison results among SynC algorithm, ESynC algorithm, IESynC algorithm, and some classical clustering algorithms

Figure 5 lists the original picture and several compressed pictures of Picture3. The several compressed pictures are drawn by using the means of clusters that are obtained by clustering the 200 * 200 pixel points of Picture3 in RGB color space using different algorithms. Because AP algorithm needs too much time and space for Picture3, this experiment did not use it. From Fig. 5, we observe that ESynC algorithm and IESynC algorithm can get multi-level clustering compressed effect for different values of parameter δ.

5.5 Analysis and conclusions of experimental results

From the comparison experimental results of these figures and tables (Figs. 1, 2, 3, 4, 5, Figs. 4-6 of Appendix 3 of Supplementary material, and Tables 1, 2, 3, 4, 5, 6), we observe that ESynC algorithm based on the linear version of Vicsek model not only gets better local synchronization effect but also needs less iterative times and time cost than SynC algorithm based on an extensive Kuramoto model almost in all experimental data sets. We think that ESynC algorithm is superior to SynC algorithm for clustering because of its more proper synchronization model.

From the simulations, we observe that IESynC algorithm is faster than ESynC algorithm in some cases. Usually, IESynC algorithm spends less time in some kinds of data sets by selecting proper values for parameter r i (i = 1, 2, …, d). The time cost of IESynC algorithm is sensitive to parameter r i (i = 1, 2, …, d). If the number of grid cells is too small or too large, we find that IESynC algorithm cannot obtain obvious improvement of time cost in many cases. So we say IESynC algorithm is a parametric algorithm.

From simulations of four data sets (from DS5 - DS8), we observe that the effective interval of parameter δ in ESynC algorithm and IESynC algorithm is longer than that in SynC algorithm and DBSCAN algorithm.

In some displayed figures, by intercomparing SynC, ESynC, and IESynC, we observe that IESynC algorithm can explore the same clusters and isolates (displayed by some figures) as ESynC algorithm. In many kinds of data sets, ESynC and IESynC can explore obvious clusters or isolates if selecting a proper value for parameter δ, and SynC algorithm cannot explore obvious clusters in many data sets.

From simulations of some data sets, we observe that the iterative times of SynC, AP, K-Means, and FCM is larger than that of ESynC and IESynC. In many data sets, ESynC, IESynC, and DBSCAN have better ability than SynC, K-Means, FCM, AP, and Mean Shift in exploring clusters and isolates. Specially, AP algorithm needs the largest time cost.

Because the values in RGB space of the pixel points of Picture3 are almost continuous and have no obvious clusters. In this case, ESynC algorithm and IESynC algorithm can get more obvious multi-level compressed effect than some other algorithms, such as K-Means and FCM. In simulations, we also observe that DBSCAN algorithm needs more space than ESynC algorithm because of its recursive procedure.

Because of the limited page space, we only select some typical data sets (twelve kinds of artificial data sets, eight UCI data sets, and three bmp picture data sets) used in our experiments. For all experimental data sets, we observe that ESynC algorithm improves SynC algorithm in local synchronization effect, iterative times, and time cost. For other data sets, we think ESynC algorithm is still superior to SynC algorithm for clustering. We believe that the selection of experimental data sets is not biased. We also know that IESynC algorithm is faster than ESynC algorithm depending on the selected data. But IESynC algorithm can explore the same clusters and isolates as ESynC algorithm in any cases.

6 Conclusions

This paper presents another synchronization clustering algorithm, ESynC, which can get better clustering results than the original synchronization clustering algorithm, SynC. From the experimental results, we observe that ESynC algorithm has less iterative times, faster clustering speed, and better clustering quality than SynC algorithm in many kinds of data sets. ESynC algorithm gets better local synchronization effect than SynC algorithm because of its more proper synchronization model. ESynC algorithm can also get better or similar clustering quality or faster clustering speed than some classical clustering algorithms in some data sets.

To our knowledge, the linear version of Vicsek model used for clustering is introduced in this paper firstly. The major contributions of this paper can be summarized as follows:

  1. (1)

    It presents an Effective Synchronization Clustering algorithm (ESynC), which is an improved version of SynC algorithm, using a linear version of Vicsek model.

  2. (2)

    It compares the linear version of Vicsek model, the extensive Kuramoto model, and the original version of Vicsek model for exploring clusters in dynamic clustering process.

  3. (3)

    It introduces an improved version of ESynC algorithm in time cost, IESynC algorithm, and validates that IESynC algorithm can get some improvement of time cost in some data sets.

  4. (4)

    It presents and validates a convergent condition of dynamical clustering in synchronization clustering algorithms, the t-step average length of edges.

ESynC algorithm and SynC algorithm use a global searching strategy to construct the δ near neighbor point set for every point in each evolution, so their time complexity is O(Tdn 2). IESynC algorithm uses a local searching strategy to construct the δ near neighbor point set for every point in each evolution. In some cases, the time complexity of IESynC algorithm is O(Tdnlog n). In some cases, the time complexity of IESynC algorithm is near O(Tdn 2). Where n is the number of all points, d is the number of dimensions, and T is the times of synchronization. So we say IESynC algorithm is a parametric algorithm.

Like DBSCAN and SynC, ESynC algorithm is also robust to outliers or isolates. In DBSCAN algorithm, Mean Shift algorithm, and ESynC algorithm, the number of clusters does not have to be fixed before clustering. Usually, parameter δ has some valid interval that can be determined by using two exploring methods presented in [7], the heuristic method described by Theorem 1 and Property 1, or the MDL-based method presented in [4]. More often, the valid interval of parameter δ in ESynC algorithm is longer than it in DBSCAN algorithm. Comparing with SynC, K-Means, FCM, and AP, ESynC algorithm can obtain better or similar clustering quality.

Although our algorithms have shown promising results, there are still some limitations. First, the time complexity of ESynC algorithm is O(Tdn 2), which limits its applicability to big data. Second, IESynC algorithm is a parametric algorithm. The time complexity of IESynC algorithm is sensitive to parameter r i (i = 1, 2, …, d). Third, like DBSCAN algorithm and CNNI algorithm [7], ESynC algorithm is also sensitive to parameter δ in some scatter data sets. Fourth, when many noises and few obvious clusters exist, DBSCAN algorithm and ESynC algorithm cannot generate multi-level clusters because parameter δ is fixed before clustering.

This work opens some possibilities for further improvement and investigation. First, do more comparison experiments. For example, in the process of constructing δ near δ neighbor point sets, comparing our improved method based on δ near neighbor grid cell set [5] and Red-Black tree with R-tree, SR-tree index structure, and other space index structures should be valuable for practical work. Second, further improve ESynC algorithm in time cost. For example, designing similarity-preserving hashing function that needs O(1) time complexity is valuable in the process of constructing δ near neighbor point set. Third, extend the applicability and explore the clustering effect of our algorithms in high-dimensional data. Fourth, further explore more proper and simple methods to estimate parameter δ. Fifth, ESynC algorithm has some similarity with Mean Shift algorithm, but they have essential difference. ESynC algorithm is a dynamic synchronization clustering algorithm, and Mean Shift algorithm is a clustering algorithm based on a non-parametric model. So, it is important to explore the relation between ESynC algorithm and Mean Shift algorithm further.