1 Introduction

Clustering is one of the most popular techniques in data mining, where the goal is to partition a set of unlabeled data objects into a number of groups of similar objects. Each group, called a cluster, includes objects that are similar to objects within the cluster and are different from those in other clusters. Clustering techniques have been used in many different applications, such as machine learning, image segmentation, web mining, bioinformatics and economics.

So far, many clustering techniques have been proposed [1]. In general, these techniques can be classified basically into two main categories: hierarchical and non-hierarchical (or partitional) clustering techniques [2].

Hierarchical clustering techniques organize data into a hierarchical structure and do not need to determine the number of clusters in advance. These techniques are mainly classified as agglomerative and divisive techniques [3]. An agglomerative clustering technique is a “bottom up” approach which treats each data object as a singleton cluster in the beginning and then recursively merges two or more of the most appropriate clusters to form a larger cluster until an appropriate clustering result emerges. Divisive clustering is a “top down” approach and proceeds in an opposite way. In the beginning, all data objects belong to a cluster and are successively divided into two smaller clusters until an appropriate clustering result is obtained.

On the other hand, partitional clustering techniques attempt to directly partition data objects into a set of disjoint clusters [4]. Partitional clustering is a combinatorial problem, which is a branch of discrete optimization problems. Also, in partitional clustering, the set of feasible solutions is finite and grows combinatorially with the problem size [5]. Several studies have used Particle Swarm Optimization (PSO) to solve Combinatorial Optimization Problems (COP) [6]. Clerc [6] provides several examples of PSO applied to combinatorial problems such as the knapsack, the traveling salesman, and the quadratic assignment problems. Recently, some researches have used the combinatorial particle swarm optimization to solve the partitional clustering problems. Jarboui et al. [7] proposed a clustering method based on the Combinatorial Particle Swarm Optimization (CPSO) with fixed number of clusters. In another study, Yucheng and Szu-Yuan [8] proposed a clustering method with variable number of clusters, and they used the CPSO and K-means algorithms. A comprehensive review of evolutionary algorithms for the clustering problem can be found in [9].

The partitional clustering aims at optimizing cluster centers and the number of clusters. One of the major drawbacks of the partitional approaches is the difficulty in determining the number of clusters [2]. In many clustering problems, the correct number of clusters is not known, and it is impossible to estimate. Most clustering algorithms need to determine the number of clusters in advance. A solution for this problem is to use dynamic clustering techniques. Dynamic clustering techniques have two general objectives, finding the optimal number of clusters and partitioning the data objects into clusters.

This paper proposes a combinatorial particle swarm optimization for dynamic data clustering. The proposed method improves the ideas presented by Jarboui et al. [7] and is called Combinatorial Particle Swarm Optimization II (CPSOII). It should be noted that CPSOII can be used for other COPs as well, with only some modifications. CPSOII automatically finds the best number of clusters and partitions the data objects into clusters effectively. Most existing methods for dynamic clustering, such as K-means and Combinatorial Particle Swarm Optimization (KCPSO) [8], Dynamic Clustering using Particle Swarm Optimization (DCPSO) [10] and Two-leveled Symbiotic Evolutionary Clustering Algorithm (TSECA) [11], divide the problem into two sub-problems: first, finding the number of clusters and second, partitioning the data objects into clusters. They usually use K-means clustering algorithm for the second objective, however, in K-means algorithm, the initialization step may have an effect on the clustering results [3] and decreases the general performance of the clustering method. CPSOII considers the clustering problem as a single problem and simultaneously finds both the number of clusters and the corresponding clustering results.

Like most existing methods [7, 8, 1216], CPSOII uses a representation called label-based to encode each clustering solution. This representation is naturally redundant and can reduce the diversity of population in swarm intelligence algorithms. Most of the existing methods, such as CPSO [7] and KCPSO [8] have not paid attention to this problem. In contrast, CPSOII proposes a novel renumbering procedure to solve this problem. The proposed renumbering procedure is used as a preprocessing step before applying the extended PSO operators, and has two distinct advantages. First, CPSOII algorithm will be independent of different encodings of a clustering solution. Second, it increases the rate of distinct particles in swarm and hence, the diversity of population, speed of convergence and quality of solution are increased.

The performance of CPSOII has been evaluated by both artificial and real-world data sets with respect to three clustering metrics, Sum of Squared Error (SSE), Variance Ratio Criterion (VRC) and Davies-Bouldin Index (DBI). The experimental results of CPSOII are compared with CPSO [7], Clustering Genetic Algorithm (CGA) [12] and K-means algorithms, when the number of clusters is known and with KCPSO [8] and CGA [12], when the number of clusters is unknown.

The experimental results show that CPSOII is very effective, robust and can solve clustering problems successfully with both known and unknown number of clusters. Comparing the obtained results of CPSOII with CPSO and other clustering techniques, such as KCPSO, CGA and K-means, reveals that CPSOII yields promising results, e.g., it improves 9.26 % of the value of DBI criterion for Hepato [17] data set. Additionally, the number of clusters found by CPSOII algorithm is the closest number to that of classes in different data sets. Furthermore, the rate of distinct particles in swarm and the speed of convergence of CPSOII are higher than the other compared methods.

The rest of this paper is organized as follows. In Sect. 2, related works are briefly reviewed. In Sect. 3, clustering problem, particle swarm optimization and combinatorial PSO algorithm for partitional clustering problem are described as a background. In Sect. 4, CPSOII algorithm is described in detail. Section 5 presents the clustering criteria, implementations, benchmark data sets, experimental setups and the results obtained by CPSOII in comparison to the other clustering methods. Finally, conclusions are presented in Sect. 6.

2 Related works

Clustering techniques based on Evolutionary Computing (EC) and Swarm Intelligence algorithms (SI) have outperformed many classical clustering techniques [18]. There have been a lot of studies in the literature for data clustering with evolutionary algorithms [9]. Although most of them are based on a fixed number of clusters, there are methods with a variable number of clusters, in which the clustering algorithm attempts to find optimal number of clusters. These works will be briefly reviewed in this section.

Compared with other evolutionary algorithms, Genetic Algorithm (GA) has been most frequently used in the clustering problems. Hruschka et al. [12, 13] proposed a dynamic clustering algorithm based on GA. Their algorithm was called CGA, it uses label-based integer encoding, in which a chromosome is an integer vector of N positions. Each element of this vector is associated with a data object and takes a value (cluster label) over the alphabet {1,2,3,…,K}, where K is the number of clusters. In this clustering algorithm, a limited crossover and a proportional selection are used. Bandyopadhyay and Maulik [19] proposed a GA-based algorithm to solve dynamic clustering problems. Their algorithm is called Genetic Clustering for Unknown K (GCUK). Liu et al. [20] developed a GA-based clustering method which is called Automatic Genetic Clustering for Unknown K (AGCUK). In this method, noisy selection and division–absorption mutation were designed to keep a balance selection between pressure and population diversity.

Modified PSO algorithm for solving clustering is the emphasis of PSO-based clustering methods. Jarboui et al. [7], presented a discrete PSO algorithm called CPSO to solve partitional clustering problems. This method uses the label-based representation to encode clustering solutions and the number of clusters is known or set in advance. Karthi et al. [21] proposed a Discrete Particle Swarm Optimization Algorithm (DPSOA) for data clustering. In DPSOA algorithm, particles are represented as groups of data objects called cluster blocks. Also, Omran et al. [10] proposed a dynamic clustering approach, based on PSO. Their algorithm was called DCPSO and applied to an unsupervised image classification. In DCPSO, the binary PSO is used to find the best number of clusters, and the center of the chosen clusters is then refined via K-means clustering algorithm. Yucheng and SzuYuan [8] used CPSO [7] and K-means algorithms for dynamic data clustering. Their algorithm was called KCPSO. In each iteration of KCPSO, the CPSO algorithm is used to optimize the number of clusters, and then K-means is used to find the best clustering result. The authors compared KCPSO with DCPSO and GCUK and the results revealed that in most cases, KCPSO outperforms DCPSO and GCUK in both finding the best number of clusters and clustering results. Latif et al. [22] presented a dynamic clustering method, based on binary multi-objective PSO that is called Dynamic Clustering using Binary Multi-Objective Particle Swarm Optimization (DCBMPSO). Paoli et al. [23] also used multi-objective PSO for dynamic clustering. This method was applied on hyper-spectral images as a case study. In this method, each particle position is represented as a vector of 2×K×D+D elements, where K is the number of clusters represented by a particle, and D is the number of dimensions of data objects. First K×D elements of particle position are the mean of clusters. Second K×D elements are the variance of clusters and last D elements determine the selected and unselected dimensions of data objects for clustering.

The hybrid evolutionary algorithms have also been used in clustering. Niknam et al. [4] presented a hybrid evolutionary algorithm based on PSO and Simulated Annealing (SA) to find optimal cluster centers. In another study, Niknam and Amiri [24] proposed a cluster analysis optimization algorithm based on a combination of PSO, Ant Colony Optimization (ACO) and K-means. Siriporn and Kim [25] also proposed a combination of GA, ACO and fuzzy c-means for partitional clustering problem.

3 Background

In this section, we describe the clustering problem, particle swarm optimization and combinatorial PSO algorithm for partitional clustering problems.

3.1 Clustering problem

The clustering problem is formally defined as follows:

Consider a set of N d-dimensional data objects O={O 1,O 2,…,O N}, where . Each o ij called a feature (attribute, variable, or dimension) and presents value of data object i at dimension j.

Given O the set of data objects, the goal of partitional clustering is to divide the data objects into K clusters {C 1,C 2,…,C K}, that satisfies the following conditions:

  1. (a)

    C i φ, i=1,…,K

  2. (b)

    \(\bigcup_{i=1}^{K} C_{i}= O\)

  3. (c)

    C i C j =φ, i,j=1,…,K and ij

In general, the data objects are assigned to clusters based on a similarity measure. In this paper, we use Euclidean distance [3] for this purpose.

3.1.1 Encoding schemes

Several encoding schemes of clustering solutions have been proposed in the literature. Hruschka et al. [9] categorized them into three types: binary, integer and real. In this paper, CPSOII uses a label-based representation. This representation is an important integer encoding, in which a solution is an integer vector of N labels, each label corresponds to a particular data object and determines the cluster number of data object in the solution [9]. However, the label-based representation is naturally redundant and one-to-many, i.e., there are K! different solutions that represent the same solution [9]. Figure 1 shows an example of the label-based representation with redundant solutions. The problem can be solved by applying a renumbering procedure [26].

Fig. 1
figure 1

An example of the label-based integer encoding and redundant solutions

3.2 Particle swarm optimization (PSO)

PSO is an evolutionary optimization technique introduced by Kennedy and Eberhart [27], which is inspired by the social behavior of bird flocking and fish schooling. PSO is a population-based search method, where the individuals, referred to as particles, are grouped into a swarm. Each particle represents a candidate solution to the optimization problem and is determined by its position and velocity. In addition, each particle has memory and retains its best experience (Pbest).

PSO combines self-experiences with social-experience. In this algorithm, a swarm of particles flies through the search space. However, this search process is not carried out entirely randomly and the position of a particle is influenced by the following factors: best position visited by itself (i.e., its own best experience), the position of the best particle in its neighborhood (i.e., the best social experience), and its current velocity. When a particle takes the entire population as its neighbors, the best value is a global best (called Gbest) and when it takes the smaller group as its neighbors, the best value is a local best, called Lbest. The performance of each particle is measured according to a predefined fitness function.

During the search process, the particles are moved according to the following equations:

(1)
(2)

Where \(x_{i}^{t}\) and \(v_{i}^{t}\) are the current position and the velocity of ith particle at iteration t, respectively. Gbest t and \(\mathit{Pbest}_{i}^{t}\) are the global and personal best position of ith particle during iterations 1 to t, respectively. w is the inertia weight that controls the impact of the previous velocities on the current velocity. r 1 and r 2 are uniformly distributed random variables in range [0,1] to provide stochastic weighting of the different components participating in the particle velocity definition. c 1 and c 2 are the factors to determine the impact of the personal best and the global best, respectively. After updating the velocity and position of a particle, its Pbest is updated according to (3).

(3)

Where \(f(x_{i}^{t+1})<f(\mathit{Pbest}_{i}^{t})\) means that the new position \(x_{i}^{t+1}\) is better than the current Pbest of the ith particle. After updating the velocity, position and Pbest of all particles, the particle with the best fitness is selected as Gbest t+1. These operations are repeated until a termination criterion is met (e.g., the number of iterations is performed, or the adequate fitness is reached).

The main strength of PSO is its fast convergence, in contrast to many global optimization algorithms like GA, SA, Tabu Search (TS) and etc. [28]. But for applying PSO successfully, one of the key issues is to find how to map the problem solution into the PSO particle, which have great impacts on its feasibility and performance [29].

3.3 Combinatorial PSO

Jarboui et al. [7] presented a new discrete PSO algorithm called CPSO to solve partitional clustering problems. In this paper, CPSO algorithm is presented as CPSOI. CPSOI, like many clustering algorithms, needs to determine the number of clusters in advance.

It is worth remembering that the original PSO, as opposed to CPOSI, is basically developed for continuous optimization problems. CPSOI essentially differs from the original (or continuous) PSO in two characteristics: particle and velocity definitions. In the original PSO, a real-based representation is employed to encode each particle. Analogously, CPSOI uses the label-based representation to encode clustering solutions. In contrast to the velocity definition of the original PSO mentioned in (1), CPSOI uses (4) to update the velocity of each particle.

(4)

where

Where \(Y_{i}^{t}\) is a dummy variable used to permit the transition from the combinatorial state to the continuous state and vice versa. The change of the velocity \(v_{ij}^{t}\) depends on the value of \(Y_{i}^{t-1}\). It should be noted that if the velocity of a particle is more than the maximum user-defined speed limit (V max ) or less than the minimum user-defined speed limit (V min ), it will be set V max or V min , respectively.

After updating the velocity, the position of each particle is updated according to (5).

(5)

where

Where α is manually determined by experts as a parameter for fitting intensification and diversification.

4 Proposed CPSOII

CPSOI has several shortcomings. First, it needs to determine the number of clusters in advance. Second, it uses the label-based representation to encode clustering solutions, therefore, the particles of swarm are prone to the redundancy. Third, during the evolution process, it may generate invalid particles, because the number of clusters is fixed and it is possible to generate an empty cluster. CPSOII is an extension of the original PSO algorithm for solving partitional clustering problems, it automatically finds the best number of clusters and simultaneously categorizes data objects. The general process of CPSOII algorithm is given in Fig. 2.

Fig. 2
figure 2

General process of CPSOII algorithm

CPSOII algorithm consists of seven steps. In the first step, algorithm parameters, such as swarm size, maximum number of iterations, and parameters used in the velocity equation are initialized. In the second step, the initial particles of swarm are generated and the Pbest of each particle is initialized with current position of the corresponding particle. This step includes two activities: generating particles by using K-means algorithm and generating particles randomly. After generating initial particles, the fitness of each particle is calculated in the third step. Thereafter, in the fourth step, the particle with the best fitness is selected as the Gbest particle. In the fifth step, the position and velocity of each particle \(X_{i}^{t}\) is updated. This step has four sub-steps. First, the renumbering procedure is used to renumber Pbest and Gbest clusters according to their similarity with \(X_{i}^{t}\) clusters. Then, the velocity equation of CPSOII (see (6)) is used to compute the velocity of the particle and then the particle is moved to a new position. Finally, the validity of the particle is evaluated and if it is invalid, the correction procedure is used to make it valid. After updating the velocity and position of each particle according to the fifth step, the fitness of all particles are computed again in the sixth step. Finally, in the seventh step, Gbest and Pbest of each particle are determined. In the mentioned steps, the steps 1–4 are executed only once, but steps 5–7 are repeated until the termination condition is satisfied.

The main concepts of CPSOII and its steps are described in the following sections.

4.1 Cluster encoding in particles

CPSOII uses the label-based integer encoding to represent the clustering solutions. With this schema, the position X i of each particle will simply be a vector that provides integer numbers representing the cluster number of data objects. The velocity V i of each particle will be a vector of integer numbers representing the recent move. In this paper, we represent the swarm as a set of M particles. Each particle is determined with its position, velocity and Pbest. Each particle position X i and particle velocity V i , i=1,…,M, characterized by N elements, x i1,…,x ij ,…,x iN and v i1,…,v ij ,…,v iN , where N is the number of data objects, x ij ∈{1,…,K i }, represents the cluster number of jth data object in ith particle and v ij ∈{0,…,K i }, represents the recent move of jth data object in ith particle. The velocity value of 0 in each element means that cluster number of corresponding data objects in the previous move was not changed. A representation of the particle position and velocity are shown in Fig. 3. It should be noted that K i is the number of clusters related to particle i and is assumed to lie in the range [K min ,K max ], where the value of K min by default is 2, unless it is manually specified and the value of K max by default is chosen \(\sqrt{N}+1\) [30], unless it is manually specified.

Fig. 3
figure 3

CPSOII particle position and velocity structure

4.2 Swarm initialization

The most common technique in EC for initializing population is random uniform initialization, in which each particle of the initial swarm and, consequently, the initial best positions are drawn by sampling a uniform distribution over the search space [31]. Random uniform initialization is an unguided initialization process to generate the initial swarm and if there are items of information available regarding the location of the global minimizer in the search space, then it is capable to initialize the swarm around global minimizer and generate better particles. This is called guided initialization.

CPSOII algorithm uses both guided and unguided initialization processes to generate the initial swarm of particles. The flowchart of the swarm initialization is shown in Fig. 4. The guided initialization in CPSOII algorithm is optional and if it is not used, only the unguided initialization is used to generate all particles of swarm. When the guided initialization is used, the initial swarm of particles is partitioned into two groups, the first group is generated by K-means algorithm (guided initialization) and the second one is randomly generated (unguided initialization). The maximum number of particles in the first group is L=K max K min +1 (note that if the maximum swarm size (M) is smaller than L, then L=M). K-means algorithm is performed L times and each time with 5 iterations. Each run of K-means algorithm generates one particle with K clusters, where the value of K for run i is K min +i−1. Initial points of K-means algorithm are randomly generated, and thereafter, each data object is assigned to the closest cluster center, based on the Euclidean distance.

Fig. 4
figure 4

Flowchart of swarm initialization

The second group of particles is randomly generated. The number of particles in the second group is ML. For particle i from the second group, the number of clusters K i is randomly generated in the range [K min ,K max ], then, each data object of data set is randomly assigned to one cluster.

After generating the initial particles of swarm, the initial velocity of all particles are assigned to zero and the initial Pbest of each particle is assigned to its current position.

4.3 Fitness evaluation

In this paper, three different criteria, SSE, VRC, and DBI, are separately used to compute the fitness of a particle. SSE is a simple criterion and considers only within-cluster distance. This criterion is appropriate for algorithms with fixed number of clusters. VRC and DBI consider both within-cluster and between-cluster distances and are suitable for the dynamic clustering algorithms. However, the DBI criterion has been most frequently used in dynamic clustering. Hence, only DBI criterion will be used to compare CPSOII with dynamic clustering algorithms, and the three aforementioned criteria will be used to compare algorithms with the fixed number of clusters.

4.4 Particles moving

In CPSOII algorithm, the movement of particles to a new position is performed in four steps as shown in Fig. 2. These steps are described with more details in below.

4.4.1 Cluster renumbering

The goal of the cluster renumbering step is to remove the redundant particles described in Sect. 3.1.1, and to prepare Pbest i and Gbest particles to calculate their difference with particle X i in the step 5.2. Figure 5 presents a pseudocode of the proposed renumbering procedure. In part 5 of this procedure, the similarity degree between the clusters of two input particles is calculated by using a similarity function, then, in part 6, two clusters with the highest similarity are matched to each other. Part 6 of the procedure is repeated R times, where R=min{K i ,K Gbest } (or \(R = \mathit{min} \{K_{i}, K_{\mathit{Pbest}_{i}} \}\)) and in each iteration, two clusters of two input particles are matched to each other and not be considered in the next iterations. Thereafter, in part 9, the clusters of Gbest (or Pbest i ) are renumbered according to their matches with clusters of X i . Due to different variances of the number of clusters in different particles, it is possible that some clusters of Gbest or Pbest i will not be matched to any cluster of X i or vice versa. In this case, after renumbering matched clusters, unmatched clusters of particles use the unused numbers in the cluster numbering (part 10). Table 1 shows the list of binary similarity functions used in CPSOII. Choi et al. [32] collected and analyzed 76 binary similarity functions and distance measures used over the last century.

Fig. 5
figure 5

Pseudocode of the renumbering procedure

Table 1 Used similarity measures for binary data

Example 1

Figure 6 illustrates an example of the cluster renumbering process in CPSOII algorithm. The similarity degree of Cluster1 of Gbest with Cluster1 of X i , using Jaccard similarity function (see Table 1), is \(\frac{3}{4}\) and it is the highest similarity degree in this example. Therefore, these clusters are matched to each other. After that, Cluster4 of Gbest and Cluster2 of X i , with similarity \(\frac{2}{3}\) is the highest similarity degree and are matched to each other. Finally, Cluster2 of Gbest is matched to Cluster3 of X i . Figure 6(c) illustrates the result of the matching process. After matching, the clusters of Gbest are renumbered to the corresponding matched cluster numbers. Cluster3 from Gbest is an unmatched cluster, so it gets an unused cluster number that is 4 in this example.

Fig. 6
figure 6

Example of the cluster renumbering process in CPSOII

The renumbering procedure makes CPSOII insensitive to different encodings of clustering solutions.

Example 2

Consider two particles with the following encodings: [2222333111], [1112222333]—each one represents three clusters. Assuming that the first particle is a better solution. In the movement of the second particle toward the first particle with current information of encoding, it is diverted from better solution because the encodings of two particles are very different and this may divert particles from better solution. Using the renumbering procedure, the first particle encoding is changed to [1111222333] and with this encoding, the movement of the second particle toward the first one, is more deliberate and rational than the previous case. It is an important benefit and it guides the algorithm toward better solutions.

4.4.2 Velocity computation

In PSO algorithm and its extensions, it is the velocity vector that drives the optimization process. In CPSOII algorithm, the velocity of each particle is modified by (6).

(6)

Where \(X_{i}^{t}\) and \(V_{i}^{t}\) are the position and velocity of particle i at iteration t, respectively. \(\mathit{Pbest}_{i}^{t}\) and Gbest t are the best positions obtained by the particle i and the swarm of particles during time t, respectively. W, R 1 and R 2 are vectors with size N comprising 0 or 1 elements, such that the values of these vectors are randomly generated with probability w,r 1 and r 2, respectively. The values of w, r 1 and r 2 have major effect on the performance of CPSOII.

In (6), Difference (⊖), Multiply (⊗) and Merge (⊕) operators are introduced. Although, Difference operator is approximately similar to the calculation of y in the CPSOI algorithm mentioned in (4), Multiply and Merge operators are quite different from the operators used by CPSOI. In addition, the type of each velocity vector element of CPSOI and CPSOII are different. In CPSOI, the velocity vector is a vector with N real elements, on the contrary, CPSOII uses the velocity vector with N integer elements. The definitions of the operators used in the body of (6) are defined as follows:

Difference operator (⊖)

This operator computes the difference between the current position of ith particle, \(X_{i}^{t}\), and personal-best position (or global-best position). The ⊖ operator is defined in (7) and (8). We use \(\lambda P_{i}^{t}\) to show the difference of particle \(X_{i}^{t}\) and its \(\mathit{Pbest}_{i}^{t}\) position, and \(\lambda G_{i}^{t}\) for \(X_{i}^{t}\) and Gbest t position.

(7)

where

(8)

where

The difference between particle position \(X_{i}^{t}\) and \(\mathit{Pbest}_{i}^{t}\) (or Gbest t) is a vector of N elements. If each nonzero element of the difference is replaced with the corresponding element in \(X_{i}^{t}\), then, the position of \(X_{i}^{t}\) will be modified to the position of \(\mathit{Pbest}_{i}^{t}\) (or Gbest t). CPSOII algorithm uses this property to move toward a desired position \(\mathit{Pbest}_{i}^{t}\) (or Gbest t). According to this property, the difference between two particles with same position is a vector with N zero elements.

Multiply operator (⊗)

This operator manages the exploration and exploitation abilities of the swarm, using W, R 1 and R 2 vectors. The values of these vectors are controlled by w,r 1 and r 2 variables. Generally, the value of w controls the movement of particles in the previous direction and the values of r 1 and r 2 control the movement of particles in the direction of Pbest and Gbest, respectively. The ⊗ operator is equivalent to Hadamard product which is defined in (9).

(9)

Where A and B are two vectors with the same size; a i and b i are the ith element of vector A and B, respectively.

Merge operator (⊕)

This operator merges two vectors of N elements and defined by (10).

(10)

where

In CPSOII algorithm, the merge operator is used to merge the three components of the velocity equation (see (6)).

4.4.3 Particle moving

After computing the velocity of a selected particle, the new position of the particle is generated based on (11).

(11)

Where r is an integer random value uniformly distributed in the range [1,K i +1] and K i +1<K max . With this range, CPSOII algorithm acquires the ability to add a new cluster. However, the difference in the number of clusters of X i with Pbest i and Gbest can also cause some clusters in X i to be added or removed.

After computing the velocity value of each particle in the original PSO, it is added to the current position of the particle and the new position of the particle is generated. In the CPSOII, the velocity vector contains the recent movements of each particle. This means that the non-zero elements of velocity vector are the same as the corresponding elements of the position vector. With this property of velocity vector, we actually map (2) of the original PSO to (6) of CPSOII. Therefore, in (11), there is no Merge operator.

Example 3

Figure 7 illustrates an example of creating a new position and velocity of particle \(X_{i}^{t}\). Figure 7(b, c, d) illustrates how the velocity operators (⊖, ⊗ and ⊕) are performed. Figure 7(e) shows the particle movement and the generation of its new position.

Fig. 7
figure 7

An example of creating a new position and velocity of particle X i

4.4.4 Position validation

To avoid generating solutions with empty clusters, the new position of each particle is checked. Then, if there are empty clusters, they are removed by the renumbering process. In the label-based representation, a particle is invalid if the number of its clusters is smaller than the largest cluster number used in the cluster encoding. In each iteration of the correction procedure, the cluster with the largest cluster number is renumbered to the smallest unused cluster number.

Example 4

Figure 8 illustrates an invalid particle (in which clusters 1 and 3 are empty) as an example for performing validation via the renumbering process. In this figure, particle \(X_{i}^{t}\) is an invalid particle, because it has just 4 clusters, but the largest cluster number is 6.

Fig. 8
figure 8

A sample of invalid solution and applying the correction process

5 Experimental results

In this section, the performance of five clustering algorithms: CPSOII, CPSOI [7], CGA [12], KCPSO [8] and K-means, are compared. The used criteria for comparing the results are SSE, VRC and DBI.

5.1 Clustering criteria

SSE [3] is one of the most common partitional clustering criteria and its general objective is to obtain a partition which minimizes the squared error. This criterion is defined in (12).

(12)

where

$$c_{i} = \frac{1}{N_{i}} \sum_{O\in C_{i}}O.$$

Where c i is the mean of cluster C i (cluster centroid) and N i denotes the number of data objects belonging to cluster C i .

SSE is an appropriate measurement when the number of clusters is known. But for dynamic clustering or when the number of clusters is unknown, SSE cannot be used because its value is reduced by increasing the number of clusters and if the number of clusters is equal to the number of objects, then the value of SSE will be zero.

VRC [33] is another criterion which can be used for cluster validation. This criterion considers both the within-cluster and between-cluster distances. The VRC is defined in (13).

(13)

where

The ratio (NK)/(K−1) is the normalization term and prevents VRC to increase monotonically with the number of clusters. Inter is the between-cluster distance and c is the mean of all data objects (i.e., data centroid). A large value of VRC shows better clustering results based on (13).

One of the popular criteria for the evaluation of clusters is DBI [34]. It combines inter (between-cluster separation) and intra (within-cluster scatter) cluster distances. The within-cluster scatter for cluster C i is defined in (14) and the between-cluster separation for two clusters C i and C j is defined in (15).

(14)
(15)

S i,q denotes the qth root of the qth moment of the objects belonging to cluster C i with respect to their mean. In (14) and (15), q and t are integer numbers, where (q and t≥1) can be selected independently. Finally, DBI is defined as:

(16)

where

$$R_{i,qt} = \max_{j,j\neq i}\biggl\{\frac{S_{i,q}+ S_{j,q} }{d_{ij,t}} \biggr\}.$$

A small value of DBI shows better clustering results. The computational complexity of DBI is higher than SSE and VRC. However, it is an appropriate criterion for dynamic clustering and is used recently in [8, 11, 14, 20, 35] for this purpose.

5.2 Implementation

CPSOII is implemented in Microsoft Visual C#.Net 2010. All experiments conducted in this paper were run in windows 7 on Desktop pc with Intel Core i5, 2.4 GHz processor and 4 GB real memory. We also implemented CPSOI described in Sect. 3.3, KCPSO and CGA algorithm.

5.3 Experimental data

We evaluate CPSOII using artificial data sets (Table 2) and real-world data sets (Table 3). Note that: (i) in Table 2, just Dataset 1 has non overlapping clusters, (ii) the data sets shown in Table 3 are used as a benchmark for the evaluation of clustering methods in recent years.

Table 2 Description of the artificial data sets [36] used in [8, 19, 35]
Table 3 Description of the real-world data sets

5.4 Experiments setup

CPSOII has multiple parameters that must be set by user, including Swarm Size, Max Iterations, w, r 1, r 2, K min , and K max . The values of w, r 1 and r 2 parameters determine the exploration and exploitation abilities of the swarm. In CPSOII algorithm, with high values of r 1 and r 2, particles tend to move toward Pbest and Gbest, respectively. With low values, particle motions are dependent on the value of w, in which if the value of w is small, the particle motions will be more random and if the value of w is large, the particle motions are dependent on the recent value of their velocities. To improve the efficiency and accelerate the search process, it is vital to determine the best value for each of these parameters. For this purpose, we ran CPSOII algorithm with different values of parameters (for all cases of these parameters in the range [0,1] with step 0.01) using all introduced data sets. Then, we selected the values with the best results. In this experiment, other parameters were set as follows: The swarm is comprised of 150 particles and 2500 iterations. Both K min and K max take the number of classes in the corresponding data sets. Figure 9 illustrates the impacts of w,r 1 and r 2 parameters on the VRC criterion using Hepato data set. The results show that the suitable value for r 1 is in range [0.4,0.5] (Fig. 9(a)), for r 2 is in range [0.6,0.7] (Fig. 9(b)) and for w is in range [0.6,0.8] (Fig. 9(c)). We tested other introduced data sets and got almost the same results. However, to obtain better results on other data sets, the values of w, r 1 and r 2 may be modified.

Fig. 9
figure 9

Effects of w, r 1 and r 2 parameters on the VRC criterion using Hepato data set (Average of 10 runs)

The parameter settings of CPSOI, KCPSO and CGA algorithms were determined by both referring to their original papers and performing empirical studies. Table 4 illustrates the parameter settings of CPSOI, CPSOII, KCPSO, CGA and K-means algorithms.

Table 4 Parameter settings of CPSOI, CPSOII, KCPSO, CGA and K-means algorithms

5.5 Experimentations and results

In this section, we describe experiments carried out to test the performance of CPSOII algorithm. For this purpose, we investigate the effectiveness of CPSOII in two cases, with the unknown and known number of clusters. In the first case, the performance of CPSOII is compared with KCPSO [8] and CGA [12]. In the second case, the performance of CPSOII is compared with CPSOI [7], CGA [12] and K-means clustering algorithms.

Thereafter, we compare the speed of convergence of CPSOII, CPSOI, CGA and KCPSO algorithms toward qualified solutions, in both with and without using K-means algorithm for the swarm initialization (Guided and Unguided initialization). Finally, we evaluate the effect of the renumbering procedure and various similarity functions on the quality of solutions and the rate of distinct particles.

In all cases, algorithms run 10 times and the obtained results for each algorithm are reported. Note that the run-time of each algorithm is reported when the fitness value of the evolutionary algorithms (CPSOII, CPSOI, KCPSO and CGA) does not change during 50 iterations.

5.5.1 Experimental results with unknown number of clusters

CPSOII is a dynamic clustering method and automatically finds the best number of clusters during the clustering process. In this section, we compare CPSOII with KCPSO [8] and CGA [12], when the number of clusters is unknown. It should be mentioned that KCPSO uses K-means in its initialization phase. Therefore, Guided & Unguided initialization is not applicable on KCPSO. For this reason, we have executed KCPSO without any modifications. The results provided by CPSOII and their comparison with KCPSO and CGA algorithm, are shown in Tables 5 and 6. For each data set, the best results obtained by algorithms are highlighted in bold face. The differences between number of classes and \(\operatorname{Avg} K\) of CPSOII, KCPSO and CGA for artificial and real-world data sets are shown in Figs. 10 and 11, respectively.

Fig. 10
figure 10

Differences between number of classes and Avg K of CPSOII, CGA and KCPSO for artificial data sets

Fig. 11
figure 11

Differences between number of classes and Avg K of CPSOII, CGA and KCPSO for real-world data sets

Table 5 Experimental results for artificial data sets with unknown number of clusters (Avg: Average, SD: Standard Deviation)
Table 6 Experimental results for real-world data sets with unknown number of clusters

The results indicate that CPSOII with the Guided & Unguided initialization performs better than the average DBI for artificial data sets (Table 5). But the value of average K, obtained by CPSOII with only the Unguided initialization, is the closest number to the number of classes of data sets. For real-world data sets (Table 6), CPSOII with the Guided & Unguided initialization performs better than the average DBI and K for most data sets. However, in some cases, CPSOII with the Unguided initialization shows better results. Generally, the average number of clusters found by CPSOII algorithm is the closest number to the number of classes in different data sets.

5.5.2 Experimental results with known number of clusters

For the fixed value of K and parameter settings according to Table 4, the provided results by CPSOII algorithm are compared to CPSOI, CGA and K-means algorithms in Tables 7, 8 and 9. In these tables, Improvement Ratio (IR) is a simple metric to compute the improvement extent of CPSOII in comparison to other methods. Equations (17), (18) and (19) show IR metric according to the DBI, VRC and SSE criteria, respectively.

(17)
(18)
(19)

Where x is the result obtained by CPSOII and y is the best result obtained by other methods.

Table 7 Experimental results for artificial data sets using the Unguided initialization and known number of clusters
Table 8 Experimental results for real-world data sets using the Unguided initialization and known number of clusters
Table 9 Experimental results for real-world data sets using the Guided & Unguided initialization and known number of clusters

Based on the results shown in Table 7, for the artificial data sets, CPSOII, CPSOI and CGA algorithms obtain the same best value in the SSE and VRC criteria, but for the DBI criterion, CPSOII algorithm obtains better than the best value for Dataset 2 and Dataset 4. Also, the average value of DBI criterion, obtained by CPSOI and CGA for Dataset 2, Dataset 3 and Dataset 4, are worse than CPSOII. The results of K-means algorithm for all data sets, except Dataset 1, are worse than other algorithms.

Based on the results shown in Tables 8 and 9, for real-world data sets, with both Guided and Unguided initialization, CPSOII performed better than CPSOI, CGA and K-means in all of the three used criteria significantly. For example, as shown in Table 8, the values of IR DBI for Libras Movement and Hepato data sets are 8.44 % and 7.19 %, respectively. Also, this value for Hepato using Guided & Unguided initialization (Table 9) is 9.26 %.

Figure 12 shows the IR values (IR DBI, IR VRC and IRSSE) of CPSOII in the best and average values of three used criteria with both the Guided and Unguided initialization.

Fig. 12
figure 12

The IR values of CPSOII in the Best and Avg values of DBI, VRC and SSE criteria with both the Guided and Unguided initialization

5.5.3 Convergence test

The speed of convergence and the quality of solution are two important problems in the optimization algorithms. These problems are dependent on several factors, such as algorithm type and heuristics used in various steps of the algorithm. CPSOII as an extension of PSO algorithm inherits the fast convergence property of PSO. Additionally, using the renumbering procedure and the Guided initialization increase the speed of convergence.

Experimental results show that the convergence of CPSOII toward the qualified solutions, in both cases of known and unknown number of clusters, is faster than the other compared algorithms. Figure 13 shows the convergence toward the qualified solutions for CPSOII, CPSOI and CGA algorithms using the Unguided initialization and based on the DBI criterion. Based on the curves shown in Fig. 13, the convergence of CPSOII toward the qualified solutions in all of six real-world data sets is faster than CPSOI and CGA. The difference of the convergence in Libras Movement data set is more than others. The Libras Movement data set is a high dimensional one (with 90 features) that makes partitioning of its data objects more complex. The results shown in Fig. 13(d) indicate that CPSOII for complex data sets achieves far better performance and convergence than CGA and CPSOI.

Fig. 13
figure 13

Comparing convergence of CPSOII, CPSOI and CGA algorithms toward qualified solutions using the Unguided initialization (using the DBI criterion)

The Guided initialization as a heuristic improves the speed of convergence and the quality of solutions. Figure 14 shows the effect of the Guided initialization in the convergence of CPSOII toward qualified solutions. The results show that by using the Guided initialization, the number of iterations for finding better solutions is reduced and therefore, the speed of algorithms is increased. As mentioned above, KCPSO uses K-means in the swarm initialization phase, so the convergence of it is compared with CPSOII only using Guided & Unguided initialization (Fig. 15).

Fig. 14
figure 14

The effect of the Guided initialization in the convergence of CPSOII toward qualified solutions (using the DBI criterion)

Fig. 15
figure 15

Comparing convergence of KCPSO and CPSOII with the Guided & Unguided initialization toward qualified solutions (using the DBI criterion)

5.5.4 Effects of renumbering procedure and various similarity functions

Population diversity is a way to monitor the degree of the convergence or divergence in PSO search process [38]. Several definitions of PSO population diversity measurements have been proposed in the literature [38]. There are different ways of introducing diversity and controlling its degree. Random restart and controlling swarm size are simple techniques for this purpose. Norouzzadeh et al. [39] used a random restart technique for injecting diversity to swarm. Clerc [6] and Zhang et al. [40] changed the size of swarm according to the performance of the algorithm dynamically. The size of swarm is important, because too few particles will cause the algorithm to converge prematurely to a local optima, while too many particles will slow down the algorithm [41]. As described in Sect. 3.1.1, the integer encoding scheme is naturally redundant. One of the disadvantages of redundant particles is that it reduces the size of swarm because there are more than one particle that represent the same solution. Thus, this problem can reduce the diversity of PSO algorithm. In the CPSOII algorithm, using the renumbering procedure is helpful for solving this problem. Hence, removing redundant particles in CPSOII increases the diversity of population and the speed of convergence appropriately. Table 10 shows the effect of similarity functions on the Rate of Distinct Particles (RDP) and clustering results. In this table, RDP shows the average rate of distinct particles during the evolution process and computed by (20) at the end of each iteration.

(20)

Where NOIdentical t and NORedundant t are the number of identical and redundant particles at iteration t, respectively. As shown in Table 10 and as expected, when the renumbering procedure is used, it appears that the rate of distinct particles is increased, and the average DBI is improved. Among the various similarity functions, Sokal & Sneath2 similarity function obtains better the average DBI and RDP in most cases. In addition, for some data sets, Sokal & Sneath1 and Rogers & Tanimoto similarity functions achieve suitable results. Based on these results, it is concluded that similarity functions and renumbering procedure play key roles in the evolutionary process and when they are used, the movement toward the correct direction is facilitated. Figure 16 illustrates RDP in CPSOII, CPSOI and CGA during the evolution process. KCPSO uses real encoding, so it has not been considered in this experiment.

Fig. 16
figure 16

Comparing the rate of distinct particles in CPSOII, CPSOI and CGA algorithms during the evolution process

Table 10 The effect of similarity functions on the RDP value and the clustering results of CPSOII

Based on the results shown in Table 10 and Fig. 16, the number of distinct particles in CPSOII with the renumbering procedure is higher than CPSOI and CGA, thus, CPSOII is able to explore more regions. Moreover, it is concluded that the rate of distinct particles is dependent upon the number of objects in data sets, because for data sets with more objects, the number of different permutations of objects is high and thus, the probability of generating two particles with the same or redundant clustering solution is low. On the other hand, converging to the global best solution leads to a decrease in the value of RDP, because most of the particles move to the global best region and the probability of generating two particles with the same or redundant clustering solution in a small region is higher than a large region. The RDP of CPSOII has been compared with CPSOI and CGA algorithms and the results are illustrated in Table 11. Based on these results, the CPSOII algorithm has a better average DBI, and also the average RDP during the evaluation process is higher than CPSOI and CGA.

Table 11 Comparing the rate of distinct particles and the average DBI in CPSOII, CPSOI and CGA algorithms

5.5.5 Scalability discussion

It should be noted that a major limitation of evolutionary clustering techniques in comparison to traditional clustering techniques, such as K-Means, is that they are too time consuming. There is a trade-off between time and quality in clustering techniques. Traditional clustering techniques sacrifice the quality of clustering in order to decrease the process time, while evolutionary clustering techniques prefer to achieve better clustering results by consuming more time. To better illustrate this point, we used three large data sets (introduced in Table 12) and compared CPSOII, CPSOI, CGA and K-means algorithms. The results of these experiments are shown in Table 13.

Table 12 Description of the used data sets for scalability test
Table 13 Experimental results for scalability test using the Unguided initialization and known number of clusters

The results indicate that although CPSOII, CPSOI and CGA consume more time, their results are considerably better than K-means.

Combining evolutionary clustering techniques with traditional clustering techniques is a solution to reduce the runtime of evolutionary clustering techniques. Guided & Unguided initialization is a simple example of this combination. Comparing the results of Unguided initialization with Guided & Unguided initialization in Tables 5, 6, 8 and 9 shows that the average time of algorithms with Guided & Unguided initialization is better than Unguided initialization.

6 Conclusions

Partitional clustering methods attempt to partition the data objects into a set of disjoint clusters directly. In this paper, we proposed a combinatorial particle swarm optimization for dynamic data clustering (CPSOII). CPSOII finds the best number of clusters automatically and partitions the data objects into clusters effectively. Compared with other PSO based clustering algorithms, such as CPSOI, KCPSO and DCPSO, CPSOII operators are simple and effective. CPSOII uses both the guided and unguided initialization to generate an initial swarm of particles. Experimental results (presented in Sect. 5.5.3) revealed that the guided initialization improved the speed of convergence and the quality of solution in most cases. However, in some cases, the best value obtained by algorithm with the guided initialization was worse than algorithm with the unguided initialization.

CPSOII uses the renumbering procedure as a preprocessing, before computing the velocity of each particle. This procedure removes redundant particles and consequently increases the rate of distinct particles and the diversity of population. In addition, it makes the CPSOII algorithm insensitive to different encodings of clustering solutions. This is an important advantage and increases the speed of convergence and the quality of solutions, because particles are consciously moved in this case. In this paper, we used seven different binary similarity functions for the renumbering procedure. The effects of these functions were tested and results revealed that the Sokal & Sneath2 similarity function obtained better average DBI and the rate of distinct particles in most cases.

The performance of CPSOII is evaluated with both artificial and real-world data sets and also with the consideration of three clustering metrics, SSE, VRC and DBI. Comparing the obtained results of CPSOII with CPSO [7], KCPSO [8], CGA [12] and K-means algorithms revealed that CPSOII yielded promising results, for example, it improved 9.26 % of the value of DBI criterion for Hepato data set. The obtained results for unknown number of clusters showed that CPSOII with the guided initialization achieved better average DBI for most data sets. Also, the obtained average K (number of clusters) with CPSOII is the closest number to the number of classes of data sets in comparison to KCPSO and CGA.

In future work, we intend to use Multi-Objective Particle Swarm Optimization (MOPSO) to dynamic data clustering to improve the performance of the proposed method. Moreover, we are going to use guided and unguided mutation and population topology to improve the results.