Keywords

1 Introduction

Advances in technology has made information easy to capture and inexpensive to store, thus the amount of data stored in various databases increased dramatically. These data contain useful but hidden information that may be critical for the decision-making processes of the enterprises. Data mining is the general name of the techniques that are used to extract information from a very large amount of data [11]. Clustering is a major technique in data mining, which refers to a process of dividing data into several subsets while maintaining maximum similarity among the data within the same cluster and keeping minimum similarity among different clusters. Its applications can be seen in customer segmentation, document clustering and information retrieval, web data analysis, image segmentation, anomaly detection, biology, medicine and many other areas. Clustering is an unsupervised process, thus true knowledge about the class that each data object belongs to is not known by the clustering algorithm. If the true class label of data is known to the algorithm and used in the analysis then the method is named classification.

When we look at the history of clustering techniques, we see that many unsupervised clustering algorithms have been developed. K-means is one of the well-known of them. K-means clustering algorithm is easy to implement and very efficient, however suffers from several drawbacks. The objective function of the K-means is not convex hence it may contain many local minima. The outcome of the K-means algorithm is heavily dependent on the initial choice of the centroids [2]. In order to achieve better clustering performance, fuzzy c-means (FCM) clustering algorithm is introduced by Bezdek [4].

Clustering is also an application field in mathematical optimization when it is done by searching for the global minima of a clustering performance function. This approach makes it possible to apply heuristic algorithms to clustering analysis. Particle swarm optimization (PSO) is a population based heuristic algorithm, which maintains a population of particles where each particle represents a potential (candidate) solution to an optimization problem. Merwe and Engelbrecht used PSO in data clustering [22]. They also developed an hybrid approach, which combines PSO and K-means algorithm to achieve better clustering performance.

Merwe and Engelbrecht’s original PSO data clustering approach inspired many works. Ji et al. clustered mobile networks by applying PSO to weighted clustering algorithm [12]. Correea et al. categorized sample types of biological databases with PSO [7]. Chen et al. tested PSO clustering algorithm on four different datasets. They analyzed the performance of standard PSO clustering algorithm in their paper [6]. Cui et al. applied PSO to the document clustering problem [8]. Attributes of documents defined as the dimensions of the particles. Omran et al. applied PSO to the image classification problem [18, 19]. Their algorithm is a binary PSO model which dynamically adjusts the number of clusters. Kumar and Arasu proposed a particle swarm optimization based clustering method to medical databases [14]. Their modified particle swarm optimization based adaptive fuzzy K-modes algorithm produces good results in terms of precision and accuracy. Rana et al. gives a detailed literature review of PSO applications to data clustering [20]. Readers can also refer to [16] for further literature survey on nature inspired metaheuristic algorithms for data clustering.

Although each of these studies provide a number of improvements and innovations for clustering applications of PSO, all of them remains faithful to the Merwe and Engelbrecht’s standard particle representation. But this representation creates a disadvantage by increasing the dimensions of the particles by the number of features of a data vector times the number of desired clusters (Fig. 1). Most stochastic optimization algorithms, including particle swarm optimization, suffer from this ‘curse of dimensionality’, which simply put, implies that their performance deteriorates as the dimensionality of the search space increases [23]. Bouveyron et al. advises dimension reduction or subspace clustering as the primary ways of avoiding the curse of dimensionality [5].

Fig. 1.
figure 1

Particle structure of the standard PSO. Each particle contains the centroids for all clusters.

The proposed method in this study,unlike the standard PSO approach, achieves high quality clustering results without increasing the number of dimensions. To do so, instead of a whole representation of a candidate solution by a particle (including all centroids of all clusters as in Fig. 1), in the proposed method, each particle represents only one centroid in the search space. Therefore, the number of dimensions of a particle equals the number of data vector features. Despite this major change in the particle representation, the proposed version of PSO’s adherence to the standard PSO principles is provided by the changes made in the structure of the communication between particles.

One of the main configurational properties of PSO is topology or structure of connections between particles. Several approaches are developed to obtain good performance. In the gbest model, each particle is connected to all other particles (Fig. 2a). In the lbest model, each particle is connected to a predefined number of other particles (Fig. 2b). In star topology, which is a lbest model, one of the particles in the swarm become the focal particle and all other particles are connected to this focal particle (Fig. 2c). Therefore, all communication in the swarm is transmitted through this focal particle.

The proposed PSO variant in this study, addresses a star topology based new PSO clustering method. In this method there are several focal particles in the swarm. Other particles are connected to their nearest focal particle and all communication passes through these focal particles. There are several studies about focal particles in PSO [13, 21]. However, we couldn’t find any study on multiple focal particles in a swarm with dynamically changing neighborhoods among particles.

Fig. 2.
figure 2

Swarm topologies: gbest topology - Each particle is connected to each other. lbest topology - Each particle is connected to a number of other particles. Star topology - Each particle is connected to a focal particle.

In this study, we aim to prove that, by decreasing the number of dimensions with the help of this multiple focal particle topology, our proposed PSO variant achieves high quality clustering results with less computation cost than other heuristics in high dimensional datasets. In the following sections, first data clustering is defined as an optimization problem. Then, in the third section, particle swarm optimization technique is introduced and the method of data clustering with particle swarm optimization is explained. In the fourth section particle swarm optimization with the focal particles method is introduced. This method is applied on five datasets and, results and the conclusion is given at the end of this study.

2 Data Clustering

When the data clustering problem is treated as an optimization problem, the aim is to find optimal centroids of clusters rather than finding optimal partition of the data vectors [1]. The dataset to be clustered is represented as a set of vectors \(D=\{x_1, x_2, .... , x_m\}\) where m is the number of data objects \(x_i\). A data object can have any number of dimensions. These dimensions of data is called attributes or features. A cost function is to be defined for clustering optimization problem. In clustering analysis these cost functions are validity indexes. A comprehensive review of the clustering methods can be found in [15, 16, 24].

2.1 Validity Indexes

Several validity indexes are defined to assess the performance of the clustering algorithms. In optimization based data clustering, these validity indexes (or similarity indexes) are used to calculate the fitness of the current solution. The most basic validity index is the sum of distances between the data vectors and their assigned cluster centroids in the vector space. This index is called clustering error index [1] and given in the Eq. (1).

$$\begin{aligned} J_e=\sum \limits _{j=1}^{N_c}[\sum \nolimits _{\forall {x_i\in {C_j}}}d(x_i,o_j)] \end{aligned}$$
(1)

where d is the distance of the data vector \(x_i\) to its assigned centroid. \(N_c\) denotes the number of clusters (provided by the user). \(o_j\) denotes the centroid vector of cluster \(C_j\).

Another validity index is quantization error (2) from [22].

$$\begin{aligned} J_q = \frac{ \sum \limits _{j=1}^{N_c} [ \frac{ \sum \nolimits _{ \forall {x_i\in {C_j}} } d(x_i,o_j)}{\mid {C_{j}\mid }}]}{N_c} \end{aligned}$$
(2)

Here \(\mid {C_j}\mid \) is the number of data vectors belonging to cluster \(C_j\). This quantization error is the average distances of the data vectors to their assigned centroids. The quantization error used in the Eq. (2) allows for division by zero. In our study, if a division by zero was encountered, the fitness of the particle was approximated to infinity.

One another well-known validity index is Silhouette value. The silhouette value for each point is a measure of how similar that point is to points in its own cluster, when compared to points in other clusters. Higher silhouette means a better assignment of data vectors to clusters. Formula for silhouette value is given in (3).

$$\begin{aligned} S(x_i)=\frac{b(x_i)-a(x_i)}{max\{a(x_i),b(x_,)\}} \end{aligned}$$
(3)

where \(a(x_i)\) is the average distance from the ith point to the other points in the same cluster as i, and \(b(x_i)\) is the minimum average distance from the ith point to points in a different cluster, minimized over clusters. Silhouette value is in between \(-1\) and \(+1\). There are several other validity indexes for data clustering, a brief list of them can be seen in [16].

The distance parameter in the Eqs. (1) and (2) can be Euclidian, cosine or any other distance metric. In data clustering euclidian distance, given in the Eq. (4), is one of the most frequently used metric. But at some special occasions like document clustering, cosine distance is more suitable [25].

$$\begin{aligned} d(x_i,o_j)=\sqrt{\sum \limits _{k=1}^{N_{d}} (x_{ik}-o_{jk})^2} \end{aligned}$$
(4)

Here \(N_d\) is the data dimension, i.e. the number of attributes of each data vector.

3 Particle Swarm Optimization

Swarm optimization algorithms are inspired by the efforts to model the social systems of birds and bees. Particle swarm optimization is developed by Kennedy and Eberhart in 1995 [9]. In PSO, each particle represents a position in \(N_d\) dimensional space. PSO algorithm moves particles through this multi-dimensional search space to search for an optimal solution. A particle’s movement is affected by three factors; (1) Particle’s own velocity vector, \(\vec {v}_i\) - (2) Particle’s best position found thus far, \(\vec {p}_i\) - (3) Best position found by the particles in the neighborhood of that particle, \(\vec {y}_i\).

In the first step of the algorithm, velocity of a particle is calculated as in (5) and then this value is added to the current position of the particle as given in (6). If \(\vec {x}_i\) is the current position of the particle, \(\vec {v}_i\) is the current velocity of the particle and \(\vec {p}_i\) is the personal best position of the particle then the velocity of the particle for the next iteration is;

$$\begin{aligned} \begin{array}{ll} \vec {v}_{i,k}(t+1) = &{} w\vec {v}_{i,k}(t)+c_1r_{1,k}(t)(\vec {p}_{i,k}(t)-\vec {x}_{i,k}(t)) \\ &{} +\,c_2r_{2,k}(t)(\vec {y}_{i,k}(t)-\vec {x}_{i,k}(t)) \\ \end{array} \end{aligned}$$
(5)
$$\begin{aligned} \vec {x}_i(t+1)=\vec {x}_i(t)+\vec {v}_i(t+1) \end{aligned}$$
(6)

where w is the inertia weight, \(c_1\), \(c_2\) are positive constants, called the cognitive and social acceleration factors respectively. \(r_{1,k}(t)\), \(r_{2,k}(t)\) \(\backsim {U(0,1)}\), and \(k=1,...,N_{d}\) [22].

3.1 Data Clustering with Particle Swarm Optimization

In PSO, every particle represents a candidate or potential solution. The model employed by the particle should point a solution of the problem by its own. In Merwe and Engelbrecht’s [22] method, a particle is constructed as in 7.

$$\begin{aligned} \vec {x}_i=(o_{i1},o_{i2},...,o_{ij},...o_{iN_c}) \end{aligned}$$
(7)

where \(o_{ij}\) corresponds to the \(j_{th}\) centroids represented by the \(i_{th}\) particle. Thus, if a data vector consists of \(N_d\) dimensions, then a particle will have \(N_d~\times ~N_c\) dimensions.

PSO algorithm tries to minimize an objective function iteration by iteration. In data clustering mode, this objective function should be chosen carefully to achieve a good clustering result at the end of the iterations or when a termination criteria for the PSO is reached. Merwe and Engelbrecht [22] have chosen quantization error (2) as the fitness function.

4 Particle Swarm Optimization with the Focal Particles

As it is explained before, in PSO, a particle is a representation of a whole solution, thus a particle should have \(N_d \times N_c\) dimensions. This usually yields the so-called ‘curse of dimensionality’ problem. To overcome this ineffectiveness, we have developed a new clustering approach, namely particle swarm optimization with the focal particles (PSOFP). In this new approach each particle represents only one centroid in the search space. If \(N_c\) is the number of clusters, then \(N_c\) number of particles are chosen as the final representatives of clustering solution. These particles are the focal particles to which all other particles in the swarm are connected to their nearest. This neighborhood structure is similar to Fig. 3. This approach results in less dimensionality in particles. Therefore, it is expected to have less computational cost than the standard approaches. In the next section we have benchmarked PSOFP’s performance with other clustering algorithms.

Fig. 3.
figure 3

\(f_0\) and \(f_1\) are the focal particles. There are 13 particles in total. This is an example for a two cluster problem.

In PSOFP, a particle is constructed as in (8).

$$\begin{aligned} \vec {x}_i=(o_{i}) \end{aligned}$$
(8)

where \(\vec {x}_i\) is a centroid in the search space. Algorithm 1 displays the pseudo code of PSOFP algorithm. To start PSOFP, a swarm with l particles are initialized with the particle formation given in (8). Swarm initialization of PSOFP is similar to the standard PSO. Then, randomly selected \(N_c\) number of these particles are labeled as focal particles. The swarm size should be bigger than \(N_c\). At each iteration, the fitness value of each particle is calculated. To do this calculation, first centroid locations represented by the focal particles are combined together to make a candidate solution. Then, for each non-focal particle, the particle’s position vector (the centroid it represents) is overwritten to the corresponding place in the candidate solution. This process is illustrated in the Table 1.

Table 1. An illustrative example for the PSOFP fitness calculation process.

In this illustrative example, a swarm with 8 particles is initialized. We are trying to cluster our data vectors into three clusters. Thus, the first three particles are assigned as the focal particles. The data vectors are in two dimensions, therefore each particle has two dimensions. To calculate the fitness value of the fourth particle,

  • First a candidate solution is built by the focal particles as: {10; 18; 45; 26; 21; 34}. The first two columns of the candidate solution is the centroid of the first cluster and the third and the fourth terms are the centroid of the second cluster, the last part is the centroid of the third cluster.

  • Then, we calculate the nearest focal particle to the fourth particle using the selected distance metric. It is the second focal particle in this example.

  • In the candidate solution, places belonging to the second focal particle is replaced with the current particle’s position: {10; 18 ; 38; 30; 21; 34}. Fitness of the fourth particle is calculated by using this final candidate solution.

Another difference from the standard PSO is, focal particles in PSOFP will not have their own inertia weight component. Focal particles are only affected by their own personal best and the best performances of the other particles that are connected to these focal particles. At the end of each iteration, particles, including the focal, move in the search space. When these movements finishes, the neighborhood structure of the swarm is to be updated. Each particle, except focal ones, will be connected to its nearest focal particle. To do this, the distances among focal and non-focal particles are calculated again.

5 Application and Results

Table 2 shows the datasets used for benchmarking. IRIS, WINE, CMC and Gesture Data are from UCI benchmark datasets. RAND1 is a randomly generated dataset which includes 500 \(\times \) U(0, 100), 1000 \(\times \) U(500, 1500) and 1000 \(\times \) U(2500, 3000) values.

Table 2. Benchmark datasets

The following methods are used for benchmarking:

  • Standard K-Means Clustering Algorithm

    figure a
  • K-means++ Algorithm: Arthur and Vassilvitskii’s K-means++ algorithm [3], is an improvement to the standard K-means for choosing better initial values and therefore avoiding poor results.

  • Merwe’s [22] hybrid PSO data clustering method: In hybrid PSO, the result of K-means clustering feed into PSO as a particle, i.e. the solution of K-means algorithm is where the PSO starts.

  • CLARANS: Ng and Han [17] introduced the algorithm CLARANS (Clustering Large Applications based upon RANdomized Search) in the context of clustering in spatial databases. Authors considered a graph whose nodes are the sets of k medoids and an edge connects two nodes if they differ by exactly one medoid.

We have paralleled the benchmarking tests on a 16 processor computer. Due to the random nature of k-means and particle swarm algorithms, all methods have been run 160 times. The test computer had 16 Intel Xeon E5 2.90 Ghz processors with 30 GB of RAM. 8 parallel runs are done at the same time. We also tried paralleling the fitness evaluation process in a single run. But due to the high information preprocessing overhead, parallel evaluation of fitness functions in a single run was slower than the serial evaluation. Our test computer was on the Amazon EC2 cloud computing servers. We refer to [10] for a discussion on parallelization in data mining applications.

PSO and PSOFP algorithms are initialized with 100 particles. Permitted maximum iteration count is 4000, but iterations stop when there is less than 0.0001 improvement in the global best value during the last 250 iterations. Equation 5 is used for velocity calculations, w = 0.90, \(c_1\) = \(c_2\) = 2.05. In standard PSO, the gbest model is chosen. Selection of the fitness function is an important process in heuristic optimization. We choose quantization error (2) as the fitness function. Quantization error and Silhouette values of each method is reported in the Table 3. CPU time column is the mean CPU time for 160 runs. Mean and Min. columns of quantization error represent the average and the best value obtained from 160 runs. Max. column of Silhouette value represents the best value achieved among 160 runs for the Silhouette index. S.D. column gives the standard deviation of runs.

When we refer to the quantization error, proposed PSOFP algorithm outperforms all other algorithms on the benchmark datasets. The mean value of the quantization error of PSOFP on five datasets is 3.71 %, 4.16 %, 4.06 % and 1.65 % lower than the K-means, K-means++, PSO Hybrid and CLARANS algorithms respectively. When we compare the best valued achieved by each algorithm (minimum values), PSOFP is 7.64 %, 7.59 %, 6.42 % and 7.78 % better than these algorithms. Standard deviation is an indicator of the representation strength of reported average errors. In all datasets, except RAND1, standard deviation of PSOFP is lower than the benchmarking algorithms. This shows that proposed PSOFP is a robust clustering technique. Silhouette value is another useful index to analyze the clustering performance. Values nearer to +1 is better for the Silhouette index. Silhouette values of PSOFP is equal or slightly better than the benchmarking algorithms. Only, in RAND1 dataset CLARANS algorithm is 2.23 % better than the PSOFP on the average.

As the CPU time column of the Table 3 indicates, due to the less number of dimensions of the search space in the PSOFP method, PSOFP is much faster, at the same time more successful in the term of clustering validity, than the standard PSO. Its computational time is 45.03 %, 5.00 %, 39.66 %, 64.54 % and 9.25 % less than the standard PSO algorithm in WINE, IRIS, CMC, Gesture and RAND1 datasets respectively. Although CLARANS algorithm gives better results than the PSOFP on RAND1 dataset, its computational time in this dataset is 4.3 times higher than the PSOFP.

Table 3. Benchmark results over 160 runs for each method.

6 Conclusions

In this study a new approach is presented for clustering analysis using particle swarm optimization with the focal particles. In standard PSO, each particle is a representation of the final solution, however, this increases the number of dimensions a particle has. In PSOFP, each particle is a representation of only one point in the search space, therefore the number of dimensions are lower than the standard PSO. We analyzed the performance effect of this dimensionality reduction to the clustering performance. We selected three small and two large datasets and benchmarked proposed PSOFP algorithm with the standard K-means, K-means++, hybrid PSO and CLARANS algorithms. Each algorithm has run 160 times. The Amazon EC2 cloud computing platform is used and 8 parallel runs has been made each time. Also, we tried paralleling the objective function evaluation of particle swarm optimization. This approach didn’t accelerate the clustering analysis due to the high information overhead among parallel processes.

Quantization error and Silhouette values are chosen as the performance criteria for benchmark tests. The results indicated that while maintaining better or equal clustering performance with the benchmarking algorithms, PSOFP was faster than the standard PSO algorithm. This shows that the dimensionality reduction approach of the PSOFP is an efficient and robust strategy in heuristic-based data clustering analysis.

As the future work, an improved fully parallel approach for focal particles can be studied. We employed Euclidian distance as the distance metric in our calculations. But cosine metric is also known to be a good representative for the similarity among data objects in high dimensional space. The performances of the algorithms can be compared by using cosine distance metric.