Keywords

1 Introduction

Clustering is an unsupervised data mining technique and is based on the concept of similarity measures between the cluster groups. The aim of the clustering is to distinguish and reform the clusters of either similar or dissimilar type relying on their distance from the cluster center. K-means clustering is one of the competent clustering techniques for solving large scale non convex optimization problems [1]. This method is useful to reduce the sum of intra cluster distances between the clusters. The algorithm follows a simple concept of classification of a data set into a number of clusters in a dimensional space. The features of the cluster are represented through a data point and relying on the homogeneity condition for which the clusters are separated. The numbers of clusters are considered as ‘k’ (called prior knowledge) helps to group the similar objects in a closer fashion as well as make distance from the dissimilar type. Based on the distance measure from the center, the k sets of clusters are divided into another k sets of subset clusters. Each time the newly formed cluster centers can be iteratively updated by using various optimization techniques. Many researchers have shown their key interest in developing k-means algorithm for diversified application areas. A number of recently proposed k-means clustering algorithms and their applications relevant to the article have been studied in this literature.

To achieve the appropriate cluster centers in the feature space for optimizing the similarity metrics, a no. of GA based clustering algorithm have been developed [2, 3]. The Ahmadyfard and Modares [4] have discussed a hybrid clustering method based on k-means and PSO for better convergence. A novel cat swarm optimized clustering algorithm have been proposed by Santosa and Ningrum [5] for better accuracy as compared to PSO. Kader [6] presented a hybrid two-phase GAI-PSO with k-means data clustering algorithm which performs fast data clustering and can avoid premature convergence to local optima. An improved PSO based k-means algorithm was developed by Zheng and Jia [7] to avoid the local optima problem in normal k-means clustering. Wang et al. [8] introduced a parallel map reduce K-PSO by combining the traditional k-means and PSO algorithm. Naik et al. [9] have proposed a hybrid PSO—K-means clustering algorithm to get optimal cluster centers for cluster analysis. An improved k-means with a hybridized PSO algorithm for web document clustering has been introduced by Jaganathan and Jaiganesh [10]. After the combination of k-means method and mathematical morphology, Yao et al. [11] have developed an improved k-means method for fish image optimization. Monedero et al. [12] presented a modification of the celebrated k-means method for quasi unsupervised learning by controlling the size of the cluster partitions and adjusted by means of the Levenberg–Marquardt algorithm. Shahbaba and Beheshti [13] introduced a novel minimum ACE k-means (MACE) clustering method which has the advantage for the use in synthetic and real data. Tzortzis and Likas [14] developed a minmax k-means algorithm where the cluster weights are set according their variance. To deal with distributed data and overcome the limitations of k-means, Naldi and Campello [15] proposed an evolutionary k-means algorithm for clustering.

Although k-means is a highly influential clustering algorithm used in various real life applications compared to other algorithms, still it has some major limitations like sensitivity to local optimal solutions in which area more works need to be done. By inspiring this, an improved swarm based k-means algorithm has been proposed for more effective and competent real world data clustering. The remaining part of the paper is organized as following manner. Section 2, describes the basic preliminary concepts like k-means, PSO and IPSO. In the Sect. 3, the proposed method (IPSO-k-means) has been presented. Section 4 presents the experimental set up along with the results obtained. Section 5 gives the conclusion of our work.

2 Preliminaries

2.1 K-Means Algorithm

The k-means algorithm [16, 17] receives k number of input parameters and performs the partition on a set of n objects in the dimensional space. The method of k-means starts with the random selection of k no of objects and are represented as cluster means. Depending on the distance metric between the object and the cluster mean, for each of the residual objects, a similar object is being assigned which helps to compute a new cluster mean. This process will be continued till the convergence of criterion function. Hence, k-means is able to find the best cluster center points in the space.

Steps of k - means Algorithm

  1. 1.

    Select predefined number of cluster centers randomly from the dataset.

  2. 2.

    Compute Euclidian distances of each instance from cluster centers.

  3. 3.

    Assign cluster number to each instance based on Euclidian distance. An instance ij is assigned to cluster ck if Euclidian distance is minimum between ij to ck.

  4. 4.

    Find out new cluster center by computing the mean of all instances in a cluster.

  5. 5.

    If the previous sets of cluster centers are same as new clustering center, then go to step-7.

  6. 6.

    Else go to step-2

  7. 7.

    Exit

2.2 PSO Algorithm

Particle Swarm Optimization is bird inspired metaheuristic with random selection of initial populations proposed by Kennedy and Eberhart [18]. Due to lesser parameter settings, the complexity of this population based algorithm is quite less than others. The epitome for the expansion of PSO was to consider a location having no mass or dimension, flying like a bird in multidimensional space, by adjusting its position and exchanging information about the current position in search space according to its own earlier experience and that of its neighbors [19]. While travelling in a group for either food or shelter [20], not only the behavior of various types of swarms indicates a unique indication towards the noncoliding nature between themselves, but also they adjust both their position and velocity. In this mechanism, the swarm members modify their positions as well as the velocities after communicating their group information according to the best position appeared in the current movement of the swarm [21]. The swarm particles would gradually get closer to the specified position and finally reach the optimal position with the help of interactive cooperation [22]. Each particle has to maintain their local best positions lbest and the global best position gbest among all of them.

$$ V_{i}^{{\left( {t + 1} \right)}} = V_{i}^{(t)} + c_{1} * {\text{rand}}\left( 1 \right) * \left( {l_{{best_{i} }}^{\left( t \right)} - X_{i}^{\left( t \right)} } \right) + c_{2} * {\text{rand}}\left( 1 \right)\left( {g_{best}^{\left( t \right)} - X_{i}^{(t)} } \right) $$
(1)
$$ X_{i}^{{\left( {t + 1} \right)}} = X_{i}^{\left( t \right)} + V_{i}^{{\left( {t + 1} \right)}} $$
(2)

Equation 1 controls both cognition and social behavior of particles and next position of the particles are updated using Eq. (2), Vi(t) and Vi(t + 1) are the velocity of ith particle at time t and t + 1 in the population respectively, c1 and c2 are acceleration coefficient normally set between 0 and 2 (may be same), Xi(t) is the position of ith particle and lbesti(t) and gbest(t) denotes the local best particle of ith particle and global best particle among local bests at time t, rand(1) generates a random value between 0 to 1.

2.3 Improved PSO Algorithm

In traditional PSO, the basic three steps like calculation of velocity, position and the fitness value will be iterated till the required criteria of convergence are met. The ending criteria may be the maximum change in the best fitness value. However, if the velocity of the swarm will be fixed to zero or nearer to that and the best position will have a fixed value, and then the PSO may lead to be trapped at some of local optima. This happens only due to the swarm’s experience on the current and global positions. This experience is to be avoided and should be based on the mutual cooperation among all the swarms in a multidirectional manner [23].

So, in IPSO a new inertia weight factor \( \lambda \) is introduced to control both the local and global search behavior. The value of \( \lambda \) may be decreased quickly [24] during the initial iterations and slowly during the optimal iterations.

The new velocity and position updation can be realized through the Eqs. (3) and (4).

$$ V_{i}^{{\left( {t + 1} \right)}} = \lambda * V_{i}^{(t)} + c_{1} * {\text{rand}}\left( 1 \right) * \left( {l_{{best_{i} }}^{\left( t \right)} - X_{i}^{(t)} } \right) + c_{2} * {\text{rand}}\left( 1 \right) * \left( {g_{best}^{(t)} - X_{i}^{(t)} } \right) $$
(3)
$$ X_{i}^{{\left( {t + 1} \right)}} = X_{i}^{\left( t \right)} + V_{i}^{{\left( {t + 1} \right)}} $$
(4)

3 Proposed Algorithm (IPSO-K-Means)

The proposed IPSO-K-means algorithm is a hybrid algorithm based on the combination of improved PSO with K-means algorithm for real world data clustering. Due to the slow convergence speed of basic PSO and easier finding of a local optimal solution in K-means algorithm, the hybridization of Improved PSO along with K-means algorithm will improve the convergence speed as well as helps to find the global optimal solution. So, the advantages of both the algorithms have been used in this paper, which may lead to achieve an efficient result than the use of any individual algorithms.

Pseudo Code of IPSO-K-Means Algorithm

Initialize the position P and velocity V of particles randomly. Each particle is a potential solution for the clustering problem. A single particle represents the centroids of clusters. Hence the population of particles is initialized as follows (Eq. 5):

$$ P = \left\{ {X_{1} , X_{2} \ldots X_{n} } \right\} $$
(5)

where \( X_{i} \) represents the centroids of clusters which is a single possible solution (particle) in the search space and can be denoted in Eq. (6).

$$ X_{i} = \left\{ {C_{i,1} , C_{i, 2} \ldots C_{i, m} } \right\} $$
(6)

where \( C_{i,j} \) represents jth cluster center among m clusters in the datasets.

4 Experimental Setup and Result Analysis

In this section, the proposed IPSO-K-Means has been implemented in MATLAB and compared with other alternatives (K-Means, GA-K-Means, PSO-K-Means). All the clustering methods are tested with multidimensional real world datasets (Table 1) from UCI repository [25] and have been compared in terms of fitness value of the cluster centers from Eq. (7). The comparison of clustering methods is listed in Table 2. The proposed method has been implemented using MATLAB 9.0 on a system with an Intel Core Duo CPU T5800, 2 GHz processor, 2 GB RAM and Microsoft Windows-2007 OS.

Table 1 Dataset information
Table 2 Performance Comparison of IPSO-K-means with the other clustering methods

In the Eq. (7), k and d are the parameters used to calculate the fitness of clustering methods along with the proposed method. The simulation has been carried out by setting the values k = 50, d = 0.1 and proposed clustering model found better from all existing methods. The acceleration coefficients c1 and c2 are set to 1.4 for early convergence during IPSO iteration. The inertia weight is set between 1.8 and 2 for early convergence. The proposed Improved PSO based technique is able to produce a good cluster center of an abject. But there is no certain time to meet the convergence criteria. With the increase in the number of iterations, the cluster center (initially chosen) will be attracted towards its corresponding similar clusters which will lead to obtain the final cluster center with best fitness value. The change in local and global best solution will result the updation in the new position and velocity of the cluster.

5 Conclusion

In this paper, a hybrid Improved swarm based K-means algorithm has been designed for the purpose of real world data clustering. The datasets have been considered from the UCI machine learning repository and are tested by various clustering methods like K-means, GA-K-means and PSO-K-means. The fitness value of the clusters obtained by the proposed method helped to get the more nearer and optimal cluster centers. The proposed method not only produces good fitness values but also it improves the cluster accuracy. The procedure to find the optimal cluster center in this paper is quite different and innovative as compared to other existing methods. The results shown in Table 2 from selected data sets indicate that the IPSO-K-means technique is able to find the global optimum solution with small standard deviations as compared to other methods. However, the future work may be planned for optimization of the initial cluster centers of k-means algorithm with the use of any other hybrid techniques.