Keywords

1 Introduction

Clustering is an unsupervised classification technique in which patterns or feature vectors or data items are organized in clusters/groups based on similarity, such that the data within the same cluster have high degree of similarity and patterns belonging to different clusters have high degree of dissimilarity. Jain et al. [1] provides a survey on clustering algorithms and Jain and Gajbhiye [2] provides a comparative performance analysis on different clustering algorithms. However, the clustering techniques can be broadly classified into two types: hierarchical and partitional. Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters into smaller ones and generates a hierarchy of partitions in which each partition is nested within the partition at the next level in the hierarchy. On the other hand, partitional clustering attempts to directly decompose the data set into a set of disjoint clusters.

K-Means clustering algorithm [2, 3] is a well-known partitioning based technique which partitions n-patterns into k-clusters with each pattern belongs to the cluster with the nearest mean. For improving the performance and efficiency of K-Means clustering, various and numerous methods have been proposed. Dash et al. [4] proposed a clustering approach by hybridizing K-Means algorithm with principal component analysis (PCA) to reduce dimensionality of data set. Additionally employed a new method obtain the initial centroid values by finding the mean of all the data sets divided into k different sets in ascending order. This approach has the demerit of taking large amount of time and it may eliminate some of the features which are also important. However, Yedla et al. [5] proposed an enhanced K-Means algorithm with improved initial center using the distance from the origin to improve the performance of K-Means algorithm. Although, this approach seems to solve the initialization problem but does not give any guarantee better time complexity. For reducing the computational time Fahim et al. [6] proposed a clustering method by considering initial centroids randomly. However, this method does not provide unique result due to very sensitive to randomly selected initial center points. Zhang and Xia [7] proposed the initial centroids algorithm based on K-Means that have avoided alternate randomness of initial centroids. Despite of several years of research on improving the performance of K-Means algorithm, the major drawback is its chances to trapping to local optima because it is an iterative and hill climbing method. Therefore in order to achieve diversification property hybridization with some global optimization technique is required.

Obtaining the optimal centroids for the clusters can be considered as a multimodal search problem. Therefore, any global optimization like technique differential evolution (DE) [8], genetic algorithm (GA) [9], particle swarm optimization (PSO) [10], ant colony optimization (ACO) [11], a bee colony optimization (BCO) [12], an evolutionary strategy (ES) [13], quantum inspired algorithms (QEA) [14], chemical reaction optimization (CRO) [1517] etc. can be used for clustering. Motivated by this few authors have applied global optimization techniques like genetic algorithm [1820], differential evolution [21], Teaching-Learning-Based Optimization [22, 23] algorithms to clustering problems and have shown better performance. However, the major drawback of these approaches is that obtaining a trade-off between diversification and intensification. Motivated by this, a hybridization of K-Means algorithm (used to intensify some of the solutions) and CRO algorithm (diversifies the solution space) is used.

The rest of this paper is organized as follows. Section 2 gives a brief overview of K-Means, DE-based and CRO-based clustering algorithms. Section 3 describes the proposed Hybrid CRO-K-Means Clustering algorithm, Sect. 4 gives the experimental set up and comparison between the proposed algorithm and some of the existing algorithms, and Sect. 5 concludes the paper.

2 Clustering Algorithms

2.1 K-Means Clustering Algorithm

K-Means is one of the simplest centroid based unsupervised learning algorithms that use K number of clusters (fixed a priori) to discriminate the data. Consider a data matrix X with n-data point patterns. The main goal is to find optimal positions of K-centroids (one for each cluster and represented by Ci) so as to partition n-data points into K-disjoint subsets by minimizing the sum of squared error. The sum squared error (E) is calculated as in Eq. 1.

$$ E = \sum\limits_{i = 1}^{K} {\sum\limits_{j = 1}^{n} {\left\| {X_{j} - C_{i} } \right\|} }^2 $$
(1)

where ||.|| is the Euclidean distance.

Algorithm:

  1. 1.

    Randomly select K centers or centroids.

  2. 2.

    Calculate the distance between each data point and cluster centers and assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.

  3. 3.

    For each cluster, recompute the cluster centers.

  4. 4.

    Repeat step 2 and 3, until no cluster center changes.

2.2 Differential Evolution Based Clustering Algorithm

The differential evolution (DE) algorithm is a simple and efficient stochastic direct search method for global optimization. It was introduced several years ago (1997) [8]. It has several advantages such as: ability to find the global minimum of a non-differentiable, nonlinear and multimodal function irrespective of initial values of its parameters, parallel implementabilty, ease of use and good convergence properties. Therefore, a wide variants of it has been developed [18]. The major difference between the DE variants lies in the mutation and crossover strategies. The conventions used above is DE/a/b/c, where ‘DE’ stands for ‘differential evolution’, ‘a’ represents the base vector to be perturbed, ‘b’ represents number of difference vectors used for perturbation of ‘a’ and ‘c’ represents the type of crossover used (bin: binary, exp: exponential). Interested reader may go through [8, 18] to have a detail description regarding DE algorithm and its variants.

DE Clustering Algorithm:

  1. 1.

    Initialize each individual solution to contain K randomly chosen points from the dataset.

  2. 2.

    Calculate fitness function for each individual.

  3. 3.

    Perform Crossover and Mutation upon solutions based on the scheme DE/a/b/c.

  4. 4.

    Replace the existing solutions with the new ones if and only if the new solutions have better fitness value.

  5. 5.

    Repeat step 3 and 4, until some termination criteria is satisfied.

Select the best chromosome in the population which will act as the resultant clusters.

2.3 Chemical Reaction Optimization Based Clustering Algorithm

Chemical reaction optimization (CRO) algorithm is a recently established population based metaheuristics optimization technique [15, 16] which is inspired by the nature of chemical reactions. It does not attempt to capture every detail of chemical reaction rather loosely couples chemical reaction with optimization. A chemical reaction starts with some initial species/reactants/molecules in the unstable states and undergoes a sequence of collisions and become the final products in stable states [16]. During the chemical reaction the intra-molecular structure of a reactant changes. Most of the reactions are reversible in nature i.e. a reaction may be turned back due to instability. It enjoys the advantages of both Simulated Annealing (SA) and GA [16]. The major difference between CRO and other evolutionary techniques is that, the population size (that is the number of reactants) may vary from one iteration to the other where as in evolutionary techniques the population size remains fixed. But few authors [25] have proposed fixed population sized CRO algorithms and shown that fixed population sized CRO not only performs better but also easier to implement. The basic unit in CRO is a molecule (chromosome) which is considered as a single possible solution in the population. Each molecule has potential energy (PE) which refers to fitness value, to characterize the molecule. A chemical change of a molecule is triggered by a collision which may be uni-molecular collision (one molecule taking part in chemical reaction) or inter-molecular collision (two or more molecule collide with each other). The corresponding reaction change is termed as Elementary Reaction.

The chemical reaction optimization based Clustering algorithm consists of following steps:

  1. 1.

    Initialize each individual solution to contain K randomly chosen points from the dataset.

  2. 2.

    Evaluation of Potential Energy (PE) of each Reactant representing a set of cluster centers.

  3. 3.

    Applying Chemical reactions to generate new reactants.

  4. 4.

    Reactants update for better potential energy.

  5. 5.

    Check termination criteria. if satisfied go to step-6 otherwise go to step-3

  6. 6.

    Use the reactant having best Potential Energy as the cluster center.

3 Proposed Methodology

In this paper the hybridized CRO-K-Means algorithm uses four kinds of elementary reactions (two uni-molecular and two bi-molecular): on-wall ineffective collision, decomposition, inter-molecular ineffective collision, and synthesis. K-Means algorithm is hybridized as an on-wall ineffective collision of the CRO algorithm. K-Means algorithm can able to find the local minima and the use of other three reactions may assist in trapping from local optima. Therefore chances of obtaining near optimal solution in lesser number of iterations can be guaranteed.

The pseudo-code of clustering using CRO is shown in Algorithm 1 where ReacNum is the size of the population ‘P’ best_mole represents the molecule having the highest potential energy (PE) in current iteration, and the elementary reactions are shown in subsequent algorithms.

3.1 On-wall Ineffective Collision Reaction

In this reaction K-Means algorithm is applied on the reactant considering the values of the reactants as the initial cluster centers. The pseudo-code of the decomposition reaction is described in Algorithm 2.

3.2 Decomposition Reaction

In this reaction the value of a randomly selected atom of the reactant is changed in order to perform more local search. Consider a reactant Rj = {Wj,1, Wj,2 …, Wj,D} with Wj,x (x ∈ [1, n]) be an atom of the reactant-j. The rate of change of the atom of the reactant depends on the rate of reaction (λ). The rate of reaction is a random number generated from a Cauchy distribution because it diversifies the solution more as compared to traditional uniform distribution. The pseudo-code is described in Algorithm 3.

Here two reactants Rj = {Wj,1,…, Wj,D} and Rk = {Wk,1, Wk,2…, Wk,D} will take part in the reaction. These reactions help in diversification of the solution by generating a new solution that is significantly different from the current solution. These reactions occur with a probability of 50 %. Following types of bimolecular reactions are used.

3.3 Synthesis Reaction

In this reaction one reactant is produced due to reaction between a reactant Rj = {Wj,1,…, Wj,D} and another randomly reactant and Rrand = {Wk,1, Wk,2…, Wk,D}; of the iteration. Here, instead of traditional normal or uniform distribution; the rate of reaction (λ) is generated from a Cauchy distribution with a location parameter 0.7 and scale parameter 0.1. The pseudo-code of this reaction is described in Algorithm 4.

3.4 Inter-molecular Ineffective Collision Reaction

In this reaction two solutions R1 and R2 are obtained from reaction between two reactants Rj and Rk. The pseudo-code of this reaction is described in Algorithm 5.

3.5 Reactant Update

Algorithm 6 describes the pseudo-code of the reversible reaction. Every reaction is followed by a reversible reaction to update the reactants (cluster centers) for next iteration i.e. reactants having better fitness (PE) survive for the next iteration. In this reaction, a reaction producing a single product replaces the replaces the used reactant for better PE; and for reactions producing two products, best product replace the used reactant for better PE. Thus the number of reactants of the population remains same throughout the reaction process.

4 Experimental Set up and Simulation Results

For comparative performance analysis of proposed hybrid CRO-K-Means algorithm with K-Means, DE/rand/1/bin, and CRO [26] based clustering algorithms, four real world datasets such as: Lung cancer, Iris, Breast cancer Wisconsin and Haberman’s survival from UCI machine learning data repository are considered. Lung cancer dataset has 32 instances with predictive attributes which are nominal, and takes integer values from 0 to 3. Iris dataset has 150 patterns which are random samples of three species of the iris flower such as: setosa, versicolor, and virginica. Breast cancer Wisconsin dataset consists of 699 patterns which were collected periodically as Dr. Wolberg reports his clinical cases. Haberman’s survival dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. Number of data points, number of attributes and number of clusters of all four datasets are listed in Table 1.

Table 1 Properties of the data sets considered

4.1 Performance Measure

In order to evaluate the effectiveness of proposed methodology with the other algorithms considered following three performance measures are considered.

Clustering metric: The clustering metric ‘M’ for K clusters C1, C2,…, CK is calculated as in Eq. 2.

$$ M(C_{1} ,C_{2} , \ldots ,C_{K} ) = \sum\limits_{i = 1}^{K} {\sum\limits_{{x_{j} \in C_{i} }}^{{}} {||{\text{x}}_{j} - C_{i} ||} } $$
(2)

The objective is to minimize ‘M’.

Note that for the evolutionary algorithms DE, CRO and Proposed methodology clustering metric is used as the fitness function or PE.

Intra-cluster Distance: The intra-cluster distance is mean of maximum distance between two data vectors within a set of clusters and is calculated as in Eq. 3.

$$ \frac{1}{K}\sum\limits_{i = 1}^{K} {\max_{ZpZq \in Ci} d(Zp,Zq)} $$
(3)

The objective is to minimize the intra cluster distance.

Inter-cluster distance: It is the minimum distance between the centroids of the clusters.

The objective is to maximize the inter-cluster distance.

4.2 Simulation Results

For all the evolutionary algorithms the population size or Reactant number is fixed to 10 and run up to a maximum of 100 iterations or 90 % of chromosomes or reactants converge to same point. As the evolutionary methods are stochastic in nature, ten independent simulations were performed for each data set and for each algorithm. The mean and standard deviation of ten iterations for each algorithm for Lung cancer, Iris, Breast cancer Wisconsin, and Haberman’s survival datasets are shown in Tables 2, 3, 4 and 5 respectively. Note that for a particular problem and for a particular run the initial value of cluster centers for the three evolutionary methods are kept same. For Differential evolution DE/rand/1/bin strategy with F = 0.5 and Cr = 0.6 is used.

Table 2 Comparison of the performance of different clustering algorithms for lung cancer dataset
Table 3 Comparison of the performance of different clustering algorithms for iris dataset
Table 4 Comparison of the performance of different clustering algorithms for breast cancer Wisconsin
Table 5 Comparison of the performance of different clustering algorithms for Haberman’s survival dataset

It can be observed from the simulation results that the proposed hybrid clustering method performs better than K-Means, DE-Clustering and CRO-Clustering algorithms for the Lung cancer, Iris, Breast cancer Wisconsin, and Haberman’s survival dataset considering Clustering Metric, Intra-Cluster Distance and Inter-Cluster Distance as performance measure.

5 Conclusion

In this paper, a hybrid of chemical reaction optimization and K-Means algorithm for data clustering is proposed. In this hybrid method K-Means algorithm is used as a chemical reaction of the CRO algorithm. The K-Means algorithm intensifies the solution up to local optima meanwhile other chemical reactions assist in overcoming the local optima by exploring the solution space; thereby a trade between intensification and diversification is implicitly maintained and chances of reaching a global optima increases. In order to evaluate the effectiveness of the proposed method four real world datasets have been considered. Results revealed that the proposed method performs better than the K-Means, DE based Clustering and CRO based clustering algorithms for the aforementioned datasets. However, the major drawback of the proposed method is that it takes more amount of time than other algorithms since it uses K-Means algorithm as one of the reactions.