1 Introduction

Data mining techniques extract knowledge from large amount of data. These techniques include classification, clustering, association rules, etc. Cluster analysis is the unsupervised technique grouping the data without knowing the class labels. Clustering is applied in many application areas such as biology, security, business intelligence and web search [1]. Clustering can be divided into two categories: hard and soft clustering. In hard clustering, the same object can belong to only a single cluster. In soft clustering, the same object can belong to different clusters.

Clustering algorithms are classified into two categories: partitional and hierarchical. Partitional clustering algorithms form the clusters by partition of the data objects into groups. Hierarchical clustering algorithms form the clusters by the hierarchical decomposition of data objects. K-Means clustering algorithm is one of the partitional clustering algorithms and it is popular and most widely used due to its simplicity and efficiency. It chooses the initial centroid randomly from the data objects and uses the Euclidean distance to measure the distance between the data objects and its cluster centroid. K-Means algorithm gives a local optimum solution due to its selection of initial centroids.

A number of optimization algorithms are developed to provide the global optimum solution. Optimization algorithms are categorized into heuristic and metaheuristic. Heuristic means ‘to find’ or ‘to discover by trial and error’ and ‘meta’ means ‘beyond’ or ‘higher level’ [2]. Some of the nature-inspired metaheuristic optimization algorithms are Genetic Algorithm [3, 4], Ant Colony Optimization [5], Simulated Annealing (SA) [6], Particle Swarm Optimization [7, 8], Tabu Search [9, 10], Cat Swarm Optimization [11], Artificial Bee Colony [12,13,14], Cuckoo Search Algorithm [15, 16], Gravitational Search Algorithm [17], Firefly Algorithm [18], Bat Algorithm [19], Wolf Search Algorithm [20] and Krill Herd [21].

Crow Search Algorithm (CSA) is one of the population-based metaheuristic optimization algorithm and it was introduced by Alireza Askarzadeh [22]. This algorithm simulates the intelligent behaviour of crows. Crows are considered as one of the world’s most intelligent birds. This algorithm is based on finding the hidden storage position of excess food. Finding food source hidden by another crow is not a easy task because if a crow finds anyone following it, it tries to fool the crow by moving to another position.

To overcome the K-Means local optimum problem, in this paper a new clustering algorithm by hybridized Crow Search Optimization and K-Means clustering algorithms called CSAK Means is proposed.

The organization of this paper is as follows. Section 2 describes the related researches in the literature. Section 3 describes the K-Means clustering algorithm and the CSA is discussed in section 4. Section 5 describes the proposed CSAK Means clustering algorithm. The experimental analysis is discussed in section 6. Conclusion and future works are provided in section 7.

2 Related works

In this section, some of the optimization algorithms approaches for clustering problems and hybridization of optimization algorithms with K-Means are discussed.

Ant Colony Optimization approach for clustering problem is given in [23]. SA algorithm approach for clustering problem was proposed in [24]. Particle Swarm Optimization approach for clustering problem is given in [25]. Tabu Search algorithm approach for clustering problem was proposed in [26]. Artificial Bee Colony Optimization approach for clustering problem is given in [27, 28]. Cat Swarm Optimization algorithm for clustering was proposed in [29].

Genetic Algorithm combined with K-Means was developed in [30]. Hybrid clustering algorithm based on K-Means and ant colony algorithm was proposed in [31]. Cluster analysis with K-Means and SA was introduced in [32]. K-Means clustering algorithm based on Particle Swarm Optimization was proposed in [33, 34]. Tabu-Search-based K-Means was developed in [35]. Artificial Bee Colony based K-Means algorithm was proposed in [36]. Combination of Gravitational Search algorithm with K-Means was introduced in [37]. Firefly Algorithm combined with K-Means was proposed in [38]. Bat Algorithm combined with K-Means was proposed in [39]. Wolf Search Algorithm, Cuckoo Search, Bat Algorithm, Firefly Algorithm and Ant Colony Optimization algorithms integrated with K-Means are introduced in [40].

These algorithms try to solve the K-Means local optimum solution, but they suffer from low-quality results and low convergence speed, complicated operators, complex structure and parameter setting issues.

3 K-Means Clustering Algorithm

K-Means is the most widely used and easy to implement clustering algorithm. It partitions the data objects into predefined K number of groups based on the data objects that are closest to the centroid. The main objective of K-Means clustering is to minimize total intra-cluster distance, or the squared error function. The squared error function is calculated using Eq. (1):

$$\begin{aligned} \sum \limits _{j=1}^K\sum \limits _{i=1}^N\parallel x_i{}^{(j)}-c_j\parallel ^2. \end{aligned}$$
(1)

A dataset consists of N number of objects \(X_i\), \(i=1, 2, \ldots , N\) with D number of features \(D_j\), \(j=1, 2, \ldots , D\).

The K-Means clustering algorithm is described as follows:

  1. i

    Input the number of clusters K.

  2. ii

    Randomly select the K initial centroids \(c_j, {\textit{j}}=1, 2, \ldots , {\textit{K}}\) from the data objects.

  3. iii

    Find the distance between each K-cluster centroid and the data objects using the formula

    $$\begin{aligned} dis(x_{i},c_{j})=\sqrt{\sum \limits _{j=1}^d(x_i{}-c_j{})^2}. \end{aligned}$$
    (2)
  4. iv

    Find the minimum distance and assign the data objects to clusters.

  5. v

    Update the centroids using Eq. (3), i.e., calculate the mean of all data objects assigned to the cluster:

    $$\begin{aligned} c_j=\frac{1}{N_j}\sum \limits _{{x_i}\in {s_j}}x_i. \end{aligned}$$
    (3)

The K-Means algorithm is terminated when one of the following conditions is satisfied: (i) the average change in the centroids, (ii) the maximum number of iterations is reached and (iii) no change in the clustership of objects.

The main features of K-Means clustering are the following: (i) simple and easy to implement and (ii) can handle large amount of data objects efficiently. The main issues are the following: (i) needs the number clusters in advance, (ii) handles numeric data only and (iii) produces local optimum solutions.

4 CSA

The principles of CSA are the following: (i) crows live in the form of groups, (ii) remember the position of food hiding locations, (iii) follow each other for stealing food and (iv) protect their food source.

The number of crows, i.e., flock size, is P in D-dimensional environment and the position of the crow at iteration time i in the search space is specified as \(X_{i,iter}\), \(i=1, 2, \ldots , N\); \({\textit{iter}}=1, 2, \ldots , {\textit{itermax}}\); itermax is the maximum number iterations. Each crow has a memory m to remember the position of the hiding place. At each iteration, the position of hiding place for crow i is specified by \(m_{i,iter}\) and it shows the best position obtained so far. Metaheuristic algorithms should provide a good balance between diversification and intensification. In CSA, these two are controlled by the Awareness Probability (AP) parameter.

The CSA is described as follows:

  1. 1.

    Initialize the parameters, number of flocks P, maximum number of iterations itermax, Flight Length FL and Awareness Probability AP.

  2. 2.

    Initialize the position of crows randomly in PD-dimensional search space.

  3. 3.

    Initialize the memory of the crows with position of crows.

  4. 4.

    Evaluate the position of the crows.

  5. 5.

    While iter<maxiter

    1. (a)

      for all crows

      1. i.

        randomly choose any one of the crows to follow (for example v);

      2. ii.

        if crow ν does not know that crow μ is following it, new position of ν is obtained using Eq. (4); if crow ν does know that crow μ is following it, new position of ν is obtained randomly:

        $$\begin{aligned} {\left\{ \begin{array}{ll} x^{i,it} + r_i \times FL^{i,it} \times (m^{j,it}-x^{i,it}) & r_j \geqslant AP^{j,it} \\ \rm{a \,\, random \,\, position} & \rm{otherwise}\end{array}\right. } \end{aligned}$$
        (4)
      3. iii.

        check the feasibility of the new position; if the new position of crow is feasible, its position is updated; otherwise, the crow stays in the current position;

      4. iv.

        evaluate the new position of the crows using Eq. (1);

      5. v.

        update the memory of the crows using Eq. (5):

        $$\begin{aligned} {\left\{ \begin{array}{ll} x^{{i,it+1}} + {f(x^{i,it+1})} {\, \rm{is \, better\, than}}\, f(m^{m^i,it})\\ m^{{i,it}} \,\rm{otherwise} \end{array}\right. } \end{aligned}$$
        (5)
  6. 6.

    End of while.

5 Proposed algorithm

The K-Means clustering algorithm is easy to implement and efficiently handles large datasets. The main drawback is that it produces local optimum solutions. To obtain the global optimum solution, K-Means is combined with global optimization algorithms. CSA is the metaheuristic global optimization algorithm and combined with K-Means to obtain the global optimum solution. In this section, CSA combined with K-Means algorithm is proposed.

The proposed CSAK Means algorithm is described as follows:

  1. 1.

    Input the values of number of clusters K, flock size N, maximum number of iterations maxiter, flight length FL and awareness probability AP.

  2. 2.

    Initialize the position of crows N and memory of crows M.

  3. 3.

    Generate the matrix of size K*D with random numbers (number of features in the dataset). The maximum range of random numbers is the total number of instances in the data objects.

  4. 4.

    Encode the random numbers with the data objects. Each row specifies the K cluster centres for clustering algorithm. For example, if \(K=3\), \(D=4\), a single row looks as shown in figure 1.

  5. 5.

    Initialize the memory of the crows with the values of the positions of the crows because initially crows hid their foods at their initial positions.

  6. 6.

    Evaluate the fitness of initial position of crows using Eq. (1).

  7. 7.

    Initialize the fitness of memory of the crows with the fitness position of the crows.

  8. 8.

    Update the position of crows:

    1. (a)

      while iteration \(\le \) maxiter

      1. i.

        for all crows

        1. A.

          choose any one of the crows to follow randomly (for example μ);

        2. B.

          if crow μ does not know that crow ν is following it, new position of μ is obtained using Eq. (4);

      2. C.

        if crow μ does know that crow ν is following it, new position of μ is obtained randomly;

      3. D.

        check the feasibility of the new position; if the new position of crow is feasible, its position is updated; otherwise, the crow stays in the current position;

    2. ii.

      end of while;

      1. (b)

        evaluate the fitness of new position of crows using Eq. (1);

    3. (c)

      update the memory of the crows using Eq. (5).

  9. 9.

    Calculate the Euclidean distance from each data to best obtained solution centroid from CSA.

Figure 1
figure 1

Encoding.

6 Experimental results

6.1 Datasets

To evaluate the performance of proposed CSAK Means algorithm, six benchmark datasets, Iris, Wine, Glass, Breast Cancer, Contraceptive Method Choice (CMC) and Haberman’s Survival, are used. For each dataset the number of instances and number of classes are specified in table 1. These datasets are collected from UCI machine repository [41].

Table 1 Dataset details.

Iris: This dataset contains 150 samples of iris flower with 3 different species. The species include Setosa, Versicolour and Virginica. For each species there are 50 observations. The attributes in each species are sepal length, sepal width, petal length and petal width.

Wine: This dataset contains the chemical analysis of wines grown in the same region but derived from three different cultivars. There are 13 quantities found in each of the three types of wines.

Glass: This dataset contains the types of glass motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence, if it is correctly identified. There are 10 quantities found in each of the six types of glass.

Wisconsin Breast Cancer: This dataset contains the samples to identify the type of breast cancer. It is identified using 9 quantities found in each of the two types of breast cancer.

CMC: This dataset contains the samples of married women who were either not pregnant or did not know at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods or short-term methods) of a woman based on her demographic and socio-economic characteristics. There are 9 quantities found in each of the three types of choices.

Habermans Survival: This dataset contains the cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. There are 3 quantities found in each of the two types of status.

6.2 Measures

The performance of CSAK Means is evaluated with internal and external measures. The internal measure used is Silhouette and the external measures used are Purity, Normalized Mutual Information, Rand Index and FMeasure. Also the convergence time and time taken for each iteration are compared for the algorithms. ANOVA and statistical tests for significance are also performed for all algorithms.


6.2a Purity: Purity is the external evaluation measure to measure the quality of clustering algorithm. It is calculated as the count of all correct predictions divided by the total count of the data objects. It is calculated using Eq. (6):

$$\begin{aligned} purity(X,Y)=\dfrac{1}{N}\sum \limits _{i=1}^K max_j|{c_i\cap t_j}|. \end{aligned}$$
(6)

N is the total number of objects, K is the number of clusters, \(c_i\) is the cluster in C and \(t_j\) is maximum count for cluster \(c_i\).


6.2b Normalized Mutual Information: Normalized Mutual Information is an external measure to validate the quality of clustering. It is the information theoretic measure on how well the predicted clusters and the actual clusters predict the normalized amount of information inherent from these two. It is calculated using Eq. (7):

$$\begin{aligned} NMI(X,Y)=\dfrac{2I(X,Y)}{[H(X)+H(Y)}. \end{aligned}$$
(7)

X is the actual class label, Y is the label predicted by the algorithm, H is the entropy and I(X;Y) is the mutual information between X and Y.


6.2c Rand Index: Rand Index is an external measure to find the similarity between actual labels and predicted labels. This measure has a value between 0 and 1, 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same. Rand Index is calculated using Eq. (8):

$$\begin{aligned} Rand Index=\frac{TP+TN}{TP+FP+FN+TN}. \end{aligned}$$
(8)

TP means True Positive; it is the count of similar objects in the same cluster. TN means True Negative; it is count of dissimilar objects in different clusters. FP means False Positive; it is the count of dissimilar objects in the same cluster. FN means False Negative; it is the count of similar objects in different clusters.


6.2d FMeasure: FMeasure is the external measure to obtain the accuracy of the clustering results. It is the harmonic mean of precision and recall. FMeasure can be computed using formula (9):

$$\begin{aligned} FMeasure=2\times \frac{precision\times recall}{precision+recall}. \end{aligned}$$
(9)

Precision is calculated as the number of correct positive predictions divided by the total number of positive predictions. The best precision is 1, whereas the worst is 0. Precision is calculated as true positive divided by the sum of false positive and true positive. It is calculated using Eq. (10):

$$\begin{aligned} precision=\frac{TP}{TP+FP}. \end{aligned}$$
(10)

Recall is calculated as the number of correct positive predictions divided by the total number of positives. The best sensitivity is 1.0, whereas the worst is 0.0. It is calculated using Eq. (11):

$$\begin{aligned} recall=\frac{TP}{TP+FN}. \end{aligned}$$
(11)

6.2e Silhouette: The silhouette is an internal measure that measures how similar an object is to its own cluster compared with other clusters. This measure combines both the cohesion and separation. It is calculated using Eq. (12):

$$\begin{aligned} sil(i)=\frac{b_i-a_i}{max({a_i,b_i)}} \end{aligned}$$
(12)

where \(a_i\) is the average dissimilarity of i with respect to all other objects within the same cluster and \(b_i\) is the average dissimilarity of i with respect to all other objects in other clusters.


6.2f ANOVA: “Analysis of Variance” is a statistical test and it determines whether there is any statistically significant difference between the means of two or more groups. A one-way ANOVA is used to find out whether the means of groups are significantly different from one another or each group is relatively the same.

The one-way ANOVA table has six columns: (i) source of variability, (ii) sum of squares (ss) of each source, (iii) degrees of freedom (df) of each source, (iv) mean square (MS) for each source, (v) F-statistic, the ratio of the MSs and (vi) probability, the corresponding p-value of F.

6.3 Results

The algorithms are implemented using Matlab R2012a on an Intel i5 of 2.30 GHz with 4 GB RAM. The K-Means, K-Means++, Genetic K-Means, PSOK Means and CSAK Means algorithms are executed in 10 distinct runs with parameters specified in table 2. The values for the Particle Swarm Optimization algorithm are suggested in [42]. The values for the CSA are suggested in [22].

Table 2 Algorithm-specific parameters.

The fitness values of K-Means, K-Means++, Genetic K-Means, PSOK Means and CSAK Means for all datasets are shown in tables 38. The ANOVA statistical test results are shown in tables 914. Figures 27 show a comparison of convergence behaviour of the datasets for all algorithms. The boxplot for the silhouette of fitness values is shown in figures 813.

Table 3 Fitness, measures and computation time values of Iris Dataset.
Table 4 Fitness, measures and computation time values of Wine Dataset.
Table 5 Fitness, measures and computation time values of Glass Dataset.
Table 6 Fitness, measures and computation time values of Cancer Dataset.
Table 7 Fitness, measures and computation time values of CMC Dataset.
Table 8 Fitness, measures and computation time values of Survival Dataset.
Table 9 ANOVA test results of Iris Dataset.
Table 10 ANOVA test results of Wine Dataset.
Table 11 ANOVA test results of Glass Dataset.
Table 12 ANOVA test tesults of Cancer Dataset.
Table 13 ANOVA test results of CMC Dataset.
Table 14 ANOVA test results of Survival Dataset.

6.4 Discussion

Table 3 shows the results of fitness, measures and computation time values of Iris Dataset. For the Iris Dataset, the CSAK Means provides the best solution and the standard deviation is also smaller than those of other algorithms. The internal and external index solutions of CSAK Means are better than those of other algorithms. The convergence time and time for each iteration for CSAK Means are higher than those of other algorithms.

Table 4 shows the fitness, measures and computation time values of of fitness values of Wine Dataset. For the Wine Dataset, the CSAK Means provides the best solution and the standard deviation is also smaller than those of other algorithms. The internal and external index solutions of CSAK Means are better than those of other algorithms. The convergence time and time for each iteration for CSAK Means are lower and higher, respectively, than those of other algorithms except PSOK Means.

Table 5 shows the fitness, measures and computation time values of of fitness values of Glass Dataset. For the Glass Dataset, K-Means++ provides the best solution. The internal measure silhouette for Genetic K-Means is better than those of other algorithms. The external measure index solutions of CSAK Means are better than those of other algorithms. The convergence time and time for each iteration for CSAK Means are lower and higher, respectively, than those of other algorithms.

Table 6 shows the fitness, measures and computation time values of Cancer Dataset. For the Cancer Dataset, CSAK Means provides the best solution. The internal and external measure index solutions of CSAK Means are better than those of other algorithms. The convergence time and time for each iteration for CSAK Means are lower and higher, respectively, than those of other algorithms.

Table 7 shows the fitness, measures and computation time values of fitness values of CMC Dataset. For the CMC Dataset, K-Means++ provides the best solution but the worst, average and standard deviation of CSAK Means are better than those of other algorithms. The internal measure silhouette for CSAK Means is better than those of other algorithms. The external measure index solutions of CSAK Means and Genetic K-Means are the same. These values are better than those of other algorithms. The convergence time for CSAK Means is higher than those of other algorithms except PSOK Means. The time taken for each iteration is higher than those of all algorithms.

Table 8 shows the fitness, measures and computation time values of fitness values of Survival Dataset. For the Survival dataset, CSAK Means provides the best solution. The internal and external measure values of CSAK Means are better than those of other algorithms. The convergence time and time taken for each iteration for CSAK Means are higher than those of other algorithms.

Tables 914 show the results of ANOVA test results. The reason behind the ANOVA test is to test if there is any significance between the accuracies of the algorithms. The null hypothesis for an ANOVA is no significant differences among the groups and the alternative hypothesis is there is significant difference among the groups. Here, in all cases where Prob>F, the null hypothesis is rejected and alternative hypothesis is accepted; this implies that accuracies of all algorithms are not equal.

Figure 2
figure 2

Fitness values of Iris Dataset.

Figure 3
figure 3

Fitness values of Wine Dataset.

Figure 4
figure 4

Fitness values of Glass Dataset.

Figure 5
figure 5

Fitness values of Cancer Dataset

Figure 6
figure 6

Fitness values of CMC Dataset.

Figure 7
figure 7

Fitness values of Survival Dataset.

Figure 8
figure 8

Boxplot view of Iris Dataset.

Figure 9
figure 9

Boxplot view of Wine Dataset.

Figure 10
figure 10

Boxplot view of Glass Dataset.

Figure 11
figure 11

Boxplot for Cancer Dataset.

Figure 12
figure 12

Boxplot view of CMC Dataset.

Figure 13
figure 13

Boxplot view of Survival Dataset.

7 Conclusion and future work

In this paper, hybridized CSA and K-Means clustering algorithm is proposed and this new algorithm is called CSAK Means. The results of proposed algorithm are compared to those of K-Means, K-Means++, Genetic K-Means and PSOK Means algorithms. To evaluate the CSAK Means algorithm, fitness function used here is Mean Square Error Criterion. Afore-mentioned experimental results show that CSA outperforms the K-Means, K-Means++, Genetic K-Means and PSOK Means algorithms. In Genetic Algorithm, three operators, namely selection, crossover and mutation, need to be applied. PSO needs four parameters, namely inertia weight, individual learning factor, social learning factor and maximum velocity. CSA needs the two parameters AP and FL. Each optimization algorithm has its own parameters and it is tedious to fix the optimum values for each parameter. In future, this is extended to dynamically determine the number of clusters.