1 Introduction

Clustering, an important task in data mining, has been approached by many disciplines, e.g., biology, information retrieval, business, medicine, social science, earth science, because of its wide applications. In a clustering problem, objects of the data set are partitioned into appropriate number of clusters based on some similarity function. Consequently, the objects sharing the same cluster are more similar than the objects in the distinct clusters (Michalski and Stepp 1983; Prakash and Singh 2012). The clustering quality is often measured by an internal validity criteria, which may be based on different features of the clusters, e.g., compactness, isolation, connectedness. Conventionally, clustering algorithms are broadly classified as partitional, hierarchical, and density based algorithms (Jain et al. 1999). The hierarchical methods organize the objects as a hierarchical tree structure where each level represents partitions of the data set. These methods do not require initialization of solutions and prior knowledge about the number of clusters. However, in these methods, setting stopping criteria is very difficult. Moreover, the objects assigned to a cluster can not move to another cluster at later stage (Murtagh 1983). Density based clustering methods (Ester et al. 1996) implement a key idea that each point in a cluster should not contain number of points in its neighborhood in a defined radius less than a defined threshold value. Though these methods are usually well suited to identify outliers, they can not handle varying densities and are sensitive to the parameters. On the other hand, Partitional clustering methods are easy and directly decompose the data set into a number of clusters. Partitional clustering can be fuzzy or hard (Jain and Dubes 1988; Prakash and Singh 2014). In case of fuzzy partitional clustering Bezdek (1981), each object resides into every cluster with some membership weight. It is more suited to the data sets having overlapping clusters. However, hard partitional clustering creates non-overlapping clusters by assigning each object into one cluster.

Based on the description in Hansen and Jaumard (1997), Kishor et al. (2016), hard partitional clustering can be mathematically represented as

$$\begin{aligned} C_p \ne \phi \quad p= & {} 1,\ldots ,k \end{aligned}$$
(1)
$$\begin{aligned} C_p \cap C_q= & {} \phi \quad p, q = 1,\ldots , k \quad and \quad p \ne q \end{aligned}$$
(2)

where \(C=\{C_1, C_2, \ldots, C_k \}\) is a set of k number of nonempty and non-overlapping clusters. Many traditional/convensional clustering algorithms, e.g., K-means (Jain and Dubes 1988), K-Medoids (Kaufman and Rousseeuw 1987),are proposed to solve the hard partitional clustering problems. These algorithms are prominent as they are simple and easy to implement. Nonetheless, they suffer from many drawbacks, e.g., the final solution is dependent on the initial solution, they are easily trapped into a local optimum solution (Jain et al. 1999).

Specifically, for these reasons, a large number of nature-inspired population based global search metaheuristics, e.g., swarm intelligence methods (Abraham et al. 2008), evolutionary methods (Hruschka et al. 2009), have been used to overcome these deficiencies and enhance the quality of clustering solutions. They possess several desired key characteristics, e.g., ability to find a (-near) optimal solution by upgrading the candidate solutions iteratively in the search space, parallel capabilities, flexibility, robustness, need little or no domain knowledge, and self-organizing behavior (Prakash and Singh 2014; Bansal et al. 2013; Jadon et al. 2014) . Evolutionary methods, e.g., genetic algorithm (GA) (Holland 1975), differential evolution (DE) (Storn and Price 1997) are generic population-based metaheuristics, which mimic processes from the natural evolution, where the key concept is survival of the fittest. Swarm intelligence methods, e.g., Particle Swarm Optimization (PSO) (Kennedy and Eberhart 1995), Ant Colony Optimization (ACO) (Dorigo and Stützle 2003), Artificial Bee Colony (ABC) (Karaboga 2005), are modelled on foraging behavior of social insects such as birds, ants, bees.In evolutionary and swarm intelligence methods, exploration and exploitation are the two key aspects to search a good solution in the search space. The exploration is responsible for diversifying solutions in the search space. In contrast, exploitation searches a new solution in the neighborhood of a good solution to obtain a locally optimal solution (Sharma et al. 2013). ABC, a relatively recent swarm intelligence algorithm introduced by Karaboga Karaboga (2005), is modelled on the intelligent food foraging behavior of real honey bees, where potential solutions are food sources of the honey bees. ABC possesses three phases in which Employed bees phase and Onlooker bees phase are responsible for the exploitation and Scout bees phase serves for the exploration in the search space.

Though ABC has shown good performance on bench-mark problems (Karaboga and Ozturk 2011) and real-world problems (Chang et al. 2015; Gao et al. 2016), it suffer from two major problems. First, it becomes slower in convergence when number of dimensions for the problem increases as information exchange between bees is limited to one randomly selected dimension only (Karaboga and Basturk 2007; Zhu and Kwong 2010). Second, it lacks in proper balancing of exploitation and exploration in the search space due to lacking in update policy of scout bees to increase diversity and mitigate stagnation problem (Liu et al. 2015) as well as limited information exchange among bees in employee bees phase and onlooker bees phase (Yan et al. 2012).

In this paper,we propose Hybrid Gbest-guided Artificial Bee Colony algorithm (HGABC) to tackle the above problems and improve the efficiency of the clustering results. We incorporate gbest-guided search procedure similar to the PSO in the employed bees phase and the onlooker bees phase for fast convergence. Further, we introduce crossover operator of GA to ABC to avoid being trapped in local optima and to improve information exchange (social learning) among the bees to enhance diversity. In this way, HGABC not only achieves a better balance between exploration and exploitation in the global search space but also shows better convergence. Though Zhu and Kwong (2010) incorporate gbest-guided search equation in the ABC for fast convergence, it leads to stuck into local optimum as solutions are influenced by the best solution in the swarm. Yan et al. (2012) improve the ABC using crossover operator of the GA however, convergence speed of the algorithm further deteriorates when dimension of the problem increases. To the best of our knowledge, there has been no attempt in the literature to address the above two issues simultaneously to achieve quality solutions with faster convergence . The HGABC is compared with the ABC (Karaboga 2005), two recent variants of the ABC named Gbest-guided Artificial Bee Colony (GABC) (Zhu and Kwong 2010) and Hybrid Artificial Bee Colony Algorithm (HABC) (Yan et al. 2012), Standard Particle Swarm Optimization (PSO-2011) (Clerc 2012), one recent variant of particle swarm optimization named Accelerated Chaotic Particle Swarm Optimization (ACPSO) (Chuang et al. 2011), a genetic algorithm based on K-means algorithm (KGA) (Bandyopadhyay and Maulik 2002), and a recent swarm intelligence based method Spider Monkey Optimization (SMO) (Bansal et al. 2014). Experimental results show an encouraging performance of the HGABC in terms of convergence speed, achieving high quality solutions, and robustness over the competing algorithms on ten real and two synthetic data sets.

Rest of the paper is organized as follows. Section refrealtedWork discusses recent research related to our work. Section 3 presents a brief introduction to the ABC and GA. Our proposed method HGABC is detailed in Sect. 4. Section 5 presents a comparative study and discussion of the results. Finally, Sect. 6 concludes and remarks upon the possible future research directions.

2 Related work

In recent years, swarm intelligence and evolutionary based optimization methods have been widely applied to optimization problems. The ABC is a recent and popular swarm intelligence method. However, researchers (Liu et al. 2015; Jadon et al. 2017) observe that the ABC requires improvement to solve real-world problems in global search space. In last few years, researchers have made various efforts to improve ABC.

To overcome the problem of slower convergence in the ABC, Zhu and Kwong (2010) propose gbest-guided ABC (GABC) which incorporates global best solution into its search equation. The GABC outperforms on most of the tested benchmark functions. Xiang and An (2013) propose an efficient and robust ABC (ERABC) by modifying the ABC in following aspects. First, chaotic map based initialization is used to generate initial population. Second, the search equation is modified based on best-so-far solution to accelerate the search process in onlooker bees phase. Third, to avoid the local minima, chaotic search method is performed in the scout bees phase. Additionally, reverse selection based on roulette-wheel is employed to maintain the population diversity. Experimental results show good performance of ERABC on 23 benchmark functions.In Karaboga and Gorkemli (2014), propose a quick ABC by introducing a new definition of the onlooker bees phase to improve local search ability. In this phase, onlooker bee finds new candidate solution based on the best solution in neighbourhood radius. The main drawback of this algorithm is its computational overhead. Kıran and Fındık (2015) propose a directed artificial bee colony optimization to accelerate convergence as undirected search procedure in ABC exhibits slow convergence. Therefore, directional information to ABC is added for each dimension of each food source as control parameter. Although effectiveness of the algorithm has been shown on nine numerical benchmark functions, its convergence speed is slow.We observe that the researchers attempt to achieve high quality solutions using ABC. However, faster convergence is still a challenge specially when number of dimensions in the problem increases.

To improve exploration and exploitation in the search space, Gao et al. (2013) incorporate Powells method (Powell 1977) as a local search to improve exploitation capability of the ABC and call it PABC. The PABC performs well among its competitors; it achieves quality solution and shows faster global convergence and robustness for almost all the considered unconstrained and constrained benchmark functions. In Liu et al. (2015), authors propose gbest and pbest guided ABC algorithm with asynchronous scaling factors (GPSABC) where two adaptive scaling factors are introduced. Further, they enhance the update policy of scout bees to improve diversity and mitigate stagnation problem. The performances of algorithms are tested on 23 benchmark functions. The experimental results demonstrate that GPSABC outperforms its competitors.

Various studies are available in the literature to improve clustering performance using metaheuristic clustering methods when k is known apriori. Bandyopadhyay and Maulik (2002) propose KGA (a genetic algorithm based on K-means algorithm) for clustering. Here, K-means algorithm is applied to refine the cluster centres of each chromosome (solution) in the population. However, as greedy behavior of K-means may force it to converge to a local optimum, the authors introduced GA operators, e.g., crossover, mutation, on the refined solutions. Experimental results demonstrate that KGA outperforms its competitors on the selected data sets. Chuang et al. (2011) improve PSO for clustering; it is a popular swarm intelligence based search and optimization method modelled on the social behavior of birds within a flock. Each particle in the swarm is a potential solution (Kennedy and Eberhart 1995). As it usually converges to a local optimum, the authors generate random parameters of cognitive and social components using logistic map to provide good diversity in the search space (Kennedy and Eberhart 1995).The empirical results demonstrate that the proposed algorithm outperforms its competitors on the selected data sets.In Jensi and Jiji (2016), the authors present an improved krill herd algorithm (IKH) to solve clustering problems as the original krill herd (KH) is quickly traped in local minimum. A genetic crossover operator has been incorporated in the KH to overcome this problem. The authors test performance of the IKH on six real data sets from UCI machine learning laboratory and show its superiority to its competitors empirically.

In Bahrololoum et al. (2015),the authors use Gravitational Search Algorithm (GSA), a metaheuristic based on the Newtonian law of gravity, to solve clustering problem. The experiments and results on seven data sets from the UCI machine learning repository suggest that GSA is a good alternative for data clustering. In Pakrashi and Chaudhuri (2016),the authors propose HKA-K by improving Heuristic Kalman Algorithm (HKA) which incorporates K-means for fast convergence. The empirical analysis performed on several data sets shows that HKA-K is comparable or better to its competitors. Karaboga and Ozturk Karaboga and Ozturk (2011) apply ABC for clustering of thirteen benchmark test data sets from the UCI machine learning repository and compare it with PSO and other nine methods. They show that the ABC can be effectively used for data clustering as it achieves competitive results. Zhang et al. (2010) incorporates Debs constraint handling method (Goldberg and Deb 1991) in the ABC to solve clustering problem. This method is tested on three data sets and reveals encouraging results over the other competing methods. However, the authors do not raise and handle the issue of slow convergence of the ABC. Yan et al. (2012) propose hybrid artificial bee colony algorithm (HABC) for data clustering and function optimization by improving information exchange ability of bees in the ABC.In HABC, crossover operator of the GA is introduced into the ABC to improve information exchange between the bees; it diversifies solutions in the search space. However, this algorithm does not approach to handle slow search ability of the algorithm when dimension of the problem increases.

Based on the investigation, we observe that though gbest-guided search equation of the ABC in Zhu and Kwong (2010) is responsible for fast convergence, it may lead to stuck into local optimum as solutions are influenced by the best solution in the swarm. In another method (Yan et al. 2012), though crossover operator of the GA improves information exchange between the bees, convergence speed of the algorithm deteriorates specially when dimension of the problem increases. In this work, we improve ABC by taking advantages both the of above approaches to achieve quality solution with fast convergence. First, we incorporate gbest-guided search procedure similar to the PSO in the employed bees phase and the onlooker bees phase and hence it makes the convergence faster. However, to avoid a probable premature convergence, we further integrate crossover operator of GA to ABC for diversifying the solutions in the search space, which not only helps in avoiding premature convergence but also scatter the search in whole search space and helps to achieve (near-) optimal solution. To the best of our knowledge, though implemented separately, combining these two separate concepts has not been attempted in the literature previously which is very effective to balance exploration and exploitation in the search space. Additionally, it also helps the swarm to converge faster to the final solution. The proposed algorithm HGABC is tested on ten real and two synthetic data sets for clustering. The experimental results are encouraging; it outperforms its competitors.

3 Algorithms background

3.1 Artificial Bee Colony (ABC)

The ABC algorithm is a relatively recent swarm intelligence population based optimization algorithm introduced by Karboga (2005), where potential solutions are food sources of honey bees. It is inspired by intelligent food foraging behavior of the real honey bees. The fitness (quality) of a solution is evaluated based on nectar amount of the food source.

In a natural bee swarm, there are three types of honey bees: employed bees, onlooker bees, and scout bees. The employed bees search new food sources with better nectar amount around the food sources in their memory. The onlooker bees gather information from the employed bees to select good food sources and further search for better food sources in the neighborhood of the selected food sources. The scout bees discover food sources randomly for the exhausted food sources. A simulated general structure of the ABC may be described by the following three phases:

  • Employed Bees Phase: An employed bee searches for a food source having more nectar value in its neighborhood. Position of the new food source with respect to the \(i_{th}\) food source is obtained using Eq. 3.

    $$\begin{aligned} x_{ij}' = x_{ij} + \phi _{ij} (x_{ij}-x_{pj}) \end{aligned}$$
    (3)

    where \(p\in \{1,2,\ldots,N\}\) is a randomly selected solution such that \(p\ne i\), N denotes number of food source, \(j\in \{1,2,\ldots,d\}\) is randomly selected index, and \(\phi _{ij}\) is a random number between [−1,1]. If newly obtained food source in the neighborhood is better than the food source in her memory, the bee updates her position with the new one. The number of employed bees is equal to the number of food sources.

  • Onlooker Bees Phase: All the employed bees provide information about the fitness (nectar amount) of the solutions (food sources) to the onlooker bees. Then, the onlooker bees choose their food sources for further probe (search) based on the probability \(prob_i\) related to the fitness of the obtained solution in the employed bees phase. One of the approaches to evaluate \(prob_i\) is described in Eq. 4.

    $$\begin{aligned} prob_i = \left( 0.9\star \frac{fit_i}{\max _{i=1}^{N}{fit_i}}\right) +0.1 \end{aligned}$$
    (4)

    where \(fit_i = 1/( FFV_i +1)\) is derived fitness value of the food source i having fitness function value FFV\(_i\); here, N represents the number of food sources (solutions). Each onlooker bee modifies the solution in her memory similar to an employed bee using Eq. 3. If the obtained solution is better than the previous one, the bee updates her position with the new one.

  • Scout Bees Phase: It may happen that the position of some food sources (solutions) are not upgraded in the previous two phases in the predetermined number of cycles known as limit. Such food sources are called abandoned solutions. At this point, scout bees search new solution \(x_i\) randomly for each abandoned solution as shown in Eq. 5.

    $$\begin{aligned} x_{ij} = x_{min,j} + rand(0,1) (x_{max,j}- x_{min,j}) \end{aligned}$$
    (5)

    where j is jth dimension of the data set.

figure a

The pseudo code of the ABC algorithm is shown in algorithm 1. In the food search process, a food source exchanges only single dimension with a randomly selected food source in both the employed bees phase and the onlooker bees phase. This information sharing mechanism among bees in the swarm is minimal which makes ABC slower in convergence specially when dimensions of the problem increases.It also shows poor performance in balancing exploration and exploitation in the search space.In addition, information sharing among the bees is an uninformed decision as a bee selects the other food source randomly ,i.e., without any consideration of its quality, to update its position (refer, Eq. 3).

3.2 Genetic algorithm (GA)

GA is a population based optimization method that mimics the procedure of natural evolution Goldberg (1989). Three genetic operators crossover, mutation, and selection are applied on the chromosomes of the current generation to produce population for the next generation.

  • Crossover: In crossover operation, a pair of chromosomes is selected with a selection strategy from parents population, which produces a pair of new chromosomes known as offsprings with some crossover probability \(P_c\). This operator is responsible for exploration of the search space.

  • Mutation: The mutation operator alters dimension(s) of a solution at random to exploit another solution in its neighborhood with a very small mutation probability \(P_m\). This operator is responsible for exploitation of neighborhood in the search space.

  • Selection: Selection operation in GA is performed in two ways. The first is parents selection; it is performed before crossover operation based on some criteria. The second is survivor selection where good solutions are selected after crossover and/or mutation operations to create new population (next generation) from only the offspring or from the union of both the parents and the offspring population based on some selection strategy.

The pseudo code of simple GA is shown in Algorithm 2.

figure b

As the selection of parents for reproduction is based on some quality measure, it is an informed decision. Therefore, the crossover operation enhances the information sharing mechanism among the solutions in the population and has better chance to create good quality offsprings through an informed search. Hence, genetic crossover operator may be introduced in a population/swarm to improve exploration and to achieve good quality solution in the global search space.

4 Proposed method

In this section, we present a description of the proposed algorithm HGABC.

4.1 Solution representation and initialization

We follow a centroid based representation to represent a food source (candidate solution) for clustering. Here, every candidate solution consists of \(k * d\) dimensions, where k denotes the number of clusters, which is known a priori and d indicates the number of dimensions in the data points. Each dimension of a cluster is assigned a random number between maximum \((x_{max})\) and minimum \((x_{min})\) value of the corresponding dimension in the data set. Algorithm 3 describes steps of solution initialization. Here, the FoodSource(nq) represents \(q^{th}\) dimension of the \(n^{th}\) cluster and rand is uniformly distributed random number in the range of [0,1]. As the paper deals with the clustering problem where k is known a priori, a solution must contain k clusters. However, it is possible that a generated solution does not contain valid k clusters. Such an invalid solution is replaced by a randomly regenerated solution which is valid. While generating a solution, dimensions of the solution must satisfy the dimensional boundary condition of the search space. It means, if the value of a dimension of solution crosses the value of upper bound of corresponding dimension in the search space then that value is replaced by the value of the upper bound. A similar condition is applied to the lower bound also.

figure c

4.2 The fitness function

The fitness function indicates quality of a solution. As Sum of Squared Error (SSE) works well with isolated and compact clusters (Jain et al. 1999), we select SSE as fitness function based on reducing intra cluster distance. It means, a lower value of SSE indicates a better quality of solution. It is expressed in Eq. 6.

$$\begin{aligned} SSE=\sum _{i=1}^k \sum _{\forall x_p \in C_i} \Vert x_p-m_i\Vert ^2 \end{aligned}$$
(6)

Here, \(x_p\) denotes pth data point in the data set, \(m_i\) denotes centroid of the ith cluster and \(C_i\) denotes the ith cluster.

4.3 HGABC

In order to apply HGABC for clustering problem, representation and encoding scheme of food sources (solutions) are detailed in Sect. 4.1. Each food source (solution) represents a set of cluster centroids \(C=\{C_1, C_2, \ldots C_k\}\); where k is the number of clusters. A set of solutions of size N is initially generated as shown in algorithm 3. The number of food sources N equals half of the colony size (CS) of bees in the swarm. It is an established fact that a proper balance in the exploration and exploitation capabilities is of utmost importance for a robust search process for the population based swarm intelligence and evolutionary algorithms (Karaboga and Ozturk 2011). Exploration diversifies solutions in the global search space whereas exploitation searches for improved solutions in the local search space (neighborhood) of an obtained solution. In HGABC, we improve exploration and exploitation capabilities along with the convergence speed of the ABC to obtain good clustering results across the range of data sets.

In the literature, it has been observed that search equation of the ABC guided by the global best solution for fast convergence (Zhu and Kwong 2010) may lead to stuck into local optima as each solution are influenced by the best solution in swarm. In another study (Yan et al. 2012), crossover operator of the GA improve exploration of solutions in swarm but the convergence speed of the algorithm further deteriorates when dimension of the problem increases.Specifically, addressing the above issues, we enhance ABC which is inspired by strengths of above two approaches as follows. In ABC, the employed bees phase and the onlooker bees phase are responsible for searching solutions in the neighbourhood of current solutions as shown in algorithm 1. In this process, a solution in the swarm exchanges only single dimension with a randomly selected solution in the swarm. This information sharing mechanism among bees in swarm is minimal which makes ABC slower in convergence; it is further aggravated when dimension of the problem increases.Therefore, inspired by Zhu and Kwong (2010), we replace solution search equation as shown in Eq. 3 in the employed bees phase and the onlooker bees phase by a search equation as shown in Eq. 7 to speed up the convergence. This new search equation contains the global best solution, which guides the candidate solution to converge to final solution quickly in the global search space.

$$\begin{aligned} x_{ij}' = x_{ij} + \phi _{ij}(x_{ij}-x_{pj})+\psi _{ij}(y_j - x_{ij}) \end{aligned}$$
(7)

Here, \(y_j\) is the \(j^{th}\) dimension of the global best solution y; \(\phi _{ij}\) is a uniform random number in [0,B], where B is a nonnegative constant; \(\psi _{ij}\) is a uniform random number in [-1,1] and \(x_{ij}\) is a random solution in the swarm. The above variant of the ABC is specified as Gbest-guided Artificial Bee Colony (GABC) (Zhu and Kwong 2010). Further, in the ABC, to search a solution in the neighborhood of the current solution in the employed bees phase and the onlooker bees phase, the current solution selects another solution randomly instead of using a quality based metric or a heuristic, i.e., this selection is an uninformed decision. Consequently, it reduces the chance of producing a better solution. This inability, i.e., a poor exploration of the ABC, has also been demonstrated in Karaboga and Akay (2009).

To improve information sharing among the bees through an informed search mechanism and to explore solutions in the global search space, we introduce genetic crossover operation on the swarm \(C_s\) in between the employed bees phase and the onlooker bees phase. For this purpose, a mating pool \(M_p\) of size N is created by selecting good solutions randomly from the swarm using binary tournament selection without replacement. Hence, crossover performed on the solutions in the mating pool has a better chance to produce good quality offsprings.

figure d

Our crossover and replacement schemes inspired by conclusions drawn in Yan et al. (2013), are as follows:

  • Crossover scheme: The solutions generated by the employee bees phase in the swarm \(C_s\) are sorted in the descending order of their fitness. Then, \(N * P_c\) solutions are selected from middle in the sorted \(C_s\) where parameter \(P_c\) is crossover probability. These selected solutions compete with the offsprings. Further, for each selected solution i, two parents are selected randomly from the mating pool \(M_p\) to perform crossover. A single point crossover (Srinivas and Patnaik 1994) is performed to generate offsprings by exchanging segments of selected parents beyond a crossover point, which is a random integer number in the range of 1 to \((k * d-1)\), where k denotes number of cluster centroids and d indicates the number of dimensions in the centroid.

  • Replacement scheme: If a generated offspring is better than the selected solution i, it replaces solution i.

Fig. 1
figure 1

Flow chart of HGABC

In this process, crossover improves information sharing among bees and further, a number of selected solutions in swarm replaced by better offsprings improves the quality of solutions. Here, genetic crossover operator embedded in ABC with improved search is named as Hybrid Gbest-guided Artificial Bee Colony (HGABC). Note that in HGABC, the onlooker bees phase and the scout bees phase are similar to the ABC with an exception that a new search equation is introduced in the onlooker bees phase as mentioned above. Finally, HGABC is designed to improve in terms of exploring and exploiting the solutions in the global search space with robustness and converging to final clustering solution quickly. We select maximum number of fitness function evaluation (\(Max\_NFFE\)) as stoping criteria to ensure fair comparison for all algorithms. The pseudo code of the HGABC is shown in Algorithm 4 and the flowchart is presented in Fig. 1.

4.4 Computational complexity of HGABC

In this section, we analyze computational complexity of the proposed method HGABC. Here, the complexity is defined as the total number of searches carried out in a run, i.e., the number of function value comparisons. The worst case computational complexities of the basic operations involved in the HGABC are as follows:

  1. 1.

    To refine centroids of solutions in the swarm: O(M*Max_NFFE), where M is the computational complexity to evaluate SSE.

  2. 2.

    To evaluate fitness function (SSE) for each solution: O(n*k*d)

Thus, the overall computational complexity of the HGABC is: O(Max_NFFE*n*k*d), which is equivalent to the other competing algorithms. It is linear and proportional to the Max_NFFE.

5 Experimental results and discussion

This section presents experimental setup, description of datasets, clustering results and comprehensive discussion. The experiments have been performed on a system having core i5 processor and 2 GB RAM in Windows 7 environment using programs written in MATLAB R2012a.

Fig. 2
figure 2figure 2

Convergence speed of algorithm for the data sets. a Iris, b glass, c vowel, d WBC, e wine, f zoo, g dermatology, h yeast, i CMC, j LD, k 10d4c, l 2d10c

5.1 Parameters setting

As results of the nature-inspired algorithms are influenced by the number of control parameters, these values should be chosen carefully. We choose colony size (CS) 80 for algorithms as suggested in Zhu and Kwong (2010). Therefore, the number of food sources (solutions) equals 40, i.e., half of the colony size. Parameter settings of the competing algorithms are set as suggested by the researchers in the respective research papers. The stopping criteria, i.e., maximum number of fitness function evaluation (Max_NFFE), is fixed to 10000 as further processing is merely a computational overhead owing to negligible improvement. We select Max_NFFE, instead of number of iterations, as stopping criteria to show convergence speed of the algorithms as the number of iterations does not fairly represent computational cost to achieve final solution by the algorithms. The crossover rate in HGABC is selected as suggested by HABC which reports better performance on this value. Experimental settings for the proposed algorithm and the competitive algorithms are shown in Table 1. The experiments are performed for 30 independent runs to compare the results. We use symbol “–” to indicate that the parameter is not applicable for the respective algorithm.

Table 1 Parameters setting

5.2 Data sets descriptions

The data sets are in matrix of size \(n* d\) with real-valued elements, which are to be partitioned into k non-overlapping clusters. We consider ten real datasets Iris, Glass, Wisconsin Breast cancer (WBC), Vowel, Wine, Zoo, Dermatology, Yeast, Contraceptive Method Choice (CMC), and Liver Disorders (LD) from the UCI machine learning repositoryFootnote 1 and two synthetic data sets 10d4c and 2d10c.Footnote 2 A brief summary of these data sets is presented in Table 2. As WBC consists of 16 samples with some missing features, we remove them from the data set and use only 683 samples out of originally 699 samples.

Table 2 Data sets description

5.3 F-measure

We use F-measure (FM) and Rand Index (RI) to judge accuracy of the obtained clusters. F-measure (Dash and Liu 1997; Prakash and Singh 2015) is a balanced measure, which is evaluated by combining precision and recall as harmonic mean. Precision is the fraction of retrieved objects that are relevant and recall is the fraction of relevant objects that are retrieved. F-measure of a cluster with respect to known class can be mathematically expressed as Eq. 8.

$$\begin{aligned} F-measure = 2\times \frac{precision \times recall}{precision + recall} \end{aligned}$$
(8)

where, \(precision= m_{pq}/m_p , m_{pq}\) is the number of objects which belong to cluster p and class q both, and \(m_{p}\) is the total number of objects in cluster p; \(recall= m_{pq}/m_q , m_{q}\) is the total number of objects in class q. The optimum value of the F-measure is 1.

Rand index (Rand 1971) is a measure of similarity between obtained clusters (p) and known classes (q). Rand index can be mathematically represented as shown in Eq. 9.

$$\begin{aligned} RI=(n11+n00)/(n00+n01+n10+n11) \end{aligned}$$
(9)

where n11 is the number of pairs of objects that are assigned to the same cluster in p and in the same class in q; n10 is the number of pairs of objects that are in the cluster in p, but not in the same class in q; n01 is the number of pairs of objects that are in the same cluster in q, but not in the same class in p; n01 is the number of pairs of objects that are assigned to different cluster in p and different class in q. The optimum value of the RI is 1. The higher values of FM and RI indicate that the objects are associated with more relevant clusters.

5.4 Comparison of results

Performance of the HGABC and its competing algorithms is measured by the fitness value of solution, convergence speed to achieve the best fitness solution, and clustering accuracy of the solution. Fitness function value (FFV) of a solution is the value of SSE which indicates compactness of the clusters. A lower value of the FFV indicates a better solution and a lower value of the standard deviation of the FFV in different runs represents robustness of the algorithm.

Since HGABC is a variant of ABC, we show its effectiveness over ABC and its variant competing algorithms separately in term of achieved FFV. Table 3 reports quality of the solutions of the HGABC over ABC, GABC and HABC in terms of the best solution - the solution that achieved the lowest FFV - along with the mean and standard deviation (SD) of the FFVs obtained in all the independent runs. Similarly Table 4 reports its performance with respect to other recent competitive algorithms.

Table 3 The best, mean and standard deviation (SD) of FFV of final solutions obtained in HGABC, ABC, GABC and HABC in 10000 iterations in 20 independent runs

The convergence behavior of the HGABC with respect to ABC, GABC and HABC is shown pictorially in Fig. 2a–i for the best obtained solution in all the runs. We observe, in Table 3, that the solutions obtained by the ABC are inferior to the solutions obtained by its variants GABC and HGABC on the range of data sets. In other words, experimental results in Table 3 point out the inability of the ABC to explore solutions in the search space and the problem of slow converge towards the global optimal solution. Figure 2a-i also validate its behavior of slow convergence to obtain final solutions in the data sets.This behavior of the ABC can be justified by analyzing the search procedure of the employed bees phase and the onlooker bees phase to generate a new solution; a solution in the swarm exchanges only single dimension with a randomly selected solution without knowing about the quality of the selected solution. This information sharing mechanism among the bees in the swarm reduces convergence speed and optimization ability of the algorithm. It is also observed that the GABC converges (refer Fig. 2a–i) faster and achieves better solutions (refer Table 3) with respect to the ABC in the predetermined number of fitness function evaluations for all the data sets. It reveals that replacement of basic search equation of the ABC by a gbest-guided search equation similar to the PSO position update equation in the employee bees phase and the onlooker bees phase in the GABC is effective. In GABC, solutions in the swarm are guided by the best solution to speed up convergence towards the global optimal solution in the search space. In other words, gbest-guided search equation introduced in the GABC overcomes slow convergence behavior of the ABC and improves it to converge fast to the final solution.

However, the GABC does not compete with the HGABC in terms of quality and performance of clustering as it does not properly explore the global search space as solutions in swarm are influenced by the best solution. It indicates that incorporation of gbest-guided search equation alone in the algorithms is not sufficient to achieve better clustering results. On the other hand, though the competing algorithm HABC also incorporates the genetic crossover scheme in the ABC, it does not compete with the HGABC in predetermined number of fitness function evaluations on these data sets. In this sense, it does not properly handle slow search ability of the algorithm to achieve final solution when dimensions of the problem increase. It means that incorporation of only crossover in the ABC is not sufficient to improve the solution. However, Table 3 reports that the HGABC achieves better solution over its competitors with respect to the SSE consistently across the range of data sets. In addition, the HGABC reports lower mean value of FFV in the achieved final solutions in all the independent runs with respect to the competitive algorithms. Moreover, its convergence rate (refer Fig. 2a–i) is also most promising. These experimental outcomes conclude that the proposed HGABC algorithm is highly competitive to obtain better clustering results over the range of data sets. It means that the incorporated genetic crossover based on informed search mechanism in the GABC improves diversity of solutions in the search space and replaces a portion of worse solution in the swarm to obtain a better final solution.

Table 4 The best, mean and standard deviation (SD) of FFV of final solutions obtained in HGABC, PSO-2011 , ACPSO, KGA, and SMO in 10000 iterations in 20 independent runs

In Table 4, HGABC is also compared with two recent PSO variants PSO-2011 and ACPSO which consist of gbest-guided search equation. However,it can be observed that PSO-2011 converges prematurely on Glass, Wine and Yeast data sets and does not reach to optimal solutions on rest of the data sets in the predetermined number of iterations. It indicates that PSO-2011 does not diversify the solutions properly in the global search space. it also validate that best-guided search equation alone in the algorithm is not sufficient to achieve good solution. In another recent variant of PSO, ACPSO, a good attempt has been made to balance exploration and exploitation and it outperforms over PSO-2011 on most of the data sets, however, it is unable to achieve final solution as good as the HGABC.An exception for dermatology data set is that ACPSO converges faster and achieves better final solution in the best run, however, it performs worse in terms of mean and standard deviation of final solutions in all run. It can be also observed in Table 4 that a GA variant KGA with incorporated genetic crossover also performs inferior to the HGABC. HGABC also outperforms over one of the recent swarm intelligence method SMO.In addition, experimentally it has been observed that convergence speed of HGABC was also more promising with respect to PSO-2011, ACPSO, KGA, and SMO, which are not shown pictorially in this article to reduce its length.Hence, these competitive algorithms require a good balancing of exploration and exploitation in the search space as they do not sufficiently explore the search space to achieve a better solution. Here, HGABC also demonstrate fast convergence over other competing algorithms likewise over ABC and its variants. Therefore, we have not included pictorial representations of their convergence behavior. Consequently, HGABC exhibits better balance between exploration and exploitation in the search space to obtain high quality clusters with fast convergence. Furthermore, it achieves minimum standard deviation (std dev) on all data sets which indicates its robustness over the competitors. As a whole, since HGABC consists of gbest-guided search equation and incorporates genetic crossover procedure, it outperforms and is robust over competitors on considered data sets for clustering.

Table 5 indicates the corresponding FM and RI values of obtained optimal solutions reported in Table 3. It can be seen that the obtained best solution of HGABC achieves better classification accuracy on most of the data sets. However, on data sets Glass, Vowel, Dermatology, and Yeast, it does not achieve better classification accuracy even though it obtains a better solution. It means that FFV does not exactly map FM and RI value on these data sets. It is because of SSE has no absolute correlation with FM and RI when actual data distribution is not regular.In case of Dermatology data set, based on F-measure values, all adopted algorithms perform highly unsatisfactory as it is a high dimensional data set containing 34 dimensions. Even on this data set, the proposed method outperforms other methods.

Table 5 The F-measure and Rand Index of best solution (as per FFV) in 20 runs

To demonstrate statistical significance of the HGABC, its FFVs of the best solutions in all run is compared along with the competing algorithms. Here, a statistical test Friedman test (Friedman 1937), a non parametric test and a multiple comparison approach, is used for this analysis. The average rankings of the clustering algorithms using Friedmans method are reported in Table 6 where lowest ranked algorithm is considered as the best one.

Mathematically, Friedman test is defined as in Eq. 10.

$$\begin{aligned} T_f= \dfrac{12n}{k(k+1)}\left[ \sum _{j=1}^{k}( R_j)^{2} -\dfrac{k(k+1)^{2}}{4} \right] \end{aligned}$$
(10)

where k is the number of participated algorithms, \((k - 1)\) is degrees of freedom, and \(R_j\) is the sum of the ranks for the \(j^{th}\) group. The significance level is set to 0.05 and degree of freedom is 7.In this test, hypothesis is considered that all the tested algorithms have equal FFVs and the alternative hypothesis assumes that all algorithms do not have equal FFVs.

If computed value of \(T_f\) is greater than the critical value (14.07), we reject the null hypothesis that all the algorithms have same accuracy at a significance level of 0.05. It is seen that the HGABC is ranked the highest on all the data sets excepts Glass, and WBC.

Table 6 Average ranking of clustering algorithms based on FFVs
Table 7 Friedmans statistics and critical values for \(\alpha = 0.05\)

Table 7 also depicts that Friedmans values (\(T_f\)) recorded on these data sets are greater than the critical value. Therefore, we reject the hypothesis that FFVs of the algorithms are equal on these data set. Hence, The HGABC produce statistically significant accuracy on these data sets.

Overall, HGABC is highly competitive in terms of convergence speed, achieving high quality clusters , robustness, and performing better balance between exploration and exploitation in search space over competitors across the range of the data sets.

6 Conclusion

Nature-inspired algorithms are global search optimization methods to search near-optimal solution within reasonable time. In this paper, we improve two major problems of ABC; its slow convergence when dimensions for the problem increases and lack of balancing in exploration and exploitation in the search space. Here, we propose HGABC by dealing with both the problems simultaneously to improve the clustering results. For this purpose, HGABC is developed by modifying the ABC in two aspects for better performance.First, we incorporate gbest-guided search procedure similar to the PSO in the employed bees phase and the onlooker bees phase to speed up the convergence. However, it may lead to premature convergence. Therefore, we further integrate crossover operator to ABC, which scatters the solutions in whole search space and helps in avoiding premature convergence. The incorporation of above concepts makes searching very effective and it also helps the swarm to converge faster to the final solution.To evaluate performance of the proposed algorithm, it is compared with the well-known evolutionary and swarm intelligence methods in clustering domain over ten real and two synthetic data sets. Results demonstrate that HGABC outperforms ABC and its variants in terms of convergence speed, achieving high quality clusters, and robustness across the range of real and artificial data sets. In addition, it also outperforms other competing algorithms.

Therefore, it is concluded that the HGABC is a simple and potential algorithm to solve hard partitional clustering with following features.

  1. 1.

    Simple: It is a simple extension of the ABC to solve clustering problems.

  2. 2.

    Easy to understand and implement: As it does not introduce any complex operator in the ABC, it is easy to understand and implement.

  3. 3.

    Quality solution with faster convergence: It achieves better balance between exploration and exploitation in search space with fast convergence.

  4. 4.

    Computationally efficient: Its computational complexity is equivalent to other competing algorithms.

The authors aim to extend this study for data sets with unknown number of clusters as a future work.