Abstract
Clustering is an unsupervised data analytic technique that can determine the similarity between data objects and put the similar data objects into one cluster. The similarity among data objects is determined through some distance function. It is observed that clustering technique gains wide popularity due to its unsupervised and can be used in diverse research filed such as image segmentation, data analytics, outlier detection, and so on. This work focuses on the data clustering problems and proposes a new clustering algorithm based on the behavior of micro-bats. The proposed bat algorithm to determine the optimal cluster center for data clustering problems. It is also observed that several shortcomings are associated with bat algorithm such as slow convergence rate, local optima, and trade-off among search mechanisms. The slow convergence issue is addressed through an elitist mechanism. While an enhanced cooperative method is introduced for handling population initialization issues. In this work, a Q-learning based neighbourhood search mechanism is also developed to effectively overcome the local optima issue. Several benchmark non-healthcare and healthcare datasets are selected for evaluating the performance of the proposed bat algorithm. The simulation results are evaluated using intracluster distance, standard deviation, accuracy, and rand index parameters and compared with nineteen existing meta-heuristic algorithms. It is observed that the proposed bat algorithm obtains significant results with these datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the data analytic field, clustering is a well-known data analysis method for determining similar data objects and grouped these data objects into one cluster [1, 2]. A cluster consists of similar data objects and has dissimilarity with the object present in other clusters [3]. The clustering task can be described through partitional, hierarchical, grid based, density based, and model-based clustering [4,5,6]. The partitional clustering aims to divide the set of data objects into many distinct clusters based on dissimilarity measures. In hierarchical clustering corresponds to the tree structure of clusters and each node of the tree acts as a cluster. It consists of two approaches- agglomerative (bottom up) and divisive (top down). In the agglomerative approach, each data point belongs to a separate cluster initially. As the process goes on, data points merge within a cluster, if having similar capabilities. While in the divisive approach, all data points belong to one cluster and are repeatedly divided into smaller clusters based on dissimilarity criteria. The grid-based clustering quantizes the data space into a finite number of cells and generates a grid like structure. While, in density-based clustering, the clusters are designed on the basis of data compactness. The compactness is computed through the number of data points presented in a given radius and the density of data points can be used to construct the clusters. The model-based clustering considers the generation of clusters using probability distribution function and each component represents a cluster. The clustering techniques has been proved their potentiality in various field such as image segmentation, stock market, pattern recognition, outlier detection, feature extraction, and medical data analysis [7,8,9].
Recently, meta-heuristic algorithms are widely adopted in the field of data clustering for obtaining the optimum clustering solutions [10]. These algorithms are inspired from swarm intelligence and insect behaviour like particle swarm optimization (PSO) [11, 12], artificial bee colony (ABC) [13,14,15] and ant colony optimization (ACO) [16, 17]; well-known physics laws like magnetic optimization algorithm (MOA) [1], charged system search (CSS) [18], black hole (BH) [19] and big bang-big crunch (BB-BC) [20, 21]; chemical processes like artificial chemical reaction optimization (ACRO) [22]; evolutionary algorithms [23] like genetic algorithm (GA), genetic programing (GP) and biogeography based algorithm (BBA) [24]; animal behaviour based algorithms like grey wolf optimization (GWO) [25], elephant heard optimization [26], cat swarm optimization [27], lion optimizer [28]; and population based algorithms like sine cosine algorithm [29], stochastic fractal search algorithm [30], thermal exchange optimization algorithm [31]. These algorithms are differ to each other in terms of local and global search mechanisms. The different search mechanisms are adopted for computing local and global optimum solutions. Few algorithms having strong local search ability, for example CSO, BB-BC, SCA, BH etc., while rest of have strong global search ability like PSO, ABC, BBA, GWO etc. [10] However, it is observed that for getting optimal solution, local and global search abilities should be balanced [12]. Further, Abraham et al. [32] stated that data clustering can minimize the dissimilarity measures between the data points within a cluster and dissimilarity can be maximized between data points of other clusters. Several clustering algorithms are designed such k-means, c-means, tabu search, simulated annealing etc. But, it is observed that these algorithms are sensitive to initial solutions, thus easily trapped in local optima. For example, Choudhury et al., [33] designed an entropy based method to determine the initial solutions for the k-means algorithm. The aim of this method is to overcome the dependency of k-means on initial solutions. Moreover, Torrente and Romo [34] also considered the initialization issue of k-means and developed a new initialization method based on the concept of bootstrap and data depth for computing the optimal initial solutions. It is also noticed that traditional clustering algorithms faced difficulty with complex and large datasets. This issue of data clustering is effectively addressed through metaheuristic algorithms. For example, Ahmadi et al. [35] designed a clustering algorithm based on the grey wolf optimization method to tackle the data clustering problems, especially with large datasets. Several modifications like local search and balancing factor are incorporated into grey wolf optimization to effectively handle the data clustering problem. Ghany et al. [36] developed a hybrid clustering algorithm based on whale optimization algorithm (WOA) and tabu search (TS) for solving data clustering. The reason for hybridization is to overcome the local optima and also to improve the quality of clustering solutions. The results stated that hybridization of WOA and TS successfully handles the aforementioned issues.
Sorensen presented the critical evaluation of various well known metaheuristic algorithms [37]. It is stated that researchers focus on the actual mechanism behind the underlying concept rather than to develop the new metaheuristic algorithm and also concentrate promising research direction in the field of metaheuristic algorithms. To keep in mind rather than design a new metaheuristic algorithm, this work considers the existing metaheuristic algorithm i.e. Bat algorithm for solving data clustering problems. Recently, the bat algorithm become popular in the research community and provides optimal solution for various optimization problems [38,39,40]. Bat algorithm is developed by Yang et al. [38] based on the behavior of micro-bats into an algorithm, especially the echolocation feature of micro-bats. The micro-bats use the echolocation feature to detect prey (food) and avoid obstacles. For detection of prey, microbats emits a short pulse. The aim of short pulse is to produce echo and in turn, micro-bats recognize the shape and size of prey. It is seen that several performance issues are associated with the bat algorithm [40,41,42]. These issues are outlined as convergence rate, local optima, population initialization, and trade-off factor among local, and global searches. In turn, the bat algorithm converges on near to optimal solution instead of the optimal solution. The issues related to the performance of the bat algorithm are summarized as
-
Population Initialization: The initial population provides a significant impact on the success of clustering algorithms [42, 43]. If, initial population is not selected in effective manner, then premature convergence problem can occur. As, meta-heuristic algorithms select the initial population using random function.
-
Local optima: It is noticed that sometime, the population of bat algorithm is not updated in effective manner [39, 44]. In turn, the objective function returns same value in successive iteration. Finally, algorithm converges with same solution, but the solution is not optimal one. This situation is called local optima and it occurs due to lack of appropriate mechanism to update population of micro-bats.
-
Convergence Rate: The convergence rate of an algorithm depends on the optimization process and exploration of the search space [45, 46]. The convergence rate can also affect due to lack of coordination between exploration (local search) and exploitation (global search) processes.
The contribution of work is given as:
-
1.
To develop an enhanced cooperative co-evolution method to handle the population initialization issue.
-
2.
An elitist strategy is developed for improving the convergence rate.
-
3.
To incorporate limit operator to check the local optima situation in algorithm.
-
4.
To develop the neighbourhood search mechanism for exploring optimal candidate solution in exploration process.
-
5.
The proposed bat algorithm is applied to solve clustering problems.
2 Related works
The recent works reported on partitional clustering algorithm are summarized in this section. Since past few decades, numbers of clustering algorithms are developed for obtaining the optimum results for partitional clustering. Few of them are discussed below.
To determine best initial population and automatic cluster numbers, Rahman and Islam [47] designed a hybrid algorithm based on K-means (KM) and genetic algorithm (GA). Genetic algorithm was applied to determine optimized initial cluster centres for KM. C-means algorithm was adopted to obtain optimum clustering results. The performances of proposed algorithms were assessed on twenty datasets. The results were compared using well-known clustering techniques. It was claimed that fuzzy c-means with GA gives better clustering results.
Liu et al. [48] presented a clone selection algorithm for addressing automatic clustering. In automatic clustering, number of clusters can be detected in auto manner. Hence, in this work, authors introduce a genetic operator for detecting the number of clusters. The well-known twenty-three datasets are selected for measuring the performance of the clone selection algorithm. The results are compared with ILS, ACDE, VGA and DCPSO algorithms. Authors claimed that proposed algorithm provides better results without prior knowledge for number of clusters.
A two-step artificial bee colony algorithm was reported for obtaining the optimal clustering results [49]. Prior to implement, three improvements are inculcated in ABC algorithm to make it more robust and efficient. These improvements are summarized as initial cluster centre locations, updated search mechanism, and equations and abandoned food source. The initial cluster centre locations are determined through one step KM method. A PSO based search mechanism is used for exploring the promising search space. Hooke and Jeeves concept are considered for evaluating abandoned food source locations. The performance of proposed two-step ABC algorithm is tested on both artificial and benchmark data sets and compared with well-known clustering algorithms. It was observed from the results that the proposed algorithm significantly improves the performance of conventional ABC algorithm.
Cao et al. [50] developed a new initialization method based on neighbourhood rough set model. The intra cluster and inter cluster similarities of an object were represented in terms of cohesion and coupling degrees. Furthermore, it is integrated with KM algorithm for improving clustering results. The efficacy of proposed algorithm is tested over three datasets and compared with other two initialization algorithms. The proposed initialization method provides superior results than traditional methods.
Han et al. [51] adopted a new diversity mechanism in gravitational search algorithm to handle clustering problems. The collective response of birds can be used to design diversity mechanism and implemented through three simple steps- (i) initialization, (ii) identification (nearest neighbours) and (iii) orientation alteration. The candidate population is generated into initialization step as a first step of algorithm. The second step corresponds to evaluate the nearest neighbours through a neighbourhood strategy. Third step can change the current location of candidate solution based on nearest neighbour. Thirteen datasets are chosen for evaluating the performance of algorithm and simulation results are compared with well-known clustering algorithms. Authors claimed that proposed algorithm achieves superior clustering results.
Senthilnath et al. [52] introduced two-phase fire fly algorithm (FA) for clustering task. This algorithm simulates the flashing pattern and social insect behaviours of fire flies. First phase of algorithm measures the variation of light intensity. Second phase towards the movement of fireflies. The efficiency of fire fly algorithm is assessed on thirteen standard datasets and compared with ABC and PSO. The simulation results favour the existence of FA algorithm in clustering filed.
To handle the initialization issue of K-mean algorithm, Erisoglu et al. [53] developed a new initialization method. This method is based on the bi-dimensional mapping of features. Initially, two features are chosen, the first feature is an attribute with maximum value of variation coefficient, called main axis. The second feature is determined using correlation values between main axis (first variable) and rest of attributes. Hence, the second feature is an attribute with minimum correlation. The several benchmark datasets are used to evaluate the performance of the proposed algorithm. From simulation results proved that proposed method significantly better than KM algorithm.
Kumar and Sahoo [54] hybridized the MCSS algorithm with PSO. The personal best mechanism of PSO algorithm was added into magnetic charge system search algorithm. Further, neighbourhood strategy was also introduced to avoid local optima situation. The ten datasets are selected for evaluating the performance of MCSS–PSO algorithm and results are compared with wide range of clustering algorithms. Authors claimed that better quality results are achieved by MCSS-PSO algorithm.
Zhou et al. [55] introduced simplex method-based SSO algorithm for solving clustering task. In this work, simplex method is incorporated into SSO algorithm to enhance local search ability and improved convergence speed. The eleven datasets are considered for evaluating the simulation results of proposed algorithm and compared with well-known clustering algorithms. The proposed SSO algorithm perform well in terms of accuracy, robustness, and convergence speed.
Boushaki et al. [56] designed a new quantum chaotic CS algorithm for clustering task. To extend the global search ability of quantum chaotic cuckoo search algorithm, a nonhomogeneous update mechanism was employed. Chaotic maps are incorporated in this algorithm to improve convergence speed. The performance of algorithm was compared with different variants CS algorithms and hybrid variants of clustering algorithms. Authors claimed that proposed CS algorithm provides more compact clusters than other algorithms.
A combination of GA and message-based similarity (MBS) measure was presented for effective cluster analysis by Chand et al. [57]. The MBS measure consists of two types of messages-responsibility and availability. The messages are exchanged among data points and cluster centres. The responsibility can be measured as evidence regarding cluster centres, while the availability corresponds to appropriateness of data point with respect to clusters. Further, GAMBS consists of variable-length real-valued chromosome representation and evolutionary operator. The artificial and real-life datasets are adopted for measuring the performance of GAMBS algorithm. The simulation results showed that the algorithm obtains significant clustering results.
Hatamlou [23] developed a new clustering algorithm inspired from black hole (BH) phenomenon. Like other clustering algorithms, BH algorithm starts with initial population selection and objective function evaluation. The performance of proposed algorithm is tested on six benchmark datasets and it is stated that black hole clustering algorithm provides better clustering results.
Zhang et al. [58] presented an ABC algorithm for data clustering. In ABC, onlooker bees and employed bees are responsible for global search, while scout bees are responsible for local search. Further, Deb’s rule is incorporated to redirect search in solution space. The performance was tested on three real-life datasets and compared with other clustering algorithms. Results revealed that the proposed algorithm provides good quality results.
Taherdangkoo et al. [59] reported a new blind naked mole rat’s algorithm in clustering field. This algorithm considers food search capability and colony protection characteristics of mole rats. The algorithm starts by initializing the population of mole rats and searches the entire space for optimal solution in random fashion. In next iterations, employed mole rats start movement to target food source and their neighbours. The performance of proposed algorithm was tested on six standard datasets and compared with other well-known clustering algorithms. Results revealed that blind naked mole rat’s algorithm provides higher accuracy with faster convergence speed.
Hatamlou [60] considered the slow convergence rate of binary search algorithm and designed a new algorithm for cluster analysis. This algorithm chooses initial cluster points from differ locations. Further, the search direction is based on the successive objective function values. If current objective function is better than previous objective function, then search proceeds in same direction, otherwise in opposite direction. The six benchmark datasets are chosen for evaluating the efficacy of proposed algorithm. The results are compared with KM, GA, SA, TS, ACO, HBMO, and PSO algorithms. The proposed algorithm provides superior clustering results.
Bijari et al. [61] presented a memory-enriched BB-BC algorithm for clustering. It works in two phases- BB and BC phase. The BB phase corresponds for generation of random points near to initial seed points. While, BC phase corresponds for optimizing these generated points. The BB-BC algorithm is memory less algorithm. So, a memory concept is integrated into BB-BC algorithm for memorizing the best location and also maintaining the exploration and exploitation tasks. The performance of algorithm was tested on six data sets and compared with well-known algorithms like GA, PSO, GWO, and original BB–BC. Results stated that the clustering results are improved significantly.
Abualigah et al. [62] combined krill herd (KH) optimization algorithm with harmony search (HS) to overcome local optima problem in clustering. A global exploration operator and reproduction procedure was integrated in krill herd algorithm. The seven standard datasets are selected for measuring the performance of proposed algorithm and results are compared with GA, PSO, HS, KHA, H-GA, and H-PSO algorithms. Authors claimed that proposed combination (KH + HS) achieves more accurate clustering results.
Pakrashi and Chaudhuri [63] hybridized Kalman filtering algorithm with KM algorithm. In this work, authors consider the slow convergence rate of KM algorithm and it can be improved with the help of Kalman filtering algorithm. Further, a conditional restart mechanism was also incorporated in K-Means algorithm to handle local optima situation. The seven benchmark datasets are taken for evaluating the performance of proposed algorithm and results are compared with HKA, KGA, GAC, ABCC, and PSO algorithms. It is noticed that Kalman filtering algorithm successfully overcome the deficiency of KM algorithm.
Kang et al. [64] hybridized KM and mussels wandering optimization algorithm, called K-MWO. The proposed algorithm comprises of local search abilities of KM, while, global search is accomplished through mussels wandering optimization algorithm. The performance is tested on nine datasets and results are compared with K-M and K-PSO algorithms. Authors claimed that K-MWO is an effective clustering algorithm.
To solve clustering search space problems, Wang et al. [65] presented a hybrid version of flower pollination algorithm (FPA) and bee pollinator algorithm (BPA). The discard pollen operator of ABC is used for enhancing global search ability of flower pollination algorithm. Further, the local search mechanism is improved through elite mutation and crossover operator. Several artificial and benchmark datasets are selected for measuring the performance of proposed algorithm. The simulation results are compared with KM, FPA, CS, PSO, ABC, DE, algorithms. Results proved that combination of FPA and BPA provides more optimal results than others.
Hatamlou and Hatamlou [66] designed a two-stage clustering approach to overcome the drawbacks of particle swarm optimization like local optima and slow convergence speed. In first stage, PSO algorithm is adopted for generating the initial candidate solution. In second stage, HS algorithm is considered for improving the quality of solution. Seven datasets are chosen for measuring the performance of the proposed algorithm and results are compared with KM, PSO, GSA, BB-BC methods. It is seen that proposed algorithm determines good quality clusters.
A hybrid version of ABC algorithm with genetic algorithm is also presented for enhancing the information exchange mechanism among bees by Yan et al. [67]. It is applied for solving data clustering problems. The information exchange mechanism is enhanced with the help of crossover operator. The six standard datasets are adopted for evaluating the simulation results of proposed ABC algorithm and results are compared with other ABC, CABC, PSO, CPSO and GA clustering algorithms. The proposed ABC algorithm having better clustering results than others.
To perform efficient clustering, Kwedlo [68] combined differential evolution (DE) algorithm with KM. KM algorithm is used to tune candidate solutions generated through mutation and crossover operators of DE. Additionally, a recording procedure is also introduced to handle redundant solutions. The performance of proposed algorithm was compared with five other well-known clustering algorithms. It was noticed that DE-KM algorithm gives state of art clustering results.
Yin et al. [69] presented a hybridized version of improved GSA with KHM for solving clustering problems. This work considers the convergence rate of KHM and diversity mechanism of GSA to develop new algorithm. The performance of proposed algorithm is tested on seven benchmark datasets and compared with other well-known clustering algorithms. Authors claimed that combination of KHM-GSA achieves better convergence.
A hybrid version of ant algorithm is presented for handling clustering problems [70]. KHM algorithm is used for hybridizing the Ant algorithm, called KHM-Ant. The proposed algorithm contains the merit of both algorithms such as initialization characteristic of KHM and local optima characteristic of ant. The five benchmark datasets are considered for measuring the performance of KHM-Ant algorithm. The simulation results are compared with KHM and ACA. Authors claimed that more optimal results are achieved by KHM-Ant algorithm.
Xiao et al. [71] developed a quantum-inspired GA (QGA) for partitional clustering. In this work, Q-bit based representation and rotation operation of quantum gates are applied for achieving better search mechanisms. Several standards and simulated datasets are selected for evaluating the performance of QGA algorithm. The QGA is able for finding optimal clusters without prior knowledge of number of clusters centres.
Aljarah et al. [72] hybridized grey wolf optimizer (GWO) with tabu search (TS) for cluster analysis. TS is incorporated as an operator in GWO for searching neighbourhood. It helps in balancing exploration and exploitation of GWO. The proposed GWOTS is tested on thirteen real datasets and results have been compared with other popular metaheuristics. The experiment results show that GWOTS is superior in terms of convergence behaviour and optimality of results.
A PSO based clustering algorithm is presented in [73]. The concept of cooperative evaluation is incorporated into PSO for improving convergence rate and diversity. The cooperative co-evolution method worked as decomposer and PSO algorithm as optimizer. The standard and simulated datasets are selected for measuring the performance of PSO and compared with SRPSO, ACO, ABC, DE, and KM algorithms. The concept of cooperative evaluation improves the performance of PSO in significant manner.
To solve clustering problems effectively, an improved CSO algorithm is reported in [74]. Several modifications are incorporated in CSO algorithm to make it effective. These modifications are described in terms of search equations. Further, a local search method is also developed for handling local optima problem. The performance is evaluated on five datasets and compared with several known clustering algorithms. Simulation results showed that improved CSO obtains effective clustering results.
A class room teaching based meta-heuristic algorithm is also presented for handling clustering problems [75]. The properties of K-means algorithm were also investigated for effective clustering results [76]. Six benchmark datasets are used to evaluate the performance of aforementioned algorithm. The performance is measured in terms of overlapping, number of clusters, dimensionality and cluster size. An intelligent system for spam detection was presented in [77]. More relevant features were identified using evolutionary random weight networks. Table 1 gives the summary of the various studies of literature.
3 Bat algorithm
Bat algorithm is based on the echolocation behaviour of microbats especially prey detection and obstacle avoidance [38]. In search of prey, microbat emits short pulse and considers the echo of nearby objects to determine its shape and size. The loudness, emission rate and random variable of bat algorithm are initialized using Eqs. (1–2).
Where, \({\mathrm{A}}_{\mathrm{i}}^{\mathrm{t}}\) is loudness,\({\mathrm{r}}_{\mathrm{i}}^{\mathrm{t}}\)is pulse emission rates and (α) is a user specified variable having values between [0–1]. The frequency and velocity of bats are computed using Eqs. (3–4).
Where, \({\mathrm{f}}_{\mathrm{min}}^{\mathrm{t}}\)and\({\mathrm{f}}_{\mathrm{max}}^{\mathrm{t}}\) denotes lowest and highest frequencies at time stamp (t) and rand () is a random function having values between [0–1],\({\mathrm{v}}_{\mathrm{i}}^{\mathrm{t}-1}\) is initial velocity,\({\mathrm{x}}_{\mathrm{i}}^{\mathrm{t}}\)is current position and x∗ is current best position. The positions of bats are updated using Eqs. (5–6)
Where, Xnew is new position and \({\mathrm{x}}_{\mathrm{i}}^{\mathrm{t}+1}\) is final updated position.
4 Improved Bat algorithm for cluster analysis
This section presents the proposed improvements in bat algorithm. These improvements are (i) enhanced co-operative co-evaluation for population initialization, (ii) elitist strategy for better convergence rate as well as tradeoff between local and global solutions, and (iii) neighborhood strategy to avoid local optima and explore good candidate solutions.
4.1 Enhanced co-operative co-evolution method
It is observed that efficiency of clustering algorithm also depends on initial cluster points [42, 44]. Several initialization methods are reported to address initial cluster selection problem [47,48,49,50]. For improving the performance of bat algorithm, an enhanced co-operative co-evolution framework is introduced to select initial cluster centers. The co-operative co-evolution method works on divide and conquers paradigm. This paradigm divides the problem into sub problems and sub problems can be solved individually and final solution is obtained by combing each sub problem solution. Hence, in this work, a co-operative co-evolution method with centroid selection mechanism is proposed. This method considers number of partitions, size of partition, and selection criteria for population initialization.
4.1.1 Population partitions and size description
This subsection gives description about number of partitions and their size to implement the co-operative co-evolution method. First task is to divide the population (data instances) into several predefined partitions. The partitions are equal to number of clusters (K) for a given dataset as mentioned in Eq. 7.
Where, pn denotes number of partitions and K denotes the number of clusters. The size of subpopulation is obtained through Eq. 8.
Where, T is total population size i.e., data instances in a dataset, K is number of clusters and psdenotes size of subpopulations. The number of subpopulations is computed through Eq. 9.
In Eq. 9, ps1, ps2, …psndenotes subpopulations. UL represents the upper limit in a sub population. Further, from each subpopulation an appropriate centroid is selected using Eq. 10.
Where, Ck denotes kth cluster center, min(pn) and max(pn) are minimum and maximum values of each nth sub population (psn) and rand () is a random number.
The working of co-operative co-evolution method is illustrated using iris dataset. This dataset contains one hundred fifty data instances and four attributes. These data instances are divided into three classes. Hence number of clusters (K) considered for iris dataset is three. The co-operative co-evolution method contains three Eqs. (7–10) to determine the initial population for clustering algorithm in terms of cluster centers.
-
a)
First step is to divide the population into subpopulation. Equation 7 is used to compute the subpopulation.
$$\begin{array}{cc}{\mathrm{p}}_{\mathrm{n}}\propto \mathrm{K},\mathrm{For}\ \mathrm{iris}\ \mathrm{dataset},\mathrm{K}=3; \\ {\mathrm{p}}_{\mathrm{n}}=3\end{array}$$
-
b)
In second step, size of subpopulations is computed. The size of subpopulation is computed through Eq. 8.
$$\begin{array}{cc}{\mathrm{p}}_{\mathrm{s}}=\mathrm{T}/\mathrm{K},\mathrm{where}\ \mathrm{T}=150\kern0.5em \mathrm{and}\ \mathrm{K}=3;\\ {\mathrm{p}}_{\mathrm{s}}=150/3=50.\end{array}$$
The size of subpopulation (ps)is 50. Further, the number of subpopulations is determined using Eq. 9.
-
iii)
In third step, Eq. 10 is used for computing initial cluster centers for clustering algorithm.
-
When n = 1 ck1 = min(1 : 50) + (max(1 : 50) − min(1 : 50)) ∗ rand()
{5.5221 4.0109 1.7333 0.5074}
-
When n = 2 ck2 = min(51 : 100) + (max(51 : 100) − min(51 : 100)) ∗ rand()
{6.8022 3.2681 4.9022 1.7246}
-
When n = 3 ck3 = min(101 : 150) + (max(101 : 150) − min(101 : 150)) ∗ rand()
{5.2810 2.4032 4.8048 1.5397}
4.2 Elitist strategy
This subsection discusses another important aspect of clustering that is convergence speed. The convergence rate depends on the searching pattern of algorithm. To improve convergence speed, an improved elitist strategy is incorporated in bat algorithm. According to elitist strategy, best positions move from previous iteration to next iteration. In this work, elitist strategy is implemented in two phases- Evaluation Phase and Updating Phase.
4.2.1 Evaluation phase
In this phase, personal best and global best positions are computed using Eqs. (11–12). The comparison operator is used to calculate global best position i. e. XGbest and personal best position XPbest.
The personal best (XPbest) is obtained using the fitness function as described in Eq. 13.
Where, SSE denotes the sum of square error and CKrepresents the Kthcentroid object. After evaluation of fitness function minimum value is selected as personal best. In next step, the global best (XGbest) is evaluated using Eq. 13, which is minimum value of distance function or objective function.
4.2.2 Updating phase
In this phase, the personal best and global best positions are compared with previous iteration values. If, current values are better than previous values, than positions are updated using Eqs. (14–15). Otherwise, previous values are considered.
To achieve optimum trade-off among search mechanisms, the basic frequency, velocity and search equations of bat algorithm are modified using Eqs. (16–19).
where, \({\mathrm{f}}_{\mathrm{i}}^{\mathrm{t}}\)represent frequency of ith bat, \({\mathrm{v}}_{\mathrm{i}}^{\mathrm{t}}\) denotes the velocity of ith bat, and\({\mathrm{X}}_{\mathrm{i}}^{\mathrm{t}}\) represents the position of ith bat; \(\min \left({\mathrm{X}}_{\mathrm{Gbest}}^{\mathrm{t}}\right)\ \mathrm{and}\kern0.5em \max \left({\mathrm{X}}_{\mathrm{Gbest}}^{\mathrm{t}}\right)\) denote minimum and maximum values of sum function associated with \({\mathrm{X}}_{\mathrm{Gbest}}^{\mathrm{t}}\) position. \(\max \left({\mathrm{X}}_{\mathrm{Pbest}}^{\mathrm{t}}\right)\) represents the maximum value of fitness function associated with personal best position and β denotes a random value between [0, 1].
4.3 Q-learning based neighborhood search mechanism
This subsection describes the Q-learning based neighborhood search for handling the local optima problem of bat algorithm. The performance of clustering algorithms can be degraded due to local optima issue [78]. The various strategies are developed for avoiding the local optima issue in literature [79, 80]. This work also presents a Q-learning based neighborhood search mechanism for effectively handling the local optima issue of clustering algorithms. The proposed Q-learning based concept works into two steps- Identification Step and Evaluation Step. The first step corresponds to determine the neighborhood boundary and neighboring data objects. Whereas, second step corresponds for evaluating the updated position of initial cluster points through Q-learning concept. Fig. 1(a-c) illustrates the Q-learning based neighborhood search mechanism.
4.3.1 Identification step
This step corresponds to determine neighboring data points of initial cluster centers as shown in Fig. 1(a). The Euclidean distance measure is considered for evaluating the neighboring data points. In this work, neighboring data objects are set to 5. Hence, five data objects with minimum Euclidean distance are selected as neighboring data points of given cluster center as shown in Fig. 1(b). Let Xi represents ith cluster center and Xi, neigh represents set of neighboring data points of ith cluster center. Xi, neigh is described as Xi, neigh = {Xi, 1, Xi, 2, ……Xi, 5} where neigh = 1 to 5.
4.3.2 Evaluation phase
This step corresponds to determine the updated position of initial cluster points as shown in Fig. 1(c). In this step, Q-learning [81] is used instead of arithmetic mean to compute the position of neighborhood data items. The Q-learning algorithm follows simple procedure until good quality solution is received. It consists of initializing the Q-table, choosing an action, performing the action, measuring the rewards and updating the Q-table. The Q[s, a] of neighbouring data points is updated using Eq. 20:
where Q " (s, a) represents the new Q-value for state (s) and action (a), Q(s, a) gives the current value, α is the learning rate, R(s, a) represents the reward for taking action for a state, γ is the discount rate, and maxQ′(s′ + a′) represents the maximum expected future reward.
The algorithmic steps of improved bat algorithm for cluster analysis are given in algorithm 1. The working of proposed approach is divided into three broad steps- (i) initialization, (ii) evaluation and assignment, and (3) updation. Fig. 2 illustrates the flow chart of aforementioned approach.
-
Initialization Step: This step corresponds to initialization of algorithmic parameters of improved bat algorithm. Further, details of datasets are also described such as dimension, number of objects etc. A cooperative co-evolution (described in subsection 4.1) method is used to select initial centroid. The other parameters like loudness, emission rate and random variable are specified.
-
Evaluation and Assignment Step: The work of this step is to evaluate objective function and allocate objects to nearest clusters. The Euclidean distance can be acted as objective function and allocates objects to nearest clusters using minimized Euclidean distance. Moreover, a Q-learning based neighborhood search mechanism is incorporated to overcome local optima situation. A limit operator is applied to determine the local optima situation. If, the candidate solution is not improved in predefined limit operator range, then it is assumed that algorithm traps in local optima and neighborhood mechanism can be invoked.
-
Updation Step: This step corresponds for updating the positions of bats through search mechanism. The emission rate of bat algorithm is compared with random function. If random function provides less value than emission rate, then neighborhood value is accepted. Otherwise, a new value is calculated using parameters variations i.e., loudness and emission rate. If, a termination criterion is met, then algorithm stops its working and final solution is obtained. Otherwise, it repeats phase 2–3.
4.4 Time complexity
The working of proposed algorithm is summarized using three steps- initialization, evaluation and assignment, and updation. The complexity of each step is computed and combined to get the final complexity of the algorithm.
-
Step 1: The initialization step starts with the selection of initial centroid positions. These positions are evaluated using Eqs. (7–10). In turn, dataset divides into M number of distinct parts and M can be interpreted as number of clusters (K) i.e. M = K. Now, the population of bat algorithm is described in terms of number of bats. Hence, the time required by initialization phase of Improved Bat algorithm is given as O(number of bats × number of attributes). Here, number of bats indicates the number of clusters (K) and number of attributes indicates the dimension (d) i.e. the length of cluster centre Ο(K × d).
-
Step 2: The Evaluation and Assignment Phase comprises of objective function, allocate the data respective clusters and fitness function.
-
a.
To compute the objective function and allocate data to respective cluster require O(n2 × K × d) time.
-
b.
In worst case scenario, the neighbourhood strategy can require O(n × K × d) time.
-
c.
The fitness function evaluation requires O(n × K × d) time.
-
a.
Hence, the total complexity of Step 2 can be given as O((n2 × K × d) + (n × K × d) + (n × K × d)) = O(n2 × K × d).
-
Step 3: In evaluation and update phase, the position of individual bat i.e., cluster centre is updated. This updating requires O(K × d) time. Further, the above process is repeated up to termination condition i.e., max _ iter . Hence, the total time required in this phase is given as O(K × d × max _ iter).
The overall complexity of the Improved Bat algorithm is the combination of all above steps and can be given as O(n2 × K × d × max _ iter). Whereas, the space complexity of the proposed algorithm is O(number of attributes × number of bats ) i. e. (d × K).
5 Experimental setup and results
This section presents the simulation results of proposed BAT algorithm for solving clustering problems. Several non-healthcare and healthcare datasets are chosen for evaluating the performance of proposed algorithm. Table 2 summarizes the description of these datasets. These datasets are (i) Iris, (ii) Glass, (iii) Wine, (iv) Ionosphere, (v) Control, (vi) Vowel, (vii) Balance, (viii) Crude Oil, (ix) CMC, (x) Liver Disorder (LD), (xi) Wisconsin Breast Cancer (WBC), (xii) Thyroid. These data sets are downloaded from UCI repository and freely available to assess the efficiency of newly proposed clustering algorithm. The user defined parameters of proposed bat algorithm are described as population = K × d, α = 0.5, loudness (Ai) ∈ (0.1, 0.9), initial velocity = 0.1, number of partitions = K, max _ iteration = 200. The average intra cluster distance (IntraCD), standard deviation (SD), accuracy and rand index parameters are used for evaluating the simulation results of proposed bat algorithm. The simulation results of proposed BAT algorithm are compared with several existing meta-heuristic algorithms. The different state of art meta-heuristic algorithms are selected for comparing the simulation results of proposed BAT algorithm. In this work, a total nineteen (19) meta-heuristic algorithms are chosen for comparison purpose. All nineteen meta-heuristic algorithms are divided into three groups (1) standard and well known clustering algorithm, (2) hybrid meta-heuristic clustering algorithm, and (3) recently developed clustering algorithm. In first group, eight standard meta-heuristic algorithms are selected for comparing the clustering results of proposed BAT algorithm which are highly cited in clustering literature. These algorithms are (i) Particle swarm optimization (PSO) [11], (ii) ant colony optimization (ACO) [70], (iii) artificial bee colony (ABC) [49], (iv) differential evolution (DE) [68], (v) genetic algorithm (GA) [78], (vi) big bang- big crunch (BB-BC) [61], (vii) Bat [42], and (viii) K-means [33]. The second group consists of six popular hybrid meta-heuristics algorithms that provide state of art clustering results in corresponding literature and also focus on local optima and convergence issues especially of clustering problems. These algorithms are (i) MEBBC [61], (ii) H-KHA [62], (iii) IKH [80], (iv) ICMPKHM [82], (v) PSO-BB-BC [83], and (vi) CBPSO [73]. Third group consists of recently reported meta-heuristic algorithm for solving clustering task. The reason behind the selection of these algorithms is to compare the performance of the proposed IBAT algorithm with recent clustering algorithms. These algorithms are (i) VS [84], (ii) MBOA [85], (iii) WOA [86], (iv) ICSO [74], (v) Chaotic TLBO [75]. The reason behind the selection of these hybrid and recent clustering algorithms is to make fair comparison of the performance of IBAT. The parameters settings of aforementioned algorithms are taken same as reported in corresponding literature. Here, population size of all algorithms is K × d. The maximum iteration is set to 200. The results are evaluated using intra cluster distance, accuracy, rand index and rank measures.
5.1 Experiment 1: benchmark clustering datasets (non-healthcare datasets)
This subsection presents the simulation results of proposed BAT algorithm for non-healthcare datasets. The simulation results of proposed BAT algorithm are compared with three categories of clustering algorithms such as standard/well-known clustering algorithms, hybrid clustering algorithms and recently reported clustering algorithms.
5.1.1 Comparison of simulation results of proposed BAT and standard/well-known clustering algorithms
This subsection discusses the simulation results of proposed BAT and standard/well-known clustering algorithms using non-healthcare datasets. The results are evaluated using intra cluster distance (intra), standard deviation (SD), accuracy, random index and rank measures. Table 3 presents the performance comparison of proposed BAT algorithm and well-known clustering algorithms like K-means, PSO, ACO, ABC, DE, GA, BB-BC, BAT. The results are evaluated in terms of average intra cluster distances (intra), standard deviation (SD) and rank measures. The results show that proposed BAT algorithm gives minimum intra cluster distance values for Iris (9.16E+01), Glass (1.96E+02), Ionosphere (8.01E+02), Vowel (1.49E+05) and balance (5.01E+04) datasets. While DE achieves minimum intra value (1.58E+04) for wine dataset, BB-BC outperforms for Control dataset with intra value as 2.38E+04 and for crude oil dataset, ACO gives the minimum intra value (2.47E+02). Standard deviation is used to assess the efficiency of algorithms. It represents the dispersion of data objects within a cluster. The minimum value of standard deviation parameter tells that data objects are tightly bound with clusters. From simulation results it is also noted that most of aspects the value of standard deviation parameter is minimum for proposed BAT algorithm than rest of algorithms. The results of accuracy parameter of proposed BAT algorithm and other cluster algorithms are illustrated in Table 4. The results are measured in terms of average case values. The proposed BAT algorithm gives the higher accuracy values for iris (93.00), glass (69.17), wine (76.01), ionosphere (71.94), control (75.30), and crude oil (76.64). While in case of vowel dataset, GA have higher accuracy value (84.7) and PSO have higher accuracy value (89.76) for balance dataset than proposed algorithm. Hence, it is concluded that proposed BAT algorithm provides more accurate results in clustering field. The simulation results of proposed BAT algorithm and other well-known clustering algorithms for non-healthcare datasets using rand index measure are given in Table 5. From the results it is observed that proposed BAT gives better results for datasets iris (0.72), glass (0.427), control (0.799) and crude oil (0.074). While it gives promising results for other datasets when compared to well-known clustering algorithms for non-healthcare datasets. Hence, it is stated that proposed BAT algorithm is one of competent algorithm for cluster analysis.
using intra cluster distance parameter
The convergence behaviour of proposed BAT, BAT, BB-BC, GA, DE, ABC, ACO, PSO and K-means clustering algorithm is shown in Fig. 3(a-h). In this graphical illustration, X-axis labels the number of iteration and Y-axis labels the intra-cluster distance. It is observed that proposed BAT algorithm converges on minimum values accept the balance and control dataset. Although in most of aspect, the proposed algorithm provides better convergence rate. Hence, it is stated that the proposed BAT outperforms than other well-known clustering algorithms.
5.1.2 Comparison of simulation results of proposed BAT and existing hybrid clustering algorithms
This subsection discusses about the proposed BAT simulation results for benchmark clustering datasets and compared to existing hybridized clustering algorithms. Furthermore, the performance of proposed algorithm is compared with six hybridized clustering algorithms. Table 6 demonstrates the simulation results of H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO and proposed BAT algorithm using average intra cluster distance (intra) and standard deviation (SD). It is observed that proposed BAT algorithm obtains minimum intra cluster distance for iris (9.16E+01), glass (1.96E+02), wine (1.61E+04), ionosphere (8.01E+02), control (2.39E+04), balance (5.01E+04) and crude oil (2.51E+02). While in vowel dataset, ICMPKHM has minimum intra cluster value (1.47E+05) than proposed algorithm. As well as, the values of standard deviation are minimum for proposed BAT algorithms for the dataset’s iris (2.12E+01), wine (3.54E+01). It gives the favourable results in most of cases when compared to other hybridized clustering algorithms. The results of accuracy parameter of proposed BAT algorithm and other hybridized clustering algorithms are demonstrated in Table 7. It is noticed that proposed BAT algorithm provides more accurate results for iris (93.00), wine (76.01), ionosphere (71.94), control (75.30), vowel (67.11), and crude oil (76.64). For glass and balance datasets, PSO-BB-BC performs better with accuracy value as 69.52 and 89.21, after that proposed BAT gives higher accuracy rate as 69.17 and 88.92 respectively. Moreover, rand index is also computed to prove its effectiveness in clustering field. Table 8 demonstrates the simulation results of proposed BAT algorithm and other hybridized clustering algorithms using rand index parameter for benchmark clustering datasets. The proposed BAT algorithm obtains better results for wine (0.374), ionosphere (0.319), glass (0.427), control (0.799), balance (0.574) and crude oil (0.074) datasets for rand index measure as compared to hybridized variants of clustering algorithms. While, K-KHA achieves better rand index (0.734) for iris, PSO-BB-BC (0.852) for vowel dataset. From the results it can be stated that proposed BAT algorithm is competent with other hybridized variants of clustering techniques over benchmark clustering datasets.
5.1.3 Comparison of simulation results of proposed BAT and recently reported clustering algorithms
The performance of proposed BAT algorithm is also compared with recent clustering algorithms. Table 9 demonstrates the simulation results of VS, MBOA, WOA, ICSO, Chaotic TLBO and proposed BAT algorithm using average intra cluster distance parameter and standard deviation. It is observed that proposed BAT algorithm obtains minimum intra cluster distance for iris (9.16E+01), glass (1.96E+02), wine (1.61E+04), ionosphere (8.01E+02), control (2.39E+04), balance (5.01E+04) and crude oil (2.51E+02). The values of standard deviation are minimum for proposed BAT algorithms for the dataset’s iris (2.12E+01), glass (1.98E+00), ionosphere (1.53E+01), control (3.62E+01), vowel (1.15E+02) and crude oil (1.06E+02). Whereas for the datasets wine (3.54E+01) and balance (3.59E+02), proposed algorithm is the second most after Chaotic TLBO. From results, it is concluded that proposed BAT is competent and outperforms in most of cases when compared to other hybridized clustering algorithms. Table 10 demonstrates the results of accuracy parameter of proposed BAT algorithm and recent clustering algorithms for benchmark clustering datasets. It is noticed that proposed BAT algorithm provides more accurate results for wine (76.01), ionosphere (71.94), control (75.30), vowel (67.11), balance (88.92) and crude oil (76.64). Except for iris and glass dataset in which Chaotic TLBO performs better with accuracy value as 91.19 and 69.52, after that proposed BAT gives accuracy value as 93.00 and 69.17. Additionally, rand index is also calculated to verify its efficacy in clustering field. Table 11 illustrates the simulation results of proposed BAT algorithm and recent clustering algorithms using rand index parameter for benchmark clustering datasets. The proposed BAT algorithm obtains better results for most of datasets that is iris (0.72), glass (0.427), wine (0.374), control (0.799), vowel (0.846), and balance (0.534) for rand index measure as compared to recent clustering algorithms. Whereas, for ionosphere, ICSO achieves better rand index (0.319) and MBOA achieves higher rand index (0.078) for crude oil dataset. The results show that proposed BAT outperforms and is proficient as compared to recent clustering algorithms.
5.2 Experiment 2: healthcare datasets
This subsection presents the simulation results of proposed BAT clustering algorithm for healthcare datasets.
5.2.1 Comparison of simulation results of proposed BAT and standard/well-known clustering algorithms
The performance comparison of proposed BAT algorithm with K-means, PSO, ACO, ABC, DE, GA, BB-BC, and BAT algorithms are presented in Table 12. The results are evaluated in terms of average intra cluster distance (intra), standard deviation (SD) and rank. Four healthcare datasets are considered to test and compare the performance of proposed BAT with well-known clustering algorithms. Simulation results showed that proposed BAT algorithm obtains minimum intra cluster distance values for all the considered healthcare datasets that is CMC (5.52E+03), LD (2.31E+02), WBC (2.89E+03), and thyroid (2.51E+02) as compared to well-known clustering algorithms. The standard deviation is minimum value computed for assessing the efficiency of algorithms. It represents dispersion of data objects within a cluster. It is also analysed that in most of aspects standard deviation parameter is minimum for proposed BAT algorithm than rest of algorithms. The results of accuracy parameter of proposed BAT algorithm and other well-known clustering algorithms are illustrated in Table 13. Results show that proposed BAT algorithm gives higher accuracy for CMC with value 48.21, WBC as 96.61 and Thyroid as 71.98. In case of LD dataset, PSO gives better results with accuracy value as 54.05. Even then the proposed algorithm with accuracy value as 54.02 outperforms than rest of the well-known clustering algorithms in case of LD dataset. So, it is concluded that proposed BAT algorithm gives more accurate results in clustering field for considered healthcare datasets. Table 14 presents the simulation results of proposed BAT algorithm and other well-known clustering algorithms using rand index measure for healthcare datasets. It is seen from the results that proposed BAT algorithm obtains better results as compared to other clustering algorithms for CMC, LD, WBC and thyroid datasets with values 0.28, 0.492, 0.276 and 0.383 respectively. Hence, proposed BAT algorithm is considered to be one of proficient algorithm for cluster analysis.
using intra cluster distance parameter for healthcare datasets
Figure 4(a-d) show the convergence behavior of IBAT and well-known clustering algorithms (BAT, BB-BC, GA, DE, ABC, ACO, PSO and K-means). In the graphical illustration X-axis labels number of iterations and Y-axis labels the intra-cluster distance. It is observed that IBAT algorithm converges on minimum values for all the considered healthcare datasets. The proposed algorithm provides better convergence rate in most of the cases. It is stated that the IBAT outperforms than other clustering algorithms for the considered healthcare datasets.
5.2.2 Comparison of simulation results of proposed BAT and existing hybrid clustering algorithms
This subsection presents the performance of proposed algorithm compared with six hybridized clustering algorithms. Table 15 demonstrates the simulation results of H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO and proposed BAT algorithm using average intra cluster distance (intra), standard deviation (SD) and rank measures. From the results, it is observed that proposed BAT algorithm attains minimum intra cluster distance in all the considered healthcare datasets with intra cluster distance value for CMC as 5.52E+03, LD as 2.31E+02, WBC as 2.89E+03 and thyroid as 2.51E+02 as compared to other hybridized clustering algorithms. In vowel dataset H-KHA has minimum intra cluster value than proposed algorithm. The proposed BAT algorithm also gives minimum value of standard deviation for most of the cases in comparison to hybridized algorithm on considered healthcare datasets. Table 16 demonstrates the results of accuracy parameter as average case of proposed BAT algorithm and other hybridized clustering algorithms for healthcare datasets. It is perceived that proposed BAT algorithm provides more accurate results as compared to other hybridized clustering algorithms. The proposed BAT gives higher accuracy values for CMC as 48.21, LD as 54.02. While there is marginal difference for accuracy values of proposed BAT over WBC (IBAT = 96.61, CBPSO = 96.89) and thyroid (IBAT = 71.98, CBPSO = 72.21) as compared to CBPSO but then also the values are higher than CBPSO and other hybridized clustering algorithms. It is stated that proposed BAT provides more accurate results compared with other hybridized clustering algorithms. Rand index is also computed for healthcare datasets to prove its effectiveness in clustering. Table 17 shows the simulation results of proposed BAT algorithm and other hybridized clustering algorithms for healthcare datasets using rand index measure. It is noticed that proposed BAT algorithm gives better results for rand index on CMC with value 0.280, WBC with value 0.257 and thyroid with 0.383. While proposed BAT gives identical rand index value compared with MEBBC for LD (0.496). Thus, it is indicated that proposed BAT obtains better results than other hybridized variants of clustering algorithms for considered healthcare datasets.
5.2.3 Comparison of simulation results of proposed BAT and recently reported clustering algorithms
This subsection presents the performance comparison of proposed algorithm with recent clustering algorithms. Table 18 demonstrates the simulation results of VS, MBOA, WOA, ICSO, Chaotic TLBO and proposed BAT algorithm using average intra cluster distance (intra), standard deviation (SD) and rank measures for healthcare datasets. From the results, it is witnessed that proposed BAT algorithm attains minimum intra cluster distance in healthcare datasets with intra cluster distance value for LD as 2.31E+02, WBC as 2.89E+03 and thyroid as 2.51E+02 as compared to recent clustering algorithms. For CMC dataset, it is seen that MBOA gives the minimum intra cluster distance value as 5.21E+03 than proposed algorithm. Also, the proposed BAT algorithm also gives minimum value of standard deviation for almost all the considered healthcare datasets except thyroid in which ICSO give minimum standard deviation value as 1.16E+01 and then comes proposed BAT with value 1.32E+01 as compared to recent clustering algorithms. The results of accuracy parameter as average case of proposed BAT algorithm and recent clustering algorithms for healthcare datasets are illustrated in Table 19. It is noticed that proposed BAT algorithm provides more accurate results as compared to recent clustering algorithms. The proposed BAT gives higher accuracy values for CMC as 48.21, LD as 54.02, WBC as 96.61 and Thyroid as 71.98. From the results, it is indicated that IBAT provides more accurate results as compared to recent clustering algorithms. To prove the effectiveness of proposed algorithm in clustering for healthcare datasets, Rand index is computed. The simulation results of proposed BAT algorithm and recent clustering algorithms using rand index measure for healthcare datasets are shown in Table 20. It is seen that proposed BAT algorithm gives better results for rand index on WBC with value 0.276 and thyroid with 0.383 as compared to recent clustering algorithms. While for CMC dataset ICSO gives the better rand index value 0.283 and then comes proposed BAT with value 0.280 and chaotic TLBO obtains better rand index rate as 0.498. Thus, it is signified that proposed BAT is competent and obtains better results for most of the considered healthcare datasets.
5.3 Statistical test
This subsection describes the statistical test to determine the best performing algorithm among proposed IBAT and other existing clustering algorithms. The statistical tests are used to establish a new algorithm and also compute weather a significant difference is occurred among the performances of the new algorithm and existing algorithms [87,88,89]. In this work, Friedman statistical test is considered for identifying the best performing algorithm among all. To perform the statistical test, two hypothesis (H0 and H1) are designed at the significance level 0.05. Hypothesis (H0) corresponds to no significant difference among performances of new algorithm and rest of algorithms. Hypothesis (H1) corresponds to significant difference among performances of new algorithm and rest of algorithms. If, significant difference is not occurred, then hypothesis (H0) is not rejected and it is said that the proposed algorithm (IBAT) similar performance like other algorithms. Otherwise, hypothesis (H0) is rejected and hypothesis (H1) is true and it indicates that there is a significant difference occurs between the performances of newly proposed algorithm and rest of algorithms. So, it can perform better than other existing algorithms and better performing algorithm. Hence, statistical tests are widely adopted for analyzing performance of newly proposed algorithm and statistical test results gives the clear idea about better performing algorithm. In this work, Friedman statistical test is applied for determining the best performing algorithm. In the first step of Friedman test, a rank is assigned to each algorithm with each dataset and furthermore, average ranking is computed for all algorithms using all dataset. The ranking of each algorithm with each dataset is reported in Table 21. The ranking of algorithms is computed using accuracy measure and it is also illustrated the average ranking of each technique. It is seen that proposed BAT (IBAT) algorithm obtains 2.1 rank which is highest rank among rest of algorithms and ACO algorithm obtains lower rank (18) among all algorithms. It also noticed that chaotic TLBO achieves second highest rank (5.1) among all algorithms. The statistical results of Friedman test are shown in Table 22. It is observed that the statistical value of Friedman test is 63.6531. The degree of freedom is 19 and the critical value is 301,435 at the significance level of 0.05. The p value computed for Friedman test is 1.01E-06. On the analysis of statistical results, it is concluded that p value is considerably less than critical value. Hence, the hypothesis (H0) is rejected and a significant difference occurs between the performance of proposed BAT (IBAT) and other existing algorithms. These results certified that proposed algorithm (IBAT) performs better than other existing algorithms and also validates the performance of proposed algorithm as compared to existing clustering algorithms. Moreover, a posthoc test is also conducted to determine the possible grouping of the similar algorithms. The results of posthoc test is presented into Table 23. The symbol “+” indicates the significant difference between the performances of algorithms, while symbol “-” indicates no significant difference between the performances of algorithms. On the analysis of posthoc test, it is stated that several algorithms exhibit the similar performance and can be clubbed into a single group. In turn, twelve groups are determined that having similar performance according to posthoc test. The description of these groups are listed as Group 1 K-means, PSO, ABC, DE, GA, BB-BC, BAT, MEBBC, IKH, ICMPKHM, CBPSO, VS, MBOA, and WOA algorithm. Group 2 consists of K-Means, K-means, PSO, GA, BB-BC, BAT, H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO, VS, MBOA, WOA, ICSO, and Chaotic TLBO algorithms. The algorithms in group 3 are ACO, ABC, DE, GA, and BAT. Group 4 contains K-means, ACO, ABC, DE, GA, BB-BC, BAT, MEBBC, IKH, VS, MBOA, and WOA algorithms. Group 5 consists of contains K-means, ACO, ABC, DE, GA, BB-BC, BAT, MEBBC, IKH, VS, MBOA, CBPSO, and WOA algorithms. Group 6 consists of K-means, PSO, ACO, ABC, DE, GA, BB-BC, BAT, MEBBC, IKH, VS, MBOA, CBPSO, and WOA algorithms. Group 7 consists of PSO, H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO, VS, MBOA, WOA, ICSO, and Chaotic TLBO algorithms. Group 8 consists of K-means, PSO, ABC, DE, GA, BB-BC, BAT, HKA, MEBBC, IKH, VS, MBOA, CBPSO, and WOA algorithms. Group 9 contains K-Means, PSO, BB-BC, H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO, VS, MBOA, WOA, ICSO, and Chaotic TLBO algorithms. Group 10 consists of K-Means, PSO, DE, GA, BB-BC, H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC, CBPSO, VS, MBOA, WOA, ICSO, and Chaotic TLBO algorithms. Group 11 consists of PSO, H-KHA, ICMPKHM, PSO-BB-BC, CBPSO, ICSO, and Chaotic TLBO algorithms. Group 12 contains only a single algorithm i.e. IBAT. Posthoc test reveals that algorithms within a group exhibits similar performance. It is also seen that proposed BAT (IBAT) algorithm is not presented in any groups except group 12 and it is single algorithm presented in group 12. Hence, it is concluded that proposed BAT algorithm provides significant different performance than other existing algorithms and it is best performing algorithm than rest of algorithms.
6 Conclusion
An improved variant of the bat algorithm is proposed for data clustering task. The proposed algorithm efficiently deals the initial population selection, convergence rate and local optima issues of bat algorithm. To resolve initial population selection problem an enhanced co-operative co-evolution method is developed and integrated into bat algorithm. A neighborhood search-based mechanism is designed to handle local optima condition. Furthermore, an elitist strategy is incorporated in bat algorithm for achieving optimal trade off among search mechanisms. The solution search equations of bat algorithm are also improved for determining optimum solution in last iterations. The performance of proposed BAT algorithm is tested using eight benchmark (non-healthcare) and four healthcare clustering datasets. The simulation results of IBAT algorithm are compared with nineteen meta-heuristic algorithms such as K-means, PSO, ACO, ABC, DE, GA, BB-BC, BAT, and hybridized H-KHA, MEBBC, IKH, ICMPKHM, PSO-BB-BC and CBPSO clustering algorithms; and also, with recent clustering algorithms like VS, MBOA, WOA, ICSO, and Chaotic TLBO. The proposed algorithm achieves better quality clustering results in terms of intra cluster distance, accuracy and rand index with most of datasets. Furthermore, Friedman statistical test is also applied to determine the best performing algorithm. The statistical results showed that the hypothesis (H0) is rejected at confidence level of 0.05. In turn, significant difference is occurred between the performance of proposed IBAT and other clustering algorithms. Hence, it is stated that proposed IBAT algorithm is best performing algorithms than rest of clustering algorithms. Finally, it is also concluded that proposed IBAT algorithm is a robust and an effective algorithm for handling data clustering task.
Abbreviations
- ABC:
-
Artificial Bee Colony
- ACA:
-
Ant Clustering Algorithm
- ACDE:
-
Automatic Clustering Differential Evolution
- ACO:
-
Ant Colony Optimization
- BATC:
-
Bat Algorithm based Clustering
- BB-BC:
-
Big Bang–Big Crunch
- CABC:
-
Cooperative Artificial Bee Colony
- CCSSA:
-
Chaotic Charge System Search Algorithm
- CPSO:
-
Cooperative Particle Swarm Optimization
- CS:
-
Cuckoo Search
- CSO:
-
Cat Swarm Optimization
- CSS:
-
Charge System Search
- DCPSO:
-
Dynamic Clustering Particle Swarm Optimization
- DE:
-
Differential Evolution
- FA:
-
Firefly algorithm
- FPAC:
-
Flower Pollination Algorithm based Clustering
- GA:
-
Genetic Algorithm
- GAMS:
-
Genetic Algorithm with Message-based Similarity
- GTCSA:
-
Gene Transposon based Clone Selection Algorithm
- GWA:
-
Grey Wolf Algorithm
- GWO:
-
Grey Wolf Optimizer
- HABC:
-
Hybrid Artificial Bee Colony
- HBMO:
-
Honey Bee Mating Optimization
- KH:
-
Krill Herd
- KHM:
-
K-harmonic Means
- K-MWO:
-
K-means and Mussels Wandering Optimization
- HS:
-
Harmony Search
- IBAT:
-
Improved Bat
- ICSO:
-
Improved Cat Swarm Optimization
- ILS:
-
Iterated Local Search
- MCSS:
-
Magnetic Charge System Search
- MO:
-
Magnetic Optimization
- PSO:
-
Particle Swarm Optimization
- SA:
-
Simulated Annealing
- TLBO:
-
Teaching learning Based Optimization
- TS:
-
Tabu Search
- VGA:
-
Variable-string-length Genetic Algorithm
- MBOA:
-
Modified Butterfly Optimization Algorithm
- WOA:
-
Whale Optimization Algorithm
- ICSO:
-
Improved cat swarm optimization
- Chaotic TLBO:
-
Chaotic Teaching Learning based optimization
- VS:
-
Vortex Search
References
Kushwaha N, Pant M, Kant S, Jain VK (2018) Magnetic optimization algorithm for data clustering. Pattern Recogn Lett 115:59–65
Kant S, Ansari IA (2016) An improved K means clustering with Atkinson index to classify liver patient dataset. International Journal of System Assurance Engineering and Management 7(1):222–228
Aggarwal CC, Reddy CK (2014) Data clustering. Algorithms and applications. Chapman & Hall/CRC Data mining and Knowledge Discovery series, Londra
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42(7):1210–1222
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231)
Scheunders P (1997) A genetic c-means clustering algorithm applied to color image quantization. Pattern Recogn 30(6):859–866
Gomez-Muñoz VM, Porta-Gándara MA (2002) Local wind patterns for modeling renewable energy systems by means of cluster analysis techniques. Renew Energy 2:171–182
Mitra S, Banka H (2006) Multi-objective evolutionary bi clustering of gene expression data. Pattern Recogn 39:2464–2477
Nanda SJ, Panda G (2014) A survey on nature inspired metaheuristic algorithms for partitional clustering. Swarm and Evolutionary computation 16:1–18
Cura T (2012) A particle swarm optimization approach to clustering. Expert Syst Appl 39(1):1582–1588
Jordehi AR (2015) Enhanced leader PSO (ELPSO): a new PSO variant for solving global optimisation problems. Appl Soft Comput 26:401–417
Karaboga, D. (2005) An idea based on honey bee swarm for numerical optimization, Erciyes University, Kayseri, Turkey, Technical Report-TR06
Karaboga D, Basturk B (2007) A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. J Glob Optim 39(3):459–471
Karaboga D, Akay B (2009) A comparative study of artificial bee colony algorithm. Appl Math Comput 214(1):108–132
Dorigo M, Birattari M, Stutzle T (2006) Artificial ants as a computational intelligence technique. IEEE Comput Intell Mag 1:28–39
Dorigo M, Maniezzo V, Colorni A (1996) Ant system: optimization by a colony of cooperating agents. IEEE Transactions on Systems, man, and cybernetics, Part B: Cybernetics 26(1):29–41
Kumar Y, Sahoo G (2014) A charged system search approach for data clustering. Progress in Artificial Intelligence 2(2–3):153–166
Hatamlou A (2013) Black hole: A new heuristic optimization approach for data clustering. Inf Sci 222:175–184
Erol OK, Eksin I (2006) A new optimization method: big bang–big crunch. Adv Eng Softw 37(2):106–111
Jordehi AR (2014) A chaotic-based big bang–big crunch algorithm for solving global optimisation problems. Neural Comput & Applic 25(6):1329–1335
Alatas B (2011) ACROA: artificial chemical reaction optimization algorithm for global optimization. Expert Syst Appl 38(10):13170–13180
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recogn 33(9):1455–1465
Ergezer M, Simon D, Du D Oppositional biogeography-based optimization. 2009 IEEE international conference on systems, man and cybernetics. IEEE, 2009
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Wang GG, Deb S, Coelho LDS (2015) Elephant herding optimization. In 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI) (pp. 1–5). IEEE
Chu SC, Tsai PW, Pan JS (2006) Cat swarm optimization. In: Pacific Rim international conference on artificial intelligence. Springer, Berlin, Heidelberg, pp 854–858
Yazdani M, Jolai F (2016) Lion optimization algorithm (LOA): a nature-inspired metaheuristic algorithm. Journal of computational design and engineering 3(1):24–36
Mirjalili S (2016) SCA: a sine cosine algorithm for solving optimization problems. Knowl-Based Syst 96:120–133
Salimi H (2015) Stochastic fractal search: a powerful metaheuristic algorithm. Knowl-Based Syst 75:1–18
Kaveh A, Dadras A (2017) A novel meta-heuristic optimization algorithm: thermal exchange optimization. Adv Eng Softw 110:69–84
Abraham A, Das S, Roy S (2008) Swarm intelligence algorithms for data clustering. In: Soft computing for knowledge discovery and data mining. Springer, Boston, pp 279–313
Chowdhury K, Chaudhuri D, Pal AK (2021) An entropy-based initialization method of K-means clustering on the optimal number of clusters. Neural Comput & Applic 33(12):6965–6982
Torrente A, Romo J (2021) Initializing k-means clustering by bootstrap and data depth. Journal of Classification 38(2):232–256
Ahmadi R, Ekbatanifard G, Bayat P (2021) A Modified Grey Wolf Optimizer Based Data Clustering Algorithm. Appl Artif Intell 35(1):63–79
Ghany KKA, AbdelAziz AM, Soliman THA, Sewisy AAEM (2020) A hybrid modified step whale optimization algorithm with tabu search for data clustering. Journal of King Saud University-Computer and Information Sciences
Sörensen K (2015) Metaheuristics—the metaphor exposed. Int Trans Oper Res 22(1):3–18
Yang X-S A new metaheuristic bat-inspired algorithm. Nature inspired cooperative strategies for optimization (NICSO 2010). Springer, Berlin, Heidelberg, 2010. 65–74
Ashish T, Kapil S, Manju B (2018) Parallel bat algorithm-based clustering using mapreduce. In: Networking Communication and Data Knowledge Engineering. Springer, Singapore, pp 73–82
Fister I Jr, Fister D, Yang XS (2013) A hybrid Bat algorithm. ELEKTROTEHNIˇSKI VESTNIK 80(1–2):1–7
Yilmaz S, Kucuksille EU (2013) Improved bat algorithm (IBA) on continuous optimization problems. Lecture Notes on Software Engineering 1(3):279
Senthilnath J, Kulkarni S, Benediktsson JA, Yang XS (2016 Apr) A novel approach for multispectral satellite image classification based on the bat algorithm. IEEE Geosci Remote Sens Lett 13(4):599–603
Neelima S, Satyanarayana N, Murthy PK (2018) Minimizing Frequent Itemsets Using Hybrid ABCBAT Algorithm. In: Data Engineering and Intelligent Computing. Springer, Singapore, pp 91–97
Aboubi Y, Drias H, Kamel N (2016) BAT-CLARA: BAT-inspired algorithm for Clustering LARge Applications. IFAC-PapersOnLine. 49(12):243–248
Fister I, Fong S, Brest J (2014) A novel hybrid self-adaptive bat algorithm. Sci World J 2014:70973
Zhao D, He Y (2015) Chaotic binary bat algorithm for analog test point selection. Analog Integr Circ Sig Process 84(2):201–214
Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowl-Based Syst 71:345–365
Liu R et al (2012) Gene transposon based clone selection algorithm for automatic clustering. Inf Sci 204:1–22
Kumar Y, Sahoo G (2017) A two-step artificial bee colony algorithm for clustering. Neural Comput & Applic 28(3):537–551
Cao F, Liang J, Jiang G (2009) An initialization method for the K-Means algorithm using neighborhood model. Computers & Mathematics with Applications 58(3):474–483
Han XH et al (2017) A novel data clustering algorithm based on modified gravitational search algorithm. Eng Appl Artif Intell 61:1–7
Senthilnath J, Omkar SN, Mani V (2011) Clustering using firefly algorithm: performance study. Swarm and Evolutionary Computation 1(3):164–171
Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in k-means algorithm. Pattern Recogn Lett 32(14):1701–1705
Kumar Y, Sahoo G (2015) Hybridization of magnetic charge system search and particle swarm optimization for efficient data clustering using neighborhood search strategy. Soft Comput 19(12):3621–3645
Zhou Y et al (2017) A simplex method-based social spider optimization algorithm for clustering analysis. Eng Appl Artif Intell 64:67–82
Boushaki SI, Kamel N, Bendjeghaba O (2018) A new quantum chaotic cuckoo search algorithm for data clustering. Expert Syst Appl 96:358–372
Chang D et al (2012) A genetic clustering algorithm using a message-based similarity measure. Expert Syst Appl 39(2):2194–2202
Zhang C, Ouyang D, Ning J (2010) An artificial bee colony approach for clustering. Expert Syst Appl 37(7):4761–4767
Taherdangkoo M et al (2013) A robust clustering method based on blind, naked mole-rats (BNMR) algorithm. Swarm and Evolutionary Computation 10:1–11
Hatamlou A (2012) In search of optimal centroids on data clustering using a binary search algorithm. Pattern Recogn Lett 33(13):1756–1760
Bijari K et al (2018) Memory-enriched big bang–big crunch optimization algorithm for data clustering. Neural Comput & Applic 29(6):111–121
Abualigah LM et al (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Pakrashi A, Chaudhuri BB (2016) A Kalman filtering induced heuristic optimization based partitional data clustering. Inf Sci 369:704–717
Kang Q et al (2016) A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence. Knowl-Based Syst 104:156–164
Wang R et al (2016) Flower pollination algorithm with bee pollinator for cluster analysis. Inf Process Lett 116.1:1–14
Hatamlou A, Hatamlou M (2013) PSOHS: an efficient two-stage approach for data clustering. Memetic Computing 5(2):155–161
Yan X et al (2012) A new approach for data clustering using hybrid artificial bee colony algorithm. Neurocomputing 97:241–250
Kwedlo W (2011) A clustering method combining differential evolution with the K-means algorithm. Pattern Recogn Lett 32(12):1613–1621
Yin M et al (2011) A novel hybrid K-harmonic means and gravitational search algorithm approach for clustering. Expert Syst Appl 38(8):9319–9324
Jiang H et al (2010) Ant clustering algorithm with K-harmonic means clustering. Expert Syst Appl 37(12):8679–8684
Xiao J et al (2010) A quantum-inspired genetic algorithm for k-means clustering. Expert Syst Appl 37(7):4966–4973
Žalik KR (2008) An efficient k′-means clustering algorithm. Pattern Recogn Lett 29(9):1385–1391
Jiang B, Wang N (2014) Cooperative bare-bone particle swarm optimization for data clustering. Soft Comput 18(6):1079–1091
Kumar Y, Singh PK (2018) Improved cat swarm optimization algorithm for solving global optimization problems and its application to clustering. Appl Intell 48(9):2681–2697
Kumar Y, Singh PK (2019) A chaotic teaching learning based optimization algorithm for clustering problems. Appl Intell 49(3):1036–1062
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759
Faris H, Ala’M AZ, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Information Fusion 48:67–83
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognition 33.9:1455–1465
Kumar Y, Sahoo G (2017) An Improved Cat Swarm Optimization Algorithm Based on Opposition-Based Learning and Cauchy Operator for Clustering. JIPS 13(4):1000–1013
Jensi R, Wiselin Jiji G (2016) An improved krill herd algorithm with global exploration capability for solving numerical function optimization problems and its application to data clustering. Appl Soft Comput 46:230–245
Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292
Bouyer A, Hatamlou A (2018) An efficient hybrid clustering method based on improved cuckoo optimization and modified particle swarm optimization algorithms. Appl Soft Comput 67:172–182
Hatamlou A (2017) A hybrid bio-inspired algorithm and its application. Appl Intell 47(4):1059–1067
Doğan B, Ölmez T (2015) A new metaheuristic for numerical function optimization: vortex search algorithm. Inf Sci 293:125–145
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67
Wang G-G, Deb S, Cui Z (2019) Monarch butterfly optimization. Neural Comput & Applic 31(7):1995–2014
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7:1–30
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation 1(1):3–18
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kaur, A., Kumar, Y. Neighborhood search based improved bat algorithm for data clustering. Appl Intell 52, 10541–10575 (2022). https://doi.org/10.1007/s10489-021-02934-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02934-x