1 Introduction

In the past few decades, the amount of daily produced data in electronic devices, such as smartphones, tablets, computers, cars, GPS, smart TVs, Internet of Things applications, and so on, has increased exponentially, in such a way that automatic and scalable computational systems are even more required. To extract useful information from the large data sets such systems are based on, it is impossible to rely in human analysis only, once the need for precise and reliable information in a short period of time has become mandatory (Naldi and Campello 2014).

Data clustering is one of the most important and primitive activities in pattern recognition, consisting in an important mechanism for exploratory data analysis. Clustering is characterized by an unsupervised attempt to categorize a set of data patterns in clusters, in such a way that observations belonging in a same cluster are more close related (according to their feature set) than observations from different clusters. In clustering, no prior knowledge about the data set at hand is required, so the clustering models take decisions about how to group the observations taking into consideration only the inner properties of such observations. Clustering algorithms have been successfully employed in many real-world applications, given their flexibility and adaptability, in fields such as: Agriculture (Rahamathunnisa et al. 2020), Data Mining (Sapkota et al. 2019), Ecology (Sreepathi et al. 2017), Image Understanding (Saraswathi and Allirani 2013; Wan et al. 2017; Diderot et al. 2019; Wei et al. 2019), Medicine (Li et al. 2016; Bruse et al. 2017; Premalatha and Subasree 2017; BEDDAD et al. 2019), Text Mining (Liu and Xiong 2011), Web Services (Parimalam and Sundaram 2017), Wireless Sensors Networks (WSN) (Misra and Kumar 2016; Dhiviya et al. 2017; Masoud et al. 2019), and so on. An interesting survey on data clustering algorithms can be found in (Xu and Tian 2015).

From an optimization perspective, clustering is considered as a particular kind of NP-hard grouping problem. The most popular clustering approaches are the partitional clustering algorithms, which provide a partition of the data set into a prefixed number of clusters (an input parameter for the models). Each cluster is represented by its centroid vector, and the clustering process is driven in an effort to optimize a criterion function iteratively, by updating these cluster centroids, seeking out to improve the quality of the final solution (final partition) provided by the algorithm.

One of the most popular partitional clustering methods is the K-Means algorithm (MacQueen et al. 1967). Although traditional partitional clustering models are able to promote fast convergence speeds, these methods are known for their sensibility to the initial centroid position, what may lead to weak solutions (i.e., the model may be trapped in a local minimum point) if the algorithm starts in a poor region of the problem space. Standard partitional clustering approaches lack the mechanisms to escape from local optima points, promoting local searches on the problem space only.

Evolutionary Algorithms (EAs) have been increasingly applied to solve a great variety of difficult problems, such as hard functions optimization (He et al. 2009; Oliveira et al. 2013; Pacifico and Ludermir 2014; Jain et al. 2019), weights and architecture optimization in Neural Networks (Silva et al. 2011; Idrissi et al. 2016; Darwish et al. 2020), feature weighting and selection (Ramos and Vellasco 2018; Taşci et al. 2018; Xu et al. 2020), and so on, given their global search capability and their mechanisms to avoid local optima points. In EAs, a set (population) of candidate solutions (individuals) for the problem at hand is kept and evolved according to a generational process, seeking out the optimization of a criterion function (the fitness function). In EAs such as Genetic Algorithm (GA) (Holland 1992), Evolutionary Programming (EP) (Fogel et al. 1966; Fogel 2009), Evolution Strategies (ES) (Rechenberg 1973; Schwefel 1993) and Differential Evolution (DE) (Storn and Price 1995, 1997), the searching scheme is driven by operators that simulate biological processes like mutation, recombination and selection. In this context, Swarm Intelligence (SI) methods are extensions of EAs which execute their search in an attempt to simulate self-organizing collective behavior of social animals, like swarming, flocking and herding (Bonabeau et al. 1999). Examples of SI algorithms are the Ant Colony Optimization (ACO) (Dorigo et al. 1996), Particle Swarm Optimization (PSO) (Kennedy and Eberhart 1995; Kennedy et al. 2001) and Group Search Optimization (GSO) (He et al. 2006, 2009). Table 1 presents some of the most recent and popular algorithms from Evolutionary Computing literature. Interesting texts in Evolutionary Computing can be found in (Kennedy 2006; Simon 2013; Eiben and Smith 2015).

Table 1 List of some natural-inspired population-based metaheuristics

EAs have been increasingly applied in clustering applications (Hruschka et al. 2009; Inkaya et al. 2016; Canuto et al. 2018; Figueiredo et al. 2019), but, given their stochastic nature, such techniques may be too slow to converge in comparison to standard partitional clustering algorithms. It is also known that, in general, evolutionary algorithms local search mechanisms are not very much effective (Ren et al. 2014).

To overcome this drawback, Hybrid Intelligence Systems (HISs) and Memetic Algorithms (MA) combining evolutionary approaches and K-Means have been proposed in literature, such that global searches performed by EAs are complemented by local searches performed by K-Means. Some examples of hybrid evolutionary systems that use K-Means as a local searcher are found in (Ahmadyfard and Modares 2008; Abdel-Kader 2010; Bhavani et al. 2011; Pacifico and Ludermir 2018, 2019). But most works in data clustering literature only employ K-Means after the generational process of the selected EA is concluded, which means that the best solution found by the EA may already be trapped in a local minimum point, making the application of K-Means sub-optimal (it will only exploit the problem region that contains a local minimum). To avoid such problem, in this work three hybrid memetic algorithms are proposed, named FMKGSO, MKGSO and TMKGSO, which combine the global search capabilities of GSO with local search speed provided by K-Means, which use K-Means (in different ways) to improve the population of GSO during the generational process. GSO was proposed as a method for continuous optimization problems based on the behavior of social animals attempting to find food resources, and, in the past few years, GSO has been successfully applied in many real world applications (Chen et al. 2014b; Krishnaprabha and Aloor 2014; Li et al. 2015; Pacifico et al. 2018), presenting competitive optimization performances in comparison to other evolutionary algorithms, such as GA, PSO, EP and ES.

The proposed approaches are implemented in an attempt to comprehend the influence of K-Means on the behavior of GSO when dealing with data clustering task, and, for that, a testing bed consisting of 20 (twenty) real-world problems (obtained from UCI Machine Learning Repository - Asuncion and Newman 2007) and 10 synthetic data sets is proposed. Five evaluation metrics from clustering literature are employed to validate the quality of the solutions achieved by the proposed models, and five partitional evolutionary algorithms are selected for comparison purposes.

This work is organized as follows. Firstly, some introductory concepts are introduced (Sect. 2), like K-Means (Sect. 2.1), partitional evolutionary algorithms (Sect. 2.2) and GSO (Sect. 2.3). After that, the proposed hybrid memetic GSO approaches are explained (Sect. 3), followed by the experimental evaluation (Sect. 4). Finally, some conclusions and interesting trends for future works are presented (Sect. 5).

2 Preliminaries

The baseline models adopted in this work are presented in the following sections (Sects. 2.1, 2.2 and 2.3).

2.1 K-Means

K-Means (MacQueen et al. 1967) has been proposed as a partitional clustering algorithm for continuous data, which groups real-valued data vectors into a predefined number of clusters. Consider a partition \(P_{C}\) of a data set with N data patterns (each data pattern is represented by a vector \({\mathbf{x} }_j\in {\mathfrak {R}^{m}}\), where \(j = 1, 2, \ldots , N\)) in C clusters, where C is a required input parameter for the algorithm. Each cluster is represented by its centroid vector \({\mathbf{g} }_{c}\in {\mathfrak {R}^{m}}\) (where \(c = 1, 2, \ldots , C\)).

In K-Means, clusters are formed based on a dissimilarity measure, the Euclidean distance [Eq. (1)]. For each iteration (until a maximum number of iterations \(t_{maxkmeans}\) is reached or another stopping criterion is satisfied), a new cluster centroid vector is calculated, for each cluster, as the mean of its current data vectors (i.e., the data patterns currently assigned to the cluster). After that, the new partition is formed, and each pattern is associated to the cluster with the nearest centroid.

$$\begin{aligned} d({\mathbf{x} }_j, {\mathbf{g} }_c) = \sqrt{\sum _{k=1}^{m}{(x_{jk}-g_{ck})^2}} \end{aligned}$$
(1)

where

$$\begin{aligned} {\mathbf{g} }_c = \frac{1}{N_c}\sum _{\forall {\mathbf{x} }_l \in c}{{\mathbf{x} }_l} \end{aligned}$$
(2)

where \(N_c\) is the number of patterns associated to cluster c. The criterion function for K-Means is the Within-Cluster Sum of Squares, given in Eq. (3).

$$\begin{aligned} J(P_C) = \sum _{c = 1}^C{\sum _{\forall {\mathbf{x }}_j \in c}{d({\mathbf{x} }_j, {\mathbf{g} }_c)}} \end{aligned}$$
(3)

K-Means algorithm is presented in Algorithm 1.

figure a

2.2 Evolutionary algorithms for data clustering

In this section, we explain the most commonly adopted representation schema for evolutionary algorithms, when adapted as partitional clustering methods (Chen and Ye 2004).

Once more, consider a partition \(P_{C}\) of a data set with N patterns \({\mathbf{x} }_j\in {\mathfrak {R}^{m}}\) (\(j = 1, 2,\ldots , N\)) in C clusters. Each cluster is represented by its centroid vector \({\mathbf{g} }_{c}\in {\mathfrak {R}^{m}}\) (\(c = 1, 2, \ldots , C\)). Each population individual \({\mathbf{X} }_i\in {\mathfrak {R}^n}\) (where \(n = m \times C\)) in population G represents C cluster centroids at the same time, one for each cluster (Chen and Ye 2004). For instance, if \(m = 3\) and \(C = 5\), each individual will be a vector \({\mathbf{X} }_i\in {\mathfrak {R}^{15}}\), where the first three features will represent the centroid \({\mathbf{g} }_1\), features 4-th to 6-th will represent the centroid \({\mathbf{g} }_2\), features 7-th to 9-th will represent the centroid \({\mathbf{g} }_4\), features 10-th to 12-th will represent the centroid \({\mathbf{g} }_4\), and the last three features (features 13-th to 15-th) will represent the centroid \({\mathbf{g} }_5\), just like in Fig. 1.

Fig. 1
figure 1

Individual representation: \({\mathbf{g} }_1\), \({\mathbf{g} }_2\), \({\mathbf{g} }_3\), \({\mathbf{g} }_4\) and \({\mathbf{g }}_5\) are 3-dimensional cluster centroids

The population of evolutionary algorithms is generally initialized by a random process, but in the context of partitional data clustering, an initialization by the random choice of C patterns from the data set in analysis to compose the initial cluster centroids (just like in K-Means - see Sect. 2.1), for each individual, leads to a faster exploration of the problem search space.

As the fitness function, many works adopt the Within-Cluster Sum of Squares [Eq. (3)] or some alternative function that takes such criterion as its main component, just like in (Chen and Ye 2004; Ahmadyfard and Modares 2008; Hruschka et al. 2009; Prabha and Visalakshi 2014; Pacifico and Ludermir 2019, 2019b). But sometimes, different functions are adopted as the fitness function (Das et al. 2007; Liu et al. 2011; Wong et al. 2011; He and Tan 2012; José-García and Gómez-Flores 2016; Pacifico and Ludermir 2016).

Once the initial population is obtained and the fitness value for each individual X\(_i^{(0)}\) in population G is computed, the evolutionary operators for the selected evolutionary algorithm are applied to the population, evolving the cluster centroids represented by each individual through a generational process, until a termination condition is reached. The global best individual found by the EA is provided as the clustering solution. A generic evolutionary algorithm for partitional data clustering is presented in Algorithm 2.

figure b

Although EAs are able to find global solutions even when dealing with complex optimization problems, the searching process performed by such strategies may be to slow, and in many applications (just like in Cui et al. 2005; Abdel-Kader 2010; Ahmadi et al. 2010; Chen et al. 2014a; Pacifico and Ludermir 2018, 2019), the EAs are hybridized with K-Means in such a way that the EAs are used to find a better set of starting points (i.e., cluster centroids) to K-Means. Algorithm 3 illustrates such process.

figure c

2.3 Group search optimization

Group search optimization is inspired by animal social searching behavior and group living theory. GSO employs the Producer-Scrounger (PS) model as a framework. The PS model was firstly proposed by Barnard and Sibly (1981) to analyze social foraging strategies of group living animals. PS model assumes that there are two foraging strategies within groups: producing (e.g., searching for food); and joining (scrounging, e.g., joining resources uncovered by others). Foragers are assumed to use producing or joining strategies exclusively. Under this framework, concepts of resource searching from animal visual scanning mechanism are used to design optimum searching strategies in GSO algorithm (He et al. 2009).

In GSO, the population G of S individuals is called group, and each individual is called a member. In a n-dimensional search space, the i-th member at the t-th searching iteration (generation) has a current position X\(_i^t\) \(\in {\mathfrak {R}^n}\) and a head angle \(\alpha\)\(_i^t\in {\mathfrak {R}^{n-1}}\). The search direction of the i-th member, which is a vector D\(_i^t\)(\(\alpha\)\(_i^t) = (d_{i1}^t, \ldots , d_{in}^t)\) can be calculated from \(\alpha\)\(_i^t\) via a polar to Cartesian coordinate transformation:

$$\begin{aligned} d_{i1}^t= & {} \prod _{q=1}^{n-1}{\cos (\alpha _{iq}^t)}, \nonumber \\ d_{ij}^t= & {} \sin (\alpha _{i(j-1)}^t)\prod _{q=1}^{n-1}{\cos (\alpha _{iq}^t)}\quad (j = 1, \ldots , n-1), \nonumber \\ d_{in}^t= & {} \sin (\alpha _{i(n-1)}^t) \end{aligned}$$
(4)

A group in GSO consists of three types of members: producers, scroungers and dispersed members (or rangers) (He et al. 2009). The rangers are introduced by GSO model, extending standard PS framework.

During each GSO search iteration, a group member which has found the best fitness value so far (most promising area form the problem search space) is chosen as the producer (X\(_p\)) (Couzin et al. 2005), and the remaining members are scroungers or rangers. Standard GSO admits only one producer in each iteration, but there are some GSO variants that use multiple producers at the same time (Junaed et al. 2013; Pacifico and Ludermir 2013).

The producer employs a scanning strategy (producing) based on its vision field, generalized to a n-dimensional space, which is characterized by maximum pursuit angle \({\theta _{max}} \in {\mathfrak {R}^{n-1}}\) and maximum pursuit distance \(l_{max}\in {\mathfrak {R}}\), given by Eq. (5).

$$\begin{aligned} l_{max} = \left\| {\mathbf{U} }-{\mathbf{L }}\right\| = \sqrt{\sum _{k=1}^{n}{(U_{k}-L_{k})^2}} \end{aligned}$$
(5)

where \(U_{k}\) and \(L_{k}\) denote the upper bound and lower bound of the k-th dimension from the problem space, respectively.

In GSO, at the t-th iteration the producer X\(_p^t\) will scan laterally by randomly sampling three points in the scanning field: one at zero degree [Eq. (6)], one in the right hand side hypercube [Eq. (7)] and one in the left hand side hypercube [Eq. (8)].

$$\begin{aligned}&{\mathbf{X} }_{z} = {\mathbf{X} }_p^t + r_1l_{max}{\mathbf{D} }_p^t(\alpha _p^t) \end{aligned}$$
(6)
$$\begin{aligned}&{\mathbf{X }}_{r} = {\mathbf{X }}_p^t + r_1l_{max}{\mathbf{D }}_p^t(\alpha _p^t+\frac{{\mathbf{r }}_2\theta _{max}}{2}) \end{aligned}$$
(7)
$$\begin{aligned}&{\mathbf{X }}_{l} = {\mathbf{X }}_p^t + r_1l_{max}{\mathbf{D }}_p^t(\alpha _p^t-\frac{{\mathbf{r }}_2\theta _{max}}{2}) \end{aligned}$$
(8)

where \(r_1\in {\mathfrak {R}}\) is a normally distributed random number (mean 0 and standard deviation 1) and r\(_2\in {\mathfrak {R}^{n-1}}\) is a uniformly distributed random sequence in the range (0, 1).

If the producer is able to find a better resource than its current position, it will fly to this point; if no better point is found, the producer will stay in its current position, then it will turn its head to a new generated angle [Eq. (9)].

$$\begin{aligned} \alpha _p^{t+1} = \alpha _p^t+{\mathbf{r} }_2\alpha _{max} \end{aligned}$$
(9)

where \(\alpha _{max}\in {\mathfrak {R}}\) is the maximum turning angle.

If after \(a\in {\mathfrak {R}}\) iterations the producer cannot find a better area, it will turn its head back to zero degree [Eq. (10)].

$$\begin{aligned} \alpha _p^{k+a} = \alpha _p^k \end{aligned}$$
(10)

All scroungers will join the resource found by the producer, performing scrounging strategy according to Eq. (11).

$$\begin{aligned} {\mathbf{X} }_i^{t+1} = {\mathbf{X} }_i^t+{\mathbf{r} }_3\circ ({\mathbf{X} }_p^t-{\mathbf{X} }_i^t) \end{aligned}$$
(11)

where r\(_3\in {\mathfrak {R}^n}\) is a uniform random sequence in the range (0, 1) and \(\circ\) is the Hadamard product or the Schur product, which calculates the entrywise product of two vectors.

The rangers will perform random walks through the problem space (Higgins and Strauss 2004), according to Eq. (12).

$$\begin{aligned} {\mathbf{X}}_i^{t+1} = {\mathbf{X}}_i^t+l_i{\mathbf{D}}_i^t(\alpha _i^{t+1}) \end{aligned}$$
(12)

where

$$\begin{aligned} l_i = ar_1l_{max} \end{aligned}$$
(13)

In GSO, when a member escapes from the search space bounds, it will turn back to its previous position inside the search space (Dixon 1959). Some studies considering alternative treatments to deal with out-bounded population individuals can be found in (Xu and Shu 2006; Silva et al. 2011; Pacifico et al. 2018).

GSO algorithm is presented in Algorithm 4.

figure d

GSO scrounging operator focuses the search performed by the group in the most promising areas from the problem space, corresponding to the main exploration/exploitation strategy employed by many EAs (like crossover strategy in Genetic Algorithms and particle movement in Particle Swarm Optimization).

Producing and ranging are the main mechanisms employed by GSO for escaping local minima points. When the producer of one generation is trapped in a local minimum point (situation in which all scroungers would be trapped by following the producer, resulting in a premature convergence of the group), producing operator may find a better point in the search space, escaping from that local minimum. Even if that situation does not happen so easily, rangers will keep performing random walks that do not depend on the results found by the producer, which may lead to most promising areas than the ones found so far by the whole group, evading local minima points.

Some works have already adapted GSO to the context of partitional clustering, just like in (Pacifico and Ludermir 2014a, 2016, 2019b).

3 Proposed approaches

This section presents the proposed hybrid memetic GSO and K-Means models, elaborated to test the influence of K-Means when employed as a local searcher to improve the capabilities of partitional Group Search Optimizer. Three memetic algorithms are presented: FMKGSO, MKGSO and TMKGSO. In this work, we opt to combine GSO and K-Means due to some advantages promoted by these algorithms, such as:

  1. 1.

    GSO has been proven to be a good global optimizer in many real-world applications;

  2. 2.

    GSO presents good mechanisms to escape from local optima points, represented by the producing behavior of producers, and by the ranging behavior from dispersed members;

  3. 3.

    Ranging and producing behaviors are also employed to prevent the premature convergence of the group, that is a known and hard to deal problem that affects some global search natural-inspired metaheuristics, such as Particle Swarm Optimization;

  4. 4.

    Standard GSO implements an effective mechanism to prevent out-bounded members, avoiding individuals that would not codify an actual solution to the problem at hand, by extrapolating the problem search space boundaries;

  5. 5.

    K-Means is a good local searcher, performing fast exploitation on problem space regions.

The proposed models are described as follows.

Firstly, as in generic partitional framework for EAs (see Sect. 2.2), for all three memetic algorithms, each group member is represented as a continuous vector X\(_i \in {\mathfrak {R}^n}\) (with \(n = m \times C\)), codifying C cluster centroids at the same time, and at the t-th generation, X\(_i\) will determine its own partition X\(_i^t.P_C\).

For each algorithm, the initial population will be obtained by the random selection of C patterns from the data set being analyzed to compose each member X\(_i^{(0)}\). After random initialization, each member X\(_i^{(0)}\) has its fitness value computed to determine the quality of the partition X\(_i^{(0)}.P_C\) it represents, by associating each data pattern x\(_j\) to its closest centroid vector X\(_i^{(0)}\).g\(_c\) in X\(_i^{(0)}\).

After that, GSO generational process is initialized. Each memetic algorithm will employ K-Means for some degree during its generational process to improve the quality of the group.

As in standard GSO, for all three memetic approaches, the generational process starts, in each generation t, by the choice of the best member found so far (according to the selected fitness function) as the producer X\(_p^t\). The producer will perform producing operator as in standard GSO, by the evaluation of three random points from its visual scanning field [see Eqs. (6), (7) and (8)]. If a better point is found, the X\(_p^t\) will migrate to that point in the next generation.

The scrounging and ranging for the memetic approaches will be executed following Eqs. (11) and (12), respectively. After executing all standard GSO operators, each memetic model will adopt K-Means to improve GSO group by different ways, as described bellow.

In FMKGSO, the algorithm keeps track on the quality of each member separately. For each new member X\(_i^{t+1}\) in the t-th generation (after the execution of producing, scrounging and ranging operators), determine its partition X\(_i^{t+1}.P_C\) by associating each data pattern x\(_j\) to its closest centroid vector X\(_i^{t+1}\).g\(_c\). Compute X\(_i^{t+1}\) fitness. If X\(_i^{t+1}\) has not been able to improve (in terms of the fitness value) by a predefined number of consecutive generations (\(t_{maxkmeans}\)), its cluster centroids are refined by the application of \(t_{maxkmeans}\) K-Means steps. FMKGSO will only apply K-Means to the members that are unable to improve their partition for a period of time. This combination seeks out to speed up the search performed by each member, once GSO operators could present slow convergence rates, just like in all EAs. K-Means execution can speed up considerably the convergence rates of GSO members, promoting a better exploration and exploitation of the problem search space. The generational process executes until a termination condition is met, and the final best improved solution X\(_p^{t_{max}}\) is returned. FMKGSO algorithm is presented in Algorithm 5.

figure e

In MKGSO, after the determination of the new group by the execution of producing, scrounging and ranging operators, each member X\(_i^{t+1}\) will be improved by the execution of exactly one K-Means step in each generation before its fitness evaluation. All members are refined that way, independently if GSO operators have been able to improve or not. This operation seeks out to speedup the exploitation performed by each GSO member, by aggregating local information into GSO generational process smoothly. After that, the partitions represented by each member will be determined, and their fitness value will be computed. The generational process executes until a termination condition is met, and the final best improved solution X\(_p^{t_{max}}\) is returned. MKGSO algorithm is presented in Algorithm 6.

figure f

The last proposed approach, TMKGSO, will apply K-Means to each member after a period, for a limited number of steps. After a prefixed number of generations (\(t_{maxkmeans}\), an input parameter for the algorithm), each new group member \({\mathbf{X}}_i^{t+1}\) (obtained by the application of producing, scrounging or ranging operator) is refined by the application of \(t_{maxkmeans}\) K-Means steps. By alternating some executions of GSO and K-Means, the new groups are improved by the interchanging of global and local information, in such a way that when GSO is executing, its operators will prevent the group from getting trapped in local optma regions, promoting the global exploration of the problem space, while during K-Means steps, the group will be guided through fast local exploitations on the region of the problem space where each member is placed. Each algorithm will complement the search performed by the other. The generational process executes until a termination condition is met, and the final best improved solution X\(_p^{t_{max}}\) is returned. TMKGSO algorithm is presented in Algorithm 7.

figure g

By employing three different strategies to use K-Means during GSO generational process, we can evaluate the actual influence of that technique as a local searcher. In that sense, in this work the total number of intended clusters C is supposed to be known a priori (Niu et al. 2017), so no other factors would influence on the behavior of the proposed models. The automatic determination of the optimal number of cluster for each problem is ongoing research, and that issue will be addressed in a future work. For more information on automatic clustering issue, please refer to Das et al. (2007), José-García and Gómez-Flores (2016), Elaziz et al. (2019).

4 Experimental results

In this section, we test the clustering capabilities of the proposed approaches, in comparison to other partitional evolutionary algorithms from literature, by means of 20 (twenty) real-world and 10 synthetic data sets. All real-world data sets are benchmark classification and clustering problems acquired from UCI Machine Learning Repository (Asuncion and Newman 2007). The selected real data set features are shown in Table 2, presenting different degrees of difficulties, such as unbalanced and overlapping classes, different number of classes and features, and so on.

Table 2 Real-world data set features

The proposed synthetic data sets are split into to main groups: Disp group (containing five data sets), where all classes varies in relation to its degree of dispersion, from a well-separated scenario (Disp01) to a scenario where all classes present some degree of overlapping; Prox group (containing five data sets), where the overlapping is obtained by the approximation of class centers, from a scenario where all class centers are well-split into the problem search space (Prox01), to a scenario where all class centers are close to each other (Prox05). The configurations for the proposed synthetic data sets are presented in Table 3 and illustrated in Fig. 2.

Table 3 Disp and Prox synthetic data sets description: each Disp and Prox data set consists of 1000 3-dimensional patterns, equally distributed among 5 classes
Fig. 2
figure 2

Synthetic data sets representation: ae Disp01 to Disp05, respectively, fj Prox01-Prox05, respectively

For comparison purposes, five clustering measures are employed: the Within-Cluster Sum of Squares [Eq. (3)], the Intra-Cluster Distance [\(D_{max}\), Eq. (14)] and the Inter-Cluster Separation [\(D_{min}\), Eq. (15)] (Wong et al. 2011), the Weighted Quantization Error (\(J_{e_{2}}\), Eq. 16) (Esmin et al. 2008) and the Corrected Rand Index [CR, Eq. (17)] (Hubert and Arabie 1985).

$$\begin{aligned}&{\textit{D}}_{max}(P_C) = \max _{c = 1, \ldots , C}\{\sum _{\forall {\mathbf{x}}_{j} \in c}{d({\mathbf{x}}_j, {\mathbf{g}}_{c})}/|N_{c}|\} \end{aligned}$$
(14)
$$\begin{aligned}&{\textit{D}}_{min}(P_C) = \min _{\forall c_{1}, c_{2}, c_{1}\ne c_{2}}\{d({\mathbf{g}}_{c_{1}}, {\mathbf{g}}_{c_{2}})\} \end{aligned}$$
(15)
$$\begin{aligned}&J_{e_{2}}(P_C) = \sum _{c = 1}^C{[(\sum _{\forall {\mathbf{x}}_{j} \in c}{d({\mathbf{p}}_j, {\mathbf{g}}_{c})}/|N_{c}|)\times (|N_{c}|/N)]} \end{aligned}$$
(16)

The CR assesses the degree of similarity between an a priori partition and a partition provided by the clustering algorithm. Given that all data sets adopted in this work are real-valued classification problems (i.e., all data patters are labeled in each data set), CR represents a robust comparison metric for clustering studies, since its analysis takes into consideration only the relationships among the data patterns from an a priori and an a posteriori partitions, without taking into consideration the categories themselves.

Considering a partition \(U_R = \{u_1, u_2, \ldots , u_R\}\) provided by a clustering algorithm and an a priori partition defined by classification \(V_C = \{v_1, v_2, \ldots , v_C\}\), CR is defined as by Eq. (17).

$$\begin{aligned} CR = \frac{a - b}{c - d} \end{aligned}$$
(17)

where

$$\begin{aligned}&a = \sum _{i=1}^R\sum _{j=1}^C{\left( {\mathop {2}\limits ^{n_{ij}}}\right) } \end{aligned}$$
(18)
$$\begin{aligned}&b = \left( {\mathop {2}\limits ^{n}}\right) ^{-1}\sum _{i=1}^R\left( {\mathop {2}\limits ^{n_{i.}}}\right) \end{aligned}$$
(19)
$$\begin{aligned}&c = 1/2\left[\sum _{i=1}^R{\left( {\mathop {2}\limits ^{n_{i.}}}\right) }+\sum _{j=1}^C{\left( {\mathop {2}\limits ^{n_{.j}}}\right) }\right] \end{aligned}$$
(20)
$$\begin{aligned}&d = \left( {\mathop {2}\limits ^{n}}\right) ^{-1}\sum _{i=1}^R\left( {\mathop {2}\limits ^{n_{i.}}}\right) \sum _{j=1}^C\left( {\mathop {2}\limits ^{n_{.j}}}\right) \end{aligned}$$
(21)

where \(n_{ij}\) represents the number of objects that are in clusters \(u_i\) and \(v_i\); \(n_{i.}\) indicates the number of objects in cluster \(u_i\); \(n_{.j}\) indicates the number of objects in cluster \(v_j\); and n is the total number of objects. CR takes its values from the interval [− 1,1], in which the value 1 indicates perfect agreement between partitions, whereas values near 0 (or negatives) correspond to cluster agreement found by chance (Arabie et al. 1996).

The evaluation criterion includes a rank system employed through the application of Friedman hypothesis test (Friedman 1937, 1940) for all the comparison clustering measures. The Friedman test is a non-parametric hypothesis test that ranks all algorithms for each data set separately. If the null-hypothesis (all ranks are not significantly different) is rejected, Nemenyi test (Nemenyi 1962) is adopted as the post-hoc test. According to Nemenyi test, the performance of two algorithms are considered significantly different if the corresponding average ranks differ by at least the critical difference

$$\begin{aligned} CD = q_a\sqrt{\frac{n_{alg}(n_{alg} + 1)}{6n_{data}}} \end{aligned}$$
(22)

where \(n_{data}\) represents the number of data sets, \(n_{alg}\) represents the number of compared algorithms and \(q_a\) are critical values based on a Studentized range statistic divided by \(\sqrt{2}\) (Demšar 2006). Since J, \(J_{e_{2}}\) and \(D_{max}\) are minimization metrics (indicated by \(\downarrow\)), the best methods will obtain lower ranks for the Friedman-Nemenyi test, while for \(D_{min}\) and CR (maximization metrics, indicated by \(\uparrow\)), the best methods will find higher average ranks for the Friedman-Nemenyi hypothesis test.

The proposed hybrid memetic GSO approaches are compared to five other EAs: Genetic Algorithm (GA), Differential Evolution (DE), Particle Swarm Optimization (PSO), standard Group Search Optimization and Backtracking Search Optimization (BSA). The selected approaches are state-of-the-art models from evolutionary computing and data clustering literature, being successfully applied in many applications recently: GA (Shi and Xu 2018; Akbari et al. 2019; Islam et al. 2019; Mortezanezhad and Daneshifar 2019; Toman et al. 2020), DE (Wang 2018; Li and Dong 2019; Cho and Nyunt 2020; Zhang and Cao 2020; Zhu et al. 2020), PSO (Souza et al. 2018; Wang et al. 2018; Li et al. 2019; Pacifico and Ludermir 2019; Miranda and Prudêncio 208), GSO (Pacifico and Ludermir 2018; Lin and Huang 2019; Pacifico and Ludermir 2019b; Taj and Basu 2019; Abualigah 2020), and BSA (Latiff et al. 2016; Günen et al. 2017; Li et al. 2019; Toz et al. 2019; Hassan and Rashid 2020).

All selected algorithms have been adapted as hybrid partitional clustering models according to the schema presented in Algorithm 3. In this work, all algorithms used the Within-Cluster Sum of Squares [Eq. (3)] as the fitness function, for simplicity, once our main objective is to evaluate the impact of K-Means on the behavior of memetic GSO, and not the influence of the fitness function on the behavior of EAs. For that issue, please, refer to Wong et al. (2011), Pacifico and Ludermir (2016).

All algorithms run in a MATLAB 7.6 environment. Thirty independent tests have been executed for each data set, and all evolutionary methods have started with the same initial population in each test, obtained by a random process, as explained in Sect. 2), as a manner to guarantee a fair evaluation and comparison among the selected techniques. Also, once the computational costs have been evaluated in terms of the average execution time of each model in each data set, each algorithm has been run and tested in a computer with an i7-7700K CPU, NVIDIA GeForce GTX 1060 6GB GPU and 32 GB RAM, independently (one algorithm each time), and no other programs, but the Operating System, were executed during the tests, granting the same environmental conditions to each method. For all tests, the adopted number of clusters C is equal to the number of classes per data set.

The hyperparameters for each evolutionary algorithm are presented in Table 4, and have been obtained from the literature: GA (Abdel-Kader 2010), DE (Das et al. 2007), PSO (Abdel-Kader 2010; Pacifico and Ludermir 2019), GSO (He et al. 2009; Pacifico and Ludermir 2014a, 2018), and BSA (Civicioglu 2013). The population size (S) and the maximum number of generations (\(t_{max}\)) have been chosen after a trial-and-error process, using values from Pacifico and Ludermir (2018),Pacifico and Ludermir (2019) as the staring point. Once the selected approaches did not improve significantly with higher population sizes nor higher number of generations, we kept the same parameter values as presented in Pacifico and Ludermir (2018), Pacifico and Ludermir (2019) to perform the current evaluation, once the computational costs associated with such parameters may grow very fast for high parameter values. The maximum number of K-Means executions (\(t_{maxkmeans}\)) has been obtained from Pacifico and Ludermir (2018).

Table 4 Hyperparameters for each EA

4.1 Discussion

The discussion is divided into three parts: first, we evaluate the experimental results for the real-world data sets (Sect. 4.1.1), followed by the evaluation on the experiments using the synthetic data sets (Sect. 4.1.2); An overall evaluation is also performed, based on the Friedman-Nemenyi hypothesis tests and the computational costs (in terms of the average execution times) obtained by each model (Sect. 4.1.3).

4.1.1 Real-world data sets

The experimental results for the real-world data sets are presented from Tables 5, 6, 7, 8, 9. The best results for each metric in each data set are bold faced.

Table 5 Experimental results for real-world data sets: within-cluster sum of squares (\(J^{\downarrow }\))
Table 6 Experimental results for real-world data sets: correct Rand index (\(CR^{\uparrow }\))
Table 7 Experimental results for real-world data sets: weighted quantization error (\(J_{e_{2}}^{\downarrow }\))
Table 8 Experimental results for real-world data sets: intra-cluster distance (\(D_{max}^{\downarrow }\))
Table 9 Experimental results for real-world data sets: inter-cluster separation (\(D_{min}^{\uparrow }\))
Fig. 3
figure 3

Average convergence graph for the best solution found so far X\(_p^t\) for a Abalone, b E. coli, c Glass, d Image segmentation, e Page blocks classification, f Yeast

In an empirical analysis, we can observe that the proposed hybrid memetic GSO approaches are able to find better values in relation to the fitness function (J) in most cases than hybrid partitional models that only use K-Means to refine the best solution found by the generational process of the EA. As illustrated in Fig. 3, sometimes the comparison hybrid partitional approaches are too slow while performing the global search that the only significant improvement is obtained after the generational process, when their final solution is refined by K-Means method (which presents fast local search capabilities). Fig. 3 also showed that even the use of just one K-Means step during the generational process for each member (MKGSO approach) is enough to improve significantly the convergence speed of GSO. The empirical analysis also showed that the hybrid memetic GSO models are more stable than the comparison hybrid partitional EA and K-Means algorithms.

The overall evaluation on the real-world problems obtained by the Friedman-Nemenyi hypothesis tests (Table 10) showed that the proposed memetic approaches are able to find clusters that are, in average, better represented by their final centroids (according to J and \(J_{e_{2}}\) indices), the final clusters are more similar to the actual classes (CR index), and the clusters are well-split in the problem space (according to \(D_{min}\)). The hypothesis tests also showed that, although MKGSO and TMKGSO use about the same number of K-Means steps for each member on current experimentation, MK-GSO presented more smooth convergence rates than TMKGSO, what may help MKGSO to avoid local minima points, but yet according to the current set of experiments, the performance of TMKGSO has been considered significantly better than the average results for MKGSO (in terms of the optimization of the fitness function - J).

Table 10 Overall evaluation on real-world data sets: average Friedman–Nemenyi ranks (and position) for each metric. The best results are bold faced

According to the Friedman-Nemenyi tests, considering all five clustering metrics, the best results have been found by TMKGSO, followed by MKGSO and FMKGSO (second and third places, respectively).

When considering the average execution time obtained by each method in each data set (Table 11), we can see that FMKGSO and TMKGSO (which use K-Means refinements only in determined conditions) presented average times compatible with the other state-of-the-art EAs from the literature. By the other hand, MKGSO presented average execution times about four times higher than the other evaluated approaches, what may limit its applications when the data set is big (in terms of the number of data patterns). Such limitation may have occurred due to the fact that such method uses K-Means calls in each generation for each individual of the population, requiring many calls for allocation and deallocation of computational resources by the compiler.

Table 11 Average execution times for the real-world data sets (in seconds)

4.1.2 Synthetic data sets

The results for both sets of synthetic data sets are presented from Tables 12, 13, 14, 15, 16. The best results for each metric in each data set are bold faced.

Table 12 Experimental results for synthetic data sets: within-cluster sum of squares (\(J^{\downarrow }\))
Table 13 Experimental results for synthetic data sets: correct Rand index (\(CR^{\uparrow }\))
Table 14 Experimental results for synthetic data sets: weighted quantization error (\(J_{e_{2}}^{\downarrow }\))
Table 15 Experimental results for synthetic data sets: intra-cluster distance (\(D_{max}^{\downarrow }\))
Table 16 Experimental results for synthetic data sets: inter-cluster separation (\(D_{min}^{\uparrow }\))

The empirical analysis considering both Disp and Prox sets showed that all algorithms have been significantly affected when the degree of class overlapping increased, specially when the class overlapping has been obtained by approximating the class centers of distribution (Prox data sets). In spite of that, the memetic approaches have been able, once again, to find the best average performances and degree of stability among all approaches. In particular, FMKGSO was able to find the best values for almost all tests, considering the five evaluation metrics, showing its capabilities and competitiveness. Fig. 4 illustrates the obtained partitions for each algorithm.

Fig. 4
figure 4

Results for Prox05 data set: a Original data set, b GA-k-means, c DE-k-means, d PSO-k-means, e GSO-k-means, f BSA-k-means, g FMKGSO, h MKGSO, i TMKGSO

Considering the ranks provided by Friedman-Nemenyi hypothesis test (Table 17), the memetic approaches have found the best values for all evaluated indices. Although, in an overall analysis on the synthetic data sets, FMKGSO has been able to present ranks slightly better then the ranks of MKGSO, the Friedman-Nemenyi tests showed that there is no statistical significantly differences between FMKGSO performance and the performances of MKGSO (their differences are not greater than the Critical Distance), so both memetic algorithms are considered to have achieved the best results together. TMKGSO has been capable of reaching the third place on the evaluation considering the synthetic approaches, but, for all five clustering metrics, this method has found better performances than all the selected state-of-the-art comparison approaches, reinforcing the advantages of memetic models.

Table 17 Overall evaluation on synthetic data sets: average Friedman-Nemenyi ranks (and position) for each metric. The critical distance is CD = 3.3202. The best results are bold faced

The evaluation considering the average execution times for each model showed that (Table 18) showed that, once more, FMKGSO and TMKGSO computational costs, in terms of time, are close to the computational costs of state-of-the-art algorithms, but MKGSO, although presenting better performances when considering the clustering metrics than the non-memetic comparison approaches, is still a little bit slower in execution.

Table 18 Average execution times for the synthetic data sets (in seconds)

4.1.3 Overall evaluation

At this point, an overall evaluation considering both the real-world and synthetic data sets is provided. For that, an evaluation based on the Friedman-Nemenyi hypothesis tests considering all data sets (real and synthetic) is performed (Table 19). By considering both sets of data sets, we can have a better understanding on the behavior of all selected algorithms (hybrid partitional approaches, and hybrid memetic partitional approaches) when dealing with either controlled and uncontrolled testing scenarios, showing how robust such approaches actually are.

Table 19 Overall evaluation on all data sets: average Friedman-Nemenyi ranks (and position) for each metric. The critical distance is CD = 1.9169. The best results are bold faced

The overall analysis ranked TMKGSO as the best approach among all the tested algorithms, considering the rank values for all metrics, followed by FMKGSO and MKGSO, respectively. The overall evaluation showed that including some K-Means during the generational process of EAs is quite advantageous (the memetic approach), and it is a preferred strategy than only use K-Means only to refine the best solution found so far by the EAs. Considering the selected five clustering metrics, we can concluded that, for the current testing bed, the memetric approaches are able to find final solutions with clusters that are, in average, better represented by their centroids vectors (according to J and \(J_{e_{2}}\) indices), more similar to the actual classes (CR index), more compact (according to \(D_{max}\) index), and well-split in the problem space in relation to each other (according to \(D_{min}\)). The proposed memetic approaches have also been able to find the best stability throughout the experimentation, reinforcing their reliability.

For the current testing bed, the best hybrid model has been the TMKGSO approach, which executes some K-Means steps after a prefixed number of generations for each group member, which either has found good performances, according to the clustering indices, and has been able to execute with a computational cost, in terms of average execution time, close to what is expected from any Evolutionary Algorithm from current state-of-the-art. By the other hand, although able to find good average performances, MKGSO approach has been quite slow in its execution, what may represent a limitation. But, if time is not the main concern when a new data clustering system is proposed, MKGSO would still be a better choice than non-memetic approaches, once it still preserves GSO good global searching capabilities and K-Means fast convergence speed, so the designer of such system should have this trade-off in mind when making the decision.

5 Conclusions

This paper presented an evaluation on the behavior of partitional Group Search Optimization when employing K-Means algorithm as a local search operator, to deal with data clustering problem. In that sense, three hybrid memetic algorithms are proposed (FMKGSO, MKGSO and TMKGSO), which use different ways to combine GSO and K-Means in an attempt to improve GSO population during the generational process of the method. Although K-Means is easily trapped in local optima points, it is known that it would represent a good local searcher when combined to global search metaheuristics, like evolutionary algorithms. Also, GSO, which is known to present good global searching capabilities and mechanisms to escape from local optima points, would benefit considerably by adopting K-Means to speedup the algorithm convergence when dealing with data clustering problems.

To evaluate the proposed methods, five state-of-the-art partitional clustering evolutionary algorithms are tested: GA, DE, PSO, standard GSO and BSA. As a current and usual practice in data clustering literature, the comparison approaches are hybridized with K-Means in such a way that their best final solutions are refined by K-Means after the generational process, so we could access the actual influence of such technique on the behavior of EAs.

The experimental testing bed adopted in this work included 20 real-world and 10 synthetic data sets. To access the potential of the proposed memetic models, five clustering metrics have been employed (within-cluster sum of squares, correct Rand index, weighted quantization error, intra-cluster distance and inter-cluster separation), and the experimental analysis has been obtained by means of both an empirical analysis, and by an overall evaluation obtained through the application of a rank system considering the Friedman-Nemenyi hypothesis tests on each clustering metric.

The experimental results showed that, when K-Means is executed within the generational process of GSO at least by a few steps, it is able to aggregate more local information than when it is adopted only to improve the final solution found so far by an EA, speeding up the exploration/exploitation promoted by the memetic system. EAs are known to present slow convergence speeds, so when K-Means is only employed after the generational process of the EA, given K-Means lack of recovering mechanisms from local optima points, the final solution found by the hybrid model may still be in a region of the problem space that do not contain the global optimum point, so K-Means will still be trapped in such region.

All memetic approaches have been able to improve the comparison approaches, and the best evaluated memetic combination between K-Means and GSO (according to the Friedman-Nemenyi ranks and the evaluation on the average execution times) has been achieved when K-Means executes for a limited number of steps, for each population individual, after a prefixed number of GSO generations (TMKGSO algorithm), followed by FMKGSO (where individuals are improved by K-Means only if they fail to improve for a prefixed number of consecutive generations). It means that interchange between the global search offered by EA operators and the local search promoted by K-Means (i.e., the memetic framework) may represent a robust approach to explore and exploit the problem search space, leading to better optimization performances. By the other hand, the use of just one K-Means step for each individual from the population in each generation (MKGSO approach), represented the worst combination for the memetic models, in terms of computational costs (measured by the average execution time), but such combination still kept the good performances of all other evaluated memetic approaches (FMKGSO and TMKGSO), showing its potential in terms of performance (considering all five selected clustering metrics) when dealing with data clustering problems, in comparison to all other state-of-the-art non-memetic hybrid algorithms. In a future investigation, some mechanisms, like a selection process on the population individuals that would be improved by K-Means in each GSO generation, would be performed in an attempt to improve MKGSO computational costs.

As future works, we intend to evaluate the influence of the adopted distance function on the behavior of the proposed memetic algorithms, so the proposed approaches would be more robust to deal with clusters with different formats and shapes, overcoming some limitations of the standard Euclidean distance. Also, new fitness functions will be introduced and tested, using a larger testing bed, that will be obtained by the development of alternative synthetic data set configurations and scenarios. Finally, mechanisms for the automatic determination of the best number of clusters will be implemented, in such a way that such parameter would be estimated by the memetic evolutionary model itself, instead of being provided as an a priori input parameter to the algorithm.