1 Introduction

In the field of optimization, there are many problems where a single objective is considered. However, most real-world problems involve more than one objective to be optimized, while a set of constraints must be satisfied [1]. The presence of multiple objectives is natural and makes the optimization interesting and complex to solve. Data clustering is precisely an example of a problem where good solutions can be better described by considering different criteria, i.e., more than one objective to optimize [2].

On the other hand, multi-objective evolutionary algorithms (MOEAs) are highly competitive when solving complex multi-objective optimization problems [3].

Several approaches have been proposed to solve the clustering problem using evolutionary algorithms. In this section, we briefly mention some of those studies.

Different evolutionary algorithms have been designed to optimize partitions for fixed (user-defined) values of number of clusters (k). However, in practice, clustering algorithms that optimize the number of clusters and the corresponding partitions are often preferred [27].

Cole [4] proposed a label-based integer encoding in which a genotype was an integer vector of N positions, where each one was associated with a data object and took a value (cluster label) over the alphabet {1, 2, 3, …, k}. In this case, k was interpreted as the maximum number of clusters represented in each individual. Ma et al. [5] proposed an evolutionary algorithm for clustering named EvoCluster, which encodes a partition in a way that each gene represents one cluster and contains the labels of the objects grouped into it. Thus, a genotype encoding k clusters (C 1 , C 2 , , C k ) of a data set with N objects was formed by k genes, each of which stored n i labels \(\left( {n1 + n2 + \cdots + nk = N} \right)\). In [28], Bandyopadhyay and Maulik present a real encoding where genotypes are made up of real numbers that represent the coordinates of the cluster centers. If genotype i encodes k clusters in n-dimensional space R n, then its length is n·k. An important advantage of this encoding scheme is that it leads to variable length genotypes, once k is no longer assumed to be constant. Pan and Cheng [29] adopted a binary encoding scheme based on a maximum number of clusters that is determined a priori. Each position of the string corresponds to an initial cluster center. Thus, the total number of 1 s in the string corresponds to the number of clusters encoded into the genotype. Casillas et al. [6] adopted a binary vector with N  1 elements encoding scheme where the elements represented the N  1 edges of a minimum spanning tree (MST) [7]. The nodes represented the N data set objects, and the edges correspond to proximity indexes between objects. In this representation, the value “0” means that the corresponding edge remains, whereas the value “1” means that the corresponding edge is eliminated.

The multi-objective clustering concept was firstly introduced in 1992 [8]. More than a decade after, in 2004, Handl and Knowles proposed a multi-objective clustering algorithm called VI-ENNA [9] which they improved later to obtain the MOCK algorithm [10, 11]. This algorithm is the main reference of this work where the focus of the study relies in the multi-objective evolutionary algorithm adopted and called PESA-II [12, 13]. The two objectives used in such work were the total intra-cluster variation (which was computed over all clusters) and the number of clusters. Both objectives should be minimized, but they are conflicting with each other, because enhancing the value of an objective function involves worsening another, i.e., to enhance the value of the intra-cluster variation involves to increase the number of clusters. Hence, by using the concept of Pareto dominance, the algorithm looks for finding a set of different non-dominated clustering solutions with a good trade-off between the number of clusters and a small total intra-cluster variation.

Mukhopadhyay et al. [14] used two objectives where the first one was also a measure of total intra-cluster variation and the second one measured the clusters’ separation (essentially, the summation of distances between all pairs of clusters). Ripon et al. [15] also proposed a multi-objective evolutionary clustering algorithm with two objectives, where the first one was an average intra-cluster variation computed over all clusters. The measure value was used to produce a normalized value of the measure taking into account the number of clusters, which varied for different individuals in the evolutionary algorithm’s population. The second objective was a measure of inter-cluster distance which measured the average distance separating a pair of clusters.

The background described above shows that evolutionary algorithms have been successfully applied in data clustering. However, to the best of our knowledge, the performance of the MOEA adopted has not been further studied as a source of improvement. Therefore, in this work, three state-of-the-art MOEAs are added to MOCK and their performance is analyzed in two ways: (1) based on performance metrics used in evolutionary multi-objective optimization and (2) based on clustering metrics so as to analyze the influence of the search algorithm in the overall performance of this particular clustering technique.

The rest of this paper is organized as follows. Some important concepts on evolutionary multi-objective optimization are given in Sect. 2. A detailed description of MOCK and the proposed approach are explained in Sect. 3. Section 4 presents the methodology, experiments and the comparison of results. Finally, Sect. 5 provides the concluding remarks and future work.

2 Basic concepts

A multi-objective optimization problem (MOP) can be mathematically defined as follows [30]: Find \(\vec{x}\) which minimizes: \(\vec{f}\left( {\vec{x}} \right)\text{ := }\left[ {f_{1} \left( {\vec{x}} \right), f_{2} \left( {\vec{x}} \right), \ldots , f_{k} \left( {\vec{x}} \right)} \right]\), subject to m inequality constraints: \(g_{i} \left( {\vec{x}} \right) \le 0, i = 1, 2, \ldots ,m\), and p equality constraints \(h_{j} \left( {\vec{x}} \right) = 0, \, j = 1,2, \ldots p\), where \(\vec{x} = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} } \right]^{T}\) is the vector of decision variables; \(\vec{f}\) is the set of objective functions to be minimized. The functions g i and h j represent the set of constraints that defines the feasible region in the search space \({\mathcal{F}}\). Any vector of variables \(\vec{x}\) which satisfies all the constraints is considered a feasible solution. Regarding optimal solutions in MOPs, the following definitions are provided:

Definition 1

A vector of decision variables \(\vec{x}_{1}\) dominates another vector of decision variables \(\vec{x}_{2}\) (denoted by \(\vec{x}_{1} \prec \vec{x}_{2}\)) if and only if \(\vec{x}_{1}\) is partially less than \(\vec{x}_{2}\), i.e., \(\forall_{i} \in \left[ {1,2, \ldots ,k} \right]:f_{i} \left( {\vec{x}_{1} } \right) \le f_{i} \left( {\vec{x}_{2} } \right){\wedge }\exists i \in \left\{ {1, \ldots ,k} \right\}: \, f_{i} \left( {\vec{x}_{1} } \right) < f_{i} \left( {\vec{x}_{2} } \right).\)

Definition 2

A vector of decision variables \(\vec{x} \in {\mathcal{X}}\) is non-dominated with respect to \({\mathcal{X}}\), if there does not exist another \(\vec{x}^{'} \in {\mathcal{X}}\) such that \(\vec{f}\left( {\vec{x}^{'} } \right) \prec \vec{f}\left( {\vec{x}} \right)\).

Definition 3

A vector of decision variables \(\vec{x}^{*} \in { \mathcal{F}}\) is Pareto optimal if it is non-dominated with respect to \({\mathcal{F}}\).

Definition 4

The Pareto optimal set \({\mathcal{P}}^{*}\) is defined by

$${\mathcal{P}}^{*} = \left\{ {\vec{x} \in { \mathcal{F}} |\vec{x}\,\,\, {\text{is}}\,\,{\text{Pareto}} - {\text{optimal}}} \right\}$$

Definition 5

The Pareto front \({\mathcal{P}\mathcal{F}}^{*}\) is defined by

$${\mathcal{P}\mathcal{F}}^{*} = \left\{ {\vec{f}\left( {\vec{x}} \right) |\vec{x} \in {\mathcal{P}}^{*} } \right\}$$

The goal on a MOP consists on determining the Pareto optimal set form the set \({\mathcal{F}}\) of all the decision variable vectors that satisfy g i and h j .

3 Improved MOCK

In multi-objective clustering, each objective is evaluated simultaneously for each potential grouping solution [8, 11]. The final result is a set of trade-off grouping solutions, where each solution has a different balance between the objectives that have been optimized. As an example, Fig. 1b shows the Pareto set obtained applying multi-objective clustering for the data in Fig. 1a where the intra-cluster and inter-cluster distances are optimized.

Fig. 1
figure 1

a Elements to group, b Pareto solutions obtained by the multi-objective clustering

3.1 MOCK elements

As mentioned in Sect. 1, MOCK is an algorithm designed for multi-objective clustering. Two objectives are considered: connectivity and compactness of the groups formed (see Sect. 3.1.3). The multi-objective optimization employed by MOCK allows it to get a set of optimal solutions, where each of them represents a possible grouping of data. Each solution found has a different k (number of groups) value. The automatic determination of the number of clusters “k” is one of the most important advantages of the approach, because in most existing traditional clustering techniques such k value must be specified [16, 17]. The optimization in MOCK is carried out by a particular MOEA, called “Pareto envelope-based selection algorithm II” (PESA-II). To apply PESA-II, it is necessary to define a suitable genetic encoding that represents a possible grouping, one or more genetic variation operators (mutation and crossover) and two or more objectives to optimize. In Fig. 2, the general structure of MOCK algorithm is presented. The input of the algorithm is the data set to be grouped, then, the minimum spanning tree (MST) is calculated to initialize the population of the MOEA. Once the population is generated, a MOEA (PESA-II in the original version) is used to optimize the solutions and get a set of optimal solutions represented in a Pareto front. In that Pareto front, all the solutions can be seen like good solutions and represent one possible clustering solution. The following sections describe in more detail the operation of MOCK, its solution representation, the variation operators and the objective functions adopted, taking into account the technical report of the MOCK authors [10].

Fig. 2
figure 2

MOCK’s general algorithm

3.1.1 Initialization and representation

The individual encoding employs a representation where each individual g consists of N genes g 1 , g 2 , , g N , where N is the size of the clustered data set, and each gene g i can take allele values j in the range {1, …, N}. Hence, a value of j assigned to the ith gene is interpreted as a link between data items i and j: Talking in terms of clustering, this means that data items i and j will be in the same cluster (see Fig. 3a, b). The decoding of this representation requires the identification of all subgraphs. All data items belonging to the same subgraph are then assigned to one cluster. The solution decoding inside MOCK can be done in linear time O(n), because a backtracking scheme was used as indicated in [10].

Fig. 3
figure 3

a Initialization of the individuals with minimum spanning tree, first individual, b subgraphs obtained by MOCK, third individual [10]

For the initialization, for a given data set, the MST is calculated by using Prim’s algorithm [18]. The first individual in the population will be that which is obtained by applying the algorithm to find the minimum spanning tree. The ith individual of the initial population is then formed eliminating the (i − 1) largest links, i.e., the edges in the tree with largest distance between the nodes (elements to be clustered). When a link is removed, in the genotype the corresponding genes are changed to link to themselves, and Fig. 3b shows the third initial member of the population. This initialization scheme provides different solutions in the initial population, each of them with different k values, to begin the search.

3.1.2 Variation operators

Uniform crossover is used as the recombination operator in MOCK because it can generate any combination of alleles from the two parents (in a single crossover event). The mutation is a single event where the value of the allele from one position in an individual is changed. MOCK uses a special mutation operator called “nearest neighbor mutation operator” where each data item can only be linked to one of its L nearest neighbor. Hence, \(g_{i} \in \left\{ {nn_{i1} , \ldots ,nn_{iL} } \right\}\), where \(nn_{il}\) denotes the lth nearest neighbor of data item i. This allows to reduce the extent of the search space to just LN. In order to apply the operator described above, it is necessary that the nearest neighbor list can be pre-computed in the initialization phase of the algorithm, and the L number of neighbors is a user-defined parameter.

3.1.3 Objective functions

MOCK deals with two objectives that must be minimized (one based on compactness and the other one based on connectedness of clusters). In order to express cluster compactness, the overall deviation of a partitioning is calculated. This is computed summing the total distances between each data item and their corresponding centroid in a particular group. This objective can be mathematically expressed as follows:

$${\text{Dev}}\left( C \right) = \mathop \sum \limits_{{C_{k} \in C}} \mathop \sum \limits_{{i \in C_{k} }} \delta \left( {i,\mu_{k} } \right)$$

where C is the set of all clusters, \(\mu_{k}\) is the centroid of cluster C k and \(\delta \left( {.,.} \right)\) is a distance measure (Euclidean distance). As an objective, overall deviation should be minimized. This criterion is similar to the intra-cluster variance.

For the connectivity among groups, a measure that evaluates the degree to which neighbors of a data item have been placed in the same cluster of the current data item is computed. In this objective function to measure the proximity between the nearest neighbors and one data item, the Euclidian distance was used too. Mathematically it can be expressed as follows:

$${\text{Conn}}\left( C \right) = \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{L} x_{{i,nn_{i} \left( j \right)}} , \quad {\text{where}}\quad x_{r,s} = \left\{ {\begin{array}{*{20}c} {\frac{1}{j} \quad if\quad {\nexists }C_{k} :r,s \in C_{k} } \\ {0 \quad {\text{otherwise}}} \\ \end{array} } \right.$$

\(nn_{i} \left( j \right)\) is the jth nearest neighbor of datum i and L is a parameter determining the number of neighbors that contribute to the connectivity measure. This objective function should be minimized too.

3.2 Proposed approach

As it was aforementioned, PESA-II is MOCK’s MOEA (upper right-hand side in Fig. 2). However, this works aims to evaluate MOCKs improvement based on using other MOEA while keeping all its other elements. The selected MOEAs for the comparison are: NSGA-II, SPEA-2 and MOEA/D. These three algorithms were selected because they are highly competitive when solving bi-objective optimization problems. Furthermore, NSGA-II and SPEA-2 are state-of-the-art algorithms based on Pareto dominance, while MOEA/D represents a state-of-the-art MOEA based on decomposition. Such set of algorithms will provide evidence about the performance of those two ways to solve the MOP of interest. A brief description of each MOEA compared in this work is presented below: Pareto envelope-based selection algorithm (PESA): This algorithm was proposed by Corne et al. [19]. This approach uses two populations of solutions: a small internal population (IP) and a larger external population (EP). The purpose of EP and IP is to exploit good solutions and to explore new solutions and achieves this by the standard EA processes of reproduction and variation, respectively. The solutions in EP are stored in “niches,” because PESA uses a hypergrid division in the objective space to maintain diversity. The selection mechanism is based on the crowding measure used by the hypergrid. This crowding measure is also used to decide what solutions to introduce into the EP. The second version of PESA, called PESA-II [12], has basically the same performance than PESA, except for the fact that region-based is used in this case. In region-based selection, the unit of selection is a hyperbox rather than an individual. The procedure consists of selecting a hyperbox and then randomly selects an individual within such hyperbox. Algorithm 1 shows PESA-II’s pseudocode. For further details, see [12].

figure a

Non-dominated sorting genetic algorithm II (NSGA-II): This is an improved version of the NSGA algorithm [20]. This algorithm builds a population of competing individuals, ranks and sorts each individual according to non-dominated level, then applies the evolutionary operators to create the offspring and then combines the parents and offspring before partitioning the new combined pool into fronts. The algorithm conducts niching by adding a crowding distance to each member. It uses this crowding distance in its selection operator to keep a diverse front by making sure each member stays a crowding distance apart. NSGA-II keeps the population diverse and helps the algorithm to explore the fitness landscape. The pseudocode of the NSGA-II is shown in Algorithm 2.

figure b

Strength Pareto evolutionary algorithm 2 (SPEA-2): This is also a revised version of SPEA whose pseudocode is shown in Algorithm 3. The algorithm uses an external archive containing non-dominated solutions previously found. At each generation, non-dominated individuals are copied to the external non-dominated set. For each individual in each external set, a strength value is computed. SPEA-2 has three main differences with respect to its predecessor [21]: (1) It incorporates a fine-grained fitness assignment strategy which takes into account for each individual the number of individuals that dominate it and the number of individuals by which it is dominated (i.e., its strength); (2) it uses a nearest neighbor density estimation technique which guides the search more efficiently; and (3) it has an enhanced archive truncation method that guarantees the preservation of boundary solutions.

figure c

Multi-objective evolutionary algorithm based on decomposition (MOEA/D): This algorithm requires a decomposition approach for converting an approximation of the PF into a number of single objective optimization problems. In principle, different decomposition approaches can be used for this purpose. However, the most traditional approach is known as Tchebycheff approach [31].

According to Tchebycheff approach, let \(\lambda^{1} , \ldots ,\lambda^{N}\) be a set of spread weight vectors and \(z^{*}\) be a reference point, the problem of approximation of the PF can be decomposed into \(N\) scalar optimization subproblems and the objective function of the jth subproblem is:

$$g^{te} \left( {x |\lambda^{j} ,z^{*} } \right) = \mathop {\hbox{max} }\limits_{1 \le i \le m} \left\{ {\lambda_{i}^{j} |f_{i} \left( x \right) - z_{i}^{*} } \right\}$$

where \(\lambda^{j} = \left( {\lambda_{1 }^{j} , \ldots ,\lambda_{m}^{j} } \right)^{T}\). In this case, \(g^{te}\) is continuous of \(\lambda\), and the optimal solution of \(g^{te} \left( {x |\lambda^{i} ,z^{*} } \right)\) should be close to that of \(g^{te} \left( {x |\lambda^{j} ,z^{*} } \right)\) if \(\lambda^{i}\) and \(\lambda^{j}\) are close to each other. Therefore, any information about these \(g^{te}\)’s with weight vectors close to \(\lambda^{i}\) should be helpful for optimizing \(g^{te} \left( {x |\lambda^{i} ,z^{*} } \right)\). In MOEA/D, a neighborhood of weight vector \(\lambda^{i}\) is defined as a set of its several closest weight vector in \(\{ \lambda_{1} , \ldots ,\lambda_{N } \}\). The neighborhood of the ith subproblem consists of all the subproblems with the weight vectors from the neighborhood of \(\lambda^{i}\). The population is composed of the best solution found so far for each subproblem. Only the current solutions to its neighboring subproblems are exploited for optimizing a subproblem in MOEA/D. In Algorithm 4, the pseudocode of MOEA/D is presented.

For this comparison, the MOCK implementation was based on [10]. Similar to the original MOCK algorithm, an integer encoding was adopted. Regarding the population handling, once a population of size N is generated, where N is the total data items to be grouped, the number of individuals in the population represents the number of instances of each data set. However, it is necessary to reduce the population size, because the value assigned to this parameter depends on the number of instances in the same way to the original MOCK implementation (see Sect. 4.3). Therefore, a certain amount of individuals of such population are randomly selected so as to decrease the size of the population.

figure d

To generate the three new MOCK versions with NSGA-II, SPEA-2 and MOEA/D, the same variation operators used by PESA-II are adopted to promote a fair comparison among the three algorithms. Two constraints were considered: The first refers to the number of clusters in a solution, which must be between 1 and 25, and the second constraint states that a group must have at least two elements. The constraint-handling technique employed was a combination of the feasibility rules proposed by Deb [22] with Pareto dominance for non-dominated sorting [20]. Therefore, the criteria used in pairwise comparisons are the following:

  1. (a)

    Between two feasible solutions, the one which dominates the other is preferred. If both solutions are feasible and non-dominated each other, one is chosen at random.

  2. (b)

    Between one feasible solution and one infeasible solution, the feasible one is preferred.

  3. (c)

    Between two infeasible solutions, the one with the lowest sum of constraint violation is preferred.

4 Methodology, experiments and comparison of results

Four algorithms: MOCKPESA-II, (the original MOCK), MOCKNSGA-II, MOCKSPEA-2 and MOCKMOEA/D, are evaluated on seven real data sets from the machine learning repository [23]. The characteristics of each of them are summarized in Table 1 where N is the total number of data items in the data set (instances), N i is the number of items for cluster i, D is the dimensionality (attributes) and k represents the optimal number of clusters in the data set. The first five data sets in Table 1 were chosen because they are the same databases solved by the original MOCK algorithm [10, 11], and in order to evaluate the performance of the three MOEAs, a fair comparison is required. Seeds and User Knowledge Modeling are data sets commonly used for clustering task, i.e., for unsupervised learning where the class attribute is unknown. However, although the first five data sets are typically used for classification because the class attribute is known, for the experiments carried out in this research, the class attribute is not considered for the executions of the algorithms, i.e., all the data sets are considered like data sets for unsupervised learning.

Table 1 Summary of the real data sets adopted in the experiments

Two experiments were designed to evaluate the performance of the four algorithms:

  1. (a)

    A comparison of the original MOCK and the three proposed versions (with NSGA-II, SPEA-2 and MOEA/D) from the evolutionary multi-objective optimization standpoint, taking into account two metrics to determine the performance of the optimization process.

  2. (b)

    A comparison of the most competitive MOCK version against classical clustering algorithms from the machine learning standpoint, taking into account two metrics.

4.1 Evolutionary multi-objective optimization metrics

For comparison purposes, 10 independent runs were carried out by each algorithm on each data set in Table 1. The Pareto fronts generated in the 10 runs are used to compute the values of two performance metrics, which are described next:

  • Hypervolume (HV): Having a suboptimal Pareto set PFknown and a reference point in objective space z ref, this performance measure estimates the hypervolume attained by it [1]. The hypervolume corresponds to the non-overlapping volume of all the hypercubes formed by the reference point (z ref) and every vector in the Pareto set approximation. This is formally defined as:

    $${\text{HV}} = \mathop {\bigcup }\limits_{I} \left\{ {{\text{vol}}_{i} \left| {{\text{vec}}_{i} \in {\text{PF}}_{\text{know}} } \right.} \right\}$$

    where vec i is a non-dominated vector from the suboptimal Pareto set and vol i is the volume for the hypercube formed by the reference point and the non-dominated vector vec i .

  • Two-set coverage (C-metric): This binary performance measure estimates the coverage proportion, in terms of percentage of dominated solutions, between two suboptimal Pareto sets [1]. Given two sets X’ and X’’, both containing only feasible non-dominated solutions, the C-metric is formally defined as:

    $$C\left( {X^{\prime},X^{\prime\prime}} \right) = \frac{{\left| {\left\{ {a^{\prime\prime} \in X^{\prime\prime};\exists a^{\prime} \in X^{\prime}:a{ \succcurlyeq }a^{\prime\prime}} \right\}} \right|}}{{\left| {X^{\prime\prime}} \right|}}$$

    If all the points in X′ dominate or are equal to all points in X″, then by definition C = 1. CS = 0 otherwise. The C-measure value means the portion of solutions in X″ being dominated by any solution in X′.

4.2 Clustering metrics

Regarding the comparison of the four versions of MOCK (MOCKPESA-II, MOCKNSGA-II, MOCKSPEA-2 and MOCKMOEA/D) against clustering algorithms (an agglomerative hierarchical clustering algorithm, the k-means algorithm and a meta-clustering algorithm), two different metrics were used to evaluate the performance of algorithms: F-measure and silhouette coefficient.

  • F-measure: This measure requires knowledge of the correct class labels for each instance of a given data set. It is compared with the generated model by a given clustering algorithm. n i,j gives the number of elements of class i within cluster j. For each class i and cluster j, precision and recall are then defined as: \(p\left( {i,j} \right) = \frac{{n_{i,j} }}{{n_{j} }}\) and \(r\left( {i,j} \right) = \frac{{n_{i,j} }}{{n_{i} }}\), respectively, and the corresponding value under the F-measure is:

    $$F\left( {i,j} \right) = \frac{{\left( {b^{2} + 1} \right)*p\left( {i,j} \right)*r\left( {i,j} \right)}}{{b^{2} *p\left( {i,j} \right) + r\left( {i,j} \right)}}$$

    where b = 1, to obtain equal weighting for p(i,j) and r(i,j). The overall F-measure for a partitioning is computed as:

    $$F = \mathop \sum \limits_{i}^{{}} \frac{{n_{i} }}{n}max_{j} \left\{ {F\left( {i,j} \right)} \right\}$$

    where n is the total size of the data set. F is limited to the interval [0,1], and it should be maximized. Therefore, a value closer to 1 indicates a better result.

  • Silhouette coefficient This cluster quality measure is based on the idea that a “good” clustering should consist of well-separated, cohesive clusters, and it is defined as follows: Given a partitioning P of a set of N objects into k clusters, consider any fixed object i and let C i denote the set of indices for all objects clustered together with object i. As a cohesion measure for cluster C i takes the average dissimilarity between object i and all other objects in C i :

    $$a\left( i \right) = \frac{1}{{n_{i} }} \mathop \sum \limits_{{j \in C_{i} }} d_{ij}$$

    where \(d_{ij}\) is the dissimilarity between objects i and j and \(n_{i}\) is the number of objects in cluster C i . To characterize the separation between clusters, let K l denote the lth neighboring cluster, distinct from C i , for \(l = 1, 2, \ldots ,k - 1\). Define \(b\left( i \right)\) as the average dissimilarity between object i in cluster C i and the objects in the closest neighboring cluster.

    $$b\left( i \right) = \mathop {\hbox{min} }\limits_{{n_{l} }} \mathop \sum \limits_{{j \in K_{l} }} dij$$

    The silhouette coefficient s(i) for object i is defined as the normalized difference:

    $$s\left( i \right) = \frac{b\left( i \right) - a\left( i \right)}{{{ \hbox{max} }\left\{ {a\left( i \right),b\left( i \right)} \right\}}}$$

    For the silhouette coefficient \(s\left( i \right)\) to be well defined, P must contain at least two clusters and every cluster in P must contain at least two objects.

    According to the authors of this quality measure, the interpretation for the silhouette coefficient should be as follows [32]: Values between 0.70 and 1.00 strong structure. Values between 0.51 and 0.70 reasonable structure.Values between 0.26 and 0.50 weak structure, try other methods. Values between below 0.25 no structure found.

For this experiment, 10 independent runs were also carried out by each algorithm. In this case, it is worth considering that multi-objective clustering algorithms (MOCKPESA-II, MOCKNSGA-II, MOCKSPEA-2 and MOCKMOEA/D) return many clustering solutions, so that for each independent execution a solution of the Pareto front must be selected. The solution chosen for this experiment is the solution with the highest value of F-measure, and therefore the results represent the best model obtained. This experiment is designed to report the statistical results of F-measure and silhouette coefficient for each algorithm. The value of the silhouette coefficient can vary between −1 and 1.

4.3 Parameter tuning

To obtain suitable parameters for each MOCK version, the Irace tool (iterated racing algorithm for automatic configuration) was used [24, 25]. The hypervolume value was used as a performance value for such tuning process. The calibration of three MOCK’s parameters was carried out: the number of generations, L (amount of nearest neighbors) and the crossover probability. The rest of the parameters of the algorithms were not considered since those values proposed in [10, 11] were taken as they depend on the number of instances of each data set.

The two configurations of parameter values obtained after the tuning process using Irace, as well as those values for the rest of the parameters taken from the literature, are given in Table 2. Note that the value of the external population size is different in MOCKPESA-II and MOCKSPEA-2 and MOCKMOEA/D. In MOCKNSGA-II, this parameter is not necessary.

Table 2 Parameter values that obtained by the Irace tool are shown in bold

4.4 Results

The results shown below are distributed as follows:

  • Results of the first experiment: For each parameter configuration in Table 2, the Pareto fronts obtained by each algorithm are shown, as well as the statistical values of the hypervolume and two-set coverage metrics. Finally, the results of the Wilcoxon test on those results are provided to get confidence on the differences observed in the samples of runs.

  • Results of the second experiment: Similar to the first experiment, the results with the two parameter configurations are presented, but in this case results of the F-measure and silhouette coefficient are provided.

4.4.1 Results of the first experiment

The Pareto fronts obtained by each algorithm on the run located in the median value of the hypervolume metric are presented in Figs. 4 and 5 for each one of the two parameter configurations in Table 2.

Fig. 4
figure 4

Pareto fronts obtained by the four compared MOCK versions on each data set by using the first parameter configuration

Fig. 5
figure 5

Pareto fronts obtained by the four compared MOCK versions on each data set by using the second parameter configuration

From the plots in Figs. 4 and 5, it can be seen that the three new versions proposed in this work (MOCKNSGA-II, MOCKSPEA-2 and MOCKMOEA/D) obtained better overall results compared with the original algorithm (MOCKPESA-II). In this plots, it can be also observed that the distribution of solutions in MOCKSPEA-2 and MOCKMOEA/D is similar, but MOCKNSGA-II presents the best distribution of the Pareto front. However, such visual comparison must be enriched in more detail with the metric values.

The results for the hypervolume metric on 10 independent runs are summarized in Tables 3 and 4, where the best result is marked in bold. The reference point to calculate that metric was (0, 0). For this metric, a lower value of hypervolume identifies a better result.

Table 3 Statistical results of the hypervolume metric for each data set using the first parameter configuration
Table 4 Statistical results of the hypervolume metric for each data set using the second parameter configuration

The results in Tables 3 and 4 suggest that MOCKNSGA-II outperformed the other three algorithms, obtaining better hypervolume values. On the other hand, the hypervolume results of MOCKMOEA/D are slightly better than those of MOCKSPEA-2 in most data sets. The first configuration obtained better values in the Iris, Wisconsin and User Knowledge Modeling data sets, while the second configuration obtained better results in the Dermatology, Wine and Seeds data sets. The results of the 95 % confidence Wilcoxon test [26] are presented in Tables 5 and 6.

Table 5 Wilcoxon test results for the hypervolume metric using the first parameter configuration
Table 6 Wilcoxon test results for the hypervolume metric using the second parameter configuration

According to the Wilcoxon test results for the hypervolume metric, the differences observed in most samples of runs are significant. Therefore, MOCKNSGA-II outperformed the other two algorithms based on this metric. However, in the case of the first configuration, for the Wisconsin data set, the statistical test shows that there are not significant differences between MOCKSPEA-2 and MOCKMOEA/D. Therefore, it can be concluded that in this case these algorithms have similar behavior.

To complement the results obtained with the hypervolume metric, in Tables 7 and 8 the statistical results obtained by C-metric are summarized. Such results are based on all the pairwise combinations of the 10 independent runs executed by each algorithm (each one of the ten Pareto fronts of one algorithm was compared with each one the ten fronts obtained by the other algorithm). The 95 % confidence Wilcoxon test was used to validate the differences observed in the samples of runs.

Table 7 Statistical results of the two-set coverage metric for each data set using the first parameter configuration, SD means standard deviation, the symbol ✓ means that there is significant difference
Table 8 Statistical results of the two-set coverage metric for each data set using the second parameter configuration; SD means standard deviation; the symbol ✓ means that there is significant difference

For this metric, the optimal value is 1, so that if all solutions of an algorithm dominate or are equal to all the solutions of the other algorithm, the value for C-metric will be equal to 1, or 0 otherwise. To say that an algorithm is better than the other, it is preferable to have values close to 1.

The results in Tables 7 and 8 show that in all cases the algorithms proposed in this research (MOCKNSGA-II, MOCKSPEA-2, MOCKMOEA/D) outperformed the original algorithm (MOCKPESA-II) based on those C-metric values. Regarding only the proposed algorithms, MOCKNSGA-II has a slightly better performance with respect to MOCKSPEA-2 and MOCKMOEA/D. The Wilcoxon test indicates that in all comparisons of the second configuration, the differences observed are significant. Regarding the first configuration, in the case of Wine data set, MOCKSPEA-2 and MOCKMOEA/D have similar behavior according to C-metric, i.e., the solutions of both algorithms have similar distribution in the Pareto front.

4.4.2 Results of the second experiment

In order to evaluate the performance of the algorithms proposed according to the quality of the clustering obtained, a comparison was made against different clustering algorithms using the F-measure and silhouette coefficient. For the three methods used in the comparison, the number of clusters (k) should be specified a priori, so that the value of k used for the algorithms is given in Table 1. This experiment is interesting because it allows to obtain information on the performance of algorithms to find solutions of high quality, giving the advantage that unlike other classical clustering algorithms, a scenario where the expert has the opportunity to try a number of alternative solutions is obtained.

The results obtained on 10 independent runs by computing the mean value of the F-measure and silhouette coefficient are summarized in Tables 9, 10, 11 and 12. The best result is shown in bold letters.

Table 9 Average F-measure value on ten independent runs, using the first parameter configuration
Table 10 Average silhouette coefficient value on ten independent runs, using the first parameter configuration
Table 11 Average F-measure value of ten independent runs, using the second parameter configuration
Table 12 Average silhouette coefficient value on ten independent runs, using the second parameter configuration

Based on the results observed in Tables 9 and 11, the four versions of MOCK outperformed the other clustering algorithms. According to the F-measure, MOCKNSGA-II had a better performance with respect to MOCKSPEA-2, MOCKMOEA/D and the original algorithm in all data sets.

The results obtained in this second experiment agree with those obtained in the first experiment, where the evolutionary multi-objective performance metrics suggested that the three new MOCK versions outperformed the original one. Moreover, based on the F-measure, three new versions also provided highly competitive results with respect to other clustering algorithms by keeping the advantages of MOCK. The results of Seeds and User Knowledge Modeling Data Set were not considered based on F-measure, because this metric considers the class attribute to precisely obtain the measure value and these data sets are for unsupervised learning.

The poor results of the F-measure for traditional methods are affected, because they can fail on certain data sets, for instance with databases of various shapes and dimensions, k-means and agglomerative hierarchical clustering fail for data sets with elongated cluster shapes. For example, in the case of Iris, Dermatology and Yeast data sets, the F-measure values obtained by the classical clustering algorithms are significantly lower. This is because in these data sets the actual cluster borders (according to the true class labels) are not discernible from the structure, making difficult identify the correctly number of clusters. Otherwise happens in the case of Wine and Wisconsin data sets, where the cluster structure is clear.

For those silhouette coefficient results in Tables 10 and 12, it can be observed that, for the two parameter configurations obtained by the IRACE tool for each MOCK version, in most data sets, they always find clusters with a strong structure. However, in the case of Yeast data set, the original version (MOCKPESA-II) finds clusters with only a reasonable structure. In contrast, the traditional techniques (hierarchical clustering, k-means and meta-clustering) have a poor performance compared with the evolutionary approaches. Finally, it is worth remarking that, based on the silhouette measure, the most competitive MOCK version is MOCKNSGA-II.

Regarding the performance of the MOCK’s self k-determination feature, considering the competitive F-measure values obtained, the four MOCK versions, including the original one, found the appropriate k value according to Table 1.

5 Conclusions and future work

In this work, an improvement of the MOCK algorithm based on using different evolutionary multi-objective algorithms was presented. Three new versions of MOCK based on NSGA-II, SPEA-2 and MOEA/D were proposed and analyzed. A comparative study based on two standpoints (evolutionary multi-objective optimization metrics and clustering metrics) was carried out by dealing with data sets with different features. The obtained results indicate that three new MOCK versions are viable alternatives since their performance is highly competitive with respect to the original MOCK based on the hypervolume and C-metric. Furthermore, three proposed versions outperformed other clustering algorithms based on the F-metric and silhouette coefficient metric. The obtained F-metric and silhouette coefficient results also confirm that the evolutionary approach can be used getting favorable results in clustering task with more advantages than traditional clustering algorithms, like the fact that the value of k (number of groups) does not need to be specified a priori. An interesting finding of this research is that NSGA-II, based on Pareto dominance, worked better in this particular MOP than a decomposition-based algorithm like MOEA/D. Finally, SPEA-2 (Pareto based) was competitive as well, but was unable to outperform NSGA-II.

Part of the future work relies in the study of different variation operators within the improved MOCK, using in this case other operators for integer encoding, Another recently MOEAs will be used to analyze their improvement on MOCK, e.g., NSGA-III, DEMO and SMS-EMOA. Furthermore, the algorithms will be tested in other data sets and the three-objective case, now maximizing the distances among clusters as well, will be solved and studied.