Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Given a finite set of objects, cluster analysis aims to group them such that the most similar ones belong to the same group or cluster, and dissimilar objects are assigned to different clusters. In the scientific literature, the objects are also called entities or patterns and are usually represented as a vector of measurements or a point in a multidimensional space. Clearly, it can be easily guessed that the greater the similarity within a cluster and the greater the dissimilarity between clusters, the better the clustering task has been performed.

Starting from the 1990s, cluster analysis has emerged as an important interdisciplinary field, involving different scientific research communities, including mathematics, theoretical and applied statistics, genetics, biology, biochemistry, computer science, and engineering. It has been applied to several domains with numerous applications, ranging from social sciences to biology. Starting from one of the pioneering paper of Rao, which appeared in 1971 [60], more recent surveys on clustering algorithms and their applications can be found in [811, 21, 22, 40, 41, 49, 53, 53, 77], and very recently in [29, 45, 71]. Nice books and edited books are [12, 56, 57, 76].

In cluster analysis, the criterion for evaluating the quality of a clustering strongly depends upon the specific application in which it is to be used. The cluster task can be mathematically formulated as a constrained fractional nonlinear 0–1 programming problem, and there are no computationally efficient procedures for solving such a problem, which becomes computationally tractable only under restrictive hypotheses.

This paper surveys the main types of clustering and criteria for homogeneity or separation, with special emphasis to the optimization and operational research perspective. In fact, first a few mathematical models of the problem are reported under several different criterion adopted to classify the data and some classical state-of-the-art exact methods are described that use the mathematical model of the problem. Then, the most famous and applied state-of-the-art clustering algorithms are reviewed, underlying and comparing their main ingredients.

The remainder of this paper is organized as follows. The next section lists some among the most useful applications of the problem. In Sect. 3, the cluster analysis task is formally stated and the most used distance measures between the various entities are described. State-of-the-art mathematical formulations of the problem along with some classical state-of-the-art exact methods are described in Sect. 4. In Sect. 5, properties and state-of-the-art solution approaches are discussed. Concluding remarks are given in the last section.

2 Applications

Cluster analysis applies in several heterogeneous domains with numerous applications, whose number grows increasingly in recent years. As the increasingly sophisticated technology allows the storage of increasingly large amounts of data, the availability of efficient techniques for generating new information by examining the resulting large databases becomes ever more urgent.

Some of cluster analysis applications are listed in the following.

  • Social sciences: in this context, clustering helps in understanding how people analyze and catalogue life experiences.

  • Information retrieval: in this context, clustering is used to create groups of documents with the goal of improving the efficiency and effectiveness of the retrieval, such as in the case of thousands of Web pages retrieved as a result of a query to a search engine (see, e.g., [1618, 74]).

  • Natural language processing: typically, large vocabularies of words of a given natural language must be clustered w.r.t. corpora of very high size [70].

  • Business: in this application, one can be interested in analyzing information about potential customers, in order to cluster them for some sort of marketing activities [58].

  • Galaxy formation: in this context, a study has been conducted on the formation of galaxies by gas condensation with massive dark halos [75].

  • Image segmentation: in this case, the segmentation is achieved by searching for closed contours of the elements in the image [73].

  • Biological data: one among the most prominent interests of the biologists is the analysis of huge data containing genetic information, such as to find groups of genes having similar functions (see among others [3, 20, 28, 30, 50, 54, 55]).

3 Problem Definition and Distance Measures Definition

In cluster analysis, we are given

  • ◇ a set of N objects (entities, patterns) \(O =\{ o_{1},\ldots,o_{N}\}\);

  • ◇ a set of M of pre-assigned clusters \(S =\{ S_{1},\ldots,S_{M}\}\);

  • ◇ a function d: O × OR that assigns to each pair \(o_{i},o_{j} \in O\) a “metric distance” or “similarity” d ij  ∈ R (usually, d ij  ≥ 0, d ii  = 0, \(d_{ij} = d_{ji}\), for i, j = 1, , N)

and the objective is to assigning the objects in O to some cluster in S while optimizing some distance criteria in such a way that the greater the similarity (or proximity, homogeneity) within a cluster and the greater the difference between clusters, the better or more distinct the clustering [38, 40].

According to [38] and [40], the problem involves the following five steps:

  1. 1.

    pattern representation (optionally including feature extraction and/or selection);

  2. 2.

    definition of a pattern proximity (or similarity) measure appropriate to the data domain;

  3. 3.

    clustering or grouping;

  4. 4.

    data abstraction (if needed), and

  5. 5.

    assessment of output (if needed).

In more detail, the first step, usually referred to as pattern representation, refers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. Typical of this first step are the process of feature selection, i.e., the identification of the most effective subset of the original features to use in clustering, and the process of feature extraction, i.e., the transformations of the input features to produce new salient features.

Pattern proximity (similarity) is usually measured by a distance function defined on pairs of patterns. In the scientific literature, the patterns or data objects are usually represented as a vector of measurements or a point in a multidimensional space.

Formally, a data object o i , i = 1, , N can be represented as the following numerical vector

$$\displaystyle{\overrightarrow{A}_{i} =\{ a_{ij}\ \vert \ 1 \leq j \leq L\},}$$

where

  • a ij is the value of the jth feature for the ith data object and

  • L is the number of features.

Then, the proximity d ij between two objects o i and o j is measured by a proximity function d of corresponding vectors \(\overrightarrow{A}_{i}\) and \(\overrightarrow{A}_{j}\).

Several different scientific communities have used and discussed a variety of distance measures (see, for example, [4, 37, 38, 41]). Some of them are listed in the following.

3.1 Euclidean Distance

Given two objects o i and o j  ∈ O, their Euclidean distance in L-dimensional space is defined as follows:

$$\displaystyle{ d_{ij} = \sqrt{\sum _{k=1 }^{L }(a_{ik } - a_{jk } )^{2}} =\| \mathbf{a}_{i} -\mathbf{a}_{j}\|_{2}. }$$
(1)

Euclidean distance has an intuitive meaning and it is usually used to evaluate the proximity of objects in two- or three-dimensional space. In general, it works well when the data set has “compact” or “isolated” clusters [51].

3.2 Pearson’s Correlation Coefficient

The Pearson’s correlation coefficient measures the similarity between the shapes of two patterns, also known as profiles.

Given two objects o i and o j  ∈ O, their Pearson’s correlation coefficient is defined as follows:

$$\displaystyle{ d_{ij} = \frac{\sum _{k=1}^{L}[(a_{ik} -\mu _{o_{i}}) \cdot ((a_{jk} -\mu _{o_{j}}))]} {\sqrt{\sum _{k=1 }^{L }a_{ik } -\mu _{o_{i } } )^{2}} \cdot \sqrt{\sum _{k=1 }^{L }a_{jk } -\mu _{o_{j } } )^{2}}}, }$$
(2)

where \(\mu _{o_{i}}\) and \(\mu _{o_{j}}\) are the mean value for \(\overrightarrow{A}_{i}\) and \(\overrightarrow{A}_{j}\), respectively.

This correlation coefficient views each object as a random variable with L observations and measures the similarity between two objects by calculating the linear relationship between the distributions of the two corresponding random variables. One drawback of the Pearson’s correlation coefficient is that it assumes an approximate Gaussian distribution of the patterns and may not be robust for non-Gaussian distributions, as experimentally shown by Bickel in 2001 [7].

3.3 City-Block or Manhattan

City-block or Manhattan distance simulates the distance between points in a city road grid. It measures the absolute differences between two object attributes.

Given two objects o i and o j  ∈ O, their City-block or Manhattan distance is defined as follows:

$$\displaystyle{ d_{ij} = \sum _{k=1}^{L}\vert a_{ ik} - a_{jk}\vert. }$$
(3)

3.4 Cosine or Uncentered Correlation

Cosine or uncentered correlation is a geometric correlation defined by the angle between two objects.

Given two objects o i and o j  ∈ O, their cosine or uncentered correlation is defined as follows:

$$\displaystyle{ D_{ij} = \frac{\sum _{k=1}^{L}a_{ik} \cdot a_{jk}} {\sum _{k=1}^{L}a_{ik}^{2}\sum _{k=1}^{L}a_{jk}^{2}}. }$$
(4)

Note that,

  • the larger the value of D ij , the lower the angle between the objects;

  • D ij  ∈ [−1, 1]: \(D_{ij} = -1\) implies that the angle between vectors representing o i and o j is a right angle; while D ij  = 1 implies that the angle between o i and o j  is 0;

  • \(d_{ij} = 1 -\vert D_{ij}\vert\).

4 Mathematical Formulations of the Problem

In 1971, Rao [60] presented several mathematical formulations of the clustering problem, depending on several different criteria adopted to classify the data. Some of them are reported in the following.

4.1 Minimize (Maximize) the Within (Between)-Clusters Sum of Squares

Defining a set of N × M Boolean decision variables x ik  ∈ { 0, 1}, i = 1, , N, k = 1, , M, such that in a solution

$$\displaystyle{x_{ik} = \left \{\begin{array}{ll} 1,&\text{if}\ o_{i} \in O\ \text{is in cluster}\ S_{k}; \\ 0,&\text{otherwise}, \end{array} \right.}$$

the problem admits the following fractional nonlinear mathematical formulation:

$$\displaystyle{\begin{array}{llll} \text{(DC-1)} &\min &\sum _{k=1}^{M}\left \{\frac{\sum _{i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}^{2}\ x_{ ik}\ x_{jk}} {\sum _{i=1}^{N}x_{ik}} \right \} \\ &\text{s.t.} \\ (\text{DC-1.1})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N\\ \\ (\text{DC-1.2})& &x_{ik} \geq 0\ \text{and integer}, &i = 1,\ldots,N,\ k = 1,\ldots,M.\end{array} }$$

Constraints (DC-1.1) assure that each o i , i = 1, , N, belongs to only one cluster.

Since (DC-1) is a fractional nonlinear 0–1 programming problem, it is difficult to solve. Exceptions are the two cases described in the following two paragraphs.

4.1.1 Cardinality of Each Cluster A Priori Known

A special case of problem (DC-1) occurs when the cardinality of each cluster is a priori known, i.e., when

$$\displaystyle{\vert S_{k}\vert = n_{k},\ \text{s.t.}\quad \sum _{k=1}^{M}n_{ k} = N.}$$

In this case, mathematical formulation (DC-1) can be slightly modified to take into account this further information as follows:

$$\displaystyle{\begin{array}{llll} \text{(DC-2)} &\min &\sum _{k=1}^{M}\ \frac{1} {n_{k}}\left \{\sum _{i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}^{2}\ x_{ ik}\ x_{jk}\right \} \\ &\text{s.t.} \\ (\text{DC-2.1})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N\end{array} }$$
$$\displaystyle{\begin{array}{llll} (\text{DC-2.2})&&\sum _{i=1}^{N}x_{ ik} = n_{k}, &\ \ \ \ k = 1,\ldots,M\\ \\ (\text{DC-2.3})&&x_{ik} \geq 0\ \text{and integer},&\ \ \ \ i = 1,\ldots,N,\ k = 1,\ldots,M. \end{array} }$$

In formulation (DC-2), the objective remains to minimize (maximize) the within (between)-clusters sum of squares. As in (DC-1), constraints (DC-2.1) guarantee that each object o i , i = 1, , N, belongs to only one cluster. The additional set of constraints (DC-2.2) impose that for each cluster S k , k = 1, , M, the number of objects in S k is equal to n k .

(DC-2) is still a nonlinear 0–1 formulation, but its objective function has lost the fractional characteristics, since here \(\{n_{k}\}_{k=1}^{M}\) are known in advance. Among the first approaches to solve this problem are to be counted the Boolean methods proposed by Hammer and Rudeanu in [35] to be applied once the clustering problem is interpreted as a constrained nonlinear Boolean programming problem. Another pioneering approach has been described by Rao in [60]. It first linearizes the objective function by adding a further set of constraints. Then, it solves the resulting problem by applying any known method for linear integer problems. The resulting linearized 0–1 problem admits the following formulation:

$$\displaystyle{\begin{array}{llll} \text{(DC}'\text{-2)} &\min &\sum _{k=1}^{M}\ \frac{1} {n_{k}}\left \{\sum _{i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}^{2}\ y_{ ij}^{k}\right \} \\ &\text{s.t.} \\ (\text{DC}'\text{-2.1})& &x_{ik} + x_{jk} - y_{ij}^{k} \leq 1, &\ \ i = 1,\ldots,N - 1,\ j = i + 1,\ldots,N,\end{array} }$$
$$\displaystyle{\begin{array}{llll} && &\ \ \ k = 1,\ldots,M \\ (\text{DC}'\text{-2.2})&&\sum _{k=1}^{M}x_{ ik} = 1, &\ \ \ i = 1,\ldots,N\\ \\ (\text{DC}'\text{-2.3})&&\sum _{i=1}^{N}x_{ ik} = n_{k}, &\ \ \ k = 1,\ldots,M\\ \\ (\text{DC}'\text{-2.4})&&x_{ik},\ y_{ij}^{k} \geq 0\ \text{and integer},&\ \ \ i,j = 1,\ldots,N,\ k = 1,\ldots,M.\end{array} }$$

It must be underlined that, although easier to solve formulation (DC\(^{{\prime}}\)-2) presents a number of constraints that grows rapidly with N and M and therefore can only be used for small instances of the problem.

4.1.2 Bipartition of the Patterns

Another special case occurs when M = 2, i.e., when there are only two pre-assigned clusters S 1 and S 2. In this particular scenario, introducing N Boolean decision variables x i  ∈ { 0, 1}, i = 1, , N, such that in a solution

$$\displaystyle{x_{i} = \left \{\begin{array}{ll} 1,&\text{if}\ o_{i} \in O\ \text{is in cluster}\ S_{1}; \\ 0,&\text{if}\ o_{i} \in O\ \text{is in cluster}\ S_{2}, \end{array} \right.}$$

the clustering task can be modeled as the following fractional nonlinear 0–1 programming problem with no additional constraints:

$$\displaystyle{\begin{array}{llll} \text{(DC-3)} &\min &\left \{\frac{\sum _{i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}^{2}\ x_{ i}\ x_{j}} {\sum _{i=1}^{N}x_{ i}} + \frac{\sum _{i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}^{2}\ (1 - x_{ i})\ (1 - x_{j})} {N-\sum _{i=1}^{N}x_{ i}} \right \} \\ &\text{s.t.} \\ (\text{DC-3.1})& &x_{i} \in \{ 0,1\},\ i = 1,\ldots,N.\end{array} }$$

Based on the above mathematical formulation, Rao [60] designed an exact method that he presented as a Branch and Bound algorithm, but that is quite a method of exploring the feasible region making use of appropriate lower bounds obtained by means of any relaxation. The root of the decision tree corresponds to the empty solution in which any object has been assigned yet. An object o i is selected and a branch is emanated: for convenience, the assignment of o i to cluster S 1 corresponds to the right branch and that to cluster S 2 to the left branch. At a generic iteration, each current leaf of the tree corresponds to a partial solution characterized by some objects assigned to S 1, some assigned to S 2, while others are yet to be assigned.

Generally speaking, in order to design any Branch and Bound approach, it is necessary to specify the following key issues:

  • the branching rule;

  • the bounding strategy;

  • the fathoming criterion.

About the branching rule, let us suppose that at a generic iteration t the algorithm is visiting node t of the tree that corresponds to a partial solution, where we have a set (possibly empty) of objects in S 1 and a set (possibly empty) of objects in S 2. Let \(\overline{O} = O\setminus \{S_{1} \cup S_{2}\}\) be the set of objects that are still to be assigned and let x best be the incumbent solution, i.e., the best feasible solution found so far (at the beginning, x best either is a known feasible solution or it is empty). If \(\overline{O} =\emptyset\), then a complete solution has been found and x best is eventually updated. Otherwise, a branching operation should be performed. Before performing a further branching, the so-called bounding criterion is tested, i.e., a lower bound for the objective function is computed (see hereafter how to compute a valid lower bound) and if the lower bound is greater than or equal to the objective function value corresponding to x best, then no branching is performed since any feasible solution that can be generated from node t will be no better than the incumbent solution itself. Conversely, if the bounding criterion fails, a branching operation is performed by assigning to cluster S 1 an object \(k \in \overline{O}\) such that

$$\displaystyle{o_{k} = \text{arg}\min _{j\in \overline{O}}\sum _{\hat{i}\in S_{1}}d_{\hat{i}j},}$$

i.e., o k is a not yet assigned object that corresponds to the minimum sum of the distances to all other objects already assigned to cluster S 1.

Generally speaking, the strategy for selecting the next sub-problem to be investigated determines how the Branch and Bound algorithm should proceed through the search tree and can have a significant effect on the behavior of the algorithm (fathoming rule). In [60], the proposed approach adopts a depth first search (DFS, for short) strategy, where the node with the largest level in the search tree is chosen for exploration. The algorithm continues branching until the end of the tree is reached or the bounding criterion indicates any branching is further needed. When this happens, the algorithm back-tracks along the tree until a node is reached with an unexplored branch to the left. A graphical representation of the branching rule is depicted in Fig. 1.

Fig. 1
figure 1

Graphical representation of the branching rule

To obtain a lower bound, Rao [60] proposed several different strategies, based on the following further definitions and notation.

At a given node t of the branching tree, let x be a Boolean partial solution defined as in formulation (DC-3) and \(\overline{O} = O\setminus \{S_{1} \cup S_{2}\}\). Moreover, let

  • \(\diamond\) \(K_{1} = \sum _{i,\,j\in S_{1}}d_{ij}\);

  • \(\diamond\) \(K_{2} = \sum _{i,\,j\in S_{2}}d_{ij}\);

  • \(\diamond\) D be the n × n symmetric distance matrix;

  • \(\diamond\) F be a \(\vert \overline{O}\vert \times \vert S_{1}\vert\) sub-matrix of D containing the distance between each currently not yet assigned object \(k \in \overline{O}\) and an object in S 1;

  • \(\diamond\) F i , \(i = 1,\ldots,\vert \overline{O}\vert\), be the sum of the distances in row i of matrix F;

  • \(\diamond\) H be a \(\vert S_{2}\vert \times \vert S_{1}\vert\) sub-matrix of D containing the distance between each object in S 2 and an object in S 1;

  • \(\diamond\) H i , i = 1, , | S 1 | , be the sum of the distances in row i of matrix H;

  • \(\diamond\) C be a \(\vert \overline{O}\vert \times \vert \overline{O}\vert\) sub-matrix of D containing the distance between each pair of unassigned objects, whose diagonal elements are assigned a very high value (\(+\infty\)).

Without loss of generality, matrices F and H are assumed arranged in such a way that

$$\displaystyle{\begin{array}{ll} F_{i} <F_{j}, &\text{if}\ i <j; \\ H_{i} <H_{j},&\text{if}\ i <j. \end{array} }$$

The objective function can be rewritten as follows:

$$\displaystyle\begin{array}{rcl} Z& =& \min \left \{\frac{K_{1} + \sum _{i\in S_{ 1},\,j\in \overline{O}}d_{ij}\ x_{i}\ x_{j} +\sum _{i,\,j\in \overline{O}}d_{ij}\ x_{i}\ x_{j}} {\vert S_{1}\vert + \sum _{i\in \overline{O}}x_{i}} \right. \\ & & +\left.\frac{K_{2} + \sum _{i\in S_{ 2},\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j}) + \sum _{i,\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j})} {\vert S_{2}\vert + \left (\vert \overline{O}\vert -\sum _{i\in \overline{O}}x_{i}\right )} \right \},{}\end{array}$$
(5)

where x i  ∈ { 0, 1}.

For any fixed value for \(\overline{n} = \sum _{i\in \overline{O}}x_{i}\), by setting \(n^{{\prime}} = \vert \overline{O}\vert -\overline{n}\), \(p = \vert S_{1}\vert + \overline{n}\), and \(t = \vert S_{2}\vert + n^{{\prime}}\), Z in (5) can be rewritten as follows:

$$\displaystyle\begin{array}{rcl} Z^{{\prime}}& =& \min t \cdot \left [\sum _{ i\in S_{1},\,j\in \overline{O}}d_{ij}x_{i}x_{j} + \sum _{i,\,j\in \overline{O}}d_{ij}x_{i}x_{j}\right ] \\ & & +\,p \cdot \left [\sum _{i\in S_{ 2},\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j}) + \sum _{i,\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j})\right ],{}\end{array}$$
(6)

given that the denominator (equal to pt), K 1 t, and K 2 p are constant.

By varying \(\overline{n}\) in its range from 0 to \(\vert \overline{O}\vert\), \(\vert \overline{O}\vert + 1\) lower bounds for \(Z^{{\prime}}\) (6) are computed and the minimum among them is kept as lower bound for Z (5). One way to obtain such a lower bound is the following:

$$\displaystyle{Z^{{\prime}}\geq u_{ 1} + u_{2} + u_{3},}$$

where

$$\displaystyle{\begin{array}{lll} u_{1} & =&t \cdot \min \sum _{i\in S_{ 1},\,j\in \overline{O}}d_{ij}x_{i}x_{j} = t \cdot \sum _{i=1}^{\overline{n}}F_{ i}; \\ u_{2} & =&p \cdot \min \sum _{i\in S_{ 2},\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j}) = p \cdot \sum _{i=1}^{n^{{\prime}} }H_{i}; \\ u_{3} & =&\left [t \cdot \min \sum _{i,\,j\in \overline{O}}d_{ij}x_{i}x_{j}\right ] + \left [p \cdot \min \sum _{i,\,j\in \overline{O}}d_{ij}(1 - x_{i})(1 - x_{j})\right ]. \end{array} }$$

Note that, the number of addenda in the summation to compute u 3 is \(v = \frac{\overline{n}\cdot (\overline{n}-1)+n^{{\prime}}(n^{{\prime}}-1)} {2}\), lying either in the upper or in the lower triangular part of the symmetric matrix C. Consequently, a lower bound for u 3 can be obtained by multiplying \(\min \{t,p\}\) by the sum of the v smallest elements of one of the triangle of matrix C. The reader interested in learning further techniques to compute a lower bound can refer to Rao’s paper [60].

4.2 Optimizing the Within Clusters Distance

If one is interested in minimizing the total within clusters distance, the clustering task objective can be formulated as follows:

$$\displaystyle{\begin{array}{llll} \text{(DC-4)} &\min &\sum _{k=1}^{M}\ \left \{\sum _{ i=1}^{N-1}\ \sum _{ j=i+1}^{N}d_{ ij}\ x_{ik}\ x_{jk}\right \} \\ &\text{s.t.} \\ (\text{DC-4.1})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N\\ \\ (\text{DC-4.2})& &x_{ik} \geq 0\ \text{and integer}, &i = 1,\ldots,N,\ k = 1,\ldots,M.\end{array} }$$

Note that, the objective function of (DC-4) is similar to the objective function of (DC-2), with the only differences that here the factors \(\frac{1} {n_{k}}\), k = 1, , M, must not appear and coherently constraints (DC-2.2) do not require to be imposed.

A further useful possible target of a clustering task is to minimize the maximum within cluster distance. In this case, one is basically interested in minimizing the maximum distance within clusters and the problem can be modeled as a linear integer programming problem as follows:

$$\displaystyle{\begin{array}{llll} \text{(DC-5)} &\min &Z \\ &\text{s.t.} \\ (\text{DC-5.1})& &d_{ij}\ x_{ij} + d_{ij}\ x_{jk} - Z \leq d_{ij},&i = 1,\ldots,N - 1 \\ & & &j = i + 1,\ldots,N \\ & & &k = 1,\ldots,M\\ \\ (\text{DC-5.2})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N,\ k = 1,\ldots,M\\ \\ (\text{DC-5.3})& &x_{ik} \geq 0\ \text{and integer}, &i = 1,\ldots,N,\ k = 1,\ldots,M\\ \\ (\text{DC-5.4})& &Z \geq 0\ \text{and integer}.\end{array} }$$

In the case of minimizing the distance between the objects inside the same cluster, in 2010 Nascimento et al. [55] slightly modified Rao’s model (DC-4) as follows:

$$\displaystyle{\begin{array}{llll} \text{(DC-6)} &\min &\sum _{i=1}^{N-1}\sum _{ j=i+1}^{N}d_{ ij}\sum _{k=1}^{M}x_{ ik} \cdot x_{jk} \\ &\text{s.t.} \\ (\text{DC-6.1})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N\\ \\ (\text{DC-6.2})& &\sum _{i=1}^{N}x_{ ik} \geq 1, &k = 1,\ldots,M\\ \\ (\text{DC-6.3})& &x_{ik} \in \{ 0,1\}, &i = 1,\ldots,N,\ k = 1,\ldots,M.\end{array} }$$

Note that, a set of additional constraints (DC-6.2) are imposed to guarantee that each cluster S k , k = 1, , M, contains at least one object. In the attempt of remedying to the nonlinearity of the objective function, Nascimento et al. proposed a linearization of the model (DC-6). In more detail, by defining a further decision vector \(y \in \mathcal{R}^{N\times N}\), the linearized version of formulation (DC-6) is the following:

$$\displaystyle{\begin{array}{llll} \text{(LDC-6)} &\min &\sum _{i=1}^{N-1}\sum _{ j=i+1}^{N}d_{ ij} \cdot y_{ij} \\ &\text{s.t.} \\ (\text{LDC-6.1})& &\sum _{k=1}^{M}x_{ ik} = 1, &i = 1,\ldots,N\\ \\ (\text{LDC-6.2})& &\sum _{i=1}^{N}x_{ ik} \geq 1, &k = 1,\ldots,M\\ \\ (\text{LDC-6.3})& &x_{ik} \in \{ 0,1\}, &i = 1,\ldots,N,\ k = 1,\ldots,M\\ \\ (\text{LDC-6.4})& &y_{ij} \geq x_{ik}+x_{jk}-1, &i = 1,\ldots,N,\,j = i + 1,\ldots,N,\,k = 1,\ldots,M\\ \\ (\text{LDC-6.5})& &y_{ij} \geq 0, &i = 1,\ldots,N,\,j = i + 1,\ldots,N.\end{array} }$$

Constraints (LCD-6.4) and (LCD-6.5) guarantee that y ij  = 1 if \(x_{ik} = x_{jk} = 1\), i.e., if objects \(o_{i},o_{j} \in O\) are in the same cluster. Therefore, it can be easily seen that the objective function of model (LCD-6) aims at minimizing the distance between objects in the same cluster. Note that, model (LCD-6) has \(\frac{N\cdot (N-1)\cdot (M+1)} {2}\) more constraints than (DC-6) but it is linear and therefore “easier” to be solved.

Exact methods [35, 60] and the above-described mathematical formulations of the problem can be used only for small- sized instances, since the number of constraints characterizing the models increases very rapidly with both the number of objects N and the number of pre-assigned clusters M.

5 A Review of the Most Popular Clustering Techniques

According to Jain et al. [40] (see the taxonometric representation of clustering methods in Fig. 2), state-of-the-art clustering algorithms can be mainly divided into two families: partitioning and hierarchical algorithms.

Fig. 2
figure 2

A taxonomy of clustering methods

A partitioning method partitions the set of data objects into non-overlapping clusters such that each data object belongs to exactly one cluster. Instead, in a hierarchical approach a cluster is permitted to have subclusters and the result of the clustering task is a set of nested clusters that can be organized in a tree. Each node of the tree corresponds to a cluster and it is the union of its children (subclusters). Clearly, the leaves have no subclusters and the root node represents the cluster containing all the objects.

Besides exclusive/non-overlapping versus overlapping clustering, in the literature several different further types of clusterings can be found, such as fuzzy clustering and probabilistic clustering.

In a fuzzy clustering, clusters are viewed as fuzzy sets and a membership weight function W: O × S ↦ [0, 1] is defined such that for each object o i  ∈ O, i = 1, , N, and for each cluster S k , k = 1, , M, W ik measures a “level” of membership of o i to cluster S k . Clearly, W ik  = 0 corresponds to “absolutely \(o_{i}\notin S_{k}\), while W ik  = 1 corresponds to “absolutely \(o_{i} \in S_{k}\).

Similarly to fuzzy clustering, a probabilistic clustering requires the computation of the probability p ik with which each object o i  ∈ O, i = 1, , N, belongs to each cluster S k , k = 1, , M. Clearly, it must be imposed that these probabilities must sum to 1. In general, probabilistic clustering are converted to an exclusive/non-overlapping clustering, where each object is assigned to the cluster in which its probability is highest.

5.1 Hierarchical Clustering Algorithms

The most popular hierarchical clustering algorithms are the single-link algorithm [65], complete-link [46], and minimum-variance algorithms [72]. The single-link and the complete-link approaches differ in how they define the similarity between a pair of clusters: in the single-link approach, this distance is the minimum of the distances between all pairs of patterns drawn from the two clusters (one pattern from the first cluster, the other from the second); in a complete-link algorithm, this distance is the maximum of all pairwise distances between patterns in the two clusters. In either case, two clusters are merged to form a larger cluster based on minimum distance criteria.

5.2 Partitioning Clustering Algorithms

The most popular partitioning clustering algorithms are the squared error algorithms (among them the most famous are the k-means method [52] and the k-medoid method [43, 44]), graph-theoretic algorithms [78], and mixture-resolving and mode-seeking algorithms [38].

5.2.1 Squared Error Algorithms and the k-Means/k-Medoid Algorithms

The squared error criterion is the most intuitive and used criterion among partitioning clustering techniques. As the Euclidean distance, it tends to work well with “isolated” and “compact” clusters.

Given a clustering S of a set of patterns O containing M clusters, the squared error for S, also known as scatter, is defined as

$$\displaystyle{E^{2}(O,S) = \sum _{ j=1}^{M}\sum _{ i=1}^{n_{j} }\|\mathbf{a}_{i}^{(j)} -\mathbf{c}_{ j}\|^{2},}$$

where

  • n j is the number of objects in cluster j ∈ { 1, , M};

  • a i (j) the ith pattern belonging to cluster j ∈ { 1, , M};

  • c j is the centroid of cluster j ∈ { 1, , M}.

The framework of a generic squared error algorithm is described in Fig. 3.

Fig. 3
figure 3

Framework of a general squared error algorithm

The k-means algorithm [52] and the k-medoid algorithm [43] are the simplest and maybe most famous approaches that use a squared error criterion. k-means relies on the definition of a centroid usually as the mean of a group of objects and it is typically applied to objects in a continuous n-dimensional space. Conversely, k-medoid algorithm relies on the definition of a medoid as the most representative object for a group of objects and in principle can be used in a wider range of contexts compared to k-means, since it only requires a suitable definition of a proximity measure among objects. It has to be furthermore underlined that in a k-medoid algorithm, the medoid is by definition an object of the given set of objects O, while in a k-means approach this property is not necessarily satisfied by a centroid.

A typical k-means algorithm is described in Fig. 4. It starts with a random initial partition of the data objects and keeps reassigning objects to “close” clusters until a convergence criterion is met. The most used stopping criteria are no (or minimal) reassignment of patterns to new cluster centers or minimal decrease in squared error. In this latter case, when the stopping criterion is related to the squared error, to evaluate a clustering it is generally used as objective function a function that takes into account the squared error: among different clustering of the same set of objects it is preferred the one corresponding to the minimum squared error, since this means that its centroids better represent the objects in their own cluster.

Fig. 4
figure 4

Framework of the k-means algorithm

The main drawback of the k-means algorithm is that the quality of the solution it returns strongly depends upon the selection of the initial partition. If the initial partition is not properly chosen, the approach may converge to a local minimum of the criterion function value. In the attempt to overcome this important drawback of the k-means approach, a variant of the method has been proposed called bisection k-means [67, 68]. The bisection k-means initially splits the objects into two clusters, then further splits one of the just created clusters, and so on, iteratively, until M clusters are individuated (obtaining, as side effect, hierarchical clusters).

A typical bisection k-means approach is described in Fig. 5. In line 1, the algorithm initializes the set of clusters to contain only one cluster S 1 containing all objects. Then, in the loop at lines 2–3, until a set S of M clusters is attained, the algorithm iteratively selects and removes from the current set S a cluster S t and for a given in input number n of trials it bisects S t into two clusters, retaining the best bisection S t b1 and S t b2 (corresponding, for example, to the lowest squared error) and adding it to the set S under construction.

Fig. 5
figure 5

Framework of the bisection k-means algorithm

The study and the design of efficient initialization methods for the k-means technique is still a research topic that attracts the efforts by various scientific communities. A recent survey and comparative study of the existing methods has been conducted by Celebi et al. [15], who cited as particularly interesting two hierarchical initialization methods named Var-Part and PCA-Part proposed in 2007 by Su and Dy [69]. These two methods are not only linear, deterministic, and order-invariant. Besides these nice characteristics, in a recent paper by Celebi and Kingravi [13], a discriminant analysis-based approach has been proposed that addresses a common deficiency of these two methods. A deep experimental analysis showed that the two methods are highly competitive with state-of-the-art best random initialization methods to date and that the proposed approach significantly improves the performance of both hierarchical methods. Finally, in [14], Celebi and Kingravi presented an in-depth comparison of six linear, deterministic, and order-invariant initialization methods.

5.2.2 Graph-Theoretic Algorithms

Most of the graph-theoretic algorithms are divisive approaches, i.e., they start with one cluster that contains all the objects and then at each step they split a cluster. Specularly, agglomerative algorithms start with each pattern in a distinct (singleton) cluster, and successively merge clusters together until a stopping criterion is satisfied.

The graph-theoretic algorithms use a graph representation of the Data sets. Formally, given the set of objects \(O =\{ o_{1},\ldots,o_{N}\}\) and the distance function d: O × OR, a weighted undirected graph G = (V, E, w) can be defined such that

  • V = O;

  • edges in E indicate the relationship between objects;

  • \(w_{ij} = d_{ij}\), \(\forall \ i,j \in V\) (i.e., \(o_{i},o_{j} \in O\)).

The graph G is usually called proximity graph. It is very easy to see that each cluster can correspond to a connected component of G, i.e., a subset of nodes/objects that are connected to one another [63]. A further stronger possible graph-theoretic approaches family looks for cliques in G, i.e., sets of nodes/objects in the graph that are completely connected to each other [6, 55].

The easiest and maybe most famous graph-theoretic divisive clustering algorithm is based on the construction of a minimal spanning tree (MST) of the graph G [78]. Once obtained a MST, it generates clusters by deleting from the MST the edges with the largest weights.

5.2.3 Mixture-Resolving Algorithms

The idea behind this family of clustering algorithms is that the objects in O are drawn from one of several distributions (usually, Gaussian) and the goal is to identify the parameters of each distribution (e.g., a maximum likelihood estimate). Traditional approaches to this problem involve obtaining (iteratively) a maximum likelihood estimate of the parameter vectors of the component densities.

Among these techniques, it has to be cited the Expectation Maximization (EM) algorithm [19], a general purpose maximum likelihood algorithm for missing-data problems. The EM algorithm has been applied to the problem of parameter estimation.

5.3 Efficient Metaheuristic Approaches

Metaheuristics for data clustering have been designed only in the last 20 years. They include artificial neural networks [39, 64], evolutionary approaches such as genetic algorithms [42, 59], simulated annealing [47], and tabu search [2, 31].

Starting from 2010, GRASP (Greedy Randomized Adaptive Search Procedure) algorithms [24, 30, 55] have also been proposed that model the problem of pattern clustering as a combinatorial optimization problem defined on the weighted graph G = (V, E, w) representing the data sets, as also used by graph-theoretic algorithms. GRASP, originally proposed by Feo and Resende [23] for set covering, has been applied in a wide range of problem areas [2527, 61, 62]. It is a multi-start process, where at each iteration, a greedy randomized solution is constructed to be used as a starting solution for local search. Local search repeatedly substitutes the current solution by a better solution in the neighborhood of the current solution. If there is no better solution in the neighborhood, the current solution is returned as a local minimum and the search stops. The best local minimum found over all GRASP iterations is output as the final solution.

In 2010, Nascimento et al. [55] proposed a GRASP algorithm to cluster biological data sets. The greedy randomized solution is iteratively built, starting from a complete graph \(\left (\vert E\vert = \frac{N\cdot (N-1)} {2} \right )\), indicating that the data set forms a unique cluster. Then, at each iteration of the construction procedure edges are gradually eliminated from the graph, creating unconnected full subgraphs (cliques), each representing a cluster. The edge elimination follows a greedy randomized criterion that selects at random one edge in a subset (RCL—Restricted Candidate List) of higher weighted edges. Once a greedy randomized clustering \(S =\{ S_{1},\ldots,S_{M}\}\) has been obtained, a local search procedure is applied starting from S to attempt to improve it. The local search works in an iterative fashion, until no better solution is found in the neighborhood. At each iteration, it replaces the current solution S by a better solution in the neighborhood N(S) obtained by transferring in S an object from a cluster to another one.

In 2011, Frinhani et al. [30] described several hybrid GRASP with path-relinking heuristics [32, 48, 62] for data clustering. In a GRASP with path-relinking [1, 48, 62], at each GRASP iteration an elite set of good-quality solutions is stored and eventually updated. In fact, the local optimal current solution is combined with a randomly selected solution from the elite set using the path-relinking operator. The best of the combined solutions is a candidate for inclusion in the elite set and is added to the elite set if it meets quality and diversity criteria. The path-relinking procedure proposed in [30] applies to a pair of solutions S s (starting solution) and S t (target solution). Initially, the procedure computes the symmetric difference \(\varDelta (S_{s},S_{t})\), i.e., the set of moves needed to reach S t (target solution) from S s (initial solution). A path of solutions in the solution space is generated linking S s and S t and the best solution S in this path is returned by the algorithm. At each step, the procedure examines all moves m ∈ Δ(S, S t ) from the current solution S and selects the one corresponding to the least cost solution, i.e., the one which minimizes the objective function evaluates in Sm, the solution resulting from applying move m to solution S. The best move m is made, producing solution Sm . The set of available moves is updated. If necessary, the best solution S is updated. The procedure terminates when S t is reached, i.e., when Δ(S, S t ) = ∅.

In 2013 [28], a Biased Random-Key Genetic Algorithm (BRKGA) has been proposed for data clustering. It is well known that in the attempt of finding good quality solutions for a combinatorial optimization problem, Genetic Algorithms (GAs) [33, 36] implement the concept of survival of the fittest, making an analogy between a solution and an individual in a population. In particular, each individual of the current population represents a feasible solution, that is encoded by a corresponding chromosome that consists of a string of genes. Each gene can take on a value, called an allele, from some alphabet. For each chromosome it is possible to evaluate its fitness level, which is clearly correlated to the corresponding objective function value of the solution it encodes. GAs keep proceeding over a number of iterations, called generations, evolving a population of chromosomes. This evolution is implemented by simulating the process of natural selection through mating and mutation.

Genetic algorithms with random keys, or random-key genetic algorithms (RKGA), were introduced by Bean [5]. In a RKGA, chromosomes are represented as vectors of randomly generated real numbers in the interval (0, 1]. A deterministic algorithm, called a decoder, takes as input a solution vector and associates with it a solution of the combinatorial optimization problem for which an objective value or fitness can be computed.

A RKGA evolves a population of random-key vectors over a number of generations. The initial population is made up of p > 0 vectors of random-keys. Each component of the solution vector is generated independently at random in the real interval (0, 1]. After the fitness of each individual is computed by the decoder in generation t, the population is partitioned into two groups of individuals: a small group of p e elite individuals, i.e., those with the best fitness values, and the remaining set of pp e non-elite individuals. To evolve the population, a new generation of individuals must be produced. All elite individuals of the population of generation t are copied without modification to the population of generation t + 1. RKGAs implement mutation by introducing mutants into the population. A mutant is simply a vector of random keys generated in the same way that an element of the initial population is generated. At each generation (see Fig. 6), a small number (p m ) of mutants are introduced into the population. With the p e elite individuals and the p m mutants accounted for in population k + 1, \(p - p_{e} - p_{m}\) additional individuals need to be produced to complete the p individuals that make up the new population. This is done by producing \(p - p_{e} - p_{m}\) offspring through the process of mating or crossover.

Fig. 6
figure 6

BRKGA: generation t + 1 from generation t

Bean [5] selects two parents at random from the entire population to implement mating in a RKGA. A biased random-key genetic algorithm, or BRKGA [34], differs from a RKGA in the way parents are selected for mating. In a BRKGA, each element is generated combining one element selected at random from the elite partition in the current population and one from the non-elite partition. Repetition in the selection of a mate is allowed and therefore an individual can produce more than one offspring in the same generation. As in RKGA, parametrized uniform crossover [66] is used to implement mating in BRKGAs. Let ρ e  > 0. 5 be the probability that an offspring inherits the vector component of its elite parent. Let m denote the number of components in the solution vector of an individual. For i = 1, , m, the ith component C i of the offspring vector C takes on the value of the ith component E i of the elite parent E with probability ρ e and the value of the ith component \(\bar{E}_{i}\) of the non-elite parent \(\bar{E}\) with probability 1 −ρ e .

When the next population is complete, i.e., when it has p individuals, fitness values are computed by the decoder for all of the newly created random-key vectors and the population is partitioned into elite and non-elite individuals to start a new generation.

A BRKGA searches the solution space of the combinatorial optimization problem indirectly by searching the continuous m-dimensional hypercube, using the decoder to map solutions in the hypercube to solutions in the solution space of the combinatorial optimization problem where the fitness is evaluated. As underlined in [34], it is evident then that any BRKGA heuristic is based on a general-purpose metaheuristic framework (see Fig. 7), in which there is a portion totally independent from the specific problem to be solved. The only connection to the combinatorial optimization problem being solved is the problem-dependent portion of the algorithm, where the decoder produces solutions from the vectors of random-keys and computes the fitness of these solutions. Therefore, to describe a BRKGA for a specific combinatorial optimization problem, one needs only to show how solutions are encoded as vectors of random keys and how these vectors are decoded to feasible solutions of the optimization problem. We report in the following encoding and decoding schemes proposed in [28] for data clustering problems.

Fig. 7
figure 7

BRKGA: flow chart

A solution is denoted as \(x = (x_{1},\ldots,x_{N})\), such that for all i = 1, , N, x i  ∈ { 1, , M}, i.e.,

$$\displaystyle{\forall \ i = 1,\ldots,N,\quad x_{i} = k,\ \text{iff}\ o_{i} \in S_{k},\ k \in \{ 1,\ldots,M\}.}$$

5.4 Encoding

A solution is encoded as a random-key vector \(X = (X_{1},X_{2},\ldots,X_{N})\), where for i = 1, , N, component X i is a real number in the interval (0, 1].

5.5 Decoding

Given an encoded solution \(X = (X_{1},X_{2},\ldots,X_{N})\) and representing the set S as an array of | S | = M elements, decoding consists of three steps and produces a string \(x = (x_{1},x_{2},\ldots,x_{N})\).

In the first step, the string x is computed such that, for each i = 1, , N,

$$\displaystyle{x_{i} = \left \lceil X_{i} \cdot \frac{1} {\varDelta } \right \rceil,\qquad \varDelta = \frac{1} {M}.}$$

In the second step, starting from x, a locally optimal solution \(\hat{x} = (\hat{x}_{1},\hat{x}_{2},\ldots,\hat{x}_{N})\) with respect to the 2-swap neighborhood is computed.

Finally, in the third step, a chromosome adjustment is made to the encoded solution such that the resulting chromosome \(\hat{X}\) decodes directly to \(\hat{x}\) using only the first step of the decoding process.

For \(j = 1,\ldots,\vert S\vert = M\), let

$$\displaystyle{\begin{array}{llll} l_{j} & =&(j - 1)\cdot \varDelta; \\ u_{j}& =&j \cdot \varDelta. \end{array} }$$

Then, to complete the adjustment, for each i = 1, , N, it is simply computed

$$\displaystyle{\hat{X}_{i} = l_{j} + X_{i}\cdot \varDelta,}$$

where j is the index of the subset in S to which object o i belongs according to solution \(\hat{x}\).

6 Concluding Remarks

The scope of this paper is to provide an overview of the main types of clustering and criteria for homogeneity or separation, with special emphasis to the optimization and operational research perspective, providing a few mathematical models of the problem under several different criterion adopted to classify the data.

Besides mathematical models that can be efficiently used to find exact solutions only in case of small-sized instances, there are hundreds of approximate techniques, proposed by researchers from several heterogenous communities, and more or less efficient, depending on the input data and the type of information they contain.

The conclusion is that unfortunately there is no single best approach that wins in every aspect. Given the intrinsic difficulties to be faced when clustering data, a person interested in performing this task should select a group of algorithms that seem to be the most appropriate, taking into account also the specific data set under analysis.