Keywords

1 Introduction

Machine Learning (ML) is a subfield of Artificial Intelligence (AI) focused mainly on investigations and proposals of new formalisms and computational algorithms, aimed at proving theoretical support as well as implementing automatic learning by computers. Over the past few decades, many ideas on how to enable automatic learning have been presented, discussed and implemented. Among the many ways ML can be implemented, the so called clustering algorithms are a particular group of (unsupervised) algorithms which aim at organizing sets of data points into groups, having data exploratory processes in sight. In the literature one can find numerous works proposing new clustering algorithms as well as different clustering taxonomies. Although taxonomies aim at organizing such algorithms into categories, due to the fact that usually they adopt different criteria for grouping them, they are not necessarily, compatible to each other (see, for instance, those suggested in [1,2,3,4]).

As pointed out in [5], efficient clustering techniques are considered a challenge, mainly due to the fact that there is no external supervision, which implies knowing nothing about the internal structure of point sets (such as spatial distribution, volume, density, geometric shapes of clusters, etc.). In such a scenery automatic learning becomes an exploratory task, aiming at identifying which are the groups of data points that are statistically separable (or not), which are the most obvious clusters and how they relate to what is aimed at to discriminate, in an attempt to expose the underlying structure of the data, based only on their descriptions, generally given as a vector of attribute values.

The main focus of this work is an empirical research and evaluation of the AGNES (AGglomerative NESting) algorithm [3], taking into account data point sets with different characteristics, cluster sizes and inter-cluster distances. For the experiments a collection of data points was artificially created. Seven data point sets were used for evaluating the performance of AGNES, having the K-means algorithm [6] as baseline. The experiments were organized into three case studies, depending on the characteristics of the data point sets.

The remainder of this paper has four more sections. Section 2 comments on the main characteristics of the agglomerative approach and gives a high-level pseudocode of the AGNES algorithm, which has been implemented and used in the experiments. Section 3 describes the data used in the experiments, organized as three case studies. Section 4 briefly introduces the validation indices Dunn, Davies-Bouldin and Rand, used for evaluating the experiment results, followed by the set of experiments related to each case study, discussing their results and presenting some comparative analysis. Section 5 resumes the work done and highlights some conclusions.

2 AGNES (AGglomerative NESting) Algorithm

For a given data set containing N data points to be clustered, agglomerative hierarchical clustering algorithms usually start with N clusters (each single data point is a cluster of its own); the algorithm goes on by merging two individual clusters into a larger cluster, until a single cluster, containing all the N data points, is obtained. Obviously, the algorithm can have another stopping criteria, such as that of ending when a clustering containing a user-defined number of clusters (k) is obtained.

Figure 1 presents a high level pseudocode of the AGNES algorithm which, at each iteration, chooses two clusters to be merged, based on the shortest Euclidean distance between the clusters formed so far. The many clustering agglomerative algorithms found in the literature can be organized taking into account the way the inter-cluster distance is defined i.e., what definition is used to compute the shortest distance between all pairs of clusters. Among the various ways, three are particularly popular and are defined next. Given two clusters X and Y, let d(x,y) denotes the distance between two data points.

  1. (1)

    Single Linkage, the distance between two clusters X and Y is the shortest distance between two data points, x ∈ X and y ∈ Y, formally represented by Eq. (1).

    $$ {\text{d}}_{\text{SL}} \left( {{\text{X}},{\text{Y}}} \right) = { \hbox{min} }_{{{\text{x}} \in {\text{X}},\,{\text{y}} \in {\text{Y}}}} {\text{d}}\left( {{\text{x}},{\text{y}}} \right) $$
    (1)
  2. (2)

    Complete Linkage, the distance between two clusters X and Y is the farthest distance between two data points, x ∈ X and y∈Y, formally represented by Eq. (2).

    $$ {\text{d}}_{\text{CL}} \left( {{\text{X}},{\text{Y}}} \right) = { \hbox{max} }_{{{\text{x}} \in {\text{X}},\,{\text{y}} \in {\text{Y}}}} {\text{d}}\left( {{\text{x}},{\text{y}}} \right) $$
    (2)
  3. (3)

    Average Linkage or UPGMA, where the distance between two clusters is the mean distance between data points of each cluster in one cluster to every point in the other cluster, formally represented by Eq. (3).

    $$ {\text{d}}_{\text{AL}} \left( {{\text{X}},{\text{Y}}} \right) = 1/(\left| {\text{X}} \right| \times \left| {\text{Y}} \right|) \times (\sum\nolimits_{{{\text{x}} \in {\text{X}}}} {\sum\nolimits_{{{\text{y}} \in {\text{Y}}}} {{\text{d}}\left( {{\text{x}},{\text{y}}} \right)} } ) $$
    (3)
Fig. 1.
figure 1

A customized version of AGNES pseudocode, based on [1, 3].

For the experiments described in Sect. 4, AGNES was implemented using the UPGMA (Unweighted Pair Group Method with Arithmetic Mean), for inter-cluster distance, where all pair-wise distances contribute equally.

3 Data Description

The experiments described in Sect. 4 had three different focuses of empirical investigations: (1) sizes of clusters; (2) inter-cluster distances and (3) diameter of clusters. For addressing each focus, a corresponding case study was conducted, having two-dimensional point sets prepared according to the particular focus intended. For all the experiments a total of 7 synthetic sets of data points were created.

Case Study I (CS - I:Squares) uses three point sets created having their clusters at different distances between themselves, (a) Square1, (b) Square3 and (c) Square5. The only difference between the three point sets Square1, Square3 and Square5 is the degree of overlap between the four clusters that define each point set. In Square1, the clusters touch each other but hardly overlap, whereas in Square5 the overlap is such that there is little density difference when moving from one cluster to the next, as can be seen in Fig. 2. Point sets Square1, Square3 and Square5 can be described as square arrangements of four clusters of equal size and spread, each cluster being a Gaussian distribution around a central point; they differ from each other only in relation to the inter-cluster distances. The number of clusters, the size of the clusters and the average and standard deviation vectors of each cluster were previously defined for both point sets: Square (CS-I) and Sizes (CS-II). Square and Sizes were created, based on the ones used in [7], having in mind to investigate the sensitivity of an agglomerative clustering algorithm to the inter-cluster distance as well as the increase of the overlapping between clusters.

Fig. 2.
figure 2

Case Study I - Point sets formed by clusters with different inter-cluster distances (a) Square1, (b) Square3 and (c) Square5.

Case Study II (CS - II: Sizes) also uses three point sets, each created having four clusters. The differences among the three point sets rely on the sizes (number of data points) of their corresponding clusters, (a) Sizes1, (b) Sizes3 and (c) Sizes5, as shown in Fig. 3. The three point sets have been created based on the Square1 (Fig. 2(a)), where there were changes in the relative cluster sizes such that the ratio of the smaller to its immediate larger cluster was 2, 6 and 10, respectively. It is important to notice, though, that the spread of the clusters has been kept constant. Sizes has been created for investigating the sensibility of the algorithm to clusters of different sizes.

Fig. 3.
figure 3

Case Study II - Point sets used in the experiments, formed by clusters of different sizes (number of points). (a) Sizes1, (b) Sizes3 and (c) Sizes5.

Case Study III (CS - III:Aggregation). The seventh point set, named Aggregation, (shown in Fig. 4) consists of 7 spherical-shaped clusters, (labeled 1, 3, 4, 5, 6 and 7 in Fig. 4) and 1 non-spherical cluster (labeled 2 in Fig. 4). The point set has been created based on the one used in [8] for experiments related to the clustering aggregation problem. Such set of points has characteristics that are known to create difficulties for many agglomerative algorithms, such as narrow “bridges” between clusters, uneven-sized clusters, clusters with different diameters, etc. On the one hand, agglomerative clustering algorithms based on the nearest neighbor, such as single linkage, tend to join groups that touch each other, such as the clusters 3 and 6 as well as 4 and 7 in Fig. 4. On the other hand, agglomerative clustering algorithms based on furthest neighbor, such as complete linkage, tend to break large clusters (such as cluster 4 in Fig. 4), based on the diameters of the small ones, such as clusters 1, 5 and 7. Table 1 presents a summary of the point sets involved in the experiments, describing their basic characteristics.

Fig. 4.
figure 4

Case Study III - Point set (Aggregation) formed by clusters with different diameters. Each cluster has been labeled for reference.

Table 1. Summary of the seven synthetic point sets. #NP: no. of points (size), #NC: no. of clusters. (*)No. of points taking into account the label ordering in the Aggregation data.

4 Validation Indices, Experiments, Results and Analysis

This section presents the empirical results obtained by AGNES algorithm [3], considering the three case studies described in the previous section, involving seven data point sets. For the experiments described in this section, AGNES and K-means [6] were implemented in C# and run under a Microsoft Windows platform. The K-Means algorithm was used for comparative purposes only. The results from running both algorithms using the same input data are presented in the following three tables, where the best results are bold faced. To make a fair comparison the number of clusters (K) supplied by the user, to both algorithms, was the same. It is important to mention that all instances in the seven point sets were previously labeled (i.e., has a class attribute associated) in order to allow for conducting external validation. The inherent separation easily perceived between the clusters in the point sets was employed for class assignment.

The results obtained were evaluated using two internal validation indexes: the Dunn’s index (D) [9] and the Davies-Bouldin index (DB) [10]. To quantify the number of data points incorrectly assigned (using the original classes previously assigned, and part of the description of each data point in all the seven point sets), the Rand index (R) [11] external validation index was also used.

Consider the following notation. S - point set clustered into clustering C1, C2,…, CNC; |S| - number of data points in S; Ci - ith cluster of the clustering; ni - number of data points in Ci; Ex, Ey - two data points, and let the distance between the two data points, Ex, Ey, be represented as dist(Ex, Ey). The Dunn’s index (D) is defined by Eq. (4).

$$ \begin{aligned} & \quad \quad \quad \quad \quad \quad \quad \,\,{\text{D}} = { \hbox{min} }_{\text{i}} \left\{ {{ \hbox{min} }_{\text{j}} \left( {{\text{A}}/{\text{B}}} \right)} \right\},{\text{where}} \\ & {\text{A}} = { \hbox{min} }_{{{\text{Ex}} \in {\text{Ci}}}} ,_{{{\text{Ex}} \in {\text{Cj}}}} {\text{dist}}\left( {{\text{E}}_{\text{x}} ,{\text{E}}_{\text{y}} } \right)\,{\text{and}}\,{\text{B}} = { \hbox{max} }_{\text{k}} \{ { \hbox{max} }_{{{\text{Ex}},{\text{Ey}} \in {\text{Ck}}}} {\text{dist}}\left( {{\text{E}}_{\text{x}} ,{\text{E}}_{\text{y}} } \right)\} \\ \end{aligned} $$
(4)

For defining the Davies-Bouldin index (DB) as in [10], first consider si be a measure of dispersion of a cluster Ci (i.e., a measure of its spread around its mean vector) and let d(Ci,Cj) = dij be the dissimilarity between two clusters, using an appropriate dissimilarity measure (e.g., distance). Consider the similarity index Rij, between Ci and Cj be given by: Rij = (si + sj)/dij, provided that dij is symmetric. Let Ri be defined as Ri = maxj=1,..,NC, j≠i Ri,j, i = 1,…, NC. Then, the Davies-Bouldin index DB is defined by Eq. (5).

$$ {\text{DB}} = 1/{\text{NC}} \times \sum\nolimits_{{{\text{i}} = 1, \ldots {\text{NC}}}} {{\text{R}}_{\text{i}} } $$
(5)

In data clustering the value of the Rand index [11] can be approached as a measure of the similarity between two data clusterings. For the experiments presented next, one of the clusterings will be the one induced by the clustering algorithm and the other, the one provided externally, by previously assigning a class to each data point, in each of the seven point sets considered. So, in a general setup, given the notation previously introduced and considering that one of the clustering of the point set S is given as O = {O1, O2,…, ONO} and the other as C = {C1, C2,…, CNC}, the Rand index is composed by the following values:

  1. (i)

    a: number of pairs of points in S that are in the same set in O and in the same set in C;

  2. (ii)

    b: number of pairs of points in S that are in different sets in O and in different sets in C;

  3. (iii)

    c: number of pairs of points in S that are in the same set in O and in different sets in C and

  4. (iv)

    d: number of pairs of points in S that are in different sets in O and in the same set in C. So the Rand index (R) is given by Eq. (6).

    $$ {\text{R}} = \left( {{\text{a }} + {\text{ b}}} \right)/\left( {{\text{a}} + {\text{b}} + {\text{c}} + {\text{d}}} \right) $$
    (6)

Intuitively (a + b) can be thought of as the number of agreements between O and C and (c + d) as the number of disagreements between O and C.

In the following tables, (+) and (−) report the best and the worst result of K-Means, respectively. As far as the point sets of CS-I:Squares are concerned, the experiment results shown in Table 2 indicate that both algorithms, AGNES and K-Means had similar performance. It is important to mention, however, that AGNES performed slightly better than K-means when taking into account the K-Means worst-case results. Nevertheless, considering the CS-I results, AGNES and K-Means algorithms share similar performances when considering clusters with uneven-sizes and different inter-cluster distances.

Table 2. Clustering results for case study CS-I:Squares – Point set in Fig. 2.

The numbers obtained from experiments in CS-II:Sizes shown in Table 3 suggest that, as the size (i.e., number of points) of the four clusters change, AGNES and the K-Means performances also change.

Table 3. Clustering results for case study CS-II:Sizes – Point set in Fig. 3.

The three validation indexes used to evaluate the quality of clusterings induced by AGNES suggest different outcomes. The values of the D index implies that AGNES achieved its best performance on the Sizes5 data point set; however, the DB and the R indexes suggest that the best results were obtained when running AGNES on Sizes1. Taking into account only clustering results from AGNES, its best performance was achieved having Size1 as input. Comparing AGNES and K-means results obtained in CS-II:Sizes, it is obvious that K-Means had the best performance as far as the values of indexes DB and R are concerned.

Case Study III (CS-III:Aggregation) focuses on clustering experiments having, as input, the Aggregation point set, shown in Fig. 4. The values of the three validation indexes applied to the clustering results obtained from several experiments using Aggregation as input are presented in Table 4.

Table 4. Clustering results for case study CS-III:Aggregation – Point set in Fig. 4.

The experiments were conducted following a different methodology than the one used in the previous two case studies. It was decided to initially consider Aggregation as having only clusters 1 and 6 and then, gradually grow the point set by adding to it some of its clusters, until reaching the 7 total clusters it effectively has. By doing so it was expected that the experiment would allow to investigate what point set configuration would have more impact on AGNES performance.

The Aggregation point set is initially formed by clusters 1 and 6; both clusters are different with regard to their sizes, diameters and shapes, but are well separated and their inter-cluster distance promotes a good performance of both, AGNES and K-means. The values of the Rand index in Table 4 suggest that both algorithms correctly identified clusters 1 and 6. Next, the clustering results shown in Table 4 are from running AGNES and K-means on the previous Aggregation point set, added with cluster 3. Such addition did not decrease AGNES performance, in spite of the “bridge” between clusters 3 and 6. The K-Means, however, failed to identify some points from cluster 3. In sequence, the Aggregation was modified again, by replacing cluster 3 and 6 with cluster 2. In spite of cluster 1 and 2 having different shapes, which eventually could create difficulties for some clustering algorithms, their differences in shape were not substantial enough and AGNES identify both correctly. The K-Means, however, could not fully identify points from cluster 2 due to their distance from the cluster’s centroid.

The next modification of Aggregation, at this point having cluster 1 and 2, was done by adding to it the cluster 6. AGNES was able to correctly detect the three clusters (1, 2 and 6) while K-Means failed again for identifying points in cluster 2. Aggregation suffered another addition, this time of cluster 3. With this configuration both algorithms, AGNES and K-Means, did not identify correctly the four clusters; yet, AGNES was the one that came closer. Next, by adding cluster 4 to the previous Aggregation, the K-Means still had trouble to fully detect all clusters, while AGNES recovered its optimal performance, since with the addition of cluster 4, the average distance between the five clusters changed. Moreover, it can be easily noticed that, by adding cluster 5 to Aggregation, the average cluster distances computed by AGNES was negatively affected. In spite of that, AGNES still had a much better performance than K-means. Finally, Aggregation was restored to its original seven clusters, as in Fig. 4. Notice that the addition of cluster 7 did not decrease AGNES performance and the clustering it induced was very close to the optimal result, as confirmed by its Rand index of 0.998.

5 Conclusions

This paper addresses the use of the agglomerative hierarchical clustering algorithm AGNES, in unsupervised tasks involving 7 point sets, grouped into three case studies: the CS-I:Squares, which focuses on inter-cluster distance, the CS-II:Sizes, with focus on clusters of different sizes and the CS-III:Aggregation, involving mainly different shapes and diameters. As far as both case studies, CS-I:Squares and CS-II:Sizes, are concerned, AGNES and K-Means results were similarly evaluated; however, in the CS-I:Squares AGNES had a slightly better performance than K-means when taking into account K-Means worst-case results. In the third case study, the CS-III:Aggregation, in most experiments, the best results were obtained with AGNES. In spite of AGNES being more strongly affected by the inter-cluster distance than any of the other chosen characteristics, such as size, diameter or shapes (except for elongated clusters where single linkage based algorithms usually have better performance), the algorithm was still very robust considering a combination of all the chosen characteristics.