Keywords

1 Introduction

Clustering is an unsupervised machine learning task that is considered one of the most important data analysis techniques in data mining. In a clustering problem, unlabeled data objects are to be partitioned into a certain number of groups (also called clusters) based on their attribute values. The objective is that objects in a cluster are more similar to each other than to objects in another cluster [1,2,3,4]. In geometrical terms, the objects can be viewed as points in a a-dimensional space, where a is the number of attributes. Clustering partitions these points into groups, where points in a group are located near one another in space.

There are a variety of categories of clustering algorithms. The most traditional include the clustering algorithms based on partition, the algorithms based on hierarchy, algorithms based on fuzzy theory, algorithms based on distribution, those based on density, the ones based on graph theory, based on grid, on fractal theory and based on model. The basic and core ideas of a variety of commonly used clustering algorithms, a comprehensive comparison and an analysis of their advantages and disadvantages are summarized in an interesting review of clustering algorithms [5]. The most used are the partitioning clustering methods, being the K-means clustering the most popular [6]. K-means clustering (and other K-means combinations) subdivide the data set into K clusters, where K is specified in advance. Each cluster is represented by the centroid (mean) of the data points belonging to that cluster and the clustering is based on the minimization of the overall sum of the squared errors between the points and their corresponding cluster center. While partition clustering constructs various partitions of the data set and uses some criteria to evaluate them, hierarchical clustering creates a hierarchy of clusters by combining the data points into clusters, and after that these clusters are combined into larger clusters and so on. It does not require the number of clusters in advance. However, once a merge or a split decision have been made, pure hierarchical clustering does not make adjustments. To overcome this limitation, hierarchical clustering can be integrated with other techniques for multiple phase clustering. Linkage algorithms are agglomerative hierarchical methods that considers merging clusters based on the distance between clusters. For example, the single-link is a type of linkage algorithm that merges the two clusters with the smallest minimum pairwise distance [7]. K-means clustering easily fails when the cluster forms depart from the hyper spherical shape. On the other hand, the single-linkage clustering method is affected by the presence of outliers and the differences in the density of clusters. However, it is not sensitive to shape and size of clusters. Model-based clustering assumes that the data set comes from a distribution that is a mixture of two or more clusters [8]. Unlike K-means, the model-based clustering uses a soft assignment, where each point has a probability of belonging to each cluster. Since clustering can be seen as an optimization problem, well-known optimization algorithms may be applied in the cluster analysis, and a variety of them have been combined with the K-means clustering, e.g. [9,10,11].

Unlike K-means and pure hierarchical clustering, the herein proposed partitioning clustering algorithm has mechanisms that dynamically adjusts the number of clusters. The mechanisms are able to add, remove and merge clusters depending on some pre-defined conditions. Although for the initialization of the algorithm, an initial number of clusters to start the points assigning process is required, the proposed clustering algorithm has mechanisms to create new clusters. Points that were not assigned to a cluster (using a weighted expected distance of a particular center to the data points) are gathered in a new cluster. The clustering is also able to merge two nearby clusters and remove the clusters that have a very few points. Although these mechanisms are in a certain way similar to others in the literature, this paper gives a relevant contribution to the object similarity issue. Similarity is usually tackled indirectly by using a distance measure to quantify the degree of dissimilarity among objects, in a way that more similar objects have lower dissimilarity values. A probabilistic approach is proposed to define the weighted expected distances from the cluster centers to the data points. These weighted distances rely on the average variability of the distances of the points to each center with respect to their estimated expected distances. The larger the variability, the smaller the weighted distance. Therefore, weighted expected distances are used to assign points to clusters (that are represented by their centers).

The paper is organized as follows. Section 2 briefly describes the ideas of a cluster analysis and Sect. 3 presents the details of the proposed weighted distances clustering algorithm. In Sect. 4, the results of clustering four sets of data points with two attributes, one set with four attributes, and another with thirteen attributes are shown. Finally, Sect. 5 contains the conclusions of this work.

2 Cluster Analysis

Assume that n objects, each with a attributes, are given. These objects can also be represented by a data matrix \(\mathbf{X} \) with n vectors/points of dimension a. Each element \(X_{i,j}\) corresponds to the jth attribute of the ith point. Thus, given \(\mathbf{X} \), a partitioning clustering algorithm tries to find a partition \(\mathcal {C} = \{C_1, C_2, \ldots , C_K\}\) of K clusters (or groups), in a way that similarity of the points in the same cluster is maximum and points from different clusters differ as much as possible. It is required that the partition satisfies three conditions:

  1. 1.

    each cluster should have at least one point, i.e., \(|C_k| \ne 0,\, k=1,\ldots ,K\);

  2. 2.

    a point should not belong to two different clusters, i.e., \(C_k \bigcap C_l = \emptyset \), for \(k,l=1,\ldots ,K, \, k\ne l\);

  3. 3.

    each point should belong to a cluster, i.e., \(\sum _{k=1}^K |C_k| = n\);

where \(|C_k|\) is the number of points in cluster \(C_k\). Since there are a number of ways to partition the points and maintain these properties, a fitness function should be provided so that the adequacy of the partitioning is evaluated. Therefore, the clustering problem could be stated as finding an optimal solution, i.e., partition \(\mathcal {C}^*\), that gives the optimal (or near-optimal) adequacy, when compared to all the other feasible solutions.

3 Weighted Distances Clustering Algorithm

Distance based clustering is a very popular technique to partition data points into clusters, and can be used in a large variety of applications. In this section, a probabilistic approach is described that relies on the weighted distances (WD) of each data point \(X_i, \, i=1,\ldots ,n\) relative to the cluster centers \(m_k, \, k=1,\ldots ,K\). These WD are used to assign the data points to clusters.

3.1 Weighted Expected Distances

To compute the WD between the data point \(X_i\) and center \(m_k\), we resort to the distance vectors \(DV_k, \, k=1,\ldots ,K\), and to the variations \(VAR_k, \, k=1,\ldots ,K\), the variability in cluster center \(m_k\) relative to all data points. Thus, let \(DV_k = \left( DV_{k,1}, DV_{k,2}, \ldots , DV_{k,n} \right) \) be a n-dimensional distance vector that contains the distances of data points \(X_1, X_2, \ldots , X_n\) (a-dimensional vectors) to the cluster center \(m_k \in \mathbb {R}^a\), where \(DV_{k,i} = \Vert X_i - m_k \Vert _2\) (\(i=1,\ldots ,n\)). The componentwise total of the distance vectors, relative to a particular data point \(X_i\), is given by:

$$\begin{aligned} T_i = \sum _{k=1}^K DV_{k,i}. \end{aligned}$$

The probability of \(m_k\) being surrounded and having a good centrality relative to all points in the data set is given by

$$\begin{aligned} p_k = \dfrac{\sum _{i=1}^n DV_{k,i}}{\sum _{i=1}^n T_i}. \end{aligned}$$
(1)

Therefore, the expected value for the distance vector component i (relative to point \(X_i\)) of the vector \(DV_k\) is \(\mathrm {E}[DV_{k,i}] = p_k T_i\). Then, the WD is defined for each component i of the vector \(DV_{k}\), WD(ki), using

$$\begin{aligned} WD(k,i) = DV_{k,i} / VAR_k \end{aligned}$$
(2)

(corresponding to the weighted distance assigned to data point \(X_i\) by the cluster center \(m_k\)), where the variation \(VAR_k\) is the average amount of variability of the distances of the points to the center \(m_k\), relative to the expected distances:

$$\begin{aligned} VAR_k = \left( \frac{1}{n} \sum _{i=1}^n \left( DV_{k,i} - \mathrm {E}[DV_{k,i}] \right) ^2 \right) ^{1/2}. \end{aligned}$$

The larger the \(VAR_k\), the lower the weighted distances \(WD(k,i), i=1,\ldots ,n\) of the points relative to that center \(m_k\), when compared to the other centers. A larger \(VAR_k\) means that in average the difference between the estimated expected distances and the real distances, to that center \(m_k\), is larger than to the other centers.

3.2 The WD-Based Algorithm

Algorithm 1 presents the main steps of the proposed weighted expected distances clustering algorithm. For initialization, a problem dependent number of clusters is provided, e.g. K, so that the algorithm randomly selects K points from the data set to initialize the cluster centers. To determine which data points are assigned to a cluster, the algorithm uses the WD as a fitness function. Each point \(X_i\) has a WD relative to a cluster center \(m_k\), the above defined WD(ki). The lower the better. The minimization of WD(ki) relative to k takes care of the local property of the algorithm, assigning \(X_i\) to cluster \(C_k\).

figure a

Thus, to assign a point \(X_i\) to a cluster, the algorithm identifies the center \(m_k\) that has the lowest weighted distance value to that point, and if WD(ki) is inferior to the quantity \(1.5 \eta _k\), where \(\eta _k\) is a threshold value for the center \(m_k\), the point is assigned to that cluster. On the other hand, if WD(ki) is between \(1.5 \eta _k\) and \(2.5 \eta _k\) a new cluster \(C_{K+1}\) is created and the point is assigned to \(C_{K+1}\); otherwise, the WD(ki) value exceeds \(2.5 \eta _k\) and the point is considered ‘noise’. Thus, all points that have a minimum weighted distance to a specific center \(m_k\) between \(1.5 \eta _k\) and \(2.5 \eta _k\) are assigned to the same cluster \(C_{K+1}\), and if the minimum weighted distances exceed \(2.5 \eta _k\), the points are ‘noise’. The goal of the threshold \(\eta _k\) is to define a bound to border the proximity to a cluster center. The limit \(1.5 \eta _k\) defines the region of a good neighborhood, and beyond \(2.5 \eta _k\) the region of ‘noise’ is defined. Similarity in the weighted distances and proximity to a cluster center \(m_k\) is decided using the threshold \(\eta _k\). For each cluster k, the \(\eta _k\) is found by using the average of the WD(k, .) of the points which are in the neighborhood of \(m_k\) and have magnitudes similar to each other. The similarity between the magnitudes of the WD is measured by sorting their values and checking if the consecutive differences remain lower that a threshold \(\varDelta _{WD}\). This value depends on the number of data points and the median of the matrix WD values (see Algorithm 2 for the details):

$$\begin{aligned} \varDelta _{WD} = \frac{n-1}{n} median[WD(k,i)]_{k=1,\ldots ,K, \,i=1,\ldots ,n}. \end{aligned}$$
(3)

During the iterative process, clusters that are close to each other, measured by the distance between their centers, may be merged if the distance between their centers is below a threshold, herein denoted as \(\eta _D\). This parameter value is defined as a function of the search region of the data points and also depends on the current number of clusters and number of attributes of the data set:

$$\begin{aligned} \eta _D = \dfrac{\min _{j=1,\ldots ,a} \{\max _{i} X _{i, j} - \min _{i} X _{i, j}\}}{K (a-1)}. \end{aligned}$$
(4)

Furthermore, all clusters that have a relatively small number of points, i.e., all clusters that verify \(|C_{k}| < N_{\min }\), where \(N_{\min } = \min \{50, \max \{2, \lceil 0.05 n\rceil \}\}\), are removed, their centers are deleted, and the points are coined as ‘noise’ (for the current iteration), see Algorithm 3.

figure b

Summarizing, to achieve the optimal number of clusters, the Algorithm 1 proceeds iteratively by:

  • adding a new cluster whose center is represented by the centroid of all points that were not assigned to any cluster according to the rules previously defined;

  • defining points ‘noise’ if they are not in an appropriate neighborhood of the center that is closest to those points, or if they belong to a cluster with very few points;

  • merging clusters that are sufficiently close to each other, as well as removing clusters with very few points.

On the other hand, the position of each cluster center can be iteratively improved, by using the centroid of the data points that have been assigned to that cluster and the previous cluster center [12, 13], as shown in Algorithm 4.

figure c
figure d

4 Computational Results

In this section, some preliminary results are shown. Four sets of data points with two attributes, one set with four attributes (known as ‘Iris’) and one set with thirteen attributes (known as ‘Wine’) are used to compute and visualize the partitioning clustering. The algorithms are coded in MATLAB®. The performance of the Algorithm 1, based on Algorithm 2, Algorithm 3 and Algorithm 4, depends on some parameter values that are problem dependent and dynamically defined: initial \(K = \min \{10,\max \{2, \lceil 0.01 n\rceil \}\}\), \(N_{\min } = \min \{50,\max \{2, \lceil 0.05 n\rceil \}\}\), \(\varDelta _{WD}> 0\), \(\eta _k >0 , k=1,\ldots ,K\), \(\eta _D >0\). The clustering results were obtained for \(It_{\max } =5\), except for ‘Iris’ and ‘Wine’ where \(It_{\max } =20\) was used.

4.1 Generated Data Sets

The generator code for the first four data sets is the following:

Problem 1

‘mydata’Footnote 1 with 300 data points and two attributes.

Problem 2

2000 data points and two attributes.

figure e

Problem 3

200 data points and two attributes.

figure f

Problem 4

300 data points and two attributes.

figure g

In Figs. 1, 2, 3 and 4, the results obtained for Problems 14 are depicted:

  • the plot (a) is the result of assigning the points (applying Algorithm 2) to the clusters based on the initialized centers (randomly selected K points of the data set);

  • the plot (a) may contain a point, represented by the symbol ‘\(\diamond \)’, which identifies the position of the center of the cluster that is added in Algorithm 2 with the points that were not assigned (according to the previously referred rule) to a cluster;

  • plots (b) and (c) are obtained after the application of Algorithm 4 to update the centers that result from the clustering obtained by Algorithm 2 and Algorithm 3, at an iteration;

  • the final clustering is visualized in plot (d);

  • plots (a), (b), (c) or (d) may contain points ‘noise’ identified by the symbol ‘\(\times \)’;

  • plot (e) shows the values of the Davies-Bouldin (DB) index, at each iteration, obtained after assigning the points to clusters (according to the previously referred rule) and after the update of the cluster centers;

  • plot (f) contains the K-means clustering when \(K=K^*\) is provided to the algorithm.

The code that implements the K-means clustering [14] is used for comparative purposes, although with K-means the number of clusters K had to be specified in advance. The performances of the tested clustering algorithms are measured in terms of a cluster validity measure, the DB index [15]. The DB index aims to evaluate intra-cluster similarity and inter-cluster differences by computing

$$\begin{aligned} DB = \frac{1}{K} \sum _{k=1}^K \max _{l=1,\ldots ,K, \, l\ne k} \left\{ \dfrac{S_{k} + S_{l}}{d_{k, l}} \right\} \end{aligned}$$
(5)

where \(S_{k}\) (resp. \(S_{l}\)) represents the average of all the distances between the center \(m_k\) (resp. \(m_l\)) and the points in cluster \(C_k\) (resp. \(C_l\)) and \(d_{k, l}\) is the distance between \(m_k\) and \(m_l\). The smallest DB index indicates a valid optimal partition.

The computation of the DB index assumes that a clustering has been done. The values presented in plots (a) (on the x-axis), (b), (c) and (d) (on the y-axis), and (e) are obtained out of the herein proposed clustering (recall Algorithm 2, Algorithm 3 and Algorithm 4). The values of the ‘DB index(m,X)’ and ‘CS index(m,X)’ (on the y-axis) of plot (a) come from assigning the points to clusters based on the usual minimum distances of points to centers. The term CS index refers to the CS validity measure and is a function of the ratio of the sum of within-cluster scatter to between-cluster separation [13]. The quantity ‘Sum of Within-Cluster Distance (WCD)’ (in plots (b), (c) and (d) (on the x-axis)) was obtained out of our proposed clustering process.

In conclusion, the proposed clustering is effective, very efficient and robust. As it can be seen, the DB index values that result from our clustering are slightly lower than (or equal to) those registered by the K-means clustering.

Fig. 1.
figure 1

Clustering process of Problem 1

Fig. 2.
figure 2

Clustering process of Problem 2

Fig. 3.
figure 3

Clustering process of Problem 3

Fig. 4.
figure 4

Clustering process of Problem 4

4.2 ‘Iris’ and ‘Wine’ Data Sets

The results of our clustering algorithm, when solving Problems 5 and 6 are now shown. To compare the performance, our algorithm was run 30 times for each data set.

Problem 5

‘Iris’ with 150 data points. It contains three categories (types of iris plant) with 4 attributes (sepal length, sepal width, petal length and petal width) [16].

Problem 6

‘Wine’ with 178 data points. It contains chemical analysis of 178 wines derived from 3 different regions, with 13 attributes [16].

The results are compared to those of [17], that uses a particle swarm optimization (PSO) approach to the clustering, see Table 1. When solving the ‘Iris’ problem, our algorithm finds the 3 clusters in 77\(\%\) of the runs (23 successful runs out of 30). When solving the problem ‘Wine’, 27 out of 30 runs identified the 3 clusters.

Table 1 shows the final value of a fitness function, known as Sum of Within-Cluster Distance (WCD) (the best, the average (avg.), the worst and the standard deviation (St. Dev.) values over the successful runs). The third column of the table has the WCD that results from our clustering, i.e., after assigning the points to clusters (according to the above described rules based on the WD – Algorithm 2), merging and removing clusters if appropriate (Algorithm 3) and updating the centers (Algorithm 4), ‘WCD\(_{WD}\)’. The number of points ‘noise’ at the final clustering (of the best run, i.e., the run with the minimum value of ‘WCD\(_{WD}\)’), ‘# noise’, the number of iterations, It, the time in seconds, ‘time’, and the percentage of successful runs, ‘suc.’, are also shown in the table. The column identified with ‘WCD’ contains values of WCD (at the final clustering) using \(\mathbf{X} \) and the obtained final optimal cluster centers to assign the points to the clusters/centers based on the minimum usual distances from each point to the centers.

Table 1. Clustering results for Problems 5 and  6
Table 2. Results comparison for Problems 5 and  6

For the same optimal clustering, the ‘WCD’ has always a lower value than ‘WCD\(_{WD}\)’. When comparing the ‘WCD’ of our clustering with ‘WCD’ registered in [17], our obtained values are higher, in particular the average and the worst values, except the best of the results ‘WCD’ for problem ‘Wine’. The variability of the results registered in [17] and [18] (see also Table 2) are rather small but this is due to the type of clustering based on the PSO algorithm. The clustering algorithm K-NM-PSO [18] is a combination of K-means, the local optimization method Nelder-Mead (NM) and the PSO. In contrast, our algorithm finds the optimal clustering much faster than the PSO clustering approach in [17]. This is a point in favor of our algorithm, an essential requirement when solving large data sets.

The results of our algorithm shown in Table 2 (third column identified with ‘WCD\(_{D}\)’) are based on the designed and previously described Algorithms 1,  2,  3 and  4, but using the usual distances between points and centers for the assigning process (over the iterative process). These values are slightly better than those obtained by the K-means (available in [18]), but not as interesting as those obtained by the K-NM-PSO (reported in [18]).

5 Conclusions

A probabilistic approach to define the weighted expected distances from the cluster centers to the data points is proposed. These weighted distances are used to assign points to clusters, represented by their centers. The new proposed clustering algorithm relies on mechanisms to create new clusters. Points that are not assigned to a cluster, based on the weighted expected distances, are gathered in a new cluster. The clustering is also able to merge two nearby clusters and remove the clusters that have very few points. Points are also identified as ‘noise’ when they are beyond a limit of a reasonable neighbourhood (as far as weighted distances are concerned) and when they belong to a cluster with very few points. The preliminary clustering results, and the comparisons with the K-means and other K-means combinations with stochastic algorithms, show that the proposed algorithm is very efficient and effective.

In the future, data sets with non-convex clusters, clusters with different shapes, in particular those with non-spherical blobs, and sizes will be addressed. Our proposal is to integrate a kernel function into our clustering approach. We aim to further investigate the dependence of the parameter values of the algorithm on the number of attributes, in particular, when highly dimensional data sets should be partitioned. The set of tested problems will also be enlarged to include large data sets with clusters of different shapes and non-convex.