Keywords

1 Introduction

Clustering aims at grouping data into homogeneous subsets (called clusters), inside which elements are similar to each other and dissimilar to elements of other clusters. The purpose of clustering is to discover natural existing structures in a data set. These techniques are widely used in various fields such as pattern recognition, image processing, data exploration, etc. It should be noted that due to a large variety of data sets different clustering algorithms and their configurations are formed. Generally, among clustering methods two major categories are distinguished: partitioning and hierarchical clustering. Partitioning clustering relocates elements of a data set between clusters iteratively until a given clustering criterium is obtained. For example, the well-known algorithms of this type include K-means and its variations [5, 24] or Expectation Maximization (EM) [19]. On the other hand, hierarchical clustering is based on the agglomerative or divisive approach. The method known as the agglomerative hierarchical clustering starts from many clusters, which are then merged into larger ones until only one cluster has been formed. However, the divisive clustering methods start from a single cluster, which includes all elements of a data set, and then it is split into smaller clusters. For instance, well-known agglomerative hierarchical clustering methods include the Single-linkage, Complete-linkage or Average-linkage [16, 20, 25]. Nowadays, a large number of new clustering algorithms appears, e.g., [13, 14]. But, for a wide variety of data sets a single clustering algorithm producing optimal data partitions does not exist. Moreover, the same clustering algorithm can also create different partition schemes of data depending on the choice of input parameters. Thus, the question asking how to find the best fit of a partition scheme to a data set is still very relevant.

The process of evaluation of partitioned data is a very difficult task and it is known as cluster validation. In this evaluation process, an estimation of the occurrence of the right clusters is very frequently realized by validity indices. In the literature on the subject, cluster validation techniques are often classified into three groups–external, internal and relative validation [16, 31]. The external validation techniques are based on previous knowledge about data. On the other hand, the internal methods use only the intrinsic properties of the data set. The relative techniques compare partition schemes of a data set, which are created by changing values of input parameters of a clustering algorithm. The key parameter for many clustering methods is the number of clusters and this is most frequently changed. Next, the partitions are compared, i.e. depending on the approach used, the maximum or the minimum value of a validity index is used to determine the best fit of a partition scheme to the data set. So far, a number of authors have proposed different validity indices or modifications of existing indices, e.g., [1, 11, 12, 15, 17, 18, 32, 36]. In the literature new interesting solutions for cluster evaluation are constantly suggested. For example, a stability index based on variation on some information measures over partitions generated by a clustering model is in [23], a new measure of distances between clusters is proposed in [30]. Papers [33, 37] present indices which use the knee-point detection. It should be noted that cluster validity indices such as, e.g., the Dunn [10], Davies-Bouldin (DB) [8], PBM [21] or Silhouette (SIL) [26] indices are very frequently used to evaluate the efficacy of the new proposed validity approaches in detecting the right data partitioning. It is important to note that clustering algorithms in conjunction with cluster validity indices can be used during a process of designing various neural networks [2,3,4] and neuro-fuzzy structures [6, 7, 9, 27,28,29].

In this paper, an improvement of the Silhouette index is described. For this purpose the new versions of this cluster validity index called the SILA and SILAv1 have been presented. The first version of the index, i.e. SILA is described in paper [34]. The next version is called SILAv1 and it uses an exponent defined by (9). A detailed explanation of the modifications involving the use of the component is presented in Sect. 2. In order to present effectiveness of the validity indices several experiments were performed for various data sets.

This paper is organized as follows: Sect. 2 presents a detailed description of the Silhouette index and an explanation of the proposed modifications of this index. Section 3 illustrates experimental results on data sets. Finally, Sect. 4 presents conclusions.

2 Improvement of the Silhouette Index

First, in this section the Silhouette index is described in more detail. Next, a modification of the index and an explanation of this change are presented.

2.1 The Detail Description of the Silhouette Index

Let us denote K-partition scheme of a data set X by C = \({\{C_1, C, ..., C_K\}}\), where \({C_k}\) indicates \(k_{th}\) cluster, \({k=1,..,K}\). Cluster compactness is measured based on a mean of within-cluster distances. The average distance \(a(\mathbf{x})\) between element \(\mathbf{x}\) and the other elements \(\mathbf{x}_k\) belonging to the same cluster is defined as:

$$\begin{aligned} a(\mathbf{x}) = \frac{1}{{n_k - 1}}\sum \limits _{\mathbf{x}_k \in C_k } {d\left( {\mathbf{x},\mathbf{x}_k } \right) } \end{aligned}$$
(1)

where \({n_k}\) is the number of elements in \({C_k}\) and \({d\left( {\mathbf{x},\mathbf{x}_k }\right) }\) is a function of the distance between \(\mathbf{x}\) and \(\mathbf{x}_k\).

Furthermore, the mean of distances of \(\mathbf{x}\) to the other elements \(\mathbf{x}_l\) belonging to cluster \({ C}_l\), where \(l = 1, ..., K\) and \(l \ne k\), can be written as:

$$\begin{aligned} \delta (\mathbf{x},\mathbf{x}_l) = \frac{1}{{n_l}}\sum \limits _{\mathbf{x}_l \in C_l } {d\left( {\mathbf{x} ,\mathbf{x}_l } \right) } \end{aligned}$$
(2)

where \({n_l}\) is the number of elements in \({C_l}\). Thus, the smallest distance \(\delta (\mathbf{x},\mathbf{x}_l)\) can be defined as:

$$\begin{aligned} b(\mathbf{x}) = \mathop {\min }\limits _{\scriptstyle l ,k = 1 \atop \scriptstyle {} l \ne k }^K \delta (\mathbf{x},\mathbf{x}_l ) \end{aligned}$$
(3)

The so-called silhouette width of element \({\mathbf{x}}\) can be expressed as follows:

$$\begin{aligned} S({\mathbf{x}}) = {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{max\left( {a(\mathbf{x}),b(\mathbf{x})} \right) }}} \end{aligned}$$
(4)

Finally, the \({ Silhouette\ (SIL)}\)  index is defined as:

$$\begin{aligned} SIL = \frac{\mathrm{1}}{\mathrm{n}}\sum \limits _{\mathbf{x} \in X} {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{\max (a(\mathbf{x}),b(\mathbf{x}))}}} \end{aligned}$$
(5)

where n is the number of elements in the data set X.

The value of the index is from the range −1 to 1 and a maximum value (close to 1) indicates the right partition scheme. Unfortunately, the index can detect incorrect data partition if differences between cluster distances are large [34].

2.2 Modification of the Silhouette Index

As mentioned above the modification of the Silhouette index was proposed in paper [34] and it involves using an additional component \(A(\mathbf{x})\), which corrects values of the index. Thus, the new index, called the SILA index, in that paper was defined as follows:

$$\begin{aligned} SILA = \frac{1}{n}\left( {\sum \limits _{\mathbf{x} \in X} {\left( {S(\mathbf{x}) \cdot A(\mathbf{x})} \right) } } \right) \end{aligned}$$
(6)

where the S(x) is the \({silhouette\ width}\)  (Eq. (4)), whereas the additional component \(A(\mathbf{x})\) was expressed as:

$$\begin{aligned} A(\mathbf{x}) = \frac{1}{{\left( {1 + a(\mathbf{x})} \right) }} \end{aligned}$$
(7)

or it can be written as follows:

$$\begin{aligned} A(\mathbf{x}) = \frac{1}{{\left( {1 + a(\mathbf{x})} \right) ^{q} }} \end{aligned}$$
(8)

where the exponent \(q=1\).

Note that the value of exponent \(q=1\) can be insufficient for large difference of distances between clusters. Hence, the new modification of the index includes the component \(A(\mathbf{x})\) in which the q exponent is defined as below:

$$\begin{aligned} q = 2 + \frac{{K^2 }}{n} \end{aligned}$$
(9)

where n is the number of elements in a data set. Thus, the new version of the index so-called SILAv1 can be presented in the following way:

$$\begin{aligned} SILAv1 = \frac{1}{n}\left( {\sum \limits _{\mathbf{x} \in X} {\left( {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{\max \left( {a(\mathbf{x}),b(\mathbf{x})} \right) }} \cdot \frac{1}{{\left( {1 + a(x)} \right) ^{q} }} } \right) } } \right) \end{aligned}$$
(10)

where the q is expressed by (9).

This approach can ensure a better performance of this index than that previous version called the SILA and its efficiency was proved based on the experiments carried out on different data sets. In the next section a detailed explanation of the modifications involving the use of the additional component is presented.

2.3 Remarks

As mentioned above, the Silhouette index takes values between −1 and 1. Appropriate data partitioning is identified by a maximum value of the index, which can be close to 1. Notice that the definition of the \({silhouette\;width}\)  can be also expressed as follows [26]:

$$\begin{aligned} S(\mathbf{x}) = \left\{ {\begin{array}{*{20}c} {1 - \frac{{a(\mathbf{x})}}{{b(\mathbf{x})}}} &{} {\quad if\quad b(\mathbf{x}) > a(\mathbf{x})} \\ 0 &{} {\quad if\quad b(\mathbf{x}) = = a(\mathbf{x})} \\ {\frac{{b(\mathbf{x})}}{{a(\mathbf{x})}} - 1} &{} \!{\quad if\quad b(\mathbf{x}) < a(\mathbf{x})} \\ \end{array}} \right. \end{aligned}$$
(11)

Here, it is clear that when \(b(\mathbf{x})\) is much greater than \(a(\mathbf{x})\), the ratio of \(a(\mathbf{x})\) to \(b(\mathbf{x})\) is very small, and S(x) is close to 1. But in the modified version of the index, the SILAv1 (or SILA), the additional component \(A(\mathbf x)\) makes it possible to correct the value of the \({silhouette\;width}.\) In \(A(\mathbf x)\) a measure of cluster compactness \(a(\mathbf x)\) is used and plays a very important role. For instance, when a clustering algorithm greatly increases sizes of clusters, the factor \(a(\mathbf x)\) also increases and the ratio of \(1/(1+a(x))^q\) decreases significantly. It decreases the value of the index and thus, the large differences of distances between clusters do not affect the final result so much. This modified \({silhouette\;width}\)  can be expressed as follows:

$$\begin{aligned} S_m(\mathbf{x})=\left\{ {\begin{array}{*{20}l} {\left( 1 - \frac{{a(\mathbf{x})}}{{b(\mathbf{x})}} \right) \cdot \frac{{1}}{{\left( {1 + a(\mathbf{x})} \right) ^q}}} &{} {\quad if\quad b(\mathbf{x}) > a(\mathbf{x})} \\ 0&{} {\quad if\quad b(\mathbf{x}) = = a(\mathbf{x})} \\ {\left( \frac{{b(\mathbf{x})}}{{a(\mathbf{x})}} - 1 \right) \cdot \frac{{1}}{{\left( {1 + a(\mathbf{x})} \right) ^q}}} &{} {\quad if\quad b(\mathbf{x}) < a(\mathbf{x})} \\ \end{array}} \right. \end{aligned}$$
(12)

Let us look at the first situation. When \(b(\mathbf x)\) is greater than \(a(\mathbf x)\), the ratio of \(a(\mathbf x)\) to \(b(\mathbf x)\) is less than 1 and the value of \(S_m(\mathbf x)\) is positive (see Eq. (12)). Notice that when the number of clusters K decreases from \(K_{max}\) to a correct number of clusters \(c^*\), then the clusters newly created by a clustering algorithm become larger and the value of a(x) increases. However, the value of \(a(\mathbf x)\) is not very great and the factor \(A(\mathbf{x})\) does not decrease so much. Thus, the value of \(S_m(\mathbf x)\) increases and it is only slightly reduced by \(A(\mathbf{x})\). Generally, for compact clusters subdivided into smaller ones, when they are merged in larger clusters, the changes of their compactness and separability are not very significant. On the other hand, when the number of clusters K is equal to the right number \(c^*\), the separability of clusters increases abruptly due to relatively large distances between clusters and now \(b(\mathbf x)\) is much larger than \(a(\mathbf x)\). Hence, when \(K=c^*\), the component \(S_m(\mathbf x)\) increases considerably. Notice that \(A(\mathbf x)\) does not change significantly, since the changes of clusters compactness are still small and so \(a(\mathbf x)\) does not increase so much. Thus, the value of \(S_m(\mathbf x)\) is not considerably reduced by \(A(\mathbf x)\). In turn, when the number of clusters \(K < c^*\), then cluster sizes can be really large and now the factor \(a(\mathbf x)\) strongly increases. Consequently, \(A(\mathbf{x})\) decreases significantly and reduces the value of the index. It overcomes problems with too great differences of distances between clusters, and allows for indication of the appropriate data partitioning by the validity index.

The other situation takes place when \(a(\mathbf x)\) and \(b(\mathbf x)\) are equal. This means that it is not clear to which clusters the element should belong. In this case, the SILAv1 index (or SILA and Silhouette indices) equals 0 (see Eqs. (11) and (12)). The last situation occurs when the factor \(a(\mathbf x)\) is larger than \(b(\mathbf x)\). In this case, the values of \(S_m(\mathbf x)\) (or \(S(\mathbf x)\)) are negative. Thus, \(\mathbf x\) should be assigned to another cluster. Notice that when \(b(\mathbf{x})\) is equal to 0, then \(S_m(\mathbf{x}) = -1/ (1+a(x))^q\).

As mentioned above, the SILA index uses \(q=1\). However, such value q can cause that the \(A(\mathbf x)\) is too small to appropriately correct the silhouette width. However, when q is too large, the influence of \(A(\mathbf x)\) can be very strong and then the value of the index greatly decreases. Hence, the issue of the choice of the exponent q for \(A(\mathbf x)\) is a very significant problem. The new version of the index called SILAv1 contains a formula of the change of the exponent q depending on the number of clusters and it is expressed by (9). It should be noted that in this definition the important role is played by the ratio \(K^2/n\), which makes that for large K the value of q is greater than 2 (close to 3) and for small K it is close to 2. This approach causes that the index does not obtain too large values for high K (q is close 3). It is very important because this index is strongly decreased by the component \(A(\mathbf x)\) with q close to 2 for small K, when values of \(a(\mathbf x)\) are large. Thus, component \(A(\mathbf x)\) has now a suitable influence on the index and it makes it possible to overcome the drawback of the Silhouette index, where large differences of distances between clusters can provide incorrect results. It should be observed that the new index can take values between 1 and \(-1/ (1+a(x))^q\).

In the next section the results of the experimental studies are presented to confirm effectiveness of this approach.

3 Experimental Results

In this section several experiments were carried out to verify effectiveness of the new index in detecting correct clusters. The experiments have been conducted on different data sets using hierarchical clustering. It should be noted that proper clustering of data is not possible without the knowledge of the right number of clusters occurring in the given data set. Thus, this parameter is a very important for most of the clustering algorithms but it is not usually known in advance. Cluster validity indices are often used to determine this parameter.

The experiments relate to determining the number of clusters in data sets when the Complete-linkage hierarchical clustering is applied as the underlying clustering method. In each step this algorithm combines the two clusters with the smallest maximum pairwise distance. Furthermore, three validity indices, i.e. the Silhouette (SIL), \( SILA\) and SILAv1 are used to indicate the right number of clusters. Note that the best range of the number of clusters for data clustering analysis should be varied from \({K_{max}=}\) \({\sqrt{n}}\) to \({K_{min}=}\) 2 [22]. However, for the hierarchical clustering the number varies from \({K_{max}=n}\) to \({K_{min}=}\) 2. To show the efficacy of the proposed approaches the values of validity indices are also presented on the plots, where the number of clusters was from the range \({K_{max}=}\) \({\sqrt{n}}\) to \({K_{min}=}\)  2. Moreover, it is assumed that the values of the validity indices are equal to 0 for \(K=1\).

In all the experiments the Weka machine learning toolkit [35] has been used, where the Euclidean distance and the min-max data normalization have been also applied.

3.1 Data Sets

Eight generated artificial data sets are used in the experiments. These data consist of various cluster structure, densities and dimensions. For instance, the first four of them called Data 1, Data 2, Data 3 and Data 4 are 2- dimensional with 3, 5, 8 and 15 clusters, respectively. The scatter plot of these data is presented in Figs. 1 and 2. Additionally, Table 1 shows a detailed description of all these artificial data. As it can be observed on the plots the distances between clusters are very different and some clusters are quite close. Generally, clusters are located in groups and some of clusters are very close and others quite far. Moreover, the sizes of the clusters are different and they contain various number of elements. Hence, many clusters validity indices can provide incorrect partitioning schemes.

Table 1. Detailed description of the artificial data sets
Fig. 1.
figure 1

2-dimensional artificial data sets: (a) \({Data\;\mathrm 1}\), (b) \({Data\;\mathrm 2}\), (c) \({Data\;\mathrm 3}\), and (d) \({Data\;\mathrm 4}\)

Fig. 2.
figure 2

3-dimensional artificial data sets: (a) \({Data\;\mathrm 5}\), (b) \({Data\;\mathrm 6}\), (c) \({Data\;\mathrm 7}\), and (d) \({Data\;\mathrm 8}\)

Fig. 3.
figure 3

Variations of the Silhouette, SILA and SILAv1 indices with respect to the number of clusters for 2-dimensional data sets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\), (c) \({ Data\; \mathrm 3}\), and (d) \({ Data\; \mathrm 4}\) partitioned by the Complete-linkage method.

Fig. 4.
figure 4

Variations of the Silhouette, SILA and SILAv1 indices with respect to the number of clusters for 3-dimensional data sets: (a) \({ Data\; \mathrm 5}\), (b) \({ Data\; \mathrm 6}\), (c) \({ Data\; \mathrm 7}\), and (d) \({ Data\; \mathrm 8}\) partitioned by the Complete-linkage method.

Experiments. The hierarchical Complete-linkage method as the underlying clustering method was used for partitioning of these data. In Figs. 3 and 4 a comparison of the variations of the Silhouette, SILA and SILAv1 indices with respect to the number of clusters are presented. As mentioned above, on the plots the maximal value of the number of clusters \(K_{max}\) is equal to \({\sqrt{n}}\) and values of the validity indices are equal 0 for \(K=1\). It can be seen that the SILAv1 index provides the correct number of clusters for all the data sets. However, the previous index SILA indicates incorrect partition schemes for two sets, i.e., Data 3 and Data 6. On the contrary, the Silhouette index incorrectly selects all partitioning schemes and this index mainly provides high distinct peaks when the number of clusters \(K=2\). This means that when the clustering method combines clusters into larger ones and differences of distances between them are large, influence of the separability measure is significant and consequently, this index provides incorrect results. On the other hand, despite the fact that the differences of distances between clusters are large, the SILAv1 index provides the correct partitioning for all these data. Notice that the component \(A(\mathbf{x})\) (in the SILAv1 or SILA indices) poorly reduces values of this index when the number of clusters \(K>c^*\), because then they are not so large and have a compact structure.

4 Conclusions

As mentioned above, the Silhouette index can indicate an incorrect partitioning scheme when there are large differences of distances between clusters in a data set. Consequently, to improve the index performance and to overcome the drawback, a change of the index has been proposed. It is based on the use of the additional component, which contains a measure of cluster compactness. The value of this measure increases when a cluster size increases considerably. Hence, the additional component decreases and it reduces the high values of the index caused by large differences between clusters. As the underlying clustering algorithms the Complete-linkage was selected to investigate the behaviour of the proposed validity indices. The conducted tests have proven the advantages of the proposed SILA and SILAv1 indices compared to the Silhouette index. In these experiments, several data sets were used and the number of clusters varied within a wide range. All the presented results confirm high efficiency of the SILAv1 index.