Keywords

1 Introduction

Clustering allows partitioning of data into homogeneous subsets (called clusters), inside which elements are similar to each other while being different from items in other groups. It is also called unsupervised learning or unsupervised classification. Nowadays, a large number of clustering algorithms exist that have found use in various fields such as data mining, bioinformatics, exploration data, etc. Clustering methods can be applied to designing neural networks and neuro-fuzzy systems [210, 1618, 3032, 37, 44]. However, the results of clustering algorithms are strongly dependent on the right choice of input parameters. Hence, for the same data but for different input parameters a clustering algorithm can produce different results. It should be noted that the number of clusters is significant input parameter of many clustering algorithms, which is often selected in advance. Thus, the key issue is how to properly evaluate results of data clustering. In the literature on the subject, three main techniques are used to evaluate partitioning of data sets, and they include external, internal or relative approaches [13, 38]. The relative methods are very popular and widely used by researchers. In this approach a clustering algorithm provides data partitioning for different values of input parameters and next partitioning schemes are compared to find the best results. For this purpose cluster validity indices are used. A great number of such indices have been introduced so far, e.g., [1, 11, 12, 14, 22, 39, 40, 4547].

In this paper, a cluster validity index called the SILA index being a modification of the Silhouette index is proposed. This modification allows us to improve the index performance. Notice that the Silhouette index is often used by many researchers to evaluate clustering results. Unfortunately, in some cases it fails to detect correct partitioning of data sets. A detailed explanation of this problem is presented in Sect. 2. The proposed SILA index contains a component which corrects the index value when changes of cluster separability are considerable during a partitioning process (see Eq. (10)). In order to present the effectiveness of the new validity index several experiments were performed for various data sets. This paper is organized as follows: Sect. 2 presents the Silhouette index and detailed description of its properties. Section 3 describes a new validity index, which is a modification of the Silhouette index. Section 4 illustrates experimental results on artificial and real-life data sets. Finally, Sect. 5 presents conclusions.

2 Description of the Silhouette Index

Let us denote K-partition scheme of a data set X by C = \({\{C_1, C, ..., C_K\}}\), where \({C_k}\) indicates \(k_{th}\) cluster, \({k=1,..,K}\). Moreover, a mean of within-cluster distances, named \(a(\mathbf{x})\), is defined as the average distance between a pattern \(\mathbf{x}\) which belongs to \({C_k}\) and the rest of patterns \(\mathbf{x}_k\) also belonging to this cluster, such that

$$\begin{aligned} a(\mathbf{x}) = \frac{1}{{n_k - 1}}\sum \limits _{\mathbf{x}_k \in C_k } {d\left( {\mathbf{x},\mathbf{x}_k } \right) } \end{aligned}$$
(1)

where \({n_k}\) is the number of patterns in \({C_k}\) and \({d\left( {\mathbf{x},\mathbf{x}_k }\right) }\) is a function of the distance between \(\mathbf{x}\) and \(\mathbf{x}_k\). Furthermore, the mean of distances of \(\mathbf{x}\) to the other patterns \(\mathbf{x}_l\) belonging to the cluster \({ C}_l\), where \(l = 1, ..., K\) and \(l \ne k\), can be written as:

$$\begin{aligned} \delta (\mathbf{x},\mathbf{x}_l) = \frac{1}{{n_l}}\sum \limits _{\mathbf{x}_l \in C_l } {d\left( {\mathbf{x}, \mathbf{x}_l } \right) } \end{aligned}$$
(2)

where \({n_l}\) is the number of patterns in \({C_l}\). Thus, the smallest distance \(\delta (\mathbf{x},\mathbf{x}_l)\) can be defined as:

$$\begin{aligned} b(\mathbf{x}) = \mathop {\min }\limits _{\scriptstyle l, k = 1 \atop \scriptstyle {} l \ne k }^K \delta (\mathbf{x},\mathbf{x}_l ) \end{aligned}$$
(3)

The so-called silhouette width of the pattern \({\mathbf{x}}\) can be expressed as follows:

$$\begin{aligned} S({\mathbf{x}}) = {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{max\left( {a(\mathbf{x}),b(\mathbf{x})} \right) }}} \end{aligned}$$
(4)

Finally, the Silhouette index is defined as:

$$\begin{aligned} SIL = \frac{1}{{n}}\sum \limits _{\mathbf{x} \in X } {S(\mathbf{x})} \end{aligned}$$
(5)

where n is the number of patterns in the data set X. Thus, this index can be also represented by:

$$\begin{aligned} SIL = \frac{\mathrm{1}}{\mathrm{n}}\sum \limits _{\mathbf{x} \in X} {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{\max (a(\mathbf{x}), b(\mathbf{x}))}}}. \end{aligned}$$
(6)

The Silhouette index is also called the SIL index. Unlike most of the validity indices, the SIL index can be used for clusters of arbitrary shapes. It should be noted that the index is based on two components, i.e., \(b(\mathbf x)\) and \(a(\mathbf x)\). As given above, the first component is the smallest of the mean distances of \(\mathbf{x}\) to the patterns belonging to other clusters. Then, \(a(\mathbf x)\) is defined as the average distance between \(\mathbf x\) and the rest of the patterns belonging to the same cluster. Notice that \(a(\mathbf x)\) can be also considered a measure of cluster compactness, whereas the numerator of \(S(\mathbf x)\), which is the difference between \(b(\mathbf x)\) and \(a(\mathbf x)\), can be considered a measure of cluster separability (see Eq. (4)). It should be noted that the value of the silhouette width is from the interval [\(-1, 1\)] and the element \(\mathbf x\) is assigned to the right cluster when \(S(\mathbf x)\) is close to 1, but when it is nearly \(-1\), \(\mathbf x\) is located in a wrong cluster. Hence, a maximum value of the Silhouette index indicates the right partition scheme. Moreover, it should be observed that the measure of cluster separability (numerator of Eq. (6)) essentially influences results of this index and in some cases it can fail to detect correct data partitioning. For example, this can happen when differences of distances between clusters are large. Figure 1 presents an example of 2–dimensional data set, which contains three clusters labelled by numbers 1, 2 and 3. Notice that the distances between the clusters are very different. Moreover, it can be seen that these clusters have several elements per class and large differences of distances between them. Thus, the distance between clusters 1 and 2 is about d1; then, between clusters 2 and 3 it is d2, and between 3 and 1 it is d3. It can be noted that the distance d1 (or d3) is much larger than d2. Let us denote by \({c^*}\) the correct number of clusters in the data set, so it is \({c^*=3}\). When the number of clusters K is more than \({c^*}\), the natural existing compact clusters are subdivided into small ones by a clustering algorithm. In this case, the minimum distance between clusters is small, which also makes this index value small (see Eq. (4)). However, when \({K=c^*}\), the value of \(b(\mathbf x)\) is equal to about d1 for \(\mathbf x\) belonging to cluster 1. Whereas \(b(\mathbf x)\) is about d2 for \(\mathbf x\) belonging to the cluster 2 (or 3). Consequently, a large distance between clusters 1 and 2 (or 1 and 3) makes that the value of the factor \(b(\mathbf x)\) calculated for cluster 1 is also much higher than \(a(\mathbf x)\) and the Silhouette index is high (see Eq. (6)). But when \(K<c^*\), the value of the index can be even higher than for \(K=c^*\). This is because clusters 2 and 3 are merged and now two new clusters are also far from each other. This means that \(b(\mathbf x)\) for both clusters is large in comparison to \(a(\mathbf x)\), which does not actually increase so much. Consequently, the sum of values of silhouette widths can be higher for \(K<c^*\) than for \(K=c^*\). Thus, due to large differences between cluster distances, the index can indicate an incorrect number of clusters. In the next section, a modification of the index is proposed so as to overcome this drawback.

Fig. 1.
figure 1

An example of a data set consisting of three clusters

3 Modification of the Silhouette Index

The modification involves an additional component which corrects values of the index. Thus, the new index, called the SILA index, is defined as follows:

$$\begin{aligned} SILA = \frac{1}{n}\left( {\sum \limits _{\mathbf{x} \in X} {\left( {S(\mathbf{x}) \cdot A(\mathbf{x})} \right) } } \right) \end{aligned}$$
(7)

where the S(x) is the \({silhouette\ width}\) (Eq. (4)). Whereas, the additional component \(A(\mathbf{x})\) is expressed as:

$$\begin{aligned} A(\mathbf{x}) = \frac{1}{{\left( {1 + a(\mathbf{x})} \right) }} \end{aligned}$$
(8)

Thus, the new index can be represented in the following way:

$$\begin{aligned} SILA = \frac{1}{n}\left( {\sum \limits _{\mathbf{x} \in X} {\left( {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{\max \left( {a(\mathbf{x}), b(\mathbf{x})} \right) }} \cdot A(\mathbf x)} \right) } } \right) \end{aligned}$$
(9)

or

$$\begin{aligned} SILA = \frac{1}{n}\left( {\sum \limits _{\mathbf{x} \in X} {\left( {\frac{{b(\mathbf{x}) - a(\mathbf{x})}}{{\max \left( {a(\mathbf{x}),b(\mathbf{x})} \right) }} \cdot \frac{1}{{\left( {1 + a(x)} \right) }} } \right) } } \right) \end{aligned}$$
(10)

In the next section the results of the experimental studies are presented to confirm the effectiveness of this approach.

4 Experimental Results

Several experiments were carried out to verify effectiveness of the new index. They are related to determining the number of clusters for artificial and real-life data sets when the Complete-linkage algorithm is applied as the underlying clustering method. It should be noted that in all the experiments the Euclidean distance and the min-max data normalization have been used. This approach is often applied, e.g., in the Weka machine learning toolkit [43].

4.1 Data Sets

Figures 2 and 3 show the randomly generated artificial data sets which were used in the experiments. Moreover, Table 1 presents their detailed description. These data consist of various numbers of clusters and elements per class. For instance, the first three of them called Data 1, Data 2 and Data 3 are 2- dimensional with 3, 5 and 8 clusters, respectively. The next three sets called Data 4, Data 5 and Data 6 are 3-dimensional with 4, 7 and 9 clusters, respectively. As it can be observed in Figs. 2 and 3 clusters are mostly circular and located in various distances from each other with some of them being quite close. For example, in Fig. 2 cluster sizes and distances between clusters are very different and they are located in two cluster groups in general. On the other hand, Fig. 3 presents various large clusters of 3-dimensional data sets. Here, distances between clusters are also very different and clusters create some groups. Whereas the real-life data were drawn from the UCI repository [20], and their detailed description is presented in Table 2. In experiments with the data sets, the Complete-linkage method as the underlying clustering algorithm was used for partitioning of the data. The number of clusters K was varied from \({K_{max}=}\) \({\sqrt{n}}\) to \({K_{min}=}\) 1. This value is an accepted rule in the clustering literature [23]. Moreover, in Figs. 4, 5 and 6 a comparison of the variations of the Silhouette and the SILA indices with respect to the number of clusters is presented. It can be seen that the SILA index provides the correct number of clusters for the all data sets. On the contrary, the Silhouette index incorrectly selects the partitioning schemes and thus the index mainly provides high distinct peaks when the number of clusters \(K=2\). This means that when the clustering algorithm merges clusters into larger ones and distances between them are large, influence of the separability measure is significant and consequently, this index provides incorrect results. On the other hand, despite the fact that the differences of distances between clusters are large, the SILA-index generates clear peaks which are related to the correct partitioning of these data. It can be observed that for real-life data sets both indices found the right number of clusters for the Iris data. However, for the Ecoli and the Glass data the Silhouette index indicates the number of clusters \(K=2\). On the other hand, the SILA index provides better results for the Glass, i.e., \(K=5\). Thus, for these sets, the number of clusters is determined more precisely by the SILA-index. Notice that when the number of clusters \(K>c^*\) the component \(A(\mathbf{x})\) poorly reduces values of this index because the clusters sizes are not so large.

Fig. 2.
figure 2

2-dimensional artificial data sets: (a) \({Data\;\mathrm 1}\), (b) \({Data\;\mathrm 2}\) and (c) \({Data\;\mathrm 3}\)

Fig. 3.
figure 3

3-dimensional artificial data sets: (a) \({Data\;\mathrm 4}\), (b) \({Data\;\mathrm 5}\) and (c) \({Data\;\mathrm 6}\)

Table 1. Detailed description of the artificial data sets
Table 2. Detailed description of the real-life data sets
Fig. 4.
figure 4

Variations of the Silhouette and SILA indices with respect to the number of clusters for 2-dimensional data sets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\) and (c) \({ Data\; \mathrm 3}\)

Fig. 5.
figure 5

Variations of the Silhouette and SILA indices with respect to the number of clusters for 3-dimensional data sets: (a) \({ Data\; \mathrm 4}\), (b) \({ Data\; \mathrm 5}\) and (c) \({ Data\; \mathrm 6}\)

Fig. 6.
figure 6

Variations of the Silhouette and SILA indices with respect to the number of clusters for real-life data sets: (a) \({ Glass}\), (b) \({ Ecoli}\) and (c) \({ Iris}\)

5 Conclusions

In this paper, a new cluster validity index called the SILA index was proposed. It should be noted that this new index is a modification of the Silhouette index, which is very often used by researchers to evaluate partitioning of data. Furthermore, unlike most other indices the SILA index (also the Silhouette index) can be used for arbitrary shaped clusters. As mentioned above, the Silhouette index can indicate incorrect partitioning scheme when there are large differences of distances between clusters in a data set. Consequently, the new index contains an additional component which improves its performance and overcomes the drawback. This component uses a measure of cluster compactness which increases when a cluster size increases considerably and it reduces the high values of the index caused by large differences between clusters. To investigate the behaviour of the proposed validity index the Complete-linkage is used as the underlying clustering algorithm. All the presented results confirm high efficiency of the SILA index. It should also be noticed that cluster validity indices can be used during a process of designing various neuro-fuzzy structures [15, 19, 21, 2429, 41, 42] and stream data mining algorithms [3336].