Abstract
Clustering is an important technique for identifying groups of similar data objects within a data set. Since problems during the data collection and data preprocessing steps often lead to missing values in the data sets, there is a need for clustering methods that can deal with such imperfect data. Approaches proposed in the literature for adapting the fuzzy c-means algorithm to incomplete data work well on data sets with equally sized and shaped clusters. In this paper we present an approach for adapting the fuzzy c-means algorithm to incomplete data that uses the dimension-wise fuzzy variances of clusters for imputation of missing values. In experiments on incomplete real and synthetic data sets with differently sized and shaped clusters, we demonstrate the benefit over the basic approach in terms of the assignment of data objects to clusters and the cluster prototype computation.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Clustering is one of the important and primarily used techniques for the automatic knowledge extraction from large amounts of data. Its aim is to identify groups, so-called clusters, of similar objects within a data set. Data clustering is used in many fields, including database marketing, image processing, bioinformatics, text mining, and many others. The quality of the data plays an important role in the clustering process and might affect the clustering results. However, problems during the data collection and data preprocessing are often inevitable and might lead to uncertain, erroneous, or missing values in the data sets. Since the completion or correction of data is often expensive or even impossible, there is a need for clustering methods that can deal with such imperfect data.
In the literature, several approaches for adapting the fuzzy c-means (FCM) algorithm to incomplete data have been proposed [1–3]. Like the basic FCM algorithm, these methods assume that the clusters are equally sized and shaped. Even if using the Euclidean distance function as a similarity criterion, in practice, FCM often reliably identifies differently shaped and sized clusters in complete data sets. Therefore, the FCM algorithm is widely used as a stand-alone clustering method as well as in hybrid hierarchical clustering methods [4]. However, the experiments conducted in [5] showed that FCM adapted to incomplete data produces less accurate clustering results on incomplete data with differently sized and scattered clusters than in the case of clusters with equal sizes and volumes. In [6], we proposed an approach for adapting FCM to incomplete data that takes the cluster scatters into account during the imputation of missing values. Since this method computes the cluster dispersions on the basis of completely available features, it cannot be applied on data sets where missing values occur in all features. Moreover, the experiments showed that this approach produces less accurate cluster prototypes than the basic approach. We assume the reason in the calculation and the use of the cluster dispersions. The cluster dispersions are computed on the basis of completely available features, but they are used for the imputation of missing values in the other features. Therefore, if clusters have different extents in different dimensions, the way of using the cluster dispersions for the imputation of missing values as it is used in our previous approach does more harm than good. In this paper, we present an approach for adapting the fuzzy c-means algorithm to incomplete data that uses the dimension-wise fuzzy variances of clusters for imputation of missing values. Our approach involves the cluster shapes and volumes for missing value imputation in a simple and computationally inexpensive way. In experiments on real and synthetic data sets with differently sized and shaped convex clusters, we demonstrate the capabilities of our new approach and show the benefit over the basic approach in terms of the assignment of data objects to clusters and the cluster prototype computation.
The remainder of the paper is organized as follows. In the next section we give a short overview of the basic fuzzy c-means algorithm and the methods for adapting FCM to incomplete data. In Sect. 3 we describe our idea for missing values imputation using the dimension-wise fuzzy variance of clusters and present the modified FCM algorithm. The evaluation results of our method and the comparison with the basic approach are presented in Sect. 4. In Sect. 5 we close the paper with a short summary and the discussion of future research.
2 Approaches for Fuzzy Clustering of Incomplete Data
2.1 Fuzzy c-Means Algorithm (FCM)
The fuzzy c-means algorithm (FCM) is a well known clustering algorithm that partitions a given data set \(X = \{x_{1}, ..., x_{n}\}\) in a d-dimensional metric data space into c clusters that are represented by their cluster prototypes \(V = \{v_{1}, ..., v_{c}\}\). Unlike the k-means algorithm [7], which assigns each data object to exactly one cluster, fuzzy c-means algorithm assigns data items to clusters with membership degrees [8]. The membership degree \(u_{ik} \in [0,1]\) expresses the relative degree to which the data point \(x_{k}\) with \(1\le k\le n\) belongs to the cluster \(C_{i}\), \(1\le i\le c\). The objective function of the fuzzy c-means algorithm is defined as follows:
The similarity between the data items and the cluster prototypes is expressed by the squared distance function. The parameter m, \(m > 1\), is the fuzzification parameter which determines the vagueness of the resulting partitioning. The objective function of the fuzzy c-means algorithm is minimized using an alternating optimization (AO) scheme [8]. The objective function is alternately optimized over the membership degrees and the cluster prototypes in an iterative process.
The algorithm begins with the initialization of the cluster prototypes \(v_{i}\) which can be either the first c data items of the data set or c randomly chosen data items or c randomly chosen points in the data space. Alternatively, the membership degrees can be initialized. In the first iteration step the membership degrees of each data item to each cluster are updated according to Formula (2).
where \(I_{x_{k}} = \{v_i \mid d^{2}(v_{i}, x_{k}) = 0\}\). In the second iteration step the new cluster prototypes are calculated based on all data items depending on their membership degrees to the cluster (see Formula (3)).
The iterative process continues as long as the cluster prototypes change up to a value \(\epsilon \). Although the fuzzy c-means algorithm is known as a stable and robust clustering algorithm that does not often get stuck in a local optimum [9, 10], it is sensible to evaluate the algorithm for different initializations to achieve the optimal partitioning results.
2.2 Different Approaches for Fuzzy Clustering of Incomplete Data
In the literature, several approaches for adapting fuzzy clustering algorithms to incomplete data have been proposed. Some of them such as the whole-data strategy FCM (WDSFCM) and the partial distance strategy FCM (PDSFCM) [1, 2] carry out the analysis only on the basis of available values. Other methods like the optimal completion strategy FCM (OCSFCM) [1, 2], the nearest prototype strategy FCM (NPSFCM) [1, 2], and the distance estimation strategy FCM (DESFCM) [3] impute missing feature values or distances in an additional iteration step of the fuzzy c-means algorithm. In experiments described in [1, 5], the lowest missclassification errors have been obtained by PDSFCM, OCSFCM, and NPSFCM. OCSFCM and NPSFCM tend to strengthen the clustering structure of incomplete data which turned out to be beneficial e.g. for determining the optimal number of clusters using cluster validity indexes (CVIs). In the following we focus on the description of OCSFCM because it provides the basis for our approach.
Optimal Completion Strategy FCM (OCSFCM). The idea of the optimal completion strategy (OCS) is to iteratively compute the missing values as the additional variables over which the objective function is minimized [1, 2]. The fuzzy c-means algorithm is modified by adding an additional iteration step where the missing values are updated according to Formula (4).
In this way, the missing values are imputed by the weighted means of all cluster centers in each iteration step.
The advantage of this approach is that missing values are imputed during the clustering process. However, the drawback of the OCSFCM is that the calculation of the cluster prototypes and the imputation of the missing values influence each other because the algorithm does not distinguish between the available and imputed feature values. In [11] the author proposed to diminish the influence of imputed values to the calculation of cluster prototypes by reducing the membership degrees of incomplete data items depending on the number of missing values. The resulting algorithm loses the property of a probabilistic fuzzy clustering algorithm, though. In our approach we tackle this problem by computing the cluster prototypes only on the basis of available feature values.
3 FCM Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters
In [6], we have proposed an approach for adapting FCM to incomplete data which can be regarded as an extension of OCSFCM that takes the dispersions of clusters into account for the imputation of missing values. The experimental results have shown the benefits of this approach over the basic OCSFCM algorithm. However, since the cluster dispersions are computed on the basis of completely available features but used for the imputation of missing values in the other features, this approach is restricted to data sets with equally shaped clusters, i.e. clusters that have the same extents in different dimensions. Therefore, in our new approach we impute missing values taking the dimension-wise fuzzy variances of clusters into account. The idea of our approach is imputing missing feature values of incomplete data items depending on both the distances between the cluster prototypes and the incomplete data items and the extents of clusters in the corresponding dimensions.
3.1 A New Membership Degree Using Dimension-Wise Fuzzy Variances of Clusters
For each cluster \(C_{i}\), \(1 \le i \le c\), we calculate its fuzzy variance \(w_{ij}\) in each dimension j, \(1 \le j \le d\), as an averaged squared distance of data items to their cluster prototypes according to Formula (5).
where
We calculate the dimension-wise fuzzy variances of clusters using only available feature items. In this way, we avoid the influence of the imputed feature values on the calculation of cluster variances. That makes sense because missing values are imputed by values close to the corresponding feature values of the cluster prototypes. Therefore, taking both the available and the imputed feature values into account for the calculation of cluster variances would reduce the real variances of clusters. Furthermore, using all available feature items for the calculation of the dimension-wise fuzzy variances of clusters makes our approach applicable on incomplete data sets where missing values occur in all features and data items. Calculating the fuzzy variances of clusters for each feature prevents the influence of the distorted estimation of cluster variances in single dimensions to the whole cluster variances in the case of a large amount of missing values in those features.
We integrate the fuzzy variances of clusters in the new membership degree update formula for the imputation of missing values as follows:
\(1\le k\le n\), \(1\le i\le c\), and \(1\le j\le d\). Note that we also compute the membership degrees for the imputation of missing values for each dimension. That implies that a missing value of an incomplete data item in a particular feature is updated depending on the distances between the corresponding values of the cluster prototypes and the last imputed value and the extents of clusters in that feature. The larger the variance of a cluster and the smaller the distance between the imputed data item and the cluster center in that particular dimension, the higher the new membership degree is. If all clusters are equally sized and shaped, then the membership degree \(u^{w}_{(ik)j}\) depends only on the distances between the data items and the cluster prototypes. As in the basic OCS approach the imputation of missing values in our approach depends on the imputed values in the previous iteration.
3.2 FCM for Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters (FCMDFVC)
We integrate the new membership degree for imputation of missing values in the basic OCSFCM algorithm and refer the resulting algorithm to as Fuzzy c-Means Algorithm for Incomplete Data using Dimension-wise Fuzzy Variances of Clusters (FCMDFVC). The working principle of FCMDFVC is depicted in Algorithm 1 and is basically the same as of OCSFCM.
The algorithm begins with the initialization of cluster prototypes and missing values. The membership degrees are updated in the first iteration step in the same way as in the basic FCM and OCSFCM. The available and the imputed feature values are not distinguished. Since we want to avoid the influence of the imputed values to the calculation of the cluster prototypes, we compute the cluster prototypes only on the basis of available feature values as in PDSFCM according to Formula (8).
If the termination condition is not reached, in the third iteration step, missing values are imputed depending on the distances to the cluster prototypes and the fuzzy variances of clusters according to Formula (9).
Basically, the new membership degree for the imputation of missing values can also be integrated in NPSFCM. In this case, the computation of cluster prototypes using only available feature values is essential because the computation of cluster prototypes and the imputation of missing values influence each other even more than in OCSFCM.
4 Data Experiments
In this section we compare our approach with the basic OCSFCM algorithm in terms of the assignment of data objects to clusters and the cluster prototypes computation. In order to assess the impact of single modifications of the basic method, in our experiments, we also tested the OCSFCM algorithm that computes the cluster prototypes only using the available feature values and the FCMDFVC approach that computes cluster prototypes using available and imputed feature values.
4.1 Test Data and Experimental Setup
We tested the four above-mentioned approaches on different real and synthetic data sets. For the sake of brevity, here we only report the results obtained on the wine data set from the UCI Machine Learning Repository [12] and three synthetic data sets with different properties. The wine data set consists of 178 data items, each with 13 features representing the results of a chemical analysis of three wine types. Corresponding to the wine types, the data items are distributed in three classes with 59, 71, and 48 instances. The synthetic data sets are depicted in Fig. 1. Each of the data sets consist of 2000 data items generated by the compositions of three and five 3-dimensional Gaussian distributions respectively. The data items are distributed in three clusters with 400, 700, and 900 instances in the data set 3D-3 and in five clusters with 200, 350, 450, 700, and 300 instances in the data sets 3D-5 and 3D-5-h. All clusters have different magnitudes and partly overlap. The data set 3D-5-h was generated from the data set 3D-5 by moving the clusters so that two groups of two and three differently sized clusters build a hierarchical structure in the resulting data set. Using the data set 3D-5-h, we wanted to find out if the ordering of clusters in the data sets affects the performance of the clustering algorithms adapted to incomplete data.
In our experiments we first clustered the complete data sets with the basic FCM algorithm and used the resulting clusterings as a baseline for the comparison. Then we generated incomplete data sets by removing values in all features with different probabilities according to the general missing-data pattern [13]. The percentage of missing values was calculated in relation to all feature values in the data sets. In the resulting incomplete data sets, missing values were equally distributed in all features. The general missing-data pattern is common but challenging for clustering methods because missing values occur in many data items. Since we did not adapted the clustering algorithms to incomplete data with conditionally missing values, we deleted the values from the test data sets according to the common missing-data mechanisms MCAR [13]. We clustered the incomplete data sets with the four above-mentioned clustering algorithms. We initialized the cluster prototypes with random values in the data space at the beginning of the algorithms to create the test conditions as real as possible. We used the Frobenius norm distance for the stopping criterion \(\Vert V - V'\Vert < \epsilon \) defined in Formula (10).
In all our experiments we set the value \(\epsilon \) to 0.0001.
4.2 Experimental Results
The performance results of the fuzzy c-means clustering algorithms adapted to incomplete data are shown in Figs. 2, 3, 4, and 5. For the sake of uniformity, we marked the clustering algorithms that only use the available feature items for the computation of cluster prototypes with an asterisk in the legend. That means that our new approach described in Sect. 3 is listed as FCMDFVC\(^{*}\) in the legend.
We evaluated the partitioning results produced by the clustering approaches using different crisp and fuzzy indexes. Here, we only report the results for the subset similarity index [14] for comparing the fuzzy partitions produced by the clustering algorithms. Although in previous publications the partitioning results produced by the fuzzy c-means algorithms for incomplete data were evaluated using crisp similarity indexes, in our opinion, it makes more sense to use fuzzy indexes to compare the resulting membership degrees. We used the Frobenius norm distance between the terminal prototypes produced by FCM on complete data and the terminal cluster prototypes produced by the four fuzzy clustering approaches on incomplete data. Since there were significant variations in the results from trial to trial, the figures below show the averaged results obtained over 100 trials. For the sake of clarity, we omitted the standard deviations in the diagrams.
Figure 2 presents the performance results for the basic OCSFCM and our approach produced on the incomplete wine data set. According to these results the computation of cluster prototypes only using available feature values seems to be the only factor that improved the partitioning results on incomplete data. Our proposal to impute missing values using the dimension-wise fuzzy variance of clusters seems to worsen the partitioning results. Indeed, in almost all experiments this approach produced the least accurate partitioning results and the highest prototype error among all approaches. As we mentioned above, the reason is that the imputation of missing values depends on the imputation in the previous iteration step and influences the computation of the cluster prototypes. Therefore, it is essential to compute the cluster prototypes only on the basis of available feature items in our approach that produced similarly good results as OCSFCM using available feature values for the computation of cluster prototypes.
Since the wine data set has 13 attributes, we cannot visualize it and cannot comment on the obtained results. Figures 3 and 4 present the performance results produced by the four approaches on the 3-dimensional data sets with three and five clusters. As the diagrams show, our new approach produced slightly more accurate partitioning results and much more accurate terminal cluster prototypes. While the performance results of OCSFCM and OCSFCM\(^{*}\) were similar on the data set 3D-3, the OCSFCM approach using available feature values for the computation of cluster prototypes produced on average much inaccurate partitioning results on the data set 3D-5. This is due to the fact that unlike other methods this approach produced comparably good clustering results only in few of 100 trials and performed poorly on average.
Figure 5 shows the performance results for the four approaches produced on the incomplete 3D-5-h data set where clusters were moved together building two groups. All clustering approaches produced unstable results on this data set. The basic FCM algorithm produced comparably accurate results on the complete 3D-5-h data set, we obtained 0.9949 for the crisp subset similarity averaged over 100 trials. That shows that the spatial arrangement of clusters in the data set has a strong influence on the clustering results produced on incomplete data.
5 Conclusions and Future Work
Approaches proposed in the literature for adapting the fuzzy c-means (FCM) algorithm to incomplete data assume that the clusters are equally sized and shaped. They produce less accurate clustering results on incomplete data with differently sized and scattered clusters than on the data sets with clusters of equal sizes and volumes. Our previous approach presented in [6] for adapting FCM to incomplete data that takes the cluster scatters into account during the imputation of missing values computes the cluster dispersions on the basis of completely available features, but uses them for the imputation of missing values in the other features. Therefore, if clusters have different extents in different dimensions, it fails to work. In this paper, we presented a new approach for adapting the fuzzy c-means algorithm to incomplete data that uses the dimension-wise fuzzy variances of clusters for imputation of missing values. Our approach involves the cluster shapes and volumes for missing value imputation and computes cluster prototypes only using the available feature values. In experiments on real and synthetic data sets with differently sized and shaped clusters, we demonstrated that our new approach produces slightly more accurate partitioning results and much more accurate terminal cluster prototypes than the basic OCSFCM method.
In all our experiments we used incomplete data with missing values MCAR because we did not adapted our approach to incomplete data with a conditional absence of values. In the future we plan to apply the idea of using the class specific probabilities for imputation of missing values as in the approach presented in [15] in order to improve the performance of our approach on data sets with missing values MAR and NMAR. Furthermore, in our experiments we assumed the optimal number of clusters to be known. Determining the optimal number of clusters on incomplete data is still a challenging problem. Therefore, we want to analyze whether the partitioning results produced by our approach on incomplete data with differently sized and scattered clusters are better than the partitioning produced by the basic OCSFCM for determining the optimal number of clusters using the cluster validity indexes.
References
Hathaway, R.J., Bezdek, J.C.: Fuzzy \(c\)-means clustering of incomplete data. IEEE Trans. Syst. Man Cybern. Part B 31(5), 735–744 (2001)
Timm, H., Döring, C., Kruse, R.: Fuzzy cluster analysis of partially missing datasets. In: Proceedings of the European Symposium on Intelligent Technologies, Hybid Systems and Their Implementation on Smart Adaptive Systems (EUNITE 2002), pp. 426–431 (2002)
Sarkar, M., Leong, T.-Y.: Fuzzy K-means clustering with missing values. In: Proceedings of the American Medical Informatics Association Annual Symposium, pp. 588–592 (2001)
van der Laan, M.Y., Pollard, K.S.: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. J. Stat. Plann. Infer. 117(2), 275–303 (2003)
Himmelspach, L., Conrad, S.: Clustering approaches for data with missing values: comparison and evaluation. In: Proceedings of the Fifth IEEE International Conference on Digital Information Management (ICDIM 2010), pp. 19–28 (2010)
Himmelspach, L., Conrad, S.: Fuzzy clustering of incomplete data based on cluster dispersion. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. LNCS, vol. 6178, pp. 59–68. Springer, Heidelberg (2010)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281–297 (1967)
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)
Kruse, R., Döring, C., Lesot, M.-J.: Fundamentals of fuzzy clustering. In: Advances in Fuzzy Clustering and Its Applications, pp. 1–30 (2007)
Klawonn, F., Kruse, R., Winkler, R.: Fuzzy clustering: more than just fuzzification. Fuzzy Sets Syst. 281, 272–279 (2015)
Timm, H.: Fuzzy-Clusteranalyse: Methoden zur Exploration von Daten mit fehlenden Werten sowie klassifizierten Daten. Ph.D. thesis, Germany (2002)
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)
Runkler, T.A.: Comparing partitions by subset similarities. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. LNCS, vol. 6178, pp. 29–38. Springer, Heidelberg (2010)
Timm, H., Döring, C., Kruse, R.: Different approaches to fuzzy clustering of incomplete datasets. Int. J. Approximate Reasoning 35(3), 239–249 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Himmelspach, L., Conrad, S. (2016). Fuzzy c-Means Clustering of Incomplete Data Using Dimension-Wise Fuzzy Variances of Clusters. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 610. Springer, Cham. https://doi.org/10.1007/978-3-319-40596-4_58
Download citation
DOI: https://doi.org/10.1007/978-3-319-40596-4_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40595-7
Online ISBN: 978-3-319-40596-4
eBook Packages: Computer ScienceComputer Science (R0)