Abstract
Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced subspace clustering algorithms, have been proposed to (1) handle the increasing abundance and complexity of data and to (2) improve the clustering results. In this survey, we present these enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms. Besides enhanced subspace clustering, we also present the basic subspace clustering and the related works in high-dimensional clustering.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2006a) Finding hierarchies of subspace clusters. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 446–453
Achtert E, Böhm C, Kriegel HP, Kröger P, Zimek A (2006b) Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (KDD), pp 4–13
Achtert E, Böhm C, Kriegel HP, Kröger P, Müller-Gorman I, Zimek A (2007) Detection and visualization of subspace cluster hierarchies. In: Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), pp 152–163
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB), pp 487–499
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 94–105
Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 61–72
Aggarwal CC, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of the 8th international conference on database theory (ICDT), pp 420–434
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of 30th international conference on very large data bases (VLDB), pp 852–863
Assent I, Krieger R, Müller E, Seidl T (2007) DUSC: dimensionality unbiased subspace clustering. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 409–414
Assent I, Krieger R, Müller E, Seidl T (2008a) EDSC: efficient density-based subspace clustering. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM), pp 1093–1102
Assent I, Krieger R, Müller E, Seidl T (2008b) INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proceedings of the 8th IEEE international conference on data mining (ICDM), pp 719–724
Avis D, Fukuda K (1996) Reverse search for enumeration. Discr Appl Math 65(1-3): 21–46
Bennett KP, Fayyad U, Geiger D (1999) Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining KDD, pp 233–243
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data, chap 2. Springer, New York, pp 25–71
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Proceedings of the 7th international conference on database theory (ICDT), pp 217–235
Böhm C, Kailing K, Kröger P, Zimek A (2004) Computing clusters of correlation connected objects. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 455–466
Breiman L (2001) Statistical modeling: the two cultures. Stat Sci 16(3): 199–231
Cerf L, Besson J, Robardet C, Boulicaut JF (2008) Data peeler: contraint-based closed pattern mining in n-ary relations. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 37–48
Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. Trans Knowl Discov Data 3(1): 1–36
Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recog 37(5): 943–952
Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (KDD), pp 84–93
Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 18th international conference on intelligent systems for molecular biology (ISMB), pp 93–103
Chiaravalloti AD, Greco G, Guzzo A, Pontieri L (2006) An information-theoretic framework for process structure and data mining. In: Proceedings of the 8th international conference on data warehousing and knowledge discovery (DaWaK), pp 248–259
Dai W, Yang Q, Xue GR, Yu Y (2008) Self-taught clustering. In: Proceedings of the 25th international conference on machine learning (ICML), pp 200–207
Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering - a filter solution. In: Proceedings of the 2nd IEEE international conference on data mining (ICDM), pp 115–122
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM international conference on knowledge discovery and data mining (KDD), pp 269–274
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (KDD), pp 89–98
Ding CHQ, He X, Zha H, Simon HD (2002) Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of the 2nd IEEE international conference on data mining (ICDM), pp 147–154
Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM international conference on data mining (SDM), pp 517–521
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolmogorov complexity. Data Mining Knowl Discov 15(1): 3–20
Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010
Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7): 873–886
Fromont É, Prado A, Robardet C (2009) Constraint-based subspace clustering. In: Proceedings of the 9th SIAM international conference on data mining (SDM), pp 26–37
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 776–781
Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeneous data co-clustering based on consistent information theory. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 880–884
Georgii E, Tsuda K, Schölkopf B (2010) Multi-way set enumeration in weight tensors. Mach Learn 82(2): 123–155
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering (ICDE), pp 512–521
Günnemann S, Müller E, Färber I, Seidl T (2009) Detection of orthogonal concepts in subspaces of high dimensional data. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM), pp 1317–1326
Günnemann S, Färber I, Boden B, Seidl T (2010a) Subspace clustering meets dense subgraph mining: a synthesis of two paradigms. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), pp 845–850
Günnemann S, Färber I, Müller E, Seidl T (2010b) ASCLU: alternative subspace clustering. In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010
Günnemann S, Kremer H, Seidl T (2010c) Subspace clustering for uncertain data. In: Proceedings of the 10th SIAM international conference on data mining (SDM), pp 385–396
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces?. In: Proceedings of the 26th international conference on very large data bases (VLDB), pp 506–515
Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A (2010) Can shared-neighbor distances defeat the curse of dimensionality?. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)
Hsu CM, Chen MS (2004) Subspace clustering of high dimensional spatial data with noises. In: Proceedings of the 8th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD), pp 31–40
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Jaschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2006) TRIAS–an algorithm for mining iceberg tri-lattices. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 907–911
Ji L, Tan KL, Tung AKH (2006) Mining frequent closed cubes in 3D datasets. In: Proceedings of the 32nd international conference on very large data bases (VLDB), pp 811–822
Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004a) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (KDD), pp 430–439
Jiang D, Tang C, Zhang A (2004b) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8): 1026–1041
Kailing K, Kriegel HP, Kröger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 241–252
Kailing K, Kröger P, Kriegel HP (2004) Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 4th SIAM international conference on data mining (SDM), pp 246–257
Ke Y, Cheng J, Ng W (2006) Mining quantitative correlated patterns using an information-theoretic approach. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (KDD), pp 227–236
Keogh EJ, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (KDD), pp 206–215
Kleinberg J, Papadimitriou C, Raghavan P (1998) A microeconomic view of data mining. Data Mining Knowl Discov 2(4): 311–324
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97(1-2): 273–324
Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous subspace clustering in streaming time series. Inf Syst 33(2): 240–260
Kriegel HP, Zimek A (2010) Subspace clustering, ensemble clustering, alternative clustering, multiview clustering: what can we learn from each other? In: Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with KDD 2010
Kriegel HP, Kröger P, Renz M, Wurst S (2005) A generic framework for efficient subspace clustering of high-dimensional data. In: Proceedings of the 5th IEEE international conference on data mining (ICDM), pp 250–257
Kriegel HP, Borgwardt KM, Kröger P, Pryakhin A, Schubert M, Zimek A (2007) Future trends in data mining. Data Mining Knowl Discov 15(1): 87–97
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1): 1–58
Kriegel HP, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Proceedings of the 23rd international conference on scientific and statistical database management (SSDBM), pp 387–404
Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), ACM, pp 218–225
Li J, Li H, Soh D, Wong L (2005) A correspondence between maximal complete bipartite subgraphs and closed patterns. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 146–156
Li J, Sim K, Liu G, Wong L (2008) Maximal quasi-bicliques with balanced noise tolerance: concepts and co-clustering applications. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 72–83
Liu G, Sim K, Li J (2006) Efficient mining of large maximal bicliques. In: Proceedings of the 8th international conference on data warehousing and knowledge discovery (DaWak), pp 437–448
Liu G, Sim K, Li J, Wong L (2009) Efficient mining of distance-based subspace clusters. Stat Anal Data Mining 2(5-6): 427–444
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1): 24–45
Mishra N, Ron D, Swaminathan R (2005) A new conceptual clustering framework. Mach Learn 56(1-3): 115–151
Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (KDD), pp 533–541
Moise G, Zimek A, Kröger P, Kriegel HP, Sander J (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3): 299–326
Müller E, Assent I, Krieger R, Jansen T, Seidl T (2008) Morpheus: interactive exploration of subspace clustering. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (KDD), pp 1089–1092
Müller E, Assent I, Günnemann S, Krieger R, Seidl T (2009a) Relevant subspace clustering: mining the most interesting non-redundant concepts in high dimensional data. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 377–386
Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009b) DensEst: density estimation for data mining in high dimensional spaces. In: Proceedings of the 9th SIAM international conference on data mining (SDM), pp 173–184
Müller E, Assent I, Seidl T (2009c) HSM: heterogeneous subspace mining in high dimensional. In: Proceedings of the 21st international conference on scientific and statistical database management (SSDBM), pp 497–516
Müller E, Günnemann S, Assent I, Seidl T (2009d) Evaluating clustering in subspace projections of high dimensional data. Proc VLDB Endow 2(1): 1270–1281
Nagesh H, Goil S, Choudhary A (2001) Adaptive grids for clustering massive data sets. In: Proceedings of the 1st SIAM international conference on data mining (SDM)
Nocedal J, Wright SJ (2006) Numerical optimization. Springer, New York, pp 497–528
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10): 1345–1359
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory (ICDT), pp 398–416
Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7): 902–916
Pensa R, Boulicaut J (2008) Constrained co-clustering of gene expression data. In: Proceedings of the 8th SIAM international conference on data mining (SDM), pp 25–36
Rege M, Dong M, Fotouhi F (2006) Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 532–541
Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 8th international conference on principles and knowledge representation and reasoning (KR), pp 539–550
Sequeira K, Zaki MJ (2004) SCHISM: a new approach for interesting subspace mining. In: Proceedings of the 4th IEEE international conference on data mining (ICDM), pp 186–193
Silverman BW (1986) Density estimation for statistics and data analysis (Chapman and Hall/CRC monographs on statistics and applied probability), 1st edn. Chapman and Hall/CRC, London
Sim K, Li J, Gopalkrishnan V, Liu G (2006) Mining maximal quasi-bicliques to co-cluster stocks and financial ratios for value investment. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 1059–1063
Sim K, Gopalkrishnan V, Chua HN, Ng SK (2009a) MACs: multi-attribute co-clusters with high correlation information. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), pp 398–413
Sim K, Li J, Gopalkrishnan V, Liu G (2009b) Mining maximal quasi-bicliques: novel algorithm and applications in the stock market and protein networks. Stat Anal Data Mining 2(4): 255–273
Sim K, Aung A, Vivekanand G (2010a) Discovering correlated subspace clusters in 3D continuous-valued data. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), pp 471–480
Sim K, Poernomo AK, Gopalkrishnan V (2010b) Mining actionable subspace clusters in sequential data. In: Proceedings of the 10th SIAM international conference on data mining (SDM), pp 442–453
Sim K, Liu G, Gopalkrishna V, Li J (2011) A case study on financial ratios via cross-graph quasi-bicliques. Inf Sci 181(1): 201–216
Snedecor GW, Cochran WG (1989) Statistical methods, 8th edn. Iowa State University Press, Ames
Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 1–12
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (KDD), pp 687–696
Tanay A, Sharan R, Shamir R (2004) Biclustering algorithms: a survey. Handbook of computational molecular biology. Chapman & Hall/CRC, London
Tomita E, Tanaka A, Takahashi H (2004) The worst-case time complexity for generating all maximal cliques. In: Proceedings of the 10th international computing and combinatorics conference (COCOON), pp 161–170
Uno T, Kiyomi M, Arimura H (2004) LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets. In: Proceedings of the 2nd international workshop on frequent itemset mining implementations (FIMI) held in conjuction with ICDM 2004
Vreeken J, Zimek A (2011) When pattern met subspace cluster—a relationship story. In: Proceedings of the 2nd international workshop on discovering, summarizing and using multiple clusterings (MultiClust) held in conjunction with ECML PKDD 2011, pp 7–18
Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning (ICML), pp 577–584
Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 394–405
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3): 645–678
Xu X, Lu Y, Tung AKH, Wang W (2006) Mining shifting-and-scaling co-regulation patterns on gene expression profiles. In: Proceedings of the 22nd international conference on data engineering (ICDE), p 89
Xu X, Lu Y, Tan KL, Tung AKH (2009) Finding time-lagged 3D clusters. In: Proceedings of the 25th international conference on data engineering (ICDE), pp 445–456
Yan C, Burleigh JG, Eulenstein O (2005) Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol Phylogenet Evol 35: 528–535
Yang J, Wang W, Wang H, Yu P (2002) δ-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 19th international conference on data engineering (ICDE), pp 517–528
Zaki MJ, Peters M, Assent I, Seidl T (2005) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (KDD), pp 736–742
Zhang X, Wang W (2007) An efficient algorithm for mining coherent patterns from heterogeneous microarrays. In: Proceedings of the 19th international conference on scientific and statistical database management (SSDBM), p 32
Zhang Q, Liu J, Wang W (2007) Incremental subspace clustering over multiple data streams. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 727–732
Zhao L, Zaki MJ (2005) TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In: Proceedings of the 25th ACM international conference on management of data (SIGMOD), pp 694–705
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Sim, K., Gopalkrishnan, V., Zimek, A. et al. A survey on enhanced subspace clustering. Data Min Knowl Disc 26, 332–397 (2013). https://doi.org/10.1007/s10618-012-0258-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0258-x