Abstract
To detect clusters in high-dimensional database, researchers have proposed several approaches which detect clusters in different subspaces. These approaches are called subspace clustering approaches. However, most of the proposed subspace clustering approaches produce a redundant result which leads to a low clustering quality. Recently, several non-redundant subspace clustering approaches are proposed such as OSCLU (Günnemann et al., Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1317–1326. ACM, 2009). For the purpose of efficient calculation, these non-redundant subspace clustering approaches, which are mainly based on an NP-hard mathematics model, introduce a lot of parameters. By doing experiments, we find that the existing non-redundant subspace clustering approaches are sensitive to parameter settings.
To solve the parameter setting problem of existing non-redundant subspace clustering, in this paper we propose a subspace clustering approach based on a two-step clustering model. With experiments, we prove that our approach can outperform OSCLU by reducing the parameter’s number and sensitivity.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Knowledge discovery in database provides data owners with new information and patterns in their data. Clustering is a traditional data mining task for automatically grouping objects [3]. Sometimes, clusters may be hidden in different views of data. In these situations, database researchers have to apply a multiple clustering method. Generally, multiple clustering provides multiple sets of clusters and thus gets more insights than only one solution. There are three goals of multiple clustering: first of all, each object should be grouped in multiple clusters and represents different perspectives on the data. Secondly, the result should consist of many alternative solutions. Users can choose one or use multiple of these solutions. At last, solutions should differ to a high extent and thus each of these solutions provides an additional knowledge.
Some recent researches of multiple clustering focus on subspace (or projection) clustering approaches which detect clusters in a subset of the whole attributes set. They group subspaces and objects to deal with the curse of dimension [6]. A potential problem is that subspace clustering approaches will introduce redundancy and multiple views.
To deal with the redundancy problem, several non-redundant subspace clustering approaches are proposed. These approaches are based on the idea that a pair of subspace clusters which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. With this redundancy definition, researchers proposed an algorithm which detects all the clusters in different subspaces gradually and only adds non-redundant clusters to the result set. However, this algorithm is proved to be NP hard, so they proposed OSCLU, an approximation algorithm. For efficient calculation, OSCLU introduces a lot of parameters. With experiments we find that OSCLU is sensitive to parameter settings. Besides, the model which adds clusters to the result set gradually makes OSCLU cannot judge clusters’ redundancy in a global view which affects redundancy removal quality a lot.
In this paper, we propose a non-redundant subspace clustering approach. Totally different from the existing approaches, we generate a non-redundant subspace clustering result by two steps. First step, we generate a redundant subspace clustering result with some traditional subspace clustering approaches, e.g., SUBCLU [13]. Second step, we consider each subspace cluster as an object, with their subspaces and grouped objects as attributes. We use some traditional clustering approaches, e.g., K-Means [12], to group these clusters generated on the first step. We consider the clusters that are grouped in one group to be similar and thus provide redundant information. So, we pick one cluster to represent this group of clusters. All these picked representor clusters compose the result set. Except the parameters for density-based clustering, we only introduce one parameter K, which determines the result cluster number and can be easily set in a heuristic way. Comparing to a newly proposed non-redundant multiple clustering approach OSCLU, our approach TSCLU mainly have 3 advantages as listed below:
-
1.
TSCLU introduces less parameters and is less sensitive to the parameter settings.
-
2.
TSCLU gets a lower redundancy value than OSCLU and a similar clustering quality measured by traditional evaluating criterion.
-
3.
TSCLU removes redundant clusters with a global which is more reasonable than OSCLU.
To prove that TSCLU can outperform OSCLU in some perspectives, we do a series of experiments to compare TSCLU with OSCLU under different clustering quality evaluation criterions like F1 [5, 15], Entropy [4, 19], Accuracy [8, 20], and Coverage. Since there are no existing redundancy evaluation measures, we also propose a redundancy evaluation criterion.
This paper is structured as follows: in Sect. 2 we review existing multiple clustering approaches and existing subspace clustering evaluation measures. Section 3 will introduce our novel non-redundant subspace clustering algorithm. We will do a series of experiments to prove our approach can outperform OSCLU with less parameters and lower redundancy value in Sect. 4. Finally, we make a conclusion and present our future work in Sect. 5.
2 Related Work
2.1 Subspace Clustering Approaches
Recent research of clustering in high-dimensional data has introduced a number of different approaches summarized in [14, 17]. One approach is to detect clusters in arbitrary subspace projections of the data. Each cluster is associated with a set of relevant dimensions in which this pattern has been discovered. In the view of redundancy of the result set, we divide subspace clustering approaches into two categories: redundant approaches and non-redundant approaches.
2.1.1 Redundant Approaches
We can identify three major paradigms of redundant subspace clustering approaches by the underlying cluster definition and parameterization of the resulting clustering:
Cell-Based Approaches: Cell-based approaches consider clusters as a set of adjacent fixed or variable grid cells that contain more than a certain threshold many objects. A famous cell-based approach is CLIQUE [2], which uses fixed threshold, fixed grid, and prune subspaces by monotonicity property. Based on this idea, researchers also proposed DOC [7], MINECLUS [21], and SCHISM [19].
Density-Based Approaches: Density-based approaches define clusters as dense regions separated by sparse regions. With this definition, we can detect arbitrary shaped clusters in relevant dimensions. The first density-based clustering algorithm is DBSCAN [9]. SUBCLU [13] applies DBSCAN to subspaces. Density-based subspace clustering approaches all suffer the parameter setting problem and have to deal with dimensionality bias of density while detecting clusters in subspaces.
Clustering-Oriented Approaches: In contrast to the previous approaches, clustering-oriented approaches focus on the clustering result R by directly specifying objective functions like the number of clusters to be detected or the average dimensionality of the clusters, as in PROCLUS [1]. As clustering-oriented approaches directly control the resulting clusters other than an individual cluster, clustering quality is affected by noisy data.
2.1.2 Non-redundant Approaches
Traditional redundant subspace clustering approaches tend to generate a quite large amount of clusters. The result contains a lot of redundant clusters. OSCLU is a recent proposed non-redundant subspace clustering approach. OSCLU is based on the idea that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded to be redundant to each other. ASCLU [10] applies this idea to alternative clustering way.
2.2 Subspace Clustering Evaluating Framework
In this section, we review some existing evaluation measures in subspace clustering area. Müller et al. [16] introduced a systematic framework to evaluate subspace clustering results. We mainly divide the existing measures into two categories: object-based measures and object- and subspace-based measure.
We use the classic measures in traditional clustering evaluation such as Entropy, F1, and Coverage to evaluate clustering result in object view.
As we are doing subspace clustering, we should take subspace selection into consideration. So we also get some evaluation measures that divide each object into each dimensions and evaluate the result based on the view of divided objects, e.g., RNIA [18] and CE [18].
As reviewed, we have already got many measures describing the quality of clustering. But we missed an important aspect of subspace clustering: the redundancy of results. Many clustering approaches devote themselves to redundancy removal. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing good at redundancy removal.
3 TSCLU
Notations: For consistent notations in the following sections we define some notations here. Every cluster C in a subspace projection is defined by a set of objects O that is a subset of the database \(\mathit{DB} =\{ O_{1},O_{2},\ldots ,O_{n}\}\) and by a set of relevant dimensions S out of the set of all dimensions \(\mathit{D} =\{ S_{1},S_{2},\ldots ,S_{m}\}\).
Definition 1.
A cluster C in a subspace projections S is
A clustering result is a set of found clusters in the respective subspace projections.
Definition 2.
A clustering result RES of k clusters is a set of clusters
Let All be the set of all possible subspaces of cluster C. Apparently, All contains lots of subspace clusters and can be quite redundant. We need to find an optimal subset of All to minimize redundancy while keeping a certain clustering quality. Generally, we do our job in two steps. First step, we generate as many clusters as possible in all the subspaces. Second step, we remove redundancy of the result by grouping the clusters generated on the first step. As to a group of similar clusters, we consider them to be similar and thus are redundant to each other. So, we retain a most valuable cluster and remove all the other clusters because they bring redundancy. Because of the two-step clustering, we call our approach TSCLU (two-step subspace clustering).
3.1 First-Step Clustering
While doing the first-step clustering, we do not put importance on the result set’s redundancy but on the coverage. To compare with OSCLU, we also choose a density-based subspace clustering approach. We choose SUBCLU to do the first-step clustering because it can detect subspace clusters with arbitrary shapes in all subspaces.
3.2 Second-Step Clustering
To remove redundancy, we also adopt the idea in OSCLU that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. Instead of setting two subjective overlap rate thresholds, we use traditional clustering algorithms to group similar subspace clusters together. By this way, we can group similar subspace clusters together with a global view. After that, we pick one most interesting subspace cluster from each group and add them to the result set.
3.2.1 Second-Step Clustering Algorithm
To reduce the time consumption and parameter number of the second-step clustering, we choose K-Means to group the subspace clusters generated on the first step. We apply a vector-building algorithm to convert a subspace cluster C to two vectors, Vo(C) and Vs(C).
Definition 3.
Considering a subspace cluster C={O, S}, we can convert it to a vector Vo(C) in object perspective.
Definition 4.
Considering a subspace cluster C={O, S}, we can convert it to a vector Vs(C) in subspace perspective.
To calculate the distance between two subspace clusters, we should take both grouped objects and relevant dimensions into consideration. So, we plus the cosine distance of object vectors and subspace vectors to be subspace cluster distance.
Definition 5.
Given two subspace clusters \(C_{i} =\{ O_{i},S_{i}\}\), \(C_{j} =\{ O_{j},S_{j}\}\). We define the distance between C i and C j to be \(\mathit{Dis}(C_{i},C_{j})\).
3.2.2 Cluster Picking
After grouping similar subspace clusters together, we have to choose a representative subspace cluster from each group to represent the group. Considering clusters in one group provide similar information, we want to pick a most interesting one. We make use of the local interesting definition in [16] to define our interesting value:
Definition 6.
Given a subspace cluster C={O, S}, we define the Interest value of C as below:
density S ( p) stands for the average density of cluster C, which is defined in [16]. By picking the one with the highest interesting value to represent each cluster group and add it to the result set, we get the final non-redundant subspace clustering result.
4 Experiments
4.1 Redundancy Evaluation
As stated in Sect. 2, many clustering approaches devote themselves to non-redundant results. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing well at redundancy removing or not. In this section we propose a redundancy evaluation criterion and use this criterion to evaluate the redundancy of a clustering result in the following experiments.
4.1.1 Redundancy Definition
Based on the idea in [10] and [11], we consider subspace cluster pairs that share little common subspace or grouped objects to be dissimilar. We can define the redundancy value between two subspace clusters with the product of the subspace overlap rate and object overlap rate.
Definition 7.
[Mutual redundancy value] Given two subspace clusters C i ={ S i , O i } and C j ={ S j , O j }, the mutual redundancy value of C i and C j is:
Based on the mutual redundancy formula above, we can define the total redundancy value of a result set RES. An apparent way is to sum all the mutual redundancy up.
Definition 8 (Total redundancy value).
Given a set of subspace clusters \(RES =\{ C_{1},C_{2},C_{3},\ldots ,C_{n}\}\). The total redundancy value TRed( RES) is:
4.1.2 Reasonableness
To prove that our redundancy-evaluating criterion is reasonable, we do experiments on real-world data glass, which contains 214 objects in 9 dimensions. We apply three subspace clustering approaches including two redundant approaches CLIQUE and SUBCLU and one non-redundant approach OSCLU. The parameters are adjusted to get the best clustering result. After that we evaluate the redundancy of the results. All the experiments are based on an open-source subspace clustering framework OpenSubspace http://dme.rwth-aachen.de/en/OpenSubspace. We implemented OSCLU on this framework and compared clustering result to CLIQUE and SUBCLU, which are already implemented on the framework. We list the experiment result below:
Table 1 shows the total redundancy and cluster number of three different approaches.
Apparently, OSCLU generated less number of clusters and got a significant reduction of total redundancy value. Although the redundancy value is seemed to be quite biased to cluster sets with smaller cluster number, we have pointed out that a large cluster number itself can be regarded as a kind of redundancy. So this criterion is reasonable and we can apply this criterion to the coming experiments. However, we do admit that the cluster number may play a too important role while evaluating the redundancy of the clustering result. It’s better to normalize the redundancy value to be bounded in [0,1], e.g., we can take the answer file’s information into consideration. In the future, we will work on this.
4.2 Comparison Between OSCLU and TSCLU
Generally speaking, OSCLU suffers from 3 main drawbacks: (1) OSCLU introduces too many parameters which make the algorithm hard to implement. (2) OSCLU is quite sensitive to the parameter settings and different parameter values may lead to a significant difference of result quality. (3) OSCLU considers the redundancy of a new coming cluster gradually instead of considering redundancy with a global view. This may make the result not optimal.
On the other hand, the parameters of TSCLU are quite simple and the two-step clustering model can remove redundancy with a global view. To prove that TSCLU can outperform OSCLU in perspectives above, we do a series of experiments on real-world data. All the experiments are based on OpenSubspace framework.
4.2.1 Parameter Sensitivity
The same as the experiments before, we do our experiments on database glass, which contains 214 objects and 9 dimensions. First of all, we test the clustering quality of TSCLU with different parameters K. We set parameters Epsilon = 8.0 and minPt = 32 for the first-step SUBCLU clustering.
From Table 2 and Figs. 1, 3, and 5, we can see that TSCLU is not sensitive to the K value. As K’s value grows, the redundancy grows linearly. Coverage grows fast before k = 6, after that coverage does not change a lot. Other quality measures like F1, Entropy, and time consumption are not sensitive to K. So, while implementing TSCLU, we can just set k = log( n). n is the size of database.
To test OSCLU’s parameter sensitivity, we compare clustering quality with different α and β. In fact, we also have to set orthogonal estimation threshold, quality upper bound, quality lower bound, and seed number subjectively to implement OSCLU. To simplify the experiment, we use the best estimation threshold, quality upper bound, lower bound, and seed number value and just compare the quality with different α and β.
From Table 3 and Figs. 2, 4, and 6, we can see that OSCLU is quite sensitive to α and β’s value. As α and β’s value grows, both redundancy and runtime grow exponentially. Clustering quality also changes a lot with the α and β’s value. Because of the paper page number’s limitation, we do not show the result of OSCLU with changes of other parameters here.
4.2.2 Clustering Quality
For fair comparison, we ran massive experiments with various parameter settings and selected the one with the best quality. We set k = 5 and 6 for TSCLU, \(\alpha = 0.3,\ \beta = 0.3\) and \(\alpha = 0.4,\beta = 0.4\) for OSCLU.
From Table 4 we can see that TSCLU got a roughly the same clustering quality with OSCLU; however, TSCLU got a significantly smaller redundancy value. As mentioned in Sect. ??, the redundancy is quite relative to cluster number; for a fair comparison, we also did an experiment to compare TSCLU and OSCLU with similar cluster number. We can see that compared to OSCLU, TSCLU still got a significantly smaller redundancy value with similar clustering quality.
With experiments on real-world data, we can see that TSCLU can get a significantly smaller redundancy value than OSCLU with similar clustering quality. Besides, we find that TSCLU is not sensitive to parameter K, while OSCLU contains lots of parameters and is quite sensitive to the parameters.
On the other hand, TSCLU suffers a relatively higher running time. Since K-means is not a time-consuming algorithm, the long running time is mainly caused by SUBCLU. We will try some other redundant subspace clustering approaches for the first-step clustering of TSCLU in the future.
5 Conclusion
In this paper, we proposed a two-step non-redundant subspace clustering approach: TSCLU. The approach is based on a two-step clustering model which is never proposed before. Different with traditional non-redundant subspace clustering approaches OSCLU and ASCLU which measure the redundancy of a cluster while adding it to the result set, TSCLU applies a redundant subspace clustering approach which ensures not to miss any clusters firstly and group clusters to remove redundancy in the second step.
With a series of experiments, we find that TSCLU can get a similar clustering quality with OSCLU, while getting a significantly smaller redundancy value. Besides, TSCLU needs less parameters than OSCLU and is not sensitive to parameter settings, which make TSCLU much easier to implement. However, we have to point out that TSCLU’s running time is much longer than OSCLU. The long running time is caused by redundant subspace clustering algorithm OSCLU. The experiments have proved that the two-step clustering model is reasonable and easy to implement. In the future we will do some work on trying different redundant subspace clustering algorithms for the first-step clustering to reduce running time and proposing a more reasonable redundancy-evaluating criterion.
References
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. ACM SIGMOD Rec. 28(2), 61–72 (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, vol. 27. ACM, New York (1998)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215, pp. 87–499, 1994
Assent, I., Krieger, R., Muller, E., Seidl, T.: Dusc: Dimensionality unbiased subspace clustering. In: ICDM 2007. Seventh IEEE International Conference on Data Mining, 2007, pp. 409–414. IEEE, (2007)
Assent, I., Krieger, R., Muller, E., Seidl, T.: Inscy: Indexing subspace clusters with in-process-removal of redundancy. In: ICDM 2008. Eighth IEEE International Conference on Data Mining, 2008 pp. 719–724. IEEE, (2008)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Database Theory ICDT’99, pp. 217–235. Springer, (1999)
Biersack, J.P., Haggmark, L.G.: A monte carlo computer program for the transport of energetic ions in amorphous targets. Nucl. Instrum. Meth. 174(1-2), 257–269 (1980)
Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable patterns. In: ICDM 2007. Seventh IEEE International Conference on Data Mining, 2007, pp. 63–72. IEEE, 2007
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996, pp. 226–231. AAAI Press, (1996)
Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer, (2010)
Günnemann, S., Müller, E., Färber, I., Seidl, T.: Detection of orthogonal concepts in subspaces of high dimensional data. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1317–1326. ACM, New York (2009)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. C (Applied Statistics) 28(1), 100–108 (1979)
Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. SDM, vol. 4, 2004
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1 (2009)
Müller, E., Assent, I., Krieger, R., Günnemann, S., Seidl, T.: Densest: Density estimation for data mining in high dimensional spaces. In: Proc. SIAM SDM, pp. 173–184, 2009
Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)
Patrikainen, A., Meila, M.: Comparing subspace clusterings. IEEE Trans. Knowl. Data Eng. 18(7), 902–916 (2006)
Sequeira, K., Zaki, M.: Schism: A new approach for interesting subspace mining. In: ICDM’04. Fourth IEEE International Conference on Data Mining, 2004, pp. 186–193. IEEE, (2004)
Ultsch, A.: Maps for the visualization of high-dimensional data spaces. In: Proc. Workshop on Self organizing Maps, pp. 225–230, 2003
Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM 2003. Third IEEE International Conference on Data Mining, 2003, pp. 689–692. IEEE, (2003)
Acknowledgements
This work is supported by the National Basic Research Program of China (973 Program, Grant No. 2012CB315803), the National High-tech R&D Program of China 863 Program (Grant No. 2011AA01A101), the National Natural Science Foundation of China (Grant No. 61003100 and No.60972011), and Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20100002120018 and No.20100002110033).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this paper
Cite this paper
Zheng, HT., Chen, H., Jiang, Y., Xia, ST., Li, H. (2013). A Two-Step Non-redundant Subspace Clustering Approach. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_18
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6880-6_18
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6879-0
Online ISBN: 978-1-4614-6880-6
eBook Packages: Computer ScienceComputer Science (R0)