A Two-Step Non-redundant Subspace Clustering Approach

Zheng, Hai-Tao; Chen, Hao; Jiang, Yong; Xia, Shu-Tao; Li, Huiqiu

doi:10.1007/978-1-4614-6880-6_18

Hai-Tao Zheng⁶,
Hao Chen⁶,
Yong Jiang⁶,
Shu-Tao Xia⁶ &
…
Huiqiu Li⁷

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

1823 Accesses

Abstract

To detect clusters in high-dimensional database, researchers have proposed several approaches which detect clusters in different subspaces. These approaches are called subspace clustering approaches. However, most of the proposed subspace clustering approaches produce a redundant result which leads to a low clustering quality. Recently, several non-redundant subspace clustering approaches are proposed such as OSCLU (Günnemann et al., Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1317–1326. ACM, 2009). For the purpose of efficient calculation, these non-redundant subspace clustering approaches, which are mainly based on an NP-hard mathematics model, introduce a lot of parameters. By doing experiments, we find that the existing non-redundant subspace clustering approaches are sensitive to parameter settings.

To solve the parameter setting problem of existing non-redundant subspace clustering, in this paper we propose a subspace clustering approach based on a two-step clustering model. With experiments, we prove that our approach can outperform OSCLU by reducing the parameter’s number and sensitivity.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Information-Theoretic Non-redundant Subspace Clustering

ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data

Article 22 April 2020

k-SubMix: Common Subspace Clustering on Mixed-Type Data

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Knowledge discovery in database provides data owners with new information and patterns in their data. Clustering is a traditional data mining task for automatically grouping objects [3]. Sometimes, clusters may be hidden in different views of data. In these situations, database researchers have to apply a multiple clustering method. Generally, multiple clustering provides multiple sets of clusters and thus gets more insights than only one solution. There are three goals of multiple clustering: first of all, each object should be grouped in multiple clusters and represents different perspectives on the data. Secondly, the result should consist of many alternative solutions. Users can choose one or use multiple of these solutions. At last, solutions should differ to a high extent and thus each of these solutions provides an additional knowledge.

Some recent researches of multiple clustering focus on subspace (or projection) clustering approaches which detect clusters in a subset of the whole attributes set. They group subspaces and objects to deal with the curse of dimension [6]. A potential problem is that subspace clustering approaches will introduce redundancy and multiple views.

To deal with the redundancy problem, several non-redundant subspace clustering approaches are proposed. These approaches are based on the idea that a pair of subspace clusters which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. With this redundancy definition, researchers proposed an algorithm which detects all the clusters in different subspaces gradually and only adds non-redundant clusters to the result set. However, this algorithm is proved to be NP hard, so they proposed OSCLU, an approximation algorithm. For efficient calculation, OSCLU introduces a lot of parameters. With experiments we find that OSCLU is sensitive to parameter settings. Besides, the model which adds clusters to the result set gradually makes OSCLU cannot judge clusters’ redundancy in a global view which affects redundancy removal quality a lot.

In this paper, we propose a non-redundant subspace clustering approach. Totally different from the existing approaches, we generate a non-redundant subspace clustering result by two steps. First step, we generate a redundant subspace clustering result with some traditional subspace clustering approaches, e.g., SUBCLU [13]. Second step, we consider each subspace cluster as an object, with their subspaces and grouped objects as attributes. We use some traditional clustering approaches, e.g., K-Means [12], to group these clusters generated on the first step. We consider the clusters that are grouped in one group to be similar and thus provide redundant information. So, we pick one cluster to represent this group of clusters. All these picked representor clusters compose the result set. Except the parameters for density-based clustering, we only introduce one parameter K, which determines the result cluster number and can be easily set in a heuristic way. Comparing to a newly proposed non-redundant multiple clustering approach OSCLU, our approach TSCLU mainly have 3 advantages as listed below:

1.
TSCLU introduces less parameters and is less sensitive to the parameter settings.
2.
TSCLU gets a lower redundancy value than OSCLU and a similar clustering quality measured by traditional evaluating criterion.
3.
TSCLU removes redundant clusters with a global which is more reasonable than OSCLU.

To prove that TSCLU can outperform OSCLU in some perspectives, we do a series of experiments to compare TSCLU with OSCLU under different clustering quality evaluation criterions like F1 [5, 15], Entropy [4, 19], Accuracy [8, 20], and Coverage. Since there are no existing redundancy evaluation measures, we also propose a redundancy evaluation criterion.

This paper is structured as follows: in Sect. 2 we review existing multiple clustering approaches and existing subspace clustering evaluation measures. Section 3 will introduce our novel non-redundant subspace clustering algorithm. We will do a series of experiments to prove our approach can outperform OSCLU with less parameters and lower redundancy value in Sect. 4. Finally, we make a conclusion and present our future work in Sect. 5.

2 Related Work

2.1 Subspace Clustering Approaches

Recent research of clustering in high-dimensional data has introduced a number of different approaches summarized in [14, 17]. One approach is to detect clusters in arbitrary subspace projections of the data. Each cluster is associated with a set of relevant dimensions in which this pattern has been discovered. In the view of redundancy of the result set, we divide subspace clustering approaches into two categories: redundant approaches and non-redundant approaches.

2.1.1 Redundant Approaches

We can identify three major paradigms of redundant subspace clustering approaches by the underlying cluster definition and parameterization of the resulting clustering:

Cell-Based Approaches: Cell-based approaches consider clusters as a set of adjacent fixed or variable grid cells that contain more than a certain threshold many objects. A famous cell-based approach is CLIQUE [2], which uses fixed threshold, fixed grid, and prune subspaces by monotonicity property. Based on this idea, researchers also proposed DOC [7], MINECLUS [21], and SCHISM [19].

Density-Based Approaches: Density-based approaches define clusters as dense regions separated by sparse regions. With this definition, we can detect arbitrary shaped clusters in relevant dimensions. The first density-based clustering algorithm is DBSCAN [9]. SUBCLU [13] applies DBSCAN to subspaces. Density-based subspace clustering approaches all suffer the parameter setting problem and have to deal with dimensionality bias of density while detecting clusters in subspaces.

Clustering-Oriented Approaches: In contrast to the previous approaches, clustering-oriented approaches focus on the clustering result R by directly specifying objective functions like the number of clusters to be detected or the average dimensionality of the clusters, as in PROCLUS [1]. As clustering-oriented approaches directly control the resulting clusters other than an individual cluster, clustering quality is affected by noisy data.

2.1.2 Non-redundant Approaches

Traditional redundant subspace clustering approaches tend to generate a quite large amount of clusters. The result contains a lot of redundant clusters. OSCLU is a recent proposed non-redundant subspace clustering approach. OSCLU is based on the idea that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded to be redundant to each other. ASCLU [10] applies this idea to alternative clustering way.

2.2 Subspace Clustering Evaluating Framework

In this section, we review some existing evaluation measures in subspace clustering area. Müller et al. [16] introduced a systematic framework to evaluate subspace clustering results. We mainly divide the existing measures into two categories: object-based measures and object- and subspace-based measure.

We use the classic measures in traditional clustering evaluation such as Entropy, F1, and Coverage to evaluate clustering result in object view.

As we are doing subspace clustering, we should take subspace selection into consideration. So we also get some evaluation measures that divide each object into each dimensions and evaluate the result based on the view of divided objects, e.g., RNIA [18] and CE [18].

As reviewed, we have already got many measures describing the quality of clustering. But we missed an important aspect of subspace clustering: the redundancy of results. Many clustering approaches devote themselves to redundancy removal. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing good at redundancy removal.

3 TSCLU

Notations: For consistent notations in the following sections we define some notations here. Every cluster C in a subspace projection is defined by a set of objects O that is a subset of the database $\mathit{DB} =\{ O_{1},O_{2},\ldots ,O_{n}\}$ and by a set of relevant dimensions S out of the set of all dimensions $\mathit{D} =\{ S_{1},S_{2},\ldots ,S_{m}\}$.

Definition 1.

A cluster C in a subspace projections S is

$$\displaystyle{ C = (O,S)\ \mathit{with}\ O \subseteq \mathit{DB},S \subseteq D }$$

(1)

A clustering result is a set of found clusters in the respective subspace projections.

Definition 2.

A clustering result RES of k clusters is a set of clusters

$$\displaystyle{ \mathit{RES} =\{ C_{1},\ldots ,C_{k}\},C_{i}\ =\ (O_{i},S_{i})\ \mathit{for}\ i = 1\ldots k }$$

(2)

Let All be the set of all possible subspaces of cluster C. Apparently, All contains lots of subspace clusters and can be quite redundant. We need to find an optimal subset of All to minimize redundancy while keeping a certain clustering quality. Generally, we do our job in two steps. First step, we generate as many clusters as possible in all the subspaces. Second step, we remove redundancy of the result by grouping the clusters generated on the first step. As to a group of similar clusters, we consider them to be similar and thus are redundant to each other. So, we retain a most valuable cluster and remove all the other clusters because they bring redundancy. Because of the two-step clustering, we call our approach TSCLU (two-step subspace clustering).

3.1 First-Step Clustering

While doing the first-step clustering, we do not put importance on the result set’s redundancy but on the coverage. To compare with OSCLU, we also choose a density-based subspace clustering approach. We choose SUBCLU to do the first-step clustering because it can detect subspace clusters with arbitrary shapes in all subspaces.

3.2 Second-Step Clustering

To remove redundancy, we also adopt the idea in OSCLU that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. Instead of setting two subjective overlap rate thresholds, we use traditional clustering algorithms to group similar subspace clusters together. By this way, we can group similar subspace clusters together with a global view. After that, we pick one most interesting subspace cluster from each group and add them to the result set.

3.2.1 Second-Step Clustering Algorithm

To reduce the time consumption and parameter number of the second-step clustering, we choose K-Means to group the subspace clusters generated on the first step. We apply a vector-building algorithm to convert a subspace cluster C to two vectors, Vo(C) and Vs(C).

Definition 3.

Considering a subspace cluster C={O, S}, we can convert it to a vector Vo(C) in object perspective.

$$\displaystyle{ V _{o}(C) =\{ V (O_{1}),V (O_{2}),\ldots ,V (O_{n})\}\quad V (O_{i}) = \left \{\begin{array}{llll} 1\quad : \quad &O_{i} \in O \\ 0\quad : \quad &\mathit{otherwise}\end{array}\right . }$$

(3)

Definition 4.

Considering a subspace cluster C={O, S}, we can convert it to a vector Vs(C) in subspace perspective.

$$\displaystyle{ V _{s}(C) =\{ V (S_{1}),V (S_{2}),\ldots ,V (S_{m})\}\quad V (S_{i}) = \left \{\begin{array}{lll} 1\quad : \quad &S_{i} \in S \\ 0\quad : \quad &\mathit{otherwise}\end{array} \right . }$$

(4)

To calculate the distance between two subspace clusters, we should take both grouped objects and relevant dimensions into consideration. So, we plus the cosine distance of object vectors and subspace vectors to be subspace cluster distance.

Definition 5.

Given two subspace clusters $C_{i} =\{ O_{i},S_{i}\}$, $C_{j} =\{ O_{j},S_{j}\}$. We define the distance between C _i and C _j to be $\mathit{Dis}(C_{i},C_{j})$.

$$\displaystyle{ \mathit{Dis}(C_{i},C_{j}) = \mathit{Cosine}(\mathit{Vo}(C_{i}),\mathit{Vo}(C_{j})) + \mathit{Cosine}(\mathit{Vs}(C_{i}),\mathit{Vs}(C_{j})) }$$

(5)

3.2.2 Cluster Picking

After grouping similar subspace clusters together, we have to choose a representative subspace cluster from each group to represent the group. Considering clusters in one group provide similar information, we want to pick a most interesting one. We make use of the local interesting definition in [16] to define our interesting value:

Definition 6.

Given a subspace cluster C={O, S}, we define the Interest value of C as below:

$$\displaystyle{ \mathit{Interest}(C) = \vert S{\vert }^{a} \cdot \vert O{\vert }^{b} \cdot \left ( \frac{1} {\vert O\vert }\sum _{p\in O}{\mathit{density}}^{S}(p)\right ) }$$

(6)

density ^S ( p) stands for the average density of cluster C, which is defined in [16]. By picking the one with the highest interesting value to represent each cluster group and add it to the result set, we get the final non-redundant subspace clustering result.

4 Experiments

4.1 Redundancy Evaluation

As stated in Sect. 2, many clustering approaches devote themselves to non-redundant results. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing well at redundancy removing or not. In this section we propose a redundancy evaluation criterion and use this criterion to evaluate the redundancy of a clustering result in the following experiments.

4.1.1 Redundancy Definition

Based on the idea in [10] and [11], we consider subspace cluster pairs that share little common subspace or grouped objects to be dissimilar. We can define the redundancy value between two subspace clusters with the product of the subspace overlap rate and object overlap rate.

Definition 7.

[Mutual redundancy value] Given two subspace clusters C _i={ S _i, O _i} and C _j={ S _j, O _j}, the mutual redundancy value of C _i and C _j is:

$$\displaystyle{ \mathit{Red}\{C_{i},C_{j}\} = \frac{\vert O_{i} a\mbox{\c{}}p O_{j}\vert } {\vert O_{j}\vert } \times \frac{\vert S_{i} a\mbox{\c{}}p S_{j}\vert } {\vert S_{j}\vert } }$$

(7)

Based on the mutual redundancy formula above, we can define the total redundancy value of a result set RES. An apparent way is to sum all the mutual redundancy up.

Definition 8 (Total redundancy value).

Given a set of subspace clusters $RES =\{ C_{1},C_{2},C_{3},\ldots ,C_{n}\}$. The total redundancy value TRed( RES) is:

$$\displaystyle{ \mathit{TRed}(\mathit{RES}) =\sum _{ i=1}^{n}\sum _{ j=1}^{n}\mathit{Red}(C_{ i},C_{j}) \times \vert \mathit{sign}(i - j)\vert }$$

(8)

4.1.2 Reasonableness

To prove that our redundancy-evaluating criterion is reasonable, we do experiments on real-world data glass, which contains 214 objects in 9 dimensions. We apply three subspace clustering approaches including two redundant approaches CLIQUE and SUBCLU and one non-redundant approach OSCLU. The parameters are adjusted to get the best clustering result. After that we evaluate the redundancy of the results. All the experiments are based on an open-source subspace clustering framework OpenSubspace http://dme.rwth-aachen.de/en/OpenSubspace. We implemented OSCLU on this framework and compared clustering result to CLIQUE and SUBCLU, which are already implemented on the framework. We list the experiment result below:

Table 1 shows the total redundancy and cluster number of three different approaches.

Table 1 Redundancy value of different approaches

Full size table

Apparently, OSCLU generated less number of clusters and got a significant reduction of total redundancy value. Although the redundancy value is seemed to be quite biased to cluster sets with smaller cluster number, we have pointed out that a large cluster number itself can be regarded as a kind of redundancy. So this criterion is reasonable and we can apply this criterion to the coming experiments. However, we do admit that the cluster number may play a too important role while evaluating the redundancy of the clustering result. It’s better to normalize the redundancy value to be bounded in [0,1], e.g., we can take the answer file’s information into consideration. In the future, we will work on this.

4.2 Comparison Between OSCLU and TSCLU

Generally speaking, OSCLU suffers from 3 main drawbacks: (1) OSCLU introduces too many parameters which make the algorithm hard to implement. (2) OSCLU is quite sensitive to the parameter settings and different parameter values may lead to a significant difference of result quality. (3) OSCLU considers the redundancy of a new coming cluster gradually instead of considering redundancy with a global view. This may make the result not optimal.

On the other hand, the parameters of TSCLU are quite simple and the two-step clustering model can remove redundancy with a global view. To prove that TSCLU can outperform OSCLU in perspectives above, we do a series of experiments on real-world data. All the experiments are based on OpenSubspace framework.

4.2.1 Parameter Sensitivity

The same as the experiments before, we do our experiments on database glass, which contains 214 objects and 9 dimensions. First of all, we test the clustering quality of TSCLU with different parameters K. We set parameters Epsilon = 8.0 and minPt = 32 for the first-step SUBCLU clustering.

From Table 2 and Figs. 1, 3, and 5, we can see that TSCLU is not sensitive to the K value. As K’s value grows, the redundancy grows linearly. Coverage grows fast before k = 6, after that coverage does not change a lot. Other quality measures like F1, Entropy, and time consumption are not sensitive to K. So, while implementing TSCLU, we can just set k = log( n). n is the size of database.

Table 2 TSCLU clustering quality with different K (cluster number, from 2 to 12)

Full size table

To test OSCLU’s parameter sensitivity, we compare clustering quality with different α and β. In fact, we also have to set orthogonal estimation threshold, quality upper bound, quality lower bound, and seed number subjectively to implement OSCLU. To simplify the experiment, we use the best estimation threshold, quality upper bound, lower bound, and seed number value and just compare the quality with different α and β.

From Table 3 and Figs. 2, 4, and 6, we can see that OSCLU is quite sensitive to α and β’s value. As α and β’s value grows, both redundancy and runtime grow exponentially. Clustering quality also changes a lot with the α and β’s value. Because of the paper page number’s limitation, we do not show the result of OSCLU with changes of other parameters here.

Table 3 OSCLU clustering quality with different α and β

Full size table

4.2.2 Clustering Quality

For fair comparison, we ran massive experiments with various parameter settings and selected the one with the best quality. We set k = 5 and 6 for TSCLU, $\alpha = 0.3,\ \beta = 0.3$ and $\alpha = 0.4,\beta = 0.4$ for OSCLU.

From Table 4 we can see that TSCLU got a roughly the same clustering quality with OSCLU; however, TSCLU got a significantly smaller redundancy value. As mentioned in Sect. ??, the redundancy is quite relative to cluster number; for a fair comparison, we also did an experiment to compare TSCLU and OSCLU with similar cluster number. We can see that compared to OSCLU, TSCLU still got a significantly smaller redundancy value with similar clustering quality.

Table 4 Clustering result of TSCLU and OSCLU(1)

Full size table

With experiments on real-world data, we can see that TSCLU can get a significantly smaller redundancy value than OSCLU with similar clustering quality. Besides, we find that TSCLU is not sensitive to parameter K, while OSCLU contains lots of parameters and is quite sensitive to the parameters.

Table 5 Clustering result of TSCLU and OSCLU(2)

Full size table

On the other hand, TSCLU suffers a relatively higher running time. Since K-means is not a time-consuming algorithm, the long running time is mainly caused by SUBCLU. We will try some other redundant subspace clustering approaches for the first-step clustering of TSCLU in the future.

5 Conclusion

In this paper, we proposed a two-step non-redundant subspace clustering approach: TSCLU. The approach is based on a two-step clustering model which is never proposed before. Different with traditional non-redundant subspace clustering approaches OSCLU and ASCLU which measure the redundancy of a cluster while adding it to the result set, TSCLU applies a redundant subspace clustering approach which ensures not to miss any clusters firstly and group clusters to remove redundancy in the second step.

With a series of experiments, we find that TSCLU can get a similar clustering quality with OSCLU, while getting a significantly smaller redundancy value. Besides, TSCLU needs less parameters than OSCLU and is not sensitive to parameter settings, which make TSCLU much easier to implement. However, we have to point out that TSCLU’s running time is much longer than OSCLU. The long running time is caused by redundant subspace clustering algorithm OSCLU. The experiments have proved that the two-step clustering model is reasonable and easy to implement. In the future we will do some work on trying different redundant subspace clustering algorithms for the first-step clustering to reduce running time and proposing a more reasonable redundancy-evaluating criterion.

References

Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. ACM SIGMOD Rec. 28(2), 61–72 (1999)
Article Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, vol. 27. ACM, New York (1998)
Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215, pp. 87–499, 1994
Google Scholar
Assent, I., Krieger, R., Muller, E., Seidl, T.: Dusc: Dimensionality unbiased subspace clustering. In: ICDM 2007. Seventh IEEE International Conference on Data Mining, 2007, pp. 409–414. IEEE, (2007)
Google Scholar
Assent, I., Krieger, R., Muller, E., Seidl, T.: Inscy: Indexing subspace clusters with in-process-removal of redundancy. In: ICDM 2008. Eighth IEEE International Conference on Data Mining, 2008 pp. 719–724. IEEE, (2008)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Database Theory ICDT’99, pp. 217–235. Springer, (1999)
Google Scholar
Biersack, J.P., Haggmark, L.G.: A monte carlo computer program for the transport of energetic ions in amorphous targets. Nucl. Instrum. Meth. 174(1-2), 257–269 (1980)
Article Google Scholar
Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable patterns. In: ICDM 2007. Seventh IEEE International Conference on Data Mining, 2007, pp. 63–72. IEEE, 2007
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996, pp. 226–231. AAAI Press, (1996)
Google Scholar
Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer, (2010)
Google Scholar
Günnemann, S., Müller, E., Färber, I., Seidl, T.: Detection of orthogonal concepts in subspaces of high dimensional data. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1317–1326. ACM, New York (2009)
Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. C (Applied Statistics) 28(1), 100–108 (1979)
Google Scholar
Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. SDM, vol. 4, 2004
Google Scholar
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1 (2009)
Article Google Scholar
Müller, E., Assent, I., Krieger, R., Günnemann, S., Seidl, T.: Densest: Density estimation for data mining in high dimensional spaces. In: Proc. SIAM SDM, pp. 173–184, 2009
Google Scholar
Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)
Google Scholar
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)
Article Google Scholar
Patrikainen, A., Meila, M.: Comparing subspace clusterings. IEEE Trans. Knowl. Data Eng. 18(7), 902–916 (2006)
Article Google Scholar
Sequeira, K., Zaki, M.: Schism: A new approach for interesting subspace mining. In: ICDM’04. Fourth IEEE International Conference on Data Mining, 2004, pp. 186–193. IEEE, (2004)
Google Scholar
Ultsch, A.: Maps for the visualization of high-dimensional data spaces. In: Proc. Workshop on Self organizing Maps, pp. 225–230, 2003
Google Scholar
Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM 2003. Third IEEE International Conference on Data Mining, 2003, pp. 689–692. IEEE, (2003)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Basic Research Program of China (973 Program, Grant No. 2012CB315803), the National High-tech R&D Program of China 863 Program (Grant No. 2011AA01A101), the National Natural Science Foundation of China (Grant No. 61003100 and No.60972011), and Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20100002120018 and No.20100002110033).

Author information

Authors and Affiliations

Tsinghua-Southampton Web Science Laboratory, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China
Hai-Tao Zheng, Hao Chen, Yong Jiang & Shu-Tao Xia
China Telecom Co., Ltd. Shenzhen Branch, Shenzhen, China
Huiqiu Li

Authors

Hai-Tao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shu-Tao Xia
View author publications
You can also search for this author in PubMed Google Scholar
Huiqiu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai-Tao Zheng .

Editor information

Editors and Affiliations

, Dept. of Computer Science and Technology, Tsinghua University, Room 10-206, East main building, Beijing, 100084, China, People's Republic
Juanzi Li
, School of Comp. Sci. & Eng., Southeast University, Dongda Road 2, Nanjing, 211189, Jiangsu, China, People's Republic
Guilin Qi
Peking University, Inst. of Computer Science & Tech., North Zhongguancun Street 128, Beijing, 100871, China, People's Republic
Dongyan Zhao
L3S Research Center, Leibniz University Hannover, Appelstr. 4, Hannover, 30167, Germany
Wolfgang Nejdl
Tsinghua Campus H202B, Shenzhen City, 518055, China, People's Republic
Hai-Tao Zheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, HT., Chen, H., Jiang, Y., Xia, ST., Li, H. (2013). A Two-Step Non-redundant Subspace Clustering Approach. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_18

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6880-6_18
Published: 02 May 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6879-0
Online ISBN: 978-1-4614-6880-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Two-Step Non-redundant Subspace Clustering Approach

Abstract

Similar content being viewed by others

Information-Theoretic Non-redundant Subspace Clustering

ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data

k-SubMix: Common Subspace Clustering on Mixed-Type Data

Keywords

1 Introduction

2 Related Work

2.1 Subspace Clustering Approaches

2.1.1 Redundant Approaches

2.1.2 Non-redundant Approaches

2.2 Subspace Clustering Evaluating Framework

3 TSCLU

Definition 1.

Definition 2.

3.1 First-Step Clustering

3.2 Second-Step Clustering

3.2.1 Second-Step Clustering Algorithm

Definition 3.

Definition 4.

Definition 5.

3.2.2 Cluster Picking

Definition 6.

4 Experiments

4.1 Redundancy Evaluation

4.1.1 Redundancy Definition

Definition 7.

Definition 8 (Total redundancy value).

4.1.2 Reasonableness

4.2 Comparison Between OSCLU and TSCLU

4.2.1 Parameter Sensitivity

4.2.2 Clustering Quality

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation