Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Knowledge discovery in database provides data owners with new information and patterns in their data. Clustering is a traditional data mining task for automatically grouping objects [3]. Sometimes, clusters may be hidden in different views of data. In these situations, database researchers have to apply a multiple clustering method. Generally, multiple clustering provides multiple sets of clusters and thus gets more insights than only one solution. There are three goals of multiple clustering: first of all, each object should be grouped in multiple clusters and represents different perspectives on the data. Secondly, the result should consist of many alternative solutions. Users can choose one or use multiple of these solutions. At last, solutions should differ to a high extent and thus each of these solutions provides an additional knowledge.

Some recent researches of multiple clustering focus on subspace (or projection) clustering approaches which detect clusters in a subset of the whole attributes set. They group subspaces and objects to deal with the curse of dimension [6]. A potential problem is that subspace clustering approaches will introduce redundancy and multiple views.

To deal with the redundancy problem, several non-redundant subspace clustering approaches are proposed. These approaches are based on the idea that a pair of subspace clusters which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. With this redundancy definition, researchers proposed an algorithm which detects all the clusters in different subspaces gradually and only adds non-redundant clusters to the result set. However, this algorithm is proved to be NP hard, so they proposed OSCLU, an approximation algorithm. For efficient calculation, OSCLU introduces a lot of parameters. With experiments we find that OSCLU is sensitive to parameter settings. Besides, the model which adds clusters to the result set gradually makes OSCLU cannot judge clusters’ redundancy in a global view which affects redundancy removal quality a lot.

In this paper, we propose a non-redundant subspace clustering approach. Totally different from the existing approaches, we generate a non-redundant subspace clustering result by two steps. First step, we generate a redundant subspace clustering result with some traditional subspace clustering approaches, e.g., SUBCLU [13]. Second step, we consider each subspace cluster as an object, with their subspaces and grouped objects as attributes. We use some traditional clustering approaches, e.g., K-Means [12], to group these clusters generated on the first step. We consider the clusters that are grouped in one group to be similar and thus provide redundant information. So, we pick one cluster to represent this group of clusters. All these picked representor clusters compose the result set. Except the parameters for density-based clustering, we only introduce one parameter K, which determines the result cluster number and can be easily set in a heuristic way. Comparing to a newly proposed non-redundant multiple clustering approach OSCLU, our approach TSCLU mainly have 3 advantages as listed below:

  1. 1.

    TSCLU introduces less parameters and is less sensitive to the parameter settings.

  2. 2.

    TSCLU gets a lower redundancy value than OSCLU and a similar clustering quality measured by traditional evaluating criterion.

  3. 3.

    TSCLU removes redundant clusters with a global which is more reasonable than OSCLU.

To prove that TSCLU can outperform OSCLU in some perspectives, we do a series of experiments to compare TSCLU with OSCLU under different clustering quality evaluation criterions like F1 [5, 15], Entropy [4, 19], Accuracy [8, 20], and Coverage. Since there are no existing redundancy evaluation measures, we also propose a redundancy evaluation criterion.

This paper is structured as follows: in Sect. 2 we review existing multiple clustering approaches and existing subspace clustering evaluation measures. Section 3 will introduce our novel non-redundant subspace clustering algorithm. We will do a series of experiments to prove our approach can outperform OSCLU with less parameters and lower redundancy value in Sect. 4. Finally, we make a conclusion and present our future work in Sect. 5.

2 Related Work

2.1 Subspace Clustering Approaches

Recent research of clustering in high-dimensional data has introduced a number of different approaches summarized in [14, 17]. One approach is to detect clusters in arbitrary subspace projections of the data. Each cluster is associated with a set of relevant dimensions in which this pattern has been discovered. In the view of redundancy of the result set, we divide subspace clustering approaches into two categories: redundant approaches and non-redundant approaches.

2.1.1 Redundant Approaches

We can identify three major paradigms of redundant subspace clustering approaches by the underlying cluster definition and parameterization of the resulting clustering:

Cell-Based Approaches: Cell-based approaches consider clusters as a set of adjacent fixed or variable grid cells that contain more than a certain threshold many objects. A famous cell-based approach is CLIQUE [2], which uses fixed threshold, fixed grid, and prune subspaces by monotonicity property. Based on this idea, researchers also proposed DOC [7], MINECLUS [21], and SCHISM [19].

Density-Based Approaches: Density-based approaches define clusters as dense regions separated by sparse regions. With this definition, we can detect arbitrary shaped clusters in relevant dimensions. The first density-based clustering algorithm is DBSCAN [9]. SUBCLU [13] applies DBSCAN to subspaces. Density-based subspace clustering approaches all suffer the parameter setting problem and have to deal with dimensionality bias of density while detecting clusters in subspaces.

Clustering-Oriented Approaches: In contrast to the previous approaches, clustering-oriented approaches focus on the clustering result R by directly specifying objective functions like the number of clusters to be detected or the average dimensionality of the clusters, as in PROCLUS [1]. As clustering-oriented approaches directly control the resulting clusters other than an individual cluster, clustering quality is affected by noisy data.

2.1.2 Non-redundant Approaches

Traditional redundant subspace clustering approaches tend to generate a quite large amount of clusters. The result contains a lot of redundant clusters. OSCLU is a recent proposed non-redundant subspace clustering approach. OSCLU is based on the idea that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded to be redundant to each other. ASCLU [10] applies this idea to alternative clustering way.

2.2 Subspace Clustering Evaluating Framework

In this section, we review some existing evaluation measures in subspace clustering area. Müller et al. [16] introduced a systematic framework to evaluate subspace clustering results. We mainly divide the existing measures into two categories: object-based measures and object- and subspace-based measure.

We use the classic measures in traditional clustering evaluation such as Entropy, F1, and Coverage to evaluate clustering result in object view.

As we are doing subspace clustering, we should take subspace selection into consideration. So we also get some evaluation measures that divide each object into each dimensions and evaluate the result based on the view of divided objects, e.g., RNIA [18] and CE [18].

As reviewed, we have already got many measures describing the quality of clustering. But we missed an important aspect of subspace clustering: the redundancy of results. Many clustering approaches devote themselves to redundancy removal. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing good at redundancy removal.

3 TSCLU

Notations: For consistent notations in the following sections we define some notations here. Every cluster C in a subspace projection is defined by a set of objects O that is a subset of the database \(\mathit{DB} =\{ O_{1},O_{2},\ldots ,O_{n}\}\) and by a set of relevant dimensions S out of the set of all dimensions \(\mathit{D} =\{ S_{1},S_{2},\ldots ,S_{m}\}\).

Definition 1.

A cluster C in a subspace projections S is

$$\displaystyle{ C = (O,S)\ \mathit{with}\ O \subseteq \mathit{DB},S \subseteq D }$$
(1)

A clustering result is a set of found clusters in the respective subspace projections.

Definition 2.

A clustering result RES of k clusters is a set of clusters

$$\displaystyle{ \mathit{RES} =\{ C_{1},\ldots ,C_{k}\},C_{i}\ =\ (O_{i},S_{i})\ \mathit{for}\ i = 1\ldots k }$$
(2)

Let All be the set of all possible subspaces of cluster C. Apparently, All contains lots of subspace clusters and can be quite redundant. We need to find an optimal subset of All to minimize redundancy while keeping a certain clustering quality. Generally, we do our job in two steps. First step, we generate as many clusters as possible in all the subspaces. Second step, we remove redundancy of the result by grouping the clusters generated on the first step. As to a group of similar clusters, we consider them to be similar and thus are redundant to each other. So, we retain a most valuable cluster and remove all the other clusters because they bring redundancy. Because of the two-step clustering, we call our approach TSCLU (two-step subspace clustering).

3.1 First-Step Clustering

While doing the first-step clustering, we do not put importance on the result set’s redundancy but on the coverage. To compare with OSCLU, we also choose a density-based subspace clustering approach. We choose SUBCLU to do the first-step clustering because it can detect subspace clusters with arbitrary shapes in all subspaces.

3.2 Second-Step Clustering

To remove redundancy, we also adopt the idea in OSCLU that a pair of subspaces which share more than a certain amount of overlapped subspaces and objects should be regarded as redundant to each other. Instead of setting two subjective overlap rate thresholds, we use traditional clustering algorithms to group similar subspace clusters together. By this way, we can group similar subspace clusters together with a global view. After that, we pick one most interesting subspace cluster from each group and add them to the result set.

3.2.1 Second-Step Clustering Algorithm

To reduce the time consumption and parameter number of the second-step clustering, we choose K-Means to group the subspace clusters generated on the first step. We apply a vector-building algorithm to convert a subspace cluster C to two vectors, Vo(C) and Vs(C).

Definition 3.

Considering a subspace cluster C={O, S}, we can convert it to a vector Vo(C) in object perspective.

$$\displaystyle{ V _{o}(C) =\{ V (O_{1}),V (O_{2}),\ldots ,V (O_{n})\}\quad V (O_{i}) = \left \{\begin{array}{llll} 1\quad : \quad &O_{i} \in O \\ 0\quad : \quad &\mathit{otherwise}\end{array}\right . }$$
(3)

Definition 4.

Considering a subspace cluster C={O, S}, we can convert it to a vector Vs(C) in subspace perspective.

$$\displaystyle{ V _{s}(C) =\{ V (S_{1}),V (S_{2}),\ldots ,V (S_{m})\}\quad V (S_{i}) = \left \{\begin{array}{lll} 1\quad : \quad &S_{i} \in S \\ 0\quad : \quad &\mathit{otherwise}\end{array} \right . }$$
(4)

To calculate the distance between two subspace clusters, we should take both grouped objects and relevant dimensions into consideration. So, we plus the cosine distance of object vectors and subspace vectors to be subspace cluster distance.

Definition 5.

Given two subspace clusters \(C_{i} =\{ O_{i},S_{i}\}\), \(C_{j} =\{ O_{j},S_{j}\}\). We define the distance between C i and C j to be \(\mathit{Dis}(C_{i},C_{j})\).

$$\displaystyle{ \mathit{Dis}(C_{i},C_{j}) = \mathit{Cosine}(\mathit{Vo}(C_{i}),\mathit{Vo}(C_{j})) + \mathit{Cosine}(\mathit{Vs}(C_{i}),\mathit{Vs}(C_{j})) }$$
(5)

3.2.2 Cluster Picking

After grouping similar subspace clusters together, we have to choose a representative subspace cluster from each group to represent the group. Considering clusters in one group provide similar information, we want to pick a most interesting one. We make use of the local interesting definition in [16] to define our interesting value:

Definition 6.

Given a subspace cluster C={O, S}, we define the Interest value of C as below:

$$\displaystyle{ \mathit{Interest}(C) = \vert S{\vert }^{a} \cdot \vert O{\vert }^{b} \cdot \left ( \frac{1} {\vert O\vert }\sum _{p\in O}{\mathit{density}}^{S}(p)\right ) }$$
(6)

density S ( p) stands for the average density of cluster C, which is defined in [16]. By picking the one with the highest interesting value to represent each cluster group and add it to the result set, we get the final non-redundant subspace clustering result.

4 Experiments

4.1 Redundancy Evaluation

As stated in Sect. 2, many clustering approaches devote themselves to non-redundant results. However, they just concentrate on the algorithms and use some observation ways to show that the result is not redundant. So we still lack a redundancy-evaluating criterion to measure whether an algorithm is doing well at redundancy removing or not. In this section we propose a redundancy evaluation criterion and use this criterion to evaluate the redundancy of a clustering result in the following experiments.

4.1.1 Redundancy Definition

Based on the idea in [10] and [11], we consider subspace cluster pairs that share little common subspace or grouped objects to be dissimilar. We can define the redundancy value between two subspace clusters with the product of the subspace overlap rate and object overlap rate.

Definition 7.

[Mutual redundancy value] Given two subspace clusters C i ={ S i , O i } and C j ={ S j , O j }, the mutual redundancy value of C i and C j is:

$$\displaystyle{ \mathit{Red}\{C_{i},C_{j}\} = \frac{\vert O_{i} a\mbox{\c{}}p O_{j}\vert } {\vert O_{j}\vert } \times \frac{\vert S_{i} a\mbox{\c{}}p S_{j}\vert } {\vert S_{j}\vert } }$$
(7)

Based on the mutual redundancy formula above, we can define the total redundancy value of a result set RES. An apparent way is to sum all the mutual redundancy up.

Definition 8 (Total redundancy value). 

Given a set of subspace clusters \(RES =\{ C_{1},C_{2},C_{3},\ldots ,C_{n}\}\). The total redundancy value TRed( RES) is:

$$\displaystyle{ \mathit{TRed}(\mathit{RES}) =\sum _{ i=1}^{n}\sum _{ j=1}^{n}\mathit{Red}(C_{ i},C_{j}) \times \vert \mathit{sign}(i - j)\vert }$$
(8)

4.1.2 Reasonableness

To prove that our redundancy-evaluating criterion is reasonable, we do experiments on real-world data glass, which contains 214 objects in 9 dimensions. We apply three subspace clustering approaches including two redundant approaches CLIQUE and SUBCLU and one non-redundant approach OSCLU. The parameters are adjusted to get the best clustering result. After that we evaluate the redundancy of the results. All the experiments are based on an open-source subspace clustering framework OpenSubspace http://dme.rwth-aachen.de/en/OpenSubspace. We implemented OSCLU on this framework and compared clustering result to CLIQUE and SUBCLU, which are already implemented on the framework. We list the experiment result below:

Table 1 shows the total redundancy and cluster number of three different approaches.

Table 1 Redundancy value of different approaches

Apparently, OSCLU generated less number of clusters and got a significant reduction of total redundancy value. Although the redundancy value is seemed to be quite biased to cluster sets with smaller cluster number, we have pointed out that a large cluster number itself can be regarded as a kind of redundancy. So this criterion is reasonable and we can apply this criterion to the coming experiments. However, we do admit that the cluster number may play a too important role while evaluating the redundancy of the clustering result. It’s better to normalize the redundancy value to be bounded in [0,1], e.g., we can take the answer file’s information into consideration. In the future, we will work on this.

4.2 Comparison Between OSCLU and TSCLU

Generally speaking, OSCLU suffers from 3 main drawbacks: (1) OSCLU introduces too many parameters which make the algorithm hard to implement. (2) OSCLU is quite sensitive to the parameter settings and different parameter values may lead to a significant difference of result quality. (3) OSCLU considers the redundancy of a new coming cluster gradually instead of considering redundancy with a global view. This may make the result not optimal.

On the other hand, the parameters of TSCLU are quite simple and the two-step clustering model can remove redundancy with a global view. To prove that TSCLU can outperform OSCLU in perspectives above, we do a series of experiments on real-world data. All the experiments are based on OpenSubspace framework.

4.2.1 Parameter Sensitivity

The same as the experiments before, we do our experiments on database glass, which contains 214 objects and 9 dimensions. First of all, we test the clustering quality of TSCLU with different parameters K. We set parameters Epsilon = 8.0 and minPt = 32 for the first-step SUBCLU clustering.

Fig. 1
figure 1

Running time vs K

Fig. 2
figure 2

Running time vs α,  β

Fig. 3
figure 3

Redundancy vs K

Fig. 4
figure 4

Redundancy vs α,  β

From Table 2 and Figs. 13, and 5, we can see that TSCLU is not sensitive to the K value. As K’s value grows, the redundancy grows linearly. Coverage grows fast before k = 6, after that coverage does not change a lot. Other quality measures like F1, Entropy, and time consumption are not sensitive to K. So, while implementing TSCLU, we can just set k = log( n). n is the size of database.

Table 2 TSCLU clustering quality with different K (cluster number, from 2 to 12)
Fig. 5
figure 5

Clustering quality vs K

To test OSCLU’s parameter sensitivity, we compare clustering quality with different α and β. In fact, we also have to set orthogonal estimation threshold, quality upper bound, quality lower bound, and seed number subjectively to implement OSCLU. To simplify the experiment, we use the best estimation threshold, quality upper bound, lower bound, and seed number value and just compare the quality with different α and β.

From Table 3 and Figs. 24, and 6, we can see that OSCLU is quite sensitive to α and β’s value. As α and β’s value grows, both redundancy and runtime grow exponentially. Clustering quality also changes a lot with the α and β’s value. Because of the paper page number’s limitation, we do not show the result of OSCLU with changes of other parameters here.

Table 3 OSCLU clustering quality with different α and β
Fig. 6
figure 6

clustering quality vs α,  β

4.2.2 Clustering Quality

For fair comparison, we ran massive experiments with various parameter settings and selected the one with the best quality. We set k = 5 and 6 for TSCLU, \(\alpha = 0.3,\ \beta = 0.3\) and \(\alpha = 0.4,\beta = 0.4\) for OSCLU.

From Table 4 we can see that TSCLU got a roughly the same clustering quality with OSCLU; however, TSCLU got a significantly smaller redundancy value. As mentioned in Sect. ??, the redundancy is quite relative to cluster number; for a fair comparison, we also did an experiment to compare TSCLU and OSCLU with similar cluster number. We can see that compared to OSCLU, TSCLU still got a significantly smaller redundancy value with similar clustering quality.

Table 4 Clustering result of TSCLU and OSCLU(1)

With experiments on real-world data, we can see that TSCLU can get a significantly smaller redundancy value than OSCLU with similar clustering quality. Besides, we find that TSCLU is not sensitive to parameter K, while OSCLU contains lots of parameters and is quite sensitive to the parameters.

Table 5 Clustering result of TSCLU and OSCLU(2)

On the other hand, TSCLU suffers a relatively higher running time. Since K-means is not a time-consuming algorithm, the long running time is mainly caused by SUBCLU. We will try some other redundant subspace clustering approaches for the first-step clustering of TSCLU in the future.

5 Conclusion

In this paper, we proposed a two-step non-redundant subspace clustering approach: TSCLU. The approach is based on a two-step clustering model which is never proposed before. Different with traditional non-redundant subspace clustering approaches OSCLU and ASCLU which measure the redundancy of a cluster while adding it to the result set, TSCLU applies a redundant subspace clustering approach which ensures not to miss any clusters firstly and group clusters to remove redundancy in the second step.

With a series of experiments, we find that TSCLU can get a similar clustering quality with OSCLU, while getting a significantly smaller redundancy value. Besides, TSCLU needs less parameters than OSCLU and is not sensitive to parameter settings, which make TSCLU much easier to implement. However, we have to point out that TSCLU’s running time is much longer than OSCLU. The long running time is caused by redundant subspace clustering algorithm OSCLU. The experiments have proved that the two-step clustering model is reasonable and easy to implement. In the future we will do some work on trying different redundant subspace clustering algorithms for the first-step clustering to reduce running time and proposing a more reasonable redundancy-evaluating criterion.