A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

Wahid, Abdul; Gao, Xiaoying; Andreae, Peter

doi:10.1007/978-3-319-26187-4_9

Abdul Wahid²⁰,
Xiaoying Gao²⁰ &
Peter Andreae²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9419))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1368 Accesses

Abstract

Clustering methods aim to find clusters or groups of similar objects in a given set of data. Common soft subspace clustering methods for text data find different clusters in subspaces using a weighted distance measure. The weighting scheme heavily affects the clustering performance and requires special consideration. Since text data has semantic information along with syntactic information, a weighting scheme, which uses semantic information, is more likely to generate a better clustering solution.

This paper introduces a novel soft subspace clustering method that uses a probabilistic model to extract semantic information from documents for weighting features. We created a feature weight matrix from the probability distribution of terms in subspaces and developed a weighted distance measure for finding similar documents in relevant subspaces. Our experiment results on synthetic and real-world datasets show that our newly developed method outperforms other state-of-the-art soft subspace clustering methods.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A LDA Feature Grouping Method for Subspace Clustering of Text Data

Ensemble subspace clustering of text data using two-level features

Article 17 June 2016

Minkowski Metric Based Soft Subspace Clustering with Different Minkowski Exponent and Feature Weight Exponent

Keywords

1 Introduction

Clustering methods try to find similar documents and group them together in clusters. Documents are generally represented in a Vector Space Model, where each distinct term is treated as a feature. Hence the feature space becomes very large. Traditional clustering methods such as k-means, consider all features at the same time to cluster the data and are only suitable for data with a small number of features.

Subspace clustering methods are widely applied when the number of features is very large. They try to group similar objects using a subset of features (i.e. subspace) instead of all features. In subspace clustering, each cluster represents a set of objects clustered according to a subspace of features. The problem of subspace clustering is often divided into two sub-problems: determining the subspaces and clustering the data. Based on how these problems are addressed, there are two main categories of subspace clustering methods: hard subspace clustering and soft subspace clustering. In hard subspace clustering, a feature in a subspace is either present or not present (1 or 0), whereas in soft subspace clustering, a feature in a subspace is determined by its degree of presence (i.e. a weight between 0–1). A feature is considered relevant (i.e. present) if its weight is high in a subspace and considered irrelevant if its weight is low in a subspace.

In text datasets, some features can be considered to be partially presented in subspaces. Therefore, soft subspace clustering methods, which assign weights to features instead of determining the exact presence of features in a subspace, are becoming more popular in text clustering.

The most popular soft subspace clustering methods are FWKM [20], EWKM [19] and FGKM [9]. These methods use modified version of k-means to cluster the data in different subspaces according to feature weights. These methods mainly differ in terms of how they compute the feature weights. The main issue with these methods is that they ignore the semantic information of the documents, which might be helpful in improving the clustering process.

Latent Dirichlet Allocation (LDA) is a popular topic modeling method which can be used to extract semantic information from a collection of documents. LDA is based on a generative model, where a document is assumed to be generated from the distribution of terms which form a special theme or topic. The main idea of our method is to treat topics generated from the LDA model as subspaces because each topic specifies a soft subset of related terms (features). Subspaces generated by the LDA were utilized in initializing the clusters in our method.

We use LDA model to compute a probability that a term is relevant in a subspace (topic/subset of terms). These probabilities can represent the semantic information and is used as term or feature weightings in our soft subspace clustering to improve the clustering process. Figure 1 shows the difference between existing clustering methods and our new method. The common existing soft subspace clustering methods use a random approach to initialize weightings and randomly assign objects to clusters. Then the feature weightings and clusters are refined iteratively. In our method, we first use LDA to assign the feature weights and assign objects to the initial clusters. Then we iteratively refine the clusters according to the feature weights.

The main contribution of this paper is a new soft subspace clustering algorithm for documents using semantically weighted terms for different subspaces that are derived from the LDA model. The main novelty of the method is the development of a new weighted distance measure from the LDA probability matrices to compute the distances between the documents in different subspaces.

The paper is organized as follows: Sect. 2 discusses the related work; Sect. 3 describes our proposed method; Sect. 4 explains the experimental design and Sect. 5 presents results along with discussion; and Sect. 6 provides a conclusion of the paper along with the future directions.

2 Related Work

2.1 Hard Subspace Clustering

Hard subspace clustering methods divide the feature space into different subspaces where each feature is either present or absent in a subspace. Hard subspace clustering methods can be further categorized by their search approaches i.e. bottom-up and top-down. The examples of bottom-up hard subspace clustering methods are CLIQUE [3], ENCLUS [10], MAFIA [18] and FINDIT [29]. The examples of top-down hard subspace clustering methods are PROCLUS [1], ORCLUS [2] and $\delta $-Clusters [30]. Our method differs from these methods because it belongs to soft subspace clustering methods.

2.2 Soft Subspace Clustering

In soft subspace clustering, each feature is assigned different weights for different subspaces. Hence some proportion of a feature is present in all subspaces. In clustering process, the features that have higher weight values in a subspace contribute more to form a cluster than the features that have lower weights. Generally the soft subspace clustering methods employ variable weighting scheme and iteratively update the feature weights in the clustering process.

Variable weighting schemes are widely applied in data mining [11–13, 21, 22]. Some of the variable weighting methods can be extended, especially k-means type variable weighting, to develop soft subspace clustering algorithms [7, 14–17, 20].

Recent approaches such as FWKM [20], EWKM [19] and FGKM [8, 9] use k-means type variable weighting algorithms and formulate a minimization problem for data clustering. FWKM uses Lagrange multiplier and forms a polynomial weighting formula to compute the feature weights and iteratively refines the clusters using the following objective function.

$$\begin{aligned} \min J(U, W, \varLambda ) = \sum _{i=1}^{k}\sum _{j=1}^{n}u_{ij}\sum _{t=1}^{m} \lambda _{it} [(\mu _{it} - d_{jt})^2 + \sigma ] \end{aligned}$$

(1)

where

u is a $k\times n$ binary matrix representing the assignment of objects to clusters. $u_{ij} = 1$ iff object j is in cluster i, $u_{ij} = 0$ otherwise.
$\lambda $ is $k\times m$ feature weight matrix. It represents k subspaces in rows and m features in columns. The value in a cell is a weight of the feature to its corresponding subspace and the value ranges from 0–1. The sum of the weights of all features in a subspace is 1. i.e. $\sum _{t=1}^{m} \lambda _{it} = 1, 1 \le i \le k, 0 < \lambda _{it} < 1$
$\mu $ is a $k\times m$ matrix representing the mean value of a feature in a cluster.
$d_{jt}$ represents a feature t of the $j^{th}$ object^{Footnote 1}.
$\sigma $ is an average spread/variance of all the features in a dataset.

EWKM clusters the data in a similar fashion but uses the exponential weighting formula to compute the feature weights. Its objective function is similar to Eq. 1, but instead of using $\sigma $, it uses Shanon entropy to control the weights. FGKM has a slightly different approach, it not only uses the individual feature weightings but also uses the feature group weightings scheme. The feature group weightings is computed by combining features into different groups and then assigning weights to those groups.

The above soft subspace clustering methods ignore the semantic information of the documents in a clustering process. The main motivation of our research work is to investigate the use of semantic information (e.g. topics) of documents in soft subspace clustering process.

2.3 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) [6] extracts topics/themes from documents, which have semantic information. It is widely used in other domains such as topic modeling [5] and Entity Resolution [4]. The topics generated by LDA can be considered as subspaces and for each subspace, LDA facilitates to compute a term weight. Our soft subspace clustering method is related to FWKM and EWKM, however our method uses LDA based weighting scheme to utilize the semantic information of the documents.

LDA is a probabilistic model with an assumption that a document is a random mixture over latent topics and each topic is a distribution over terms. The two main parameters in this model are topic-document distributions $\theta $ and topic-term distributions $\phi $.

Figure 2 ^{Footnote 2} represents a graphical model for LDA. Arrows represent conditional dependencies between two variables and plates/rectangles represent loop or repetition of the variable mentioned in the corner of the plate. The shaded circle represents the observed variable while unshaded represent unobserved variables. Hyperparameter $\alpha $ is a prior on topic distribution. High value of $\alpha $ favors topic distributions with more topics and low value ($<$1) of $\alpha $ favors topic distribution with a few topics. Hyperparameter $\beta $ is a prior on term distribution in every topic, which controls the number of times terms are sampled from a topic. The LDA model infers three latent variables $\theta $, $\phi $ and z (topics) while observing t (terms) in a document set D.

In Fig. 2, the inner plate (z and t) denotes the continuous sampling of topics and terms until $N_d$ terms are created from document d. The out plate (which is surrounding $\theta $) denotes the continuous sampling of a topic distribution for each document d in a document set D. The plate surrounding $\phi $ denotes the continuous sampling of a term distribution over each topic z until a total of Z topics are generated. More details of LDA can be found in [5].

To the best of our knowledge, our research work is the first attempt that applies LDA to assign weights and use it in text soft subspace clustering.

3 Our LDA Weighted K-Means Model

This section presents our new subspace clustering method which builds on LDA for document clustering^{Footnote 3}. Figure 3 shows the overall design of our method. The documents are pre-processed by implementing stop words filtration, low frequency words filtration and WordNet lemmatization. Then we use LDA based on Gibbs sampling to generate two matrices: topic-document matrix $\theta $ and topic-term matrix $\phi $. $\theta $ is then used for initializing the clusters and $\phi $ is used as feature weights for refining the clusters.

3.1 Gibbs Sampling

We implemented LDA model in an unsupervised way (without using training datasets) using Gibbs sampling algorithm explained in [24]. The Gibbs sampling iteratively computes the conditional probability of assigning an occurrence of a term (token of a term) to each topic. The common Gibbs sampling method provides the estimates of the posterior distribution over z (topics) but does not provides $\theta $ and $\phi $. However, we can use the Gibbs sampling technique to approximate $\theta $ and $\phi $ from posterior estimates of z.

For each token i (an occurrence of a term), let $v_i$, $d_i$, $z_i$ denote the term for the token, the document for the token and the topic of the token respectively in a document collection. The Gibbs sampling iteratively processes each term token in the document collection and estimates the conditional probability of assigning the current term token to an individual topic, based on the topic assignments to all other term tokens. The conditional distribution is formalized as:

$$\begin{aligned} Prb(z_i = r|\mathbf z _{-i},...) \end{aligned}$$

(2)

where $z_i = r$ is the assignment of $i^{th}$ token to topic r. $\mathbf z _{-i}$ denotes the topic assignment of all the tokens excluding the $i^{th}$ token. Other variables for Eq. 2 represented by (...) are $v_{i}$, $d_{i}$, $\mathbf v _{-i}$, $\mathbf d _{-i}$, $\alpha $ and $\beta $. $\mathbf v _{-i}$ represents all terms tokens except the $i^{th}$ term token and $\mathbf d _{-i}$ represents document tokens except the $i^{th}$ document token. Griffiths and Steyvers [24] provided a simple way to compute Eq. 2 as:

$$\begin{aligned} Prb(z_i = r|\mathbf z _{-i},...) \propto \frac{\mathcal {C}_{rv_i}^{(1)} + \beta }{\sum _{l=1}^{m}\mathcal {C}_{rl}^{(1)}+m\beta } \frac{\mathcal {C}_{rd_i}^{(2)} + \alpha }{\sum _{z=1}^{Z}\mathcal {C}_{zd_i}^{(2)}+Z\alpha } \end{aligned}$$

(3)

where $\mathcal {C}^{(1)}$ and $\mathcal {C}^{(2)}$ are $Z\times m$ and $Z\times D$ matrices respectively and Z, m, D are the number of topics, terms and documents respectively. The cell values of these matrices represent the frequency of the term/document for the corresponding topics. $\mathcal {C}_{rv_i}^{(1)}$ denotes the number of times the term $v_i$ is assigned to the topic r excluding the $i^{th}$ instance and $\mathcal {C}_{rd_i}^{(2)}$ denotes the number of times a term token in document d is assigned to the topic r excluding the $i^{th}$ instance.

3.2 Generating $\theta $ and $\phi $

After applying the Gibbs sampling algorithm, we create two matrices: (1) $\phi $ topic-term matrix and (2) $\theta $ topic-document matrix. These matrices are generated from the two count matrices $\mathcal {C}^{(1)}$ and $\mathcal {C}^{(2)}$ according to [24] as follows:

$$\begin{aligned} \phi _{rt} = \frac{\mathcal {C}_{rt}^{(1)} + \beta }{\sum _{l=1}^{m} \mathcal {C}_{rl}^{(1)}+m\beta } \text {, } \theta _{rj} = \frac{\mathcal {C}_{rj}^{(2)} + \alpha }{\sum _{z=1}^{Z}\mathcal {C}_{zj}^{(2)}+Z\alpha } \end{aligned}$$

(4)

$\phi $ corresponds to the probability that a term t is assigned to topic r and $\theta $ corresponds to the probability that a document j is assigned to topic r.

The rows of topic-document matrix $\theta $ represent topics and the columns represent documents. The cells of $\theta $ represent the probability that a document has the corresponding topic. We use this matrix to form the initial clusters. One should note that LDA naturally provides a simple way for clustering the documents. However, this clustering is not soft subspace clustering. Following is a way to improve the clusters generated from LDA by utilizing the information from LDA and forming soft subspace clustering method.

In LDA model, each term is a feature and each topic corresponds to a subspace, therefore topic-term matrix $\phi $ can be considered of a feature weight matrix for different subspaces where each feature or term has a degree of presence in all subspaces or topics. We used the values of topic-term matrix $\phi $ for determining relevant subspaces and developed a new weighted distance measure, which finds similar documents in relevant subspaces.

3.3 Objective Function

We perform clustering by formulating the clustering as a minimization problem and our objective is to minimize the sum of squared distances between documents and the nearest cluster centers weighted by different subspaces. The objective function is similar to the objective functions (Eq. 1) of the FWKM or EWKM, however, we do not include $\sigma $ or Shanon entropy because we are already controlling the feature weighting using two hyper parameters of LDA model ($\alpha $ and $\beta $). Moreover, the objective function uses previously computed LDA based feature weights instead of computing the feature weights in iterative manner.

Let D = {$d_1, d_2, d_3, ..., d_n$} be a set of n documents and T = {$t_1, t_2, t_3, ..., t_m$} represents m terms in the documents. Then the objective function for clustering the n documents into k clusters can be defined as:

$$\begin{aligned} \sum _{i=1}^{k}\left( \sum _{j=1}^{n}\sum _{t=1}^{m} \delta _{ij} \phi _{it} (\mu _{it} - d_{jt})^2 \right) \end{aligned}$$

(5)

where

$\delta $ is a $k\times n$ binary matrix representing the assignment of documents to clusters. $\delta _{ij} = 1$ iff document j is in cluster i, $\delta _{ij} = 0$ otherwise.
$\phi $ is $k\times m$ topic-term matrix generated from LDA model. It represents k subspaces in rows and m terms in columns. The value in a cell is a weight of the term to its corresponding subspace and the value ranges from 0-1. The sum of the weights of all terms in a subspace is 1. i.e. $\sum _{t=1}^{m} \phi _{it} = 1, 1 \le i \le k, 0 < \phi _{it} < 1$
$\mu $ is a $k\times m$ matrix representing the mean value of a term in a cluster. It is calculated as:
$$\begin{aligned} \mu _{it} = \frac{\sum _{j=1}^{n}\delta _{ij}d_{jt}}{\sum _{j=1}^{n}\delta _{ij}} \end{aligned}$$
(6)
$d_{jt}$ represents a term t (a feature) of the $j^{th}$ document, which is the term-frequency of the term in the document.

We iteratively assign documents to their nearest cluster centers until the algorithm converges. We minimize the objective function by updating $\delta $ using the following:

$$\begin{aligned} \delta = \left\{ \begin{array}{lr} \delta _{ij} = 1, \text { if } i = \mathop {{\text {arg}}\!\min }\nolimits _x dist(\mu _x, d_j)\\ \delta _{ij} = 0, \text { otherwise} \end{array}\right. \end{aligned}$$

(7)

where $dist(\mu _x,d_j)$ is defined as

$$\begin{aligned} dist(\mu _x,d_j) = \sum _{t=1}^{m} \phi _{xt} (\mu _{xt} - d_{jt})^2 \end{aligned}$$

(8)

Equation 8 defines our distance measure. Unlike k-means, our distance measure computes the distance of a document from the cluster centers by using a LDA parameter $\phi $, which provides a semantic based feature weighting to different subspaces. Higher value of the probability that a term is assigned to a topic indicates that the term has a higher degree of presence in a subspace. Therefore the difference between a term in the document and the mean value of the term in the cluster for that particular term is more important. The use of LDA differentiates our method from other soft subspace clustering methods.

3.4 Our Algorithm: DWKM

Our Dirichlet Weighted K-mean algorithm is a modified version of k-means algorithm. The details are shown in Algorithm 1.

Algorithm 1 takes two arguments: a document set and the number of clusters and outputs the clustering solution. The algorithm performs preprocessing step on the documents, which includes stop word removal, lemmatization and tokenization of words. Then the algorithm randomly assigns all term tokens to Z topics and performs Gibbs sampling. Once $\phi $ and $\theta $ matrices are generated, line 4 of the algorithm groups documents to different clusters according to their highest probability using $\theta $. The algorithm then, fine tunes the clusters by repeating the update and assignment steps according to Eqs. 6 and 7 until convergence criteria is met. The convergence criterion terminates the loop if there are no more documents to relocate to any clusters or the total number of specified iterations exceeds the predefined limit.

4 Experimental Setup

Our experiments are designed based on two recent papers [9, 19]. Our method DWKM was evaluated on four synthetic and six real world datasets, and compared with five clustering methods using different cluster quality measures. Four synthetic datasets were generated by following the same process described in [9] and six real-world datasets were generated as described in [19].

4.1 Datasets

The synthetic datasets SD1, SD2, SD3, SD4 were generated according to [9]. Each consists of 6000 objects, 200 features, three subspaces and three clusters. The noise level in SD1, SD2, SD3 and SD4 are 0, 0.2, 0 and 0.2 respectively (as described in [9]). The percentage of missing values in DS1, DS2, DS3 and DS4 are 0, 0, 0.12, 0.12 respectively. Detailed information about how to reproduce the synthetic datasets can be found in [9].

The six real-word datasets with two or more clusters from 20-Newsgroup^{Footnote 4} are the same as [19]. Table 1 shows the details of these six datasets. The dataset D1, D2 and D3 are easier than datasets D4, D5 and D6. D1 and D2 have semantically different clusters whereas D4 and D5 have semantically related clusters. D3 and D6 have unbalanced clusters (as shown in Table 1).

Table 1. Six real world datasets created from 20-Newsgroup dataset

Full size table

4.2 Evaluation Measures

In order to compare our method with other methods, we used two evaluation measures: Cluster Accuracy [23] and F-measure [19, 25–27] for synthetic dataset and three evaluation measures: F-measure, Normal Mutual Information(NMI) [32] and Entropy [31] for the real-world datasets. These measures are chosen based on [19] and [9] The lower entropy value of a clustering solution indicates the clustering solution has a better quality, whereas higher values of all other evaluation measures indicate a better cluster quality.

The evaluation measures can be computed as follows:

$$\begin{aligned} Cluster Accuracy = \frac{\sum _{i=1}^k d_i}{n} \end{aligned}$$

(9)

$$\begin{aligned} \text {F-measure} = \sum _{i=1}^k \frac{n_i}{n} \cdot \max _{1 \le j \le k} \left\{ \frac{2 \cdot \frac{n_{ij}}{n_i} \cdot \frac{n_{ij}}{n_j}}{\frac{n_{ij}}{n_i} + \frac{n_{ij}}{n_j}} \right\} \end{aligned}$$

(10)

$$\begin{aligned} \text {NMI} = \frac{\sum _{i=1,j=1}^k n_{ij} \log \left( \frac{n \cdot n_ij}{n_i \cdot n_j} \right) }{\sqrt{(\sum _{i=1}^k n_i \log \frac{n_i}{n})(\sum _{j=1}^k n_j \log \frac{n_j}{n}) }} \end{aligned}$$

(11)

$$\begin{aligned} \text {Entropy} = \sum _{j=1}^k \frac{n_j}{n} \left( - \frac{1}{\log k} \sum _{i=1}^k \frac{n_{ij}}{n_j} \cdot \log \frac{n_{ij}}{n_j} \right) \end{aligned}$$

(12)

where $d_i$ is correctly identified documents in cluster i, k is total number of clusters and n is the total number of documents in a dataset. $n_i$ and $n_j$ represent the number of documents in class i of the original dataset and cluster j in our computed clustering solution respectively, $n_{ij}$ represents the number of documents that are common in both class i and cluster j.

Table 2. Comparison of clustering methods on synthetic dataset using Accuracy (AC) and F-measure (FM). The values on left are the mean values of 100 runs and the values in parenthesis are standard deviation of 100 runs.

Full size table

5 Results

We compared our method DWKM with k-means, LDA based simple clustering, FWKM [20], EWKM [19] and FGKM [9].

5.1 Comparison

K-means and LDA based simple clustering algorithm were implemented in lingpipe. We provided predefined number of clusters as a parameter for both algorithms. The simple LDA clustering algorithm uses the same initial steps described in our method without the cluster refinement step. We treated initial clusters as final clusters and skipped the loop which refines the cluster using feature weights. The parameters for LDA are number of topics = number of clusters in ground truth, number of clusters = number of clusters in ground truth, $\alpha $ = 0.1 and $\beta $ = 0.01. We tuned the parameter $\alpha $ and $\beta $ for the best performance. FWKM, EWKM and FGKM clustering algorithm were implemented in Weka^{Footnote 5} and we used standard parameters as described by the authors.

Table 3. A comparison of clustering methods in terms of F-measure, NMI and Entropy on six real-world datasets created from 20-Newsgroup dataset. The values listed in the table are the mean values of 100 runs of five clustering methods on six real-world datasets

Full size table

Table 4. Percentage improvement of DWKM over FGKM in terms of Accuracy(AC) and F-measure (FM) on synthetic datasets

Full size table

Table 5. Percentage improvement of DWKM over FGKM in terms of F-measure (FM), NMI and Entropy (EN) on real datasets

Full size table

The performance of all six clustering algorithms for synthetic dataset is shown in Table 2 and for real-world dataset is shown in Table 3.

Table 2 shows the comparison of clustering methods in terms of Accuracy and F-measure on four synthetic datasets. The values in bold represent the best results. In general, DWKM performs better than other clustering methods in terms of both Accuracy and F-measure on the synthetic datasets. The Accuracy and F-measure values on datasets SD1 and SD2 for DWKM and FGKM have large gaps, whereas the differences of the values on datasets SD3 and SD4 are relatively smaller. The LDA based simple clustering performed better than standard k-means, but performed worse than soft subspace clustering algorithms.

Table 3 shows the mean values of F-measure, NMI and Entropy for k-means, FWKM, FGKM and DWKM clustering methods on six real-world datasets. In general, on the six real-world data set DWKM performed better than other clustering methods in terms of F-measure, NMI and Entropy values. The D1 dataset is the easiest dataset. K-means, EWKM, FGKM and DWKM have the same F-measure value 0.96 on D1 dataset, which means these clustering methods produced equally good clustering solutions. However, if we consider the NMI and Entropy values along with F-measure value of the D1 dataset, we can see that DWKM performed slightly better than other clustering methods. The LDA based simple clustering followed the same trend as in synthetic datasets and performed better than standard k-means, but worse than soft subspace clustering algorithms.

It was also observed from the results that DWKM performed well on data with different level of difficulties (data without noise, with noise, with balanced clusters and with unbalanced clusters). This shows that our semantic weighting of subspaces derived from LDA is reasonably effective for finding clusters in different types of data. Moreover the LDA based simple clustering algorithm performed much better than k-means algorithm when datasets had semantically related clusters (results of D4 and D5). It was also noted that the use cluster refinement step based on feature weighting of LDA model boosted the performance of clustering solution. The DWKM algorithm without the cluster refinement step, performed better than k-means algorithm and slightly worse than other clustering methods.

Tables 4 and 5 provide percentage improvement of DWKM over FGKM on synthetic datasets and real datasets respectively. The results in all tables suggest that DWKM is a better clustering method. We further investigate the performance of all clustering methods by conducting a statistical analysis.

5.2 Statistical Analysis

We performed two types of statistical tests: (1) unpaired t-test and (2) paired Wilcoxon statistical significance test [28] by considering DWKM as the control group. The unpaired ttest was performed using the standard deviation and mean values of evaluation measures listed in Table 2. In general the results from unpaired ttest showed that DWKM achieved statistically significant improvement over three methods k-means, FWKM and EWKM on all synthetic datasets with p-value less then 0.05. The p-values of unpaired ttest computed for FGKM on SD1 and SD2 synthetic datasets are less than 0.05, which indicates that our method DWKM has statistical significant improvement on SD1 and SD2 over FGKM. The performance of our method on other SD3 and SD4 synthetic dataset was found to be comparable over FGKM.

For the six real-world dataset we used paired Wilcoxon statistical significance test. The p-values of F-measure, NMI and Entropy values for FGKM were 0.0305, 0.0028 and 0.0228 respectively. In general the p-values for all five clustering methods were found to be less than 0.05, which suggested that our method DWKM shows a better performance and significant improvement over five clustering methods (Table 6).

Table 6. P-values of unpaired ttest of DWKM and FGKM on synthetic datasets

Full size table

6 Conclusion

In this paper, we introduced a new soft subspace clustering method which uses LDA model to weight the features in the subspaces for clustering documents. The LDA model was implemented using a standard Gibbs sampling algorithm, and it generated two matrices: topic-term and topic-documents. We used the topic-term matrix to develop a new weighted distance measure, where topics are used as subspaces. We developed a k-mean based soft subspace clustering method based on our new weighted distance measure. The algorithm is initialized using the topic-document matrix, where topics are considered as initial clusters.

Our new method DWKM, was found to achieve a statistically significant improvement over recently developed soft subspace clustering methods on synthetic and real-world datasets.

Currently the method requires users to input the number of topics to initialize the LDA model. In future we will remedy this by investigating non-parametric LDA models and will try to reduce the computational complexity of the overall method. Another direction for the future work is to investigate the use of LDA to generate different candidate clustering solutions for clustering ensemble methods.

Notes

1.
For clustering a collection of documents, $d_{jt}$ is often the term-frequency of a term in a document.
2.
This figure is created by the author. However, similar figures are commonly used in literature to describe LDA.
3.
The code of our method was implemented using lingpipe toolkit (http://alias-i.com/lingpipe/).
4.
http://qwone.com/~jason/20Newsgroups/.
5.
The code for FWKM, EWKM and FGKM was provided by the authors.

References

Aggarwal, C.C, Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMOD Record, vol. 28, pp. 61–72. ACM (1999)
Google Scholar
Aggarwal, C.C., Yu, P.S: Finding generalized projected clusters in high dimensional spaces, vol. 29. ACM (2000)
Google Scholar
Agrawal, R., Gehrke, J, Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM, vol. 5, p. 59. SIAM (2006)
Google Scholar
Blei, D.M., Lafferty, J.D.: Topic models. Text Min.: Classif., Clustering, Appl. 10, 71 (2009)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)
Article MATH Google Scholar
Chen, X., Xu, X., Huang, J.Z., Ye, Y.: Tw-$(k) $-means: automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans. Knowl. Data Eng. 25(4), 932–944 (2013)
Article Google Scholar
Chen, X., Ye, Y., Xu, X., Huang, J.Z.: A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recogn. 45(1), 434–446 (2012)
Article MATH Google Scholar
Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)
Google Scholar
De Soete, G.: Optimal variable weighting for ultrametric and additive tree clustering. Qual. Quant. 20(2–3), 169–180 (1986)
Article Google Scholar
De Soete, G.: Ovwtre: a program for optimal variable weighting for ultrametric and additive tree fitting. J. Classif. 5(1), 101–104 (1988)
Article Google Scholar
DeSarbo, W.S., Carroll, J.D., Clark, L.A., Green, P.E.: Synthesized clustering: a method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika 49(1), 57–78 (1984)
Article MathSciNet MATH Google Scholar
Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: SDM, vol. 73, p. 93. SIAM (2004)
Google Scholar
Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 66(4), 815–849 (2004)
Article MathSciNet MATH Google Scholar
Frigui, H., Nasraoui, O.: Simultaneous clustering and dynamic keyword weighting for text documents. In: Berry, M.W. (ed.) Survey of Text Mining, pp. 45–72. Springer, New York (2004)
Chapter Google Scholar
Frigui, H., Nasraoui, O.: Unsupervised learning of prototypes and attribute weights. Pattern Recogn. 37(3), 567–581 (2004)
Article Google Scholar
Goil, S., Nagesh, H., Choudhary, A.: Mafia: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)
Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)
Article Google Scholar
Jing, L., Ng, M.K., Xu, J., Huang, J.Z.: Subspace clustering of text documents with feature weighting K-means algorithm. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 802–812. Springer, Heidelberg (2005)
Chapter Google Scholar
Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)
MathSciNet MATH Google Scholar
Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach. Learn. 52(3), 217–237 (2003)
Article MATH Google Scholar
Nguyen, N., Caruana, R.: Consensus clusterings. In: Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 607–612. IEEE (2007)
Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
Google Scholar
Wahid, A., Gao, X., Andreae, P.: Exploiting user queries for search result clustering. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013, Part I. LNCS, vol. 8180, pp. 111–120. Springer, Heidelberg (2013)
Chapter Google Scholar
Wahid, A., Gao, X., Andreae, P.: Multi-view clustering of web documents using multi-objective genetic algorithm. In: 2014 IEEE Congress on Evolutionary Computation (CEC), pp. 2625–2632. IEEE (2014)
Google Scholar
Wahid, A., Gao, X., Andreae, P.: Multi-objective multi-view clustering ensemble based on evolutionary approach. In: IEEE Congress on to Appear in Evolutionary Computation, CEC 2015. IEEE (2015)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1, 80–83 (1945)
Article Google Scholar
Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J.: Findit: a fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46(4), 255–271 (2004)
Article Google Scholar
Yang, J., Wang, W., Wang, H., Yu, P.: $\delta $-clusters: csubspace correlation in a large data set. In: Proceedings of the 18th International Conference on Data Engineering, pp. 517–528. IEEE (2002)
Google Scholar
Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical report, DTIC Document (2002)
Google Scholar
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the Workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, 19 Kelburn Parade, 6012, Wellington, New Zealand
Abdul Wahid, Xiaoying Gao & Peter Andreae

Authors

Abdul Wahid
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Gao
View author publications
You can also search for this author in PubMed Google Scholar
Peter Andreae
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdul Wahid .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jianyong Wang
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
Florida Atlantic University, Boca Raton, Florida, USA
Dingding Wang
Victoria University, Melbourne, Victoria, Australia
Hua Wang
Florida International University, Miami, Florida, Florida, USA
Shu-Ching Chen
Florida International University, Miami, Florida, USA
Tao Li
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wahid, A., Gao, X., Andreae, P. (2015). A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9419. Springer, Cham. https://doi.org/10.1007/978-3-319-26187-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-26187-4_9
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26186-7
Online ISBN: 978-3-319-26187-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

Abstract

Similar content being viewed by others

A LDA Feature Grouping Method for Subspace Clustering of Text Data

Ensemble subspace clustering of text data using two-level features

Minkowski Metric Based Soft Subspace Clustering with Different Minkowski Exponent and Feature Weight Exponent

Keywords

1 Introduction