Cluster Based Term Weighting Model for Web Document Clustering

Prakash, B. R.; Hanumanthappa, M.; Mamatha, M.

doi:10.1007/978-81-322-1768-8_70

B. R. Prakash⁶,
M. Hanumanthappa⁶ &
M. Mamatha⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 259))

1528 Accesses
2 Citations

Abstract

The term weight is based on the frequency with which the term appears in that document. The term weighting scheme measures the importance of a term with respect to a document and a collection. A term with higher weight is more important than a term with lower weight. A document ranking model uses these term weights to find the rank of a document in a collection. We propose a cluster-based term weighting models based on the TF-IDF model. This term weighting model update the inter-cluster and intra-cluster frequency components uses the generated clusters as a reference in improving the retrieved relevant documents. These inter cluster and intra-cluster frequency components are used for weighting the importance of a term in addition to the term and document frequency components.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Fuzzy Document Clustering Model Based on Relevant Ranked Terms

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Integrating LDA with Clustering Technique for Relevance Feature Selection

Keywords

1 Introduction

A document clustering algorithm helps to find groups in documents that share a common pattern [1–5]. It is an unsupervised technique and is used to automatically find clusters in a collection without any user supervision. The main goal of the clustering is to find the meaningful groups so that the analysis of all the documents within clusters is much easier compared to viewing it as a whole collection.

The Vector Space Model (VSM) represents a document using a vector of T unique terms in a collection (T-dimension). Each term in a vector is associated with a weight (term weight) [6, 7]. The term weight is based on the frequency with which the term appears in that document. The term weighting scheme measures the importance of a term with respect to a document and a collection. A term with higher weight is more important than a term with lower weight. Each document can be located in T-dimensional space, where T is the number of unique terms in a collection (Euclidean space). With a document represented by a location in Euclidean space, we can compare any two documents by measuring the actual distance between them. In the same way, a user-supplied query can be represented as a vector and mapped in Euclidean space. In order to find a set of documents relevant to a query, we can find documents that are closer to this query in Euclidean space. A document ranking model finds the similarities between these documents and a query. If a document is more relevant to a query, it will get a higher ranking. VSM and term weighting schemes are widely used in many research areas such as document clustering, classification, information retrieval, document ranking, etc.

2 Cluster-Based Retrieval

Cluster-based retrieval uses the cluster hypothesis to retrieve a set of documents relevant to a query [8]. Cluster hypothesis Documents in the same cluster behave similarly with respect to relevance to information needs [9]. If a document in a cluster is relevant to a query, then the rest of the documents in that cluster are potentially relevant to the same query. There are two approaches in cluster-based retrieval. The first approach retrieves one or more clusters relevant to a query instead of retrieving documents relevant to a query. In other words, this approach retrieves and ranks the relevant clusters instead of the relevant documents. Based on the cluster hypothesis, the documents from the highly ranked clusters are more relevant to a query than the documents from the clusters with lower ranking. The main motive of this approach is to achieve higher efficiency and faster search.

The second approach uses the generated clusters as a reference in improving the retrieved relevant documents. In this approach, the given document collection is clustered (static clustering) beforehand. When a set of documents is retrieved for a query, the generated clusters (static clusters) of the collection are used as a reference to update the retrieved relevant document list. The main goal of this approach is to improve the precision-oriented searches.

3 Term Weighting Scheme

The Boolean retrieval model assigns 1 or 0 based on the presence or absence of a term in a document. This model performs undesirably in querying for a document. Later, VSM was introduced for ranked retrieval [10]. It is widely used in querying documents, clustering, classification and other information retrieval applications because it is simple and easy to understand. It uses a bag of word approach. Each document $ {\text{d}}_{\text{i}} $ in the collection ζ is represented as a vector of terms with weight [11, 12].

One of the most commonly used term weighting schemes, TF-IDF, assigns weights to each term using the term frequency (tf) and inverse document frequency (idf). The term frequency of the $ {\text{term}}_{\text{t}} $ is the number of times the given $ {\text{term}}_{\text{t}} $ occurs in a document. The inverse document frequency is the total number of documents in a collection containing the $ {\text{term}}_{\text{t}} $ with respect to the total number of documents (N) in a collection. Then, the document vector $ {\text{d}}_{\text{i}} $, represented as:

$$ {\text{d}}_{\text{i}} = \left\{ {{\text{term}}_{1,} {\text{tf}}_{{{\text{i}}1}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{1} }};{\text{term}}_{2,} {\text{tf}}_{{{\text{i}}2}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{2} }}; \ldots \; {\text{term}}_{{{\text{T}},}} {\text{tf}}_{\text{it}} \cdot\;{ \log }\frac{\text{N}}{{{\text{N}}_{\text{t}} }};} \right\} $$

The term weight $ {\text{w}}_{\text{it}} $ determines whether the $ {\text{term}}_{\text{t}} $ will be included in the further steps. Only certain terms extracted from a document can be used for identifying and scoring a document within the collection. The term weighting schemes are used to assign weight to each term in a document. These term weights represent the importance of a term with respect to a collection. Document clustering uses these term weights to compare two documents for their similarity.

Table 1 shows representation of some of the term weighting schemes commonly used. Here, TF is the Term Frequency, IDF is the Inverse Document Frequency and ICF is the Inverse Cluster Frequency.

Table 1 Term weighting schemes

Full size table

$ {\text{w}}_{\text{itj}} $ is the weight of the term $ {\text{term}}_{\text{t}} $ in the document $ {\text{d}}_{\text{i}} $ of the cluster $ {\text{C}}_{\text{j}} $.
$ {\text{tf}}_{\text{if}} $ = fit is the term frequency of the term $ {\text{term}}_{\text{t}} $ in the document $ {\text{d}}_{\text{i}} $.
$ {\text{idf}}_{\text{t}} = \log \frac{\text{N}}{{{\text{N}}_{\text{t}} }} $ is the inverse document frequency for the term $ {\text{term}}_{\text{t}} $ in the collection

where N is the total number of documents in the collection and $ {\text{d}}_{\text{t}} $ is the number of documents that contain the term $ {\text{term}}_{\text{t}} $.

$ {\text{icf}}_{\text{t}} = \log \frac{\text{k}}{{{\text{k}}_{\text{t}} }} $ is the inverse cluster frequency of the term $ {\text{term}}_{\text{t}} $ in the collection ζ, where K is the total number of clusters in the collection and Kt is the number of clusters that contains the term $ {\text{term}}_{\text{t}} $.

We present a new term weighting method based on the traditional TF-IDF term weighting scheme. Our motivation is based on the idea that the terms of the documents in a same cluster have similar importance compared to the terms of the documents in a different cluster. We concentrated on the terms that are important within a cluster and considered the other terms as irrelevant and redundant. We presented this idea by giving more weight to the terms that are common within a cluster but uncommon in other clusters.

In our new term weighting scheme, we used unsupervised partitional (K-means) clustering algorithms [2, 13, 14]. First, we ran the K-means algorithm with the four term weighting schemes, to show that the CBT term weighting scheme improves the quality of the clusters generated by a partitional clustering algorithm.

From our experiment, we found that the new term weighting scheme based on the clusters gives better results than the other well-known term weight schemes traditionally used for both partitional and hierarchical clustering algorithms.

3.1 The Proposed Term Weighting Scheme

We introduce our new term weighting scheme. For the term $ {\text{term}}_{\text{t}} $, document $ {\text{d}}_{\text{i}} $ and cluster $ {\text{C}}_{\text{j}} $, CBT is given as:

$$ \begin{aligned} {\text{w}}_{\text{itj}} &= {\text{tf}}_{\text{it}} \cdot {\text{idf}}_{\text{t}} \cdot {\text{df}}_{\text{tj}} \cdot {\text{icf}}_{\text{t}} \\ &= {\text{f}}_{\text{it}} \cdot { \log }\frac{\text{N}}{{{\text{N}}_{\text{t}} }} \cdot \frac{{{\text{df}}_{\text{j}} }}{{|{\text{c}}_{\text{j}} |}} \cdot {\text{log}}\frac{\text{K}}{{{\text{K}}_{\text{t}} }} \\ \end{aligned} $$

Here, $ {\text{df}}_{\text{tj}} = {\text{df}}_{\text{j}} $

$ {\text{jC}}_{\text{j}} {\text{j}} $ is the document frequency of the term $ {\text{term}}_{\text{t}} $ within the cluster $ {\text{C}}_{\text{j}} $, where $ {\text{df}}_{\text{j}} $ is the number of documents in the cluster $ {\text{C}}_{\text{j}} $ that contain the term $ {\text{term}}_{\text{t}} $, and $ {\text{jC}}_{\text{j}} {\text{j}} $ is the total number of documents in the cluster $ {\text{C}}_{\text{j}} $.

Our new term weighting scheme has four components. The first two components are based on the term weighting components discussed in [15]. The last two components are the cluster components as shown in Table 2. In other words, CBT assigns a weight to a term which is

Table 2 List of components in CBT term weighting scheme

Full size table

Highest when the term occurs more frequently in the documents of a cluster and uncommon in other clusters.
Higher when the term occurs less frequently in the documents of a cluster and uncommon in other clusters.
Lower when the term occurs often in a few clusters.

Lowest when the term occurs in most of the documents in a collection.

4 K-Means Algorithm with CBT

Initially, the K-means algorithm doesn’t have any information about the cluster components, so we start the algorithm by setting $ {\text{df}}_{\text{tj}} $ and $ {\text{icf}}_{\text{t}} $ to 1 and update the inter- and intra-cluster components on each iteration. If a document has a set of terms that doesn’t belong to a cluster, then its term weight will be reduced so that it will move to other clusters. It will be repeated until it finds a suitable cluster of its type.

Require: An integer K ≥ 1, Document Collection $ \upzeta $
1: if K = 1 then
2: return ζ
3: else
4: Initialize l = 0
5: $ \{ {\text{C}}_{1}^{\left( 0 \right)} , \cdots \cdots ,{\text{C}}_{\text{k}}^{\left( 0 \right)} \leftarrow {\text{RANDOM }}\quad {\text{CLUSTERS}}(\upzeta,{\text{K}}) $
6: repeat
7: for all $ {\text{d}}_{\text{i}} \in\upzeta,{\text{i}}:1 \ldots \ldots {\text{N do}} $
8: $ {\text{m}} = {\text{arg min}}_{\text{j}} |{\text{c}}_{\text{j}} - {\text{d}}_{\text{i}} | $
9: $ {\text{C}}_{\text{m}}^{{\left( {{\text{l}} + 1} \right)}} \leftarrow {\text{C}}_{\text{m}}^{{\left( {{\text{i}} + 1} \right)}} \cup {\text{d}}_{\text{i}} $
10: end for
11: l ← l+1 + 1
12: $ {\text{w}}_{\text{itj}} {\text{tf}}_{\text{it}} \cdot {\text{idf}}_{\text{t}} \cdot {\text{df}}_{\text{tj}} \cdot {\text{icf}}_{\text{t}} ; $
for each term term_t in a document d_i for a cluster
$ {\text{C}}_{\text{j}}^{{({\text{t}})}} ,{\text{t}}\, :\, 1\ldots {\text{T, i}}\, :\, 1\ldots {\text{N, j}}\, :\, 1\ldots {\text{K }} $
13: for j = 1 to K do
14: $ {\text{C}}_{\text{j}} \leftarrow \frac{1}{{|{\text{C}}_{\text{j}}^{{({\text{l}})}} |}}\mathop \sum \limits_{{{\text{d}}_{\text{i}} \in {\text{C}}_{\text{j}}^{{({\text{l}})}} }} {\text{d}}_{\text{i}} $
15: end for
16: until No change in K centroids
17: return $ \left\{ {{\text{C}}_{1}^{{\left( {\text{l}} \right)}} , \cdots \cdots ,{\text{C}}_{\text{k}}^{{\left( {0{\text{l}}} \right)}} } \right\} $
18: end if

4.1 Data Collections for CBT

We used the TREC [16], 20 Newsgroup and Reuters-21578 [17] data collections for our experiment. TR11, TR12, TR23, TR31, and TR45 collections are taken from TREC-5, TREC-6 and TREC-7. 20 NG S1–S5 are the five randomly chosen subsets of 20 Newsgroup documents [18]. RE S1 and RE S2 data sets are from Reuters-21578 collection. We got 4645 documents that have only one category. In addition to that, we used the Reuters transcribed subset (RE S2) [19]. For all the data sets shown in Table 3, we removed the stop words and stemmed using the Porter stemming algorithm [20].

Table 3 Data sets

Full size table

5 Conclusion

Importance of using inter- and intra- cluster components in the term weights using the average entropy measure. Since the K-means clustering algorithm is unstable and sensitive to initial centroids, we ran the algorithm 10 times with different random seed for the initial centroids on each run. We repeated this experiment for the four term weighting schemes on the data collections listed in Table 3.

We calculated the entropy for the term weighting schemes, as given in Eq. (2.13), for each run after the algorithm converged. Then, we computed the average of the entropies obtained in each run. Similarly, we computed the average F-measure and average purity measures. Table 4 shows the average entropy calculated for each data set. Table 5 shows the average entropy, average F-Measure and average purity measured for the TF-IDF and CBT term weighting schemes for the K-means clustering algorithm. Both experiments show that the results obtained from the K-means clustering algorithms with the CBT term weighting scheme have shown better results compared to the other term weighting schemes on each data set. According to the cluster-based term weighting scheme, a term is considered important to a cluster if it is unique to that cluster and occurs frequently within the documents of that cluster. The inter- and intra- cluster components try to identify these important terms by analyzing the term frequency distribution at three levels: document, cluster and collection. And our experimental results have shown that adding these cluster components in the term weighting scheme significantly improves the results on each data set. We believe that some of the deviations in the results are due to the clustering algorithms’ lack of handling the noise in the data collection.

Table 4 K-means clustering algorithm—avg. Entropy measured for norm TF, CBT, TF IDF ICF and TF IDF term weighting schemes

Full size table

Table 5 Bisecting K-means clustering algorithm—avg. Entropy, avg. F-measure and avg. Purity measured for the TF-IDF and CBT term weighting schemes

Full size table

References

Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, vol. 8, 1st edn. Addison-Wesley, Boston (2005)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, pp. 525–526. Department of Computer Science and Engineering University of Minnesota, Citeseer (2000)
Google Scholar
Guha, Sudipto, Rastogi, Rajeev, Shim, Kyuseok: CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 27(2), 73–84 (1998)
Article Google Scholar
Zhao, Ying, Karypis, George, Fayyad, Usama: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Disc. 10(2), 141–168 (2005)
Article Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘92), pp. 318–329. ACM Press, New York (1992)
Google Scholar
Chisholm, E., Kolda, T.G.: New term weighting formulas for the vector space method in information retrieval. Technical report (1999)
Google Scholar
Singhal, Amit, Buckley, Chris, Mitra, Mandar, Mitra, Ar: Pivoted document length normalization, pp. 21–29. ACM Press, New York (1996)
Google Scholar
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), pp. 186–193. ACM, New York (2004)
Google Scholar
Voorhees, E.M.; The cluster hypothesis revisited. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘85), pp. 188–196. ACM, NewYork (1985)
Google Scholar
Salton, Gerard: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Boston (1989)
Google Scholar
Cummins, Ronan, O’Riordan, Colm: Evolving general term weighting schemes for information retrieval: tests on larger collections. Artif. Intell. Rev. 24, 277–299 (2005)
Article Google Scholar
Jung, Y., Park, H., Du, D.Z.: An effective term weighting scheme for information retrieval (2000)
Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to information retrieval, Chapter 16, p. 496. Cambridge University Press, Cambridge (2008)
Google Scholar
David MacKay, J.C.: Information Theory Inference and Learning Algorithms. Cambridge University Press, Cambridge (2002)
Google Scholar
Salton, Gerard, Buckley, Christopher: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Text REtrieval Conference (TREC) (1999)
Google Scholar
Lewis, D.D.: The reuters-21578 text categorization test collection (1999)
Google Scholar
Zhou, X., Zhang, X., Hu, X.: Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining. In: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), pp. 197–201. IEEE (2007)
Google Scholar
Hettich, S., Bay, S.D.: Reuters transcribed subset (1999)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Murugesan, K.: Cluster-based term weighting and document ranking models. kentucky (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Applications, Bangalore University, Bangalore, India
B. R. Prakash & M. Hanumanthappa
Department of Computer Science, Sri Siddaganga College for Women, Tumkur, India
M. Mamatha

Authors

B. R. Prakash
View author publications
You can also search for this author in PubMed Google Scholar
M. Hanumanthappa
View author publications
You can also search for this author in PubMed Google Scholar
M. Mamatha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. R. Prakash .

Editor information

Editors and Affiliations

Department of Paper Technology, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Millie Pant
Department of Mathematics, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Kusum Deep
Department of Mathematics and Computer Science, Liverpool Hope University, Liverpool, United Kingdom
Atulya Nagar
Department of Applied Mathematics, South Asian University, New Delhi, India
Jagdish Chand Bansal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prakash, B.R., Hanumanthappa, M., Mamatha, M. (2014). Cluster Based Term Weighting Model for Web Document Clustering. In: Pant, M., Deep, K., Nagar, A., Bansal, J. (eds) Proceedings of the Third International Conference on Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 259. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1768-8_70

Download citation

DOI: https://doi.org/10.1007/978-81-322-1768-8_70
Published: 04 March 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1767-1
Online ISBN: 978-81-322-1768-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Cluster Based Term Weighting Model for Web Document Clustering

Abstract

Similar content being viewed by others

A Fuzzy Document Clustering Model Based on Relevant Ranked Terms

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Integrating LDA with Clustering Technique for Relevance Feature Selection

Keywords

1 Introduction

2 Cluster-Based Retrieval

3 Term Weighting Scheme

3.1 The Proposed Term Weighting Scheme

4 K-Means Algorithm with CBT

4.1 Data Collections for CBT

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Cluster Based Term Weighting Model for Web Document Clustering

Abstract

Similar content being viewed by others

A Fuzzy Document Clustering Model Based on Relevant Ranked Terms

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Integrating LDA with Clustering Technique for Relevance Feature Selection

Keywords

1 Introduction

2 Cluster-Based Retrieval

3 Term Weighting Scheme

3.1 The Proposed Term Weighting Scheme

4 K-Means Algorithm with CBT

4.1 Data Collections for CBT

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation