Hierarchical document clustering using local patterns

Malik, Hassan H.; Kender, John R.; Fradkin, Dmitriy; Moerchen, Fabian

doi:10.1007/s10618-010-0172-z

Hierarchical document clustering using local patterns

Published: 03 April 2010

Volume 21, pages 153–185, (2010)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Hierarchical document clustering using local patterns

Download PDF

Hassan H. Malik¹,
John R. Kender²,
Dmitriy Fradkin³ &
…
Fabian Moerchen³

371 Accesses
19 Citations
Explore all metrics

Abstract

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Angiulli F, Ianni G, Palopoli L (2001) On the complexity of mining association rules. In: Proceedings of Nono Convegno Nazionale su Sistemi Evoluti di Basi di Dati (SEBD), pp 177–184
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of international conference on knowledge discovery and data mining, pp 436–442
Brijs T, Vanhoof K, Wets G (2003) Defining interestingness for association rules. Int J Inf Theor Appl 10(4): 370–376
Google Scholar
Carter C, Hamilton H, Cercone N (1997) Share based measures for itemsets. In: Proceedings of the first European symposium on principles of data mining and knowledge discovery, pp 14–24
Clifton C, Coolie R, Rennie J (2004) TopCat: Data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964
Article Google Scholar
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining, pp 59–70
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3). http://portal.acm.org/citation.cfm?id=1132960.1132963
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma RS (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174
Article Google Scholar
Han E.H., Karypis G, Kumar V, Mobasher B (1997) Clustering based on association rule hypergraphs. In: Proceedings of research issues on data mining and knowledge discovery, pp 59–70
Karypis G (2003) CLUTO: A software package for clustering high dimensional datasets. http://www-users.cs.umn.edu/~karypis/cluto/
Knobbe A, Crémilleux B, Fürnkranz J, Scholz M (2008) From local patterns to global models: the LeGo approach to data mining. In: Proceedings of local patterns to global models workshop (ECML/PKDD), pp 1–16
Li Y, Chung S. M, (2005) Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM), pp 293–294
Malik H, Kender JR (2006) High quality, efficient hierarchical document clustering using closed interesting itemsets. In: Proceedings of sixth IEEE international conference on data mining, pp 991–996
Malik H, Kender JR (2007) Optimizing frequency queries for data mining applications. In: Proceedings of Seventh IEEE International Conference on Data Mining, pp 595–600
Malik H, Kender JR (2008) Instance driven hierarchical clustering of document collections. In: Proceedings of local patterns to global models workshop (ECML/PKDD)
Moerchen F, Brinker K, Neubauer C (2007) Any-time clustering of high frequency news streams. In: Proceedings of data mining case studies workshop, the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105
Article Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24((5): 513–523
Article Google Scholar
Tan P, Kumar V, Sristava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of 8th international conference on knowledge discovery and data mining, pp 32–41
Wang J, Karypis G (2004) SUMMARY: efficient summarizing transactions for clustering. In: Proceedings of fourth IEEE international conference on data mining, pp 241–248
Wu C (2006) Mining top-K frequent closed itemsets is not in APX. In: Proceedings of PAKDD, pp 435–439
Xiong H, Steinbach M, Tan PN, Kumar V (2004) HICAP: hierarchical clustering with pattern preservation. In: Proceedings of SIAM international conference on data mining, pp 279–290
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: Proceedings of fourth IEEE international conference on data mining, pp 563–566
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Thomson Reuters, 195 Broadway, New York, NY, 10007, USA
Hassan H. Malik
Columbia University, 1214 Amsterdam, MC 0401, New York, NY, 10027, USA
John R. Kender
Siemens Corporate Research, 755 College Road East, Princeton, NJ, 08540, USA
Dmitriy Fradkin & Fabian Moerchen

Authors

Hassan H. Malik
View author publications
You can also search for this author in PubMed Google Scholar
John R. Kender
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Fradkin
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Moerchen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hassan H. Malik.

Additional information

Responsible editor: Johannes Fürnkranz and Arno Knobbe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, H.H., Kender, J.R., Fradkin, D. et al. Hierarchical document clustering using local patterns. Data Min Knowl Disc 21, 153–185 (2010). https://doi.org/10.1007/s10618-010-0172-z

Download citation

Received: 21 March 2009
Accepted: 06 March 2010
Published: 03 April 2010
Issue Date: July 2010
DOI: https://doi.org/10.1007/s10618-010-0172-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hierarchical document clustering using local patterns

Abstract

Article PDF

Similar content being viewed by others

Evaluating Top-K Approximate Patterns via Text Clustering

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Soft document clustering using a novel graph covering approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical document clustering using local patterns

Abstract

Article PDF

Similar content being viewed by others

Evaluating Top-K Approximate Patterns via Text Clustering

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Soft document clustering using a novel graph covering approach

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation