Abstract
The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Angiulli F, Ianni G, Palopoli L (2001) On the complexity of mining association rules. In: Proceedings of Nono Convegno Nazionale su Sistemi Evoluti di Basi di Dati (SEBD), pp 177–184
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of international conference on knowledge discovery and data mining, pp 436–442
Brijs T, Vanhoof K, Wets G (2003) Defining interestingness for association rules. Int J Inf Theor Appl 10(4): 370–376
Carter C, Hamilton H, Cercone N (1997) Share based measures for itemsets. In: Proceedings of the first European symposium on principles of data mining and knowledge discovery, pp 14–24
Clifton C, Coolie R, Rennie J (2004) TopCat: Data mining for topic identification in a text corpus. IEEE Trans Knowl Data Eng 16(8): 949–964
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining, pp 59–70
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3). http://portal.acm.org/citation.cfm?id=1132960.1132963
Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma RS (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2): 140–174
Han E.H., Karypis G, Kumar V, Mobasher B (1997) Clustering based on association rule hypergraphs. In: Proceedings of research issues on data mining and knowledge discovery, pp 59–70
Karypis G (2003) CLUTO: A software package for clustering high dimensional datasets. http://www-users.cs.umn.edu/~karypis/cluto/
Knobbe A, Crémilleux B, Fürnkranz J, Scholz M (2008) From local patterns to global models: the LeGo approach to data mining. In: Proceedings of local patterns to global models workshop (ECML/PKDD), pp 1–16
Li Y, Chung S. M, (2005) Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM), pp 293–294
Malik H, Kender JR (2006) High quality, efficient hierarchical document clustering using closed interesting itemsets. In: Proceedings of sixth IEEE international conference on data mining, pp 991–996
Malik H, Kender JR (2007) Optimizing frequency queries for data mining applications. In: Proceedings of Seventh IEEE International Conference on Data Mining, pp 595–600
Malik H, Kender JR (2008) Instance driven hierarchical clustering of document collections. In: Proceedings of local patterns to global models workshop (ECML/PKDD)
Moerchen F, Brinker K, Neubauer C (2007) Any-time clustering of high frequency news streams. In: Proceedings of data mining case studies workshop, the thirteenth ACM SIGKDD international conference on knowledge discovery and data mining
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1): 90–105
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24((5): 513–523
Tan P, Kumar V, Sristava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of 8th international conference on knowledge discovery and data mining, pp 32–41
Wang J, Karypis G (2004) SUMMARY: efficient summarizing transactions for clustering. In: Proceedings of fourth IEEE international conference on data mining, pp 241–248
Wu C (2006) Mining top-K frequent closed itemsets is not in APX. In: Proceedings of PAKDD, pp 435–439
Xiong H, Steinbach M, Tan PN, Kumar V (2004) HICAP: hierarchical clustering with pattern preservation. In: Proceedings of SIAM international conference on data mining, pp 279–290
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: Proceedings of fourth IEEE international conference on data mining, pp 563–566
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz and Arno Knobbe.
Rights and permissions
About this article
Cite this article
Malik, H.H., Kender, J.R., Fradkin, D. et al. Hierarchical document clustering using local patterns. Data Min Knowl Disc 21, 153–185 (2010). https://doi.org/10.1007/s10618-010-0172-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0172-z