HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Chen, Keke; Liu, Ling

doi:10.1007/s00778-009-0134-5

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Regular Paper
Published: 28 February 2009

Volume 18, pages 1241–1260, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

The VLDB Journal Aims and scope Submit manuscript

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Download PDF

Keke Chen¹ &
Ling Liu²

196 Accesses
21 Citations
3 Altmetric
Explore all metrics

Abstract

Analyzing clustering structures in data streams can provide critical information for real-time decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HE-Tree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HE-Tree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating high-quality BKPlots. We also develop the time-decaying HE-Tree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HE-Tree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.

References

Aggarwal, C.C.: On change diagnosis in evolving data streams. IEEE Trans. Knowl. Data Eng. 17, 5 (2005)
Article Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A framework for projected clustering of high dimensional data streams. In: Proceedings of Very Large Databases Conference (VLDB) (2004)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: Proceedings of Intternational Conference on Extending Database Technology (EDBT) (2004)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of ACM Conference on Principles of Database Systems (PODS) (2002)
Barbara, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM) (2002)
Ben-David, S., Gehrke, J., Kifer, D.: Detecting change in data stream. In: Proceedings of Very Large Databases Conference (VLDB) (2004)
Bock, H.: Probabilistic aspects in cluster analysis. In: Conceptual and Numerical Analysis of Data. Springer, Berlin (1989)
Brand, M.: An entropic estimator for structure discovery. In: Proceedings Of Neural Information Processing Systems (NIPS), pp. 723–729 (1998)
Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. (1991)
Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: Annual ACM-SIAM Symposium on Discrete Algorithms (2007)
Chakrabarti, D., Papadimitriou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: Proceedings of ACM SIGKDD Conference (2004)
Chen K., Liu L.: VISTA: Validating and refining clusters via visualization. Inf. Vis. 3 4, 257–270 (2004)
Article Google Scholar
Chen, K., Liu, L.: The “best k” for entropy-based categorical clustering. In: Proceedings of International Conference on Scientific and Statistical Database Management (SSDBM), pp.253–262 (2005)
Chen, K., Liu, L.: Detecting the change of clustering structure in categorical data streams. In: SIAM Data Mining Conference (2006)
Chen, K., Liu, L.: iVIBRATE: Interactive visualization based framework for clustering large datasets. ACM Trans. Inf. Syst. 24, 2 (2006)
Google Scholar
Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multidimensional regression analysis of time-series data streams. In : Proceedings of Very Large Databases Conference (VLDB) (2002)
Cheng, C.H., Fu, A. W.-C., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD Conference (1999)
Cover T., Thomas J.: Elements of Information Theory. Wiley, NY (1991)
Book MATH Google Scholar
Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams. In: Proceedings of ACM SIGMOD Conference (2003)
Dhillon, I.S., Mellela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD Conference (2003)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of ACM SIGKDD Conference, pp. 71–80 (2000)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-clustering categorical data using summaries. In: Proceedings of ACM SIGKDD Conference (1999)
Ganti V., Gehrke J., Ramakrishnan R.: Demon: Mining and monitoring evolving data. IEEE Trans. Knowl. Data Eng. 13, 1 (2001)
Article Google Scholar
Ganti V., Gehrke J., Ramakrishnan R., Loh W.: A framework for measuring differences in data characteristics. J. Comput. Syst. Sci. 64, 3 (2002)
MathSciNet Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT— optimistic decision tree construction. In: Proceedings of ACM SIGMOD Conference, pp. 169–180 (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. In: Proceedings of Very Large Databases Conference (VLDB), pp. 222–236 (2000)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15 (2003)
Guha S., Rastogi R., Shim K.: ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst. 25 5, 345–366 (2000)
Article Google Scholar
Halkidi M., Batistakis Y., Vazirgiannis M.: Cluster validity methods: Part I and II. SIGMOD Record 31 2, 40–45 (2002)
Article Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on Research Issues on Data Mining and Knowledge Discovery (1997)
Jain A., Murty M., , : Data clustering: A review. ACM Comput. Surv. 31, 264–323 (1999)
Article Google Scholar
Jain A.K., Dubes R.C.: Algorithms for Clustering Data. Prentice hall, New York (1988)
MATH Google Scholar
Keogh, E., Lin, J., Truppel, W.: Clustering of time series subsequences is meaningless: Implications for previous and future research. In: Proceedings of International Conference on Data Mining (ICDM) (2003)
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of International Conference on Machine Learning (ICML) (2004)
Meek C., Thiesson B., Heckerman D.: The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 397–418 (2002)
Article MATH MathSciNet Google Scholar
Singh, S., Mayfield, C., Prabhakar, S., Shah, R., Hambrusch, S.: Indexing uncertain categorical data. In: Proceedings of IEEE International Conference on Data Engineering (ICDE) (2007)
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Proceedingd of the 37-th Annual Allerton Conference on Communication, Control and Computing (1999)
Utgoff P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
Article Google Scholar
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept drifting data streams using ensemble classifiers. In: Proceedings of ACM SIGKDD Conference (2003)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 412–420. Morgan Kaufmann, San Francisco (1997)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, USA
Keke Chen
College of Computing, Georgia Institute of Technology, Atlanta, USA
Ling Liu

Authors

Keke Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ling Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keke Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, K., Liu, L. HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. The VLDB Journal 18, 1241–1260 (2009). https://doi.org/10.1007/s00778-009-0134-5

Download citation

Received: 28 June 2007
Revised: 29 January 2009
Accepted: 29 January 2009
Published: 28 February 2009
Issue Date: December 2009
DOI: https://doi.org/10.1007/s00778-009-0134-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Abstract

Article PDF

Similar content being viewed by others

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

A Framework for Monitoring Clustering Stability Over Time

Clustering from Data Streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Abstract

Article PDF

Similar content being viewed by others

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

A Framework for Monitoring Clustering Stability Over Time

Clustering from Data Streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation