Abstract
Analyzing clustering structures in data streams can provide critical information for real-time decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HE-Tree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HE-Tree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating high-quality BKPlots. We also develop the time-decaying HE-Tree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HE-Tree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aggarwal, C.C.: On change diagnosis in evolving data streams. IEEE Trans. Knowl. Data Eng. 17, 5 (2005)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A framework for projected clustering of high dimensional data streams. In: Proceedings of Very Large Databases Conference (VLDB) (2004)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: Proceedings of Intternational Conference on Extending Database Technology (EDBT) (2004)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of ACM Conference on Principles of Database Systems (PODS) (2002)
Barbara, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM) (2002)
Ben-David, S., Gehrke, J., Kifer, D.: Detecting change in data stream. In: Proceedings of Very Large Databases Conference (VLDB) (2004)
Bock, H.: Probabilistic aspects in cluster analysis. In: Conceptual and Numerical Analysis of Data. Springer, Berlin (1989)
Brand, M.: An entropic estimator for structure discovery. In: Proceedings Of Neural Information Processing Systems (NIPS), pp. 723–729 (1998)
Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. (1991)
Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: Annual ACM-SIAM Symposium on Discrete Algorithms (2007)
Chakrabarti, D., Papadimitriou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: Proceedings of ACM SIGKDD Conference (2004)
Chen K., Liu L.: VISTA: Validating and refining clusters via visualization. Inf. Vis. 3 4, 257–270 (2004)
Chen, K., Liu, L.: The “best k” for entropy-based categorical clustering. In: Proceedings of International Conference on Scientific and Statistical Database Management (SSDBM), pp.253–262 (2005)
Chen, K., Liu, L.: Detecting the change of clustering structure in categorical data streams. In: SIAM Data Mining Conference (2006)
Chen, K., Liu, L.: iVIBRATE: Interactive visualization based framework for clustering large datasets. ACM Trans. Inf. Syst. 24, 2 (2006)
Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multidimensional regression analysis of time-series data streams. In : Proceedings of Very Large Databases Conference (VLDB) (2002)
Cheng, C.H., Fu, A. W.-C., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD Conference (1999)
Cover T., Thomas J.: Elements of Information Theory. Wiley, NY (1991)
Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams. In: Proceedings of ACM SIGMOD Conference (2003)
Dhillon, I.S., Mellela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD Conference (2003)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of ACM SIGKDD Conference, pp. 71–80 (2000)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-clustering categorical data using summaries. In: Proceedings of ACM SIGKDD Conference (1999)
Ganti V., Gehrke J., Ramakrishnan R.: Demon: Mining and monitoring evolving data. IEEE Trans. Knowl. Data Eng. 13, 1 (2001)
Ganti V., Gehrke J., Ramakrishnan R., Loh W.: A framework for measuring differences in data characteristics. J. Comput. Syst. Sci. 64, 3 (2002)
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT— optimistic decision tree construction. In: Proceedings of ACM SIGMOD Conference, pp. 169–180 (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. In: Proceedings of Very Large Databases Conference (VLDB), pp. 222–236 (2000)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15 (2003)
Guha S., Rastogi R., Shim K.: ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst. 25 5, 345–366 (2000)
Halkidi M., Batistakis Y., Vazirgiannis M.: Cluster validity methods: Part I and II. SIGMOD Record 31 2, 40–45 (2002)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on Research Issues on Data Mining and Knowledge Discovery (1997)
Jain A., Murty M., , : Data clustering: A review. ACM Comput. Surv. 31, 264–323 (1999)
Jain A.K., Dubes R.C.: Algorithms for Clustering Data. Prentice hall, New York (1988)
Keogh, E., Lin, J., Truppel, W.: Clustering of time series subsequences is meaningless: Implications for previous and future research. In: Proceedings of International Conference on Data Mining (ICDM) (2003)
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of International Conference on Machine Learning (ICML) (2004)
Meek C., Thiesson B., Heckerman D.: The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 397–418 (2002)
Singh, S., Mayfield, C., Prabhakar, S., Shah, R., Hambrusch, S.: Indexing uncertain categorical data. In: Proceedings of IEEE International Conference on Data Engineering (ICDE) (2007)
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Proceedingd of the 37-th Annual Allerton Conference on Communication, Control and Computing (1999)
Utgoff P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept drifting data streams using ensemble classifiers. In: Proceedings of ACM SIGKDD Conference (2003)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, US, 1997), pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, K., Liu, L. HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. The VLDB Journal 18, 1241–1260 (2009). https://doi.org/10.1007/s00778-009-0134-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0134-5