Abstract
Mining strong correlations from transactional databases often leads to more meaningful results than mining association rules. In such mining, null (transaction)-invariance is an important property of the correlation measures. Unfortunately, some useful null-invariant measures such as Kulczynski and Cosine, which can discover correlations even for the very unbalanced cases, lack the (anti)-monotonicity property. Thus, they could only be applied to frequent itemsets as the post-evaluation step. For large datasets and for low supports, this approach is computationally prohibitive. This paper presents new properties for all known null-invariant measures. Based on these properties, we develop efficient pruning techniques and design the Apriori-like algorithm NICoMiner for mining strongly correlated patterns directly. We develop both the threshold-bounded and the top-k variations of the algorithm, where top-k is used when the optimal correlation threshold is not known in advance and to give user control over the output size. We test NICoMiner on real-life datasets from different application domains, using Cosine as an example of the null-invariant correlation measure. We show that NICoMiner outperforms support-based approach more than an order of magnitude, and that it is very useful for discovering top correlations in itemsets with low support.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Dataset: Dblp (2006), http://www.informatik.uni-trier.de/~ley/db/
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD (1993)
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: generalizing association rules to correlations. In: SIGMOD (1997)
Campagna, A., Pagh, R.: Finding associations and computing similarity via biased pair sampling. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, Springer, Heidelberg (2009)
Cohen, J., West, S.G., Cohen, P., Aiken, L.: Applied Multiple Regression Correlation Analysis for the Behavioral Sciences, 3rd edn. Lawrence Erlbaum Assoc Inc., Mahwah (2002)
Hahsler, M.: Groceries dataset (2007), http://rss.acs.unt.edu/Rdoc/library/arules/data/
Hahsler, M., Hornik, K., Reutterer, T.: Implications of probabilistic data modeling for mining association rules. In: Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation (2006)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Kuo, W.P., Jenssen, T.-K., Butte, A.J., Ohno-Machado, L., Kohane, I.S.: Analysis of matched mrna measurements from two different microarray technologies. Bioinformatics 18(3), 405–412 (2002)
Lee, Y.-K., Kim, W.-Y., Cai, Y.D., Han, J.: Comine: Efficient mining of correlated patterns. In: ICDM (2003)
Omiecinski, E.R.: Alternative interest measures for mining associations in databases. IEEE Trans. on Knowl. and Data Eng. 15, 57–69 (2003)
Redmond, M.: Communities and crime dataset (2009), http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
Redmond, M.A., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research 141, 660–678 (2002)
Srikant, R., Agrawal, R.: Mining generalized association rules. In: VLDB (1995)
Sun, Y., Han, J., Gao, J., Yu, Y.: iTopicModel: Information network-integrated topic modeling. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, Springer, Heidelberg (2009)
Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: KDD (2002)
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (2005)
von Storch, H., Zwiers, F.W.: Statistical analysis in climate research. Cambridge University Press, Cambridge (2002)
Wu, T., Chen, Y., Han, J.: Re-examination of interestingness measures in pattern mining: a unified framework. Data Min. Knowl. Discov. 21(3), 371–397 (2010)
Xin, D., Han, J., Yan, X., Cheng, H.: Mining compressed frequent-pattern sets. In: VLDB (2005)
Xiong, H., He, X., Ding, C.H.Q., Zhang, Y., Kumar, V., Holbrook, S.R.: Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Pacific Symposium on Biocomputing (2005)
Xiong, H., Zhou, W., Brodie, M., Ma, S.: Top-k φ correlation computation. INFORMS Journal on Computing 20(4), 539–552 (2008)
Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70, 60–83 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, S., Barsky, M., Han, J. (2011). Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6912. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23783-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-23783-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23782-9
Online ISBN: 978-3-642-23783-6
eBook Packages: Computer ScienceComputer Science (R0)