Abstract
We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: PODS ’98: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM Press, New York, pp 18–24
Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington, D.C., pp 207–216
Agrawal R, Mannila H, Srikant R, Toivonen H and Verkamo AI (1996). Fast discovery of association rules. In: Fayyad, U, Piatetsky-Shapiro, G, Smyth, P, and Uthurusamy, R (eds) Advances in Knowledge Discovery and Data Mining., pp 307–328. AAAI Press/The MIT Press, Cambridge
Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Principles of Data Mining and Knowledge Discovery, pp 75–85
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge Discovery and Data Mining. ACM, New York, pp 254–260
Brin S, Motwani R and Silverstein C (1997). Beyond market baskets: Generalizing association rules to correlations. In: Peckham, J (eds) SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pp 265–276. ACM Press, New York
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data. pp 255–264
Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases
Chow C and Liu C (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3): 462–467
Cooper G (1990). The computational complexity of probabilistic inference using bayesian belief networks. Artif Intell 42(2–3): 393–405
Csiszár I (1975). I-divergence geometry of probability distributions and minimization problems. Ann Prob 3(1): 146–158
Darroch J and Ratchli D (1972). Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Knowledge Discovery and Data Mining, pp 43–52
DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Knowledge Discovery and Data Mining, pp 67–76
Fortelius M (2005) Neogene of the old world database of fossil mammals (NOW). University of Helsinki, http://www.helsinki.fi/science/now/
Fortelius M, Gionis A, Jernvall J and Mannila H (2006). Spectral ordering and biochronology of european fossil mammals paleobiology. Paleobiology 32(2): 206–214
Gallo A, Bie TD, Christianini N (2007) Mini: Mining informative non-redundant itemsets. In: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp 438–445
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12–16
Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Knowledge Discovery and Data Mining
Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD’02, pp 135–147
Jiroušek R and Přeušil S (1995). On the effective implementation of the iterative proportional fitting procedure. Comput Stat Data Anal 19: 177–189
Kohavi R, Brodley C, Frasca B, Mason L and Zheng Z (2000). KDD-Cup 2000 organizers report: Peeling the onion. SIGKDD Explorat 2(2): 86–98
Kullback S (1968). Information Theory and Statistics. Dover Publications, Inc., New York
Mannila H, Mielikäinen T (2003) The pattern ordering problem. In: Principles of Data Mining and Knowledge Discovery, pp 327–338
Norén GN, Bate A and Edwards IR (2007). Extending the methods used to screen the who drug safety database towards analysis of complex associations and improved accuracy for rare events. Stat Med 25: 3740–3757
Omiecinski ER (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Pasquier N, Bastide Y, Taouil R and Lakhal L (1999). Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540: 398–416
Pavlov D, Mannila H and Smyth P (2003). Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6): 1409–1421
Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases. AAAI/MIT Press, New York, pp 229–248
Tatti N (2006a) Computational complexity of queries based on itemsets. Inf Process Lett pp 183–187
Tatti N (2006). Safe projections of binary data sets. Acta Inf 42(8–9): 617–638
Tatti N (2007) Maximum entropy based significance of itemsets. In: Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321
van der Vaart AW (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge
Webb GI (2006) Discovering significant rules. In: Knowledge discovery and data mining, pp 434–443
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version appeared as “Maximum Entropy Based Significance of Itemsets”, In Proceedings of Seventh IEEE International Conference on Data Mining (ICDM 2007), pp 312–321, 2006 [32].
Rights and permissions
About this article
Cite this article
Tatti, N. Maximum entropy based significance of itemsets. Knowl Inf Syst 17, 57–77 (2008). https://doi.org/10.1007/s10115-008-0128-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0128-4