Abstract
In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998, pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there are large maximal frequent itemsets (i.e., long patterns) in databases.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 108–118
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp 487–499
Agrawal R and Shafer JC (1996). Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6): 962–969
Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 85–93
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proceedings of international conference on data engineering, pp 443–452
Cheung DW, Han J, Ng V, Fu AW, Fu Y (1996) A fast distributed algorithm for mining association rules. In: Proceedings of the 4th international conference on parallel and distributed information systems, pp 31–43
Cheung DW, Lee SD and Xiao Y (2002). Effect of data skewness and workload balance in parallel data mining. IEEE Trans Knowl Data Eng 14(3): 498–514
Chung SM and Yang J (1996). A parallel distributive join algorithm for cube-connected multiprocessors. IEEE Trans Parallel Distrib Systems 7(2): 127–137
Frequent Itemset Mining Implementations (FIMI) Repository (2004) http://fimi.cs.helsinki.fi/fimi04/
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd annual meeting of the transportation research board, Washington, p 18
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings of the 1st IEEE international conference on data mining, pp 163–170
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of international conference on data engineering, pp 512–521
Han EH, Karypis G and Kumar V (2000). Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12(3): 337–352
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1–12
Holt JD and Chung SM (2001). Multipass algorithms for mining association rules in text databases. Knowl Inf Systems 3(2): 168–183
Holt JD and Chung SM (2002). Mining association rules using inverted hashing and pruning. Inf Process Lett 83(4): 211–220
Lin D and Kedem ZM (2002). Pincer-Search: an efficient algorithm for discovering the maximal frequent set. IEEE Trans Knowl Data Eng 14(3): 553–566
Snir M, Otto S, Huss-Lederman S, Walker D and Dongarra J (1996). MPI: the complete reference. MIT Press, Cambridge
Park JS, Chen MS, Yu PS (1995) Efficient parallel mining for association rules. In: Proceedings of international conference on information and knowledge management, pp 31–36
Park JS, Chen MS and Yu PS (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5): 813–825
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of international conference on database theory, LNCS, vol 1540, pp 398–416
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21st VLDB conference, pp 432–444
Thakur R, Rabenseifner R and Gropp W (2000). Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1): 49–66
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of international conference on information and knowledge management, pp 483–490
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. Computer Science Department Technical Rep. #651, University of Rochester
Zaki MJ, Parthasarathy S, Ogihara M and Li W (1997). Parallel algorithms for fast discovery of association rules. Data Mining Knowl Discovery 1(4): 343–373
Zaki MJ (2000). Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3): 372–390
Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 34–43
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chung, S.M., Luo, C. Efficient mining of maximal frequent itemsets from databases on a cluster of workstations. Knowl Inf Syst 16, 359–391 (2008). https://doi.org/10.1007/s10115-007-0115-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0115-1