Efficient mining of maximal frequent itemsets from databases on a cluster of workstations

Chung, Soon M.; Luo, Congnan

doi:10.1007/s10115-007-0115-1

Efficient mining of maximal frequent itemsets from databases on a cluster of workstations

Regular Paper
Published: 12 December 2007

Volume 16, pages 359–391, (2008)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Knowledge and Information Systems Aims and scope Submit manuscript

Efficient mining of maximal frequent itemsets from databases on a cluster of workstations

Download PDF

Soon M. Chung¹ &
Congnan Luo¹

161 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998, pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there are large maximal frequent itemsets (i.e., long patterns) in databases.

Avoid common mistakes on your manuscript.

References

Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 108–118
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp 487–499
Agrawal R and Shafer JC (1996). Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6): 962–969
Article Google Scholar
Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 85–93
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of ACM SIGMOD international conference on management of data, pp 255–264
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proceedings of international conference on data engineering, pp 443–452
Cheung DW, Han J, Ng V, Fu AW, Fu Y (1996) A fast distributed algorithm for mining association rules. In: Proceedings of the 4th international conference on parallel and distributed information systems, pp 31–43
Cheung DW, Lee SD and Xiao Y (2002). Effect of data skewness and workload balance in parallel data mining. IEEE Trans Knowl Data Eng 14(3): 498–514
Article Google Scholar
Chung SM and Yang J (1996). A parallel distributive join algorithm for cube-connected multiprocessors. IEEE Trans Parallel Distrib Systems 7(2): 127–137
Article Google Scholar
Frequent Itemset Mining Implementations (FIMI) Repository (2004) http://fimi.cs.helsinki.fi/fimi04/
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd annual meeting of the transportation research board, Washington, p 18
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings of the 1st IEEE international conference on data mining, pp 163–170
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of international conference on data engineering, pp 512–521
Han EH, Karypis G and Kumar V (2000). Scalable parallel data mining for association rules. IEEE Trans Knowl Data Eng 12(3): 337–352
Article Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD international conference on management of data, pp 1–12
Holt JD and Chung SM (2001). Multipass algorithms for mining association rules in text databases. Knowl Inf Systems 3(2): 168–183
Article MATH Google Scholar
Holt JD and Chung SM (2002). Mining association rules using inverted hashing and pruning. Inf Process Lett 83(4): 211–220
Article MATH MathSciNet Google Scholar
Lin D and Kedem ZM (2002). Pincer-Search: an efficient algorithm for discovering the maximal frequent set. IEEE Trans Knowl Data Eng 14(3): 553–566
Article Google Scholar
Snir M, Otto S, Huss-Lederman S, Walker D and Dongarra J (1996). MPI: the complete reference. MIT Press, Cambridge
Google Scholar
Park JS, Chen MS, Yu PS (1995) Efficient parallel mining for association rules. In: Proceedings of international conference on information and knowledge management, pp 31–36
Park JS, Chen MS and Yu PS (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5): 813–825
Article Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of international conference on database theory, LNCS, vol 1540, pp 398–416
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21st VLDB conference, pp 432–444
Thakur R, Rabenseifner R and Gropp W (2000). Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1): 49–66
Article Google Scholar
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of international conference on information and knowledge management, pp 483–490
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. Computer Science Department Technical Rep. #651, University of Rochester
Zaki MJ, Parthasarathy S, Ogihara M and Li W (1997). Parallel algorithms for fast discovery of association rules. Data Mining Knowl Discovery 1(4): 343–373
Article Google Scholar
Zaki MJ (2000). Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3): 372–390
Article MathSciNet Google Scholar
Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 34–43

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, OH, USA
Soon M. Chung & Congnan Luo

Authors

Soon M. Chung
View author publications
You can also search for this author in PubMed Google Scholar
Congnan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon M. Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chung, S.M., Luo, C. Efficient mining of maximal frequent itemsets from databases on a cluster of workstations. Knowl Inf Syst 16, 359–391 (2008). https://doi.org/10.1007/s10115-007-0115-1

Download citation

Received: 15 April 2005
Revised: 31 August 2007
Accepted: 08 September 2007
Published: 12 December 2007
Issue Date: September 2008
DOI: https://doi.org/10.1007/s10115-007-0115-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient mining of maximal frequent itemsets from databases on a cluster of workstations

Abstract

Article PDF

Similar content being viewed by others

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

A highly scalable parallel algorithm for maximally informative k-itemset mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient mining of maximal frequent itemsets from databases on a cluster of workstations

Abstract

Article PDF

Similar content being viewed by others

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

A highly scalable parallel algorithm for maximally informative k-itemset mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation