Abstract
In this article, we focus on distributed Apriori-based frequent itemsets mining. We present a new distributed approach which takes into account inherent characteristics of this algorithm. We study the distribution aspect of this algorithm and give a comparison of the proposed approach with a classical Apriori-like distributed algorithm, using both analytical and experimental studies. We find that under a wide range of conditions and datasets, the performance of a distributed Apriori-like algorithm is not related to global strategies of pruning since the performance of the local Apriori generation is usually characterized by relatively high success rates of candidate sets frequency at low levels which switch to very low rates at some stage, and often drops to zero. This means that the intermediate communication steps and remote support counts computation and collection in classical distributed schemes are computationally inefficient locally, and then constrains the global performance. Our performance evaluation is done on a large cluster of workstations using the Condor system and its workflow manager DAGMan. The results show that the presented approach greatly enhances the performance and achieves good scalability compared to a typical distributed Apriori founded algorithm.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding of the 20th international conference on very large data bases, VLDB, pp 487–499
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May 13–15
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Park JS, Chen M-S, Yu PS (1995) An effective hash-based algorithm for mining association rules. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, SIGMOD ’95
Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. VLDB J :432–444
Cheung DW, Han J, Ng VT, Fu AW, Fu Y (1996) A fast distributed algorithm for mining association rules. In: PDIS: International conference on parallel and distributed information systems
Cheung DW, Ng VT, Fu AW, Fu Y (1996) Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8(6): 911–922
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8: 962–969
Ashrafi MZ, Taniar D, Smith KA (2004) ODAM: an optimized distributed association rule mining algorithm. IEEE Distrib Syst Online 5(3)
Zaki M, Ogihara M, Parthasarathy S, Li W (1996) Parallel data mining for association rules on shared memory multi-processors. Proceedings of the 1996 IEEE/ACM conference on supercomputing
Parthasarathy S, Zaki MJ, Ogihara M, Li W (2001) Parallel data mining for association rules on shared memory systems. Knowl Inf Syst 3(1): 1–29
Pramudiono I, Kitsuregawa M (2003) Parallel FP-Growth on PC cluster. In: Proceedings of the 7th Pacific–Asia conference of knowledge discovery and data mining (PAKDD03)
Schuster A, Wolff R (2004) Communication-efficient distributed mining of association rules. Data Min Knowl Disc 8(2): 171–196
Toivonen H (1996) Sampling large databases for association rules. In: VLDB ’96: Proceedings of the 22nd international conference on very large data bases
Assaf Schuster, Ran Wolff, Dan Trock (2005) A high-performance distributed algorithm for mining association rules. Knowl Inf Syst 7(4): 458–475
Chung SM, Congnan L (2008) Efficient mining of maximal frequent itemsets from databases on a cluster of workstations. Knowl Inf Syst 16(3): 359–391
Congnan L, Chung SM (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst 15(2): 149–179
Chung SM, Luo C (2004) Distributed mining of maximal frequent itemsets from databases on a cluster of workstations. In: CCGRID ’04: Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Wolff R, Schuster A (2003) Association rule mining in peer-to-peer systems. In: Proceeding of the IEEE conference on data mining ICDM 2003
Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4): 14–25
Zaki M (2000) Parallel and distributed data mining: an introduction. In: Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Park B, Kargupta H (2002) Distributed data mining: algorithms, systems, and applications. Data Mining Handbook
Le-Khac N-A, Aouad LM, Kechadi MT (2007) Knowledge map: toward a new approach supporting the knowledge management in distributed data mining. In: The third international conference on autonomic and autonomous systems, ICAS
Le-Khac N-A, Kechadi MT, Carthy J (2006) ADMIRE framework: distributed data mining on data grid platforms. First international conference on software and data technologies, ICSOFT
Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern 34(6): 2451–2465
Ghanem VM, Kohler YM, Sayed AJ, Wendel P (2002) Discovery net: towards a grid of knowledge discovery. In: Eighth international conference on knowledge discovery and data mining
Kickinger G, Hofer J, Brezany P, Tjoa AM (2004) Grid knowledge discovery processes and an architecture for their composition. Parallel Distrib Comput Netw. Proceedings of parallel and distributed computing and networks
Berman, Fran (2001) Viewpoint: from TeraGrid to knowledge grid. Commun ACM 44(11): 27–28
Srikant R, Agrawal R (1997) Mining generalized association rules. Future Gener Comput Syst 13(2–3): 161–180
Geerts F, Goethals B, Van Den Bussche J (2005) Tight upper bounds on the number of candidate patterns. ACM Trans Database Syst 30(2): 333–363
Thain D, Tannenbaum T, Livny M (2004) Distributed computing in practice: the condor experience. Concurr Comput Pract Exp 17(2–4): 323–356
Fahringer T, Jugravu A, Pllana S, Prodan R, Seragiotto C Jr, Truong H-L (2005) ASKALON: a tool set for cluster and grid computing: research articles. Concurr Comput Pract Exp 17(2–4): 143–169
von Laszewski G, Foster I, Gawor J, Lane P (2001) A Java commodity grid kit. Concurr Comput Pract Exp 13(8–9): 645–662
Delannoy O, Petiton S (2004) A Peer to peer computing framework: design and performance evaluation of YML. HeteroPar 2004, the third international workshop on algorithms, models and tools for parallel computing on heterogeneous networks
Amin K, von Laszewski G, Hategan M, Zaluzec NJ, Hampton S, Rossi A (2004) GridAnt: a client- controllable grid workflow system. In: Proceedings of the 37th annual Hawaii international conference on system science
Cappello F et al (2005) Grid’5000: a large scale, reconfigurable, controlable and monitorable grid platform. In: Sixth IEEE/ACM international workshop on grid computing
Purdom PW, Van Gucht D, Groth DP (2004) Average-case performance of the Apriori algorithm. SIAM J Comput 33(5): 1223–1260
Aouad LM, Le-Khac N-A, Kechadi TM (2007) Grid-based approaches for distributed data mining applications. In: DCABES 2007, the sixth international conference on distributed computing and applications for business, engineering and sciences
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aouad, L.M., Le-Khac, NA. & Kechadi, T.M. Performance study of distributed Apriori-like frequent itemsets mining. Knowl Inf Syst 23, 55–72 (2010). https://doi.org/10.1007/s10115-009-0205-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0205-3