Abstract
Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. The paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm. MFIM algorithm distributed data according horizontal projection method. MFIM algorithm made nodes compute local frequent itemsets with by FP-tree and mapReduce, then the center node exchanged data with other nodes and combined; finally, global frequent itemsets were gained by mapReduce. Theoretical analysis and experimental results suggest that MFIM algorithm is fast and effective.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The key for mining association rules is finding frequent itemsets [1]. There are various serial algorithms for mining association rules, such as Apriori [2]. However, the database for mining association rules is generally large, traditional serial algorithms cost much time. In order to improve efficiency, some parallel mining algorithms were proposed, which include PDM [3], CD [4], FDM [5]. Most of them divide global transaction database into equal n fractions according to horizontal method. In addition, most parallel mining algorithms adopt Apriori-like algorithm, so that a lot of candidate itemsets are generated and database is scanned frequently. Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. Then, the paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm.
2 Related Description
The global transaction database is DB, and the total number of tuples is M. Suppose, P 1, P 2, …, P n are n nodes, node for short, there are M i tuples in DB i , if DB i (i = 1, 2, …, n) is a part of DB and stores in P i , then \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } ,\;M = \sum\nolimits_{i = 1}^{n} {M_{i} } \) mining association rules can be described as follows: Each node P i deals with local database DB i and communicates with other nodes; finally, global frequent itemsets of global transaction database are gained by mapReduce.
Definition 1
For itemsets X, the number of tuples that contain X in local database DB i (i = 1, 2, …, n) is defined as local frequency of X, symbolized as X.si .
Definition 2
For itemsets X, the number of tuples that contain X in global database is global frequency of X, symbolized as X.s .
Definition 3
For itemsets X, if X.si ≥ min_sup*M i (i = 1, 2, …, n), then X is defined as local frequent itemsets of DB i , symbolized as F i . min_sup is the minimum support threshold.
Definition 4
For itemsets X, if X.s ≥ min_sup*M, then X is defined as global frequent itemsets, symbolized as F. If |X| = k, then X is symbolized as F k .
Theorem 1
If itemsets X are local frequent itemsets of DB i , then any nonempty subset of X is also local frequent itemsets of DB i .
Theorem 2
If itemsets X are global frequent itemsets, then X and all nonempty subset of X are at least local frequent itemsets of a certain local database.
Theorem 3
If itemsets X are global frequent itemsets, then any nonempty subset of X is also global frequent itemsets.
3 MFIM Algorithm
MFIM distributes data according to horizontal projection method that divides M tuples in global transaction database into M 1, M 2, …, M n (\( \sum\nolimits_{i = 1}^{n} {M_{i} } = M \)). The aggregation including M i tuples in the ith node represents \( \left\{ {T_{i}^{j} |T_{i}^{j} = O_{q} \;{\text{and}}\;q = n \times \left( {j - 1} \right) + i} \right\} \), \( T_{i}^{j} \) represents the jth tuple of the ith node, O q represents the qth tuple of global transaction database DB. DB is divided into n local databases DB 1, DB 2, …, DB n as large as \( \left\lfloor \frac{M}{n} \right\rfloor \), namely \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } \). Because, DB i gets the tuples of DB via regular separation distance, and global transaction database is divided into n local database evenly, MFIM reduces data deviation.
MFIM sets one node P 0 as the center node, other nodes P i send local frequent itemsets F i to the center node P 0. P 0 gets local frequent itemsets F′(\( F^{\prime } = \bigcup\nolimits_{i = 1}^{n} {F_{i} } \)) which are pruned by the strategy of top–down. P 0 sends the remaining of F′ to other nodes. For local frequent itemsets d∈ the remaining of F′, P 0 collects local frequency d.si of d from each node and gets global frequency d.s of d. Global frequent itemsets are gained by mapReduce.
F′ are pruned by the strategy of top–down. Pruning lessens communication traffic.
The strategy of top–down is described as follow.
-
(1)
Confirming the largest size k of itemsets in F′.
-
(2)
Collecting global frequency of all local frequent k-itemsets in F′ from other nodes P i .
-
(3)
Judging all local frequent k-itemsets in F′, if local frequent k-itemsets Q are not global frequent itemsets, then Q are deleted from F′, else turn to (4).
-
(4)
Adding Q and any nonempty subset of Q to global frequent itemsets F according to Theorem 3 and Deleting Q and any nonempty subset of Q from F′.
The pseudo code of MFIM is described as follows:
Algorithm MFIM
Input: The local transaction database DB i that has M i tuples and \( M = \sum\nolimits_{i = 1}^{n} {M_{i} } \), n nodes P i (i = 1, 2, …, n), the center node P 0, the minimum support threshold min_sup.
Output: The global frequent itemsets F.
Methods: According to the following steps:
Step 1: /* distributing data according to horizontal projection method*/
Step 2: /*each node adopts FP-growth algorithm to produce local frequent itemsets by FP-tree and mapReduce*/
Step 3: /* P 0 gets the union of all local frequent itemsets and prunes*/
Step 4: /*computing global frequency of itemsets*/
Step 5: /*getting global frequent itemsets by mapReduce*/
4 Experiments of MFIM
This paper compares MFIM to classical parallel algorithm CD and FDM, takes advantage of VC++6.0 to realize CD and FDM. MFIM compares to CD and FDM in terms of communication traffic and runtime. In the experiments, the number of tested nodes is five except center node. The experimental data comes from the sales data in June 2012 from a supermarket. The results are reported in Figs. 1 and 2.
The comparison experiment results indicate that under the same minimum support threshold, the communication traffic and runtime of MFIM decrease while comparing with CD and FDM.
5 Conclusions
MFIM makes nodes calculate local frequent itemsets independently by FP-growth algorithm and mapReduce, then the center node exchanges data with other nodes and combines by the strategy of top–down. It can promote highly the efficiency of data mining.
References
Chen, Z.B., Han, H., Wang, J.X.: Data Warehouse and Data Mining. Tsinghua University Press, Beijing (2009)
Agrawal, R., Srikant, R.: Fast algorithms for mining frequent itemsets. In: Proceedings of the 20th International Conference Very Large Data Base, Santiago, pp. 487–499 (1994)
Park, J.S., Chen, M.S., Yu, P.S.: Efficient distributed data mining for frequent itemsets. In: Proceedings of the 4th International Conference on Information and Knowledge Management, Baltimore, pp. 31–36 (1995)
Agrawal, R., Shafer, J.C.: Distributed mining of frequent itemsets. IEEE Trans. Knowl. Data Eng. 8(6), 962–969 (1996)
Cheung, D.W., Han, J.W., Ng, W.T., Tu, Y.J.: A fast distributed algorithm for mining association rules. In: Proceedings of IEEE 4th International Conference on Management of Data, Miami Beach, pp. 31–34 (1996)
He, B.: Fast mining of global maximum frequent itemsets in distributed database. Control Decis. 26(8), 1214–1218 (2011). (in Chinese with English abstract)
Acknowledgments
This research is supported by the fundamental and advanced research projects of Chongqing under grant No. CSTC2013JCYJA40039 and the science and technology research projects of Chongqing Board of Education under grant No. KJ130825. This research is also supported by the Nanjing university state key laboratory for novel Software technology fund under grant No. KFKT2013B23 and the Shenzhen key laboratory for high-performance data mining with Shenzhen new industry development fund under grant No. CXB201005250021A.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer India
About this paper
Cite this paper
He, B. (2014). The Algorithm of Mining Frequent Itemsets Based on MapReduce. In: Patnaik, S., Li, X. (eds) Proceedings of International Conference on Soft Computing Techniques and Engineering Application. Advances in Intelligent Systems and Computing, vol 250. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1695-7_62
Download citation
DOI: https://doi.org/10.1007/978-81-322-1695-7_62
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1694-0
Online ISBN: 978-81-322-1695-7
eBook Packages: EngineeringEngineering (R0)