The Algorithm of Mining Frequent Itemsets Based on MapReduce

He, Bo

doi:10.1007/978-81-322-1695-7_62

Bo He^4,5,6

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 250))

1639 Accesses
1 Citations

Abstract

Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. The paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm. MFIM algorithm distributed data according horizontal projection method. MFIM algorithm made nodes compute local frequent itemsets with by FP-tree and mapReduce, then the center node exchanged data with other nodes and combined; finally, global frequent itemsets were gained by mapReduce. Theoretical analysis and experimental results suggest that MFIM algorithm is fast and effective.

Access provided by Autonomous University of Puebla. Download conference paper PDF

The Algorithm for Mining Global Frequent Itemsets Based on Cloud Computing

Fast Distributed Mining Algorithm of Maximum Frequent Itemsets Based on Cloud Computing

Using MapReduce Framework for Mining Association Rules

Keywords

1 Introduction

The key for mining association rules is finding frequent itemsets [1]. There are various serial algorithms for mining association rules, such as Apriori [2]. However, the database for mining association rules is generally large, traditional serial algorithms cost much time. In order to improve efficiency, some parallel mining algorithms were proposed, which include PDM [3], CD [4], FDM [5]. Most of them divide global transaction database into equal n fractions according to horizontal method. In addition, most parallel mining algorithms adopt Apriori-like algorithm, so that a lot of candidate itemsets are generated and database is scanned frequently. Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. Then, the paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm.

2 Related Description

The global transaction database is DB, and the total number of tuples is M. Suppose, P ₁, P ₂, …, P _n are n nodes, node for short, there are M _i tuples in DB _i, if DB _i (i = 1, 2, …, n) is a part of DB and stores in P _i, then \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } ,\;M = \sum\nolimits_{i = 1}^{n} {M_{i} } \) mining association rules can be described as follows: Each node P _i deals with local database DB _i and communicates with other nodes; finally, global frequent itemsets of global transaction database are gained by mapReduce.

Definition 1

For itemsets X, the number of tuples that contain X in local database DB _i (i = 1, 2, …, n) is defined as local frequency of X, symbolized as X.si .

Definition 2

For itemsets X, the number of tuples that contain X in global database is global frequency of X, symbolized as X.s .

Definition 3

For itemsets X, if X.si ≥ min_sup*M _i (i = 1, 2, …, n), then X is defined as local frequent itemsets of DB _i, symbolized as F _i. min_sup is the minimum support threshold.

Definition 4

For itemsets X, if X.s ≥ min_sup*M, then X is defined as global frequent itemsets, symbolized as F. If |X| = k, then X is symbolized as F _k.

Theorem 1

If itemsets X are local frequent itemsets of DB _i, then any nonempty subset of X is also local frequent itemsets of DB _i.

Theorem 2

If itemsets X are global frequent itemsets, then X and all nonempty subset of X are at least local frequent itemsets of a certain local database.

Theorem 3

If itemsets X are global frequent itemsets, then any nonempty subset of X is also global frequent itemsets.

3 MFIM Algorithm

MFIM distributes data according to horizontal projection method that divides M tuples in global transaction database into M ₁, M ₂, …, M _n (\( \sum\nolimits_{i = 1}^{n} {M_{i} } = M \)). The aggregation including M _i tuples in the ith node represents \( \left\{ {T_{i}^{j} |T_{i}^{j} = O_{q} \;{\text{and}}\;q = n \times \left( {j - 1} \right) + i} \right\} \), \( T_{i}^{j} \) represents the jth tuple of the ith node, O _q represents the qth tuple of global transaction database DB. DB is divided into n local databases DB ₁, DB ₂, …, DB _n as large as \( \left\lfloor \frac{M}{n} \right\rfloor \), namely \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } \). Because, DB _i gets the tuples of DB via regular separation distance, and global transaction database is divided into n local database evenly, MFIM reduces data deviation.

MFIM sets one node P ₀ as the center node, other nodes P _i send local frequent itemsets F _i to the center node P ₀. P ₀ gets local frequent itemsets F′(\( F^{\prime } = \bigcup\nolimits_{i = 1}^{n} {F_{i} } \)) which are pruned by the strategy of top–down. P ₀ sends the remaining of F′ to other nodes. For local frequent itemsets d∈ the remaining of F′, P ₀ collects local frequency d.si of d from each node and gets global frequency d.s of d. Global frequent itemsets are gained by mapReduce.

F′ are pruned by the strategy of top–down. Pruning lessens communication traffic.

The strategy of top–down is described as follow.

(1)
Confirming the largest size k of itemsets in F′.
(2)
Collecting global frequency of all local frequent k-itemsets in F′ from other nodes P _i.
(3)
Judging all local frequent k-itemsets in F′, if local frequent k-itemsets Q are not global frequent itemsets, then Q are deleted from F′, else turn to (4).
(4)
Adding Q and any nonempty subset of Q to global frequent itemsets F according to Theorem 3 and Deleting Q and any nonempty subset of Q from F′.

The pseudo code of MFIM is described as follows:

Algorithm MFIM

Input: The local transaction database DB _i that has M _i tuples and \( M = \sum\nolimits_{i = 1}^{n} {M_{i} } \), n nodes P _i (i = 1, 2, …, n), the center node P ₀, the minimum support threshold min_sup.

Output: The global frequent itemsets F.

Methods: According to the following steps:

Step 1: /* distributing data according to horizontal projection method*/

Step 2: /*each node adopts FP-growth algorithm to produce local frequent itemsets by FP-tree and mapReduce*/

Step 3: /* P ₀ gets the union of all local frequent itemsets and prunes*/

Step 4: /*computing global frequency of itemsets*/

Step 5: /*getting global frequent itemsets by mapReduce*/

4 Experiments of MFIM

This paper compares MFIM to classical parallel algorithm CD and FDM, takes advantage of VC++6.0 to realize CD and FDM. MFIM compares to CD and FDM in terms of communication traffic and runtime. In the experiments, the number of tested nodes is five except center node. The experimental data comes from the sales data in June 2012 from a supermarket. The results are reported in Figs. 1 and 2.

The comparison experiment results indicate that under the same minimum support threshold, the communication traffic and runtime of MFIM decrease while comparing with CD and FDM.

5 Conclusions

MFIM makes nodes calculate local frequent itemsets independently by FP-growth algorithm and mapReduce, then the center node exchanges data with other nodes and combines by the strategy of top–down. It can promote highly the efficiency of data mining.

References

Chen, Z.B., Han, H., Wang, J.X.: Data Warehouse and Data Mining. Tsinghua University Press, Beijing (2009)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining frequent itemsets. In: Proceedings of the 20th International Conference Very Large Data Base, Santiago, pp. 487–499 (1994)
Google Scholar
Park, J.S., Chen, M.S., Yu, P.S.: Efficient distributed data mining for frequent itemsets. In: Proceedings of the 4th International Conference on Information and Knowledge Management, Baltimore, pp. 31–36 (1995)
Google Scholar
Agrawal, R., Shafer, J.C.: Distributed mining of frequent itemsets. IEEE Trans. Knowl. Data Eng. 8(6), 962–969 (1996)
Article Google Scholar
Cheung, D.W., Han, J.W., Ng, W.T., Tu, Y.J.: A fast distributed algorithm for mining association rules. In: Proceedings of IEEE 4th International Conference on Management of Data, Miami Beach, pp. 31–34 (1996)
Google Scholar
He, B.: Fast mining of global maximum frequent itemsets in distributed database. Control Decis. 26(8), 1214–1218 (2011). (in Chinese with English abstract)
MathSciNet Google Scholar

Download references

Acknowledgments

This research is supported by the fundamental and advanced research projects of Chongqing under grant No. CSTC2013JCYJA40039 and the science and technology research projects of Chongqing Board of Education under grant No. KJ130825. This research is also supported by the Nanjing university state key laboratory for novel Software technology fund under grant No. KFKT2013B23 and the Shenzhen key laboratory for high-performance data mining with Shenzhen new industry development fund under grant No. CXB201005250021A.

Author information

Authors and Affiliations

School of Computer Science and Engineering, ChongQing University of Technology, Chongqing, 400054, China
Bo He
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Bo He
Shenzhen Key Laboratory of High-Performance Data Mining, Shenzhen, 518055, China
Bo He

Authors

Bo He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo He .

Editor information

Editors and Affiliations

Dept of Computer Science and Engineering, SOA University, Bhubaneswar, Orissa, India
Srikanta Patnaik
Electronics and Computer Engg Tech., Indiana State University, Indiana, Indiana, USA
Xiaolong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, B. (2014). The Algorithm of Mining Frequent Itemsets Based on MapReduce. In: Patnaik, S., Li, X. (eds) Proceedings of International Conference on Soft Computing Techniques and Engineering Application. Advances in Intelligent Systems and Computing, vol 250. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1695-7_62

Download citation

DOI: https://doi.org/10.1007/978-81-322-1695-7_62
Published: 21 December 2013
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1694-0
Online ISBN: 978-81-322-1695-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics