Keywords

1 Introduction

The key for mining association rules is finding frequent itemsets [1]. There are various serial algorithms for mining association rules, such as Apriori [2]. However, the database for mining association rules is generally large, traditional serial algorithms cost much time. In order to improve efficiency, some parallel mining algorithms were proposed, which include PDM [3], CD [4], FDM [5]. Most of them divide global transaction database into equal n fractions according to horizontal method. In addition, most parallel mining algorithms adopt Apriori-like algorithm, so that a lot of candidate itemsets are generated and database is scanned frequently. Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. Then, the paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm.

2 Related Description

The global transaction database is DB, and the total number of tuples is M. Suppose, P 1, P 2, …, P n are n nodes, node for short, there are M i tuples in DB i , if DB i (i = 1, 2, …, n) is a part of DB and stores in P i , then \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } ,\;M = \sum\nolimits_{i = 1}^{n} {M_{i} } \) mining association rules can be described as follows: Each node P i deals with local database DB i and communicates with other nodes; finally, global frequent itemsets of global transaction database are gained by mapReduce.

Definition 1

For itemsets X, the number of tuples that contain X in local database DB i (i = 1, 2, …, n) is defined as local frequency of X, symbolized as X.si .

Definition 2

For itemsets X, the number of tuples that contain X in global database is global frequency of X, symbolized as X.s .

Definition 3

For itemsets X, if X.si ≥ min_sup*M i (i = 1, 2, …, n), then X is defined as local frequent itemsets of DB i , symbolized as F i . min_sup is the minimum support threshold.

Definition 4

For itemsets X, if X.s ≥ min_sup*M, then X is defined as global frequent itemsets, symbolized as F. If |X| = k, then X is symbolized as F k .

Theorem 1

If itemsets X are local frequent itemsets of DB i , then any nonempty subset of X is also local frequent itemsets of DB i .

Theorem 2

If itemsets X are global frequent itemsets, then X and all nonempty subset of X are at least local frequent itemsets of a certain local database.

Theorem 3

If itemsets X are global frequent itemsets, then any nonempty subset of X is also global frequent itemsets.

3 MFIM Algorithm

MFIM distributes data according to horizontal projection method that divides M tuples in global transaction database into M 1, M 2, …, M n (\( \sum\nolimits_{i = 1}^{n} {M_{i} } = M \)). The aggregation including M i tuples in the ith node represents \( \left\{ {T_{i}^{j} |T_{i}^{j} = O_{q} \;{\text{and}}\;q = n \times \left( {j - 1} \right) + i} \right\} \), \( T_{i}^{j} \) represents the jth tuple of the ith node, O q represents the qth tuple of global transaction database DB. DB is divided into n local databases DB 1, DB 2, …, DB n as large as \( \left\lfloor \frac{M}{n} \right\rfloor \), namely \( DB = \bigcup\nolimits_{i = 1}^{n} {DB_{i} } \). Because, DB i gets the tuples of DB via regular separation distance, and global transaction database is divided into n local database evenly, MFIM reduces data deviation.

MFIM sets one node P 0 as the center node, other nodes P i send local frequent itemsets F i to the center node P 0. P 0 gets local frequent itemsets F′(\( F^{\prime } = \bigcup\nolimits_{i = 1}^{n} {F_{i} } \)) which are pruned by the strategy of top–down. P 0 sends the remaining of F′ to other nodes. For local frequent itemsets d∈ the remaining of F′, P 0 collects local frequency d.si of d from each node and gets global frequency d.s of d. Global frequent itemsets are gained by mapReduce.

F′ are pruned by the strategy of top–down. Pruning lessens communication traffic.

The strategy of top–down is described as follow.

  1. (1)

    Confirming the largest size k of itemsets in F′.

  2. (2)

    Collecting global frequency of all local frequent k-itemsets in F′ from other nodes P i .

  3. (3)

    Judging all local frequent k-itemsets in F′, if local frequent k-itemsets Q are not global frequent itemsets, then Q are deleted from F′, else turn to (4).

  4. (4)

    Adding Q and any nonempty subset of Q to global frequent itemsets F according to Theorem 3 and Deleting Q and any nonempty subset of Q from F′.

The pseudo code of MFIM is described as follows:

Algorithm MFIM

Input: The local transaction database DB i that has M i tuples and \( M = \sum\nolimits_{i = 1}^{n} {M_{i} } \), n nodes P i (i = 1, 2, …, n), the center node P 0, the minimum support threshold min_sup.

Output: The global frequent itemsets F.

Methods: According to the following steps:

Step 1: /* distributing data according to horizontal projection method*/

Step 2: /*each node adopts FP-growth algorithm to produce local frequent itemsets by FP-tree and mapReduce*/

Step 3: /* P 0 gets the union of all local frequent itemsets and prunes*/

Step 4: /*computing global frequency of itemsets*/

Step 5: /*getting global frequent itemsets by mapReduce*/

4 Experiments of MFIM

This paper compares MFIM to classical parallel algorithm CD and FDM, takes advantage of VC++6.0 to realize CD and FDM. MFIM compares to CD and FDM in terms of communication traffic and runtime. In the experiments, the number of tested nodes is five except center node. The experimental data comes from the sales data in June 2012 from a supermarket. The results are reported in Figs. 1 and 2.

Fig. 1
figure 1

Comparison of communication traffic

Fig. 2
figure 2

Comparison of runtime

The comparison experiment results indicate that under the same minimum support threshold, the communication traffic and runtime of MFIM decrease while comparing with CD and FDM.

5 Conclusions

MFIM makes nodes calculate local frequent itemsets independently by FP-growth algorithm and mapReduce, then the center node exchanges data with other nodes and combines by the strategy of top–down. It can promote highly the efficiency of data mining.