Keywords

1 Introduction

In order to solving the problem of association rules mining in the era of big data, many researchers have proposed lot of algorithms that using parallel computing technologies to deal with association rules mining. Yang X Y [4] proposed a distributed Apriori algorithm using Hadoop’s MapReduce framework, however, this algorithm would cause a large number of I/O operations. Qiu H [3] proposed a parallel frequent itemsets mining algorithm with Spark called YAFIM (Yet Another Frequent Itemset Mining). And experiments show that, compared with the algorithms implemented with Hadoop, YAFIM achieved 18\(\times \) speedup in average for various benchmarks [3]. But, YAFIM will also cost many hours when processing GB-level data. In the year of 2015, Zhang F proposed DPBM, a distributed matrix-based pruning algorithm based on Spark [1]. Experiments show that it is faster than YAFIM. But this algorithm may have a critical problem: if the total transactions are large enough, the computing speed will be very slow and not efficient.

In order to solve these problems, this paper proposed FISM, a new distributed frequent itemsets mining algorithm by using sparse boolean matrix on Spark. First, we use sparse matrix to replace the boolean-matrix of DPBM. Next, we also improved the generating process of candidate frequent itemsets. The experiments show that FISM outperforms DPBM, YAFIM, Parallel FP-Growth and other algorithms in both speed and scalability performance.

2 Distributed Frequent Itemsets Mining Algorithm by Using Sparse Boolean Matrix on Spark (FISM)

Assume that there are n transactions in the dataset D, and we call it T = \(T_1\), \(T_2\), \(\cdots \), \(T_n\). Next, assume that there are m items in the dataset called I = \(I_1\), \(I_2\), \(\cdots \), \(I_m\). The following are main steps of FISM:

First scan the dataset and for every item \(I_i\) (i = \(1,2,3,\cdots \), m), get the vector \(V_i\) (\(b_{i1}\), \(b_{i2}\), \(b_{i3}\), \(\cdots \), \(b_{in}\)) where

$$ b_{ij}= \left\{ \begin{aligned} 1&I_i\in T_j \\ 0&I_i\notin T_j \end{aligned} \right. $$
Fig. 1.
figure 1

Sparse Boolean matrix B

(j = \(1,2,3,\cdots \), n; i = \(1,2,3,\cdots \), m), and if \(b_{ij}\) = 0 then transaction \(T_j\) does not contain item \(I_i\). There are total m vectors which build up the boolean matrix B. The number of columns of boolean matrix B is n and the number of rows of boolean matrix B is m. Sometimes, the number of transactions are very large, so the boolean matrix B will be implemented by sparse matrix. Figure 1 shows the actual data structure of B. For example, the i-th member of vectors is a list (\(4,7,68,6457,\cdots \)), so it means that transaction 4, 7, 68 and 6457 all contain Item \(I_i\) and other transactions do not contain Item \(I_i\). According to boolean matrix B, we can easily get the support number of Item \(I_i\) .

Definition 1

The “AND” result of one itemset E is defined as follows:

Suppose that E consists of h items, they are \(I_1\), \(I_2\), \(I_3\), \(\ldots \), \(I_h\), and suppose that these items’ corresponding rows in sparse boolean matrix B are \(V_1\), \(V_2\), \(V_3\), \(\ldots \), \(V_h\), and suppose that \(V_{result}\) = \(V_1\) \(\&\) \(V_2\) \(\&\) \(V_3\), \(\ldots \), \(\&\) \(V_h\). Define that \(V_{result}\) is the “AND” result of E. The symbol “\(\&\)” means “AND” operation between two vectors.

It is easy to generate (k + 1)-candidate frequent itemsets by k-frequent itemsets. After we get all k-frequent itemsets, for every k-frequent itemset, we stored its “AND” result (see Definition 1) in a list.Obviously, every (k + 1)-candidate frequent itemset consists of a new item \(I_{new}\) and a k-frequent itemset called \(itemset_{old}\), and the "AND" result of \(itemset_{old}\) is called \(V_{old}\). Then \(V_{old}\) will be reused in the “AND” operation of \(V_{old}\) and \(I_{new}\)’s corresponding row of sparse boolean matrix B. In this way, we needn’t to do k+1 times “AND” operations for the confirming of every one of (k + 2)-frequent itemset. We just do one time of “AND” operation of \(V_{old}\) and \(I_{new}\)’s boolean vector. Especially, we implemented the “AND” operation with multithreading technology.

In general,traditional frequent itemsets finding algorithms will scan the dataset for many times and this cause large amount of I/O operations while FISM only needs to scan dataset for once and use sparse matrix to accelerate the procedure of frequent itemsets finding. In theoretically, FISM outperforms traditional Apriori algorithm by up to one order of magnitude. And the experiments will confirm it.

3 Experiments Results

In this section, we compared FISM with MRApriori [4], YAFIM [3], DPBM [1] and PFP-growth (parallel FP-Growth [2] algorithm implemented by Spark team) to evaluate its performance. We implemented all the algorithms in Spark1.5.0. All the datasets are stored in HDFS and the cluster consists of 4 nodes and each node has 4 Intel Xeon cores with 2.60 GHz, 22.5 GB memory and 1 TB disk. The running system is CentOS6.5 and the version JDK is 1.7. Last but not least, the correctness of all the algorithms mentioned above are exactly same. Table 1 are characteristics of the datasets. Table 2 are results of experiments.

Table 1. Detail properties of datasets
Table 2. Experiments results

We can see that in every repetition and dataset, FISM outperforms all the others algorithms. For all the datasets, FISM is 1.8\(\times \) faster than PFP-growth, 20\(\times \) faster than YAFIM and 10\(\times \) faster than DPBM.

Figure 2 shows the sizeup performance of all algorithms. The x-axis shows the number of replicated times of dataset. FISM is always faster than any other algorithms. MRApriori and YAFIM are not included in Fig. 2 because they spend more than 24 h. At the same time, we can see that with the increasement of dataset, the FISM ’cost time are increasing in a nearly linear way.

Fig. 2.
figure 2

The sizeup performance of each algorithms

4 Conclusions

In order to accelerate frequent itemsets mining in big data era, this paper proposed FISM, a new distributed sparse boolean-matrix based frequent itemsets mining algorithm with Spark. It generate a sparse boolean matrix which shows whether one item is included in one transaction or not, then use the matrix to get all frequent itemsets. In this way, we only need to scan the dataset once. The experiments show that FISM is 1.8\(\times \) faster than PFP-growth, 20\(\times \) faster than YAFIM and 10\(\times \) faster than DPBM. In addition, the FISM also has a better performance in scalability.