A Distributed Frequent Itemsets Mining Algorithm Using Sparse Boolean Matrix on Spark

Luo, Yonghong; Yang, Zhifan; Shi, Huike; Zhang, Ying

doi:10.1007/978-3-319-45817-5_38

Yonghong Luo¹⁷,
Zhifan Yang¹⁷,
Huike Shi¹⁷ &
…
Ying Zhang^17,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9932))

Included in the following conference series:

Asia-Pacific Web Conference

1658 Accesses

Abstract

Frequent itemsets mining is one of the most important aspects in data mining for finding interesting knowledge in a huge mass of data. However, traditional frequent itemsets mining algorithms are usually data-intensive and computing-intensive. Take Apriori algorithm, a well-known algorithm in finding frequent itemsets for example, it needs to scan the dataset for many times and with the coming of big data era, it will also cost a lot of time over GB-level data. In order to solve those problems, researchers have made great efforts to improve Apriori algorithm based on distributed computing framework Hadoop or Spark. However, the existing parallel Apriori algorithms based on Hadoop or Spark are not efficient enough over GB-level data. In this paper, we proposed a distributed frequent itemsets mining algorithm by sparse boolean matrix on Spark (FISM). And experiments show FISM has better performance than all others existing parallel frequent itemsets mining algorithms and can also deal with GB-level data.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

Article 28 October 2015

PPS: Parallel Pincer Search for Mining Frequent Itemsets Based on Spark

Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark

Article Open access 20 February 2018

Keywords

1 Introduction

In order to solving the problem of association rules mining in the era of big data, many researchers have proposed lot of algorithms that using parallel computing technologies to deal with association rules mining. Yang X Y [4] proposed a distributed Apriori algorithm using Hadoop’s MapReduce framework, however, this algorithm would cause a large number of I/O operations. Qiu H [3] proposed a parallel frequent itemsets mining algorithm with Spark called YAFIM (Yet Another Frequent Itemset Mining). And experiments show that, compared with the algorithms implemented with Hadoop, YAFIM achieved 18$\times $ speedup in average for various benchmarks [3]. But, YAFIM will also cost many hours when processing GB-level data. In the year of 2015, Zhang F proposed DPBM, a distributed matrix-based pruning algorithm based on Spark [1]. Experiments show that it is faster than YAFIM. But this algorithm may have a critical problem: if the total transactions are large enough, the computing speed will be very slow and not efficient.

In order to solve these problems, this paper proposed FISM, a new distributed frequent itemsets mining algorithm by using sparse boolean matrix on Spark. First, we use sparse matrix to replace the boolean-matrix of DPBM. Next, we also improved the generating process of candidate frequent itemsets. The experiments show that FISM outperforms DPBM, YAFIM, Parallel FP-Growth and other algorithms in both speed and scalability performance.

2 Distributed Frequent Itemsets Mining Algorithm by Using Sparse Boolean Matrix on Spark (FISM)

Assume that there are n transactions in the dataset D, and we call it T = $T_1$, $T_2$, $\cdots $, $T_n$. Next, assume that there are m items in the dataset called I = $I_1$, $I_2$, $\cdots $, $I_m$. The following are main steps of FISM:

First scan the dataset and for every item $I_i$ (i = $1,2,3,\cdots $, m), get the vector $V_i$ ($b_{i1}$, $b_{i2}$, $b_{i3}$, $\cdots $, $b_{in}$) where

$$ b_{ij}= \left\{ \begin{aligned} 1&I_i\in T_j \\ 0&I_i\notin T_j \end{aligned} \right. $$

(j = $1,2,3,\cdots $, n; i = $1,2,3,\cdots $, m), and if $b_{ij}$ = 0 then transaction $T_j$ does not contain item $I_i$. There are total m vectors which build up the boolean matrix B. The number of columns of boolean matrix B is n and the number of rows of boolean matrix B is m. Sometimes, the number of transactions are very large, so the boolean matrix B will be implemented by sparse matrix. Figure 1 shows the actual data structure of B. For example, the i-th member of vectors is a list ($4,7,68,6457,\cdots $), so it means that transaction 4, 7, 68 and 6457 all contain Item $I_i$ and other transactions do not contain Item $I_i$. According to boolean matrix B, we can easily get the support number of Item $I_i$ .

Definition 1

The “AND” result of one itemset E is defined as follows:

Suppose that E consists of h items, they are $I_1$, $I_2$, $I_3$, $\ldots $, $I_h$, and suppose that these items’ corresponding rows in sparse boolean matrix B are $V_1$, $V_2$, $V_3$, $\ldots $, $V_h$, and suppose that $V_{result}$ = $V_1$ $\&$ $V_2$ $\&$ $V_3$, $\ldots $, $\&$ $V_h$. Define that $V_{result}$ is the “AND” result of E. The symbol “$\&$” means “AND” operation between two vectors.

It is easy to generate (k + 1)-candidate frequent itemsets by k-frequent itemsets. After we get all k-frequent itemsets, for every k-frequent itemset, we stored its “AND” result (see Definition 1) in a list.Obviously, every (k + 1)-candidate frequent itemset consists of a new item $I_{new}$ and a k-frequent itemset called $itemset_{old}$, and the "AND" result of $itemset_{old}$ is called $V_{old}$. Then $V_{old}$ will be reused in the “AND” operation of $V_{old}$ and $I_{new}$’s corresponding row of sparse boolean matrix B. In this way, we needn’t to do k+1 times “AND” operations for the confirming of every one of (k + 2)-frequent itemset. We just do one time of “AND” operation of $V_{old}$ and $I_{new}$’s boolean vector. Especially, we implemented the “AND” operation with multithreading technology.

In general,traditional frequent itemsets finding algorithms will scan the dataset for many times and this cause large amount of I/O operations while FISM only needs to scan dataset for once and use sparse matrix to accelerate the procedure of frequent itemsets finding. In theoretically, FISM outperforms traditional Apriori algorithm by up to one order of magnitude. And the experiments will confirm it.

3 Experiments Results

In this section, we compared FISM with MRApriori [4], YAFIM [3], DPBM [1] and PFP-growth (parallel FP-Growth [2] algorithm implemented by Spark team) to evaluate its performance. We implemented all the algorithms in Spark1.5.0. All the datasets are stored in HDFS and the cluster consists of 4 nodes and each node has 4 Intel Xeon cores with 2.60 GHz, 22.5 GB memory and 1 TB disk. The running system is CentOS6.5 and the version JDK is 1.7. Last but not least, the correctness of all the algorithms mentioned above are exactly same. Table 1 are characteristics of the datasets. Table 2 are results of experiments.

Table 1. Detail properties of datasets

Full size table

Table 2. Experiments results

Full size table

We can see that in every repetition and dataset, FISM outperforms all the others algorithms. For all the datasets, FISM is 1.8$\times $ faster than PFP-growth, 20$\times $ faster than YAFIM and 10$\times $ faster than DPBM.

Figure 2 shows the sizeup performance of all algorithms. The x-axis shows the number of replicated times of dataset. FISM is always faster than any other algorithms. MRApriori and YAFIM are not included in Fig. 2 because they spend more than 24 h. At the same time, we can see that with the increasement of dataset, the FISM ’cost time are increasing in a nearly linear way.

4 Conclusions

In order to accelerate frequent itemsets mining in big data era, this paper proposed FISM, a new distributed sparse boolean-matrix based frequent itemsets mining algorithm with Spark. It generate a sparse boolean matrix which shows whether one item is included in one transaction or not, then use the matrix to get all frequent itemsets. In this way, we only need to scan the dataset once. The experiments show that FISM is 1.8$\times $ faster than PFP-growth, 20$\times $ faster than YAFIM and 10$\times $ faster than DPBM. In addition, the FISM also has a better performance in scalability.

References

Gui, F., Ma, Y., Zhang, F., Liu, M., Li, F., Shen, W., Bai, H.: A distributed frequent itemset mining algorithm based on Spark. In: 2015 IEEE 19th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 271–275. IEEE (2015)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, vol. 29, pp. 1–12. ACM (2000)
Google Scholar
Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent itemset mining algorithm with Spark. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1664–1671. IEEE (2014)
Google Scholar
Yang, X.Y., Liu, Z., Fu, Y.: Mapreduce as a programming model for association rules algorithm on Hadoop. In: 2010 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), pp. 99–102. IEEE (2010)
Google Scholar

Download references

Acknowledgements

This work is partially supported by National 863 Program of China under Grant No. 2015AA015401, as well as the Research Foundation of Ministry and China Mobile under Grant No. MCM20150507. This work is also partially supported by Tianjin Municipal Science and Technology Commission under Grant No. 13ZCZDGX01098 and No. 16JCQNJC00500.

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, Tianjin, China
Yonghong Luo, Zhifan Yang, Huike Shi & Ying Zhang
College of Software, Nankai University, Tianjin, China
Ying Zhang

Authors

Yonghong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zhifan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huike Shi
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Zhang .

Editor information

Editors and Affiliations

School of Computing, University of Utah, Salt Lake City, Utah, USA
Feifei Li
School of Electrical Engineering, Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
Soochow University , Suzhou, China
Kai Zheng
Soochow University , Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Yang, Z., Shi, H., Zhang, Y. (2016). A Distributed Frequent Itemsets Mining Algorithm Using Sparse Boolean Matrix on Spark. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds) Web Technologies and Applications. APWeb 2016. Lecture Notes in Computer Science(), vol 9932. Springer, Cham. https://doi.org/10.1007/978-3-319-45817-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-45817-5_38
Published: 18 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45816-8
Online ISBN: 978-3-319-45817-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Distributed Frequent Itemsets Mining Algorithm Using Sparse Boolean Matrix on Spark

Abstract

Similar content being viewed by others

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

PPS: Parallel Pincer Search for Mining Frequent Itemsets Based on Spark

Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark

Keywords

1 Introduction

2 Distributed Frequent Itemsets Mining Algorithm by Using Sparse Boolean Matrix on Spark (FISM)

Definition 1

3 Experiments Results

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Distributed Frequent Itemsets Mining Algorithm Using Sparse Boolean Matrix on Spark

Abstract

Similar content being viewed by others

A distributed frequent itemset mining algorithm using Spark for Big Data analytics

PPS: Parallel Pincer Search for Mining Frequent Itemsets Based on Spark

Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark

Keywords

1 Introduction

2 Distributed Frequent Itemsets Mining Algorithm by Using Sparse Boolean Matrix on Spark (FISM)

Definition 1

3 Experiments Results

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation