Review of Parallel Apriori Algorithm on MapReduce Framework for Performance Enhancement

Agarwal, Ruchi; Singh, Sunny; Vats, Satvik

doi:10.1007/978-981-10-6620-7_38

Ruchi Agarwal¹⁷,
Sunny Singh¹⁷ &
Satvik Vats¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 654))

4062 Accesses
7 Citations

Abstract

Finding frequent itemsets in the large transactional database is considered as one of the most and significant issues in data mining. Apriori is one of the popular algorithms that widely used as a solution of addressing the same issue. However, it has computing power shortage to deal with large data sets. Various modified Apriori-like algorithms have been proposed to enhance the performance of traditional Apriori algorithm that works on distributed platform. Developing efficient and fast computing algorithm to handle large data sets becomes a challenging task due to load balancing, synchronisation and fault-tolerance issue. In order to overcome these problems, MapReduce model comes into existence, originally introduced by Google. MapReduce model-based parallel Apriori algorithm finds the frequent itemsets from large data sets using a large number of computers in distributed computational environment. In this paper, we mainly focused on parallel Apriori algorithm and its different versions based on approaches used to implement them. We also explored on current major open issues and extensions of MapReduce framework along with future research directions.

Access provided by CONRICYT-eBooks. Download conference paper PDF

On using MapReduce to scale algorithms for Big Data analytics: a case study

Article Open access 30 November 2019

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop

Keywords

1 Introduction

We are living in Big Data era where data is growing exponentially with time and size of data is moving from terabytes to petabytes [1]. This trend brings out challenges to store this vast amount of data effectively and demands for analytical technology. Analysis of Big Data helps the organizations as well as government in decision-making and setting polices to provide better services to people. Various data mining tools are available from last decades to extract useful information, but they failed to process the large data sets because of time and space complexity. Association rules mining (ARM) technique [2] is used to find out the interesting patterns, sequences or itemsets from large database [3]. Apriori algorithm is used to implement ARM, but effectiveness of this algorithm reduces as the size of the data sets increases to compute, because of its iterative fashion of working which leads to further increment in the value of time complexity. Lots of work has been done to make Apriori algorithm run parallel to reduce the time complexity of traditional Apriori, originally proposed by R. Agarwal. As a result, several parallel Apriori algorithms come into existence such as count distribution (CD), candidate distribution (CaD) and data distribution (DD). These algorithms provide some key features such as dynamic itemset counting [4], data and task parallelism [5]. However, these algorithms come with some major weakness of synchronisation of data, communication issues due to message passing interface (MPI) framework which mainly support for homogeneous environment rather than heterogeneous environment and only work with low-level language like C and FORTRAN [6, 7]. Further, workload balancing [8] and fault-tolerance issue make them incapable to handle Big Data in distributed environments. Above problems lead to the development of MapReduce programming model, introduced by Google [9] for processing large database which enables the programmer to write programming code using map and reduce functions to run parallel applications. Google’s MapReduce framework [10] is one of the current approaches which are available to process the Big Data using commodity machines or nodes in distributed computational environment. Hadoop provides platform to run the MapReduce programming model [11, 12] and enables the developers to code analytical applications under the hood of strong fault tolerance where guarantee is offered by Hadoop. Despite of various advantages of MapReduce model, it has also been criticised in terms of its limitation and complexity [13]. This leads to extensive research on MapReduce characteristics, to identify various issues in terms of performance and complexity of the model and current implementations [14,15,16]. To overcome these difficulties, various extensions are proposed where each one of the extensions fix one or more limitations and drawbacks of MapReduce framework. The scope of this paper is strictly limited to open issues and extensions of MapReduce model to enhance it, not to discuss generalised data-flow systems such as Spark, Dryad and Stratosphere.

This paper is organised as follows. Section 2 presents an overview of Big Data and MapReduce as a programming model under the title background study. Section 3 presents the parallel Apriori algorithm and its implementation on MapReduce framework. Section 4 presents the open issues as limitations of MapReduce model and various extensions of MapReduce to improve it. We conclude in Sect. 5 with possible future research direction.

2 Background Study

2.1 Big Data and Its Characteristics

Generally, Big Data term is used to describe the data that is very large in size and yet growing exponentially with time. It can be characterised by using following four parameters, commonly known in terms of “4 V” parameters: (i) volume: refers to the size of data, (ii) velocity: refers to the speed of generation of data, (iii) variety: refers to the nature of data whether it is structured or unstructured data and (iv) variability: refers to inconsistency in the data. In current scenario, Big Data and its analysis are at the centre point of current science and business.

2.2 MapReduce as a Programming Model

MapReduce intends to perform flexible information processing in the cloud [9]. Many Programming models have been proposed under the name process models such as generic processing model, graph processing model and stream processing model to solve domain-specific applications. These models are used to improve the performance of NoSQL databases. MapReduce programming model comes under generic processing model that used to address the general application problems. MapReduce programs can be seen in two phases, map phase and reduce phase which consist of map function and reduce function, respectively, and input to each function are key-value pairs. MapReduce algorithms can be categorised into four classes, as shown in Table 1.

Table 1 Classification of MapReduce algorithms

Full size table

3 Parallel Apriori Algorithm

3.1 Parallel Apriori Algorithm on MapReduce

First and foremost, it is required to write parallel Apriori algorithm code in terms of map and reduce functions to run the application on MapReduce model. These two main functions of MapReduce model get the inputs in key-value pairs and generate the output in the key-value form also. The key step in parallel Apriori algorithm is to find out the frequent itemsets. Figure 1 shows the work flow of generation of frequent 1-itemsets.

First, HDFS divides the transactional database into data-chunks (default size of data-chunk is 64 MB) and distributes them among different machines in key-value form where key represents the Transactional ID (TID) and value denotes the list of items. Each mapper running on different machines fed by this key-value pairs and generates the output (key-value) pairs after reading one transaction at a time where key is further refined to represent each item and value is frequency of occurrence of item in the database. These outputs of mapper functions also are known as intermediate values, because these values are fed to combiner before to submit to reducers. Combiner has the task to shuffle and exchange the values using shuffle sort algorithm and consequently prepares a list having values linked with the same key. Here, key represents the item and value represents the support count ≥ minimum support of that item.

Reducer function has the main task to aggregate all key-value pairs and generates final output [17]. Here, frequent 1-itemsets are generated at the end into HDFS (storage unit) as output. Frequent k-itemsets are generated by each mapper after reading frequent itemsets from previous iteration and generate candidate itemsets on that basis. This process is done in iterative fashion to get frequent k-itemsets where each iterative step is same as generation of frequent 1-itemsets [7, 18].

3.2 Various Proposed Implementations of Parallel Apriori Algorithm on MapReduce

To reduce the time and space complexity of parallel Apriori algorithm, various Apriori-like algorithms have been proposed which execute on MapReduce framework. Broadly, these algorithms can be further classified based on 1-phase of MapReduce and combiner and k-phase of MapReduce approach which is used to develop them. Algorithms having 1-phase of MapReduce approach execute single iteration of MapReduce job to extract all frequent itemsets. On the other hand, algorithms having k-phase of MapReduce approach execute multiple iterations of MapReduce job [19]. As a result of continuous research, an improved Apriori algorithm [20] comes into existence which further minimises the time complexity of parallel Apriori algorithm from O(|L _k|²) to O (|V _key|²/q) where L _k is the set of large k-itemsets, V _key is the value list of ith key and q is the number of reducers. Further, pruning step of this algorithm is improved that leads to Improved Pruning Apriori (IP-Apriori) [21].

4 MapReduce Open Issues and Extensions

4.1 Performance Issues

MapReduce platform provides some key features such as scalability, fault tolerance to handle the data at large scale, but overall performance of this platform highly depends on the nature of application that is executed in distributed computational environment. To make MapReduce framework more suitable for Big Data handling and to improve the performance, various Hadoop extensions are suggested over the period such as index creation [22], data co-location, reuse the previously computed results and mechanisms dealing with computational skew.

4.2 Programming Model and Query Processing Issues

To code MapReduce applications, understanding of both system architecture and programming skills is required. The programming model of MapReduce has the limitation under its “batch” nature where data is needed to upload into the file system even when the same data set needs to be analysed many times. This programming model is also inappropriate for many classes of algorithm where results of one MapReduce job serving as the input for the next in case of complex queries analysis process. Consequently, a set of domain-specific systems have been emerged to extend the MapReduce programming model where high-level languages such as Java, Ruby, Python and various abstractions have been built to support MapReduce application development environment. Researchers proposed some model to implement iterative algorithms using MapReduce framework such as Hadoop, iHadoop [23], iMapReduce [24], Twister [25] and CloudClustering [26]. Apart from that, users have to spend more time in writing programs in the absence of expressiveness just like SQL. Therefore, it is required to enhance the MapReduce query capabilities [27].

4.3 MapReduce Extensions

To eliminate the limitations of MapReduce framework, researchers try to integrate the key features of parallel database and database to MapReduce programming model which results in MapReduce extensions. Various MapReduce extensions with key advantages are listed in Table 2.

Table 2 MapReduce extensions and advantages

Full size table

5 Conclusion and Future Research Direction

Based on our survey, both Apriori (traditional Apriori) and parallel Apriori algorithm versions are suffering from the problem of scanning the database multiple times, specially those based on k-phase of MapReduce approach which incurs high processing cost and generation of candidate itemsets that needs more memory space. We also focused on MapReduce capabilities, limitations as open issues and various proposed extensions. Open issues lead to various extensions or enhancements, and major enhancements are the result of integration of database with MapReduce, integration of indexing capabilities to MapReduce, integration of MapReduce with data warehouse capabilities and adding skew management in MapReduce.

Future research can be carried out in two dimensions to enhance the performance of parallel Apriori algorithm. One dimension leads to modification in joining and pruning steps of existing algorithm to enable it to support pipelining or use of alternative Apriori-like algorithms which are free from the problem of multiple times scanning of database. Second dimension of research leads to use of advanced MapReduce framework such as i ² MapReduce model which supports incremental problem-based algorithm or hybrid algorithms also to enhance the overall throughput of system.

References

Assunção, M.D., Calheiros, R.N., Bianchi, S., Netto, M.A., Buyya, R.: Big data computing and clouds: trends and future directions. J. Parallel Distrib. Comput. 79, 3–15 (2015)
Google Scholar
Agrawal, R., Ramakrishnan, S.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Agrawal, R., Imieliński, T., Swami, A. Mining association rules between sets of items in large databases. ACM SIGMOD Rec. 22(2), 207–216 (1993)
Google Scholar
Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Rec. 26(2), 255–264 (1997)
Google Scholar
Dunham, M.H., Xiao, Y. Gruenwald, L., Hossain, Z.: A survey of association rules (2008). Retrieved 5 Jan 2001
Google Scholar
Oruganti, S., Ding, Q., Tabrizi, N.: Exploring HADOOP as a platform for distributed association rule mining. In: FUTURE COMPUTING 2013, The Fifth International Conference on Future Computational Technologies and Applications, pp. 62–67 (2013)
Google Scholar
Lin, M.-Y., Lee, P.-Y., Hsueh, S.-C.: Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, p. 76. ACM (2012)
Google Scholar
Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 6, 962–969 (1996)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the Operating Systems Design and Implementation (OSDI), pp 137–150 (2004)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
He, H., Du, Z., Zhang, W., Chen, A.: Optimization strategy of Hadoop small file storage for big data in healthcare. J. Supercomput. 1–12 (2015)
Google Scholar
Schneider, R.D.: Hadoop for Dummies^®. Special edn. Wiley, Canada (2012)
Google Scholar
Pavlo, A., Paulson, E. Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM (2009)
Google Scholar
Lee, K.-H., Lee, Y.-J., Choi, H. Chung, Y.D., Bongki, M.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Rec. 40(4), 11–20 (2012)
Google Scholar
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of Mapreduce: an in-depth study. Proc. VLDB Endowment 3(1-2 (2010): 472-483
Google Scholar
Goyal, A., Dadizadeh, S.: A survey on cloud computing. In: University of British Columbia Technical Report for CS 508, 55–58 (2009)
Google Scholar
Kovacs, F., Illes, J: Frequent itemset mining on hadoop. In: 2013 IEEE 9th International Conference on Computational Cybernetics (ICCC), pp. 241–245. IEEE (2013)
Google Scholar
Li, N., Zeng, L., He, Q., Shi, Z.: Parallel implementation of Apriori algorithm based on MapReduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD), pp. 236–241. IEEE (2012)
Google Scholar
Li, L., Zhang, M.: The strategy of mining association rule based on cloud computing. In: 2011 International Conference on Business Computing and Global Informatization (BCGIN), pp. 475–478. IEEE (2011)
Google Scholar
Al-Maolegi, M., Arkok, B. An improved Apriori algorithm for association rules. arXiv preprint arXiv:1403.3948 (2014)
Sequeira, J.V., Ansari, Z.: Analysis on improved pruning in Apriori algorithm. Int. J. Adv. Res. Comput. Sci. Softw. Eng. (IJARCSSE) 5, 894–902 (2015)
Google Scholar
Kang, W.L., Kim, H.G., Lee, Y.J.: Efficient indexing for OLAP query processing with MapReduce. In: Computer Science and Its Applications, pp. 783–788. Springer, Berlin, Heidelberg (2015)
Google Scholar
Song, J., Guo, C., Zhang, Y., Zhu, Z., Yu, G.: Research on MapReduce based incremental iterative model and framework. IETE J. Res. 61(1), 32–40 (2015)
Article Google Scholar
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Imapreduce: a distributed computing framework for iterative computation. J. Grid Comput. 10(1), 47–68 (2012)
Article Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)
Google Scholar
Dave, A., Lu, W., Jackson, J., Barga, R.: CloudClustering: toward an iterative data processing pattern on the cloud. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pp. 1132–1137. IEEE (2011)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley (2009)
Google Scholar
Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain (1994)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endowment 3(1–2), 515–529 (2010)
Article Google Scholar
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endowment 4(6), 385–396 (2011)
Article Google Scholar
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endowment 4(9), 575–585 (2011)
Article Google Scholar
Kwon, Y.C., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 75–86. ACM (2010)
Google Scholar
Kwon, Y.C., Ren, K., Balazinska, M., Howe, B., Rolia, J.: Managing skew in Hadoop. IEEE Data Eng. Bull. 36(1), 24–33 (2013)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce Online. NSDI 10(4), 20 (2010)
Google Scholar
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endowment 5(10), 1028–1039 (2012)
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endowment 5(11), 1591–1602 (2012)
Article Google Scholar
Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. VLDB 1, 169–180 (2001)
Google Scholar
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endowment 3(1–2), 494–505 (2010)
Article MATH Google Scholar
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endowment 5(6), 586–597 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Sharda University, Greater Noida, India
Ruchi Agarwal, Sunny Singh & Satvik Vats

Authors

Ruchi Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Sunny Singh
View author publications
You can also search for this author in PubMed Google Scholar
Satvik Vats
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruchi Agarwal .

Editor information

Editors and Affiliations

Jagan Institute of Management Studies, New Delhi, Delhi, India
V. B. Aggarwal
Department of Computer Science, University of Delhi, New Delhi, Delhi, India
Vasudha Bhatnagar
Microsoft Innovation Centre, Sri Aurobindo Institute of Technology, Indore, Madhya Pradesh, India
Durgesh Kumar Mishra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agarwal, R., Singh, S., Vats, S. (2018). Review of Parallel Apriori Algorithm on MapReduce Framework for Performance Enhancement. In: Aggarwal, V., Bhatnagar, V., Mishra, D. (eds) Big Data Analytics. Advances in Intelligent Systems and Computing, vol 654. Springer, Singapore. https://doi.org/10.1007/978-981-10-6620-7_38

Download citation

DOI: https://doi.org/10.1007/978-981-10-6620-7_38
Published: 04 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6619-1
Online ISBN: 978-981-10-6620-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Review of Parallel Apriori Algorithm on MapReduce Framework for Performance Enhancement

Abstract

Similar content being viewed by others

On using MapReduce to scale algorithms for Big Data analytics: a case study

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop

Keywords

1 Introduction

2 Background Study

2.1 Big Data and Its Characteristics

2.2 MapReduce as a Programming Model

3 Parallel Apriori Algorithm

3.1 Parallel Apriori Algorithm on MapReduce

3.2 Various Proposed Implementations of Parallel Apriori Algorithm on MapReduce

4 MapReduce Open Issues and Extensions

4.1 Performance Issues

4.2 Programming Model and Query Processing Issues

4.3 MapReduce Extensions

5 Conclusion and Future Research Direction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Review of Parallel Apriori Algorithm on MapReduce Framework for Performance Enhancement

Abstract

Similar content being viewed by others

On using MapReduce to scale algorithms for Big Data analytics: a case study

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop

Keywords

1 Introduction

2 Background Study

2.1 Big Data and Its Characteristics

2.2 MapReduce as a Programming Model

3 Parallel Apriori Algorithm

3.1 Parallel Apriori Algorithm on MapReduce

3.2 Various Proposed Implementations of Parallel Apriori Algorithm on MapReduce

4 MapReduce Open Issues and Extensions

4.1 Performance Issues

4.2 Programming Model and Query Processing Issues

4.3 MapReduce Extensions

5 Conclusion and Future Research Direction

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation