Keywords

1 Introduction

Big Data cannot be acknowledged as a precise term with a proper definition. Instead, it is a collection of large amount of various kinds of data, mostly unstructured. This indicates it is very difficult process for relational database managements systems to analyze and process such data, which is of large volume and unstructured nature. To business and big communities should be capable of finding out useful information from this vast amount of data, that helps to grow their existing business and to find out new requirements of their users, so as for better customer experiences in the future. Big Data analytics techniques provide more precise results and solutions [1].

Researchers and Analysts use advanced techniques of analytics such as text and predictive analytics, natural language processing (NLP), data mining and machine learning to figure out more useful insight and information which is very helpful for an enterprise to make right decision. Open-source big data analytics technologies such as Hadoop MapReduce and Spark provide adequate solutions. In this paper, we examine the comparison between Spark and MapReduce frameworks on Hadoop [2].

Section 2 covers Apache Hadoop with MapReduce framework for storage and processing of large dataset. Section 3 discusses about data processing in Apache Spark and MLlib library for iterative machine learning algorithms. Section 4 discusses Hadoop distributed file system (HDFS) to store data in a distributed manner on different nodes. Section 5 shows the comparison of MapReduce and Spark on page rank algorithm and discussion. Section 6 discusses comparison of MapReduce and Spark on sort. Section 7 shows the experimental studies and discussion generated by execution of word count program by using Spark and MapReduce programming framework. Section 8 describes the conclusion.

2 Apache Hadoop

Apache Hadoop is an open-source framework to process and manage Big Data. Big Data is large amount of which can be in any form for example structured, unstructured and semi structured [3]. Hadoop provides a parallel distributed database known as HDFS Hadoop distributed file system, and there are number of tools and frameworks related to Hadoop, which we can use to perform operations on data stored in HDFS. In Hadoop, data is replicate on multiple nodes, if one data node fails, we can recover our data from other nodes [4].

Apache Hadoop includes following modules:

  1. (a)

    Hadoop Common: The common utility libraries which help Hadoop components to interact with each other.

  2. (b)

    Hadoop distributed file system (HDFS): It is termed as a distributed file system for parallel processing and providing fault tolerant and high throughput which helps to store and process data on multiple nodes in a distributed manner.

  3. (c)

    Hadoop YARN: Yarn supervises the cluster resource management and job scheduling during the execution of jobs [5].

3 Hadoop MapReduce

Hadoop MapReduce is a programming framework to process immense data in-parallel distributed manner on large nodes of commodity hardware [6]. It is a fault tolerant and reliable framework. MapReduce programming usually works on map and reduce paradigm [7]. Map phase splits the input dataset into multiple chunks to process in a fully parallel manner, and the output of this stage is input for the next phase which is known as reduce task, and both input and output after the execution of job are stored in HDFS [8]. The MapReduce framework typically works on a <key, value> pair, that is, it splits the input dataset into <key, value> pairs and process output of the job in similar form [9]. Figure 1 represents MapReduce workflow.

Fig. 1
figure 1

MapReduce work flow

4 Apache Spark

Apache Spark is a framework for analyzing Big Data [10] which can process and analyze massive amount of data in distributed manner. Apache Spark uses Resilient Distributed Datasets known as RDDs [11] which is distributed set of instances and an unchallengeable fault tolerant for the execution of parallel operations [12].

As in MapReduce, both Map and Reduce phases use disk read/write operations number of times in the execution of a job, so these read/write operations cause more delay and increase execution time [13].

Apache Spark provides a programing interface where we can write a program, execute it in distributed manner [14], and RDD helps to keep data in RAM and process that data. In this procedure, it utilizes in-memory cache efficiently. During data processing, it puts data on local file system, which reduces read/write disk operations in Spark [15]. Figure 2 describes Apache Spark stack.

Fig. 2
figure 2

Apache Spark Stack

5 Page Rank

Page rank is one of the foundational algorithms of Google’s indexing process: ranking every web page on the internet by the number and quality of links. It is a procedure to measuring the value and importance of a webpage. PageRank counts the number and quality of links of a webpage to estimate the weight age of webpage [16].

It is assumed that a website which accepts links from another website is very crucial. Different authors performed comparison of page rank algorithm on Hadoop and Spark and according to [2, 17, 18] Spark performs better than Hadoop but according to [19] Hadoop performs better than Spark. Table 1 describes the comparison of page rank algorithm on Hadoop and Spark.

Table 1 Comparison of MapReduce and Spark on PageRank

6 Sort

Since data is not reduced through the pipeline, therefore sorting is the most perplexing operation. For moving data on all the machines, shuffle operation is performed at the sorting’s core. To sort 1 TB of data, shuffling of this data needs to be performed on the network [21]. Different authors performed comparison of sort algorithm on Hadoop and Spark, and according to [22], Spark performs better than Hadoop but according to [19], Hadoop performs better than Spark. Table 1 describes the comparison of sort algorithm on Hadoop and Spark (Table 2).

Table 2 Comparison of MapReduce and Spark on sort

7 Experimental Studies

We have evaluated the efficiency of MapReduce and Spark by implementing word count algorithm.

For these experiments, we have used the WC program which we have included in both MapReduce and Spark, and these experiments were performed on five datasets of different sizes on both MapReduce and Spark to calculate the execution time [23]. To perform this experiment, two nodes Hadoop clusters were used. Figure 1 describes the comparison of MapReduce and Spark when word count algorithm was implemented on both the frameworks [15, 24]. Table 3 describes the comparison of MapReduce and Spark on word count on different datasets (Fig. 3).

Table 3 Comparison of MapReduce and Spark on word count
Fig. 3
figure 3

Comparative analysis of Hadoop

8 Conclusion

Both the Spark and MapReduce frameworks are popular distributed computing paradigms for providing an effective solution for handling this large amount of data called Big Data. Today there are many misconceptions about the comparison of Spark and MapReduce framework. Many researchers and bloggers are claiming that Spark is 10–100 times faster than MapReduce framework. In this paper, we are performing the comparative analysis of both the programming frameworks in terms of execution time. We have run word count program on both the programming frameworks with different sized datasets and found that Spark is 10 times faster than Hadoop.