1 Introduction

The rapid development of the Internet has led to massive volumes and variety of data becoming available, as well as the rate at which the data are being generated is increasing exponentially [1]. These characteristics are acknowledged as essential of Big Data [2]. The computing capabilities of multi-core computers have become remarkably sophisticated, but inevitable bottlenecks in their performance and scalability still limit the possibilities for handling large-scaled data in a centralized context. The “scale-up” optimization strategy, in which computations performed by a single machine should be changed to a distributed computing context with a “scale-out” feature, has therefore been comprehensively acknowledged. It has now become particularly important to determine the effective way of achieving efficient and scalable storage and computation of big data, and this work represents a considerable challenge [3].

In recent years, cloud computing technologies have received a great deal of attention from researchers in the information technology industry and academia. The US National Institute of Standards and Technology has defined cloud computing as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” [4]. The development of cloud computing has been evolutionary rather than revolutionary, via a merging of several conventional technologies, including virtualization, grid computing, utility computing, and autonomic computing [5].

Of the cloud computing technologies that are currently in use, Google’s MapReduce paradigm [6], which is built on the Google File System (GFS) [7], has become the prominent parallel programming model in the cloud computing community because of its simplicity, scalability, and fault tolerance [8]. Several open-source frameworks have been released to make MapReduce available for the public. And one of these, Hadoop MapReduce [9] well improves scalability of large-scaled data processing. It also enables programmers to focus on computational logic rather than those low-level programming details, such as data partitioning, task scheduling, load balance, and fault tolerance. Nowadays, MapReduce is extensively used to solve data-intensive problems in various fields, lots of algorithms that were originally designed for the single machine context have been parallelized to be suitable for MapReduce. For example, Yahoo uses MapReduce to meet its big data analysis requirements [10]. Apache Mahout Project [11] provides several MapReduce-based scalable machine learning libraries. Urbani et al. [12] developed a MapReduce-based inference engine WebPIE to implement scalable ontology reasoning for the Semantic Web [13]. Work [14] described the efficient Skyline query processing of massive data, and the method they used was also based on MapReduce. Figure 1 shows the number of indexed papers related to the term “MapReduce” in Web of Science and EI Compendex databases where the X-axis denotes publication years from 2007 to 2014, the Y-axis represents the corresponding indexed number.

Fig. 1
figure 1

The Number of indexed papers related to MapReduce from 2007 to 2014

However, several inherent limitations seriously affect the efficiency of original MapReduce. For instance, the fundamental computing unit in MapReduce, a job, is performed in two phases: map and reduce, but it is difficult to implement multi-iteration algorithms in a single job. Hadoop provides a job chain mechanism, but running multiple MapReduce jobs is still computationally expensive. In addition, the batch processing characteristic of original MapReduce is notoriously difficult to use when handling large-scale data in real time and interactive context. Numerous optimization approaches have been released to tackle the challenges in recent years. Therefore, a systematic review of these novel MapReduce optimization solutions is much in demand.

Until now, several relevant works that survey MapReduce have been presented. For example, Doulkeridis et al. [15] reviewed the state-of-the-art in improving the performance of parallel query processing using MapReduce, although some important optimizations like hardware acceleration and performance tuning of MapReduce were not listed in detail. Li et al. [16] surveyed the distributed data management and processing approaches using MapReduce. However, their work focused on reviewing the high level languages and database-related operators for MapReduce, some extensions of MapReduce programming model were not discussed. In 2011, Lee et al. [17] gave a survey of MapReduce data processing, but some novel improvements and features of MapReduce runtimes such as YARN resource management framework were not described.

Different from above works, we intend to focus on the efficiency and flexibility improvements of MapReduce, and provide a more comprehensive review of state-of-the-art methods for optimizing the MapReduce programming model and its runtime system. We will first identify the major drawbacks of original MapReduce paradigm. We will then provide an in-depth analysis of the optimization approaches with aspect to their different objectives and capabilities such as job scheduling optimization, programming model extension and hardware-based acceleration, etc.

The remainder of this paper is organized as follows. Section 2 gives an overview of the basics of Google’s MapReduce paradigm, its high-level abstractions, as well as some well-known open-source implementations. Section 3 contains a discussion and analysis of the major drawbacks to MapReduce that affect its performance and efficiency. Based on that, Sect. 4 contains a detailed review of the state-of-the-art improvements that have been undertaken to MapReduce related to these drawbacks. In Sect. 5, we conclude the whole paper and suggest the future work that are required.

2 Basics of MapReduce Paradigm

As the basis of other runtime implementations, Google’s MapReduce programming model was designed to process huge amounts of data in a cluster of commodity machines [18]. The key ideas behind the design of this model are the principle of “divide and conquer”, and the strategy of “moving computing instead of moving data”. To help programmers focus on implementing computational logic in a large distributed cluster, researchers also developed distributed and parallel computing functions in their MapReduce runtime that hide the programming details of load balancing, network communication and fault tolerance. However, the source code of their runtime system is not available to the public because of Google’s privacy policy.

In this section, we first describe the basics of Google’s MapReduce programming model and introduce some existing open-source implementations. We then present a short review of some well- known high-level abstractions of MapReduce. The characteristics of these MapReduce systems are also compared to help users to choose the appropriate tool for their particular requirements.

2.1 MapReduce Programming Model

The basic design idea of Google’s MapReduce is inspired by the Map and Reduce functions in classical functional programming languages like Lisp [19]. In its master–slave architecture, data stored in the GFS are converted into key–value pairs and processed in a job consisting of a map phase and a reduce phase. A batch of jobs can also be formed into a chain to cope with complex computing tasks. Data must meet a basic requirement to be suitable for MapReduce computations, that the datasets should be decomposable into many small independent sub-datasets for processing in a particular task.

In a MapReduce job, the master node first partitions input data into M independent chunks (where M is the number of Map tasks) and passes them to the mapper nodes. Each map task is independently executed in a mapper node. Afterwards, in the map phase, each mapper accepts data chunks and then generates a series of intermediate key–value pairs according to a user-defined Map function. The MapReduce runtime system then automatically sorts and merges these intermediate key–value pairs depending on the key. The intermediate data with the same key are divided into R segments (where R is the number of reducer nodes) using a hash function. Finally, after being notified of the location of the intermediate data in the reduce phase, each reducer accepts a set of intermediate key–value pairs and merges all the data with the same key value, then generates a series of key–value pairs according to a user-defined Reduce function. Figure 2 illustrates the processes involved in one MapReduce job.

Fig. 2
figure 2

Google’s MapReduce paradigm

The formal expressions describing the Map and Reduce functions are given below, and in the [\(v_{i}\)] represents a list of values with respect to \(k_{2}\), \(i\ge 0\). Programmers can follow the pseudo-code of Map and Reduce functions as shown in Table 1 to count the number of occurrences of words in a collection of files using MapReduce.

$$\begin{aligned}&\hbox {Map}: \langle k_{1}, v_{1}\rangle \rightarrow [\langle k_{2}, v_{2}\rangle ]\\&\hbox {Reduce}: \langle k_{2}, [v_{i}]\rangle \rightarrow [\langle k_{3}, v_{3}\rangle ] \end{aligned}$$
Table 1 Sample of word count pseudo-code that can be used in MapReduce

As shown in Table 1, the Map function running in one map task takes a particular file or a portion of multiple files as input. Each time map task encounters word w, it produces intermediate key–value pairs in which w is the key, and the value (assigned to “1” in Table 1) is the number of occurrences of word w. The MapReduce runtime system then automatically divides these intermediate data into several partitions according to the number of reduce tasks, and each partition is assigned to the appropriate reduce task after the data are sorted to group all of the key–value pairs with the same key. Finally, the Reduce function takes the sorted data as its input, and for each value v it encounters, it sums all the counts produced for word w.

2.2 Open-Source Implementations of MapReduce

To date, several open-source runtimes that implement Google’s MapReduce model have been presented to solve data-intensive computing problems for different usage scenario or deployment environments.

QTConcurrent [20] is a C++ library for multi-threaded applications within the QT project [109], providing functional programming style APIs for parallel processing, including a MapReduce implementation for shared-memory systems. Different from Google’s MapReduce, QT Concurrent can only be executed in non-distributed environment. The number of threads used in a QT program is automatically adjusted depending on the number of processor cores available.

Phoenix [21] implements MapReduce by using multi-core chips and shared-memory multi-processors. It comprises of a parallel programming API and an efficient runtime that automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance for all of the processors. A C++ re-implementation of Phoenix, called Phoenix++ [22] has recently been released.

Disco [23], which utilizes Erlang language, is a lightweight, open-source framework for distributed computing, and is also built on top of Google’s MapReduce paradigm. Profiling and debugging MapReduce jobs and random access to petabyte-scale data and auxiliary results are well supported with the help of the Disco Distributed File System. Many companies, including Nokia Research Center, employ Disco for large-scaled log analyses, probabilistic modeling, data mining, and full-text indexing [23].

Skynet [24] is a Ruby-based open-source implementation of Google’s MapReduce framework. It is an adaptive, self-upgrading, fault-tolerant, and fully distributed system with no single point at which failure might occur. Unlike master–slave architecture used in other systems, Skynet has no special master server, and all the workers use the “peer recovery” strategy and message queue architecture to monitor the progress of the other worker-computers.

GridGain [25] provides a Java-based middleware for the in-memory processing of big data in distributed context. GridGain allows real-time in-memory processing of both transactional and non-transactional live data with very low latencies. Several novel features, such as distributed task sessions, checkpoints for long running tasks, early and late load balancing, and affinity co-location with data grids, have been integrated in GridGain, in which a task is split into multiple sub-tasks assigned to the available cluster nodes. The results of the sub-tasks are then aggregated and sent back to the user.

Another open-source MapReduce runtime, Twister [26], aims to provide efficient iteration computation in distributed in-memory environment. Twister uses publish/subscribe messaging infrastructure to handle communications, including transferring intermediate data between map and reduce tasks in distributed memory. It also uses a configuration phase for any proposed long-running map and reduce tasks, to eliminate the need for reloading static data for each iteration. A combine operation produces a collective output from all the reduce tasks. Moreover, Twister contains a task pool to avoid the need for tasks to initialize repeatedly.

Distinctively, Misco [27] implements a Python-based MapReduce framework for mobile systems. It has a polling-based task assignment mechanism and uses the HTTP protocol to transfer data between the Misco server and the workers. An Earliest Deadline First Scheduler and a Sequential Scheduler are used to schedule its applications and tasks, respectively. Furthermore, a UDP Server module contained in the Misco Server is used to monitor incoming worker logs using the UDP transport protocol.

Apache Hadoop is the dominant open-source MapReduce framework [9] which allows distributed processing of large datasets across clusters of commercial computers. The core components of Hadoop ecosystem include Hadoop Common, Hadoop File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Hadoop Common provides the functional infrastructure for Hadoop framework. HDFS is an open-source implementation of GFS, and provides scalable distributed file system support. Hadoop YARN is a novel general framework for job scheduling and cluster resource management. YARN can serve as a runtime environment not only for MapReduce but also for Spark [65]. Generally, MapReduce is called MRv2 in Hadoop YARN framework.

In YARN, although the MapReduce programming model is used unchanged, the JobTracker and TaskTracker in Hadoop 1.0 are divided into two novel components. These are Resource Manager, which allocates global resources for all of the applications, and Application Master, which manages the job execution process for one particular application. Figure 3 illustrates the overall architecture and workflow of Hadoop MRv2. First, when YARN receives the submitted MapReduce application, the Application Manager component in Resource Manager calls the corresponding Node Manager to start Application Master (denoted as MRAppMaster in Fig. 3) in a Container. Then, Application Master requests computational resources from the Resource Scheduler component in Resource Manager, and asks the corresponding Node Mangers to start the jobs for this application. During executing, the task in each Container sends its running status to Application Master via RPC protocols.

Fig. 3
figure 3

Architecture of YARN-based Hadoop MRv2

2.3 High-Level Abstractions of MapReduce Model

When implementing complicated data analysis tasks in MapReduce frameworks mentioned above, programmers have to design several Map and Reduce functions separately, and determine how to group these functions to form a series of MapReduce jobs. In order to further hide the programming detail of MapReduce, researchers have proposed several high-level abstractions, such as Sawzall [105], Apache Pig [102], Cascading [103] and Scalding [104].

Sawzall [105] is an abstracted type-safe script language designed for analysing massive individual log records in Google’s MapReduce clusters. A general Sawzall script can be abstracted into two phases: a filtering phase and an aggregation phase. After the Sawzall interpreter is instantiated for each piece of input data, the filtering phase evaluates the analysis on each input record individually and emits the intermediate results to the aggregation phase. Then, based on a set of predefined aggregators (also called tables), the aggregation phase collate and reduce the intermediate to create the final results. Through using Sawzall, the resulting code is much simpler and shorter, by a factor of ten or more, than the corresponding C++ code in MapReduce. In 2010, the runtime of Sawzall which runs a Sawzall script once over a single input was open-sourced.

Motivated by Sawzall, Apache develops an open-source large-scaled data analysis platform Pig [102] which adopts a high-level SQL-like language Pig Latin. Pig provides a default MapReduce-based deployment model that executes ad-hoc computing tasks in a Hadoop cluster and HDFS installation after the Pig Latin programs being compiled into a series of MapReduce jobs. Furthermore, Pig Latin can handle nested data models and also is able to use operations that are commonly found in databases, such as Group By, Join, Filter, Union and Foreach etc.

Table 2 presents the Pig Latin code of word count sample. First, the Pig interpreter parses the Pig Latin command and verifies that the input are valid. Second, the Pig compiler builds a logical plan for the computational logic that the programmer defined. And then, the Pig compiler converts the logical plan into a physical plan to determine MapReduce job following lazy execution strategy. Finally, the corresponding jobs are executed automatically in Hadoop context. The computation flow of Pig’s word count program is illustrated in the Fig. 4.

Table 2 Sample of word count pseudo-code that can be used in Apache Pig
Fig. 4
figure 4

The computation flow of word count in Apache Pig

Cascading [103] is another notable high level abstraction framework for building data application on the top of Hadoop MapReduce. Cascading hides the topological structure and configuration of MapReduce from programmers to improve efficiency of complex business logic development. Cascading follows the “pipe and filter” strategy to define computational workflow before compiling and executing into Hadoop MapReduce.

Twitter’s Scalding API [104] extends Java-based Cascading to enable MapReduce application development with promising Scala language which combines the features of functional programming and object-oriented programming. Recently, Twitter also open-sources the Summingbird [106] library that allows programmers write MapReduce programs that look like native Scala or Java collection transformations and execute them on Scalding platform.

The main characteristics of these open-source MapReduce runtimes and high-level abstractions mentioned above are summarized in Table 3. The characteristics include the development and programming language, the supported operating system, and the deployment environment.

Table 3 Features of open-source implementations and abstractions of MapReduce

3 Shortcomings of Original MapReduce

As mentioned, MapReduce has been widely used to solve data-intensive problems in various fields because of its simplicity, fault tolerance and scalability. However, there is still some debates amongst academic researchers about its efficiency, even MapReduce has been criticised as a “major step backwards” in parallel data processing, compared with traditional RDBMSs [28]. Dean et al. addressed misconceptions about Google’s MapReduce and discussed ways of handling the pitfalls related to, for example, heterogeneous systems, indices, and structured data and schemes [18]. But it is obviously that several pitfalls of MapReduce need to be addressed, as described below.

  1. 1.

    An efficient scheduling mechanism is critical to MapReduce runtime system because it often runs multiple jobs in a distributed context. However, the default scheduler in original Hadoop MapReduce assumes that the computing nodes in the clusters are homogeneous and that the estimated straggler tasks can be speculatively copied and re-executed by other idle nodes [29]. This means that a heavy I/O transfer load is inevitable, particularly when working under heterogeneous clusters. In addition, within a single MapReduce job, the reduce phase does not begin executing until all the map tasks are completed. This synchronization barrier seriously affects its performance. In the context of cloud computing data center, it is quite common for the computing resources available to MapReduce to be shared with multiple users. But, the Hadoop MapReduce assumes that all the computing resources are assigned to a single user, so the original job scheduling mechanism can scarcely adapt to shared MapReduce environments.

  2. 2.

    The simplicity of MapReduce programming model makes it relatively easy for programmers to use. Many complex computational tasks, especially the iterative computing algorithms used in data mining and graph analyses, are difficult to implement in a single MapReduce job, but running multiple jobs in Hadoop is computationally expensive [30]. Therefore, an extension of the MapReduce programming model and runtime is in great demand.

  3. 3.

    With the development of the Internet of Things, large-scale stream data will be generated rapidly by a range of sensors. Real-time computing requirements for high speed data streams pose significant challenges. The original MapReduce was designed for batch-oriented offline processing, which means that the data must all be copied to the distributed file system or distributed databases at the beginning or during the computing process. Therefore, original MapReduce model barely satisfies upcoming real-time or interactive processing requirements.

  4. 4.

    Commercial computers almost all currently have multi-core CPUs or high performance GPUs. However, all map and reduce tasks in the original MapReduce are linearly executed, so the hardware computing capability of each node is not completely used. On the other hand, most MapReduce implementations are also designed to be executed in single cluster environments, but many cloud computing data centers have been established around the world. Research is therefore required on how the original MapReduce paradigm can be extended to make it suitable for processing large-scale distributed data across multiple clusters.

  5. 5.

    MapReduce applications are typically used in a cluster environment consisting of large numbers of commercial computers. The complex configuration parameters and cluster setup details involved pose challenges to MapReduce participators. Therefore, some optimizations, such as those that simulate MapReduce contexts, are in great demand allowing the performance of MapReduce to be tuned and its dynamic running behaviour to be analysed.

  6. 6.

    In practice, MapReduce clusters are mostly deployed in cloud computing data centers in which data and computing resources are shared by multi-tenants. However, original MapReduce runtime only provides Token-based and Kerberos-based authentication mechanisms. Therefore, how to further protect large-scaled sensitive data and provide stronger authentication and authorization should be considered seriously. On the other hand, energy consumption often holds a large fraction of the total cost of MapReduce clusters. This brings a critical and challenging issue about how to decrease the usage of power and maintain the capability of MapReduce at same time.

4 Improvements of MapReduce

In this section, we review the state-of-the-art optimization approaches that have been proposed to address the aforementioned issues of MapReduce programming model and its runtime systems. We classify these approaches into following six aspects according to their different characteristics.

4.1 Optimizations of Job Scheduling

Although MapReduce model provides a simple programming interface for users, it does not contain any execution plan that specifies how jobs are executed in the nodes. The job scheduling mechanism of particular MapReduce runtime system seriously affects efficiency of MapReduce since all the functions such as task partitioning and resource allocation are integrated in. The default FIFO job scheduler in Hadoop MapReduce runtime assumes that the job submitted is executed sequentially under homogeneous cluster. However, it is very common that MapReduce is being deployed in heterogeneous environment, and the computing and data resources are shared by multiple users and applications. Recently, researchers have made lots of progress in the area of MapReduce job scheduling optimizations with respect to these two scenarios: shared and heterogeneous MapReduce environment.

4.1.1 Shared MapReduce Environment

Under shared MapReduce environment, how to allocate limited data and computational resources should be primarily considered. Assigning these resources equally to each user or application is the straightforward way. However, diversity of multiple jobs and features such as data locality are not well considered by this kind of fairness-based method. Therefore, Zaharia et al. [35] proposed a delay scheduling algorithm for addressing the conflict between data locality and fairness in a shared MapReduce cluster. The basic idea of the proposed algorithm is that a job that should be scheduled according to fairness is delayed for a small amount of time when it cannot be executed locally, letting other jobs launch instead. The delay scheduling algorithm is utilized in the Hadoop Fair Scheduler.

Besides taking advantage of data locality, Seo et al. [32] proposed a High Performance MapReduce Engine (HPMR) to reduce the amount of intermediate output of multiple jobs to shuffle as well. HPMR contains two optimization schemes, pre-fetching and pre-shuffling, to improve the degree of data locality and efficiency of data shuffling, respectively, for shared MapReduce environments. The prefetching schema is classified into types: the intra-block prefetching and the inter-block prefetching. In intra-block pre-fetching, only an input split or an intermediate output is pre-fetched, while the whole candidate data block is pre-fetched in inter-block pre-fetching. During pre-shuffling, HPMR looks over an input split before the map phase begins and predicts the target reducer where the key-value pairs are partitioned.

Meanwhile, some works improve the efficiency of job scheduling by estimating the execution status of multiple MapReduce applications. Sandholm et al. [31] introduced a total system efficiency metric for shared clusters on a Xen-virtualized infrastructure. This work dynamically adjusts the allocation of resources based on the average actual application performance ratio in shared environment. Sandholm et al. incorporated three user-assigned prioritization strategies into Hadoop MapReduce as well, one that prioritizes entire workflows, one that prioritizes different stages of a single workflow and one that detects and prioritizes bottleneck components within a workflow stage. As a result, users can change the priority of an application over time and assign different priority to different components under their limitation of total priority.

For the similar purpose, Polo et al. [33] presented a performance-driven co-scheduling mechanism for shared MapReduce environments that dynamically predicts the performance of parallel MapReduce jobs and adjusts the resource allocation. A job performance estimation algorithm for dynamically estimating the completion time of a MapReduce job was proposed. The allocation algorithm used in the task scheduler assigns free slots to each job depending on the results of the estimation, and organizes multiple jobs into an ordered queue based on the proposed scheduling policy. The authors also presented an application-centric MapReduce multi-job task scheduler to meet user-defined high level performance goals by exploiting the capabilities of a hybrid system [34]. By developing a prototype in a cluster of Cell/BE blades, this scheduler is capable of dynamically allocating resources to co-located MapReduce jobs based on their completion time goals.

LsPS [36] is an adaptive scheduling algorithm that uses knowledge of workload characteristics to tune the scheduling schemes for Heterogeneous MapReduce cluster, with important statistical information on job workloads for each user, including average task execution time, average job size and the coefficient of variation of job sizes, being monitored and gathered by the light-weighted historical information collector developed. When scheduling for multiple users, LsPS allocates slots according to their workload characteristics, and when scheduling for a single user, LsPS tunes the scheduling schemes for jobs based on that user’s job size distribution.

4.1.2 Heterogeneous Environment

The slow tasks which are denoted as stragglers often prolong the execution time of MapReduce jobs. Effectively identifying stragglers in the heterogeneous cluster and speculatively scheduling them to other idle nodes is the straightforward way to decrease response time. Until now, several works, such as [3739] and [29], improve the efficiency of job scheduling in heterogeneous context by making good use of the speculation strategy.

The Longest Approximate Time to End (LATE) scheduling algorithm [37] defines some static parameters like SpeculativeCap and SlowTaskThreshold to denote the number of speculative tasks that can be running at one time, and to determine whether a target task is slow enough to prevent needless speculation, respectively. This algorithm estimates the progress rate and the time to completion of each task according to the static parameters and the amount of time that the task has been running. By using LATE, when a node asks for a new task and the number of speculative tasks that are running is less than the denoted total number SpeculativeCap, the request is ignored if the node’s total progress is below the proposed parameter SlowNodeThreshold. Otherwise, the tasks that are currently running but are not being speculated are ranked by their estimated completion times and a copy of the highest ranked task with a progress rate below SlowTaskThreshold is launched.

Inspired by an idea similar to the LATE algorithm, Chen et al. developed a Self-Adaptive MapReduce scheduling algorithm (SAMR) [38] and a History-based Auto-Tuning (HAT) MapReduce scheduler [39] for heterogeneous environments. Instead of using static parameters to find stragglers (as LATE does), Chen et al. utilized historical information updated after every execution to adjust the time weight of each stage of the Map and Reduce tasks when estimating tasks execution times. In addition, SAMR dynamically identifies slow nodes and classifies them into sets of map slow nodes and reduce slow nodes. As a result, the backup map tasks are launched on the nodes which are fast nodes or reduce slow nodes.

As the SAMR algorithm does not take into account the fact that different types of jobs may have different map and reduce stage weights, Sun et al. [29] developed an Enhanced Self-Adaptive MapReduce scheduling algorithm (ESAMR) to improve the speculative re-execution of slow tasks. ESAMR differentiates historical stage weight information on each node and divides them into k clusters using a k-means clustering algorithm. ESAMR estimates the execution time of the running tasks according to the cluster’s weights being classified.

In addition, because of the limited network bandwidth in clusters, some studies tried to optimize job scheduling by enhancing data locality for heterogeneous MapReduce clusters. For example, Zhang et al. [40] developed a data-locality-aware scheduling method that addresses the data locality problem that occurs in heterogeneous MapReduce environments. After receiving a request from a requesting node, this method preferentially schedules a task with input data stored on the requesting node. If no such task exists, the method selects a task with input data stored nearest to the requesting node, and then decides whether to schedule the task to the requesting node or to reserve the task for the node storing the input data by making a trade-off between waiting time and transmission time at runtime.

MapReduce with Access Patterns [43] is a combination of data access semantics and the MapReduce programming framework that has been used in High Performance Computing (HPC) analytics applications. It contains a data-centric scheduler to increase the performance of the HPC analytics MapReduce programs by maintaining data locality. In this scheduling algorithm, a virtual split-based approach is used to assign all the independent data chunks on a data node to local tasks and to avoid data transfers completely. A weighted-set cover-based approach was also designed to select data nodes for scheduling Map tasks with multiple dependent chunks.

Bu et al. [45] developed a task scheduling system to mitigate interference while preserving the task data locality in a virtual MapReduce cluster. In order to estimate the task slowdown caused by interference in virtual MapReduce cluster, an exponential interference prediction model is presented and used in their Dynamic Threshold policy. In addition, Bu et al. [35] improved the Delay Scheduling algorithm by adjusting delay intervals of ready-to-run jobs in proportion to the input size.

To take full advantage of the feature of data replication in HDFS, Yang et al. [42] developed a Data-replica scheduler that improves the performance of MapReduce task scheduling and data allocation in heterogeneous clusters. In this approach, after detecting the location of a free slot, the scheduler determines whether the data node is fast or slow according to the proposed fast-datanode queue. If the node is considered to be fast, an unprocessed block is read, otherwise the scheduler reads the data block from the head of the proposed undo-blockInfo table. The fast-datanode queue, the undo-blockInfo table, and the slow-data-etime queue (which is ordered by the predicted completion time for the processing tasks) are updated after the free slot has been processed.

Ahmad et al. [41] analysed the key reasons for MapReduce’s poor performance on heterogeneous clusters and proposed a load balance optimization approach called Tarazu which consists of three components: Communication-Aware Load Balancing (CALB), Communication-Aware Scheduling (CAS), and Predictive Load Balancing (PLB). CALB regulates the use of remote Map tasks depending on whether Map or Shuffle is likely to be in the critical path, CAS determines how many remote tasks are needed and when to execute them in the task-steal mode, and PLB achieves better load balance in the Reduce phase by skewing the intermediate key distribution among the Reduce tasks depending on the types of nodes on which the reduce tasks run.

Table 4 summarizes above job scheduling optimizations according to their scenarios and used optimization strategies. We observe that data locality, speculative execution and dynamic performance estimation are the most commonly used job scheduling optimization strategies for both shared and heterogeneous MapReduce clusters.

Table 4 Summarization of job scheduling optimizations for shared and heterogeneous MapReduce

4.2 Optimizations of MapReduce Programming Model

The simplicity of Google’s MapReduce programming model makes it quite prevalent in the area of big data processing. However, programmers have to define a series of Map and Reduce functions to implement complex computational logic due to the limited operators and fixed workflow of original MapReduce. In this section, we review these flexibility optimizations with respect to extension of MapReduce programming model and discuss these workflow improvements for iteration computing.

4.2.1 Extension of MapReduce Programming Model

The Map–Reduce–Merge framework [46] extends the original MapReduce model with a Merge phase to support relational algebra and to implement several join algorithms for multiple heterogeneous datasets. This framework uses a coordinator to manage two sets of mappers and reducers. After these tasks are completed, the coordinator launches a set of mergers that read the output from selected reducers and merges them with user-defined logic. The novel programming model used in the Map–Reduce–Merge framework is shown below, in which \(\upalpha \), \(\upbeta \), and \(\upgamma \) are dataset lineages, k is a key, and v represents the value entities.

$$\begin{aligned}&\hbox {Map}: \langle k_{1}, v_{1}\rangle _{\upalpha }\rightarrow [\langle k_{2}, v_{2}\rangle ]_{\upalpha }\\&\hbox {Reduce}: \langle k_{2}, [v_{2}]\rangle _{\upalpha }\rightarrow \langle k_{3}, [v_{3}]\rangle _{\upalpha }\\&\hbox {Merge}: \langle \langle k_{2}, [v_{3}]\rangle _{ \upalpha }, \langle k_{3}, [v_{4}]\rangle _{ \upbeta }\rangle \rightarrow [\langle k_{4}, v_{5}\rangle ]_{\upgamma } \end{aligned}$$

Map–Join–Reduce [47] extends original MapReduce model to meet the demand for efficiently joining multiple datasets. To add to the mapper and reducer, the authors of [47] developed a novel filtering-join-aggregation model that adds a third joiner executing function to a MapReduce job. A one-to-many shuffling strategy that efficiently shuffles each intermediate key–value pair generated by the mapper to many joiners at one time was also introduced.

Tuple MapReduce [48] is a model that extends MapReduce to improve parallel data processing tasks using compound records, optional in-reduce ordering, or inter-source datatype joins. Instead of computing a key–value pair in the same way as the original MapReduce, Tuple MapReduce processes a raw n-sized tuple and includes a group-by clause before the reduce phase. The Map function in the Tuple MapReduce model takes a tuple as an input of types (\(i_{1}\), ..., \(i_{m})\) and produces a list of other tuples as the output list(\(v_{1}\), ..., \(v_{n})\), where the first g fields in the list are used to group-by and the first s fields are used to sort the fields. The Reduce function then takes as its input a tuple of size g, (\(v_{1}\), ..., \(v_{g})\), and a list of tuples list(\(v_{1}\), ..., \(v_{n})\). Finally, a list of tuples list(\(k_{1}\), ..., \(k_{l})\) is produced by the Reduce function.

Vu et al. [49] proposed cHadoop, which supports continuous MapReduce jobs. A carry operation is added to the reduce phase to re-inject the data generated as the output of the map phase to the next execution. In the cHadoop framework, two components, named the Continuous Job Tracker and the Continuous NameNode, which were extended from the Job Tracker and the Name Node in Hadoop MapReduce, respectively, were developed to manage and execute continuous jobs, respectively.

Another MapReduce model extension is XMR [50], which features a hierarchical reduce phase. In order to tune the performance, the authors of [50] studied the parameters used in MapReduce and set minimum, maximum, and average values for each parameter. They also developed a data redistribution algorithm to identify the high-performing nodes and to reorganize the HDFS file fragments according to the computing ratios. In this method, the number of map and reduce slots is effectively managed to decrease the execution time. A hybrid routing schedule shuffle phase to define the scheduler tasks, to decrease the level of memory management required, is developed as well.

4.2.2 Improvements of Iterative Computing

As mentioned above, the fixed workflow and HDFS-based intermediate data storage mechanism of original Hadoop MapReduce cause serious performance degradation when executing a series of jobs iteratively. Several studies as described below try to solve the bottlenecks of iterative computing.

Haloop [51] is a modified version of Hadoop MapReduce designed to support iterative computing. Besides using several novel programming interfaces, Haloop contains a new loop control module in the master node, which iteratively starts MapReduce jobs in a loop until a fixed point has been reached. It also contains a novel task scheduler that takes advantage of data locality and provides a data caching and indexing mechanism to improve the iterative computing performance.

In order to overcome global synchronization overheads in large-scale iterative computing contexts, Kambatla et al. [52] developed an extension of MapReduce model by adapting locality-enhanced partitioning, partial synchronization, and eager scheduling optimization techniques.

iHadoop [53] schedules iterations asynchronously and connects the output of one iteration to the next, allowing both to process data concurrently. In addition, iHadoop concurrently checks the terminations that occur in the background against the asynchronous iterations.

Pipelined-MapReduce [54] improves MapReduce programming model that allows direct data transfer through a pipeline between n Map and Reduce tasks. Instead of determining intermediate data transfers through Task Trackers, the mappers in Pipelined-MapReduce determine which reduce tasks the intermediate data should be sent to, and send the data through a TCP socket directly.

MapCombine [55] contains a novel controller component to schedule the iterations efficiently and avoid re-initializing the runtime environment. The combine phase of the original MapReduce model is modified so that the static data can be cached to balance the workload of a computing node and solve the problem of data reloading. Moreover, MapCombine contains an interaction layer that is responsible for fault tolerance, downtime recovery, and communication between the controller and the combiners.

The iMapReduce framework [56] is another distributed computing framework for implementing iterative algorithms. This system follows a persistent task strategy, and the entire iterative iMapReduce process is implemented in one single job, with all Map and Reduce tasks continuing to be executed until the master has checked that the termination conditions are satisfied. iMapReduce uses a one-to- one socket connection to directly pass status data from the Reduce task to the Map task when starting the next iteration, to avoid the unnecessary shuffling of data in the iterative computing system. Map tasks in iMapReduce can also be asynchronously executed because the Reduce tasks immediately send the data they have produced through the persistent socket connection.

Twister4Azure [58] is a distributed decentralized iterative MapReduce runtime for the Windows Azure Cloud. It uses a Merge task after the reduce phase to determine the terminations of loops that take all of the Reduce outputs and broadcast data for the current iteration as inputs. To avoid unnecessary static data reloads and transfers between iterations, Twister4Azure supports three types of data caching approaches, instance storage based caching, direct in-memory caching, and memory- mapped-file based caching. The authors also developed a cache-aware scheduling algorithm to schedule new MapReduce jobs through Azure queues.

It is noteworthy that, Spark [65], as an emerging distributed in-memory computing framework, improves the efficiency of iterative computing through building on top of a high level abstraction called Resilient Distributed Dataset (RDD) [57]. Moreover, Spark provides more than 80 operators such as map, filter, reduceByKey, join, reduce and collect and classifies them as Transformations and Actions operators. Instead of storing intermediate data into distributed file systems, the runtime of Spark efficiently manages the lineage of read-only RDDs in memory and allows to pipeline several transformations based on the Lazy computing strategy.

Currently, Spark has become the top level project in Apache. Programmers can write Spark applications in Java, Scala or Python and deploy them on Hadoop, Mesos, or in the cloud to access diverse data sources like HDFS, Cassandra, HBase and S3. Within Spark, some high-level tools such as Spark SQL, GraphX and Spark Streaming are provided to implement SQL query integration, graph and stream processing of big data, respectively.

Table 5 shows the summarization of MapReduce programming model extension and iteration computing optimization.

Table 5 Summarization of optimizations of MapReduce Programming Model

4.3 Real-Time Support

Since original MapReduce model is couple with distributed file system, such as GFS or HDFS, it lacks efficient support mechanism for real-time processing. In recent years, several solutions which take high-speed stream as input data source have been released to deal with real-time processing issues of original MapReduce.

The Hadoop Online Prototype (HOP) [59] extended original MapReduce to support long-running jobs through online aggregation of, and continuous queries, on streaming data. Within a job, HOP directly pipelines the output of the Map function to the Reduce task using a TCP socket when the intermediate data buffer of the Map function reaches a threshold size. Similarly, the output of a job in HOP can be directly pipelined to the Map task of the next job. Using this pipelining strategy and the job progress metric developed, HOP gives an approximate answer in online aggregation and continuous query situations.

Böse et al. [60] developed an online MapReduce framework using shared-memory architecture, to allow the efficient mining of large data streams. In this framework, the Map tasks read input data from a file or from data streams and generate output pairs periodically and as events. A generated pair is assigned to the input queue of the corresponding reducer using a hash map mechanism. A collector component is utilized to compute the preliminary or final result used in convergence estimations and there is a visualization option.

By constructing an experimental evaluation of Hadoop MapReduce in the Amazon EC2 cloud, Phan et al. [61] identified key factors that affect real-time scheduling in MapReduce, and formulated the problem as a constraint satisfaction model. The authors of [61] developed an enhanced MapReduce execution model and a range of heuristic techniques for online scheduling using hierarchical scheduling, real-time virtual machines, and probabilistic models, based on this model.

MiscoRT [62] is a mobile MapReduce framework that was developed to support the execution of distributed applications with real-time response requirements on smart phone networks. It extends the Misco system with two levels of schedulers, called the Application Scheduler and the Task Scheduler. The Application Scheduler determines the order in which the applications are run, based on the urgencies and time constraints that are applicable, and it estimates the execution times using an analytical model. The Task Scheduler schedules tasks dynamically and uses the measured laxity values of the tasks to adjust the scheduling order. Additionally, node failures are also considered in both components.

Peng et al. [63] extended the original MapReduce model to support real-time analytics processing. In their proposed framework, the shuffle and sort phases of MapReduce are removed. A timestamp is integrated into the key part of the key–value model input to support the necessity of a real-time data stream. The authors used a Chord based on the JOL rule language to manage the input data stream. Cassandra was chosen for persistent key–value storage in the system.

RTMR [64] is a MapReduce-based large-scale data processing approach for high-speed data streams. RTMR pre-processes historical data to generate intermediate results that are distributed (cached) to the local disk of each working node according to the hash results for the key, to decrease the overheads involved in repeated data loading and computing. Based on this, the map phase in RTMR distributes the input data stream to the appropriate worker node and uses a local pipeline strategy, which is controlled by the system parameters and thread pools to optimize the CPU utilization rate, to transfer the intermediate data asynchronously between the Map and Reduce tasks. The authors also modified the in-memory and disk storage data structures and improved the intermediate results read/write strategy using the overhead estimation and replacement algorithm they developed, to optimize the access performance of local intermediate data.

On the other hand, to optimize the efficiency of incremental processing that is maintaining a very larger repository of Web documents and processing small updates concurrently, Peng et al. [44] from Google designed the Percolator system which provides two main abstractions: ACID transactions over a random-access repository and observers, a way to organize the workflow of incremental computation. The structure of Percolator system in the cluster consists of three components: a Percolator worker, a BigTable tablet server and a GFS chunk server. During computing, all observers are running in the Percolator worker which scans BigTable for changed columns and invokes the corresponding observers as a function call in the worker process to perform transactions by sending read/write RPCs to BigTable tablet servers. Moreover, two services: the timestamp oracle and the lightweight lock service are integrated in Percolator to implement snapshot isolation and improve efficiency of notifications.

Similarly, in order to provide low-latency querying service as Google’s Zeitgeist system does, Akidau et al. [107] designed another programming model and distributed framework called MillWheel with the capability of scalable and fault-tolerant stream processing. The basic workflow of MillWheel can be seen as a directed compute graph consisting of several user-defined transformations on input data. Each of these transformations, which are also called computations, can be parallelized across an arbitrary number of machines based on (key, value, timestamp) triples. Specifically, since data arrival time does not strictly correspond to its generation time in a distributed system with inputs from all over the world, MillWheel includes a timestamp-based Low Watermark mechanism to distinguish the latency and completeness of incoming data.

4.4 Hardware Acceleration

As discussed in Sect. 3, original MapReduce is designed for distributed computing in a cluster of commodity computers, the computing capabilities of hardware such as multi-core CPUs and GPUs are not adequately utilized. This section reviews these optimization approaches which extend original MapReduce to various hardware architectures. We classify them into two aspects: processor-level and cross-clusters hardware acceleration according to their different characteristics.

4.4.1 Processor-Level Hardware Acceleration

Employing the computing capabilities of multi-core CPUs is the most common way for parallel computing. Some works, such as [21, 67] and [73], implemented Google’s MapReduce programming model in multi-core CPUs context.

In addition to the features offered by Phoenix [21], MATE [67] provides a generalized reduction API for the MapReduce model. Both Map and Reduce phases were combined into a Reduction function that processes split data, defined by the splitter function, and updates the intermediate results to give reduction objects. A Combination function was also developed to combine the final results from multiple copies of the reduction objects into one object. The MATE runtime, which has the same scheduling strategy and fault tolerance mechanism as Phoenix, has two types of temporary buffers, a reduction-object buffer and a combination buffer, which allocate each thread using different splits and store the intermediate output results from each stage.

Tiled-MapReduce [73] is an extended MapReduce model for shared memory multi-core platforms that is based on a tiling strategy. It decomposes a large MapReduce job into a number of independent small subjobs that are then iteratively processed all at one time. In so doing, execution workflow in the general MapReduce model is modified to give four phases, Map, Combine, Reduce, and Merge. The Map and Combine phases within a subjob are iteratively executed to generate partial results. The Reduce phase is then used to process the partial results of the iterations rather than the intermediate data. An Iteration Window approach is used in the runtime of the Tiled-MapReduce model to exploit the cache hierarchy in a multi-core platform efficiently. Optimized systems were also developed to improve the memory, cache, and CPU efficiency.

Comparing with multi-core CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth. Several GPUs-based MapReduce implementations and optimizations have been proposed to take full advantage of the performance of GPUs.

For example, Stuart et al. [68] developed a multi-GPUs MapReduce implementation for efficient volume rendering. The workflow in this system takes advantage of NVIDIA GPUs, and consists of Map, Partition, Sort, and Reduce stages, the partial reduce and combine phases being omitted. To overcome the I/O transfer bottleneck, the runtime handles volume data in a streaming manner rather than by storing the intermediate key–value pairs and final values on a disk.

StreamMR [70] is an OpenCL-based MapReduce framework for AMD GPUs. Since atomic operations can cause severe performance degradation in AMD GPUs, StreamMR uses an atomic-free mechanism that maintains a separate output buffer for each workgroup in global memory to allow efficient output and pre-processing to be achieved. StreamMR also maintains a hash table for each wavefront, to group the intermediate results.

Chen et al. [71] developed an accelerated GPU-based MapReduce implementation. To avoid storing intermediate data in the memory in the device, they used a reduction-based approach that encapsulates intermediate key–value pairs into reduction objects and performs the reduction process in the shared GPU memory. The size of the shared memory space is often insufficient for this to be achieved, so a memory hierarchy was developed for storing the reduction objects on both the shared memory and the memory of the device. To balance the memory overhead and the locking costs, the authors further developed the multi-group scheme that partitions each thread block into multiple groups, so that each group has its own copy of the reduction object.

Grex [72] is a MapReduce framework designed to allow general purpose GPUs to be used for parallel data processing. Its workflow consists of five execution stages, parallel split, map, boundary merge, group, and reduce. In the first parallel split stage, instead of splitting the input data sequentially, Grex assigns one unit of data to each thread and uses the parallel prefix sum algorithm to compute the token for each key in parallel, based on the temporary data structure KeyRecord developed. To overcome problems with data skewing and load balancing, the split data are distributed to the subsequent Map tasks in parallel, and each working thread computes a unique address to write the intermediate key–value pairs to, so that the use of locks and atomic operations to synchronize the thread with other threads can be avoided. Grex assigns an equal number of intermediate data to reduce tasks, again to overcome problems with data skewing and load balancing. It uses a lazy emit strategy to decrease the overhead involved in handling intermediate data in terms of computation and global memory usage, and uses shared memory, texture cache, and constant memory in the GPU memory hierarchy to decrease delays accessing memory.

Besides the above approaches, some works implement MapReduce model in coupled CPU and GPU architecture. For instance, MapCG [66] allows source code level portability between a CPU and a GPU. It consists of a MapReduce-based high-level programming language and a runtime for specific hardware architectures. Intermediate key–value pairs in MapCG are copied values and are never sorted, unlike what happens in Phoenix [21]. Two lightweight memory allocators on the CPU and GPU were developed to deal with the performance bottleneck in Phoenix caused by massive requests for malloc() functions. A hash table structure in MapCG groups the key–value pairs on the GPU.

Mars [69] is a MapReduce framework that can run on NVIDIA GPUs, AMD GPUs, multi-core CPUs, and Hadoop-based distributed systems. Mars workflow consists of three loosely coupled stages, Map, Group, and Reduce. The Group stage is designed to group the Map outputs by key. The Mars scheduler that runs on the CPU schedules Map and Reduce tasks to GPU threads. The sizes of the outputs from the Map and Reduce stages are unknown, and conflict may occur when multiple threads each try to write the results to the shared output array, so a lock-free scheme was developed and applied to both the Map and Reduce stages. The Map–Reduce–Count step in this scheme computes the number of intermediate results, the total sizes of the intermediate keys, and the corresponding values. Based on this, a prefix sum and an array of writing locations are generated to represent the location of the output array for each thread. A Rapid Group and a skew handling scheme were also developed, to optimize the sort operation in the Group stage and the workload distribution in the Reduce stage, respectively. Furthermore, Mars enables the integration of GPU-accelerated code to distributed environment, like Hadoop, with the least effort.

Cell processor is another hardware architecture used for MapReduce optimization. A MapReduce runtime for the Cell/BE architecture was developed by the study [74]. Unlike the original MapReduce model, this runtime consists of five stages, Map, Partition, Quick-sort, Merge–sort, and Reduce. The Map stage executes the user-specified Map function in the SPEs, and produces a set of keys containing pointers to the corresponding values. The Partition, Quick-sort, and Merge–sort stages then group identical keys together to produce a set of partitions sorted by the keys, on both the PPE and the SPEs. Finally, the Reduce stage applies the user-defined Reduce function on the SPEs to produce one logical output array of key–value pairs. In this implementation, the whole computational procedure in the runtime is designed to execute in 20 threads, with one PPE main thread spawning all of the other threads, eight SPE worker threads performing the five stages described above, eight PPE scheduler threads notifying the appropriate SPE when a work unit is ready for prefetching, two PPE worker threads performing the input data partitioning and final merge–sort processes, and one PPE event thread responding to output buffer memory allocation requests and controlling the execution flow.

Rafique et al. [75] developed a MapReduce system for asymmetric clusters of asymmetric multi-core processors and general-purpose processors. A general-purpose server with multi-core x86 processors and a large amount of memory was designed to be the manager of their proposed architecture, and this server is responsible for the dynamic scheduling of jobs, the distribution of data, and allocating work to the Cell-based drivers or compute nodes, such as SONY PS3 and IBM QS20 systems equipped with the cell processors. The authors also developed a streaming approach for splitting data into work units, to fit the in-cores in the computing nodes, using various optimization techniques, such as prefetching, double buffering, and asynchronous I/O.

The work in [76] proposed another MapReduce runtime system based on Cell architecture, and this was derived from the system developed by [74]. The modified execution scheme consists of four stages, Map, quick-sort, merge–sort, and Reduce, with partitioning being performed implicitly in the Map stage. The modified runtime uses the main thread and the scheduler thread in an event-driven controller to achieve task initiation, scheduling, and workflow control on the PPE instead of using 12 threads. The performance and scalability of the original system were also improved in several additional ways, including adding implicit partitioning, quick sorting, memory management, and execution schemes.

In order to take advantage of the computational performance of Intel Xeon Phi coprocessor, Lu et al. [108] presented a MapReduce optimization named MRPhi sharing similar idea with Phoenix++. Within this system, two technologies from on Phoenix++ are adopted, which are efficient combiners and different container structures. As Xeon Phi features with wide 512-bit VPUs on each core which doubles the vector width compared with other Intel Xeon CPUs, MRPhi implements map phase in a vectorization friendly way to assist the auto-vectorization and take advantage of VPUs in Xeon Phi. In addition, since the auto-vectorization often fails due to the complex logic, Lu et al. employs the SIMD parallelism to implement hash computation manually. To better utilize the hyper-threading capability of Xeon Phi and improve the resource utilization, user-defined map and reduce phases are pipelined by using the MIMD threads. Furthermore, to deal with the issue of large local arrays, Lu et al. employs low overhead atomic operations on the global array, and the cache efficiency can be improved as well because of the coherent L2 caches with ring interconnection.

4.4.2 Clusters-Based Hardware Acceleration

Hadoop On the Grid (HOG) [77] extended Hadoop MapReduce to execute on the Open Science Grid in the United States. HOG consists of a grid submission and execution component, an across grid HDFS, and an across grid MapReduce. Grid submission and execution component, which is based on Condor and GlideinWMS, is used to manage the submission and execution of the Hadoop worker nodes and to allocate the nodes at remote sites. The master server in the HOG system resides on a stable central server. Failures on the grid are common, so the time between the heartbeat messages is lower in this system than in other systems, and the rack awareness strategy of HDFS was extended to become the site awareness in HOG. The MapReduce job tracker is deployed on the central server, and its communication with task trackers is based on HTTP over the WAN, to make it suitable for the grid context.

Heintz et al. [78] developed a cross-phase MapReduce system to overcome the limitations of Hadoop MapReduce when processing large-scale distributed data and using large amounts of computational resources. They proposed a Map-aware Push technique to hide latency and enable dynamic feedback between the push and map phases. The better performing nodes and those with faster links can process more data in the runtime using this method than using other methods. To eliminate the impacts of mapper–reducer link bottlenecks, the authors also developed a Shuffle-aware Map approach that includes a shuffle-aware scheduler to feed back the cost of a downstream shuffle into the map process, to allow the map phase to be altered in an appropriate way.

G-Hadoop is another MapReduce framework that allows large-scale distributed computing across multiple clusters [79]. This system replaces the native HDFS the Gfarm globally distributed file system, and uses the Torque Resource Manager as the distributed resource management system for each High End Computing cluster, to allow datasets to be shared across multiple sites. G-Hadoop architecture is also based on the master–slave model. The master node, which is installed at a central organization, is responsible for job splitting, task assignment, and metadata management, and consists of a Metadata server from the Gfarm system and a modified JobTracker based on the data-aware scheduling system. The slave node, which is sent to each participating cluster, is designed to perform the TaskTracker, JobTracker, I/O Server, and Network share functions. A novel job execution flow, based on the system just described, was also developed for G-Hadoop.

In order to implement efficient data-intensive computing in distributed clusters, Mantha et al. [80] proposed Pilot-MapReduce which is an extension of the MapReduce runtime framework based on the Pilot abstraction model and enables the separation of resource management and the application of MapReduce in general-purpose distributed infrastructures. The architecture of the proposed Pilot-MapReduce framework consists of the MapReduce-Manager, the Pilot API, and several Data Pilot and Compute Pilot on different resources. Among them, the MapReduce-Manager is responsible for orchestrating the resource pool and managing the entire MapReduce computing process. The Pilot API is used as an abstraction for compute and data resources, as well as managing Data Units and Compute Units where the Map and Reduce tasks worked. In addition, the Pilot-MapReduce supports three different distributed MapReduce topologies: local, distributed, and hierarchical.

Instead of adapting the master–slave architecture, as in original MapReduce, P2P-MapReduce [81] is a peer-to-peer MapReduce framework that provides a more reliable solution for managing node churn, master failure, and job recovery in dynamic cloud infrastructures. Master nodes and slave nodes in P2P-MapReduce form logical peer-to-peer networks in which master nodes are responsible for receiving job requests from user nodes, managing and executing recovery jobs, and coordinating masters and slaves. Using the proposed Task Managers, each slave executes the tasks assigned according to the lowest workload strategy.

Table 6 shows the corresponding hardware architecture used for improving the efficiency of original MapReduce.

Table 6 Summarization of hardware-based acceleration for MapReduce

4.5 Performance Tuning of MapReduce

Performance Tuning of practical MapReduce clusters is a challenging work because of the number of computing nodes and complexity of configuration parameters. Hence, several works tried to build the simulation context for MapReduce performance modelling and optimization.

MRPerf [82] is a sub-phase level simulation tool for modeling the performance of MapReduce applications on large clusters. Taking the user-defined cluster topology specification as its input, MRPerf simulates the inter- and intra-rack network communications performed by MapReduce, relying on the ns-2 network simulator. Similarly, according to the user-defined application job specification, MRPerf captures the computation time for each data-size-dependent sub-phase within a Map task using cycle–bye parameters. MRPerf also simulates disk I/O time using a disk simulator and the Data layout files.

MRSim [83] uses GidSim to simulate the network topology and communications, and uses the SimJava discrete event engine to model the other components of the system. MRSim takes a cluster topology file and a job specification file as its input, and provides simulation services for shared multi-core CPUs, hard disk drives, network traffic and memory buffers, merge parameters, parallel copy and sort parameters, combiners, and other components.

Kolberg et al. [88] developed the MSRG simulator for the MapReduce environment on top of the SimGrid simulation toolkit. The MSRG system provides a speculative execution mechanism and a data replication function, as well as several API functions that allow a user to define task costs and intermediary data.

Liu et al. [89] developed HSim, which is another MapReduce simulator for Hadoop applications. HSim models several Hadoop parameters (including node parameters, cluster parameters, Hadoop system parameters, and HSim simulator parameters) to provide highly accurate and dynamic behaviour simulation services. In the HSim architecture, the Cluster Reader component reads the cluster parameters from the Cluster Spec to create a simulated Hadoop cluster environment. The Job Spec is then processed by the Job Reader Component and the jobs are submitted to HSim to be simulated. However, neither the combiner nor the load balancing mechanism is considered in the current versions of HSim.

SimMapReduce [85] is a GridSim-based simulator designed to manage resources and to evaluate the scheduling performance of MapReduce. Its architecture has four layers, SimJava, GridSim, SimMapReduce, and User Definition. The discrete event processes are modelled in SimJava, the basic system component provisions are supported by Gridsim, the different cluster configurations are simulated in SimMapReduce, and the multi-layer scheduling algorithms are supported using predefined XML configuration files in the User Definition layer.

On the other hand, some studies improve the performance of MapReduce clusters by tracing the runtime or optimizing the configuration parameters.

MR-Scope [84] is a real-time tracing tool for MapReduce. It is composed of three layers, the Hadoop Layer, the RPC Layer, and the Client Layer. The Hadoop layer dynamically traces the node status in Hadoop using the heartbeat and observation point features. The RPC layer is responsible for establishing communication between the client and Hadoop utilizing a collection of protocols, with the internal RPC mechanism of Hadoop being changed to a non-blocking way. The Client layer can visualize the distribution of the HDFS blocks and their replicas, and display the ongoing processes in each running task using three different but interconnected perspectives, the HDFS perspective, the MapReduce task scheduling perspective, and the on-going MapReduce distributed perspective.

Predator [86] is an experience-guided configuration optimizer for MapReduce. It is based on a Hadoop configuration model in which the parameters are divided into four groups according to the general configuration experience, information on the cluster CPUs and memory, the input information for a job, and the results of the Grid Hill Climbing (GHC) algorithm developed for this system. The GHC algorithm, which is based on an objective function developed to estimate the job execution time, randomly generates sampling points that divide a parameter into equal subspaces and search the optimized configuration parameters.

In order to identify the relationships among the workload characteristics, Hadoop configurations, and workload performance, and accurate performance prediction of Hadoop cluster, Yang et al. [87] proposed a statistical analysis approach for Hadoop MapReduce. By applying the principal component analysis approach, they first transform the critical metrics of Hadoop workflow and framework configuration parameters into a small set of independent principal components. Then, they use cluster analysis methods to identify groups of workloads with similar behaviours. In addition, they propose a regression model to predict relative performance of workloads under different Hadoop configurations.

Vianna et al. [90] developed an analytical model for estimating the MapReduce workload performance, particularly focusing on the intra-job pipeline parallelism between the Map and Reduce tasks. Based on the reference model which is designed to predict the performance of parallel computations, the authors represent the intra-job pipeline as a precedence tree. And then, the approximate Mean Value Analysis is applied to predict mean job response time, throughput and resource utilization of Hadoop.

4.6 Security and Energy Optimization of MapReduce

Security and power saving are two key factors that should be seriously considered for cloud computing center owners. Lots of research works have been proposed to improve the security and power efficiency for various distributed environments from grid computing to cloud computing. In addition, a few studies tried to improve the energy efficiency of cloud computing data centers. We review these typical solutions specific to MapReduce as the following.

4.6.1 Security Optimizations of MapReduce

In cloud computing context, users often submit individual data to the clusters, but they do not know where and how the data is being stored and processed since the operational details inside the cloud are invisible to data contributors. Some studies have been proposed to enhance the data privacy. Airavat [96] is a MapReduce-based distributed system that provides end-to-end confidentiality, integrity and privacy guarantees for sensitive data in cloud computing context. To prevent information leaks through system resources, it adds a SELinux-like mandatory access control to the MapReduce distributed file system. Moreover, Airavat adopts differential privacy strategy to ensure the output of aggregate computations does not violate the privacy of individual inputs.

For similar purpose, Guo et al. [99] extended MapReduce with novel integration of access control via attributed-based encryption and privacy-preserving aggregate computation via homomorphic encryption technologies. Within the proposed approach of [98], a transform module is integrated before the reduce phase to find the same key on ciphertext for the encrypted intermediate data.

Apart from the above solutions, several solutions extend the MapReduce model and runtime to optimize security for various usage scenarios. For example, in order to realize privacy data protection efficiently in hybrid clouds, Han et al. [100] presented a hierarchical control architecture based multi-cluster MapReduce programming model named HMR. Within HMR, data isolation and placement among private cloud and public clouds according to the data privacy characteristic is implemented by the control center. A Map- Reduce-GlobalReduce scheduling process is also designed to perform the corresponding distributed parallel computation correctly under the multi-clusters mode.

In order to support secure computing with mixed-sensitivity data on hybrid clouds, Zhang et al. [101] proposed tagged-MapReduce which extends the original key-value model with a sensitivity tag. Moreover, Zhang et al. designed several scheduling strategies that can exploit properties of the map and reduce functions to rearrange the computation for greater efficiency under these constraints while maintaining MapReduce correctness.

SecureMR [97] is a decentralized replication-based integrity verification scheme for MapReduce. It enhances the Hadoop MapReduce with five logical security components: Secure Manager, Secure Scheduler, Secure Task Executor, Secure Committer and Secure Verifier. Among them, Secure Manger and Secure Scheduler are deployed in the master node for task duplication, secure task assignment and commitment-based consistency checking. Secure Task Executor which runs in both mappers and reducers is used to prevent Dos and replay attacks that exploit fake or old task assignments. In addition, Secure Committer deployed in mappers generates commitments for the intermediate data and sends them to Secure Manager to complete the commitment-based consistency checking. Security Verifier running in a reducer collaborates with Secure Manager to verify the intermediate results from mappers. During the process, two protocols: Commitment protocol and Verification protocol are utilized to implement security communication among these components.

Different from above solutions, Yoon et al. [95] designed a black box approach, with no modification of the original MapReduce operations or introduction of extra operations, to detect the attacks launched by malicious or misconfigured nodes, which may tamper with the ordinary functions of the MapReduce framework. To achieve this goal, Yoon et al. collected the low-level system calls and traces of running Hadoop logs through dynamic instrumentation. Moreover, Yoon et al. identified a set of invariants to form the basic execution behaviour of both the Hadoop framework and the applications. Based on that, Yoon et al. detected the malicious nodes by matching these correlated Hadoop logs and system calls against the identified invariants.

4.6.2 Energy Optimizations for MapReduce

An effective way to reduce energy consumption of MapReduce clusters is to selectively power down idle nodes. Following this idea, the covering subset [91] scheme keeps one replica of each block within a small subset of machines to remain fully powered to preserve data availability, and the other nodes are powered down during low-utilization. On the contrary, the All-In strategy [92] powers down the entire cluster during periods of inactivity, and runs at full capacity more effectively.

On the other hand, Green HDFS [93] logically divides the HDFS cluster into disjoint hot and cold zones. The frequently accessed data is placed in the hot zone, which is always powered. Green HDFS fills the cold zone using one powered on machine at a time.

In order to improve the energy efficiency for time-sensitive, interactive data analysis workflows, Chen et al. [94] developed a workload manager Berkeley Energy Efficient MapReduce (BEEMR). The basic idea of BEEMR is firstly classifying each MapReduce job into one of three classes: interactive job, batch-able job and interruptible job, based on several empirical parameters. After that, the interactive jobs are serviced in the proposed interactive zone and retained priority to set in a full-power ready state as this kind of job processes data in a low latency way. For all batch-able and interruptible jobs whose latency is not a concern, BEEMR put them in a wait queue of batch zone. Energy savings come from aggregating jobs in the batch zone to achieve high utilization, executing them in regular batches and transitioning machines in the batch zone to a low-power state when jobs complete.

5 Conclusions and Future Work

Continuous growth in the amount of data produced has led to the emergence of many scalability and performance challenges for traditional single-machine-based computing environments. The MapReduce paradigm has become the de facto standard for data-intensive computing in the cloud computing field because of its simplicity, scalability, and fault tolerance. Several approaches to optimize MapReduce have been developed recently, and these are aimed at improving the programming model and decreasing the occurrence of bottlenecks.

In this paper, we first described the basic idea behind the MapReduce model and discussed some of the shortcomings of the original model. We then assessed the methods designed to address these shortcomings in terms of their abilities to optimize the efficiency and flexibility of different aspects of MapReduce like optimizing job scheduling, extending the programming model, accelerating hardware performance, tuning the runtime, as well as security and energy optimization. These methods are each focused on optimizing a specific aspect of the model, and improve the performance of MapReduce and make it more suitable for use in dealing with complex computational logic. Research on the aspects mentioned above should be continued and combined. We also suggest some other problems that will also need to be studied, and these are listed below.

  1. 1.

    Optimize the MapReduce programming model for use in Hadoop YARN.

The novel runtime environment Hadoop YARN is currently considered to be the next generation of the MapReduce open source framework. However, the programming model used in MRv2 is not well improved, and most of the MapReduce optimizations being developed by researchers in both industry and academia are focused on improving performance of the original MapReduce model. More research is required to determine how to integrate currently available MapReduce optimizations (e.g., iterative MapReduce models, more efficient scheduling algorithms, and hardware acceleration solutions) into YARN-based MRv2. In addition, how to integrate the features of other YARN supported programming model such as Spark into MapReduce should be further researched.

  1. 2.

    Research into the parallelization of classical algorithms based on the optimized MapReduce.

In the context of big data processing, classical data mining algorithms, such as the k-means, SVM, and Naive Bayes algorithms, are frequently used to extract useful information. However, most of the current parallelization approaches using these algorithms are based on the original MapReduce programming model. The optimized MapReduce models described above, which perform better and are more flexible than the original MapReduce model, are not often used. Research is therefore required to determine how these classical algorithms can be modified to make them more suitable for use in optimized MapReduce models.

  1. 3.

    Combining MapReduce with the Internet of Things.

The Internet of Things is developing rapidly, and many smart devices have been deployed. However, most of these are only used to gather and transfer data to a cloud computing center through the Internet, which means that the computing and storage capabilities of the devices are not exploited well. A considerable amount of research is therefore required to determine how the MapReduce model can be extended or restructured to meet the demands of the Internet of Things.