Keywords

1 Introduction

The objective of data analytics is scrutinizing, cleansing, renovating, and molding of the data for extracting functional information, portentous termination and sustaining choice making [1]. Data analysis has various sides and looming methods beneath diverse identities in special business, science and social science fields [2].

Big data is a meticulous technique of data analysis that focuses on analyzing huge data sets which materialize from various fields of intensive informatics data centers [3]. Big data typically comprises of data sets of massive volume beyond the skill of traditional software tools to analyze, handle, and process the data [4].

Procedures written in this practical way are mechanically parallelized and implemented on an immense cluster of commodity equipment [5, 6]. In program execution, runtime structures are concerned of the splits which are scheduled in handling many operations such as implementation across set of machines, managing failures, and handling inter-machine communications [7]. The crucial drawback is exhibited on Hadoop performance affecting the cluster.

The significant explanation of Hadoop is outlined as below:

  1. (1)

    Distinct phases are leaped into a single task—the implementation of reduce function is CPU intensive and memory intensive as to segregate the map task data and produce the absolute outcome.

  2. (2)

    Arbitrary requests from I/O effecting the shuffle phase—task tracker receives plenty of I/O reading requests. Each request will prompt plenty of I/O reading operations with different offset on the task tracker.

In this paper, an attempt is done to extricate shuffle phase from reduce task and instrument it as a standard resource provider. Integrate the shuffle service with sequential read policy and handling partitioning skew in reduce task to manage stragglers. Section 2 portrays the background and Hadoop MapReduce programming model, and Sect. 3 describes the problem statement. Section 4 discusses design process, and Sect. 5 analyzes on improvement in map phase. Section 6 involves the evaluation of algorithm, and Sect. 7 reviews the results. Finally, Sect. 8 concludes.

2 Background

The recent efforts from Hadoop MapReduce features are analyzed in improving performance are illustrated as follows:

  1. (1)

    Map step, reduce step, the sort and merge step are included in Google MapReduce model implemented by Hungchih yang, Ali Dasdan et al.

  2. (2)

    An architectural combination of MapReduce and database technologies resulted as HadoopDB is developed for analytical workloads.

  3. (3)

    Hadoop MapReduce HDFS layer is replaced with concurrency optimized data storage layer which improves efficiency of data accessing concurrency, proposed by B Nicolae, G Antoniu et al.

  4. (4)

    A pipeline architecture was proposed by N Conway, T Condie et al., which supports online streaming for many networking sites

  5. (5)

    Resource manager and scheduler are alienated into separate components by YARN from Apache for solving the blockage of job tracker.

2.1 MapReduce

A data flow standard such as MapReduce is widely used for parallelizing the data on various applications [8]. This is a simple and open data flow programming model preferential when compared over usual high-level database approaches. This training model is used for processing large-scale datasets in computer clusters by exercising two function map ( ) and Reduce ( ). The functions Map ( ) and reduce ( ) are as follows:

$${\text{Map}}\left( {{\text{K}}1,\;{\text{V}}1} \right) \to {\text{list}}\left( {{\text{K}}2,\;{\text{V}}2} \right)\quad {\text{Reduce}}\left( {{\text{K}}2,\;{\text{list}}\left( {{\text{V}}2} \right)} \right) \to {\text{list}}\left( {{\text{V}}2} \right)$$

The Map ( ) functions uses key/value pair as input generating the intermediate key/value pairs. The generated intermediate key/value pairs are the input given to reduce function to produce final output [9].

2.2 Hadoop

Hadoop executes shuffle as a component of reduce task because of which there is high utilization of bandwidth in the cluster, resulting low usage of processor and unproductive performance [10, 11].

Hadoop distributed file system (HDFS) provides high throughput access to application data, resource allocation task in cluster and high unsystematic disk I/O requests are suitable for application that has large data sets [12].

From the Fig. 1, data in a Hadoop cluster is busted down into minor portions and circulated all through the collection, where a job tracker keeps track of jobs in both parent and child segments. The map and reduce functions can be implemented on slighter subsets of your larger data sets, and this provide the scalability metrics that is needed for data processing [13].

Fig. 1
figure 1

MapReduce model

3 Problem Statement

MapReduce Programming Model is very simple but as it processes, we come across many problems in map ( ) function. Map ( ) function is assigned with each split if one split cannot execute with any problem (or) if one split fails then we cannot compute the result of Map ( ) function. As combining the individual result of each map function is assigned as input to reduce function, Map function should perform in better way [14].

Two tasks associated with improved reduce phase are shuffle part and reduce part. The initial shuffle segment calls for transitional outcome from map phase. This necessitates more buffer area for various operations sorting and mapping to elicit output.

Numerous disk I/O requests from shuffle phase result in inefficient usage of resources. The above-specified reasons lead to cluster performance problems. The analysis shows an improvement in certain phases of Hadoop MapReduce specifically in terms of execution [15, 16].

4 Design Process

Shuffle and reduce as individual stages of tasks: Primarily remove copy and merge operations of shuffle from reduce as an entity splits.

4.1 Joining of Unusual Splits into Solitary Task

The shuffle phase fetches the transitional outcome from each and every map task where as the reduce function could not start its processing until shuffle phase releases the processed output data. This wastes the CPU resource time and decreases the network bandwidth.

4.2 Random I/O Request of Shuffle Task

Each map task needs to read facts from disk and transfer the response to defined reduce task instantly. This results in large amount of random disk I/O operations which in turn reduces the performance.

4.3 Design

Our features mainly involve the following stages to increase enhancement. Shuffle can process meager data improving the resource utilization efficiently within the same amount of time but the disk I/O request is progressively increased.

Fig. 2 describes the various stages of improved MapReduce architecture by implementing the technique of disjoint maps with skew in them. They are handled separately by the task tracker in slave node. The usage of generate function improves the shuffle phase and processes the data to reduce task.

Fig. 2
figure 2

Improved shuffle phase and handling straggler

Services from Shuffle: By implementing shuffle as service, resource utilization has been incremented because light weight common service relocates the on command for reduce task as a service [17].

Overcoming stragglers: To avoid blocking of slots, a skewed task is recognized and implemented. A skewed task is identified and accomplished them under similar task master. It detects partitioning skew before shuffling of data begins by monitoring data sizes produced by map and handles it by dynamically creating multiple reduce task per skewed partition [18].

Managing disk I/O requests in map phase: The I/O requests from different disk drives are processed within certain interval of time. These requests are sorted and grouped into a sequential list, forwarding them to respective output files of map tasks [19]. Task tracker reconstructs responses by reading data emerging from disk and sending their responses to reduce task in order.

5 Improving the Map Phase

From the problem statement discussed, there is a need to improve the map ( ). The function needs some checkpoints which monitor the function regularly and try to solve when a split intimates an interrupt.

The generate function monitors the map phase at regular checkpoint and views the status of each map split. These checkpoints are arranged dynamically and access the needs of the splits. Distributed storage structure shares information among different tasks. The above algorithm specifies the design in Fig. 2, and various phases of handling the stragglers and handling the resources efficiently are elaborated.

The map jobs are scheduled in a queue, and reduce jobs use priority queue structure. In this way, interpretation of result from the map ( ) results intermediate key/value pairs [20]. These pairs are given as an input to the reduce function, and after interpretation, we generate the final output. The generate function is also used even in reduce phase. Dynamically arranged checkpoints monitor the reduce phase and split into smaller splits when an interrupt occurs. It finally combines all the split’s output for obtaining the final result.

6 Evaluation

Presenting the performance and resource utilization of MapReduce jobs by implementing the shuffle service can be analyzed below:

6.1 Simulation Experiments

Intricate methods, routine calls, resource requirements, etiquette, and exchanges in the Hadoop cluster influence the ratio of disk read/write operations. Because of these random requests, there is a decrease in the regular reading ratio on disk and by which there is an increase in reduce task time.

6.2 MapReduce Job Experiment

Pi estimator utilizes more of CPU computations so it is CPU-intensive task where as word count and TeraSort are resource oriented. The resources are memory and band width.

6.3 Straggler Handling

The techniques are employed in distributing the reducers with even number of map outputs in parallel, ensuring there are no skews.

6.4 Settings

The configuration of our Hadoop with 0.24 version requires 12-node cluster among one is a master and remaining ten are slaves. Every node in a cluster uses core processor organizes 2 GHZs 4 GB of Ram and 500 GB disk drive.

Read total: Sum of read operations per each reduce task.

File size: Volume of information produced by each map task in core state.

Read time: Mean time between transfer requests and to obtain data for process.

Read ratio: Average ratio of read total to read time.

Local read total: Word count of reduce task with comparison of job time.

Reduce_skew: Stragglers count in reduce task.

7 Results

Reading performance is verified with an improved fetch phase with varied file sizes such as 128, 256, and 512 MB than the earlier fetch task. If sequential strategy on read operation is applied, then mean increase in read ratio is 94.17%, and if concurrent strategy is applied, then the ratio is 62.81%.

Figure 3 shows the data-read in local mode by varying in their speed of accessing. The data read is measured as 128, 256 and 512 MB per second avoiding stranglers. Drawing the result from word count showing the diminishing time utilization of reduce phase from 8.94% to 6.32%.

Fig. 3
figure 3

Read total (MB) local read

The reduce phase utilization of resources and the word count are low as the data from map phase has to release the entire output. Best word count is observed in Fig. 4 as it visualizes the read and write ratios on disk. After the necessary modifications done for the shuffle phase, the graph illustrates the improvement by showing 7% increase in resource utilization.

Fig. 4
figure 4

Word count comparison of reduce task

8 Conclusion

MapReduce programming model requires improvement in map phase as well as in shuffle phase. Though it is simple, but while implementation some complications are observed at map phase. If one map fails, it cannot compute the output as the result of map phase is an output for reduce phase.

The reduce phase adds a scheduler for every node. So, by using generate function which dynamically monitors the reduce phase will solve the basic problem in map phase. Cluster resources are well utilized efficiently when data is huge for processing transitional information then shuffle is determined as a service with minute amounts of time.

Hadoop MapReduce uses word count and TeraSort which acts as an added advantage for performance enhancement with different data structures. Resource deployment, absolute time usage are perfection features observed in skew handling technique.