Keywords

1 Introduction

Into the computational market the term “big data” is very usually used term. In a data world, Big Data means every one’s data. It is the Knowledge or relevant fact considered by particular organization, gathered and deal with recent methods or procedure to generate best result in precise manner.

As we start the learning about Big Data, data analytics will elaborate the subject precisely likely to talking about “The three V’s” - “volume, velocity and variety,” this points which identifies the challenge of data analytics word. In brief, it is big amount of attribute processing in fast and various manner. User’s executable processing records, databases of production for organization, influx of web log files, live video file, current media’s user conversations and many more could be involved. Mark van Rijmenam told about main 3 V’s that “Why the 3 V’s Are Not Sufficient to Describe Big Data,” also says that “veracity, variability, visualization, and value” to the definition. Rijmenam also says that “90% of all data ever created, was created in the past two years. From now on, the amount of data in the world will double every two years” [1, 4].

A software framework for distributed processing of large data sets on compute clusters of commodity hardware is known as Hadoop MapReduce (Hadoop Map/Reduce)” [2]. It is a partial part of the Apache Hadoop. Organizing endeavor, surveil them and re-surveil any crashed tasks ware taking care about the framework. “According to The Apache Software Foundation, the primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner” [1, 15]. The framework of Hadoop MapReduce can organize the output data for mapping class, which output becomes input of reduce class. So, all incoming data and the outgoing data relevant of tasks are saved in a resulted file system.

2 Hadoop Features and Characteristics

Apache Hadoop is the most powerful and very precise big data tool. “Hadoop provides the most reliable storage layer – HDFS, a batch processing engine – MapReduce and a Resource Management Layer – YARN” [4]. There are some important Hadoop Features which are stated as below-

  • Open source: Apache Hadoop is known as open source project. So, we can modified code according to business requirements.

  • Distributed processing: data is processed in parallel way on a cluster of different nodes because can be saved in a distributed manner in Haoop Distributed File System across that cluster of nodes [7].

  • Fault Tolerance: There are 3 replicas of each block was stored across that cluster in Hadoop in default manner and it can also be modified also as per the given requirement. So if any node will become down, data can be recovered from any other nodes in easy manner. [2, 11] By the framework, automatically data can be recovered if failures of nodes or tasks are occurred. This is how we can says this is one of the important feature of Hadoop.

  • Reliability: Data is satisfactory stored on the different nodes of cluster in a case machine failures due to replication. [15] If your machine will going to fail for work then also your data will be stored satisfactory.

  • High Availability: Due to number of copies of data is large, data can be available and ready to despite of hardware failure. If any hardware crashes of machine will happen, then data will be accumulate from another pathway also [15].

  • Scalability: whenever any new hardware can be easily accommodate to the given nodes we can say that hadoop is highly scalable. When any extra nodes can be accommodate on the urgent way without any downtime, we can says that it also provides horizontal scalability [7, 15].

  • Economy: As we apply it of commonly connected hardware, we can says that Apache Hadoop is not very expensive. Cluster have no need any specialized node for it. Hadoop gives us huge cost cutting and also as it is very simple to accumulate more machines on that cluster over here. [9] So whenever any requirement will increases in time manner, you can also put up the machines and also without any alteration and without so much pre-planning for that.

  • Data Locality: “move computation to data instead of data to computation” - Hadoop is working on this fundamental principle of data locality. [8] Whenever any user will submits the MapReduce process, this process will moved to data in that cluster in place of transferring data to that place where the process is submitted.

3 Functionality of Map-Reduce

We can say that Apache Hadoop, Mapreduce is the heart. It will behave like the programming structure which permit for extendibility into the big number of clusters of a Server. We can say that MapReduce paradigm is comparatively easy to catch for the known users about the scaling processing of cluster.

Understanding for new users can be difficult about this topic, because it is not generalize concept for people to have been expand as n former times. If you are new for the Hadoop MapReduce tasks, don’t be hesitate, we will try to get knowledge in it that you will pick up it quickly.

“This term MapReduce is refers two different and distinct tasks that Hadoop programs platform. [9]”As shown in Fig. 1, Map job is initial task, which gets a cluster of attributes and transfer it into different cluster of attributes, into that every attributes are tumbledown into attributes like key pairs or value pairs. Outgoing data going to the reduce job through a map function as incoming and segregate particular data attributes like key pairs or value pairs into a small chunks of attributes. Name of MapReduce itself says that, every time reduce job is occurred only after the completion of the map job.

Fig. 1.
figure 1

Basics of Map-reduce

The fundamental of Hadoop says that “The MapReduce framework works mainly on <key, value> pairs, that is, the framework views the input to the task as a pair of <key, value> set and produces a pair of <key, value> sets as the output of the task, producing of different types [15]”.

The key classes and value classes have been put in sequentially by the Hadoop framework and that’s for that reason necessity to implant the Writable interface. In Addition, the set of key values will implant the accessible pathway to ordering by the framework.

Here is the Input & Output of a MapReduce job:

“(input) <key1, value1> -> map -> <key2, value2> -> combine -> <key2, value2> -> reduce -> <key3, value3> (output)” [4].

Commonly the MapReduce concept is concern about the data sending to the node at which place the result is placed. MapReduce processed in three ways, first is mapping, shuffling and reducing.

Mapping Stage:

The input data process is known as map or mapper’s task. Data which are file or directory typed which are stored in the Hadoop distributive file system (HDFS) as input. The input data of file is going through the mapper function interpretably. This data would be processed by the mapper stage and creates into many small parts of this data.

Reducing Stage:

“This stage is the combination of the two stage: the Shuffle stage and the Reducing stage” [9]. The main task of Reducer’s is to get the data from mapper stage and process that data. After completion of this process, it produces a new group of output, which will saved in the Hadoop distributive file system HDFS.

Into the MapReduce task, Hadoop pass the Map and Reduce tasks into the cluster to the significant servers. The knowledge of data-processing such as allocation of tasks to different nodes, verifying that task was completed or not, copying data into the cluster of different nodes and all this will be managed by the framework. Most of this processing or computing takes can place on local space with data that compares the traffic between nodes. After that cluster will gather and compares the data into an appropriate form and sends this resultant data to the server after completion of this given tasks.

4 Proposed Algorithm

The basic function of Map and Reduce of MapReduce paradigm are known related to data organization in form of key and value pairs. Into the Map task, it will get single set of data of single data domain and give it back as a list of key and value pairs in a multiple data domain:

$$ \text{Map}\,\left( {\text{key}1,\,\text{value}1} \right) \to \text{list}\,\left( {\text{key}2,\,\text{value}2} \right) $$

This mapper function is applicable in a serialized way to each key1 and value pair into the input datasets. This will generates a series of key2 and value pairs for every of the call. Completion of this process, this framework gathers all data pairs with the original key2 from the all list, make groups of those data and generate single group for every key.

After then applied the Reduce function in serialized way to every group, which produces the values in collective manner the common domain:

$$ {\text{Reduce}}\,\left( {\text{key}2,\,\text{list}\left( {\text{value}2} \right)} \right) \to \text{list}\left( {\text{value}3} \right) $$

Though one massage will permit to giving back multiple value of data, every single Reduce massage generate either the value v3 or a none return. This returns of that all massages are gathered as the expected result of the list.

The proposed algorithm is as below.

figure a

4.1 Example

Here we can try to understand this topic with an example. Suppose user contain four files of data and particular file have two columns data (“a key and a value in Hadoop terms”) that shows the value as name of city and relevant temperature of particular city for the different days. As we are taking this data just as an example so it is simple to understand. We can understand that any real time application is not so simple because we are talking about big data as it was containing record numbers of rows and may be it was not be preprocessed this data. Here no matter that what amount of data we are analyzing, the unique principles we are highlighting here is remain as it is. And for this example, “we had taken city is the key and temperature is the value.

Ahmedabad, 21

Vadodara, 26

Rajkot, 23

Surat, 33

Here, we have to find highest temperature of every city from every collection from all the data we have collected. We can cut down this in 5 major mapping tasks, at there every single map task works on 1 out of 5 given files, also map task going into the data and gives back with the maximum temperature value for every particular city using the MapReduce framework. The final value is look like as below:

$$ \left( {{\text{Ahmedabad,}}\, 2 1} \right)\left( {{\text{Vadodara,}}\, 2 6} \right)\left( {{\text{Rajkot,}}\, 2 3} \right)\left( {{\text{Surat,}}\, 3 3} \right) $$

Think that given four other mapper tasks generate the below interim results:

(Ahmedabad, 18) (Vadodara, 27)

(Rajkot, 32) (Surat, 37)

(Ahmedabad, 32) (Vadodara, 20)

(Rajkot, 33) (Surat, 38)

(Ahmedabad, 22) (Vadodara, 19)

(Rajkot, 20) (Surat, 31)

(Ahmedabad, 31) (Vadodara, 22)

(Rajkot, 19) (Surat, 30)

Into the reduce tasks, feed all output stream of these files which added the results of all input data and single value of output data of particular every city, produced the executable output set as below:

$$ \left( {{\text{Ahmedabad,}}\, 3 2} \right)\left( {{\text{Vadodara,}}\, 2 7} \right)\left( {{\text{Rajkot,}}\, 3 3} \right)\left( {{\text{Surat,}}\, 3 8} \right) $$

Similarly, you may also find an example of “census was conducted in Roman where the census bureau would dispatch its people to each city in the empire for map and reduce tasks. Work is like each census taker in each city would be tasked to count the number of people in that city and return the results to the capital city.”

“Thus, the results from each city will be reduced to a single count (sum of all cities) to calculate the overall population of the empire. This mapping of people of cities, in parallel way and then combining the results is more efficient than sending an alone person to count every person in the empire in a serial” [16].

Here in Table 1, some basic commands are given for implementation of this basic map-reduce functionality.

Table 1. Hadoop basic commands [8]

5 Conclusion

When we are talking about the processing of large amount of data sets, Hadoop MapReduce paradigm accessible for the scheduling such large amount of data in a protective and efficient way. Hadoop is a highly scalable platform and behind scalability, the main reason is to save and distribute huge amount of data in different servers. While every addition of data servers it can increase more processing throughput.