An Overview of the MapReduce Model

Rajeswari, S.; Suthendran, K.; Rajakumar, K.; Arumugam, S.

doi:10.1007/978-3-319-64419-6_40

S. Rajeswari¹⁷,
K. Suthendran¹⁷,
K. Rajakumar¹⁷ &
…
S. Arumugam¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10398))

Included in the following conference series:

International Conference on Theoretical Computer Science and Discrete Mathematics

793 Accesses
1 Citations

Abstract

Data is getting accumulated fast in various domains all over the world and the data size varies from terabytes to yottabytes. Such huge size data are known as Big Data. Extraction of meaningful information from raw data using special patterns are called Data Mining and sophisticated algorithms have been designed for this purpose. In this paper, the time complexity for MapReduce-based data mining algorithm is presented.

Access provided by CONRICYT-eBooks. Download conference paper PDF

MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research

Article 26 August 2017

MapReduce for Big Data Analysis: Benefits, Limitations and Extensions

MapReduce Algorithms for Big Data Analysis

Keywords

1 Introduction

The size of data, storage capacity, processing power and availability of data are rising day by day. The traditional data warehousing and management systems tools are not capable of dealing with huge data. The analytic methods used to deal with huge data is referred to as Big Data analytics. Doug Laney was the first one to discuss about 5 V’s in Big Data management [1], namely, volume, variety, velocity, variability and veracity.

Some of the technologies to process big data are Hadoop HDFS, MapReduce, Hive, HBase, Pig, Flume, Hadoop Mahout, Windows Azure etc. [1]. MapReduce was first developed by Google but now it is incorporated by the Apache. In Parallel, the large clusters of hardware are available to process huge amount of data. It has two functions Map() and Reduce(). Map() is used for filtering and sorting the data and Reduce() is used for summarizing the data [2] (Fig. 1).

The MapReduce process is depicted in Fig. 1. Here the input is application specific while the output is a group of < key, value > pairs produced by the Map function. In the mapping process, each single key-value input pair (key, value) is mapped into several key-value pairs: $[(l_1, x_1),\ldots ,(l_1, x_r)]$ with same key, but different values. These key-value pairs are used for reducing the functions.

Data mining techniques are applied on raw data for extracting useful information. The process of finding a model is to describe the data classes by using a classification algorithm [3].

Before using the classification algorithm, preprocessing steps such as categorization and feature selection are included. This process is helpful for getting an improved accuracy in the prediction. Categorization is the process of converting the data into a categorical format. Based on the condition, the data is categorized into a standard format. Feature selection is used to select the important features of the data and to remove the irrelevant attributes. MapReduce concept is used in the categorization and feature selection process.

Classification algorithm involves two steps, namely training set and testing set. The training set is used to build a model with the training data. Testing set is applied on the classification model and is used to check the accuracy [4].

A Decision tree is a classification model. It is mainly used to classify an object to a predetermined class. CHAID, CART, ID3, C4.5, C5.0 are decision tree algorithms and C5.0 is a widely used decision tree algorithm.

C5.0 algorithm handles continuous and categorical values. Feature selection is the basic step to construct a decision tree. The decision tree algorithm C5.0 is used to access the data and has higher speed when compared to ID3 and C4.5 [5]. MapReduce process is used to evaluate the data.

In this paper we discuss the time complexity for MapReduce – based C5.0 algorithm. MapReduce technique is a training model to be linked with performance for processing huge data sets. MapReduce concept runs on a large cluster of product machines and is highly scalable [6].

Matei Zaharia et al. proposed an improved scheduling algorithm that decreases the Hadoop reply time and the performance by using MapReduce [7].

2 Implementation Scrutiny

MapReduce with Datamining algorithm is used for tera bytes of data [8]. Using the categorization algorithm, attribute values can be grouped. Among categorized data, relevant attribute has to be picked up. For this purpose, feature selection algorithm can be used. These two algorithms are used for the prediction of class labels. We fix one example for the purpose of illustration of all the concepts presented here.

Example 1

Training data set and testing data set for prediction of mode of transport are given in Tables 1 and 2 respectively.

Table 1. Training dataset

Full size table

Table 2. Testing dataset

Full size table

2.1 Categorization

Categorization is the process of grouping objects into categories, usually for some specific purpose. Generally, categorization algorithm has two parts. First is Map part which is used to check the conditions of the attribute values and the second is reduce part which changes the numerical values to categorized values [9].

The attribute values are changed into categorized format based on the conditions. Finally the categorical values are stored in a new file.

For the case study given in Tables 1 and 2, for the categorization process, the travel cost and Income level attributes are used and the conditions are as follows.

Condition for travel cost	Condition for income level
if Travel cost == 500 then Standard else if Travel cost < 500 then cheap else Expensive end	if Income == 100000 then high else if Income < 50000 then low else medium end

2.2 Feature Selection

Feature selections is also called attribute selection or variable selection. After categorization, the feature selection process is used to select a subset of relevant attributes [10]. Chi-square $(\chi ^2)$ feature selection based on MapReduce concept is used for finding the relevant attributes. It is one of the popular feature selection methods. Statistical test is also used to decide whether observed frequencies are much dissimilar from expected frequencies [11]. The $\chi ^2$-filter is defined by

$$\begin{aligned} \chi ^{2} = \frac{\varSigma (O-E)^{2}}{E} \; \end{aligned}$$

(1)

where for each attribute, O is the Observed Frequency and E is the Expected Frequency.

The weight for each attribute is calculated by using (1). Attributes whose weight is greater than a chosen threshold value are taken for further processing.

2.3 C5.0 Classifier

Classification is one of the major components in data mining and C5.0 is decision tree based classification algorithm. It can handle continuous values, categorical values and numerical attributes. Let $\mathcal {C}$ be a categorization of a set S of objects into categories $C_1,C_2,\dots ,C_r$. Let $p_i$ be the probability that an object in S is in category $C_i$. The entropy of S is defined as

$$\begin{aligned} \mathrm{Entropy} (S)=-\sum \limits _{i=1}^r p_i\log _2p_i. \end{aligned}$$

The entropy measures the homogeneity of objects.

To determine the best attribute to be chosen for a node in a decision tree, we use the concept of information gain. For any given attribute A, consider the set V(A) of possible values of A. For any $v\in V(A)$, let $S_v$ be the set of all elements of S having value v for the attribute A. The information gain of A with respect to S is given by

$$\begin{aligned} \mathrm{Inf.Gain}\ (S,A)=\ \mathrm{Entropy}\ (S)-\sum \limits _{v\in V(A)}\frac{|S_v|}{|S|}\ \mathrm{Entropy}\ (S_v). \end{aligned}$$

Thus the information gain measures the expected reduction in the entropy.

The attribute with highest information gain is taken as the root node in the decision tree and the values of the attribute are its children nodes. Then using the remaining attributes, the attribute with highest information gain is taken as a child node and the process is repeated. If $S(V_i)\ne \emptyset $, the tree constructed is added as a new branch at $v_i$. Each leaf node of the final decision tree gives a rule for predicting the class label.

For the example given in Table 1, the probability of an individual travelling by bus, car and train are respectively 0.4, 0.3 and 0.3. Hence the entropy value is $-(0.4\log _2 (0.4)+0.3\log _2(0.3)+0.3\log _2(0.3))=1.571$.

For each of the attributes Gender, Car-owner ship, travel cost and income the information gain in computed and the results are given in Table 3.

Table 3. Information gain result

Full size table

The attribute with highest gain value is travel cost which is taken as the root node and its branches are its values, namely, standard, expensive and cheap.

The next step computation of informations gain for the removing three attributes is carried out. Table 4 gives the results for the attribute cheap.

Table 4. Information gain for the node “cheap”

Full size table

In this way the process can be continued, giving the decision tree. From the decision tree teh transportation mode can be predicted in terms of the attributes. For example for the object with attributes female, car owner, cheap the predicted transportation mode is bus.

3 Conclusion

The input for the MapReduce model is a set S of objects with $|S|=n$, attributes $A_1, A_2,\dots , A_k$ where the attribute $A_i$ takes $\alpha _i$ values and rules for categorization. The complexity of categorization process is O(n). The complexity of feature selection is O(k). The complexity of C5.0 classifier is $O(n^2)$. Thus the overall complexity of the algorithm is $O(n^2)$.

References

Miranda Lakshmi, T., Martin, A., Mumtaj Begum, R., Prasanna Venkatesh, V.: An analysis on performance of decision tree algorithms using students qualitative data. Int. J. Mod. Educ. Comput. Sci. 5, 18–27 (2013)
Article Google Scholar
Vaitheeswaran, G., Arockiam, L.: Big data for education in students perspective. Int. J. Comput. Appl. Adv. Comput. Commun. Tech. High Perform. Appl. 4, 11–17 (2014)
Google Scholar
Laney, D.: 3-D Data management: controlling data volume, velocity and variety. META Group Research Note, pp. 20–25 (2011)
Google Scholar
Ladha, L., Deepa, T.: Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. 3, 1787–1790 (2011)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29–42 (2008)
Google Scholar
Choura, R.: Use of data mining techniques for the evaluation of student performance: a case study. Int. J. Comput. Sci. Manag. Res. 1(3), 425–433 (2012)
Google Scholar
Naseriparsa, M., Bidgoli, A.M., Varaee, T.: A hybrid feature selection method to im-prove performance of a group of classification algorithms. Int. J. Comput. Appl. 69(17), 28–35 (2014)
Google Scholar
Azhagusundari, B., Thanamani, A.S.: Feature selection based on information gain. Int. J. Innov. Technol. Explor. Eng. 2(2), 18–21 (2013)
Google Scholar
Vanaja, S., Ramesh Kumar, K.: Analysis of feature selection algorithms on classification: a survey. Int. J. Comput. Appl. 96(17), 29–35 (2014)
Google Scholar
Yadav, S.K., Pal, S.: Data mining: a prediction for performance improvement of engineering students using classification. World Comput. Sci. Inf. Technol. J. 2(2), 51–56 (2012)
Google Scholar

Download references

Acknowledgment

The first author is thankful to the management of Kalasalingam University for providing fellowship.

Author information

Authors and Affiliations

Kalasalingam University, Anand Nagar, Krishnankoil, 626126, Tamil Nadu, India
S. Rajeswari, K. Suthendran, K. Rajakumar & S. Arumugam

Authors

S. Rajeswari
View author publications
You can also search for this author in PubMed Google Scholar
K. Suthendran
View author publications
You can also search for this author in PubMed Google Scholar
K. Rajakumar
View author publications
You can also search for this author in PubMed Google Scholar
S. Arumugam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Suthendran .

Editor information

Editors and Affiliations

Kalasalingam University, Krishnankoil, India
S. Arumugam
Ball State University, Muncie, Indiana, USA
Jay Bagga
Indiana University – Purdue University, Fort Wayne, Indiana, USA
Lowell W. Beineke
Indian Institute of Technology, New Delhi, India
B.S. Panda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rajeswari, S., Suthendran, K., Rajakumar, K., Arumugam, S. (2017). An Overview of the MapReduce Model. In: Arumugam, S., Bagga, J., Beineke, L., Panda, B. (eds) Theoretical Computer Science and Discrete Mathematics. ICTCSDM 2016. Lecture Notes in Computer Science(), vol 10398. Springer, Cham. https://doi.org/10.1007/978-3-319-64419-6_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-64419-6_40
Published: 17 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64418-9
Online ISBN: 978-3-319-64419-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Overview of the MapReduce Model

Abstract

Similar content being viewed by others

MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research

MapReduce for Big Data Analysis: Benefits, Limitations and Extensions

MapReduce Algorithms for Big Data Analysis

Keywords

1 Introduction

2 Implementation Scrutiny

Example 1

2.1 Categorization

2.2 Feature Selection

2.3 C5.0 Classifier

3 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Overview of the MapReduce Model

Abstract

Similar content being viewed by others

MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research

MapReduce for Big Data Analysis: Benefits, Limitations and Extensions

MapReduce Algorithms for Big Data Analysis

Keywords

1 Introduction

2 Implementation Scrutiny

Example 1

2.1 Categorization

2.2 Feature Selection

2.3 C5.0 Classifier

3 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation