1 Introduction

Big data is an evolving term that describes combinations of larger amount of structured, unstructured and semi structured data that cannot be processed using conventional computing techniques [1]. Facebook, YouTube or social network have enormous amount of data that falls under the category of bigdata and it requires it to be collect and manage on a daily basis. The volume, velocity, variability of data sets makes them difficult to be captured, managed, and processed by traditional method and tools. Hadoop Map Reduce programming model is now being used for processing big data while Map Task performs the input data by partitioning into fixed blocks and generate the output as a collection of<num, data> pairs [2]. After completion of Map task these pairs are shuffled across different reduced tasks based on<num, data> pairs and each reduced task accepts only one num key at a time. The process data for that num key produces the result as<num, data> pairs. The Hadoop Map Reduce architecture [3] consists of one Job tracker performed at a time but many Task Trackers work simultaneously. Hadoop Map Reduce is the modified version of online Map Reduce which supports online aggregation [4] and reduces response time. Conventional Map Reduce method materializes the intermediated results of mapper and it does not allow pipelining between Map and Reduce tasks. This method has the advantage of simple recovery in case of failures. Once the mappers have finished the tasks, the reducers start executing the tasks.

Fig. 1
figure 1

Block diagram of the proposed layers 3 traffic aware clustering algorithm

Partitioning clustering allocate documents into a fixed number of clusters and the well known partitioning methods are k-means clustering [5] and its variants .The basic K-means clustering method [6] initially allocates a set of objects to a number of cluster randomly from which the mean of each cluster is calculated for each iteration and each object is reallocated to the nearest mean. In the Map phase, the mapper takes a single (num, data) pair as input and generate any number of (num,data) pairs as output. The map operates as stateless in which the logic operates on a single pair at a time even if in practice several input pairs are delivered to the same mapper. To recapitulate for the map stage, the user simply designs a map function that maps an input<num, data> pair to any number of output pairs. Frequently the map phase is used to specify the desired location of the input by changing its key. The shuffle stage is automatically handled by the Map Reduce technique and the engineer has nothing to do for at this stage. The system implementing Map Reduce routes all values that are in aggregate with an individual key to the same reducer. In the reduce stage, the reducer takes all of the values associated with a single num n and outputs any number of<num, data> pairs. One of the sequential aspects of Map Reduce computation is that it reduces all of the maps before the reduce stage can begin. The reducer has access all the values with the same key and it can perform sequential computations. The parallelism is exploited by observing that reducers operating on different num keys can be executed concurrently. Document clustering [7] performs collection of similar documents into classes where similarity is some function on a document and it does not need separate training process, where the documents in the same clusters are similar and the documents in the different clusters are not relevant.

In this paper, a new proposal algorithm is introduced by layers 3 traffic aware clustering programming model. Figure 1 illustrates the block diagram of the proposed system, the design, and implementation of the massive datasets, the processing model based on Map Reduce and Hadoop.

2 Problem identification

Big data is a complex enormous amount of datasets which has become a buzz in the market, due to various challenges and major issues to be handled in the big data. Map Reduce [8, 9] is the prominent technique for processing the enormous amount of data by designing and develop the new method for clustering. In the map phase, it performs the task of converting the input into the form of<num, data> after which the output is sorted and shuffle in the shuffle phase of second layer in which is again sorted at the aggregator stage [10]. The third layer is the reducing phase, which performs the task fetching from its own share of data partitioned from all map tasks to produce the final solution. Here the clustering technique for the map phase is used where big data is analyzed and cluster of similar objects are formed. In this three layer method, the data produced by the map phase are ordered, partitioned and aggregated to the appropriate machines which executes the reduce phase. The network traffic [11] from all the three layers can cause a great volume of network traffic, which causes a serious constraint on the efficiency of data analytic applications. One of the problems in the network traffic is difficult in the process of data with given time.

3 Methodology

3.1 Layers 3 traffic aware clustering method

This proposed method is a layers 3 traffic aware clustering algorithm based on parallel K-means [3, 12] and distance metric [13]. Traffic aware clustering combines the benefits of both the K-means and distance metric algorithms. The benefit of parallel K-means is that it is not complex and forms clusters using the distance metrics and traffic aware and so it has less execution time. This proposed system works in 3 layers.

Layer 1: In this layer, the data is partitioned into different clusters and different parts to minimize into the boundary points [14].

Layer 2: This layer performs shuffling and sorting the clusters based on traffic aware.

Layer 3: This performs sorting that is the input is fetched to reduce the clusters. The clustering is a function of grouping the set of objects that are combined into a single cluster and this cluster contains only similar objects. The dissimilar objects are formed into another cluster. The goal of the clustering is to partition unstructured data objects into a structured data objects. K-means clustering algorithm is used to partition the different sets of data into clusters and it is n- objects based on attributes into k-partitions where k>n to form a vector space. The work begin with a decision on a value of k = number of clusters. First the data is partitioned into k clusters and then each data point is taken in sequence and the distance from the centroid of the clusters is calculated. If the data points which are not closer to the centroid are present in the cluster then those data points are switch that clusters. This process is repeated until the centre does not change. The system processes the given input data, where the data is shuffled and reduced into small required data. To reduce the network traffic within a Map Reduce task [11, 15], the aggregate data is considered the aggregate data with similar keys before sending the mapper output to the reducer stage. The data aggregation creates opportunities for multiple tasks on different machines thus goal of minimize the network traffic [16, 17] by data partition and aggregation for Map Reduce task. Layers 3 traffic aware clustering algorithm is proposed for big data applications and the final result demonstrates the reduction in the network traffic cost.

3.2 Layers 3 traffic aware clustering algorithm

Step 1: Input D: data set having n data points. A= {A\(_{1}\), A\(_{2}\), A\(_3\ldots \) A\(_{n}\)} be the set of data points and V= {V\(_{1}\), V\(_{2}\), V\(_{3}\), V\(_{4}\)...V\(_{n}\)} be the set of centers.

Step 2: Map Reduce function executes in three stages ie. Map stage, shuffle stage- traffic aware and reduce stage.

Step 3: Iteratively improves partitioning of data into k document clusters.

Step 4: Mapper Phase

  1. (i)

    Input in map function paradigm is based on sending to where the data resides and it is in form of file or directory and is stored in the Hadoop distributed file system (HDFS).

  2. (ii)

    Calculate the distance between each data point and cluster centre using the distance metrics as follows.

    $$\begin{aligned} D_{XY} =\sqrt{\mathop \sum \nolimits _{k=1}^m \left( {X_{ik} -X_{jk} } \right) ^{2}} \end{aligned}$$
    (1)
  1. (iii)

    Data point is assigned to the cluster centre whose distance from the cluster centre is minimum for all the cluster centers.

  2. (iv)

    The input passes to mapper function line by line and in the form is <num, data> pair as cluster centre and data points.

  3. (v)

    The mapper function finds the nearest centre among k centers for the input point and process the data and creates small chucks of data.

figure c
Fig. 2
figure 2

Proposed layers 3 traffic aware clustering system flow chart

Step 5: Shuffle stage- traffic aware

figure d

Step 6: Reducer Phase

  1. (i)

    Reducer phase is the combination of shuffle and reduce and the input of reducer’s job is to process the data and produces the output of map function that is all data point creates a new set of output.

  2. (ii)

    New cluster centre is calculated using

    $$\begin{aligned} V_i =\left( {\frac{1}{C_i }} \right) \mathop \sum \nolimits _{j=1}^{C_l } X_j \end{aligned}$$
    (2)

    C\(_i\) = No. of data points in the ith cluster

  1. (iii)

    The reducer phase calculates the new centre using data points that are stored in HDFS.

figure e

Figure 2 shows the proposed method flow chart. In this system the input data is processed and partitioned to calculate the distance between the data points. The next step is the formation of clusters from which the cluster centre is calculated. The shuffling and sorting is also done at this stage which performs multiple tasks on different machine to check whether the same numbers are combined together before sending them to reduce tasks. In the last stage the sorted output is given to the input of the reduce function where computation of centre co-ordinate new cluster is done before the process is ended.

3.3 Data partitioning

The data partitioning [18] is done in the Map task function. The proposed system includes the logical block addressing clusters in Map function at this stage because multiple tasks content for the same slot wastes the time for re allocating the data. The sorted list in each list contains the block of the same partition and they are split into different array list then a new a new cluster repeated data into the partition. Figure 3 shows the data partitioning.

Fig. 3
figure 3

Data partitioning

4 Implementation results

The system is developed in JAVA and deployed in Hadoop framework. The results obtained from the implementation of layers 3 traffic aware clustering algorithm are compared with bisecting K-means algorithm and K-Means parallel algorithm using the same data sets. The data sets are “Reuters 21578” [19], “Online 20 Newsgroup” [20], “Web Ace” [21], “TREC” [22]. The algorithm is performed on the basis of two parameters namely network traffic related execution time and performance accuracy. Table 1. Illustrates the comparison of execution time of various methods using different data sets.

Table 1 Comparison of various methods with four different data sets execution time

Figure 4 shows the performance using execution time of the proposed algorithm compared with four other algorithms namely Basic K-means, K-Means Parallel, Bisecting K-Means and DB-scan with the same data sets. The proposed algorithm formulates the network traffic minimization and facilitates to construct three layer clustering algorithm. In the intermediate layer, a potential aggregator is created at each machine so that aggregation of data can be done from all mappers. Thus the proposed system efficiently reduces the network traffic cost.

Fig. 4
figure 4

Performance of the algorithms using execution Time

Fig. 5
figure 5

Comparisons of various techniques with accuracy

Figure 5 shows the system accuracy of the proposed system tested using the data in Table 2 and compared with various algorithms namely Bisecting K-Means, K-Means Parallel, Basic K-Means, DB-Scan on the same data sets. Accuracy is calculated using this formula

$$\begin{aligned} \mathrm{Accuracy }= \mathop \sum \nolimits _i P_i \left( {max_j \left( {\frac{P_{ij} }{P_j }} \right) } \right) \end{aligned}$$
(3)

Accuracy is mainly calculated for the quality of clusters and Fig. 5 shows the comparison of the various algorithms. The performance graph shown in Figs. 4 and 5 clearly demonstrates that the proposed algorithm better in performance than other algorithms.

Table 2 Comparison of various techniques accuracy %

5 Conclusion

In this system, layers 3 traffic aware clustering algorithm is proposed as an efficient traffic aware partition and aggregate to minimize the network traffic cost. Big data provides background of various clustering technique which can be used to analyze data and design three layered structure to solve the problem. The comparison the various algorithms like Bisecting K-Means, K-Means Parallel, Basic K-Means, DB-Scan with the proposed system were done on the same data sets to calculate their execution time and accuracy. The implementation results had been tested using four data sets and the obtained results demonstrates that the proposed system is more accurate. Same data groups from multiple tasks on different machines were combined before sending them to reduce tasks. The usage of the proposed algorithm shows better performance by reducing the network traffic using partition, aggregation and reduces the network traffic.