Keywords

1 Introduction

Clustering is the process of partitioning data points in a given dataset into groups (clusters), where data points in one group are more similar than data points in other groups. cluster analysis plays an important role in the Big Data problem. For example, it has been used to analyse gene expression data, and in image segmentation to locate objects’ borders in an image.

K-Means [1] is one of the most popular and widely used clustering algorithms. K-means has been extensively studied and improved to cope with the rapid and exponential increase in the size of datasets. One obvious solution is to parallelise K-Means. K-Means have been parallelised based on different environments such as Message Passing Interface (MPI) [2] and MapReduce [3].

For a given number of iterations, the computational complexity of K-Means is dominated by the distance computations required to determine the nearest centre for each data point. These operations consume most of the algorithm’s run-time because, in each iteration, the distance from each data point to each centre has to be calculated. Various optimisation approaches have been introduced to tackle this issue. Elkan [4] applied the triangle inequality property to eliminate unnecessary distance computations on high dimensional datasets. An optimisation technique based on multidimensional trees (KD-Trees) [5] was proposed by Pelleg and Moore [6] to accelerate K-Means. Judd et al. [7] presented a parallel K-Means formulation for MPI and used two approaches to prune unnecessary distance calculations. Pettinger and Di Fatta [8, 9] proposed a parallel KD-Tree K-Means algorithm for MPI, which overcomes the load imbalance problem generated by KD-Trees in distributed computing systems. Different approaches have been proposed to improve K-Means efficiency on MapReduce by reducing the number of iterations. However, we intend to accelerate K-Means on MapReduce by reducing distance computations per iteration.

This paper describes the implementation of K-Means on MapReduce with a mapper-combiner-reducer approach and how the iterative procedure is accomplished on MapReduce. In Addition, it presents some preliminary results relative to the effect of distance calculations on the performance of K-Means on MapReduce. Finally, two approaches are suggested to improve the efficiency of K-Means on MapReduce.

The rest of the paper is organised as follows: Sect. 2 briefly introduces K-Means and MapReduce, and presents a detailed description of Parallel K-Means on MapReduce. Section 3 reports the experimental results. Section 4 presents the work in progress. Finally, Sect. 5 concludes the paper.

2 Parallel K-Means on MapReduce

2.1 K-Means

Given a set \( X \) of \( n \) data points in a \( d \)-dimensional space \( {\mathbb{R}}^{d} \), and an integer \( k \) that represents the number of clusters, K-Means partitions X into \( k \) clusters by assigning each \( x_{i} \in X \) to its nearest cluster centre, or centroid, \( c_{j} \in C \), where \( C \) is the set of \( k \) centroids. Given a set of initial centroids, data points are assigned to clusters and cluster centroids are recalculated: this process is repeated until the algorithm converges or meets an early termination criterion. The goal of K-Means is to minimise the objective function known as the Sum of Squared Error \( (SSE) = \mathop \sum \limits_{j = 1}^{k} \mathop \sum \limits_{i = 1}^{{n_{j} }} ||x_{i} - c_{j} ||^{2} \), where \( x \) is the \( i^{th} \) data point in the \( j^{th} \) cluster and \( n_{j} \) is the number of data points in the \( j^{th} \) cluster. The time complexity for K-Means is \( O(nkd) \) per iteration.

2.2 MapReduce

MapReduce [3] is a programming paradigm that is designed to, efficiently and reliably, store and process large-scale datasets on large clusters of commodity machines.

In this paradigm, the input data is partitioned and stored as blocks (or input-splits) on a distributed file system such as Google File System (GFS) [10], or Hadoop Distributed File System (HDFS) [11]. The main phases in the MapRede are Map, Shuffle, and Reduce. In addition, there is an optional optimisation phase called Combine. The MapReduce phases are explained as follows:

In the Map phase, the user implements a map function that takes as an input the records inside each input-split in the form of key1-value1 pairs. Each map function processes one pair at a time. Once processed, a new set of intermediate key2-value2 pairs is outputted by the mapper. Next, the output is spilled to the disk of the local file system of the computing machine. In the Shuffle phase the mappers’ output is sorted, grouped by key (key2) and shuffled to reducers. Once the mappers’ outputs are transferred across the network, the Reduce phase proceeds where reducers receive the input as key2-list(value2) pairs. Each reducer processes the list of values associated to each unique key2. Then, each reducer produces results as key3-value3 pairs, which are written to the distributed file system. The Combine phase is an optional optimisation on MapReduce. Combiners minimise the amount of intermediate data transferred from mappers to reducers across the network by performing a local aggregation over the intermediate data.

2.3 Parallel K-Means on MapReduce Implementation

Parallel K-Means on MapReduce (PKMMR) has been discussed in several papers (e.g., [12, 13]). However, in this paper we explain, in details, how counters are used to control the iterative procedure. Moreover, we show the percentage of the average time consumed by distance computations. PKMMR with a combiner consists of: Mapper, Combiner, Reducer user program called Driver that controls the iterative process. In the following sections, a data point is denoted as \( dp \), a cluster identifier as \( c\_id \), the combiner’s partial sum and partial count as \( p\_sum \) and \( p\_count \).

Driver Algorithm.

The Driver is a process that controls the execution of each K-Means iterations in MapReduce and determines its convergence or other early termination criteria. The pseudocode is described in Algorithm-1. The Driver controls the iterative process through a user defined counter called \( global\_counter \) (line 2). The global_counter is used as a termination condition in the while loop. The counter is incremented in the Reducer if the algorithm does not converge or an early termination condition is not met, otherwise, the counter is set to zero and the while loop terminates. Besides configuring, setting, and submitting the MapReduce job, the Driver also merges multiple reducers’ outputs into one file that contains all updated centroids.

Mapper Algorithm.

Each Mapper processes an individual input-split received from HDFS. Each Mapper contains three methods, setup, map and cleanup. While the map method is invoked for each key-value pair in the input-split, setup and cleanup methods are executed only once in each run of the Mapper. As shown in Algorithm-2, setup loads the centroids to c_list. The map method takes as input the offset of the dp and the dp as key-value pairs, respectively. In lines 4−10, where the most expensive operation in the algorithm occurs, the loop iterates over the c_list and assigns the dp to its closest centroid. Finally, the mapper outputs the c_id and an object consists of the dp and integer 1. Because it is not guaranteed that Hadoop is going to run the Combiner, Mapper and Reducer must be implemented such that they produce the same results with and without a Combiner. For this reason, an integer 1 is sent with the dp (line 11) to represent p_count in case the combiner is not executed.

Combiner Algorithm.

As shown in Algorithm-3, the Combiner receives from the Mapper (key, list(values)) pairs, where key is the c_id, and list(values) is the list of dps assigned to this c_id along with the integer 1. In lines 2−6, the Combiner performs local aggregation where it calculates the p_sum, and p_count of dps in the list(values) for each c_id. Next, in line 7, it outputs key-value pairs where key is the c_id, and value is an object composed of the p_sum and p_count.

Reducer Algorithm.

After the execution of the Combiner, the Reducer receives (key, list(values)) pairs, where key is the c_id and each value is composed of p_sum and p_count. In lines 2−6 of Algorithm-4, instead of iterating over all the dps that belong to a certain c_id, p_sum and p_count are accumulated and stored in total_sum and total_count, respectively. Next, the new centroid is calculated and added to new_c_list. In lines 9−11, a convergence criterion is tested. If the test holds, then the global_counter is incremented by one, otherwise, the global_counter’s value does not change (stays zero) and the algorithm is terminated by the Driver.

3 Experimental Results

To evaluate PKMMR, we run the algorithm on a Hadoop [14] 2.2.0 cluster of 1 master node and 16 worker nodes. The master node has 2 AMD CPUs running at 3.1 GHz with 8 cores each, and 8 × 8 GB DDR3 RAM, and 6 × 3 TB Near Line SAS disks running at 7200 rpm. Each worker node has 1 Intel CPU running at 3.1 GHz with 4 cores, and 4 × 4 GB DDR3 RAM, and a 1 × 1 TB SATA disk running at 7200 rpm.

The datasets used in the experiments are artificially generated where data points are randomly distributed. Additionally, initial cluster centroids are randomly picked from the dataset [1]. The number of iterations is fixed in all experiments at 10.

To show the effect of distance calculations on the performance of PKMMR, we run the algorithm with different number of data points n, dimensions d and clusters k. The percentage of the average time consumed by distance calculations in each iteration is represented by the grey area in each bar in the Figs. 1-(a), (b), and (c). The white dotted area represents the percentage of the average time consumed by other MapReduce operations per iteration including job configuration and distribution, map tasks (excluding distance calculations) and reduce tasks.

Fig. 1.
figure 1

Percentage of the average consumed time by distance calculations per iteration with variable number of d, k and n.

In each run, we compute the average run-time for one iteration by dividing the total run-time over the number of iterations. Then, the average run-time consumed by distance calculations per iteration is computed.

We run PKMMR with a varied number of d, while n is fixed at 1,000,000, and k is fixed at 128. Figure 1-(a) shows that 39 % (d = 4) to 63 % (d = 128) of the average iteration time is consumed by distance calculations.

PKMMR is also run with a variable number of k, while n is set to 1,000,000 and d is set to 128. In Fig. 1-(b), it can be clearly seen the tremendous increase in the percentage of consumed time by distance calculations per iteration from 11 % (k = 8) to 79 % (k = 512). In this experiment, distance calculations become a performance bottleneck as the number of clusters increases, which is more likely to occur while processing large-scale datasets.

Figure 1-(c) illustrates the percentage of the average time of distance calculations when running PKMMR with variable number of n, while d = 128 and k = 128. As it can be observed, distance calculations consume most of the iteration time. About 65 % of the iteration time is spent on distance calculations when n = 1,250,000. Therefore, reducing the number of required distance calculations will most likely accelerates the iteration run-time and, consequently, improves the overall run-time of PKMMR.

4 Work in Progress

We intend to accelerate the performance of K-Means on MapReduce by applying two methods to reduce the distance computations in each iteration. Firstly, triangle inequality optimisation techniques are going to be implemented and tested with high dimensional datasets. However, such techniques usually require extra information to be stored and transferred from one iteration to the next. As a consequence, large I/O and communication overheads may hinder the effectiveness of this approach if not taken into careful consideration. Secondly, efficient data structures, such as KD-trees or other space-partitioning data structures [15], will be adapted to MapReduce and used with K-Means. Two issues will be investigated in this approach. First, inefficient performance with high dimensional datasets that has been reported in [6]. Second, load imbalance that was addressed in [8, 9].

5 Conclusions

In this paper we have described the implementation of parallel K-Means on the MapReduce framework. Additionally, a detailed explanation of the steps to control the iterative procedure in MapReduce has been presented. Moreover, a detailed analysis of the average time consumed by distance calculations per iteration has been discussed. From the preliminary results, it can be clearly seen that most of the iteration time is consumed by distance calculations. Hence, reducing this time might contribute in accelerating K-Means on the MapReduce framework. Two approaches are under investigations, which are, respectively, based on the triangle inequality property and space-partitioning data structures.