Keywords

1 Introduction

Recently, clustering methods have attracted much attention as effective tools in theoretical and applied problems of machine learning that allows to detect patterns in raw/poorly structured data. Another motivation is the need to process large data sets to obtain a natural grouping of data. Therefore, one of the main aspects for clustering methods is their scalability [9, 15].

There are a number of studies aimed at improving clustering quality using methods with high cost and time complexity [15,16,17, 19]. This type of methods usually has a significant drawback: it is practically intractable to cluster medium and large data sets (approximately \(10^5\)\(10^7\) objects or more). These methods cannot work with huge databases, because the computational complexity in time/space (memory) growths (polynomially) very rapidly. Therefore, it has a sense to look for algorithms with reasonable trade-offs between effective scalability and the quality of clustering [7, 8, 10, 11].

One of the known methods of data clustering is the k-means algorithm, which is widely used due to its simplicity and good characteristics [18, 20]. A number of algorithms and technologies have improved this method by clustering input data objects in portions. There are algorithms that use the data stream clustering or decomposition approach, for example, mini-batch k-means [3, 8, 13]. It has a weighted version of k-means algorithm with many applications [10].

Meta-heuristics can be of great help when the exact solution is difficult or expensive in terms of used computation time and space. Some heuristics for k-means that accelerate calculations have been developed and implemented: part of the heuristics is devoted to accelerating the convergence of the method, another discards redundant or insignificant intermediate calculations. The following meta-heuristics have shown their effectiveness in clustering big data:

  • deletion at each iteration of data patterns that are unlikely to change their membership in a particular cluster, as in [11, 12];

  • using the triangle inequality in [14];

  • combinations of various techniques [1, 4].

For many machine learning algorithms, processing of big data is problematic and severely limits functionality usage. Our approach is directed to make an advantage out of this drawback, i.e., the more data is given, the better estimates can be obtained. The k-means is one of the fastest algorithms, so we use it as the underlying basis in our approach. In this paper, we use the k-means++ modification to build an algorithmic meta-heuristic that uses some subsets from the entire dataset at each step. We note that ++ version of the k-means has a special initialization of centroids  [6].

Formally, given a set of objects \(X = \{x_1, ..., x_N\}\) in Euclidean space to be clustered and a set of corresponding weights \(\{ w_1, ..., w_N \}\), for \(l \in 1,..., N\). Then \(\{ C_1, ..., C_k \}\) is a partition of X in k clusters if it satisfies (i) \(C_i \ne \emptyset ,\) (ii) \(C_i \cap C_j = \emptyset , i\ne j,\) \(i,j = 1,2,...,k,\) and (iii) \(\bigcup ~C_i = X.\) Then, minimum sum-of-squared clustering problem is defined as following:

$$\begin{aligned} MSSD = \min _{C_1, ..., C_k} { \sum _{l=1}^{N} w_l \min _{j = 1,...,k} { || x_l - c_j || ^2}}, \end{aligned}$$

where centroids \(c_j = {\sum _{l \in arg(C_j)}^{} w_l x_l }/{\sum _{l \in arg(C_j)} w_l }.\) Correspondingly, SSD criteria gives an estimate on a particular clustering partition:

$$\begin{aligned} SSD(C_1, ..., C_k) = { \sum _{l=1}^{N} w_l \min _{j = 1,...,k} { || x_l - c_j || ^2}}. \end{aligned}$$

In case the weights of objects are not specified, then \(w_l = 1\) for \(l \in 1,..., N.\)

Decomposing the dataset can be technically realized by a stream like methods. Streaming to process a window may be considered as searching dependencies between the obtained essential information and the one gathered previously by the computational model. The principal goal of the study is to investigate methods of dataset decomposition in a stream-like fashion for computing k-means centroid initialization of clustering that produces convincing results regarding MSSD criteria (Minimum Sum-of-Squared Distance) [5]. Shortly, methods for finding close to optimal k-means initialization, while having fast computational speed.

The idea of merging clusters obtained by partial clusterings is known in the literature: there are formal approximations that guarantee certain estimates on performance, and quality [9]. Known clustering algorithm STREAM with k-median l1-metric in [13] weights each center by the number of points assigned to it. The stream clustering usually assumes processing the input data in sequential order. Unlike this we use decomposition that may use essential the parallelization of the clustering of the input dataset portions. In our algorithm we add additional heuristic SSD estimates to the weighting. This algorithm can be used in cases the dataset is replenished dynamically, on fly, and possibly in real time. The clustering of additional portions clarifies the clustering structure.

The goal of this work is to create a decomposition method for the k-means algorithm on large-scale datasets to initialize centroids in order to obtain qualitative results with respect to the MSSD criteria. In other words, we use the method of finding the initialization of k-means so that it is close to optimal while having a high calculation speed. Different types of meta-heuristics are used in the task of clustering k-means by processing the obtained data in a secondary (high-level) clustering procedure. Another goal of this work is to study the influence of meta-parameters to the algorithm behaviour with glance to the time and the SSD (Sum-of-Squared Distance) criterion minimization. Another purpose of our research is to define the bounds of such algorithm efficiency and its behaviour regarding meta-parameters.

2 Algorithm

The idea of the algorithm is to make a partition of the dataset into smaller portions, then find the corresponding clustering structure and replace them by single centroid points. On this stage we obtain compact representation of initial dataset that preserves its most essential structural information. Then aggregate and clusterize these centroids in different possible ways getting new heuristic for generalized centroids. Shortly, transform the Big Data into Small Data, cluster them and use obtained centroids to initialize the original Big Data.

More formally, by this approach, first, we decompose the entire dataset entries shuffled randomly on subsets of fixed size (taking either all, or representative portion of elements). Next step is to do k-means clustering of some of these subsets (batches/windows). We use the term ‘window’ (along the term ‘batch’ from the literature) to stress that the data subsets are taken in sizes proportional to the entire dataset.

(Meta-) parameters of the algorithm:

  • k is the number of required clusters.

  • N is the number of objects in the entire data set.

  • d is the window size (number of objects in one window). The sizes of the windows are chosen in proportion to the entire dataset. E.g., taking 5 wins decomposition means taking the size of the windows equal to [N/5].

  • n is the number of windows used for independent initialization of k-means during Phase 1, see next section.

  • \(m (\ge n)\) is the total number of windows used for the clustering. The union of m windows may or may not cover the entire dataset.

By using SSD estimates on their corresponding clusterings we make heuristics for better initialization of the algorithm on the entire dataset.

We considered the following two modes for the windows (wins) generation: 1. Segmentation of the entire data set on windows, then a random permutation of objects in the data set is created. The data set is segmented into successive windows of size d. We refer to this as uniform window decomposition mode. 2. For each window, d random objects are selected from the entire set. By repeating this, the required number of windows is generated (objects may be picked repeatedly in different wins). We refer to this mode as random window generation mode.

In order to simplify description we distinguish centroids according to algorithmic steps at which they appear:

  • centroids that results from k-means++ on separate windows and used for subsequent initializations we call local centroids;

  • set of generalized centroids is obtained by gathering (uniting) resulted local centroids from clusterings on windows;

  • basis centroids are obtained by k-means++ clusterings of the set of generalized centroids (considered as a small dataset of higher level representation of windows);

  • final centroids are obtained by computing k-means on the entire datasest, initialized by basis centroids.

2.1 Phase 1: Aggregation of Centroids

An independent application of k-means++ algorithm on a fixed number n of windows in order to obtain local centroids with following aggregation to generalized set of centroids. The scheme for the algorithms is shown in Fig. 1. This centroids are considered as higher level representation of clustered windows. Each object of generalized set of centroids is assigned to the weight corresponding to the normalized SSD value for the window in which it is calculated as centroid. The weight of i-th object is calculated as follows:

$$\begin{aligned} w_i = 1 - (SSD_i-SSD_{min})/(SSD_{max}-SSD_{min}), \end{aligned}$$
(1)

where \(SSD_i\) is the SSD value for such window from which the i-th centroid is taken as an object. Then, using k-means, the new dataset of generalized centroids is divided into k-clusters, taking into account the weights \(w_i\) of the objects. In the case of degeneration, k-means is reinitialized. The resulting (basis) centroids are used for:

  1. 1.

    initialization of k-means on the Input Dataset in order to obtain final centroids;

  2. 2.

    evaluation of the SSD on the Input Dataset;

  3. 3.

    initialization Phase 2 of the algorithm described in the following sections.

Fig. 1.
figure 1

Scheme for the decomposition/aggregation clustering method. In Phase 1 k-means++ initialization is performed independently on windows 1, ..., n. Resulted final centroids are used for initialization during Phase 2 on windows \(n+1,...,m\).

Alternatively, during processing subsequent windows \(n+1\), \(n+2\), ..., m we have considered the following options in Phase 2: parallel option, straightforward option, and sequential option.

2.2 Phase 2: Parallel Option

The centroids obtained in the previous Phase 1 are used to initialize k-means on each subsequent window \(n+1\), \(n+2\), ..., m. The resulting (local) centroids and SSD estimates are stored if there is no centroid degeneration. The stopping condition is the specified limit either on the computation time or on the number of windows being processed. Similar to Phase 1 we do the clustering on the generalized set of centroids. Subsequent use of its results is similar to clauses 1.1 and 1.2 of Phase 1. Both Phase 1. and parallel option of the Phase 2 are unified in Fig. 1.

2.3 Phase 2: Straightforward Option

Fig. 2.
figure 2

Direct subsequent use of the best obtained SSD

An alternative heuristic of splitting the entire dataset and an alternative way of choosing centroids for the clustering initialization is used (see Fig. 2).

The idea of this heuristics is to evaluate and to use the best centroids regarding SSD criteria for initialization of k-means on the subsequent window. While each window is clustered the best obtained centroids are accumulated to process them for final clustering, like in Phase 1.

Algorithm Sketch:

  1. 1.

    Make dataset decomposition on subsets \(win_0, win_1,..., win_l,...\) of equal size.

  2. 2.

    Obtain list of initial centroids \(cent_0\) either by k-means++ on the first window \(win_0\) or by Phase 1. Assign \(AC \leftarrow [cent_0]\), \(c \leftarrow cent_0\), \(BestSSD \leftarrow SSD_0\), \(l=1.\)

  3. 3.

    (Start iteration) Use centroids c to initialize k-means with the next window \(win_l\). Calculate \(cent_l\) and \(SSD_l\).

  4. 4.

    If degeneracy in the clustering is presented (i.e., the number of obtained non-trivial clusters less then k, \(|cent_l| < k\)) then withdraw \(win_l\) and continue from step 3 for the next \(l \leftarrow l+1\).

  5. 5.

    If its clustering SSD is within the previously obtained or best SSD then \( AC \leftarrow AC \cup cent_l\).

  6. 6.

    If its clustering SSD gives better score then mark it as the best and use for the following initializations, i.e., \(BestSSD \leftarrow SSD_l\), \(c \leftarrow cent_l\).

  7. 7.

    Repeat from step 3 until all windows have been processed or, time bounding condition is satisfied.

  8. 8.

    The accumulated centroids AC are considered as elements for additional clustering, while their SSD values are used to calculate corresponding weights like in Phase 1.

  9. 9.

    The obtained centroids AC are used for the clustering (like in the previous part) and its final SSD value has been compared with SSD of k-means++ on the entire dataset.

2.4 Phase 2: Sequential Option

The following is the sequential version of the algorithm. It is schematically represented in Fig. 3.

Algorithm Sketch:

  1. 1.

    l = 1, \(init \leftarrow \) centroids from Phase 1, m is the fixed parameter

  2. 2.

    k-means clustering on the window \(m + l\) with initialization init.

  3. 3.

    If there is no degeneration during clustering then memorize the resulting (local) centroids and the corresponding SSD values.

  4. 4.

    In order to obtain new centroids, we carry out clustering with weights on the united set of centroids (similarly to Phase 1).

  5. 5.

    If the time limit has not been exhausted then \(init \leftarrow \) centroids obtained in step 4, \(l = l+1\) and go to step 2, otherwise step 6.

  6. 6.

    Subsequent usage of obtained centroids is similar to clauses 1 and 2 of Phase 1.

3 Computational Experiments

In this section we show the testing results of our algorithm from Sect. 2.4 on three datasets. We only present the results of computation by the sequential version described in Phase 2, with the initial centroids precomputed (sequentially) according to Phase 1. We do not include parallel version of Phase 2 as it distinguishes in the way windows are clustered and it requires additionally efforts in order to compare computational times (taking into account parallelism). The straightforward case can be seen as a particular case of the window aggregation.

Fig. 3.
figure 3

Sequential aggregation (accumulation) of the heuristically optimal SSD and centroids

Table 1 summarizes clustering estimates of used datasets. Results of computations on various meta-parameter sets from Table 2 are compared regarding SSD/time estimates to the ones obtained by k-means++ and summarized in Table 3. Each line of Table 3 corresponds to unique meta-parameter’s set and includes two SSD estimates, average time per clustering an proportion between computation time of proposed decomposition algorithm and k-means++. The first SSD estimate is obtained as following: we cluster corresponding entire dataset by k-means++ with corresponding parameters and consider obtained SSD criteria as a baseline value. Then, we do independent clusterings by our algorithms on ranges of experiments to calculate basis centroids and compare whether obtained SSD (on the basis centroids) values improves baseline values. The rates are presented for the cases our algorithm finds better solution. We present it in order to show what approximation our algorithm gives if the entire dataset have not been involved. We note that in order to obtain basis centroids we only need to process separate windows. The second SSD estimate is calculated in the same way with addition of one more step. Specifically, k-means is processed on the entire dataset while initialized by the basis centroids. Comparing these two columns of SSD estimates in Table 3 on various parameters and datasets allows us to consider obtained basis centroids as reasonable approximation to MSSD on the entire dataset.

Table 1. SSDs and computation times for datasets DS1, DS2, DS3. The k-means++ is performed with default parameters from the programming library sklearn [1], i.e., 10 separate initializations are executed, the result of the best is presented.
Table 2. Meta-parameters of experiments from Table 3. Window sizes are taken in the ranges (N/20, N/25, ..., N/100), (N/100, N/110, ..., N/250) and (N/10, N/20, ..., N/150). ‘Allow repeats in windows’ refers to the mode how windows are generated. Random window generation mode allows repeats in data objects (TRUE), while uniform window decomposition does not allow it (FALSE).
Table 3. Experiments on different parameter sets from Table 2. SSD criteria and computation times are presented. Two SSD estimates are considered: 1. the rates our algorithm improves* computed SSD in regards to k-means++ (baseline), where centroids are resulted from aggregation step, i.e., the criteria is estimated on basis centroids; 2. the rate our algorithm improves** the baseline with final centroids, i.e., after additional step with k-means initialized by basis centroids. The improvement rates are given regarding k-means++ on windows from Table 1. Average times per clustering procedure are given.

Datasets Description:

We used three datasets DS1, DS2, DS3 of real numbers.

  • DS1 contains \(4\times 10^6\) objects and number of attributes (features) is 25. The structure of the data: 50 synthetic blobs having Gaussian distribution, each having the same number of elements and the same standard deviation value. There is no overlaps in separate blobs.

  • DS2 contains \(4\times 10^6\) objects and number of attributes (features) is 20. The structure of the data: 20 synthetic blobs having Gaussian distribution, each blob has variable number of objects (from \(10^4\) to \(40 \times 10^4\)) and variable standard deviation values (distributed randomly in the range from 0.5 to 1.5).

  • DS3 is SUSY dataset from open UCI database [2]. The number of attributes is 18 and the number of objects is \(5 \times 10^6\). In our study we do not take into account the true labelling provided by the database, i.e., the given predictions for two known classes. The purpose of using such dataset is to search for internal structure in the data. This dataset is preprocessed by normalization prior the clustering.

4 Conclusions

In this approach we show that it is possible to achieve better results in the meaning of SSD criteria by applying iteratively the clustering procedure on subsets of the dataset. Obtained centroids are processed again by (meta-) clustering, resulting to the final solution.

Some observations:

  • One promising result is that centroids calculated by shown method on large datasets provide reasonable good quality SSD values even without clustering on the whole dataset. Step 6 in Part 2.2 and step 9 in Part 2.3 in many cases may be omitted giving essential advantage in computational speed.

  • It is observed there is no sense in splitting the dataset for a huge number of windows as the number of degenerated clusters growths as well.

  • Slight improvement is detected on normalized data and small number of clusters.

  • Our experiments mostly support the idea that quality and precision of clustering results are highly dependant on the dataset-size and its internal data structure, while it does not strongly depend on the clustering window/batch size, as far as the majority of windows represents the clustering structure of the entire dataset.