Introduction

The development of technology such as smart city and IoT has lead to the rapid increasing of data. Consequently, machine learning has to face new hard situation: solving problems for data that have big amount in volume, variety, velocity and veracity. Many methods have been proposed for several years to deal with big data. One of the simplest way is depending on infrastructure and hardware, but few people can afford this. Another option is finding suitable algorithms to reduce the computational complexity from the input size that may contain millions or billions of data points. The idea of finding a relevant subset from original data to decrease the computational cost brings scientists to the concept of coreset, which was first applied in geometric approximation by Agarwal et al. [1, 2]. The problem of coreset constructions for k-median and k-means clustering was then stated and investigated by Har-Peled et al. [17, 18, 19]

In recent years, many coreset construction algorithms have been proposed for a wide variety of clustering problems [10, 12, 13, 22]. These research always try to find good algorithms that create samples that are more correct or being faster. Even though there are many investigations about this, two approaches that fascinates us are the sampling-based methods and farthest-first-traversal-based algorithms. In each techniques, there are also various researches. In this paper, we focus on four state-of-the-art researches as follows

  • The first one is ProTraS algorithm which is proposed in 2018 by Ros and Guillaume [29]. In this paper, we use the version with improvement proposed in [31].

  • The second coreset method is an idea based on native Farthest-First-Traversal algorithm in [32].

  • The third coreset construction is Adaptive Sampling by Feldman et al. [14].

  • The last one is the Lightweight Coreset by Bachem et al. [7]. The lightweight coreset defines a new type of coreset, the lightweight form, but the samples created for this type can be considered as a coreset which provide a good fit with full data set.

  • Besides, we also use Uniform or Naive Sampling as baseline for comparison.

The remaining of this paper is organized as follows. In “Background and Related Works”, we discuss some related works and background used in this paper. In “Coreset Construction Methods Used for Comparisons”, we introduce briefly about these four coreset construction methods. We do experiments and comparisons using relative error in “Comparative Results” then have discussions about the advantages as well as the disadvantages of each method. We end this paper by the conclusion in “Conclusions” of this paper.

Background and Related Works

k-Median and k-Means Clustering

  • k-means clustering is a popular method, originally from signal processing, for cluster analysis in data mining. The standard algorithm was first proposed by Lloyd of Bell Labs in 1957, and was published later in [21]. The algorithm was then developed by Inaba et al. [24], Vega et al. [33], Matousek [25], or the k-Means++ by Author and Vassilvitskii in [4].

  • k-median clustering is a variation of k-means where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. There are also plenty research about this algorithm such as Arora [3], Charikar et al. [9], etc.

In this paper, we refer k-clustering for both k-median and k-means. The k-clustering problems can be stated as follows:

Let \(X \subset \mathbb {R}^d\), the k-clustering problems are to find \(Q \subset \mathbb {R}^d\) with \(|Q| = k\) such that these functions are minimized

$$\begin{aligned} \phi _X(Q)&= \sum _{x \in X} d(x, Q) = \sum _{x \in X} \min _{q \in Q} ||x-q|| \end{aligned}$$
(1)
$$\begin{aligned} \phi _X(Q)&= \sum _{x \in X} d(x, Q)^2 = \sum _{x \in X} \min _{q \in Q} ||x-q||^2 \end{aligned}$$
(2)

Equations (1) and (2) are for k-median and k-means respectively.

Coresets for k-Median and k-Means Clustering

If the data set X mentioned above is big enough, it is hard and expensive to solve these k-median and k-means problems properly. Therefore, instead of solving on X, one of the classical techniques is the extraction of small amount information from the given data, and performing the computation on this extracted subset. However, in many circumstances, it is not easy to find this most relevant subset. Consequently, attention has shifted to developing approximation algorithms. The goal now is to compute an \((1+ \varepsilon )\)-approximation subset, for some \(0< \varepsilon < 1\). The framework of coresets has recently emerged as a general approach to achieve this goal [2].

The definition of coresets for k-median and k-means can be stated as:

Definition 1

Coresets for k-median and k-means clustering. Let \(\varepsilon > 0\), the weighted set C is a \((k,\varepsilon )\)- coreset of X if for any \(Q \subset \mathbb {R}^d\) of cardinality at most k

$$\begin{aligned} | \phi _X(Q) - \phi _C(Q)| \le \varepsilon \phi _X(Q) ; \end{aligned}$$
(3)

this also equivalent to

$$\begin{aligned} (1-\varepsilon )\phi _X(Q) \le \phi _C(Q) \le (1+\varepsilon )\phi _X(Q). \end{aligned}$$
(4)

Coreset Construction Methods Used for Comparisons

Many coreset construction algorithms have been proposed in recent years for clustering problems. As a result, various approaches have been investigated such as exponential grids [11], bounding points [19] or dimension reduction [15]. In this paper, we focus on two other techniques that have been used in recent researches: farthest-first-traversal-based algorithms and sampling-based methods.

Farthest-First-Traversal (FFT)-Based Algorithm

In computational geometry, the FFT of a metric space is a set of points selected sequently; after the first point is chosen arbitrarily, each next successive point is located as the farthest one from the set of previously selected points. The first use of the FFT was by Rosenkrantz et al. [30] in connection with heuristics for the traveling salesman problem. Then, Gonzalez [16] used it as part of a greedy approximation algorithm for the problem of finding k clusters that minimize the maximum diameter of a cluster. Later, Arthur and Vassilvitskii [4] use a FFT-like algorithm to propose k-means++ algorithm.

figure a

In 2018, Ros and Guillaume proposed DENDIS [27], DIDES [28] and ProTraS [29] which are based on FFT algorithm. These are iterative algorithms based on the hybridization of distance and density concepts. They differ in the priority given to distance or density, and in the stopping criterion defined accordingly. After these proposal methods, there are plenty of improvements for coreset constructions that are also based on FFT. Two state-of-the-art methods are proposed in [31] and in [32]. We use these two for comparisons in next chapter.

ProTraS Post-Processing Improvement

Authors in [31] proposed an improvement for ProTraS. The idea of this improvement is the post-processing task of ProTraS by replacing a representative in the sample obtained by ProTraS by the center of the group represented by it. Thereby, objects located at the boundary side of clusters will be replaced by interior ones of those. The new obtained sample, thus, has separated clusters. This helps to improve the quality of clustering process. This algorithm is described in Algorithm 2.

figure b

FFT-Based with Pre-processing Improvement

ProTraS provides a good method to build coresets, but it still has some drawbacks. To overcome these problems, authors in [32] use a native FFT with pre-processing strategies to create a coreset. The first one is to find a specific initial point instead of randomly and the second one is a technique to reduce the computational complexity to make the construction much more faster. This algorithm is described in Algorithm 3.

figure c

Sampling-Based Methods

Sampling is a definition from Statistics. This is the selection of a subset of individuals from a population or original set according to a specific probability or distribution. The process of finding coresets based on sampling is quite simple and fast. However, it requires a lot of mathematical proofs and theorems behind. Like other approaches, there are many researches for sampling-based coreset constructions. In this paper, along with naive or uniform sampling as the baseline for comparisons, we use two most effective and well-proved methods: the Adaptive Sampling [14] and the Lightweight Coreset [7].

Adaptive Sampling

This algorithm is proposed in [14]. The key idea is to build an approximate solution (sample set, C) and to use it to bias the random sampling. The first step is achieved by an iterative algorithm that samples a small number of points, and removes half of the data set X closest to the sampled points. In the second step, the sampling is biased with probabilities, for each point in X , which are roughly proportional to their squared distance to C. This algorithm is widely used in many researches and is described in Algorithm 4.

figure d

Lightweight Coresets

In inequality (3) of coreset definition, the right term \(\varepsilon \phi _X(Q)\) allows the approximation error to scale with the quantization error as well as to include both the additive and multiplicative errors. Bachem et al. [5, 6, 8, 7] interpret and split these errors that lead to the definition of Lightweight Coresets as follows

Definition 2

Lightweight Coresets for k-clustering. Let \(\varepsilon > 0\) and \(k \in \mathbb {N}\). Let \(X \subset \mathbb {R}^d\) be a set of points with mean \(\mu (X)\). The weighted set C is an \((\varepsilon , k)\)-lightweight coreset of X if for any set \(Q \subset \mathbb {R}^d\) of cardinality at most k

$$\begin{aligned} \vert \phi _X(Q) - \phi _C(Q) \vert \le \frac{\varepsilon }{2} \phi _X(Q) + \frac{\varepsilon }{2} \phi _X(\{\mu (X)\}) \end{aligned}$$
(5)

In inequality (5), the \(\frac{\varepsilon }{2} \phi _X(Q)\) term allows the approximation error to scale with the quantization error and constitutes the multiplicative part; while the \(\frac{\varepsilon }{2} \phi _X(\{\mu (X)\})\) term scales with the variance of the data and corresponds to the additive approximation error term that is invariant of the scale of the data.

Even though there are differences in definitions between Coresets and Lightweight Coresets, Bachem et al. [7] have shown that as we decrease \(\varepsilon\), the true cost of the optimal solution obtained on the lightweight coreset approaches the true cost of the optimal solution on the full data set in an additive manner.

Construction of Lightweight Coresets The construction is based on importance sampling. Let q(x) be any probability distribution on X and Q any set of k centers in \(\mathbb {R}^d\).The quantization error can be approximated by sampling m points from X using q(x) and assigning them weights inversely proportional to q(x). The q(x) is defined as follows

$$\begin{aligned} q(x) = \frac{1}{2} \frac{1}{\vert X \vert } + \frac{1}{2} \frac{d(x,\mu (X))^2}{\sum _{x' \in X}{d(x',\mu (X))^2}} \end{aligned}$$
(6)

The construction for lightweight coreset is described as in Algorithm 5

figure e

Comparative Results

In this section, we do comparisons among these five methods

  • Uniform Sampling: a naive approach to coreset constructions which is based on uniform sub-sampling of the data. This may be regarded as the baseline since it is commonly used in practice.

  • ProTraS post-processing improvement: this is described in Algorithm 2. The idea is based on ProTraS with the post-processing to improve the correctness [31].

  • FFT-based with pre-processing improvement in Algorithm 3 [32].

  • Adaptive Sampling: method to construct coresets based on Gaussian Mixture Models and is described in Algorithm 4 [14], [23].

  • Lightweight Coreset: the method mentioned in Algorithm 5. The idea is to perform importance sampling where data point is sampled with probability \(\frac{1}{2}\) uniformly at random or with probability \(\frac{1}{2}\) proportional to its squared distance to the mean of the data [7].

Data for Experiment

We use 15 datasets from data clustering repository of the computing school of Eastern Finland University [34], and from GitHub clustering benchmark [35]. These datasets are described in Table 1. We display some data examples in Fig. 1

Table 1 Datasets for experiments
Fig. 1
figure 1

Some datasets for experiments

Experiment Setup

Since these five algorithms need different input parameters, ProTraS and algorithm 2 need the value of \(\varepsilon\) while the others need sample size as input. Therefore, we first run ProTraS with post-processing in algorithm 2 for \(\varepsilon = 0.1\) and \(\varepsilon = 0.2\), then we use the sample size from results of this first step and used as the input parameter for the Uniform Sampling, FFT-based with pre-processing improvement in Algorithm 3 [32], Adaptive Sampling in Algorithm 4 [14] and Lightweight Coreset in Algorithm 5 [7].

The experiment for each data set is described as follows:

  1. 1.

    Step 1. Use k-means++ [4] to cluster the full data set

  2. 2.

    Step 2. Generate coreset by improved ProTraS

    1. (a)

      Step 2.1. Apply ProTraS algorithm to full data set

    2. (b)

      Step 2.2. Apply algorithm 2 to the sample from step 2.1

    3. (c)

      Denote m as the sample size of the coreset received from step 2.2

  3. 3.

    Step 3. Generate samples by FFT-based construction in algorithm 3

  4. 4.

    Step 4. Generate samples by Uniform Sampling with size m

  5. 5.

    Step 5. Generate samples by Adaptive Sampling in algorithm 4 with size m

  6. 6.

    Step 6. Generate samples for Lightweight Coreset with size m by algorithm 5

  7. 7.

    Step 7. Use k-means++ to solve the k-means clustering problem on each subsample.

  8. 8.

    Step 8. We measure the elapsed time and compute the relative error for each method and subsample size compared to the full solution from step 1.

Since the experiments of uniform sampling and lightweight coresets are randomized, we run them 20 times with different random seeds and compute sample averages.

All experiments were implemented in Python and run on an Intel Core i7 machine with 8-2.8GHz processors and 16 GB memory.

Table 2 Experiment results—relative error values with \(\varepsilon = 0.1\)
Table 3 Experiment results—relative error values \(\varepsilon = 0.2\)
Table 4 Experiment results—runtime (in second(s)) of each sample
Table 5 Experiment results—runtime (in second(s)) of each sample
Fig. 2
figure 2

Relative error in relation to subsample size

Fig. 3
figure 3

Relative error in relation to subsample size

Fig. 4
figure 4

Relative error in relation to subsample size

Results and Discussion

In the experiments, we compare five coreset construction methods:

  • Uniform sampling as the baseline, denoted as “Uniform”

  • ProTraS with the post-processing improvements, denoted as “iProTraS”

  • FFT-based coreset with pre-processing improvement, denoted as “sizeFFT”

  • Adaptive Sampling, denoted as “Adaptive”

  • Lightweight Coreset, denoted as “lwCoreset”

We use relative error as the measurement for correctness and the run-time comparison. The results are expressed as follows:

  • The relative errors are shown in Table 2 for \(\varepsilon = 0.1\) and in Table 3 for \(\varepsilon = 0.2\). For this type of measurement, smaller means better.

  • The time comparison is shown in Table 4 for \(\varepsilon = 0.1\) and in Table 5 for \(\varepsilon = 0.2\). This measurement is estimated in second (s) unit. Here, smaller means faster.

  • Figures 2, 3 and 4 show the relations between relative error and subsample sizes.

Discussions

  • All Figs. 2,3 and4 show that the relative errors decrease for all methods as the sample size is decreased. More accurate coresets we can receive if we get more points.

  • In most cases, Uniform Sampling creates the high error values, it means Uniform Sampling is the worst coreset construction. This is understandable since Uniform Sampling is the simplest and naivest method. However, this is the fastest method.

  • Adaptive Sampling creates coresets having low error in most cases, especially if the clusters are well separated. In cases of nearly overlap clusters (D11, D12, D13), this method creates worse coresets.

  • Lightweight Coreset performs well in most cases, it is also very fast; in fact, it is just slower than Uniform Sampling and faster than other methods. However, lightweight coreset rarely creates a sample with low error. In well-separated clusters, this method is not as good as Adaptive Sampling but it is clearly better in some other cases.

  • It is obviously that the improved ProTraS is much more slower than the other. ProTraS is built based on the farther-first-traversal algorithm in which the points are selected sequently while the uniform sampling, adaptive sampling and lightweight coreset all are based on sampling method which is extremely fast. However, this method creates coresets with very low errors.

  • The FFT-based algorithm creates the best coresets in most cases, but also the slowest algorithms. This algorithm takes a very long run-time, nearly the same of the improved ProTraS.

The three sampling-based methods (uniform, adaptive and lightweight coreset) create coresets with very high errors in some cases and also create coresets with very low errors sometimes. However, the average errors of sampling-based methods seem to be good enough to use in reality.

In most cases, the improved ProTraS and the FFT-based algorithm seem to have similar low relative errors and slow run-time. Unlike sampling-class methods which can yield results very fast, the FFT based algorithms (improved ProTraS and native FFT-based with preprocessing) creates coresets by checking point by point in full data set and need to calculate many distances during runtime. These ones take a lot of time but the result is extremely useful since the coresets from these methods have the lowest errors in most cases.

Conclusions

In this paper, we introduce and compare four state-of-the-art coreset constructions, the ProTraS algorithm [29] with post-processing improvement [31], FFT-based coreset with pre-processing improvement [32], Adaptive Sampling [14], and Lightweight Coreset [7] and we use relative errors to compare these methods along with uniform sampling as the baseline.

Even though FFT-based class methods and its improvement defeat other methods in the experiments, the speed or runtime is a big concern when comparing to other sampling-based ones. This method needs a lot of computation to create coresets, that is why it is very slow.

On the other hand, the sampling-based class constructions complete all experiments at a glance; however, the correctness of the created sample is still a big problem. Since this is sampling-based method, we need to check that the result is good enough or not.

Finally, each method mentioned in this paper has its own advantages and disadvantages. The options ’Slow but more accuracy’ or ’Fast but less correct’ will be weighed before applying any of these algorithms in practice.