Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Gaussian mixture modelling is an important task, e.g., in the field of cluster analysis. A common approach is the method of maximum likelihood for which the Expectation-Maximization (EM) algorithm [13] can be applied. The EM algorithm iteratively tries to improve a given initial mixture model and converges to a stationary point of the likelihood function. Unfortunately, the likelihood function is generally non-convex, possessing many stationary points [23]. The initial model determines to which of these points the EM algorithm converges [5].

1.1 Maximum Likelihood Estimation for Gaussian Mixtures

A Gaussian mixture model (K-GMM) over \(\mathbb {R}^D\) can be described by a parameter \(\theta =\{(w_k,\mu _k,\varSigma _k)\}_{k=1,\ldots ,K}\), where \(w_k\in \mathbb {R}\) is the mixing weight (\(\sum _{k=1}^K w_k=1\)), \(\mu _k\in \mathbb {R}^D\) is the mean, and \(\varSigma _k\in \mathbb {R}^{D\times D}\) is the covariance matrix of the k-th mixture component. Its probability density function is given by \( \mathcal {N}(x|\theta ) = \sum _{k=1}^K w_k \mathcal {N}(x|\mu _k,\varSigma _k)\), where we denote the D-variate Gaussian distribution by \(\mathcal {N}(\cdot |\mu ,\varSigma )\). Given a data set \(X\subset \mathbb {R}^D\), the Maximum Likelihood Estimation (MLE) problem is to find a K-GMM \(\theta \) that maximizes the likelihood \(\mathcal {L}(\theta |X) = \prod _{x\in X} \mathcal {N}(x|\theta )\). For \(K=1\), there is a closed-form solution [8]. For \(K>1\), the EM algorithm, whose outcome heavily depends on the initial model, can be applied.

1.2 Related Work

A common way to initialize the EM algorithm is to first draw means uniformly at random from the input set and then to approximate covariances and weights [7, 21, 24, 25]. To compensate for the random choice of initial means, several candidate solutions are created and the one with the largest likelihood is chosen. Often, few steps of the EM, Classification EM, or Stochastic EM algorithm are applied to the candidates. Similarly, the K-means algorithm may be used [8, p. 427]. Due to the random choice of the initial means, all these methods are better suited for spherical and well-separated clusters. Furthermore, testing several candidates is computationally expensive.

Other popular initializations are based on hierarchical agglomerative clustering (HAC). For instance, in [21, 24, 25], HAC (with different distance measures) is used to obtain mean vectors. Since HAC is generally very slow, it is usually only executed on a random sample [24]. However, the size of any reasonable sample depends on the size of the smallest optimal component. Moreover, it is often outperformed by other methods (e.g. [24, 25]). Another approach using HAC is presented in [21]. It aims at finding the best local modes of the data set in a reduced \(m^*\)-dimensional space and applies HAC only on these modes. However, this method is time-consuming and the choice of \(m^*\) is crucial [21, p. 5, 13]. Moreover, in [22] it is outperformed by simple random methods.

[25] presents a density based approach which not only determines an initial solution but also the number of components. It initializes the means by points which have a “high concentration” of neighbors. To this end, the size m of the neighborhood of a point (i.e., the minimum number of points in a cluster) has to be fixed in advance. In our experiments, we found that the performance crucially depends on the choice of m. Hence, we ignore this method in this paper.

In [27], a greedy algorithm is presented which constructs a sequence of mixture models with 1 through K components. Given a model \(\theta _k\) with k components, it constructs several new candidates with \(k+1\) components. Each candidate is constructed by adding a new component to \(\theta _k\) and executing the EM algorithm. Hence, this method is only useful if several values of K need to be considered.

In [20] a modification of the Gonzalez algorithm for GMMs is presented. Furthermore, there are some practical applications using the K-means++ algorithm for the initialization of GMMs (e.g., in [19] GMMs are used for speech recognition). Additional initialization methods can be found e.g. in [6, 15, 21, 26].

1.3 Our Contribution

Clearly, there is no way to determine the best initialization algorithm that outperforms all other algorithms on all instances. The performance of an initialization depends on the given data and the allowed computational cost. Nonetheless, the initializations presented so far (except the simple random initializations) face mainly two problems: Firstly, they are rather complex and time consuming. Secondly, the choice of hyperparameters is crucial for the outcome.

In this paper, we present new methods that are fast and do not require choosing sensitive hyperparameters. These methods can be seen as adaptions of the K-means++ algorithm [3] and the Gonzalez algorithm [17] and as an extension of the initial work in [19, 20]. We present experiments indicating the superiority of our methods compared to a large number of alternative methods.

2 Baseline Algorithms

The most widely used initializations start by choosing K data points:

  • Unif draws K points independently and uniformly at random from X.

  • HAC computes a uniform sample S of size \(s\cdot |X|\) of the input set X and executes hierarchical clustering with average linkage cost on S.

  • G executes the algorithm given in [17], which yields a 2-approximation for the discrete radius K-clustering problem. Iteratively, it chooses the point with the largest Euclidean distance from the already chosen points.

  • KM++ executes the K-Means++ algorithm [3], which has been designed for the K-means problem. In each round, KM++ samples a data point (i.e. the next mean) from the given data set X with probability proportional to its K-means cost (with respect to the points chosen so far). In expectation, the resulting K-means costs are in \(\mathcal {O}(\log (K) \cdot \text{ opt })\). KM++ is particularly interesting since the K-means algorithm is a special case of the EM algorithm [8].

Then, given K data points, Algorithm 1 is used to create a GMM, which is then the initial solution that is fed to the EM algorithm.

figure a

A popular alternative is to apply the K-means algorithm with the chosen data points before executing Algorithm 1. The main idea behind this is that starting the EM algorithm with a coarse initial solution (where e.g. not all clusters are covered well) might impose a high risk of getting stuck at a poor local minimum. To avoid this problem, one first runs a different algorithm that optimizes a function similar to the likelihood, i.e. the K-means costs (cf. [8, p. 427, 443]). We refer to the K-means algorithm as an intermediate algorithm and indicate its use by the postfix “\(_{km}\)”.

3 Adaptive Seeding for GMMs

Our new adaptive methods construct a sequence of models with \(k=1\) through \(k=K\) components adaptively. Given a \((k-1)\)-GMM \(\theta _{k-1}\), our methods try to choose a point from the data set that is not described well by the given \(\theta _{k-1}\) and which is hopefully a good representative of a component of an optimal k-GMM. The idea behind is that this point can lead us to a significant refinement of \(\theta _{k-1}\).

Choosing a Point. The negative log-likelihood of a point \(x\in \mathbb {R}^D\), given the GMM \(\theta _{k-1}\), measures how well x is described by \(\theta _{k-1}\).Footnote 1 Unfortunately, it may take negative values and does not scale with the data set and the GMMFootnote 2. This also applies to the minimum component-wise negative log-likelihood

$$\begin{aligned} \textstyle \min \left\{ - \log \left( (2\pi )^{D/2}\vert \varSigma _l\vert ^{1/2}\right) + \frac{1}{2}(x-\mu _l)^T\varSigma _l^{-1}(x-\mu _l)\,\mid \, (w_l,\mu _l,\varSigma _l)\in \theta _{k-1} \right\} , \end{aligned}$$

due to the first summand. Hence, we use the minimum Mahalanobis distance

$$\begin{aligned} m(x| \theta _{k-1}) := \min \left\{ (x-\mu _l)^{T}\varSigma _l^{-1}(x-\mu _l)\,\mid \, (w_l,\mu _l,\varSigma _l)\in \theta _{k-1} \right\} . \end{aligned}$$

Our first method chooses the point \(x\in X\) maximizing \(m(x|\theta _{k-1})\). Since these points are more likely to be outliers, we also consider choosing a point only from a uniform sample of X, which is chosen in advance (cf. Algorithm 2).

Our second method chooses point \(x\in X\) with probability \(\propto m(x|\theta _{k-1})\) (cf. Algorithm 3). In order to reduce the probability to choose an outlier, we also consider adding an \(\alpha \) portion of uniform distribution, i.e. drawing x with probability

$$\begin{aligned} \textstyle m_{{\alpha }}(x|\theta _{k-1}) := \alpha \cdot {m(x|\theta _{k-1})}/{\sum _{y\in X} m(y|\theta _{k-1})} + (1-\alpha )\cdot {1}/{\vert X\vert }. \end{aligned}$$

Constructing a GMM. Then, given a point \(x\in X\) and the means of \(\theta _{k-1}\), we construct a k-GMM. In our first experiments, we used Algorithm 1 to construct a k-GMM. However, it turned out that estimating only spherical covariance matrices (with variable variances) yields a better performance than estimating full covariance matrices. We assume that this is due to the fact that \(\theta _{k-1}\) is only a very coarse estimate of \((k-1)\)-components of an optimal k-GMM. Formally, we replace the covariance update in Line 3 of Algorithm 1 by . We denote this version of Algorithm 1 as Means2SphGMM. Given the resulting k-GMM, our methods then again choose a new point from X as already described above.

Intermediate Algorithm. Recall that some baselines use the K-means algorithm as an intermediate algorithm (cf. Sect. 2). Since we do not only construct means but GMMs, we apply a hard-clustering variant of the EM algorithm, i.e. the Classification EM algorithm (CEM) [10], and let it only estimate spherical covariances. We indicate its use by the postfix “\(_{cem}\)”.

Algorithms 2 and 3 summarize our methods. Note that we do not optimize the hyperparameters \(\alpha \) and s in our experiments.

Comparison to Baselines. Our adaptive initializations can be seen as adaptions of the Gonzalez and Kmeans++ algorithm. Simply speaking, these methods assume that each component is represented by a Gaussian with the same fixed spherical covariance matrix and fixed uniform weights. In contrast, our goal is to estimate also the covariance matrices adaptively. Furthermore, in [20] another adaption of the Gonzalez algorithm is presented, which we denote by KwedlosGonzalez (KG). Unlike our method, it chooses weights and covariance matrices randomly and independently of the means (and of each other).

figure b

4 Experiments

We evaluated all presented methods with respect to artificial as well as real world data sets. Our implementation as well as the complete results are available at [9]. Due to space limitations, we omit the results of those algorithms that are consistently outperformed by others. These results are available at [9] as well.

Quality Measure. Recall that the goal of our paper (and the EM algorithm) is to find a maximum likelihood estimate (MLE). Thus, the likelihood is not only the common but also the appropriate way of evaluating our methods.

Other measures need to be treated with caution: Some authors consider their methods only with respect to some specific tasks where fitting a GMM to some data is part of some framework. Hence, any observed effects might be due to several reasons (i.e. correlations). In particular, GMMs are often compared with respect to certain classifications. As already pointed out by [14], the class labels of real world data sets do not necessarily correspond to the structure of an MLE. The same holds for data sets and classifications generated according to some GMM. Moreover, a cross-validation, that examines whether methods over-fit models to training data, is not reasonable, since our methods do not aim at finding a model that does not fit too well to the given data set. Finally, one should not generate data sets according to some “ground truth“ GMM \(\theta _{gt}\) and compare GMMs with \(\theta _{gt}\) because in many cases (e.g. small |X|) one cannot expect \(\theta _{gt}\) to be a good surrogate of the MLE.

Setup. Recall that in Algorithms 2 and 3 hyperparameters \(\alpha \) and s are used. We do not optimize them, but test reasonable values, i.e. \(\alpha \in \{0.5,1\}\) and \(s\in \{0.1,1\}\). We execute each method with 30 different seeds. On the basis of some initial experiments, we decided to execute the intermediate algorithms for 25 rounds and the EM algorithm for 50 rounds. If only the EM algorithm is applied, then we execute it for 75 rounds.

4.1 Artificial Data Sets

Data Generation. We create 192 test sets, each containing 30 data sets that share certain characteristics [9]. For each data set, we first create a GMMFootnote 3 at random but control the following properties: First, the components of a GMM can either be spherical or elliptical. We describe the eccentricity of \(\varSigma _k\) by \(e_k = \frac{\max _d \lambda _{kd}}{\min _d \lambda _{kd}}\), where \(\lambda _{kd}^2\) denotes the d-th eigenvalue of \(\varSigma _k\). Second, components can have different sizes, in terms of the smallest eigenvalue of the corresponding covariance matrices. Third, components have different weights.

Fig. 1.
figure 1

Examples for different separation parameters. Figures show orthogonal projections to random plane. Data sets in (a)–(c) have \(D=3\). (d)–(f) have \(D=10\).

Fourth, components can overlap more or less. Following [12], we define the separation parameter \( c_\theta = \min _{l\ne k} {\Vert \mu _l - \mu _k \Vert }/{\sqrt{\max \left\{ {\mathrm{{trace}}}(\varSigma _l),\mathrm{{trace}}(\varSigma _k)\right\} }}\). In high dimension \(D\gg 1\), \(c_\theta =2\) corresponds to almost completely separated clusters (i.e. points generated by the same component), while \(c_\theta \in \{0.5,1\}\) indicates a slight but still negligible overlap [11]. However, in small dimension, \(c_\theta \in \{0.5,1\}\) corresponds to significant overlaps between clusters, while \(c_\theta =2\) implies rather separated clusters (cf. Fig. 1).

We generate random GMMs as follows. Initially, we draw means uniformly at random from a cube with a fixed side length. For the weights, we fix some \(c_w\ge 0\), construct a set of weights \(\{ 2^{c_w\cdot i}/\sum _{j=1}^K 2^{c_w\cdot j} \}_{i=1,\ldots ,K}\) and assign these weights randomly. To control the sizes and the eccentricity, we fix the minimum and maximum eigenvalue and draw the remaining values uniformly at random from the interval. Then, we set \(\varSigma _k = Q^T\text{ diag }(\lambda _{k1}^2,\ldots ,\lambda _{kD}^2)Q\) for a random \(Q\in \text {SO}(D)\). Finally, the means are scaled as to fit the predefined separation parameter. Given the resulting GMM \(\theta \), we first draw some points according to \(\theta \). Then, we construct a bounding box, elongate its side lengths by a factor 1.2, and draw noise points uniformly at random from the resized box.

We created a test set (i.e. 30 data sets) for each combination of the following parameters: \(K=20\), \(|X|\in \{1000,\ 5000\}\), \(D\in \{3,10\}\), \(c_\theta \in \{0.5,\ 1,\ 2\}\), \(c_w\in \{0.1, 1\}\), different combinations of size and eccentricity (i.e., equal size and \(e_k=10\), equal size and \(e_k\in [1,10]\), different size and \(e_k=1\), different size and \(e_k \in [1,10]\)), and without or with 10 % noise points.

Evaluation Method. We consider the initial solutions produced by the initialization (possibly followed by an intermediate algorithm) and the final solutions obtained by running the EM algorithm afterwards. For each data set, we compute the average log-likelihood of the initial and final solution, respectively. Based on these averages, we create rankings of the algorithmsFootnote 4. Then, we compute the average rank (and standard deviation of the rank) of each algorithm over all datasets matching certain properties.

Results. In general, we observe that one should use an intermediate algorithm before applying the EM algorithm. Thus, we omit the results of some methods [9].

Data without Noise. For these rather simple data sets, there is no method that constantly outperforms all others. Nonetheless, it is always one of our adaptive or the G \(_{km}\) initialization that performs best.

Table 1. Average ranks (\(\pm \) std.dev.) for generated data with \(K=20\), \(|X|=1000\), \(D=10\), different weights, and without noise.

The results depicted in Tables 1 and 2 suggest that, regardless of the weights, the performance is determined by the separation. Furthermore, a good initial solution does not imply a good final solution. Given overlap (\(c_{\theta }=0.5\)) or moderate separation (\(c_{\theta }=1\)), SG \(_{cem}(s=1)\) and Ad \(_{cem}(\alpha =1)\) work best, even though their initial solutions have low average ranks compared to KM++ \(_{km}\). Given higher separation (\(c_{\theta }=2\)), we expect it to be easier to identify clusters and that skewed covariance matrices do not matter much if means are assigned properly in the first place. Indeed, the simple G \(_{km}\) and KG do the trick.

Table 2. Average ranks (\(\pm \) std.dev.) for generated data with \(K=20\), \(|X|=1000\), dimension \(D=10\), equal weights, and without noise.

Table 3 shows that Ad \(_{cem}(\alpha =1)\) works well for elliptical data, while G \(_{km}\) should be chosen for spherical data. Recall that there are no noise points yet. We expect that the performance of G \(_{km}\) degenerates in the presence of noise since it is prone to choose outliers. Overall, given data sets without noise, Ad \(_{cem}(\alpha =1)\) performs best.

Table 3. Average ranks (\(\pm \) std.dev.) for generated data with \(K=20\), \(|X|=1000\), dimension \(D=10\), and without noise. Only final solutions.

Noisy Data. When introducing noise, our adaptive methods are still among the best methods, while the performance of some others degenerates significantly. Tables 4 and 5 show that SG \(_{cem}\) and Ad \(_{cem}\) still work well for \(c_w\le 1\) and, in contrast to data without noise, also for separated instances (\(c_w=2\)). KG and G \(_{km}\) are now among the methods with the lowest average rank. This is not a surprise since our noise contains outliers. From the results depicted in Table 6 one can draw the same conclusion, i.e. KG and G \(_{km}\) can not handle noisy data. For noisy data, our Ad \(_{cem}\) methods outperform the others.

Table 4. Average ranks (\(\pm \) std.dev.) for generated data with \(K=20\), \(|X|=1000\), dimension \(D=10\), different weights, and 10 % noise.
Table 5. Average ranks (\(\pm \) std.dev.) for generated data sets with \(K=20\), \(|X|=1000\), dimension \(D=10\), equal weights, and 10 % noise.
Table 6. Average ranks (\(\pm \) std.dev.) for generated data sets with \(K=20\), \(|X|=1000\), dimension \(D=10\), and 10 % noise. Only final solutions.

Low Dimensional or High Sample Size Data. We expect that, if the dimension is low or the sample size is large enough, it is generally easier to identify clusters. Indeed the results differ significantly from our previous results. In general, the KM++ \(_{km}\) and Unif \(_{km}\) perform best. For data sets with \(D=3\) and \(|X|=1000\), Table 7 shows that the KM++ \(_{km}\) method works well even in the presence of noise. However, if we are given noise and small separation, the simple Unif \(_{km}\) does well. We also increased the sample size to \(|X|=5000\) and the dimension to \(D=10\), expecting that the higher sample size can make up for the higher dimension (results available in [9]). Indeed, for data sets without noise, where clusters can presumably be identified easier, KM++ \(_{km}\) still suffices. However, given noise or too small a separation, our Ad \(_{cem}\) methods and the simple Unif \(_{km}\) work better.

Table 7. Average ranks (\(\pm \) std.dev.) for generated data (\(K=20\), \(|X|=1000\), \(D=3\)).

4.2 Real World Data Sets

We use four publicly available data sets: Covertype (\(|X|=581\,012\), \(D=10\) real-valued features) [4]; two Aloi data sets (\(|X|=110\,250\), \(D\in \{27,64\}\)) based on color histograms in HSV color space [18] from data provided by the ELKI project [2] and the Amsterdam Library of Object Images [16]; Cities (\(|X|=135\,082\), \(D=2\)) is a projection of the coordinates of cities with a population of at least 1000 [1]; Spambase (\(|X|=4601\), \(D=10\) real-valued features) [4].

Fig. 2.
figure 2

Results for the real world data sets depicted as boxplots (final solutions only).

The results are depicted in Fig. 2: For Aloi (\(D=27\)) and Spambase (\(K=3\)), SG \(_{cem}(s=1)\) is considerably better than the other methods. For Cities and Spambase (\(K=10\)), SG \((s=1)\) does better (without running the CEM). For Aloi (\(D=64\)) and the Covertype, Ad \(_{cem}(\alpha =1)\) works better than the others.

4.3 Time Measurement

The run times of the compared methods match our expectation (cf. Table 8). First, (intermediate) steps of the CEM algorithm are faster than (more) steps of the EM algorithm. Second, sampling and running methods on a random subset of the data should in general reduce the run time.

Table 8. Average run times (in seconds) over 12 data sets with \(|X| = 10^3\) and \(D=10\) and different runs per data set using an Intel Core i7-3770 CPU (3.40 GHz, 8 GB RAM).

5 Conclusion and Future Work

If you need a fast and simple method, then we suggest to use one of the following methods: Given a data set with a large number of points per cluster or low dimension, the K-means++ initialization followed by the K-means algorithm and Means2GMM should do well. Otherwise, we recommend our new methods Ad and SG followed by the spherical CEM algorithm, especially if your data is presumably noisy. Last but not least, whatever you prefer, we suggest trying intermediate steps of the spherical CEM or K-means algorithm.

For the K-means++ algorithm and the Gonzalez algorithm there are provable guarantees. We hope that our results are a good starting point for a theoretical analysis that will transfer these results to the MLE problem for GMMs.