1 Introduction

Model-based clustering is an increasing popular method for unsupervised learning. In contrast to classical heuristic methods, such as the original k-means proposal and hierarchical clustering, model-based clustering methods rely on a probabilistic assumption about the data distribution. According to the main underlying assumption, data are generated from a mixture distribution, where each cluster is described by one or more mixture components. Maximum likelihood estimation of parameters is usually carried out via the EM algorithm (Dempster et al. 1977). The EM algorithm is an iterative, strictly hill-climbing procedure whose performance can be very sensitive to the starting point because the likelihood surface tends to have multiple modes, although it usually produces sensible results when started from reasonable starting values. Thus, good initialisation is crucial for finding MLEs, although no method suggested in the literature uniformly outperforms the others.

In the case of Gaussian model-based clustering (Fraley and Raftery 2002), several approaches are available, both stochastic and deterministic, for selecting an initial partition of the observations, or an initial estimate of the parameters. In the mclust R package (Fraley et al. 2012, 2015), the EM algorithm is initialised using the partitions obtained from model-based agglomerative hierarchical clustering. Efficient numerical algorithms exist for approximately maximising the classification likelihood with multivariate normal models. However, in certain circumstances, poor initial partitions may cause the EM algorithm to converge to a local maximum of the likelihood function.

In this contribution we discuss cases where an initial partition may lead to sub-optimal maximum likelihood estimates when applied to coarse data with ties (e.g. discrete data or continuous data that are rounded in some form when measured), and we present some possible refinements to improve the fitting of such finite mixture models.

The outline of this article is as follows. Section 2 gives a brief review of background material on model-based clustering, with special attention devoted to some of the proposals available in the literature for the initialisation of EM algorithm. Section 3 discusses the model-based hierarchical agglomerative clustering method used for starting the EM algorithm. This is a very convenient and efficient algorithm, but in certain circumstances presents a serious drawback. Section 4 contains some simple transformation-based methods to refine the EM initialisation step derived from model-based agglomerative hierarchical clustering. The behaviour of these methods is illustrated through the use of real data examples in Sect. 5. The final section provides some concluding remarks.

2 Background

2.1 Model-based clustering overview

Let \(\varvec{x}_1,\varvec{x}_2,\ldots ,\varvec{x}_n\) be a sample of n independent identically distributed observations. The distribution of every observation is specified by a probability mass or density function through a finite mixture model of G components, which takes the following form

$$\begin{aligned} f(\varvec{x}; \varvec{\Psi }) = \sum _{k=1}^G \pi _k f_k(\varvec{x}; \varvec{\theta }_k), \end{aligned}$$
(1)

where \(\varvec{\Psi }= \{\pi _1, \ldots , \pi _{G-1}, \varvec{\theta }_1, \ldots , \varvec{\theta }_G\}\) are the parameters of the mixture model, \(f_k(\varvec{x}; \varvec{\theta }_k)\) is the kth component density at \(\varvec{x}\) with parameter(s) \(\varvec{\theta }_k\), \((\pi _1,\ldots ,\pi _{G-1})\) are the mixing weights or probabilities (such that \(\pi _k > 0\), \(\sum _{k=1}^G\pi _k = 1\)), and G is the number of mixture components.

Assuming G fixed, mixture model parameters \(\varvec{\Psi }\) are usually unknown and must be estimated. The log-likelihood function corresponding to Eq. (1) is given by \(\ell (\varvec{\Psi }; \varvec{x}_1, \ldots , \varvec{x}_n) = \sum _{i=1}^{n} \log (f(\varvec{x}_i; \varvec{\Psi }))\). Direct maximisation of the log-likelihood function is often complicated, so MLE of finite mixture models is usually carried out via the EM algorithm (McLachlan and Peel 2000).

In the model-based approach to clustering, each component of a finite mixture of density functions belonging to a given parametric class is associated with a group or cluster. Most applications assume that all component densities arise from the same parametric distribution family, although this need not be the case in general. A popular model assumes a Gaussian distribution for each component, i.e. \(f_k(\varvec{x}; \varvec{\theta }_k) \sim {{\mathrm{\mathcal {N}}}}(\varvec{\mu }_k, \varvec{\Sigma }_k)\). Thus, clusters are ellipsoidal, centred at the mean vector \(\varvec{\mu }_k\), and with other geometric features, such as volume, shape and orientation, determined by \(\varvec{\Sigma }_k\). Parsimonious parameterisations of the covariance matrices can be defined by means of eigen-decomposition in the form \(\varvec{\Sigma }_k = \gamma _k \varvec{O}_k \varvec{A}_k \varvec{O}{}^{\top }_k\), where \(\gamma _k\) is a scalar controlling the volume of the ellipsoid, \(\varvec{A}_k\) is a diagonal matrix specifying the shape of the density contours, and \(\varvec{O}_k\) is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery 1993; Celeux and Govaert 1995). Fraley et al. (2012, Table 1) summarized some parameterisations of within-group covariance matrices available in the mclust software, and the corresponding geometric characteristics.

The number of mixture components and the parameterisation of the component covariance matrices can be selected on the basis of model selection criteria, such as the Bayesian information criterion (BIC; Schwartz 1978; Fraley and Raftery 1998) or the integrated complete-data likelihood criterion (ICL; Biernacki et al. 2000).

2.2 Initialisation of EM algorithm

The EM algorithm is an easy to implement, numerically stable algorithm, which has, under fairly general conditions, reliable global convergence. However, it may converge slowly and, like any other Newton-type method, does not guarantee convergence to the global maximum when there are multiple maxima (McLachlan and Krishnan 2008, p. 29). Further, in the case of finite mixture modelling, the estimates obtained depend on the starting values. Thus initialisation of EM is crucial because the likelihood surface tends to have multiple modes, although it usually produces sensible results when started from reasonable starting values (Wu 1983; Everitt et al. 2011, p. 150).

Several approaches are available, both stochastic and deterministic, for initialising the EM algorithm. Broadly speaking, there are two general approaches. The first one starts from some initial values for the parameters to be estimated. A simple strategy is based on generating several candidates by drawing parameter values uniformly at random over the feasible parameters regions. Since the random-starts strategy has a fair chance of not providing good initial starting values, a common suggestion to alleviate this problem is to run the EM algorithm with several random starts and to choose the best solution. However, such a strategy can be quite time consuming and is not always practical, especially for high-dimensional datasets.

Two other stochastic initialisation schemes are the so-called emEM and rndEM. The former approach, proposed by Biernacki et al. (2003), uses several short runs of the EM initialised with valid random starts as parameter estimates until an overall number of total iterations is exhausted. Then, the solution with the highest log-likelihood is chosen to be the initialiser for the long EM, which runs until the usual strict convergence criteria are met. This approach is computationally intensive and the same comments made above about random starts apply to it also.

Two related approaches were proposed by Maitra (2009), one called Rnd-EM where the short EM stage is replaced by choosing multiple starting points and evaluating the log-likelihood at these values without running any EM iterations. Then, the best obtained solution serves as an initialiser for the long EM stage. The second proposal is a staged approach based on finding a large number of local modes of the dataset, and then to choose representatives from the most widely-separated ones. This approach is reported to be very time-consuming for high-dimensional data, and Melnykov and Maitra (2010) found the method to be outperformed by emEM and RndEM. Recently, Melnykov and Melnykov (2012) have proposed a strategy for initialising mean vectors by choosing points with higher concentrations of neighbours and using a truncated normal distribution for the preliminary estimation of dispersion matrices.

A second kind of approach for initialising the EM algorithm is based on the partition obtained from another clustering algorithm, e.g. k-means or hierarchical agglomerative clustering (HAC). In this case, the final classification is used to start the EM algorithm from the M-step. Unfortunately, most of these partitioning algorithm have several drawbacks, such as the need to be properly initialised or the tendency to impose specific shapes and patterns on clusters. In the popular mclust package for R, the EM algorithm is initialised using the partitions obtained from model-based hierarchical agglomerative clustering (MBHAC). In this approach, k clusters are obtained from a large number of smaller clusters by recursively merging the two clusters that have the smallest dissimilarity in a model-based sense, i.e. the dissimilarity used for agglomeration is derived from a probabilistic model. Banfield and Raftery (1993) proposed a dissimilarity based on a Gaussian mixture model, which is equal to the decrease in likelihood resulting by the merging of two clusters. Fraley (1998) showed how the structure of some specific Gaussian models can be exploited to yield efficient algorithms for agglomerative hierarchical clustering. More details about this approach are discussed in the following section.

3 Model-based hierarchical agglomerative clustering

The objective of MBHAC is to obtain a hierarchical ordering of clusters of n objects on the basis of the some measure of similarity among them. The result is a tree-like structure, which proceeds from n clusters containing one object to one cluster containing all n objects by successively merging objects and clusters.

Given n objects or observations \((\varvec{x}_1, \ldots , \varvec{x}_n)\), let \((z_1, \ldots , z_n){}^{\top }\) denote the classification labels, i.e. \(z_i=k\) if \(\varvec{x}_i\) is assigned to cluster k. The unknown parameters are obtained by maximising the classification likelihood

$$\begin{aligned} \mathcal {L}_C (\varvec{\theta }_1, \ldots , \varvec{\theta }_G, z_1, \ldots , z_n|\varvec{x}_1, \ldots , \varvec{x}_n) = \prod _{i=1}^n f_{z_i}(\varvec{x}_i|\varvec{\theta }_{z_i}) . \end{aligned}$$

Assuming the multivariate Gaussian distribution for \(f_{k}(\varvec{x}_i|\varvec{\theta }_{k})\), the parameters are the mean vector \(\varvec{\mu }_k\) and the covariance matrix \(\varvec{\Sigma }_k\). By imposing different covariance structures over \(\varvec{\Sigma }_k\), different criterion can be derived (see Fraley 1998). For instance, when the covariance is allowed to be different among clusters, the criterion to be minimised at each hierarchical stage is

$$\begin{aligned} \mathcal {W}_G = \sum _{k=1}^G n_k \log \left| \frac{\varvec{W}_k}{n_k}\right| , \end{aligned}$$
(2)

where \(n_k\) is the number of observations in group k and \(\varvec{W}_k\) is the sample cross-product matrix for the kth group (\(k=1,\ldots ,G\)). Efficient numerical algorithms are available for approximately maximising the classification likelihood (or, equivalently, minimising the corresponding criterion) with multivariate Gaussian models have been discussed.

As mentioned, the above MBHAC approach is used for starting the EM algorithm in the mclust R package. This is particularly convenient because the underlying probabilistic model can be shared by the initialisation step and the model fitting step. MBHAC is also computationally convenient because a single run provides the basis for initialising the EM algorithm for any number of mixture components and parameterisations of the component covariance matrices.

However, a serious problem for the MBHAC approach may arise when, at any stage, two pairs of objects attain the same minimum value for the criterion in (2). In the presence of coarse data, resulting from the discrete nature of the data or from continuous data that are rounded in some way when measured, ties must be broken by choosing the pair of entities that will be merged. This is often done at random but, regardless of which method is adopted for breaking ties, this choice can have important consequences because it changes the clustering of the remaining observations. In this case the final EM solution may depend on the ordering of the variables, and to a lesser extent on permutation of the observations (the latter case is not studied further in this paper).

This difficulty is known as the ties in proximity problem in the hierarchical clustering literature (see, for example, Jain and Dubes 1988, Sec. 3.2.6). This problem can also arise in other contexts, such as k-means clustering (Gordon 1999, p. 42) or partition around medoids (PAM; Kaufman and Rousseeuw 1990, p. 104).

4 Transformation-based approaches for obtaining starting partitions in model-based hierarchical agglomerative clustering

In this section we describe some simple proposals for starting the EM algorithm using the partitions obtained with MBHAC. Ideally, we would like to retain the positive aspects of such approach, but, at the same time, reduce the chance that a poor initial partition causes the EM algorithm to converge to a local maximum of the likelihood function.

The idea is to project the data through a suitable transformation before applying the MBHAC at the initialisation step. Once a reasonable hierarchical partition is obtained, the EM algorithm is run using the data on the original scale.

Let \(\varvec{X}\) be the \((n \times p)\) data matrix, and \(\mathbb {X}= (\varvec{X}- \varvec{1}_n\bar{\varvec{x}}{}^{\top })\) be the corresponding centred matrix, where \(\bar{\varvec{x}}=(\bar{x}_1,\ldots ,\bar{x}_p){}^{\top }\) is the vector of sample means, and \(\varvec{1}_n\) is the unit vector of length n. Let \(\widehat{\varvec{\Sigma }} = \{s_{ij}\} = \mathbb {X}{}^{\top }\mathbb {X}/n\) be the \((p \times p)\) sample covariance matrix. Consider the singular value decomposition (SVD) of the centred data matrix,

$$\begin{aligned} \mathbb {X}= \varvec{U}\varvec{D}\varvec{V}{}^{\top }= \sum _{i=1}^r \lambda _i\varvec{u}_i\varvec{v}{}^{\top }_i, \end{aligned}$$

where \(\varvec{u}_i\) are the right singular vectors, \(\varvec{v}_i\) the left singular vectors, \(\lambda _1 \ge \lambda _2 \ge \ldots \ge \lambda _r > 0\) the singular values, and r the rank of matrix \(\mathbb {X}\). Similarly, the centred and scaled data matrix can be decomposed as

$$\begin{aligned} \mathbb {X}\varvec{S}^{-1/2} = \varvec{U}^* \varvec{D}^* \varvec{V}^*{}^{\top }= \sum _{i=1}^r \lambda ^*_i\varvec{u}^*_i\varvec{v}^*_i{}^{\top }, \end{aligned}$$

where \(\varvec{S}= {{\mathrm{\mathrm {diag}}}}(s^2_1, \ldots , s^2_p)\) is the diagonal matrix of sample variances. In both SVD decompositions the rank r is the rank of the data matrix, i.e. \(r \le \min (n,p)\), with equality when there are no singularities.

We now provide the details of the transformations investigated and some remarks.

4.1 Data sphering

Sphering (or whitening) the data (SPH) is obtained by applying the following transformation:

$$\begin{aligned} \varvec{Z}_{\text {SPH}}\;\leftarrow \; \mathbb {X}\varvec{V}\varvec{D}^{-1}\sqrt{n} = \varvec{U}\sqrt{n}, \end{aligned}$$

where \(\varvec{V}\) and \(\varvec{D}^{-1}\sqrt{n} = {{\mathrm{\mathrm {diag}}}}(\sqrt{n}/\lambda _i)\) are, respectively, the matrix of eigenvectors and the diagonal matrix of square root inverse of eigenvalues from the spectral decomposition of the sample marginal covariance, \(\widehat{\varvec{\Sigma }}\). For this transformation, \({{\mathrm{\mathrm {E}}}}(\varvec{Z}_{\text {SPH}}) = \varvec{0}\) and \({{\mathrm{\mathrm {Var}}}}(\varvec{Z}_{\text {SPH}}) = \varvec{I}\), so the features are centred at zero, with unit variances and uncorrelated. Thus, this transformation converts an elliptically shaped symmetric cloud of points into a spherically shaped cloud.

4.2 PCA scores from covariance matrix

The principal component transformation from the covariance matrix (PCS) is obtained as:

$$\begin{aligned} \varvec{Z}_{\text {PCS}}\;\leftarrow \; \mathbb {X}\varvec{V}= \varvec{U}\varvec{D}, \end{aligned}$$

for which \({{\mathrm{\mathrm {E}}}}(\varvec{Z}_{\text {PCS}}) = \varvec{0}\) and \({{\mathrm{\mathrm {Var}}}}(\varvec{Z}_{\text {PCS}}) = \varvec{D}^2/n = {{\mathrm{\mathrm {diag}}}}(\lambda _i^2/n)\), so the features are centred, uncorrelated and with decreasing variances equal to the eigenvalues of \(\widehat{\varvec{\Sigma }}\). Usually the first few components account for most of the dispersion in the data. A similar idea was proposed by McLachlan (1988), who discussed the use of principal component analysis in a preliminary exploratory analysis for the selection of suitable starting values.

4.3 PCA scores from correlation matrix

The principal component transformation from the correlation matrix (PCR) is defined as:

$$\begin{aligned} \varvec{Z}_{\text {PCR}}\;\leftarrow \; \mathbb {X}\varvec{S}^{-1/2} \varvec{V}^* = \varvec{U}^* \varvec{D}^*, \end{aligned}$$

for which \({{\mathrm{\mathrm {E}}}}(\varvec{Z}_{\text {PCR}}) = \varvec{0}\) and \({{\mathrm{\mathrm {Var}}}}(\varvec{Z}_{\text {PCR}}) = {\varvec{D}^*}^2/n = {{\mathrm{\mathrm {diag}}}}({\lambda _i^*}^2/n)\), so the features are centred, uncorrelated and with decreasing variances equal to the eigenvalues of the marginal sample correlation matrix, \(\varvec{S}^{-1/2}\widehat{\varvec{\Sigma }}\varvec{S}^{-1/2}\).

4.4 Scaled SVD projection

The scaled SVD transformation (SVD) is computed as:

$$\begin{aligned} \varvec{Z}_{\text {SVD}}\;\leftarrow \; \mathbb {X}\varvec{S}^{-1/2} \varvec{V}^* {\varvec{D}^*}^{-1/2} = \varvec{U}^* {\varvec{D}^*}^{1/2}, \end{aligned}$$

for which \({{\mathrm{\mathrm {E}}}}(\varvec{Z}_{\text {SVD}}) = \varvec{0}\) and \({{\mathrm{\mathrm {Var}}}}(\varvec{Z}_{\text {SVD}}) = \varvec{D}^*/n = {{\mathrm{\mathrm {diag}}}}(\lambda _i^*/n)\). Again the features are centred, uncorrelated and with decreasing variances equal to the square root of the eigenvalues of the marginal sample correlation matrix, \(\varvec{S}^{-1/2}\widehat{\varvec{\Sigma }}\varvec{S}^{-1/2}\). In this case the features’ dispersion presents a gentle decline compared to the PCR case.

4.5 Remarks

All the above transformations allow one to remove the “ordering” effect, so that even in the presence of ties the partitions obtained by applying MBHAC are invariant to permutations of the input variables. Furthermore, as shown in the next section through examples with real data, most of them allow one to achieve better clustering results when used for initialising the EM algorithm in Gaussian model-based clustering.

Fig. 1
figure 1

Scatterplot matrix for the Crabs data: lower panels show scatterplots for pairs of variables in the original scale; upper panels show the features obtained by applying the scaled SVD transformation. For all graphs points are marked according to the true classes

To get an idea of this, consider Fig. 1 where we report a scatterplot matrix with pairs of plots for the original variables in the Crabs dataset (to be discussed in Sect. 5.1) and for scaled SVD-transformed features in, respectively, the lower and upper panels, and with points marked according to the true classes. In the graphs for the variables in the original scale is hard to detect any clustering structure, whereas in the first three SVD features there appears to be a certain degree of separation among classes. Applying MBHAC on this features space should provide better starting points for applying the EM algorithm.

5 Data analyses

In this section we present some examples using real data. We compare the behaviour of the proposed transformations against the usual MBHAC for starting the EM algorithm, and two common strategies for GMMs initialisation. The first strategy is to start from the best partition obtained out of 50 runs of the k-means algorithm. The second is the emEM strategy as implemented in the mixmod software (Biernacki et al. 2006; Auder et al. 2014), which, by default, uses 50 short runs of EM, each made of 5 iterations, followed by a long run of EM from the solution maximising the log-likelihood. The tolerance used for assessing the log-likelihood relative convergence of the EM algorithm is set to \(10^{-5}\).

The comparison among the different initialising strategies is based on two measures, the BIC and the adjusted Rand index (ARI). The former is used to select the best Gaussian finite mixture model with respect to both the number of components and the component-covariances decomposition. The BIC for model \(\mathcal {M}\) with k components has the following form

$$\begin{aligned} \mathrm {BIC}_{\mathcal {M},k} = 2\ell (\widehat{\varvec{\Psi }}; \varvec{x}) - \nu _{\mathcal {M},k} \log (n), \end{aligned}$$

where \(\ell (\widehat{\varvec{\Psi }}; \varvec{x})\) is the maximised log-likelihood, \(\nu _{\mathcal {M},k}\) is the number of independent parameters to be estimated in model \(\mathcal {M}\), and k is the number of mixture components. This criterion depends on the starting partition through the log-likelihood at the MLEs \(\widehat{\varvec{\Psi }}\), penalised by the complexity of the model. It is the default criterion used in mclust for selecting a model, so the larger the value of the BIC the stronger the evidence for the corresponding model and number of components.

The ARI (Hubert and Arabie 1985) is used for evaluating the clustering obtained with a given mixture model. This is a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. The ARI has zero expected value in the case of a random partition, and it is bounded above by 1, with higher values representing better partition accuracy. Furthermore, it can be applied to compare partitions having different numbers of parts. The ARI is the index recommended by Milligan and Cooper (1986) for measuring the agreement between an external reference partition and a clustering partition. In the following data analysis examples, we take advantage of the knowledge of the true classes for measuring clustering accuracy, but not for model fitting.

5.1 Crabs data

We now consider a dataset consisting of data on five morphological measurements for 200 Leptograpsus crabs, with 50 crabs for each of two colour forms (blue and orange) and both sexes. Figure 2 shows the marginal distribution for each variable. In each stripchart data points are stacked to avoid overlapping, so the presence of several ties in the data is evident.

Fig. 2
figure 2

Stripcharts with stacked data points for the Crabs data showing the marginal distribution for each variable. From this plot the presence of several ties is clearly visible

Overall, there are \(5!=120\) possible different ordering of the variables, of which 105 lead to selection of the (EEE,9) model, and 15 the (EEV,3) model. Table 1 shows the corresponding BIC and ARI obtained for these models and for models initialised using the transformations discussed in Sect. 4. As it can be seen, initialisation with the scaled SVD transformation returns the model with both the largest BIC and the most accurate partition (ARI \(=\) 0.7938). This transformation is the only one that selects the correct number of mixture components. Initialisation based on k-means also yields the right number of components, but its fit and accuracy are substantially worse. The emEM initialisation strategy yields a better fit, but it selects the wrong number of clusters.

Table 1 Results from different initialisation strategies for the Crabs data

Raftery and Dean (2006) selected as optimal subset for clustering purposes the variables (FL, RW, CW, BD) (in the identified ordering). Also for such a subset, different orderings give different results in 7 out of 24 possible arrangements. Table 2 shows the clustering results obtained with different initialisation strategies using the above mentioned optimal subset of variables. Again, initialisation with the scaled SVD transformation yields the largest BIC and the highest ARI among the considered strategies, with a slight improvement over the default initialisation with the identified ordering of the variables. Note that emEM has results analogous to the best default MBHAC initialisation, but k-means results are markedly worse.

Table 2 Results from different initialisation strategies for the Crabs data using the optimal subset of variables identified by Raftery and Dean (2006)

Finally, note that, using both the full set of the variables and the optimal subset, analysis with the initialisation using the scaled SVD transformation used the smallest computing time. This may appear counterintuitive at first sight, because computing the SVD is time consuming. However, this is more than balanced by the fact that having good starting values allows the EM algorithm to converge in fewer steps.

5.2 Female voles data

Flury (1997) reported the data for a sample of 86 female voles from two species, 41 from Microtus californicus and 45 from Microtus ochrogaster, and seven variables describing various body measurements in units of 0.1mm, so several ties are contained in this data set.

There are \(7!=5040\) possible orderings of the variables. When the default initialisation is used, there is considerable variation among the resulting solutions: 41 final models with different BIC values (up to 5 significant digits) are estimated, leading to 38 different partitions. From Table 3, about 67 % of the solutions give a single component model, about 20 % have three components, and only about 7.5 % of the solutions correctly identify the correct number of clusters. On the contrary, three transformation-based initialisation strategies are able to achieve the best fit in term of BIC, which in turn provides the largest ARI. The same optimal solution is also achieved by the emEM strategy, whereas results from the k-means initialisation are inferior. Note that also in this case the scaled SVD transformation is among the best strategies, with no increase in computing time.

Table 3 Results from different initialisation strategies for the female voles data

5.3 Italian wines data

Forina et al. (1986) reported data on several chemical and physical properties of 178 wines grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, Barbera). We consider 27 of the original 28 variables that were described in the Forina et al. paper and are available in the pgmm R package (McNicholas et al. 2015). In what follows, the data are analysed on the standardised scale and, given the large number of features, only the common full covariance matrix model (EEE) is examined. Table 4 shows the results obtained using different MBHAC initial partitions, k-means and emEM initialisation strategies. Except for one case, the MBHAC initialisations allow one to achieve good clustering accuracy. In particular, SVD has the highest BIC and attains a clustering that perfectly matches the real classes. By comparison, the models initialised by k-means and emEM show somewhat worse values of both BIC and ARI.

Table 4 Results from different initialisation strategies for the Italian wines data

6 Final comments

The mixture model-based approach to clustering provides a firm statistical framework for unsupervised learning. The EM algorithm is typically used for parameter estimation. However, because of the many local maxima of the likelihood function, an important problem for getting sensible maximum likelihood estimates is to obtain reasonable starting points. A computationally efficient approach to EM initialisation is based on the partitions obtained from agglomerative hierarchical clustering.

In this paper we have presented and experimented with simple transformation-based methods to refine the EM initialisation step derived from model-based agglomerative hierarchical clustering. The proposed transformation-based strategies allow one to remove the dependence on the ordering of the variables when selecting starting partitions. They also often lead to improved model fitting and more accurate clustering results. Among the investigated transformations, the scaled SVD transformation performed the best in our experiments. The proposed approach may be applicable to other mixture modelling contexts, but we have not explored this possibility yet. Future studies will be devoted to this aspect.

As mentioned by Biernacki et al. (2003, p. 567) and Melnykov and Maitra (2010), we cannot expect an initialisation strategy to work uniformly well in all cases. Therefore, it is important to explore different strategies and to choose the solution with the highest log-likelihood value, but also to consider those sub-optimal solutions on a more subject specific considerations.

The transformation-based initialisation strategies discussed in this paper are available in the R package mclust (version \(>=\) 4.4), which can be downloaded from CRAN at http://cran.r-project.org/web/packages/mclust/index.html.