1 Introduction

Most early clustering algorithms were based on heuristic approaches and some such methods, including hierarchical agglomerative clustering and k-means clustering (MacQueen 1967; Hartigan and Wong 1979), are still widely used. The use of mixture models to account for population heterogeneity has been very well established for over a century (e.g., Pearson 1894), but it was the 1960s before mixture models were used for clustering (Wolfe 1965; Hasselblad 1966; Day 1969). Because of the lack of suitable computing equipment, it was much later before the use of mixture models (e.g., Banfield and Raftery 1993; Celeux and Govaert 1995) and, more generally, probability models (e.g., Bock 1996, 1998a, 1998b) for clustering became commonplace. Since the turn of the century, the use of mixture models for clustering has burgeoned into a popular subfield of cluster analysis and recent examples include Franczak et al. (2014), Vrbik and McNicholas (2014), Murray et al.(2014a, b), Lee and McLachlan (2014), Lin et al. (2014), Subedi et al. (2015), Morris and McNicholas (2016), O’Hagan et al. (2016), Dang et al. (2015), Lin et al. (2016), Lee and McLachlan (2016), Dang et al. (2017), Cheam et al. (2017), Melnykov and Zhu (2018), Zhu and Melnykov (2018), Gallaugher and McNicholas (2019b), Tortora et al. (2019), Biernacki and Lourme (2019), Murray et al. (2019), Morris et al. (2019), and Punzo et al. (2020). The reader may consult Bouveyron and Brunet-Saumard (2014) and McNicholas (2016b) for relatively recent reviews of model-based clustering work.

A d-dimensional random vector Y is said to arise from a parametric finite mixture distribution if, for all yY, we can write its density as

$$ f(\mathbf{y}\mid\boldsymbol{\vartheta})= \sum\limits_{g=1}^{G} \rho_{g} p_{g}(\mathbf{y}\mid\boldsymbol{\theta}_{g}), $$

where ρg > 0 such that \({\sum }_{i=1}^{G} \rho _{g} =1\) are the mixing proportions, pg(y𝜃g) are component densities, and 𝜗 = (ρ1,…,ρG,𝜃1,…,𝜃G) is the vector of parameters. When the component parameters 𝜃1,…,𝜃G are decomposed and constraints are imposed on the resulting decompositions, the result is a family of mixture models. Typically, each component probability density is of the same type and, when they are Gaussian, the mixture density function is

$$ f(\mathbf{y}\mid\boldsymbol{\vartheta})= \sum\limits_{g=1}^{G} \rho_{g}\phi_{d}(\mathbf{y}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g}), $$

where ϕd(yμg,Σg) is the d-dimensional Gaussian density with mean μg and covariance Σg, and the likelihood is

$$ \mathcal{L}(\boldsymbol{\vartheta} \mid \mathbf{y}_{1}, \ldots, \mathbf{y}_{n}) = \prod\limits_{i=1}^{n} \sum\limits_{g=1}^{G}\rho_{g}\phi_{d}(\mathbf{y}_{i}\mid \boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g}), $$

where 𝜗 denotes the model parameters. In Gaussian families, it is usually the component covariance matrices Σ1,…,ΣG that are decomposed (see Section 2).

The expectation-maximization (EM) algorithm (Dempster et al. 1977) is often used for mixture model parameter estimation but its efficacy is questionable. As discussed by Titterington et al. (1985) and others, the nature of the mixture likelihood surface leaves the EM algorithm open to failure. Although this weakness can be mitigated by using multiple re-starts, there is no way to completely overcome it. Besides its heavy reliance on starting values, convergence of the EM algorithm can be very slow. When families of mixture models are used, the EM algorithm approach must be employed in conjunction with a model selection criterion to select the member of the family and, in many cases, the number of components. There are many model selection criteria to choose from, such as the Bayesian information criterion (BIC; Schwarz 1978), the integrated completed likelihood (ICL; Biernacki et al. 2000), and the Akaike information criterion (AIC; Akaike 1974). All of these model selection criteria have some merit and various shortcomings, but the BIC remains by far the most popular (McNicholas 2016a, Chp. 2). There has been interest in the use of Bayesian approaches to mixture model parameter estimation, via Markov chain Monte Carlo (MCMC) methods (e.g., Diebolt and Robert 1994; Richardson and Green 1997; Bensmail et al. 1997; Stephens 1997, 2000; Casella et al. 2002); however, difficulties have been encountered with, inter alia, computational overhead and convergence (see Celeux et al. 2000; Jasra et al. 2005). Variational Bayes approximations present an alternative to MCMC algorithms for mixture model parameter estimation and are gaining popularity due to their fast and deterministic nature (see Jordan et al. 1999; Corduneanu and Bishop 2001; Ueda and Ghahramani 2002; McGrory and Titterington 2007, 2009; McGrory et al. 2009; Subedi and McNicholas 2014).

With the use of a computationally convenient approximating density in place of a more complex “true” posterior density, the variational algorithm overcomes the hurdles of MCMC sampling. For observed data y, the joint conditional distribution of parameters 𝜃 and missing data z is approximated by using another computationally convenient distribution q(𝜃,z). This distribution q(𝜃,z) is obtained by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities, where

$$ \text{KL} (q(\mathbf{\boldsymbol{\theta},z})\mid p(\mathbf{\boldsymbol{\theta},z\mid \mathbf{y}})) = {\int}_{\boldsymbol{\Theta}} \sum\limits_{z} q(\mathbf{\boldsymbol{\theta},z}) \log\left\{\frac{q(\boldsymbol{\theta},\mathbf{z})}{p(\mathbf{\boldsymbol{\theta},z\mid y})}\right\} d\boldsymbol{\theta}. $$

The approximating density is restricted to have a factorized form for computational convenience, so that q(𝜃,z) = q𝜃(𝜃)qz(z). Upon choosing a conjugate prior, the appropriate hyper-parameters of the approximating density q𝜃(𝜃) can be obtained by solving a set of coupled non-linear equations.

The variational Bayes algorithm is initialized with more components than expected. As the algorithm iterates, if two components have similar parameters then one component dominates the other causing the dominated component’s weighting to be zero. If a component’s weight becomes sufficiently small, less than or equal to two observations in our analyses, the component is removed from consideration. Therefore, the variational Bayes approach allows for simultaneous parameter estimation and selection of the number of components.

2 Methodology

2.1 Introducing Parsimony

If d-dimensional data y1,…,yn arise from a finite mixture of Gaussian distributions, then the log-likelihood is

$$ \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \boldsymbol{\theta}) = \ \sum\limits_{i=1}^{n} \log\left[ \sum\limits_{g=1}^{G} \rho_{g}\frac{|\boldsymbol{\Sigma}^{-1}_{g}|}{2\pi^{\frac{d}{2}}}\exp\left\{\frac{1}{2}(\mathbf{y}_{i}-\boldsymbol{\mu}_{g})^{\prime}\boldsymbol{\Sigma}^{-1}_{g}(\mathbf{y}_{i}-\boldsymbol{\mu}_{g})\right\}\right]. $$

The number of parameters in the component covariance matrices of is Gd(d + 1)/2, which is quadratic in d. When dealing with real data, the number of free parameters to be estimated can very easily exceed the sample size n by an order of magnitude. Hence, the introduction of parsimony through the imposition of additional structure on the covariance matrices is desirable. Banfield and Raftery (1993) exploited geometrical constraints on the covariance matrices of a mixture of Gaussian distributions using the eigen-decomposition of the covariance matrices, such that \(\boldsymbol {\Sigma }_{g} = \lambda _{g} \mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }\), where Dg is the orthogonal matrix of eigenvectors and Ag is a diagonal matrix proportional to the eigenvalues of Σg, such that |Ag| = 1, and λg is the associated constant of proportionality. This decomposition has an advantage in terms of its interpretation, i.e., the parameter λg controls the cluster volume, Ag controls the cluster shape, and Dg controls the cluster orientation. This allows for imposition of several constraints on the covariance matrix that have geometrical interpretation giving rise to a family of 14 models known as Gaussian Parsimonious clustering models (GPCM; Celeux and Govaert 1995) (see Table 1).

Table 1 Nomenclature, interpretation, and covariance structure for each member of the GPCM family

The mclust package (Scrucca et al. 2016) for R (R Core Team 2018) implements 12 of these 14 GPCM models in an EM framework, with the MM framework of Browne and McNicholas (2014) used for the other two models (EVE and VVE). Bensmail et al. (1997) used Gibbs sampling to carry out Bayesian inference for eight of the GPCM models. Bayesian regularization of some of the GPCM models has been considered by Fraley and Raftery (2007). After assigning a highly dispersed conjugate prior, they replace the maximum likelihood estimator of the group membership obtained using the EM algorithm by a maximum a posteriori probability (MAP) estimator. Note that \(\text {MAP} (\hat {z}_{ig}) = 1\) if \(g=\arg \max \limits _{h}({\hat {z}_{ih}})\) and \(\text {MAP}({\hat {z}_{ig}}) = 0\) otherwise, where \(\hat {z}_{ig}\) denotes the a posteriori expected value of Zig and

$$ z_{ig}=\left\{\begin{array}{ll} 1 & \text{if } \mathbf{x}_{i} \text{ belongs to component } g, \text{ and}\\ 0 & \text{otherwise}. \end{array}\right. $$

A modified BIC using the maximum a posteriori probability is then used for model selection. Herein, we implement 12 of those 14 GPCM models using variational Bayes approximations—conjugate priors are not available for the EVE and VVE models.

2.2 Priors and Approximating Densities

As suggested by McGrory and Titterington (2007), the Dirichlet distribution is used as the conjugate prior for the mixing proportion, such that

$$ p(\boldsymbol{\rho})=\text{Dir}(\boldsymbol{\rho}; \alpha_{1}^{(0)},\ldots,\alpha_{G}^{(0)}), $$

where ρ = (ρ1,…,ρG) are the mixing proportions and \(\alpha _{1}^{(0)},\ldots ,\alpha _{G}^{(0)}\) are the hyperparameters. Conditional on the precision matrix Tg, independent normal distributions were used as the conjugate priors for the means such that

$$ p(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G} \mid\mathbf{T}_{1},\ldots,\mathbf{T}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}^{(0)}_{g},(\beta_{g}^{(0)}\mathbf{T}_{g})^{-1}), $$

where \(\{\mathbf {m}_{g}^{(0)}, \beta _{g}^{(0)}\}_{g=1}^{G}\) are the hyper-parameters.

Fraley and Raftery (2007) assigned priors on the parameters for the covariance matrix and its components in a Bayesian regularization application. However, we assign priors on the precision matrix with the hyperparameters given in Table 2. Note that it was not possible to put a suitable (i.e., determinant one) prior on the matrix Ag for the models EVI and VVI or on A for models VEV and VEI; accordingly, we instead put a prior on \(c_{g}\mathbf {A}_{g}^{-1}\) or cA− 1, respectively, where cg or c is a positive constant. Using the expected value of \(c_{g}\mathbf {A}_{g}^{-1}\) (or cA− 1), the expected value of \(\mathbf {A}_{g}^{-1}\) (or A− 1) is determined to satisfy the constraint that the determinant is 1. Because Dg is an orthogonal matrix of eigenvectors, the Bingham matrix distribution is used as the conjugate prior for Dg. The Bingham distribution, first introduced by Bingham (1974), is a probability distribution on a set of orthonormal vectors \(\{\mathbf {u}: \mathbf {u}^{\prime }\mathbf {u}=1\}\) and has antipodal symmetry thus making it ideal for random axes.

Table 2 The precision parameter upon which a prior is placed, as well as the corresponding prior distribution and hyperparameters, for 12 of the 14 members of the GPCM family

The Bingham matrix distribution (Gupta and Nagar 2000) is the matrix analogue, on the Steifel manifold, of the Bingham distribution and has been used in multivariate analysis and matrix decomposition methods (Hoff 2009). The density of the Bingham matrix distribution, as defined by Gupta and Nagar (2000), is

$$ p(\mathbf{D}) = b(\mathbf{A},\mathbf{B})\exp(\text{tr}\{\mathbf{B}\mathbf{D}\mathbf{A}\mathbf{D}^{\prime}\}) [d\mathbf{D}], $$

for DO(n,d), where O(n,d) is the Stiefel manifold of n × d matrices, [dD] is the unit invariant measure on O(n,d), and A and B are symmetric and diagonal matrices, respectively. Samples from the Bingham matrix distribution can be obtained using the Gibbs sampling algorithm implemented in the R package rstiefel (Hoff 2012).

The approximating densities that minimize the KL divergence are as follows. For the mixing proportions, qρ(ρ) = Dir(ρ; α1,…,αG), where \(\alpha _{g}=\alpha _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}\). For the mean,

$$ q_{\boldsymbol{\mu}}(\boldsymbol{\mu} \mid\mathbf{T}_{1},\ldots,\mathbf{T}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}_{g},(\beta_{g}\mathbf{T}_{g})^{-1}), $$

where \(\beta _{g} = \beta _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}\) and

$$ \mathbf{m}_{g} =\frac{1}{\beta_{g}}\left( {\beta_{g}^{(0)} \mathbf{m}_{g}^{(0)}+\sum\limits_{i=1}^{n}\hat{z}_{ig}\mathbf{y}_{i}}\right). $$

The probability that the i th observation belongs to a group g is then given by

$$ \hat{z}_{ig} =\frac{\varphi_{ig}}{{\sum}_{j=1}^{G}\varphi_{ij}}, $$

where

$$ \begin{array}{@{}rcl@{}} \varphi_{ig}&=&\frac{1}{{\sum\limits_{g=1}^{G}\varphi_{ij}}}\exp\left( \mathbb{E}[\log \rho_{g}]+\frac{1}{2}\mathbb{E}[\log|\mathbf{T}_{g}|]-\frac{1}{2}\text{tr}\left\{\vphantom{\frac{1}{2}}\mathbb{E}[\mathbf{T}_{g}](\mathbf{y}_{i}-\mathbb{E}[\boldsymbol{\mu}_{g}])\right.\right.\\ &&\qquad\qquad\qquad\quad\left.\left.\times (\mathbf{y}_{i}-\mathbb{E}[\boldsymbol{\mu}_{g}])^{\prime}+\frac{1}{\beta_{g}}\mathbf{I}_{d}\right\}\right),\\ \mathbb{E}[\log(\rho_{g})] &=&{\Psi}(\hat{\alpha}_{g})-{\Psi}\left( {\sum}_{g=1}^{G}{\hat{\alpha}_{g}}\right), \end{array} $$

\(\mathbb {E}[\boldsymbol {\mu }_{g}] =\mathbf {m}_{g}\), and Ψ(⋅) is the digamma function. The values of \( \mathbb {E}[\mathbf {T}_{g}]\) and \(\mathbb {E}[\log |\mathbf {T}_{g}|]\) vary depending on the model (see Table 6, Appendix A for details). The posterior distribution of the parameters \(\lambda _{g}^{-1}\) and Ag are gamma distributions and, therefore, the expected value of \( \mathbb {E}[\lambda _{g}^{-1}]\), \(\mathbb {E}[\log |\lambda _{g}^{-1}|]\), \( \mathbb {E}[\mathbf {A}_{g}]\), and \(\mathbb {E}[\log |\mathbf {A}_{g}|]\) all have a closed form. The posterior distribution for \(\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }\) is a Wishart distribution and so there is a closed form solution for \(\mathbb {E}[\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }]\) and \(\mathbb {E}[\log |\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }|]\). The posterior distribution of the parameter Dg is a Bingham matrix distribution (see Appendix C for details) and, hence, Monte Carlo integration was used to find the expected values of \(\mathbb {E}[\mathbf {T}_{g}]\) and \(\mathbb {E}[\log |\mathbf {T}_{g}|]\). The estimated model parameters maximize the lower bound of the marginal log-likelihood.

2.3 Convergence

The posterior log-likelihood of the observed data obtained using the posterior expected values of the parameters is

$$ \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \tilde{\boldsymbol{\theta}}) = \sum\limits_{i=1}^{n} \log \left[ \sum\limits_{g=1}^{G} \frac{\tilde{\rho}_{g}|\tilde{\mathbf{T}}_{g}|}{2\pi^{{d}/{2}}}\exp \left\{\frac{1}{2}(\mathbf{y}_{i}-\tilde{\boldsymbol{\mu}}_{g})^{\prime}\tilde{\mathbf{T}}_{g}(\mathbf{y}_{i}-\tilde{\boldsymbol{\mu}}_{g})\right\}\right], $$

where \(\tilde {\boldsymbol {\mu }}_{g}=\mathbf {m}_{g}\) and

$$\tilde{\rho}_{g} =\frac{\alpha_{g}}{{\sum}_{j=1}^{G}\alpha_{j}}.$$

The expected precision matrix \(\tilde {\mathbf {T}}_{g}\) varies according to the model. Convergence of the algorithm for these models is determined using a modified Aitken acceleration criterion. The Aitken acceleration (Aitken 1926) is given by

$$ a^{(m)}=\frac{l^{(m+1)}-l^{(m)}}{l^{(m)}-l^{(m-1)}}, $$

where l(m) is the value of the posterior log-likelihood at iteration m. Convergence can be considered to have been achieved when

$$ \left|l_{\infty}^{(m+1)}-l_{\infty}^{(m)}\right| < \epsilon, $$

where \(l_{\infty }^{(m+1)}\) is an asymptotic estimate of the log-likelihood given by

$$ l_{\infty}^{(m+1)} = l^{(m)}+\frac{1}{(1-a^{(m)})}(l^{(m+1)}-l^{(m-1)}) $$

(Böhning et al. 1994).

The VEV and EEV models utilize Gibbs sampling and Monte Carlo integration to find both the expected value of the parameter Tg and the expectations of functions of Tg. As the Gibbs sampling chain approaches the stationary posterior distribution, the posterior log-likelihood oscillates rather than monotonically increasing at every new iteration. Hence, an alternate convergence criteria was used for these models. When the relative change in the parameter estimates from successive iterations is small, convergence is assumed. Hence, for the VEV and EEV models, the algorithm is stopped when

$$ \max_{i} \left\{\frac{\big| \psi_{i}^{(m+1)}-\psi_{i}^{(m)}\big|}{\big| \psi_{i}^{(m)}\big| + \delta_{1}}\right\} < \delta_{2}, $$
(1)

where δ1 and δ2 are predetermined constants, \(\psi _{i}^{(m)}\) is the estimate of the i th parameter on the m th iteration, and i indexes over every parameter in the model—note that, for matrix- or vector-valued parameters, \(\psi _{i}^{(m)}\) corresponds to an individual element so that i indexes over all parameter elements and the comparison in (1) is element-wise. In the analyses herein, we use δ1 = 0.001 and δ2 = 0.05 for three consecutive iterations. A detailed discussion of the convergence of Monte Carlo EM algorithm is provided in Neath et al. (2013).

2.4 Model Selection

Despite the benefits of simultaneously obtaining parameter estimates along with the number of components, a model selection criterion is needed to determine the covariance structure. For the selection of the model with the best fit, the deviance information criterion (DIC; Spiegelhalter et al. 2002) is used as suggested by McGrory and Titterington (2007). The DIC is given by

$$ \text{DIC} = -2 \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \tilde{\boldsymbol{\theta}})+2p_{D}, $$

where

$$ 2p_{D} \approx -2 \int q_{\theta} (\boldsymbol{\theta}) \log \left\{ \frac{q_{\theta} (\boldsymbol{\theta})}{p(\boldsymbol{\theta})}\right\} d\boldsymbol{\theta} + 2 \log \left\{ \frac{q_{\theta} (\tilde{\boldsymbol{\theta}})}{p(\tilde{\boldsymbol{\theta}})}\right\} $$

and \(\log p(\mathbf {y}_{1},\ldots ,\mathbf {y}_{n} \mid \tilde {\boldsymbol {\theta }})\) is the posterior log-likelihood of the data.

Hereafter, the variational Bayes approach that uses the variational Bayes algorithms introduced herein together with the DIC to select the model (i.e., covariance structure) will be referred to as the VB-DIC approach.

2.5 Performance Assessment

The adjusted Rand index (ARI; Hubert and Arabie 1985) is used to assess the performance of the clustering techniques applied in Section 3. The Rand index (Rand 1971) is based on the pairwise agreement between two partitions, e.g., predicted and true classifications. The ARI corrects the Rand index to account for agreement by chance: a value of 1 indicates perfect agreement, the expected value under random class assignment is 0, and negative values indicate a classification that is worse than would be expected by guessing.

3 Results

3.1 Simulation Study 1

The VB-DIC approach is run on 50 simulated two-dimensional Gaussian data sets with three components and known mean and covariance structures Σg = λgId (VII, see Table 3 for λg values). For each dataset, we use five random starts for each of 12 members of the GPCM family, and we set the maximum number of components to ten each time. For each dataset, the model with the smallest DIC is selected as the final model. A G = 3 component model is selected on 46 out of 50 occasions with an ARI of 1 each time while a G = 4 component model was selected, with an average ARI of 0.96 and a standard deviation of 0.044, on the other four occasions.

Table 3 Summary of the average and standard errors of the estimated parameters from the cases where a G = 3 component VII model is selected in Simulation Study 1

A VII model is selected on 47 out of 50 occasions, with VEE and VVV models selected twice and once, respectively. When VEE and VVV are selected, the average difference between the DIC values of the model selected and VII is 1.553 with a standard deviation of 1.984 (the range of the difference is 0.470–3.049). This shows that, although the model selected was different than VII in four cases, the chosen model has similar DIC value to the VII model in each case. In all, there are 43 cases where a G = 3 component VII model is selected and the true and average estimated values (with standard deviations) for μg and λg in these cases are given in Table 3—in all cases, the estimates are very close to the true values.

One advantage of using a variational Bayes approach is that, at every iteration, the hyperparameters of the variational posterior are updated to further minimize the Kullback-Leibler divergence between the approximate variational posterior density and the true posterior density. Hence, 95% credible intervals can be created using the variational posterior distribution for all the component means μg and the component precision parameter \({1}/{{\sigma _{g}^{2}}}\) for each run (see Fig. 1). A Bayesian credible interval provides an interval within which the unobserved parameter value falls with a certain probability. Similar to Wang et al. (2005), we also evaluated the frequentist coverage probability of the intervals, i.e., the number of times the true value of the parameter is contained within the credible interval. Across the nine parameters, the mean coverage probability was 0.927 (range 0.860–0.976), which is slightly lower than 0.95. Blei et al. (2017) point out that variational inference tends to underestimate the variance of the posterior density.

Fig. 1
figure 1

95% credible intervals for the component means μg (top two rows) and the component precision parameter 1/σ2 (bottom row) for the 43 runs where a G = 3 component VII model is selected in Simulation Study 1, where vertical lines denote the values of the parameters used to generate the data

For completeness, the EM algorithm together with the BIC to select the model (i.e., covariance structure and G)—referred to as the EM-BIC framework hereafter—was also applied to these data using the mclust package for R. In all 50 cases, a G = 3 component VII model is chosen and gives perfect classification results for all 50 datasets.

3.2 Simulation Study 2

We ran another simulation study with 50 different three-component, three-dimensional Gaussian distributions with known mean and covariance structure \(\boldsymbol {\Sigma }_{g}=\boldsymbol {\Sigma }=\lambda \mathbf {D} \mathbf {A} \mathbf {D}^{\prime }\). Again, five different runs with different random starts are used and the maximum number of components is set to ten. In 41 out of 50 datasets, a three-component model is selected by the VB-DIC approach. Out of these 41 cases, an EEE model is selected 39 times and an EEV model is selected twice. These 41 cases give an average ARI of 1.000 (sd 0.001). When an EEV model is selected, the difference in DIC between the EEV and EEE models is 3.256 and 8.095, respectively, indicating that these two models were close in their fits. Four- and five-component models were selected for 8 and 1 of the datasets, respectively, with an average ARI of 0.923 (sd 0.097). The true and estimated mean parameters using VB-DIC for the EEE model are given in Table 4, and the true and estimated covariance parameters using VB-DIC for the EEE model are:

$$ \boldsymbol{\Sigma} = \left[\begin{array}{lll} 0.50& 0.35& 0.25 \\ 0.35& 1.00 &0.45\\ 0.25 &0.45& 1.20 \end{array}\right], \quad \hat{\boldsymbol{\Sigma}} = \left[\begin{array}{lll} 0.494 ~(\text{sd}~0.049)& 0.346~(\text{sd}~0.044)& 0.235~(\text{sd}~0.046)\\ 0.346 ~(\text{sd}~0.044)& 0.995~(\text{sd}~0.076)& 0.445~(\text{sd}~0.069)\\ 0.235 ~(\text{sd}~0.046)&0.445~(\text{sd}~0.069)& 1.204~(\text{sd}~0.099) \end{array}\right]. $$
Table 4 Summary of the average and standard errors of the estimated parameters from 39 out of the 50 three-dimensional simulated datasets where an EEE model was selected along the true parameters used to generate the data

The EM-BIC framework, via mclust, was also used for these data. An EEE model was chosen for all 50 datasets with an average ARI of 1.0 (sd 0.001).

3.3 Clustering of Benchmark Datasets

To demonstrate the performance of the VB-DIC approach, we applied our algorithm on several benchmark datasets and compared its performance with the widely used EM-BIC framework via the mclust package.

Crabs Data

The Leptograpsus crab data set, publicly available in the package MASS (Venables and Ripley 2002) for R, consists of biological measurements on 100 crabs from two different species (orange and blue) with 50 males and 50 females of each species. The biological measurements (in millimeters) include frontal lobe size, rear width, carapace length, carapace width, and body depth. Although this data set has been analyzed quite often in the literature, using several different clustering approaches, the correlation among the variables makes it difficult to cluster (Fig. 2). Due to this known issue with the data set, we perform an initial step of processing using principal component analysis to convert these correlated variables into principal components (Fig. 2). Finally, the VB-DIC approach was run on these uncorrelated principal components with a maximum of G = 10 components.

Fig. 2
figure 2

Scatter plot matrix showing the relationships among the variables in the Leptograpsus crabs dataset (left) and showing the relationships among the uncorrelated principal components (right), where the colors/symbols represent the different groups

SRBCT Data

The SRBCT dataset, available in the R package plsgenomics (Boulesteix et al. 2018), is a gene expression data from the microarray experiments of small round blue cell tumors (SRBCT) of childhood cancer. It contains measurements on 2,308 genes from 83 samples comprising of 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), and 25 cases of rhabdomyosarcoma (RMS). Note that our proposed variational Bayes algorithm is not designed for high-dimensional, low sample size (i.e., large p, small N) problems. Dang et al. (2015) performed a differential expression analysis on the gene expression data using an ANOVA across the known groups and selected the top ten genes, ranked using the obtained p-values, to represent a potential set of measurements that contain information on group identification. Hence, we preprocessed the SRBCT dataset in a similar manner to Dang et al. (2015) and implemented the VB-DIC approach with a maximum of G = 10 components.

Iris Data

The Iris data set available in the R datasets package contains measurements in centimeters of the variables sepal length, sepal width, petal length, and petal width of 50 flowers from each of the three species of Iris: Iris setosa, Iris versicolor, and Iris virginica.

Diabetes Data

The diabetes dataset available in the R package mclust contains measurements on three variables on 145 non-obese adult patients classified into three groups (Normal, Overt and Chemical):

  • glucose: Area under plasma glucose curve after a three hour oral glucose tolerance test.

  • insulin: Area under plasma insulin curve after a three hour oral glucose tolerance test.

  • sspg: Steady state plasma glucose.

Banknote Data

The banknote dataset, available in the R package mclust, contains six measurements of 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. Measurements are available for the following variables:

  • Length: Length of the bill in mm.

  • Left: Width of left edge in mm.

  • Right: Width of right edge in mm.

  • Bottom: Bottom margin width in mm.

  • Top: Top margin width in mm.

  • Diagonal: Length of diagonal in mm.

A summary of the performance of the VB-DIC approach and the EM-BIC approach is given in Table 5, where the ARI of the approach that gives the best performance is in italics. For three out of five benchmark datasets, our VB-DIC approach outperforms the EM-BIC framework as implemented via mclust. For one of the five datasets, our VB-DIC approach gives the same ARI as the EM-BC framework and, in the other one of the five datasets, the EM-BIC framework yields a slightly larger ARI compared to our VB-DIC approach.

Table 5 Summary of the performance of VB-DIC approach on the benchmark datasets along with the performance using the EM-BIC framework

4 Discussion

A variational Bayes approach for parameter estimation for the well-known GPCM family has been proposed. As stated before, an advantage of using a variational Bayes algorithm is that, because the hyperparameters of the approximating posterior densities are updated at every iteration, we are indeed updating the approximating variational posterior density of a parameter as opposed to the point estimate of a parameter as in an EM framework. This essentially leads to a natural framework to extract interval estimates (i.e., credible intervals) for every run similar to a fully Bayesian approach but without the need to create a confidence interval via bootstrapping like in an EM framework. We also preserve the monotonicity property of the log-likelihood function, like an EM algorithm, which is lost in a fully Bayesian MCMC based-approach. Additionally, the variational Bayes approach allows for simultaneously obtaining parameter estimates and the number of components. However, a model selection criterion still needs to be utilized while selecting the covariance structure. Herein, we used the DIC for the selection of the covariance structure and so the resulting variational Bayes approach was called the VB-DIC approach. As can be seen from the simulation studies, the correct covariance structure is often selected using the DIC. That said, it may well be the case that another criterion is more suitable for selecting the model (i.e., the covariance structure). Notably, starting values play a different role for variational Bayes than for the EM algorithm—because the former gradually reduces G as the algorithm iterates, the “starting values” for all but the initial G are not the values used to actually start the algorithm. Accordingly, direct comparison of the VB-DIC and EM-BIC approaches is not entirely straightforward.

In the simulation studies, the parameters estimated using variational Bayes approximations were very close to the true parameters (when the correct model was chosen), and excellent classification was obtained using the model selected by DIC. In many of the simulated and real analyses, the performance of the VB-DIC approach was very similar or the same as the EM-BIC approach. This is not surprising. As noted by McLachlan and Krishnan (2008) and Gelman et al. (2013), the EM algorithm can be thought of as a special case of variational Bayes in which the parameters are partitioned into two parts, ϕ and γ, and the approximating distribution of ϕ is required to be a point mass and the approximating distribution of γ is unconstrained conditional on the last update of ϕ. Across all the simulations, EM-BIC framework outperformed the VB-DIC approach; however, the VB-DIC approach outperformed the EM-BIC framework on three of the five benchmark real datasets considered.

In summary, we have explored a Bayesian alternative for parameter estimation for the most widely used family of Gaussian mixture models, i.e., the GPCM family. The use of variational Bayes in conjunction with the DIC for a family of mixture models is a novel idea and lends itself nicely to further research. Moreover, the DIC provides an alternative model selection criterion to the almost ubiquitous BIC. There are several possible avenues for further research, one of which is extension to the semi-supervised (e.g., McNicholas 2010) or, more generally, fractionally supervised paradigm (see Vrbik and McNicholas 2015; Gallaugher and McNicholas 2019b). Another avenue is extension to other families of Gaussian mixture models (e.g., the PGMM family of McNicholas and Murphy 2008, 2010) and to non-Gaussian families of mixture models (e.g., Vrbik and McNicholas 2014; Lin et al. 2014). Further consideration is needed vis-à-vis the approach used to selected the model (i.e., covariance structure) in the variational Bayes approach, e.g., one could conduct a detailed comparison of VB-DIC and, inter alia, VB-BIC. Finally, an analogous variational Bayes approach could be taken to parameter estimation for mixtures of matrix variate distributions (see, e.g., Viroli 2011; Gallaugher and McNicholas 2018a, b, 2019a).