A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Subedi, Sanjeena; McNicholas, Paul D.

doi:10.1007/s00357-019-09351-3

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Published: 04 March 2020

Volume 38, pages 89–108, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Classification Aims and scope Submit manuscript

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Download PDF

Sanjeena Subedi¹ &
Paul D. McNicholas²

304 Accesses
6 Citations
Explore all metrics

Abstract

Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction over fifty years ago, and is now commonly utilized within a family setting. Families of mixture models arise when the component parameters, usually the component covariance (or scale) matrices, are decomposed and a number of constraints are imposed. Within the family setting, model selection involves choosing the member of the family, i.e., the appropriate covariance structure, in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for model selection, and the expectation-maximization (EM) algorithm is usually used for parameter estimation. In fact, this EM-BIC rubric has virtually monopolized the literature on families of mixture models. Deviating from this rubric, variational Bayes approximations are developed for parameter estimation and the deviance information criteria (DIC) for model selection. The variational Bayes approach provides an alternate framework for parameter estimation by constructing a tight lower bound on the complex marginal likelihood and maximizing this lower bound by minimizing the associated Kullback-Leibler divergence. The framework introduced, which we refer to as VB-DIC, is applied to the most commonly used family of Gaussian mixture models, and real and simulated data are used to compared with the EM-BIC rubric.

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

Article 14 January 2022

Recent Developments in Model-Based Clustering with Applications

On the Use of the Matrix-Variate Tail-Inflated Normal Distribution for Parsimonious Mixture Modeling

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Most early clustering algorithms were based on heuristic approaches and some such methods, including hierarchical agglomerative clustering and k-means clustering (MacQueen 1967; Hartigan and Wong 1979), are still widely used. The use of mixture models to account for population heterogeneity has been very well established for over a century (e.g., Pearson 1894), but it was the 1960s before mixture models were used for clustering (Wolfe 1965; Hasselblad 1966; Day 1969). Because of the lack of suitable computing equipment, it was much later before the use of mixture models (e.g., Banfield and Raftery 1993; Celeux and Govaert 1995) and, more generally, probability models (e.g., Bock 1996, 1998a, 1998b) for clustering became commonplace. Since the turn of the century, the use of mixture models for clustering has burgeoned into a popular subfield of cluster analysis and recent examples include Franczak et al. (2014), Vrbik and McNicholas (2014), Murray et al.(2014a, b), Lee and McLachlan (2014), Lin et al. (2014), Subedi et al. (2015), Morris and McNicholas (2016), O’Hagan et al. (2016), Dang et al. (2015), Lin et al. (2016), Lee and McLachlan (2016), Dang et al. (2017), Cheam et al. (2017), Melnykov and Zhu (2018), Zhu and Melnykov (2018), Gallaugher and McNicholas (2019b), Tortora et al. (2019), Biernacki and Lourme (2019), Murray et al. (2019), Morris et al. (2019), and Punzo et al. (2020). The reader may consult Bouveyron and Brunet-Saumard (2014) and McNicholas (2016b) for relatively recent reviews of model-based clustering work.

A d-dimensional random vector Y is said to arise from a parametric finite mixture distribution if, for all y ⊂Y, we can write its density as

$$ f(\mathbf{y}\mid\boldsymbol{\vartheta})= \sum\limits_{g=1}^{G} \rho_{g} p_{g}(\mathbf{y}\mid\boldsymbol{\theta}_{g}), $$

where ρ_g > 0 such that ${\sum }_{i=1}^{G} \rho _{g} =1$ are the mixing proportions, p_g(y∣𝜃_g) are component densities, and 𝜗 = (ρ₁,…,ρ_G,𝜃₁,…,𝜃_G) is the vector of parameters. When the component parameters 𝜃₁,…,𝜃_G are decomposed and constraints are imposed on the resulting decompositions, the result is a family of mixture models. Typically, each component probability density is of the same type and, when they are Gaussian, the mixture density function is

$$ f(\mathbf{y}\mid\boldsymbol{\vartheta})= \sum\limits_{g=1}^{G} \rho_{g}\phi_{d}(\mathbf{y}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g}), $$

where ϕ_d(y∣μ_g,Σ_g) is the d-dimensional Gaussian density with mean μ_g and covariance Σ_g, and the likelihood is

$$ \mathcal{L}(\boldsymbol{\vartheta} \mid \mathbf{y}_{1}, \ldots, \mathbf{y}_{n}) = \prod\limits_{i=1}^{n} \sum\limits_{g=1}^{G}\rho_{g}\phi_{d}(\mathbf{y}_{i}\mid \boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g}), $$

where 𝜗 denotes the model parameters. In Gaussian families, it is usually the component covariance matrices Σ₁,…,Σ_G that are decomposed (see Section 2).

The expectation-maximization (EM) algorithm (Dempster et al. 1977) is often used for mixture model parameter estimation but its efficacy is questionable. As discussed by Titterington et al. (1985) and others, the nature of the mixture likelihood surface leaves the EM algorithm open to failure. Although this weakness can be mitigated by using multiple re-starts, there is no way to completely overcome it. Besides its heavy reliance on starting values, convergence of the EM algorithm can be very slow. When families of mixture models are used, the EM algorithm approach must be employed in conjunction with a model selection criterion to select the member of the family and, in many cases, the number of components. There are many model selection criteria to choose from, such as the Bayesian information criterion (BIC; Schwarz 1978), the integrated completed likelihood (ICL; Biernacki et al. 2000), and the Akaike information criterion (AIC; Akaike 1974). All of these model selection criteria have some merit and various shortcomings, but the BIC remains by far the most popular (McNicholas 2016a, Chp. 2). There has been interest in the use of Bayesian approaches to mixture model parameter estimation, via Markov chain Monte Carlo (MCMC) methods (e.g., Diebolt and Robert 1994; Richardson and Green 1997; Bensmail et al. 1997; Stephens 1997, 2000; Casella et al. 2002); however, difficulties have been encountered with, inter alia, computational overhead and convergence (see Celeux et al. 2000; Jasra et al. 2005). Variational Bayes approximations present an alternative to MCMC algorithms for mixture model parameter estimation and are gaining popularity due to their fast and deterministic nature (see Jordan et al. 1999; Corduneanu and Bishop 2001; Ueda and Ghahramani 2002; McGrory and Titterington 2007, 2009; McGrory et al. 2009; Subedi and McNicholas 2014).

With the use of a computationally convenient approximating density in place of a more complex “true” posterior density, the variational algorithm overcomes the hurdles of MCMC sampling. For observed data y, the joint conditional distribution of parameters 𝜃 and missing data z is approximated by using another computationally convenient distribution q(𝜃,z). This distribution q(𝜃,z) is obtained by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities, where

$$ \text{KL} (q(\mathbf{\boldsymbol{\theta},z})\mid p(\mathbf{\boldsymbol{\theta},z\mid \mathbf{y}})) = {\int}_{\boldsymbol{\Theta}} \sum\limits_{z} q(\mathbf{\boldsymbol{\theta},z}) \log\left\{\frac{q(\boldsymbol{\theta},\mathbf{z})}{p(\mathbf{\boldsymbol{\theta},z\mid y})}\right\} d\boldsymbol{\theta}. $$

The approximating density is restricted to have a factorized form for computational convenience, so that q(𝜃,z) = q_𝜃(𝜃)q_z(z). Upon choosing a conjugate prior, the appropriate hyper-parameters of the approximating density q_𝜃(𝜃) can be obtained by solving a set of coupled non-linear equations.

The variational Bayes algorithm is initialized with more components than expected. As the algorithm iterates, if two components have similar parameters then one component dominates the other causing the dominated component’s weighting to be zero. If a component’s weight becomes sufficiently small, less than or equal to two observations in our analyses, the component is removed from consideration. Therefore, the variational Bayes approach allows for simultaneous parameter estimation and selection of the number of components.

2 Methodology

2.1 Introducing Parsimony

If d-dimensional data y₁,…,y_n arise from a finite mixture of Gaussian distributions, then the log-likelihood is

$$ \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \boldsymbol{\theta}) = \ \sum\limits_{i=1}^{n} \log\left[ \sum\limits_{g=1}^{G} \rho_{g}\frac{|\boldsymbol{\Sigma}^{-1}_{g}|}{2\pi^{\frac{d}{2}}}\exp\left\{\frac{1}{2}(\mathbf{y}_{i}-\boldsymbol{\mu}_{g})^{\prime}\boldsymbol{\Sigma}^{-1}_{g}(\mathbf{y}_{i}-\boldsymbol{\mu}_{g})\right\}\right]. $$

The number of parameters in the component covariance matrices of is Gd(d + 1)/2, which is quadratic in d. When dealing with real data, the number of free parameters to be estimated can very easily exceed the sample size n by an order of magnitude. Hence, the introduction of parsimony through the imposition of additional structure on the covariance matrices is desirable. Banfield and Raftery (1993) exploited geometrical constraints on the covariance matrices of a mixture of Gaussian distributions using the eigen-decomposition of the covariance matrices, such that $\boldsymbol {\Sigma }_{g} = \lambda _{g} \mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }$, where D_g is the orthogonal matrix of eigenvectors and A_g is a diagonal matrix proportional to the eigenvalues of Σ_g, such that |A_g| = 1, and λ_g is the associated constant of proportionality. This decomposition has an advantage in terms of its interpretation, i.e., the parameter λ_g controls the cluster volume, A_g controls the cluster shape, and D_g controls the cluster orientation. This allows for imposition of several constraints on the covariance matrix that have geometrical interpretation giving rise to a family of 14 models known as Gaussian Parsimonious clustering models (GPCM; Celeux and Govaert 1995) (see Table 1).

Table 1 Nomenclature, interpretation, and covariance structure for each member of the GPCM family

Full size table

The mclust package (Scrucca et al. 2016) for R (R Core Team 2018) implements 12 of these 14 GPCM models in an EM framework, with the MM framework of Browne and McNicholas (2014) used for the other two models (EVE and VVE). Bensmail et al. (1997) used Gibbs sampling to carry out Bayesian inference for eight of the GPCM models. Bayesian regularization of some of the GPCM models has been considered by Fraley and Raftery (2007). After assigning a highly dispersed conjugate prior, they replace the maximum likelihood estimator of the group membership obtained using the EM algorithm by a maximum a posteriori probability (MAP) estimator. Note that $\text {MAP} (\hat {z}_{ig}) = 1$ if $g=\arg \max \limits _{h}({\hat {z}_{ih}})$ and $\text {MAP}({\hat {z}_{ig}}) = 0$ otherwise, where $\hat {z}_{ig}$ denotes the a posteriori expected value of Z_ig and

$$ z_{ig}=\left\{\begin{array}{ll} 1 & \text{if } \mathbf{x}_{i} \text{ belongs to component } g, \text{ and}\\ 0 & \text{otherwise}. \end{array}\right. $$

A modified BIC using the maximum a posteriori probability is then used for model selection. Herein, we implement 12 of those 14 GPCM models using variational Bayes approximations—conjugate priors are not available for the EVE and VVE models.

2.2 Priors and Approximating Densities

As suggested by McGrory and Titterington (2007), the Dirichlet distribution is used as the conjugate prior for the mixing proportion, such that

$$ p(\boldsymbol{\rho})=\text{Dir}(\boldsymbol{\rho}; \alpha_{1}^{(0)},\ldots,\alpha_{G}^{(0)}), $$

where ρ = (ρ₁,…,ρ_G) are the mixing proportions and $\alpha _{1}^{(0)},\ldots ,\alpha _{G}^{(0)}$ are the hyperparameters. Conditional on the precision matrix T_g, independent normal distributions were used as the conjugate priors for the means such that

$$ p(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G} \mid\mathbf{T}_{1},\ldots,\mathbf{T}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}^{(0)}_{g},(\beta_{g}^{(0)}\mathbf{T}_{g})^{-1}), $$

where $\{\mathbf {m}_{g}^{(0)}, \beta _{g}^{(0)}\}_{g=1}^{G}$ are the hyper-parameters.

Fraley and Raftery (2007) assigned priors on the parameters for the covariance matrix and its components in a Bayesian regularization application. However, we assign priors on the precision matrix with the hyperparameters given in Table 2. Note that it was not possible to put a suitable (i.e., determinant one) prior on the matrix A_g for the models EVI and VVI or on A for models VEV and VEI; accordingly, we instead put a prior on $c_{g}\mathbf {A}_{g}^{-1}$ or cA^− 1, respectively, where c_g or c is a positive constant. Using the expected value of $c_{g}\mathbf {A}_{g}^{-1}$ (or cA^− 1), the expected value of $\mathbf {A}_{g}^{-1}$ (or A^− 1) is determined to satisfy the constraint that the determinant is 1. Because D_g is an orthogonal matrix of eigenvectors, the Bingham matrix distribution is used as the conjugate prior for D_g. The Bingham distribution, first introduced by Bingham (1974), is a probability distribution on a set of orthonormal vectors $\{\mathbf {u}: \mathbf {u}^{\prime }\mathbf {u}=1\}$ and has antipodal symmetry thus making it ideal for random axes.

Table 2 The precision parameter upon which a prior is placed, as well as the corresponding prior distribution and hyperparameters, for 12 of the 14 members of the GPCM family

Full size table

The Bingham matrix distribution (Gupta and Nagar 2000) is the matrix analogue, on the Steifel manifold, of the Bingham distribution and has been used in multivariate analysis and matrix decomposition methods (Hoff 2009). The density of the Bingham matrix distribution, as defined by Gupta and Nagar (2000), is

$$ p(\mathbf{D}) = b(\mathbf{A},\mathbf{B})\exp(\text{tr}\{\mathbf{B}\mathbf{D}\mathbf{A}\mathbf{D}^{\prime}\}) [d\mathbf{D}], $$

for D ∈ O(n,d), where O(n,d) is the Stiefel manifold of n × d matrices, [dD] is the unit invariant measure on O(n,d), and A and B are symmetric and diagonal matrices, respectively. Samples from the Bingham matrix distribution can be obtained using the Gibbs sampling algorithm implemented in the R package rstiefel (Hoff 2012).

The approximating densities that minimize the KL divergence are as follows. For the mixing proportions, q_ρ(ρ) = Dir(ρ; α₁,…,α_G), where $\alpha _{g}=\alpha _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}$. For the mean,

$$ q_{\boldsymbol{\mu}}(\boldsymbol{\mu} \mid\mathbf{T}_{1},\ldots,\mathbf{T}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}_{g},(\beta_{g}\mathbf{T}_{g})^{-1}), $$

where $\beta _{g} = \beta _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}$ and

$$ \mathbf{m}_{g} =\frac{1}{\beta_{g}}\left( {\beta_{g}^{(0)} \mathbf{m}_{g}^{(0)}+\sum\limits_{i=1}^{n}\hat{z}_{ig}\mathbf{y}_{i}}\right). $$

The probability that the i th observation belongs to a group g is then given by

$$ \hat{z}_{ig} =\frac{\varphi_{ig}}{{\sum}_{j=1}^{G}\varphi_{ij}}, $$

where

$$ \begin{array}{@{}rcl@{}} \varphi_{ig}&=&\frac{1}{{\sum\limits_{g=1}^{G}\varphi_{ij}}}\exp\left( \mathbb{E}[\log \rho_{g}]+\frac{1}{2}\mathbb{E}[\log|\mathbf{T}_{g}|]-\frac{1}{2}\text{tr}\left\{\vphantom{\frac{1}{2}}\mathbb{E}[\mathbf{T}_{g}](\mathbf{y}_{i}-\mathbb{E}[\boldsymbol{\mu}_{g}])\right.\right.\\ &&\qquad\qquad\qquad\quad\left.\left.\times (\mathbf{y}_{i}-\mathbb{E}[\boldsymbol{\mu}_{g}])^{\prime}+\frac{1}{\beta_{g}}\mathbf{I}_{d}\right\}\right),\\ \mathbb{E}[\log(\rho_{g})] &=&{\Psi}(\hat{\alpha}_{g})-{\Psi}\left( {\sum}_{g=1}^{G}{\hat{\alpha}_{g}}\right), \end{array} $$

$\mathbb {E}[\boldsymbol {\mu }_{g}] =\mathbf {m}_{g}$, and Ψ(⋅) is the digamma function. The values of $ \mathbb {E}[\mathbf {T}_{g}]$ and $\mathbb {E}[\log |\mathbf {T}_{g}|]$ vary depending on the model (see Table 6, Appendix A for details). The posterior distribution of the parameters $\lambda _{g}^{-1}$ and A_g are gamma distributions and, therefore, the expected value of $ \mathbb {E}[\lambda _{g}^{-1}]$, $\mathbb {E}[\log |\lambda _{g}^{-1}|]$, $ \mathbb {E}[\mathbf {A}_{g}]$, and $\mathbb {E}[\log |\mathbf {A}_{g}|]$ all have a closed form. The posterior distribution for $\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }$ is a Wishart distribution and so there is a closed form solution for $\mathbb {E}[\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }]$ and $\mathbb {E}[\log |\mathbf {D}_{g}\mathbf {A}_{g}\mathbf {D}_{g}^{\prime }|]$. The posterior distribution of the parameter D_g is a Bingham matrix distribution (see Appendix C for details) and, hence, Monte Carlo integration was used to find the expected values of $\mathbb {E}[\mathbf {T}_{g}]$ and $\mathbb {E}[\log |\mathbf {T}_{g}|]$. The estimated model parameters maximize the lower bound of the marginal log-likelihood.

2.3 Convergence

The posterior log-likelihood of the observed data obtained using the posterior expected values of the parameters is

$$ \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \tilde{\boldsymbol{\theta}}) = \sum\limits_{i=1}^{n} \log \left[ \sum\limits_{g=1}^{G} \frac{\tilde{\rho}_{g}|\tilde{\mathbf{T}}_{g}|}{2\pi^{{d}/{2}}}\exp \left\{\frac{1}{2}(\mathbf{y}_{i}-\tilde{\boldsymbol{\mu}}_{g})^{\prime}\tilde{\mathbf{T}}_{g}(\mathbf{y}_{i}-\tilde{\boldsymbol{\mu}}_{g})\right\}\right], $$

where $\tilde {\boldsymbol {\mu }}_{g}=\mathbf {m}_{g}$ and

$$\tilde{\rho}_{g} =\frac{\alpha_{g}}{{\sum}_{j=1}^{G}\alpha_{j}}.$$

The expected precision matrix $\tilde {\mathbf {T}}_{g}$ varies according to the model. Convergence of the algorithm for these models is determined using a modified Aitken acceleration criterion. The Aitken acceleration (Aitken 1926) is given by

$$ a^{(m)}=\frac{l^{(m+1)}-l^{(m)}}{l^{(m)}-l^{(m-1)}}, $$

where l^(m) is the value of the posterior log-likelihood at iteration m. Convergence can be considered to have been achieved when

$$ \left|l_{\infty}^{(m+1)}-l_{\infty}^{(m)}\right| < \epsilon, $$

where $l_{\infty }^{(m+1)}$ is an asymptotic estimate of the log-likelihood given by

$$ l_{\infty}^{(m+1)} = l^{(m)}+\frac{1}{(1-a^{(m)})}(l^{(m+1)}-l^{(m-1)}) $$

(Böhning et al. 1994).

The VEV and EEV models utilize Gibbs sampling and Monte Carlo integration to find both the expected value of the parameter T_g and the expectations of functions of T_g. As the Gibbs sampling chain approaches the stationary posterior distribution, the posterior log-likelihood oscillates rather than monotonically increasing at every new iteration. Hence, an alternate convergence criteria was used for these models. When the relative change in the parameter estimates from successive iterations is small, convergence is assumed. Hence, for the VEV and EEV models, the algorithm is stopped when

$$ \max_{i} \left\{\frac{\big| \psi_{i}^{(m+1)}-\psi_{i}^{(m)}\big|}{\big| \psi_{i}^{(m)}\big| + \delta_{1}}\right\} < \delta_{2}, $$

(1)

where δ₁ and δ₂ are predetermined constants, $\psi _{i}^{(m)}$ is the estimate of the i th parameter on the m th iteration, and i indexes over every parameter in the model—note that, for matrix- or vector-valued parameters, $\psi _{i}^{(m)}$ corresponds to an individual element so that i indexes over all parameter elements and the comparison in (1) is element-wise. In the analyses herein, we use δ₁ = 0.001 and δ₂ = 0.05 for three consecutive iterations. A detailed discussion of the convergence of Monte Carlo EM algorithm is provided in Neath et al. (2013).

2.4 Model Selection

Despite the benefits of simultaneously obtaining parameter estimates along with the number of components, a model selection criterion is needed to determine the covariance structure. For the selection of the model with the best fit, the deviance information criterion (DIC; Spiegelhalter et al. 2002) is used as suggested by McGrory and Titterington (2007). The DIC is given by

$$ \text{DIC} = -2 \log p(\mathbf{y}_{1},\ldots,\mathbf{y}_{n} \mid \tilde{\boldsymbol{\theta}})+2p_{D}, $$

where

$$ 2p_{D} \approx -2 \int q_{\theta} (\boldsymbol{\theta}) \log \left\{ \frac{q_{\theta} (\boldsymbol{\theta})}{p(\boldsymbol{\theta})}\right\} d\boldsymbol{\theta} + 2 \log \left\{ \frac{q_{\theta} (\tilde{\boldsymbol{\theta}})}{p(\tilde{\boldsymbol{\theta}})}\right\} $$

and $\log p(\mathbf {y}_{1},\ldots ,\mathbf {y}_{n} \mid \tilde {\boldsymbol {\theta }})$ is the posterior log-likelihood of the data.

Hereafter, the variational Bayes approach that uses the variational Bayes algorithms introduced herein together with the DIC to select the model (i.e., covariance structure) will be referred to as the VB-DIC approach.

2.5 Performance Assessment

The adjusted Rand index (ARI; Hubert and Arabie 1985) is used to assess the performance of the clustering techniques applied in Section 3. The Rand index (Rand 1971) is based on the pairwise agreement between two partitions, e.g., predicted and true classifications. The ARI corrects the Rand index to account for agreement by chance: a value of 1 indicates perfect agreement, the expected value under random class assignment is 0, and negative values indicate a classification that is worse than would be expected by guessing.

3 Results

3.1 Simulation Study 1

The VB-DIC approach is run on 50 simulated two-dimensional Gaussian data sets with three components and known mean and covariance structures Σ_g = λ_gI_d (VII, see Table 3 for λ_g values). For each dataset, we use five random starts for each of 12 members of the GPCM family, and we set the maximum number of components to ten each time. For each dataset, the model with the smallest DIC is selected as the final model. A G = 3 component model is selected on 46 out of 50 occasions with an ARI of 1 each time while a G = 4 component model was selected, with an average ARI of 0.96 and a standard deviation of 0.044, on the other four occasions.

Table 3 Summary of the average and standard errors of the estimated parameters from the cases where a G = 3 component VII model is selected in Simulation Study 1

Full size table

A VII model is selected on 47 out of 50 occasions, with VEE and VVV models selected twice and once, respectively. When VEE and VVV are selected, the average difference between the DIC values of the model selected and VII is 1.553 with a standard deviation of 1.984 (the range of the difference is 0.470–3.049). This shows that, although the model selected was different than VII in four cases, the chosen model has similar DIC value to the VII model in each case. In all, there are 43 cases where a G = 3 component VII model is selected and the true and average estimated values (with standard deviations) for μ_g and λ_g in these cases are given in Table 3—in all cases, the estimates are very close to the true values.

One advantage of using a variational Bayes approach is that, at every iteration, the hyperparameters of the variational posterior are updated to further minimize the Kullback-Leibler divergence between the approximate variational posterior density and the true posterior density. Hence, 95% credible intervals can be created using the variational posterior distribution for all the component means μ_g and the component precision parameter ${1}/{{\sigma _{g}^{2}}}$ for each run (see Fig. 1). A Bayesian credible interval provides an interval within which the unobserved parameter value falls with a certain probability. Similar to Wang et al. (2005), we also evaluated the frequentist coverage probability of the intervals, i.e., the number of times the true value of the parameter is contained within the credible interval. Across the nine parameters, the mean coverage probability was 0.927 (range 0.860–0.976), which is slightly lower than 0.95. Blei et al. (2017) point out that variational inference tends to underestimate the variance of the posterior density.

For completeness, the EM algorithm together with the BIC to select the model (i.e., covariance structure and G)—referred to as the EM-BIC framework hereafter—was also applied to these data using the mclust package for R. In all 50 cases, a G = 3 component VII model is chosen and gives perfect classification results for all 50 datasets.

3.2 Simulation Study 2

We ran another simulation study with 50 different three-component, three-dimensional Gaussian distributions with known mean and covariance structure $\boldsymbol {\Sigma }_{g}=\boldsymbol {\Sigma }=\lambda \mathbf {D} \mathbf {A} \mathbf {D}^{\prime }$. Again, five different runs with different random starts are used and the maximum number of components is set to ten. In 41 out of 50 datasets, a three-component model is selected by the VB-DIC approach. Out of these 41 cases, an EEE model is selected 39 times and an EEV model is selected twice. These 41 cases give an average ARI of 1.000 (sd 0.001). When an EEV model is selected, the difference in DIC between the EEV and EEE models is 3.256 and 8.095, respectively, indicating that these two models were close in their fits. Four- and five-component models were selected for 8 and 1 of the datasets, respectively, with an average ARI of 0.923 (sd 0.097). The true and estimated mean parameters using VB-DIC for the EEE model are given in Table 4, and the true and estimated covariance parameters using VB-DIC for the EEE model are:

$$ \boldsymbol{\Sigma} = \left[\begin{array}{lll} 0.50& 0.35& 0.25 \\ 0.35& 1.00 &0.45\\ 0.25 &0.45& 1.20 \end{array}\right], \quad \hat{\boldsymbol{\Sigma}} = \left[\begin{array}{lll} 0.494 ~(\text{sd}~0.049)& 0.346~(\text{sd}~0.044)& 0.235~(\text{sd}~0.046)\\ 0.346 ~(\text{sd}~0.044)& 0.995~(\text{sd}~0.076)& 0.445~(\text{sd}~0.069)\\ 0.235 ~(\text{sd}~0.046)&0.445~(\text{sd}~0.069)& 1.204~(\text{sd}~0.099) \end{array}\right]. $$

Table 4 Summary of the average and standard errors of the estimated parameters from 39 out of the 50 three-dimensional simulated datasets where an EEE model was selected along the true parameters used to generate the data

Full size table

The EM-BIC framework, via mclust, was also used for these data. An EEE model was chosen for all 50 datasets with an average ARI of 1.0 (sd 0.001).

3.3 Clustering of Benchmark Datasets

To demonstrate the performance of the VB-DIC approach, we applied our algorithm on several benchmark datasets and compared its performance with the widely used EM-BIC framework via the mclust package.

Crabs Data

The Leptograpsus crab data set, publicly available in the package MASS (Venables and Ripley 2002) for R, consists of biological measurements on 100 crabs from two different species (orange and blue) with 50 males and 50 females of each species. The biological measurements (in millimeters) include frontal lobe size, rear width, carapace length, carapace width, and body depth. Although this data set has been analyzed quite often in the literature, using several different clustering approaches, the correlation among the variables makes it difficult to cluster (Fig. 2). Due to this known issue with the data set, we perform an initial step of processing using principal component analysis to convert these correlated variables into principal components (Fig. 2). Finally, the VB-DIC approach was run on these uncorrelated principal components with a maximum of G = 10 components.

SRBCT Data

The SRBCT dataset, available in the R package plsgenomics (Boulesteix et al. 2018), is a gene expression data from the microarray experiments of small round blue cell tumors (SRBCT) of childhood cancer. It contains measurements on 2,308 genes from 83 samples comprising of 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), and 25 cases of rhabdomyosarcoma (RMS). Note that our proposed variational Bayes algorithm is not designed for high-dimensional, low sample size (i.e., large p, small N) problems. Dang et al. (2015) performed a differential expression analysis on the gene expression data using an ANOVA across the known groups and selected the top ten genes, ranked using the obtained p-values, to represent a potential set of measurements that contain information on group identification. Hence, we preprocessed the SRBCT dataset in a similar manner to Dang et al. (2015) and implemented the VB-DIC approach with a maximum of G = 10 components.

Iris Data

The Iris data set available in the R datasets package contains measurements in centimeters of the variables sepal length, sepal width, petal length, and petal width of 50 flowers from each of the three species of Iris: Iris setosa, Iris versicolor, and Iris virginica.

Diabetes Data

The diabetes dataset available in the R package mclust contains measurements on three variables on 145 non-obese adult patients classified into three groups (Normal, Overt and Chemical):

glucose: Area under plasma glucose curve after a three hour oral glucose tolerance test.
insulin: Area under plasma insulin curve after a three hour oral glucose tolerance test.
sspg: Steady state plasma glucose.

Banknote Data

The banknote dataset, available in the R package mclust, contains six measurements of 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. Measurements are available for the following variables:

Length: Length of the bill in mm.
Left: Width of left edge in mm.
Right: Width of right edge in mm.
Bottom: Bottom margin width in mm.
Top: Top margin width in mm.
Diagonal: Length of diagonal in mm.

A summary of the performance of the VB-DIC approach and the EM-BIC approach is given in Table 5, where the ARI of the approach that gives the best performance is in italics. For three out of five benchmark datasets, our VB-DIC approach outperforms the EM-BIC framework as implemented via mclust. For one of the five datasets, our VB-DIC approach gives the same ARI as the EM-BC framework and, in the other one of the five datasets, the EM-BIC framework yields a slightly larger ARI compared to our VB-DIC approach.

Table 5 Summary of the performance of VB-DIC approach on the benchmark datasets along with the performance using the EM-BIC framework

Full size table

4 Discussion

A variational Bayes approach for parameter estimation for the well-known GPCM family has been proposed. As stated before, an advantage of using a variational Bayes algorithm is that, because the hyperparameters of the approximating posterior densities are updated at every iteration, we are indeed updating the approximating variational posterior density of a parameter as opposed to the point estimate of a parameter as in an EM framework. This essentially leads to a natural framework to extract interval estimates (i.e., credible intervals) for every run similar to a fully Bayesian approach but without the need to create a confidence interval via bootstrapping like in an EM framework. We also preserve the monotonicity property of the log-likelihood function, like an EM algorithm, which is lost in a fully Bayesian MCMC based-approach. Additionally, the variational Bayes approach allows for simultaneously obtaining parameter estimates and the number of components. However, a model selection criterion still needs to be utilized while selecting the covariance structure. Herein, we used the DIC for the selection of the covariance structure and so the resulting variational Bayes approach was called the VB-DIC approach. As can be seen from the simulation studies, the correct covariance structure is often selected using the DIC. That said, it may well be the case that another criterion is more suitable for selecting the model (i.e., the covariance structure). Notably, starting values play a different role for variational Bayes than for the EM algorithm—because the former gradually reduces G as the algorithm iterates, the “starting values” for all but the initial G are not the values used to actually start the algorithm. Accordingly, direct comparison of the VB-DIC and EM-BIC approaches is not entirely straightforward.

In the simulation studies, the parameters estimated using variational Bayes approximations were very close to the true parameters (when the correct model was chosen), and excellent classification was obtained using the model selected by DIC. In many of the simulated and real analyses, the performance of the VB-DIC approach was very similar or the same as the EM-BIC approach. This is not surprising. As noted by McLachlan and Krishnan (2008) and Gelman et al. (2013), the EM algorithm can be thought of as a special case of variational Bayes in which the parameters are partitioned into two parts, ϕ and γ, and the approximating distribution of ϕ is required to be a point mass and the approximating distribution of γ is unconstrained conditional on the last update of ϕ. Across all the simulations, EM-BIC framework outperformed the VB-DIC approach; however, the VB-DIC approach outperformed the EM-BIC framework on three of the five benchmark real datasets considered.

In summary, we have explored a Bayesian alternative for parameter estimation for the most widely used family of Gaussian mixture models, i.e., the GPCM family. The use of variational Bayes in conjunction with the DIC for a family of mixture models is a novel idea and lends itself nicely to further research. Moreover, the DIC provides an alternative model selection criterion to the almost ubiquitous BIC. There are several possible avenues for further research, one of which is extension to the semi-supervised (e.g., McNicholas 2010) or, more generally, fractionally supervised paradigm (see Vrbik and McNicholas 2015; Gallaugher and McNicholas 2019b). Another avenue is extension to other families of Gaussian mixture models (e.g., the PGMM family of McNicholas and Murphy 2008, 2010) and to non-Gaussian families of mixture models (e.g., Vrbik and McNicholas 2014; Lin et al. 2014). Further consideration is needed vis-à-vis the approach used to selected the model (i.e., covariance structure) in the variational Bayes approach, e.g., one could conduct a detailed comparison of VB-DIC and, inter alia, VB-BIC. Finally, an analogous variational Bayes approach could be taken to parameter estimation for mixtures of matrix variate distributions (see, e.g., Viroli 2011; Gallaugher and McNicholas 2018a, b, 2019a).

References

Aitken, A.C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45, 14–22.
MATH Google Scholar
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 (6), 716–723.
MathSciNet MATH Google Scholar
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49 (3), 803–821.
MathSciNet MATH Google Scholar
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7, 1–10.
Google Scholar
Biernacki, C., Celeux, G., Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (7), 719–725.
Google Scholar
Biernacki, C., & Lourme, A. (2019). Unifying data units and models in (co-)clustering. Advances in Data Analysis and Classification, 13 (1), 7–31.
MathSciNet MATH Google Scholar
Bingham, C. (1974). An antipodally symmetric distribution on the sphere. The Annals of Statistics, 2 (6), 1201–1225.
MathSciNet MATH Google Scholar
Blei, D.M., Kucukelbir, A., McAuliffe, J.D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association, 112 (518), 859–877.
MathSciNet Google Scholar
Bock, H.H. (1996). Probabilistic models in cluster analysis. Computational Statistics and Data Analysis, 23, 5–28.
MATH Google Scholar
Bock, H.H. (1998a). Data science, classification and related methods, (pp. 3–21). New York: Springer-Verlag.
Google Scholar
Bock, H.H. (1998b). Probabilistic approaches in cluster analysis. Bulletin of the International Statistical Institute, 57, 603–606.
MATH Google Scholar
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46, 373–388.
MATH Google Scholar
Boulesteix, A.-L., Durif, G., Lambert-Lacroix, S., Peyre, J., Strimmer, K. (2018). plsgenomics: PLS Analyses for Genomics. R package version 1.5-2.
Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: a review. Computational Statistics and Data Analysis, 71, 52–78.
MathSciNet MATH Google Scholar
Browne, R.P., & McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 8 (2), 217–226.
MathSciNet MATH Google Scholar
Casella, G., Mengersen, K., Robert, C., Titterington, D. (2002). Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society: Series B, 64, 777–790.
MathSciNet MATH Google Scholar
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28, 781–793.
Google Scholar
Celeux, G., Hurn, M., Robert, C. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95, 957–970.
MathSciNet MATH Google Scholar
Cheam, A.S.M., Marbac, M., McNicholas, P.D. (2017). Model-based clustering for spatiotemporal data on air quality monitoring. Environmetrics, 93, 192–206.
MathSciNet Google Scholar
Corduneanu, A., & Bishop, C. (2001). Variational Bayesian model selection for mixture distributions. In Artificial intelligence and statistics (pp. 27–34). Los Altos: Morgan Kaufmann.
Dang, U.J., Browne, R.P., McNicholas, P.D. (2015). Mixtures of multivariate power exponential distributions. Biometrics, 71 (4), 1081–1089.
MathSciNet MATH Google Scholar
Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34 (1), 4–34.
MathSciNet MATH Google Scholar
Day, N.E. (1969). Estimating the components of a mixture of normal distributions. Biometrika, 56 (3), 463–474.
MathSciNet MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 (1), 1–38.
MathSciNet MATH Google Scholar
Diebolt, J., & Robert, C. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B, 56, 363–375.
MathSciNet MATH Google Scholar
Fraley, C., & Raftery, A.E. (2007). Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24, 155–181.
MathSciNet MATH Google Scholar
Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (6), 1149–1157.
Google Scholar
Gallaugher, M.P.B., & McNicholas, P.D. (2018a). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93.
Google Scholar
Gallaugher, M.P.B., & McNicholas, P.D. (2018b). A mixture of matrix variate bilinear factor analyzers. In: Proceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association. Also available as arXiv preprint. arXiv:1712.08664v3.
Gallaugher, M.P.B., & McNicholas, P.D. (2019a). Mixtures of skewed matrix variate bilinear factor analyzers. Advances in Data Analysis and Classification. To appear. https://doi.org/10.1007/s11634-019-00377-4.
Gallaugher, M.P.B., & McNicholas, P.D. (2019b). On fractionally-supervised classification: weight selection and extension to the multivariate t-distribution. Journal of Classification, 36 (2), 232–265.
MathSciNet MATH Google Scholar
Gelman, A., Stern, H.S., Carlin, J.B., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian data analysis. Boca Raton: Chapman and Hall/CRC Press.
MATH Google Scholar
Gupta, A., & Nagar, D. (2000). Matrix variate distributions. Boca Raton: Chapman & Hall/CRC Press.
MATH Google Scholar
Hartigan, J.A., & Wong, M.A. (1979). A k-means clustering algorithm. Applied Statistics, 28 (1), 100–108.
MATH Google Scholar
Hasselblad, V. (1966). Estimation of parameters for a mixture of normal distributions. Technometrics, 8 (3), 431–444.
MathSciNet Google Scholar
Hoff, P. (2012). rstiefel: random orthonormal matrix generation on the Stiefel manifold. R package version 0.9.
Hoff, P.D. (2009). Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18 (2), 438–456.
MathSciNet Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
MATH Google Scholar
Jasra, A., Holmes, C.C., Stephens, D.A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Journal of the Royal Statistical Society: Series B, 10 (1), 50–67.
MathSciNet MATH Google Scholar
Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233.
MATH Google Scholar
Lee, S., & McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing, 24, 181–202.
MathSciNet MATH Google Scholar
Lee, S.X., & McLachlan, G.J. (2016). Finite mixtures of canonical fundamental skew t-distributions – the unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing, 26 (3), 573–589.
MathSciNet MATH Google Scholar
Lin, T., McLachlan, G.J., Lee, S.X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis, 143, 398–413.
MathSciNet MATH Google Scholar
Lin, T. -I., McNicholas, P. D., Hsiu, J. H. (2014). Capturing patterns via parsimonious t mixture models. Statistics and Probability Letters, 88, 80–87.
MathSciNet MATH Google Scholar
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press.
McGrory, C., & Titterington, D. (2007). Variational approximations in Bayesian model selection for finite mixture distributions. Computational Statistics and Data Analysis, 51, 5352–5367.
MathSciNet MATH Google Scholar
McGrory, C., & Titterington, D. (2009). Variational Bayesian analysis for hidden Markov models. Australian and New Zealand Journal of Statistics, 51, 227–244.
MathSciNet MATH Google Scholar
McGrory, C., Titterington, D., Pettitt, A. (2009). Variational Bayes for estimating the parameters of a hidden Potts model. Computational Statistics and Data Analysis, 19 (3), 329–340.
MathSciNet Google Scholar
McLachlan, G.J., & Krishnan, T. (2008). The EM algorithm and extensions, 2nd edn. New York: Wiley.
MATH Google Scholar
McNicholas, P.D. (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference, 140 (5), 1175–1181.
MathSciNet MATH Google Scholar
McNicholas, P.D. (2016a). Mixture model-based classification. Boca Raton: Chapman & Hall/CRC Press.
MATH Google Scholar
McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33 (3), 331–373.
MathSciNet MATH Google Scholar
McNicholas, P.D., & Murphy, T.B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18, 285–296.
MathSciNet Google Scholar
McNicholas, P.D., & Murphy, T.B. (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics, 26 (21), 2705–2712.
Google Scholar
Melnykov, V., & Zhu, X. (2018). On model-based clustering of skewed matrix data. Journal of Multivariate Analysis, 167, 181–194.
MathSciNet MATH Google Scholar
Morris, K., & McNicholas, P.D. (2016). Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures. Computational Statistics and Data Analysis, 97, 133–150.
MathSciNet MATH Google Scholar
Morris, K., Punzo, A., McNicholas, P.D., Browne, R.P. (2019). Asymmetric clusters and outliers: mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Computational Statistics and Data Analysis, 132, 145–166.
MathSciNet MATH Google Scholar
Murray, P.M., Browne, R.B., McNicholas, P.D. (2014a). Mixtures of skew-t factor analyzers. Computational Statistics and Data Analysis, 77, 326–335.
MathSciNet MATH Google Scholar
Murray, P.M., Browne, R.P., McNicholas, P.D. (2019). Mixtures of hidden truncation hyperbolic factor analyzers. Journal of Classification. To appear. https://doi.org/10.1007/s00357-019-9309-y.
Murray, P.M., McNicholas, P.D., Browne, R.P. (2014b). A mixture of common skew-t factor analyzers. Stat, 3 (1), 68–82.
MathSciNet MATH Google Scholar
Neath, R.C., & et al. (2013). On convergence properties of the Monte Carlo EM algorithm. In: Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton, pp.43–62. Institute of Mathematical Statistics.
O’Hagan, A., Murphy, T.B., Gormley, I.C., McNicholas, P.D., Karlis, D. (2016). Clustering with the multivariate normal inverse Gaussian distribution. Computational Statistics and Data Analysis, 93, 18–30.
MathSciNet MATH Google Scholar
Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London A, 185, 71–110.
MATH Google Scholar
Punzo, A., Blostein, M., McNicholas, P.D. (2020). High-dimensional unsupervised classification via parsimonious contaminated mixtures. Pattern Recognition, 98, 107031.
Google Scholar
R Core Team. (2018). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Google Scholar
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
Google Scholar
Richardson, S., & Green, P. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B, 59, 731–792.
MathSciNet MATH Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2), 461–464.
MathSciNet MATH Google Scholar
Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8 (1), 205–233.
Google Scholar
Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society: Series B, 64, 583–639.
MathSciNet MATH Google Scholar
Stephens, M. (1997). Bayesian methods for mixtures of normal distributions. Oxford: Ph.D. thesis University of Oxford.
Google Scholar
Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components — an alternative to reversible jump methods. The Annals of Statistics, 28, 40–74.
MathSciNet MATH Google Scholar
Subedi, S., & McNicholas, P.D. (2014). Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Advances in Data Analysis and Classification, 8 (2), 167–193.
MathSciNet MATH Google Scholar
Subedi, S., Punzo, A., Ingrassia, S., McNicholas, P.D. (2015). Cluster-weighed t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods and Applications, 24 (4), 623–649.
MathSciNet MATH Google Scholar
Titterington, D.M., Smith, A.F.M., Makov, U.E. (1985). Statistical analysis of finite mixture distributions. Chichester: John Wiley & Sons.
MATH Google Scholar
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D. (2019). A mixture of coalesced generalized hyperbolic distributions. Journal of Classification, 36 (1), 26–57.
MathSciNet MATH Google Scholar
Ueda, N., & Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks, 15, 1223–1241.
Google Scholar
Venables, W.N., & Ripley, B.D. (2002). Modern applied statistics with S, 4th edn. New York: Springer.
MATH Google Scholar
Viroli, C. (2011). Finite mixtures of matrix normal distributions for classifying three-way data. Statistics and Computing, 21 (4), 511–522.
MathSciNet MATH Google Scholar
Vrbik, I., & McNicholas, P.D. (2014). Parsimonious skew mixture models for model-based clustering and classification. Computational Statistics and Data Analysis, 71, 196–210.
MathSciNet MATH Google Scholar
Vrbik, I., & McNicholas, P.D. (2015). Fractionally-supervised classification. Journal of Classification, 32 (3), 359–381.
MathSciNet MATH Google Scholar
Wang, X., He, C.Z., Sun, D. (2005). Bayesian inference on the patient population size given list mismatches. Statistics in Medicine, 24 (2), 249–267.
MathSciNet Google Scholar
Wolfe, J.H. (1965). A computer program for the maximum likelihood analysis of types. Technical Bulletin 65–15, U.S.Naval Personnel Research Activity.
Zhu, X., & Melnykov, V. (2018). Manly transformation in finite mixture modeling. Computational Statistics & Data Analysis, 121, 190–208.
MathSciNet MATH Google Scholar

Download references

Funding

This work was supported by a Postgraduate Scholarship from the Natural Science and Engineering Research Council of Canada (Subedi); the Canada Research Chairs program (McNicholas); and an E.W.R. Steacie Memorial Fellowship (McNicholas).

Author information

Authors and Affiliations

Department of Mathematical Sciences, Binghamton University, State University of New York, 4400 Vestal Parkway East, Binghamton, NY, 13902, USA
Sanjeena Subedi
Department of Mathematics & Statistics, McMaster University, 1280 Main St. W., Hamilton, ON, L8S 4K1, Canada
Paul D. McNicholas

Authors

Sanjeena Subedi
View author publications
You can also search for this author in PubMed Google Scholar
Paul D. McNicholas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjeena Subedi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Posterior distributions for the parameters of eigen-decomposed covariance matrix

Table 6 Posterior distributions of the precision parameters as well as their corresponding parameters for 12 of the members of the GPCM family

Full size table

Appendix B: Posterior expected value of the precision parameters of the eigen-decomposed covariance matrix

Table 7 Posterior expected value of the precision parameters of the eigen-decomposed covariance matrix for 12 of the members of the GPCM family

Full size table

Appendix C: Mathematical details for the EEV and VEV Models

3.1 C.1 EEV Model

The mixing proportions were assigned a Dirichlet prior distribution, such that

$$ q_{\rho}(\boldsymbol{\rho})=\text{Dir}(\boldsymbol{\rho};\alpha_{1}^{(0)},\ldots,\alpha_{G}^{(0)}). $$

For the mean, a Gaussian distribution conditional on the covariance matrix was used, such that

$$ q_{\boldsymbol{\mu}}(\boldsymbol{\mu} \mid\lambda,\mathbf{A},\mathbf{D}_{1},\ldots,\mathbf{D}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}_{g}^{(0)},(\beta_{g}^{(0)-1}\lambda\mathbf{D}_{g}\mathbf{A}\mathbf{D}_{g}^{\prime})). $$

For the parameters of the covariance matrix, the following priors were used: the k th diagonal elements of (λA)^− 1 were assigned a Gamma $(a^{(0)}_{k},b^{(0)}_{k})$ distribution and D_g was assigned a matrix von Mises-Fisher $(\mathbf {C}^{(0)}_{g})$ distribution. By setting τ = (λA)^− 1, its prior can be written

$$ p_{\tau}(\boldsymbol{\tau})\propto \prod\limits_{k=1}^{K}\tau_{k}^{\frac{a^{(0)}_{k}}{2}-1} \exp\left\{-\frac{b^{(0)}_{k}}{2}\tau_{k}\right\}, $$

where τ_k is the k th diagonal element of τ = (λA)^− 1.

The matrix D has a density as defined by Gupta and Nagar (2000):

$$ p(\mathbf{D}) = b(\mathbf{Q}^{(0)},\mathbf{P}_{g}^{(0)})\exp(\text{tr}\{\mathbf{Q}^{(0)}\mathbf{D}\mathbf{P}_{g}^{(0)}\mathbf{D}^{\prime}\}) [d\mathbf{D}], $$

for D ∈ O(d,d), where O(d,d) is the Stiefel manifold of d × d matrices, [dD] is the unit invariant measure on O(d,d), and A⁽⁰⁾ and $\mathbf {B}_{g}^{(0)}$ are symmetric and diagonal matrices, respectively.

The joint distribution of μ₁,…,μ_G, τ, and D is

$$ \begin{array}{@{}rcl@{}} &&p(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G},\boldsymbol{\tau}, \mathbf{D}) \propto\prod\limits_{g=1}^{G}|\beta_{g}^{(0)}\boldsymbol{\tau}|^{\frac{1}{2}} \exp \left\{\frac{-(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})\beta_{g}^{(0)}\mathbf{D}_{g}^{\prime}\boldsymbol{\tau} \mathbf{D}_{g}(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})'}{2} \right\}\\ &&\times\exp \left\{\text{tr}(\mathbf{Q}^{(0)}\mathbf{D}\mathbf{P}_{g}^{(0)}\mathbf{D}^{\prime})\right\}\prod\limits_{k=1}^{K}\tau_{k}^{\frac{a^{(0)}_{k}}{2}-1} \exp\left\{-\frac{b^{(0)}_{k}}{2}\tau_{k}\right\}. \end{array} $$

The likelihood of the data can be written

$$ \mathcal{L}(\boldsymbol{\mu}_{1},\!\ldots\!,\boldsymbol{\mu}_{G},\boldsymbol{\tau}, \mathbf{D} \mid \mathbf{y}_{1}, \!\ldots\!, \mathbf{y}_{n}) \!\propto\!\prod\limits_{i=1}^{n}\prod\limits_{g=1}^{G}|\boldsymbol{\tau}|^{{\hat{z}_{ig}}/{2}} \exp\! \left\{\!-\frac{\hat{z}_{ig}}{2}(\mathbf{y}_{i} - \boldsymbol{\mu}_{g})\mathbf{D}_{g}^{\prime}\boldsymbol{\tau} \mathbf{D}_{g}(\mathbf{y}_{i} - \boldsymbol{\mu}_{g})'\right\}.$$

Therefore, the joint posterior distribution of μ, τ, and D can be written

$$p(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G},\boldsymbol{\tau}, \!\mathbf{D} \!\mid \mathbf{y}_{1}, \ldots, \mathbf{y}_{n}) \!\propto\! p(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G},\boldsymbol{\tau}, \mathbf{D}) \mathcal{L}(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G},\boldsymbol{\tau}, \mathbf{D} \!\mid\! \mathbf{y}_{1}, \ldots, \mathbf{y}_{n}).$$

Thus, the posterior distribution of mean becomes

$$q_{\boldsymbol{\mu}}(\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{G} \mid \boldsymbol{\tau},\mathbf{D}_{1},\ldots,\mathbf{D}_{G})=\prod\limits_{g=1}^{G}\phi_{d}(\boldsymbol{\mu}_{g};\mathbf{m}_{g},(\beta_{g}\mathbf{D}_{g}^{\prime}\boldsymbol{\tau}\mathbf{D}_{g})^{-1}),$$

where $\beta _{g} = \beta _{g}^{(0)}+{\sum }_{i=1}^{n}\hat {z}_{ig}$ and

$$\mathbf{m}_{g} =\frac{1}{\beta_{g}}\left( \beta_{g}^{(0)} \mathbf{m}_{g}^{(0)}+ \sum\limits_{i=1}^{n}\hat{z}_{ig}\mathbf{y}_{i}\right).$$

The posterior distribution for the k th diagonal element of τ = (λA)^− 1 is

$$q_{\tau}(\tau_{k}) = \text{Gamma} (a_{k},b_{k})$$

where $a_{k}=a_{k}^{(0)}+d{\sum }_{g=1}^{G}{\sum }_{i=1}^{n}\hat {z}_{ig}=a_{k}^{(0)}+dn$ and

$$b_{k}=b_{k}^{(0)}+\sum\limits_{g=1}^{G}\left( \sum\limits_{i=1}^{n}\hat{z}_{ig}y_{ik}^{2}+\beta_{g}^{(0)}m_{gk}^{2}- \beta_{g}m_{gk}^{2}\right).$$

We have

$$ \begin{array}{@{}rcl@{}} q(\mathbf{D}_{g}|\mathbf{y};\boldsymbol{\mu}_{g},\boldsymbol{\tau})&\propto& \exp\left\{ \text{tr} \left (-\frac{1}{2}{(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})\beta_{g}^{(0)}\mathbf{D}_{g}^{\prime}\boldsymbol{\tau} \mathbf{D}_{g}(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})'}\right ) \right \} \\ &&\times\exp\left\{ \text{tr} \left (-\frac{1}{2}{\sum\limits_{i=1}^{n}z_{ig}(\mathbf{y}-\boldsymbol{\mu}_{g})\mathbf{D}_{g}^{\prime}\boldsymbol{\tau} \mathbf{D}_{g}(\mathbf{y}-\boldsymbol{\mu}_{g})'}+\mathbf{Q}_{g}^{(0)}\mathbf{D}_{g}\mathbf{P}_{g}^{(0)}\mathbf{D}_{g}^{\prime}\right )\right \}, \end{array} $$

which has the functional form of a Bingham matrix distribution, i.e., the form

$$\exp \left\{{\text{tr}(\mathbf{Q}_{g} \mathbf{D}_{g}\mathbf{P}_{g} \mathbf{D}_{g}^{\prime}})\right\},$$

where $\mathbf {Q}_{g} = \mathbf {Q}_{g}^{(0)}+\boldsymbol {\tau }$ and

$$ \mathbf{P}_{g} = \mathbf{P}_{g}^{(0)}-\frac{1}{2}\left[\sum\limits_{i=1}^{n}z_{ig}(\mathbf{y}-\boldsymbol{\mu}_{g})(\mathbf{y}-\boldsymbol{\mu}_{g})^{\prime}+(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})\beta_{g}^{(0)}(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})'\right]. $$

3.2 C.2 VEV Model

Similarly, the posterior distribution of D_g for the VEV model has the form

$$ \begin{array}{@{}rcl@{}} q(\mathbf{D}_{g}|\mathbf{y};\boldsymbol{\mu}_{g},\boldsymbol{\tau}_{g})\!\!&\propto&\!\! \exp\left\{ \text{tr} \left (-\frac{1}{2}{(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})\beta_{g}^{(0)}\mathbf{D}_{g}^{\prime}\boldsymbol{\tau}_{g} \mathbf{D}_{g}(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})'}\right ) \right \} \\ &&\!\!\times \exp\left\{ \text{tr} \left (-\frac{1}{2} \sum\limits_{i=1}^{n}\hat{z}_{ig}(\mathbf{y}-\!\boldsymbol{\mu}_{g})\mathbf{D}_{g}^{\prime}\boldsymbol{\tau}_{g} \mathbf{D}_{g}(\mathbf{y}\!-\boldsymbol{\mu}_{g})'+\mathbf{Q}_{g}^{(0)}\mathbf{D}_{g}\mathbf{P}_{g}^{(0)}\mathbf{D}_{g}^{\prime}\right )\right \}, \end{array} $$

which has the functional form of a Bingham matrix distribution, i.e., the form

$$\exp\left\{{\text{tr}(\mathbf{Q}_{g} \mathbf{D}_{g}\mathbf{P}_{g} \mathbf{D}_{g}^{\prime}})\right\},$$

where $\mathbf {Q}_{g}=\mathbf {Q}_{g}^{(0)}+\boldsymbol {\tau }_{g}$ and

$$\mathbf{P}_{g} = -\frac{1}{2}\left[\sum\limits_{i=1}^{n}\hat{z}_{ig}(\mathbf{y}-\boldsymbol{\mu}_{g})(\mathbf{y}-\boldsymbol{\mu}_{g})'+(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})\beta_{g}^{(0)}(\boldsymbol{\mu}_{g}-\mathbf{m}_{g}^{(0)})'\right].$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Subedi, S., McNicholas, P.D. A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting. J Classif 38, 89–108 (2021). https://doi.org/10.1007/s00357-019-09351-3

Download citation

Published: 04 March 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00357-019-09351-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Abstract

Similar content being viewed by others

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

Recent Developments in Model-Based Clustering with Applications

On the Use of the Matrix-Variate Tail-Inflated Normal Distribution for Parsimonious Mixture Modeling

1 Introduction

2 Methodology

2.1 Introducing Parsimony

2.2 Priors and Approximating Densities

2.3 Convergence

2.4 Model Selection

2.5 Performance Assessment

3 Results

3.1 Simulation Study 1

3.2 Simulation Study 2

3.3 Clustering of Benchmark Datasets

Crabs Data

SRBCT Data

Iris Data

Diabetes Data

Banknote Data

4 Discussion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix A: Posterior distributions for the parameters of eigen-decomposed covariance matrix

Appendix B: Posterior expected value of the precision parameters of the eigen-decomposed covariance matrix

Appendix C: Mathematical details for the EEV and VEV Models

3.1 C.1 EEV Model

3.2 C.2 VEV Model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation