Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Papastamoulis, Panagiotis

doi:10.1007/s11222-019-09891-z

Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Published: 27 August 2019

Volume 30, pages 485–506, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistics and Computing Aims and scope Submit manuscript

Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Download PDF

Panagiotis Papastamoulis ORCID: orcid.org/0000-0001-9468-7613¹

808 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Recent work on overfitting Bayesian mixtures of distributions offers a powerful framework for clustering multivariate data using a latent Gaussian model which resembles the factor analysis model. The flexibility provided by overfitting mixture models yields a simple and efficient way in order to estimate the unknown number of clusters and model parameters by Markov chain Monte Carlo sampling. The present study extends this approach by considering a set of eight parameterizations, giving rise to parsimonious representations of the covariance matrix per cluster. A Gibbs sampler combined with a prior parallel tempering scheme is implemented in order to approximately sample from the posterior distribution of the overfitting mixture. The parameterization and number of factors are selected according to the Bayesian information criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm. The contributed method and software are demonstrated and compared to similar models estimated using the expectation–maximization algorithm on simulated and real datasets. The software is available online at https://CRAN.R-project.org/package=fabMix.

Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

Article 12 February 2023

Gaussian parsimonious clustering models with covariates and a noise component

Article 20 September 2019

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

Article 14 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Factor analysis (FA) explains relationships among a set of observed variables using a set of latent variables. This is typically achieved by expressing the observed multivariate data as a linear combination of a smaller set of unobserved and uncorrelated variables known as factors. Let ${\varvec{x}} = ({\varvec{x}}_1,\ldots ,{\varvec{x}}_n)$ denote a random sample of p-dimensional observations with ${\varvec{x}}_i\in {\mathbb {R}}^{p}$; $i= 1,\ldots ,n$. Let ${\mathcal {N}}_p({\varvec{\mu }},{\varvec{{\varSigma }}})$ denote the p-dimensional normal distribution with mean ${\varvec{\mu }}$ and covariance matrix ${\varvec{{\varSigma }}}$ and also denote by $\mathbf{I }_p$ the $p\times p$ identity matrix. The following equations summarize the typical FA model.

$$\begin{aligned}&{\varvec{x}}_i = {\varvec{\mu }} + {\varvec{{\varLambda }}} {\varvec{y}}_i + {\varvec{\varepsilon }}_i,\quad i = 1,\ldots ,n \end{aligned}$$

(1)

$$\begin{aligned}&({\varvec{y}}_i, {\varvec{\varepsilon }}_i) \sim {\mathcal {N}}_q({\varvec{0}},\mathbf{I }_q){\mathcal {N}}_p({\varvec{0}},{\varvec{{\varSigma }}}),\quad \text{ iid } \text{ for } i = 1,\ldots ,n \end{aligned}$$

(2)

$$\begin{aligned}&{\varvec{{\varSigma }}} = \hbox {diag}(\sigma _1^2, \ldots ,\sigma _p^2) \end{aligned}$$

(3)

$$\begin{aligned}&{\varvec{x}}_i|{\varvec{y}}_i \sim {\mathcal {N}}_p({\varvec{\mu }}+{\varvec{{\varLambda }}} {\varvec{y}}_i,{\varvec{{\varSigma }}}),\quad \text{ ind. } \text{ for } i = 1,\ldots ,n \end{aligned}$$

(4)

Before proceeding, note that we are not differentiating the notation between random variables and their corresponding realizations. Bold uppercase letters are used for matrices; bold lowercase letters are used for vectors and normal text for scalars.

In Eq. (1), we assume that ${\varvec{x}}_i$ is expressed as a linear combination of a latent vector of factors ${\varvec{y}}_i\in {\mathbb {R}}^{q}$. The $p\times q$ dimensional matrix ${\varvec{{\varLambda }}} = (\lambda _{rj})$ contains the factor loadings, while ${\varvec{\mu }} = (\mu _1,\ldots ,\mu _p)$ contains the marginal mean of ${\varvec{x}}_i$. The unobserved vector ${\varvec{y}}_i$ lies on a lower-dimensional space, that is, $q < p$, and it consists of uncorrelated features $y_{i1}, \ldots ,y_{iq}$ as shown in Eq. (2), where ${\varvec{0}}$ denotes a vector of zeros. Note that the error terms ${\varvec{\varepsilon }}_i$ are independent from ${\varvec{y}}_i$. Furthermore, the errors are consisting of independent random variables $\varepsilon _{i1},\ldots ,\varepsilon _{ip}$, as implied by the diagonal covariance matrix ${\varvec{{\varSigma }}}$ in Eq. (3). As shown in Eq. (4), the knowledge of the missing data (${\varvec{y}}_i$) implies that the conditional distribution of ${\varvec{x}}_i$ has a diagonal covariance matrix. The previous assumptions lead to

$$\begin{aligned} {\varvec{x}}_i \sim {\mathcal {N}}_p({\varvec{\mu }},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}), \quad \text{ iid } \text{ for } i = 1,\ldots ,n. \end{aligned}$$

(5)

According to Eq. (5), the covariance matrix of the marginal distribution of ${\varvec{x}}_i$ is equal to ${\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}$. This is the crucial characteristic of factor analytic models, where they aim to explain high-dimensional dependencies using a set of lower-dimensional uncorrelated factors (Kim and Mueller 1978; Bartholomew et al. 2011).

Mixtures of Factor Analyzers (MFA) are generalizations of the typical FA model, by assuming that Eq. (5) becomes

$$\begin{aligned} {\varvec{x}}_i \sim \sum _{k = 1}^{K}w_k {\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}}_{k}), \text{ iid } i = 1,\ldots ,n \end{aligned}$$

(6)

where K denotes the number of mixture components. The vector of mixing proportions ${\varvec{w}} := (w_1,\ldots ,w_K)$ contains the weight of each component, with $0\leqslant w_k\leqslant 1$; $k = 1,\ldots ,K$ and $\sum _{k=1}^{K}w_k = 1$. Note that the mixture components are characterized by different parameters ${\varvec{\mu }}_k,{\varvec{{\varLambda }}}_k,{\varvec{{\varSigma }}}_k$, $k = 1,\ldots ,K$. Thus, MFAs are particularly useful when the observed data exhibit unusual characteristics such as heterogeneity. That being said, this approach aims to capture the behavior of each cluster within a component of the mixture model. A comprehensive perspective on the history and development of MFA models is given in Chapter 3 of the monograph by McNicholas (2016).

Early works applying the expectation–maximization (EM) algorithm (Dempster et al. 1977) for estimating MFA are the ones from Ghahramani et al. (1996), Tipping and Bishop (1999), McLachlan and Peel (2000). McNicholas and Murphy (2008), McNicholas and Murphy (2010) introduced the family of parsimonious Gaussian mixture models (PGMMs) by considering the case where the factor loadings and/or error variance may be shared or not between the mixture components. These models are estimated by the alternating expectation–conditional maximization algorithm (Meng and Van Dyk 1997) and have superior performance compared to other approaches (McNicholas and Murphy 2008). Under a Bayesian setup, Fokoué and Titterington (2003) estimate the number of mixture components and factors by simulating a continuous-time stochastic birth–death point process using a birth–death MCMC algorithm (Stephens 2000). More recently, Papastamoulis (2018b) estimated Bayesian MFA models with an unknown number of components using overfitting mixtures.

In recent years, there is a growing progress on the usage of overfitting mixture models in Bayesian analysis (Rousseau and Mengersen 2011; van Havre et al. 2015; Malsiner Walli et al. 2016, 2017; Frühwirth-Schnatter and Malsiner-Walli 2019). An overfitting mixture model consists of a number of components which is much larger than its true (and unknown) value. Under suitable prior assumptions (see “Appendix A”) introduced by Rousseau and Mengersen (2011), it has been shown that asymptotically the redundant components will have zero posterior weight and force the posterior distribution to put all its mass in the sparsest way to approximate the true density. Therefore, the inference on the number of mixture components can be based on the posterior distribution of the “alive” components of the overfitted model, that is, the components which contain at least one allocated observation.

Other Bayesian approaches to estimate the number of components in a mixture model include the reversible jump MCMC (RJMCMC) (Green 1995; Richardson and Green 1997; Dellaportas and Papageorgiou 2006; Papastamoulis and Iliopoulos 2009), birth–death MCMC (BDMCMC) (Stephens 2000) and allocation sampling (Nobile and Fearnside 2007; Papastamoulis and Rattray 2017) algorithms. However, overfitting mixture models are straightforward to implement, while the rest of the approaches require either careful design of various move types that bridge models with different number of clusters, or analytical integration of parameters.

The overall message is that there is a need for developing an efficient Bayesian method that will combine the previously mentioned frequentist advances on parsimonious representations of MFAs and the flexibility provided by the Bayesian viewpoint. This study aims at filling this gap by extending the Bayesian method of Papastamoulis (2018b) to the family of parsimonious Gaussian mixtures of McNicholas and Murphy (2008). Furthermore, we illustrate the proposed method using the R (Ihaka and Gentleman 1996; R Core Team 2016) package fabMix (Papastamoulis 2018a) available as a contributed package from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=fabMix. The proposed method efficiently deals with many inferential problems (see, e.g., Celeux et al. (2000a)) related to mixture posterior distributions, such as (i) inferring the number of non-empty clusters using overfitting models, (ii) efficient exploration of the posterior surface by running parallel heated chains and (iii) incorporating advanced techniques that successfully deal with the label switching issue (Papastamoulis 2016).

The rest of the paper is organized as follows: Section 2 reviews the basic concepts of parsimonious MFAs. Identifiability problems and corresponding treatments are detailed in Sect. 2.1. The Bayesian model is introduced in Sect. 2.2. Section 3 presents the full conditional posterior distributions of the model. The MCMC algorithm is described in Sect. 3.2. A detailed presentation of the main function of the contributed R package is given in Sect. 4. Our method is illustrated and compared to similar models estimated by the EM algorithm in Sects. 5.1 and 5.2 using an extended simulation study and four publicly available datasets, respectively. We conclude in Sect. 6 with a summary of our findings and directions for further research. Appendix contains further discussion on overfitting mixture models (“Appendix A”), details of the MCMC sampler (“Appendix B”) and additional simulation results (“Appendix C”).

2 Parsimonious mixtures of factor analyzers

Consider the latent allocation variables $z_i$ which assign observation ${\varvec{x}}_i$ to a component $k =1,\ldots ,K$ for $i = 1,\ldots ,n$. A priori each observation is generated from component k with probability equal to $w_k$, that is,

$$\begin{aligned} \mathrm {P}(z_i = k) = w_k,\quad k = 1,\ldots ,K, \end{aligned}$$

(7)

independent for $i = 1,\ldots ,n$. Note that the allocation vector ${\varvec{z}} := (z_1,\ldots ,z_n)$ is not observed, so it should be treated as missing data. We assume that ${\varvec{z}}_i$ and ${\varvec{y}}_i$ are independent; thus, Eq. (2) is now written as:

$$\begin{aligned} ({\varvec{y}}_i, {\varvec{\varepsilon }}_i|z_i = k)\sim {\mathcal {N}}_q({\varvec{0}},\mathbf{I }_q){\mathcal {N}}_p({\varvec{0}},{\varvec{{\varSigma }}}_{k}), \end{aligned}$$

(8)

and conditional on the cluster membership and latent factors, we obtain that

$$\begin{aligned} ({\varvec{x}}_i|z_i = k,{\varvec{y}}_i) \sim {\mathcal {N}}_p({\varvec{\mu }}_{k}+{\varvec{{\varLambda }}}_{k} {\varvec{y}}_i,{\varvec{{\varSigma }}}_{k}). \end{aligned}$$

(9)

Consequently,

$$\begin{aligned} ({\varvec{x}}_i|z_i = k) \sim {\mathcal {N}}_p({\varvec{\mu }}_{k}, {\varvec{{\varLambda }}}_{k} {\varvec{{\varLambda }}}_{k}^{T}+ {\varvec{{\varSigma }}}_{k}), \end{aligned}$$

(10)

independent for $i=1,\ldots ,n$. From Eqs. (7) and (10), we derive that the marginal distribution of ${\varvec{x}}_i$ is the finite mixture model in Eq. (6).

Following McNicholas and Murphy (2008), the factor loadings and error variance per component may be common or not among the K components in Eq. (6). If the factor is constrained, then:

$$\begin{aligned} {\varvec{{\varLambda }}}_1=\cdots ={\varvec{{\varLambda }}}_K = {\varvec{{\varLambda }}}. \end{aligned}$$

(11)

If the error variance is constrained, then:

$$\begin{aligned} {\varvec{{\varSigma }}}_1=\cdots ={\varvec{{\varSigma }}}_K = {\varvec{{\varSigma }}}. \end{aligned}$$

(12)

Furthermore, the error variance may be isotropic (i.e., proportional to the identity matrix) or not and depending on whether constraint (12) is disabled or enabled:

$$\begin{aligned} {\varvec{{\varSigma }}}_k= & {} \sigma ^2_k \mathbf{I }_p; k = 1,\ldots ,K \quad \text{ or } \end{aligned}$$

(13)

$$\begin{aligned} {\varvec{{\varSigma }}}_k= & {} \sigma ^2 \mathbf{I }_p; k = 1,\ldots ,K . \end{aligned}$$

(14)

We note that under constraint (13), the model is referred to as a mixture of probabilistic principal component analyzers (Tipping and Bishop 1999).

Depending on whether a particular constraint is present or not, the following set of eight parameterizations arises.

$$\begin{aligned} \text{ UUU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}}_{k})\\ \text{ UCU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}})\\ \text{ UUC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + \sigma ^2_k\mathbf{I }_p)\\ \text{ UCC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + \sigma ^2\mathbf{I }_p)\\ \text{ CUU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}_{k})\\ \text{ CCU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}})\\ \text{ CUC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + \sigma ^2_k\mathbf{I }_p)\\ \text{ CCC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + \sigma ^2\mathbf{I }_p) \end{aligned}$$

independent for $i = 1,\ldots ,n$. Following the pgmm nomenclature (McNicholas and Murphy 2008): the first, second and third letters denote whether ${\varvec{{\varLambda }}}_k$, ${\varvec{{\varSigma }}}_k = {\mathrm {diag}}(\sigma ^2_{k1}, \ldots ,\sigma ^2_{kp})$ and $\sigma ^2_{kj}$, $k=1,\ldots ,K$; $j=1,\ldots ,p$, are constrained (C) or unconstrained (U), respectively. A novelty of the present study is to offer a Bayesian framework for estimating the whole family of the previous parameterizations (note that Papastamoulis (2018b) estimated the UUU and UCU parameterizations).

2.1 Label switching and other identifiability problems

Let $L({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}}) = \prod _{i=1}^{n}\sum _{k=1}^{K}w_kf(x_i|\theta _k, {\varvec{\phi }})$, $({\varvec{w}}, {\varvec{\theta }},{\varvec{\phi }})\in {\mathcal {P}}_{K-1}\times {\varTheta }^{K}\times {\varPhi }$ denote the likelihood function of a mixture of K densities, where ${\mathcal {P}}_{K-1}$ denotes the parameter space of the mixing proportions ${\varvec{w}}$, ${\varvec{\theta }} = ({\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_K)$ are the component-specific parameters and ${\varvec{\phi }}$ denotes a (possibly empty) collection of parameters that are common between all components. For instance, consider the UCU parameterization where ${\varvec{\theta }}_k = ({\varvec{\mu }}_k, {\varvec{{\varLambda }}}_k)$ for $k = 1,\ldots ,K$ and ${\varvec{\phi }} = {\varvec{{\varSigma }}}$. For any permutation $\tau =(\tau _1,\ldots ,\tau _K)$ of the set $\{1,\ldots ,K\}$, the likelihood of mixture models is invariant to permutations of the component labels: $L({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}}) = L(\tau {\varvec{w}}, \tau {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}})$. Thus, the likelihood surface of a mixture model with K components will exhibit K! symmetric areas. If $({\varvec{w}}^{*},{\varvec{\theta }}^{*},{\varvec{\phi }}^{*})$ corresponds to a mode of the likelihood, the same will hold for any permutation $(\tau {\varvec{w}}^{*}, \tau {\varvec{\theta }}^{*},{\varvec{\phi }}^{*})$.

Label switching (Redner and Walker 1984) is the commonly used term to describe this phenomenon. Under a Bayesian point of view, in the case that the prior distribution is also invariant to permutations (which is typically the case, see, e.g., Marin et al. (2005), Papastamoulis and Iliopoulos (2013)), the same invariance property will also hold for the posterior distribution $f({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}})$. Consequently, the marginal posterior distributions of mixing proportions and component-specific parameters will be coinciding, i.e., $f(w_1|{\varvec{x}}) = \cdots = f(w_K|{\varvec{x}})$ and $f(\theta _1|{\varvec{x}}) = \cdots = f(\theta _K|{\varvec{x}})$. Thus, when approximating the posterior distribution via MCMC sampling, the standard practice of ergodic averages for estimating quantities of interest (such as the mean of the marginal posterior distribution for each parameter) becomes meaningless. In order to deal with this identifiability problem, we post-process the simulated MCMC output using a deterministic relabeling algorithm, that is the Equivalence Classes Representatives (ECR) algorithm (Papastamoulis and Iliopoulos 2010; Papastamoulis 2014), as implemented in the R package label.switching (Papastamoulis 2016).

A second source of identifiability problems is related to orthogonal transformations of the matrix of factor loadings. A popular practice (Geweke and Zhou 1996; Fokoué and Titterington 2003; Mavridis and Ntzoufras 2014; Papastamoulis 2018b) to overcome this issue is to pre-assign values to some entries of ${\varvec{{\varLambda }}}$; in particular, we set the entries of the upper diagonal of the first $q\times q$ block matrix of ${\varvec{{\varLambda }}}$ equal to zero:

$$\begin{aligned} {\varvec{{\varLambda }}} = \begin{pmatrix} \lambda _{11} &{} 0 &{} \cdots &{} 0\\ \lambda _{21} &{} \lambda _{22} &{} \cdots &{} 0\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \lambda _{q1} &{} \lambda _{q2} &{} \cdots &{} \lambda _{qq}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \lambda _{p1} &{} \lambda _{p2} &{} \cdots &{} \lambda _{pq} \end{pmatrix}. \end{aligned}$$

Another problem is related to the so-called sign switching phenomenon, see, e.g., Conti et al. (2014). Simultaneously switching the signs of a given row r of ${\varvec{{\varLambda }}}$; $r = 1,\ldots ,p$ and ${\varvec{y}}_i$ does not alter the likelihood. Thus, ${\varvec{{\varLambda }}}$ and ${\varvec{y}}_i$; $i = 1, \ldots ,n$ are not marginally identifiable due to sign switching across the MCMC trace. However, this is not a problem in our implementation, since all parameters of the marginal density of ${\varvec{x}}_i$ in (6) are identified (see also the discussion for sign-invariant parametric functions in Papastamoulis (2018b)).

Parameter-expanded approaches are preferred in the recent literature (Bhattacharya and Dunson 2011; McParland et al. 2017), because the mixing of the MCMC sampler is improved. In our implementation, we are able to obtain excellent mixing using the popular approach of restricting elements of ${\varvec{{\varLambda }}}$: the reader is referred to Figure 2 of Papastamoulis (2018b), where it is obvious that our MCMC sampler has the ability to rapidly move between the multiple modes of the target posterior distribution of ${\varvec{{\varLambda }}}$ (more details on convergence diagnostics are also presented in “Appendix A.4” of Papastamoulis (2018b)).

2.2 Prior assumptions

We assume that the number of mixture components (K) has a sufficiently large value so that it overestimates the “true” number of clusters. Unless otherwise stated, the default choice is $K = 20$. All prior assumptions of the overfitting mixture models are discussed in detail in Papastamoulis (2018b). For ease of presentation, we repeat them in this section. Let $ {\mathcal {D}}(\cdots )$ denote the Dirichlet distribution, and $ {\mathcal {G}}(\alpha ,\beta )$ denote the Gamma distribution with mean $\alpha /\beta $. Let also ${\varvec{{\varLambda }}}_{kr\cdot }$ denote the r-th row of the matrix of factor loadings ${\varvec{{\varLambda }}}_k$; $k = 1,\ldots ,K$; $r = 1,\ldots ,p$. The following prior assumptions are imposed on the model parameters:

$$\begin{aligned}&{\varvec{w}} \sim {\mathcal {D}}\left( \gamma ,\ldots ,\gamma \right) , \quad \gamma = \frac{1}{K} \end{aligned}$$

(15)

$$\begin{aligned}&{\varvec{\mu }}_k \sim {\mathcal {N}}_p({\varvec{\xi }}, {\varvec{{\varPsi }}}), \quad \text{ iid } \text{ for } k = 1,\ldots ,K \end{aligned}$$

(16)

$$\begin{aligned}&{\varvec{{\varLambda }}}_{kr\cdot } \sim {\mathcal {N}}_{\nu _r}({\varvec{0}},{\varvec{{\varOmega }}}), \quad \text{ iid. } \text{ for } r = 1,\ldots ,p \end{aligned}$$

(17)

$$\begin{aligned}&\sigma _{kr}^{-2} \sim {\mathcal {G}}(\alpha ,\beta ), \quad \text{ iid } \text{ for } k = 1,\ldots ,K; r = 1,\ldots ,p \end{aligned}$$

(18)

$$\begin{aligned}&\omega _{\ell }^{-2} \sim {\mathcal {G}}(g,h), \quad \text{ iid } \text{ for } \ell = 1,\ldots ,q \end{aligned}$$

(19)

where all variables are assumed mutually independent and $\nu _r =\min \{r,q\}$; $r=1,\ldots ,p$; $\ell = 1,\ldots ,q$; $j=1,\ldots ,K$. In Eq. (17), ${\varvec{{\varOmega }}} = \hbox {diag}(\omega _1^2,\ldots ,\omega _q^2)$ denotes a $q\times q$ diagonal matrix, where the diagonal entries are distributed independently according to Eq. (19). A graphical representation of the hierarchical model is given in Figure 1 of Papastamoulis (2018b). The default values of the remaining fixed hyper-parameters are given in “Appendix B”.

The previous assumptions refer to the case of the unconstrained parameter space, that is the UUU parameterization. Clearly, they should be modified accordingly when a constrained model is used. Under constraint (11), the prior distribution in Eq. (17) becomes ${\varvec{{\varLambda }}}_{r\cdot } \sim {\mathcal {N}}_{\nu _r}({\varvec{0}},{\varvec{{\varOmega }}})$, independent for $r = 1,\ldots ,p$. Under constraints (12) and (13), the prior distribution in Eq. (18) becomes $\sigma _{r}^{-2} \sim {\mathcal {G}}(\alpha ,\beta )$, independent for $r = 1,\ldots ,p$. Finally, under constraints (12) and (14), the prior distribution in Eq. (18) becomes $\sigma ^{-2} \sim {\mathcal {G}}(\alpha ,\beta )$.

3 Inference

This section describes the full conditional posterior distributions of model parameters and the corresponding MCMC sampler. Due to conjugacy, all full conditional posterior distributions are available in closed forms.

3.1 Full conditional posterior distributions

Let us define the following quantities:

$$\begin{aligned}&n_k = \sum _{k=1}^{K}I(z_i = k)\\&{\varvec{A}}_k = n_k{\varvec{{\varSigma }}}_k^{-1} + {\varPsi }^{-1} \\&{\varvec{B}}_k = {\varvec{{\varSigma }}}_k^{-1}\sum _{k=1}^{K}I(z_i=k)\left( {\varvec{x}}_i - {\varvec{{\varLambda }}}_k{\varvec{y}}_i\right) + {\varvec{\xi }}{\varvec{{\varPsi }}}^{-1}\\&{\varvec{\tau }}_{kr} = \frac{\sum _{i=1}^{n}I(z_i = k)(x_{ir} - \mu _{kr}){\varvec{y}}_i^{T}}{\sigma _{kr}^2}\\&{\varvec{{\varDelta }}}_{kr} = \frac{\sum _{i=1}^{n}I(z_i = k){\varvec{y}}_i{\varvec{y}}_i^{T}}{{\varvec{\sigma }}_{kr}^2}\\&s_{kr} = \sum _{i=1}^{n}I(z_i = k)\left( x_{ir} - \mu _{kr} - {\varvec{{\varLambda }}}_{kr\cdot }{\varvec{y}}_i\right) ^2\\&{\varvec{T}} = \sum _{k=1}^{K}\sum _{r=1}^{p}{\varvec{{\varLambda }}}_{kr\cdot }{\varvec{{\varLambda }}}_{kr\cdot }^{T}\\&{\varvec{M}}_k = \mathbf{I }_q + {\varvec{{\varLambda }}}^{T}_k{\varvec{{\varSigma }}}^{-1}_k{\varvec{{\varLambda }}}^{T}_k \end{aligned}$$

for $k = 1,\ldots ,K$; $r = 1,\ldots ,p$. For a generic sequence of the form $\{G_{rc}; r \in {\mathcal {R}}, c\in {\mathcal {C}}\}$, we also define $G_{\bullet c}=\sum _{r}G_{rc}$ and $G_{r\bullet }=\sum _{c}G_{rc}$. Finally, $(x|\cdots )$ denotes the conditional distribution of x given the value of all remaining variables.

From Eqs. (6) and (7), it immediately follows that for $k=1,\ldots ,K$

$$\begin{aligned} \mathrm {P}(z_i = k|\cdots ) \propto w_kf\left( {\varvec{x}}_i;{\varvec{\mu }}_k,{\varvec{{\varLambda }}}_k{\varvec{{\varLambda }}}_k^{T} + {\varvec{{\varSigma }}}_k\right) , \end{aligned}$$

(20)

independent for $i =1,\ldots ,n$, where $f(\cdot ;{\varvec{\mu }},{\varvec{{\varSigma }}})$ denotes the probability density function of the multivariate normal distribution with mean ${\varvec{\mu }}$ and covariance matrix ${\varvec{{\varSigma }}}$. Note that in order to compute the right-hand side of the last equation, inversion of the $p\times p$ matrix ${\varvec{{\varLambda }}}_k{\varvec{{\varLambda }}}_k^{T} + {\varvec{{\varSigma }}}_k$ is required. Using the Sherman–Morrison–Woodbury formula (see, e.g., Hager (1989)), the inverse matrix is equal to ${\varvec{{\varSigma }}}_k^{-1} - {\varvec{{\varSigma }}}_k^{-1}{\varvec{{\varLambda }}}_k {\varvec{M}}_{k}^{-1}{\varvec{{\varLambda }}}_k^T{\varvec{{\varSigma }}}^{-1}_k$, for $k = 1,\ldots ,K$. The full conditional posterior distribution of mixing proportions is a Dirichlet distribution with parameters

$$\begin{aligned} {\varvec{w}}|\cdots \sim {\mathcal {D}}(\gamma + n_1,\ldots ,\gamma +n_K). \end{aligned}$$

(21)

The full conditional posterior distribution of the marginal mean per component is

$$\begin{aligned} {\varvec{\mu }}_k|\cdots \sim {\mathcal {N}}_p\left( {\varvec{A}}_k^{-1}{\varvec{B}}_k, {\varvec{A}}_k^{-1} \right) , \end{aligned}$$

(22)

independent for $k = 1\ldots ,K$.

The full conditional posterior distribution of the factor loadings without any restriction is

$$\begin{aligned}&{\varvec{{\varLambda }}}_{kr\cdot }|\cdots \sim {\mathcal {N}}_{\nu _r}\left( \left[ {\varvec{{\varOmega }}}^{-1} {+} {\varvec{{\varDelta }}}_{kr}\right] ^{-1}{\varvec{\tau }}_{kr},\left[ {\varvec{{\varOmega }}}^{-1} {+} {\varvec{{\varDelta }}}_{kr}\right] ^{-1}\right) ,\nonumber \\ \end{aligned}$$

(23)

independent for $k = 1,\ldots ,K$; $r = 1,\ldots ,p$. Under constraint (11), we obtain that

$$\begin{aligned}&{\varvec{{\varLambda }}}_{r\cdot }|\cdots \sim {\mathcal {N}}_{\nu _r}\left( \left[ {\varvec{{\varOmega }}}^{-1} + {\varvec{{\varDelta }}}_{\bullet r}\right] ^{-1}{\varvec{\tau }}_{\bullet r},\left[ {\varvec{{\varOmega }}}^{-1} + {\varvec{{\varDelta }}}_{\bullet r}\right] ^{-1}\right) ,\nonumber \\ \end{aligned}$$

(24)

independent for $r = 1,\ldots ,p$.

The full conditional distribution of error variance without any restriction is

$$\begin{aligned} \sigma _{kr}^{-2}|\cdots \sim {\mathcal {G}}\left( \alpha + n_k/2,\beta + s_{kr}/2\right) , \end{aligned}$$

(25)

independent for $k = 1,\ldots ,K$; $r = 1,\ldots ,p$. Under constraint (12), we obtain that

$$\begin{aligned} \sigma _{r}^{-2}|\cdots \sim {\mathcal {G}}(\alpha + n/2,\beta + s_{\bullet r}/2), \end{aligned}$$

(26)

independent for $r = 1,\ldots ,p$. Under constraints (12) and (13), we obtain that

$$\begin{aligned} \sigma _{k}^{-2}|\cdots \sim {\mathcal {G}}(\alpha + n_kp/2,\beta + s_{k\bullet }/2), \end{aligned}$$

(27)

independent for $k = 1,\ldots ,K$. Under constraints (12) and (14), we obtain that

$$\begin{aligned} \sigma ^{-2}|\cdots \sim {\mathcal {G}}(\alpha + np/2,\beta + s_{\bullet \bullet }/2). \end{aligned}$$

(28)

The full conditional distribution of latent factors is given by

$$\begin{aligned} {\varvec{y}}_i|\cdots \sim {\mathcal {N}}_q\left( {\varvec{M}}_{z_i}^{-1}{\varvec{{\varLambda }}}_{z_i}^{T}{\varvec{{\varSigma }}}^{-1}_{z_i}({\varvec{x}}_i - {\varvec{\mu }}_{z_i}), {\varvec{M}}_{z_i}^{-1} \right) , \end{aligned}$$

(29)

independent for $i = 1,\ldots ,n$. Finally, the full conditional distribution for $\omega _\ell $ is

$$\begin{aligned} \omega _\ell ^{-2}|\cdots \sim {\mathcal {G}}\left( g + Kp/2, h + T_{\ell \ell }/2\right) , \end{aligned}$$

(30)

while under constraint (11) we obtain that

$$\begin{aligned} \omega _\ell ^{-2}|\cdots \sim {\mathcal {G}}\left( g + p/2, h + T_{\ell \ell }/2K\right) , \end{aligned}$$

(31)

independent for $\ell = 1,\ldots ,q$.

3.2 MCMC sampler

Given the number of factors (q) and a model parameterization, a Gibbs sampler (Geman and Geman 1984; Gelfand and Smith 1990) coupled with a prior parallel tempering scheme (Geyer 1991; Geyer and Thompson 1995; Altekar et al. 2004) is used in order to produce a MCMC sample from the joint posterior distribution. Each heated chain ($j = 1,\ldots , {\texttt {nChains}}$) corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any subset of parameters, it is only applied to the Dirichlet prior distribution of mixing proportions (van Havre et al. 2015). The inference is based on the output of the first chain ($j = 1$) of the prior parallel tempering scheme (van Havre et al. 2015). The number of factors and model parameterization is selected according to the Bayesian information criterion (BIC) (Schwarz 1978), conditional on the most probable number of alive clusters per model (see Papastamoulis (2018b) for a detailed comparison of BIC with other alternatives).

Let ${\mathcal {M}}$ and ${\mathcal {Q}}$ denote the set of model parameterizations and number of factors. In the following pseudocode, $x\leftarrow [y|z]$ denotes that x is updated from a draw from the distribution f(y|z) and $\theta _j^{(t)}$ denotes the value of $\theta $ at the t-th iteration of the j-th chain.

1.
For $(m,q) \in {\mathcal {M}}\times {\mathcal {Q}}$
1. (a)
  Obtain initial values $({\varvec{{\varOmega }}}_j^{(0)}$, ${\varvec{{\varLambda }}}_{m;j}^{(0)}$, ${\varvec{\mu }}^{(0)}_j$, ${\varvec{z}}^{(0)}_j$, ${\varvec{{\varSigma }}}_{m;j}^{(0)}$, ${\varvec{w}}^{(0)}_j$, ${\varvec{y}}^{(0)}_j)$ by running the overfitting initialization scheme, for $j = 1,\ldots , {\texttt {nChains}}$.
2. (b)
  For MCMC iteration $t = 1,2,\ldots $ update
  1. i.
    For chain $j = 1,\ldots , {\texttt {nChains}}$
    1. A.
      ${\varvec{{\varOmega }}}^{(t)}_j \leftarrow \left[ {\varvec{{\varOmega }}}|{\varvec{{\varLambda }}}_{mj}^{(t-1)}\right] $.
      If $m \in \{\text{ UUU,UCU,UUC, } \text{ UCC }\}$ use (30)
      else use (31).
    2. B.
      ${\varvec{{\varLambda }}}_{m;j}^{(t)} \leftarrow \left[ {\varvec{{\varLambda }}}|{\varvec{{\varOmega }}}_j^{(t)}, {\varvec{\mu }}_j^{(t{-}1)},\right. \left. {\varvec{{\varSigma }}}_{m;j}^{(t{-}1)}, {\varvec{x}}, {\varvec{y}}_j^{(t{-}1)}, {\varvec{z}}_j^{(t{-}1)}\right] $
      If $m \in \{\text{ UUU,UCU,UUC, } \text{ UCC }\}$ use (23)
      else use (24).
    3. C.
      ${\varvec{\mu }}_j^{(t)}\leftarrow \left[ {\varvec{\mu }}|{\varvec{{\varLambda }}}_m^{(t)},{\varvec{{\varSigma }}}_m^{(t-1)}, {\varvec{x}}, {\varvec{y}}^{(t-1)}, {\varvec{z}}_j^{(t-1)}\right] $ according to (22).
    4. D.
      ${\varvec{z}}_j^{(t)}\leftarrow \left[ {\varvec{z}}|{\varvec{w}}_j^{(t-1)},{\varvec{\mu }}_j^{(t)}, {\varvec{{\varLambda }}}_{m;j}^{(t)}, {\varvec{{\varSigma }}}_{m;j}^{(t-1)}, {\varvec{x}}\right] $ according to (20).
    5. E.
      ${\varvec{w}}_j^{(t)}\leftarrow \left[ {\varvec{w}}|{\varvec{z}}_j^{(t)}\right] $ according to (21) with prior parameter $\gamma = \gamma _{(j)}$.
    6. F.
      ${\varvec{{\varSigma }}}_{m;j}^{(t)}\leftarrow \left[ {\varvec{{\varSigma }}}|{\varvec{x}}, {\varvec{z}}_j^{(t)}, {\varvec{\mu }}_j^{(t)}, {\varvec{{\varLambda }}}_{m;j}^{(t)}, {\varvec{y}}_j^{(t-1)}\right] $
      If $m \in \{\text{ UUU,CUU }\}$ use (25)
      else if $m \in \{\text{ UCU,CCU }\}$ use (26)
      else if $m \in \{\text{ UUC,CUC }\}$ use (27)
      else use (28).
    7. G.
      ${\varvec{y}}_j^{(t)}\leftarrow \left[ {\varvec{y}}|{\varvec{x}}, {\varvec{z}}_j^{(t)}, {\varvec{\mu }}_j^{(t)}, {\varvec{{\varSigma }}}_{m;j}^{(t)},{\varvec{{\varLambda }}}_{m;j}^{(t)}\right] $ according to (29).
  2. ii.
    Select randomly $1\leqslant j^*\leqslant {\texttt {nChains}}-1$ and propose to swap the states of chains $j^*$ and $j^*+1$.
3. (c)
  For chain $j = 1$ compute BIC conditionally on the most probable number of alive clusters.
2.
Select the best (m, q) model corresponding to chain $j = 1$ according to BIC and reorder the simulated output of the selected model according to ECR algorithm, conditional on the most probable number of alive clusters.

The MCMC algorithm is initialized using random starting values arising from the “overfitting initialization” procedure introduced by Papastamoulis (2018b). For further details on steps 1.(a) (MCMC initialization) and 1.(b).ii (prior parallel tempering scheme), the reader is referred to “Appendix B” (see also Sections 2.6, 2.7 and 2.9 of Papastamoulis (2018b)).

Table 1 Arguments of the fabMix() function

Full size table

4 Using the fabMix package

The main function of the fabMix package is fabMix(), with its arguments shown in Table 1. This function takes as input a matrix rawData of observed data where rows and columns correspond to observations and variables of the dataset, respectively. The parameters of the Dirichlet prior distribution ($\gamma _{(j)}; j = 1,\ldots ,\text{ nChains }$) of the mixing proportions are controlled by dirPriorAlphas. The range for the number of factors is specified in the q argument. Valid input for q is any positive integer vector between 1 and the Ledermann bound (Ledermann 1937) implied by the number of variables in the dataset. By default, all eight parameterizations are fitted; however, the user can specify in model any non-empty subset of them.

The fabMix() function simulates a total number of ${\texttt {nChains}}\times {\texttt {length(models)}}\times {\texttt {length(q)}}$ MCMC chains. For each parameterization and number of factors, the (nChains) heated chains are processed in parallel, while swaps between pairs of chains are proposed. Parallelization is possible in the parameterization level as well, using the argument parallelModels. This means that parallelModels are running in parallel where each one of them runs nChains chains in parallel, provided that the number of available threads is at least equal to ${\texttt {nChains}}\times {\texttt {parallelModels}}$. In order to parallelize our code, the doParallel (Revolution Analytics and Steve Weston 2015), foreach (Revolution Analytics and Steve Weston 2014) and doRNG (Gaujoux 2018) packages are imported.

The prior parameters $g, h, \alpha , \beta $ in Eqs. (18) and (19) correspond to g, h,alpha_sigma and beta_sigma, respectively, with a (common) default value equal to 0.5. It is suggested to run the algorithm using normalize = TRUE, in order to standardize the data before running the MCMC sampler. The default behavior of our method is to normalize the data; thus, all reported estimates refer to the standardized dataset. In the case that the most probable number of mixture components is larger than 1, the ECR algorithm is applied in order to undo the label switching problem. Otherwise, the output is post-processed so that the generated parameters of the (single) alive component are switched to the first component of the overfitting mixture.

Table 2 Output returned to the user of the fabMix() function

Full size table

The sampler will first run for warm_up iterations before starting to propose swaps between pairs of chain. By default, this stage consists of 5000 iterations. After that, each chain will run for a series of mCycles MCMC cycles, each one consisting of nIterPerCycle MCMC iterations (steps A, B, $\ldots $, G of the pseudocode). The updates of factor loadings according to (23) and (24) at step B of the pseudocode are implemented using object-oriented programming using the Rcpp and RcppArmadillo libraries (Eddelbuettel and François 2011; Eddelbuettel and Sanderson 2014). At the end of each cycle, a swap between a pair of chains is proposed.

Obviously, the total number of MCMC iterations is equal to ${\texttt {warm}}\_{\texttt {up}}+ {\texttt {mCycles}}\times {\texttt {nIterPerCycle}}$, and the first ${\texttt {warm}}\_{\texttt {up}}+ {\texttt {burnCycles}}\times {\texttt {nIterPerCycle}}$ iterations are discarded as burn-in. Given the default values of nIterPerCycle, warm_up and overfittingInitialization, choices between $50\leqslant {\texttt {burnCycles}}\leqslant 500 < {\texttt {mCycles}}\leqslant 1500$ are typical in our implementation (see also the convergence analysis in Papastamoulis (2018b)).

While the function runs, some basic information is printed either on the screen (if parallelModels is not enabled) or in separate text files inside the output folder (in the opposite case), such as the progress of the sampler as well as the acceptance rate of proposed swaps between chains. The output which is returned to the user is detailed in Table 2. The full MCMC output of the selected model is returned as a list (named as mcmc) consisting of mcmc objects, a class imported from the coda package (Plummer et al. 2006). In particular, mcmc consists of the following:

y: object of class mcmc containing the simulated factors.
w: object of class mcmc containing the simulated mixing proportions of the alive components, reordered according to ECR algorithm.
Lambda: list containing objects of class mcmc with the simulated factor loadings of the alive components, reordered according to ECR algorithm. Note that this particular parameter is not identifiable due to sign switching across the MCMC trace.
mu: list containing objects of class mcmc with the simulated marginal means of the alive components, reordered according to ECR algorithm.
z: matrix of the simulated latent allocation variables of the mixture model, reordered according to ECR algorithm.
Sigma: list containing objects of class mcmc with the simulated variance of errors of the alive components, reordered according to ECR algorithm.
K_all_chains: matrix of the simulated values of the number of alive components per chain.

The user can call the print, summary and plot methods of the package in order to easily retrieve and visualize various summaries of the output, as exemplified in the next section.

5 Examples

This section illustrates our method. At first, we demonstrate a typical implementation on two single simulated datasets and explain in detail the workflow. Then we perform an extensive simulation study for assessing the ability of the proposed method to recover the correct clustering and compare our findings to the pgmm package (McNicholas and Murphy 2008, 2010; McNicholas et al. 2010, 2015). Application to four publicly available datasets is provided next.

5.1 Simulation study

We simulated synthetic data of $p = 30$ variables consisting of $n = 300$ observations and $K = 6$ clusters (dataset 1) and $n = 200$, $K = 2$ (dataset 2), as shown in Fig. 1. Both of them were generated using MFA models with $q = 2$ (dataset 1) and $q = 3$ (dataset 2) factors. The two datasets exhibit different characteristics: The variance of errors per cluster (${\varvec{{\varSigma }}}_k$) is significantly larger in dataset 2 compared to dataset 1. In addition, the selection of factor loadings in dataset 2 results in more complex covariance structure. The generating mechanism, described in detail in Papastamoulis (2018b), is available in the fabMix package via the simData() and simData2() functions, as shown below.

Next we estimate the eight overfitting Bayesian MFA models with $K_{{\max }}=20$ mixture components, assuming that the number of factors ranges in the set $1\leqslant q\leqslant 5$. The MCMC sampler runs nChains = 4 heated chains, each one consisting of mCycles = 700 cycles, while the first burnCycles = 100 are discarded. Recall that each MCMC cycle consists of nIterPerCycle = 10 usual MCMC iterations and that there is an additional warm-up period of the MCMC sampler (before starting to propose chain swaps) corresponding to 5000 usual MCMC iterations.

The argument parallelModels = 4 implies that four parameterizations will be processed in parallel. In addition, each model will use nChains = 4 threads to run in parallel the specified number of chains. Our job script used 16 threads so in this case the ${\texttt {parallelModels}}\times {\mathtt{nChains}} = 16$ jobs are efficiently allocated.

5.1.1 Methods for printing, summarizing and plotting the output

The print method for a fabMix.object displays some basic information for a given run of the fabMix function. The following output corresponds to the first dataset.

The following output corresponds to the print method for the fabMix function for the second dataset.

We conclude that the selected models correspond to the UUC parameterization with $K = 6$ clusters and $q = 2$ factors for dataset 1 and $K = 2$, $q = 2$ for dataset 2. The selected number of clusters and factors for the whole range of eight models is displayed next, along with the estimated posterior probability of the number of alive clusters per model (K_MAP_prob), the value of the BIC for the selected number of factors (BIC_q) as well as the proportion of the accepted swaps between the heated MCMC chains in the last column. The frequency table of the estimated single best clustering of the datasets is displayed in the last field. We note that the labels of the frequency table correspond to the labels of the alive components of the overfitting mixture model, that is, components 4, 7, 13, 14, 15, and 17 for dataset 1 and components 6 and 20 for dataset 2. Clearly, these labels can be renamed to 1, 2, 3, 4, 5, 6 and 1, 2 respectively, but we prefer to retain the raw output of the sampler as a reminder of the fact that it corresponds to the alive components of the overfitted mixture model.

The summary method of the fabMix package summarizes the MCMC output for the selected model by calculating posterior means and quantiles for the mixing proportions, marginal means and the covariance matrix per (alive) cluster. A snippet of the output for dataset 2 is shown below.

The printed output is also returned to the user via s$posterior_means and s$quantiles.

The plot() method of the package generates the following types of graphics output:

(1)
Plot of the BIC values per factor level and parameterization.
(2)
Plot of the posterior means of marginal means (${\varvec{\mu }}_k$) per (alive) cluster and highest density intervals of the corresponding normal distribution along with its assigned data.
(3)
The coordinate projection plot of the mclust package (Fraley and Raftery 2002; Scrucca et al. 2017), that is, a scatterplot of the assigned data per cluster for each pair of variables.
(4)
Visualization of the posterior mean of the correlation matrix per cluster using the corrplot package.
(5)
The MAP estimate of the factor loadings (${\varvec{{\varLambda }}}_k$) per (alive) cluster.

The following commands produce plot (1) for datasets 1 and 2.

The produced plots are shown in Fig. 2. Note that each point in the plot is labeled by an integer, which corresponds to the MAP number of alive components for the specific combination of factors and parameterization.

The following commands produce plot (2) for datasets 1 and 2.

The created plots are shown in Fig. 3. The class_mfrow arguments control the rows and columns of the layout and it should consist of two integers with their product equal to the selected number of (alive) clusters. In addition, a legend is placed on the bottom of the layout. The value(s) in the confidence argument draws the highest density interval(s) of the estimated normal distribution. Note that these plots display the original and not the scaled dataset which is used in the MCMC sampler. Therefore, the central curve and confidence limits displayed in the specific plot correspond to the mean and variance (multiplied by the appropriate quantile of the standard Normal distribution) of the random variables arising by applying the inverse of the z-transformation on the MCMC estimates reported by the fabMix function.

Figure 4 visualizes the correlation matrix for the first cluster of each dataset, using the corrplot package. The argument ${\texttt {sig}}\_{\texttt {correlation}} = \alpha $ is used for marking cases where the equally tailed $(1-\alpha )$ Bayesian credible interval contains zero. The following commands generate the plots in Fig. 4.

5.1.2 Assessing clustering accuracy and comparison with pgmm

In this section, we compare our findings against the ground truth in simulated datasets and also compare against the pgmm package, considering the same range of clusters and factors per dataset. For each combination of number of factors, components and parameterization, the pgmmEM() algorithm was initialized using three random starting values as well as the K-means clustering algorithm, that is, four different starts in total. Note that the number of different starts of the EM algorithm is set equal to number of parallel chains in the MCMC algorithm. The input data are standardized in both algorithms.

As shown in Table 3, the adjusted Rand index (ARI) (Rand 1971) between fabMix and the ground-truth classification is equal to 1 and 0.98 for simulated datasets 1 and 2, respectively. The corresponding ARI for pgmm is equal to 0.98 and 0.88, respectively. In both cases, our method finds the correct number of clusters; however, pgmm overestimates K in dataset 1. Both methods select the UUC parameterization in dataset 1, but in dataset 2 different models are selected (UUC by fabMix and CUC by pgmm).

Table 3 Selected number of clusters, factors, parameterization and adjusted Rand index for simulated data 1 and 2

Full size table

The selected number of factors equals 2; however, in dataset 2 the “true” number of factors equals 3. The underestimation of the number of factors in dataset 2 remains true for a wide range of similar data: In particular, we generated synthetic datasets with identical parameter values as the ones in dataset 2 but each time the sample size was increasing by 200 observations. We observed that the correct number of factors is returned when $n \geqslant 1600$ for fabMix and $n \geqslant 1800$ for pgmm.

Next we replicate the two distinct simulation procedures (according to the simData() and simData2() functions of the package) used to generate the previously described datasets, but considering that $1\leqslant K \leqslant 10$ (true number of clusters) and $1\leqslant q \leqslant 3$ (true number of factors). The number of variables remains the same as before, that is $p=30$, and the sample size is drawn uniformly at random in the set $\{100,200,\ldots ,1000\}$. We will use the terms ‘scenario 1’ and ’scenario 2’ to label the two different simulation procedures. In scenario 1, the diagonal of the variance of errors is generated as $\sigma _{kr}^{2} = 1 + 20\log (k+1)$, $r = 1,\ldots ,p$, whereas in scenario 2: $\sigma _{kr}^{2} = 1 + u_r\log (k+1)$, where $u_r\sim \hbox {Uniform}(500,1000)$, $r = 1,\ldots ,p$; $k=1,\ldots ,K$. In general, scenario 1 generates datasets with well-separated clusters. On the other hand, the amount of error variance in scenario 2 makes the clusters less separated. For a given simulated dataset with $K_{\text {true}}$ clusters and $q_{\text {true}}$ factors, we are considering that the total number of components in the overfitting mixture model (fabMix) as well as the maximum number of components fitted from pgmm is set equal to $K_{{\max }} = K_{\text {true}} + 6$ and that the number of factors ranges between $1\leqslant q \leqslant q_{\text {true}} + 2$. These bounds are selected in order to speed up computation time without introducing any bias in the resulting inference (as confirmed by a smaller pilot study). For each scenario, 500 datasets were simulated.

The main findings of the simulation study are illustrated in Fig. 5. Note that in scenario 1 fabMix almost always finds the correct clustering structure: The boxplots of the adjusted Rand index are centered at 1, and, on the second row, the boxplots of the estimated number of clusters are centered at the corresponding true value. On the other hand, observe that for $K\geqslant 6$pgmm has the tendency to overestimate the number of clusters. In the more challenging scenario 2, the estimates of the number of cluster exhibit larger variability. However, note that for $K=8,9,10$ the number of clusters selected by fabMix is closer to the true value than pgmm, a fact which is also reflected in the ARI where fabMix tends to have larger values than pgmm. For both scenarios, the estimation of the number of factors is in strong agreement between the two methods, as shown in the third row of Fig. 5. In the last row, the selected parameterization is shown. Observe that the results are fairly consistent between the two methods.

Finally, we note that in the presented simulation study, the generated clusters have equal sizes (on average). The reader is referred to “Appendix C” for exploring the performance of the compared methods in the presence of small and large clusters with respect to the size of the available data (n).

5.2 Publicly available datasets

In this section, we analyze four publicly available datasets: a subset of the wave dataset (Breiman et al. 1984; Lichman 2013) available at the fabMix package, the wine dataset (Forina et al. 1986) available at the pgmm package, the coffee dataset (Streuli 1973) available at the pgmm package, and the standardized yeast cell cycle data (Cho et al. 1998) available at http://faculty.washington.edu/kayee/model/. Note that Papastamoulis (2018b) analyzed the first three datasets but only considering the UUU and UCU parameterizations for fabMix.

The coffee dataset consists of $n=43$ coffee samples of $p = 12$ variables, collected from beans corresponding to the Arabica and Robusta species (thus, $K = 2$). The wave dataset consists of a randomly sampled subset of 1500 observations from the wave dataset (Breiman et al. 1984), available from the UCI machine learning repository (Lichman 2013). According to the available ground-truth classification of the dataset, there are three equally weighted underlying classes of 21-dimensional continuous data. The wine dataset (Forina et al. 1986), available at the pgmm package (McNicholas et al. 2015), contains $p=27$ variables measuring chemical and physical properties of $n = 178$ wines, grouped in three types (thus, $K = 3$). The reader is referred to McNicholas and Murphy (2008), Papastamoulis (2018b) for more detailed descriptions of the data.

The yeast cell cycle data (Cho et al. 1998) quantify gene expression levels over two cell cycles (17 time points). The dataset has previously been used for evaluating the effectiveness of model-based clustering techniques (Yeung et al. 2001). We used the standardized subset of the 5-phase criterion, containing $n = 384$ genes measured at $p = 17$ time points. The expression levels of the $n=384$ genes peak at different time points corresponding to the five phases of cell cycle, so this five-class partition of the data is used as the ground-truth classification.

Table 4 Selected number of clusters, factors, parameterization and adjusted Rand index for the publicly available data

Full size table

We applied our method using the eight parameterizations of overfitting mixtures with Kmax = 20 components for $1\leqslant q\leqslant q_{{\max }}$ factors using nChains = 4 heated chains. We set $q_{{\max }} = 5$ for the coffee, wave and wine datasets, while $q_{{\max }} = 10$ for the yeast cell cycle dataset. The number of MCMC cycles was set to mCycles = 1100, while the first burnCycles = 100 were discarded as burn-in. The eight parameterizations are processed in parallel on parallelModels = 4 cores, while each heated chain of a given parameterization is also running in parallel. All other prior parameters were fixed at their default values.

We have also applied pgmm considering the same range of clusters and factors per dataset. For each combination of number of factors, components and parameterization, the EM algorithm was initialized using five random starting values as well as the K-means clustering algorithm, that is, six different starts in total. For the coffee dataset, a larger number of different starts are required as discussed in Papastamoulis (2018b).

Table 4 summarizes the results for each of the publicly available data. We conclude that fabMix performs better than pgmm at the coffee and yeast datasets. In the wine dataset, on the other hand, pgmm performs better than fabMix, but we underline the improved performance of our method compared to the one reported by Papastamoulis (2018b) where only the UUU and UCU parameterizations were fitted. The two methods are in agreement on the wave dataset. The plot command of the fabMix package displays the estimated clusters according to the CUU model with six factors for the yeast dataset, as shown in Fig. 6.

6 Discussion and further remarks

This study offered an efficient Bayesian methodology for model-based clustering of multivariate data using mixtures of factor analyzers. The proposed model extended the ideas of Papastamoulis (2018b) building upon the previously introduced set of parsimonious Gaussian mixture models (McNicholas and Murphy 2008; McNicholas et al. 2010). The additional parameterizations improved the performance of the proposed method compared to Papastamoulis (2018b) where only two out of eight parameterizations were available. Furthermore, our contributed R package makes the proposed method available to a wider audience of researchers.

The computational cost of our MCMC method is larger than the EM algorithm, as shown in Fig. 7. But of course, when a point estimate is required, the EM algorithm is the quickest solution. When a point estimate is not sufficient, our method offers an attractive Bayesian treatment of the problem. Clearly, the Bayesian approach does show further advantages (as in the simulated datasets according to Scenario 1, as well as in the coffee and yeast datasets), where the multimodality of the likelihood potentially causes the EM to converge to local maxima.

A direction for future research is to generalize the method in order to automatically detect the number of factors in a fully Bayesian manner. This is possible by, for example, treating the number of factors as a random variable and implementing a reversible jump mechanism in order to update it inside the MCMC sampler. Another possibility would be to incorporate strategies for searching the space of sparse factor loading matrices allowing posterior inference for factor selection (Bhattacharya and Dunson 2011; Mavridis and Ntzoufras 2014; Conti et al. 2014). Recent advances on infinite mixtures of infinite factor models (Murphy et al. 2019) also allow for direct inference of the number of clusters and factors and could boost the flexibility of our modeling approach.

References

Altekar, G., Dwarkadas, S., Huelsenbeck, J.P., Ronquist, F.: Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3), 407–415 (2004). https://doi.org/10.1093/bioinformatics/btg427
Article Google Scholar
Bartholomew, D.J., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach, vol. 904. Wiley, Hoboken (2011)
MATH Google Scholar
Bhattacharya, A., Dunson, D.B.: Sparse Bayesian infinite factor models. Biometrika 98(2), 291–306 (2011)
MathSciNet MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, Belmont, CA (1984)
MATH Google Scholar
Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000a). https://doi.org/10.1080/01621459.2000.10474285
Article MathSciNet MATH Google Scholar
Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000b)
MathSciNet MATH Google Scholar
Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2(1), 65–73 (1998). https://doi.org/10.1016/S1097-2765(00)80114-8
Article Google Scholar
Conti, G., Frühwirth-Schnatter, S., Heckman, J.J., Piatek, R.: Bayesian exploratory factor analysis. J. Econom. 183(1), 31–57 (2014)
MathSciNet MATH Google Scholar
Dellaportas, P., Papageorgiou, I.: Multivariate mixtures of normals with unknown number of components. Stat. Comput. 16(1), 57–68 (2006)
MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)
MATH Google Scholar
Eddelbuettel, D., François, R.: Rcpp: seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18 (2011). https://doi.org/10.18637/jss.v040.i08
Article Google Scholar
Eddelbuettel, D., Sanderson, C.: Rcpparmadillo: accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014). https://doi.org/10.1016/j.csda.2013.02.005
Article MathSciNet MATH Google Scholar
Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)
MathSciNet MATH Google Scholar
Fokoué, E., Titterington, D.: Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Mach. Learn. 50(1), 73–94 (2003)
MATH Google Scholar
Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)
Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
MathSciNet MATH Google Scholar
Frühwirth-Schnatter, S., Malsiner-Walli, G.: From here to infinity: sparse finite versus Dirichlet process mixtures in model based clustering. Adv. Data Anal. Classif. 13, 33–64 (2019)
MathSciNet MATH Google Scholar
Gaujoux, R.: doRNG: Generic Reproducible Parallel Backend for ‘foreach’ Loops. https://CRAN.R-project.org/package=doRNG, r package version 1.7.1 (2018)
Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990)
MathSciNet MATH Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–6(6), 721–741 (1984). https://doi.org/10.1109/TPAMI.1984.4767596
Article MATH Google Scholar
Geweke, J., Zhou, G.: Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 9(2), 557–587 (1996). https://doi.org/10.1093/rfs/9.2.557
Article Google Scholar
Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Proceedings of the 23rd symposium on the interface, interface foundation, Fairfax Station, Va, pp. 156–163 (1991)
Geyer, C.J., Thompson, E.A.: Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90(431), 909–920 (1995). https://doi.org/10.1080/01621459.1995.10476590
Article MATH Google Scholar
Ghahramani, Z., Hinton, G.E., et al.: The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto (1996)
Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995)
MathSciNet MATH Google Scholar
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
MathSciNet MATH Google Scholar
Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996). https://doi.org/10.1080/10618600.1996.10474713
Article Google Scholar
Kim, J.O., Mueller, C.W.: Factor Analysis: Statistical Methods and Practical Issues, vol. 14. Sage, Thousand Oaks (1978)
Google Scholar
Ledermann, W.: On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika 2(2), 85–93 (1937)
MATH Google Scholar
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 15 Sept 2018
Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)
MathSciNet MATH Google Scholar
Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Identifying mixtures of mixtures using bayesian estimation. J. Comput. Graph. Stat. 26, 285–295 (2017)
MathSciNet MATH Google Scholar
Marin, J., Mengersen, K., Robert, C.: Bayesian modelling and inference on mixtures of distributions. Handb. Stat. 25(1), 577–590 (2005)
MathSciNet Google Scholar
Mavridis, D., Ntzoufras, I.: Stochastic search item selection for factor analytic models. Br. J. Math. Stat. Psychol. 67(2), 284–303 (2014). https://doi.org/10.1111/bmsp.12019
Article MathSciNet MATH Google Scholar
McLachlan, J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
MATH Google Scholar
McNicholas, P.D., ElSherbiny, A., Jampani, R.K., McDaid, A.F., Murphy, B., Banks, L.: pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm, R package version 1.2.3 (2015)
McNicholas, P.D.: Mixture Model-Based Classification. CRC Press, Boca Raton (2016)
MATH Google Scholar
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)
MathSciNet Google Scholar
McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705 (2010)
Google Scholar
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54(3), 711–723 (2010)
MathSciNet MATH Google Scholar
McParland, D., Phillips, C.M., Brennan, L., Roche, H.M., Gormley, I.C.: Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data. Stat. Med. 36(28), 4548–4569 (2017)
MathSciNet Google Scholar
Meng, X.L., Van Dyk, D.: The EM algorithm—an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 59(3), 511–567 (1997)
MathSciNet MATH Google Scholar
Murphy, K., Gormley, I.C., Viroli, C.: Infinite mixtures of infinite factor analysers (2019). arXiv preprint arXiv:1701.07010
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
MathSciNet Google Scholar
Nobile, A., Fearnside, A.T.: Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. 17(2), 147–162 (2007). https://doi.org/10.1007/s11222-006-9014-7
Article MathSciNet Google Scholar
Papastamoulis, P.: fabMix: Overfitting Bayesian mixtures of factor analyzers with parsimonious covariance and unknown number of components (2018a). http://CRAN.R-project.org/package=fabMix, R package version 4.5
Papastamoulis, P.: Handling the label switching problem in latent class models via the ECR algorithm. Commun. Stat. Simul. Comput. 43(4), 913–927 (2014)
MathSciNet MATH Google Scholar
Papastamoulis, P.: label.switching: an R package for dealing with the label switching problem in MCMC outputs. J. Stat. Softw. 69(1), 1–24 (2016)
Google Scholar
Papastamoulis, P.: Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Comput. Stat. Data Anal. 124, 220–234 (2018b). https://doi.org/10.1016/j.csda.2018.03.007
Article MathSciNet MATH Google Scholar
Papastamoulis, P., Iliopoulos, G.: Reversible jump MCMC in mixtures of normal distributions with the same component means. Comput. Stat. Data Anal. 53(4), 900–911 (2009)
MathSciNet MATH Google Scholar
Papastamoulis, P., Iliopoulos, G.: An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. J. Comput. Graph. Stat. 19, 313–331 (2010)
MathSciNet Google Scholar
Papastamoulis, P., Iliopoulos, G.: On the convergence rate of random permutation sampler and ECR algorithm in missing data models. Methodol. Comput. Appl. Probab. 15(2), 293–304 (2013). https://doi.org/10.1007/s11009-011-9238-7
Article MathSciNet MATH Google Scholar
Papastamoulis, P., Rattray, M.: BayesBinMix: an R package for model based clustering of multivariate binary data. R J. 9(1), 403–420 (2017)
Google Scholar
Plummer, M., Best, N., Cowles, K., Vines, K.: CODA: convergence diagnosis and output analysis for MCMC. R News 6(1), 7–11 (2006)
Google Scholar
R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/, ISBN 3-900051-07-0
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Google Scholar
Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26(2), 195–239 (1984)
MathSciNet MATH Google Scholar
Revolution Analytics and Steve Weston (2014) foreach: Foreach looping construct for R. http://CRAN.R-project.org/package=foreach, r package version 1.4.2
Revolution Analytics and Steve Weston (2015) doParallel: Foreach Parallel Adaptor for the ’parallel’ Package. http://CRAN.R-project.org/package=doParallel, r package version 1.0.10
Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B 59(4), 731–758 (1997)
MathSciNet MATH Google Scholar
Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(5), 689–710 (2011)
MathSciNet MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
MathSciNet MATH Google Scholar
Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E.: mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8(1), 205–233 (2017)
Google Scholar
Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28(1), 40–74 (2000)
MathSciNet MATH Google Scholar
Streuli, H.: Der heutige stand der kaffeechemie. In: 6th International Colloquium on Coffee Chemisrty, Association Scientifique International du Cafe, Bogata, Columbia, pp. 61–72 (1973)
Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999)
Google Scholar
van Havre, Z., White, N., Rousseau, J., Mengersen, K.: Overfitting Bayesian mixture models with an unknown number of components. PLoS ONE 10(7), 1–27 (2015)
Google Scholar
Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001). https://doi.org/10.1093/bioinformatics/17.10.977
Article Google Scholar

Download references

Acknowledgements

The author would like to acknowledge the assistance given by IT services and use of the Computational Shared Facility of the University of Manchester. The suggestions of two anonymous reviewers helped to improve the findings of this study.

Author information

Authors and Affiliations

Department of Statistics, Athens University of Economics and Business, 76 Patission Street, 10434, Athens, Greece
Panagiotis Papastamoulis

Authors

Panagiotis Papastamoulis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Papastamoulis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Overfitted mixture model

Assume that the observed data have been generated from a mixture model with $K_0$ components

$$\begin{aligned} f_{K_0}({\varvec{x}}) = \sum _{k=1}^{K_0}w_kf_k({\varvec{x}}|{\varvec{\theta }}_k), \end{aligned}$$

where $f_k\in {\mathcal {F}}_{{\varTheta }}=\{f(\cdot |{\varvec{\theta }}):{\varvec{\theta }}\in {\varTheta }\}$; $k = 1,\ldots ,K_0$ denotes a member of a parametric family of distributions. Consider that an overfitted mixture model $f_{K}({\varvec{x}})$ with $K > K_0$ components is fitted to the data. Rousseau and Mengersen (2011) showed that the asymptotic behavior of the posterior distribution of the $K-K_0$ redundant components depends on the prior distribution of mixing proportions $({\varvec{w}})$. Let d denote the dimension of free parameters of the distribution $f_k$. For the case of a Dirichlet prior distribution,

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \gamma _1,\ldots ,\gamma _K\right) \end{aligned}$$

(A.32)

if

$$\begin{aligned} \max \{\gamma _k; k=1,\ldots ,K\} < d/2 \end{aligned}$$

then the posterior weight of the extra components converges to zero (Theorem 1 of Rousseau and Mengersen (2011)).

Let $f_K({\varvec{\theta }},{\varvec{z}}|{\varvec{x}})$ denote the joint posterior distribution of model parameters and latent allocation variables for a model with K components. When using an overfitted mixture model, the inference on the number of clusters reduces to (a): choosing a sufficiently large value of mixture components (K), (b): running a typical MCMC sampler for drawing samples from the posterior distribution $f_K({\varvec{\theta }}, {\varvec{z}}|{\varvec{x}})$ and (c) inferring the number of “alive” mixture components. Note that at MCMC iteration $t = 1,2,\ldots $ (c) reduces to keeping track of the number of elements in the set $\varvec{K_0}^{(t)}=\{k=1,\ldots ,K: \sum _{i=1}^{n}I(z_i^{(t)}=k)>0\}$, where $z_i^{(t)}$ denotes the simulated allocation of observation i at iteration t.

In our case, the dimension of free parameters in the k-th mixture component is equal to $d = 2p + pq-\frac{q(q-1)}{2}$. Following Papastamoulis (2018b), we set $\gamma _1=\cdots =\gamma _K = \frac{\gamma }{K}$; thus, the distribution of mixing proportions in Eq. (A.32) becomes

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma }{K},\ldots ,\frac{\gamma }{K}\right) \end{aligned}$$

(A.33)

where $0< \gamma < d/2$ denotes a pre-specified positive number. Such a value is chosen for two reasons. At first, it is smaller than d / 2 so the asymptotic results of Rousseau and Mengersen (2011) ensure that extra components will be emptied as $n\rightarrow \infty $. Second, this choice can be related to standard practice when using Bayesian nonparametric clustering methods where the parameters of a mixture are drawn from a Dirichlet process (Ferguson 1973), that is, a Dirichlet process mixture model (Neal 2000).

Details of the MCMC sampler

Data normalization and prior parameters Before running the sampler, the raw data are standardized by applying the z-transformation

$$\begin{aligned} \frac{x_{ir} - {\bar{x}}_{r}}{\sqrt{s^2_r}}, \quad i =1,\ldots ,n;r = 1,\ldots ,p \end{aligned}$$

where ${\bar{x}}_{r} = \frac{\sum _{i=1}^{n}x_{ir}}{n}$ and $s^{2}_{r}= \frac{1}{n-1}\sum _{i=1}^{n}\left( x_{ir}-{\bar{x}}_{r}\right) ^2$. The main reason for using standardized data is that the sampler mixes better. Furthermore, it is easier to choose prior parameters that are not depending on the observed data, that is, using the data twice. In any other case, one could use empirical prior distributions as reported in Fokoué and Titterington (2003), see also Dellaportas and Papageorgiou (2006). For the case of standardized data, the prior parameters are specified in Table 5. Standardized data are also used as input to pgmm.

Table 5 Prior parameter specification for the case of standardized data

Full size table

Prior parallel tempering It is well known that the posterior surface of mixture models can exhibit many local modes (Celeux et al. 2000b; Marin et al. 2005). In such cases, simple MCMC algorithms may become trapped in minor modes and demand a very large number of iterations to sufficiently explore the posterior distribution. In order to produce a well-mixing MCMC sample and improve the convergence of our algorithm, we utilize ideas from parallel tempering schemes Geyer (1991), Geyer and Thompson (1995), Altekar et al. (2004), where different chains are running in parallel and they are allowed to switch states. Each chain corresponds to a different posterior distribution, and usually each one represents a “heated” version of the target posterior distribution. This is achieved by raising the original target to a power T with $0\leqslant T \leqslant 1$, which flattens the posterior surface and thus easier to explore when using an MCMC sampler.

In the context of overfitting mixture models, van Havre et al. (2015) introduced a prior parallel tempering scheme, which is also applied by Papastamoulis (2018b). Under this approach, each heated chain corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any subset of parameters, it is only applied to the Dirichlet prior distribution of mixing proportions (van Havre et al. 2015). Let us denote by $f_i({\varvec{\varphi }}|{\varvec{x}})$ and $f_i({\varvec{\varphi }})$; $i=1,\ldots ,J$, the posterior and prior distribution of the i-th chain, respectively. Obviously, $f_i({\varvec{\varphi }}|{\varvec{x}}) \propto f({\varvec{x}}|{\varvec{\varphi }})f_i({\varvec{\varphi }})$. Let ${\varvec{\varphi }}^{(t)}_i$ denote the state of chain i at iteration t and assume that a swap between chains i and j is proposed. The proposed move is accepted with probability $\min \{1,A\}$ where

$$\begin{aligned} A= & {} \frac{f_i({\varvec{\varphi }}_j^{(t)}|{\varvec{x}}) f_j({\varvec{\varphi }}_i^{(t)}|{\varvec{x}})}{f_i({\varvec{\varphi }}_i^{(t)} |{\varvec{x}})f_j({\varvec{\varphi }}_j^{(t)}|{\varvec{x}})}= \frac{f_i({\varvec{\varphi }}_j^{(t)})f_j({\varvec{\varphi }}_i^{(t)})}{f_i ({\varvec{\varphi }}_i^{(t)})f_j({\varvec{\varphi }}_j^{(t)})} \nonumber \\= & {} \frac{{\widetilde{f}}_i( w_j^{(t)}){\widetilde{f}}_j( w_i^{(t)})}{{\widetilde{f}}_i( w_i^{(t)}){\widetilde{f}}_j( w_j^{(t)})}, \end{aligned}$$

(B.34)

and ${\widetilde{f}}_i(\cdot )$ corresponds to the probability density function of the Dirichlet prior distribution related to chain $i = 1,\ldots ,J$. According to Eq. (A.33), this is

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma _{(j)}}{K},\ldots , \frac{\gamma _{(j)}}{K}\right) , \end{aligned}$$

(B.35)

for a pre-specified set of parameters $\gamma _{(j)}>0$ for $j = 1,\ldots ,J$.

In our examples, we used a total of $J = 4$ parallel chains where the prior distribution of mixing proportions for chain j in Eq. (B.35) is selected as

$$\begin{aligned} \gamma _{(j)} = \gamma + \delta (j-1), \quad j = 1,\ldots ,J, \end{aligned}$$

where $\delta > 0$. For example, when the overfitting mixture model uses $K=20$ components and $\gamma = 1$ (the default value shown in Table 5), it follows from Eq. (A.33) that the parameter vector of the Dirichlet prior of mixture weights which corresponds to the target posterior distribution ($j = 1$) is equal to $(0.05,\ldots ,0.05)$. Also in our examples, we have used $\delta =1$, but in general we strongly suggest to tune this parameter until a reasonable acceptance rate is achieved. Each chain runs in parallel, and every 10 iterations we randomly select two adjacent chains $(j,j+1)$, $j\in \{1,\ldots ,J-1\}$ and propose to swap their current states. A proposed swap is accepted with probability A in Eq. (B.34).

“Overfitting initialization” strategy We briefly describe the “overfitting initialization” procedure introduced by Papastamoulis (2018b). We used an initial period of 500 MCMC iterations where each chain is initialized from totally random starting values, but under a Dirichlet prior distribution with large prior parameter values. These values were chosen in a way that the asymptotic results of Rousseau and Mengersen (2011) guarantee that the redundant mixture components will have non-negligible posterior weights. More specifically for chain j, we assume $ w\sim {\mathcal {D}}(\gamma '_{j},\ldots ,\gamma '_{j})$ with $\gamma '_{(j)} = \frac{d}{2} + (j-1)\frac{d}{2(J-1)}$, for $j =1,\ldots ,J$. Then, we initialize the actual model by this state. According to Papastamoulis (2018b), this specific scheme was found to outperform other initialization procedures.

Table 6 Setup for our simulation scenarios with unequal cluster sizes. In all cases, the dimensionality of the multivariate data is equal to $p=30$

Full size table

Additional simulations

In the simulation section of the manuscript, the weights of the simulated datasets have been randomly generated from a Dirichlet distribution with mean equal to 1 / K, conditional on the number of clusters (K). Thus, on average, the true cluster sizes are equal. In this section, we examine the performance of the proposed method in the presence of unequal cluster sizes with respect to the size (n) of the observed data.

We replicate the simulation mechanism for scenarios 1 and 2 presented in the main text, but now we consider unequal (true) cluster sizes, as detailed in Table 6. For each case, the sample size is increasing (as shown in the last column of Table 6) while keeping all others parameters (that is, the true values of marginal means and factor loadings) constant. As shown in Table 6, in scenario 1 there are five clusters and two factors, whereas in scenario 2 there are two clusters and three factors. In total, three different examples per scenario are considered: for a given scenario, the component-specific parameters are different in each example, but the weights are the same. An instance of our three examples (per scenario) using $n = 200$ simulated observations is shown in Fig. 8. Observe that in all cases the “true clusters” are not easily distinguishable, especially in scenario 2 where there is a high degree of cluster overlapping.

We applied fabMix and pgmm using the same number (4) of parallel chains (for fabMix) and different starts (for pgmm) as in the simulations presented in the main paper. The results are summarized in Figs. 9 and 10 for scenarios 1 and 2, respectively. The adjusted Rand index is displayed in the first line of each figure, where the horizontal axis denotes the sample size (n) of the synthetic data. The dotted black line corresponds to the adjusted Rand index between the ground truth and the cluster assignments arising when applying the Maximum A Posteriori rule using the true parameter values that generated the data, that is,

$$\begin{aligned} {z}_i=\max _{k\in \{1,\ldots ,K^*\}}\left\{ \frac{w_k^*f_k ({\varvec{x}}_i|\theta _k^*)}{\sum _{j=1}^{K^*}w_j^*f_j ({\varvec{x}}_i|\theta _j^*)}\right\} ,\quad i=1,\ldots ,\nonumber \\ \end{aligned}$$

(C.36)

where $K^*$, $(w_1^*,\ldots ,w^*_{K^*})$ and $(\theta _1^*,\ldots ,\theta ^*_{K^*})$ denote the values of number of components, mixing proportions and parameters of the multivariate normal densities of the mixture model used to generate the data. Observe that in all three examples of scenario 1 the dotted black line is always equal to 1, but this is not the case in the more challenging Scenario 2 due to enhanced levels of cluster overlapping.

The adjusted Rand index between the ground-truth clustering and the estimated cluster assignments arising from fabMix and pgmm is shown in the first row of Figs. 9 and 10. Clearly, the compared methods have similar performance as the sample size increases, but for smaller values of n the proposed method outperforms the pgmm package.

The estimated number of clusters, shown at the second row of Figs. 9 and 10, agrees (in most cases) with the true number of clusters, but note that our method is capable of detecting the right value earlier than pgmm. Two exceptions occur at $n=200$ for example 2 of scenario 1 where fabMix (red line at second row of Fig. 9) inferred 6 alive clusters instead of 5, as well as at $n=800$ for example 1 of scenario 2 where fabMix (red line at second row of Fig. 10) inferred 3 alive clusters instead of 2.

Finally, the last row in Figs. 9 and 10 displays the inferred number of factors for scenarios 1 and 2, respectively. In every single case, the estimate arising from fabMix is at least as close as the estimate arising from pgmm to the corresponding true value. Note, however, that in example 1 of scenario 2 both methods detect a smaller number of factors (2 instead of 3 factors). In all other cases, we observe that as the sample size increases both methods infer the “true” number of factors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Papastamoulis, P. Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Stat Comput 30, 485–506 (2020). https://doi.org/10.1007/s11222-019-09891-z

Download citation

Received: 15 December 2018
Accepted: 16 August 2019
Published: 27 August 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11222-019-09891-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Abstract

Similar content being viewed by others

Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

Gaussian parsimonious clustering models with covariates and a noise component

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

1 Introduction

2 Parsimonious mixtures of factor analyzers

2.1 Label switching and other identifiability problems

2.2 Prior assumptions

3 Inference

3.1 Full conditional posterior distributions

3.2 MCMC sampler

4 Using the fabMix package

5 Examples

5.1 Simulation study

5.1.1 Methods for printing, summarizing and plotting the output

5.1.2 Assessing clustering accuracy and comparison with pgmm

5.2 Publicly available datasets

6 Discussion and further remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Overfitted mixture model

Details of the MCMC sampler

Additional simulations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

Abstract

Similar content being viewed by others

Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

Gaussian parsimonious clustering models with covariates and a noise component

Model-based clustering via new parsimonious mixtures of heavy-tailed distributions

Explore related subjects

1 Introduction

2 Parsimonious mixtures of factor analyzers

2.1 Label switching and other identifiability problems

2.2 Prior assumptions

3 Inference

3.1 Full conditional posterior distributions

3.2 MCMC sampler

4 Using the fabMix package

5 Examples

5.1 Simulation study

5.1.1 Methods for printing, summarizing and plotting the output

5.1.2 Assessing clustering accuracy and comparison with pgmm

5.2 Publicly available datasets

6 Discussion and further remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Overfitted mixture model

Details of the MCMC sampler

Additional simulations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation