1 Introduction

Factor analysis (FA) explains relationships among a set of observed variables using a set of latent variables. This is typically achieved by expressing the observed multivariate data as a linear combination of a smaller set of unobserved and uncorrelated variables known as factors. Let \({\varvec{x}} = ({\varvec{x}}_1,\ldots ,{\varvec{x}}_n)\) denote a random sample of p-dimensional observations with \({\varvec{x}}_i\in {\mathbb {R}}^{p}\); \(i= 1,\ldots ,n\). Let \({\mathcal {N}}_p({\varvec{\mu }},{\varvec{{\varSigma }}})\) denote the p-dimensional normal distribution with mean \({\varvec{\mu }}\) and covariance matrix \({\varvec{{\varSigma }}}\) and also denote by \(\mathbf{I }_p\) the \(p\times p\) identity matrix. The following equations summarize the typical FA model.

$$\begin{aligned}&{\varvec{x}}_i = {\varvec{\mu }} + {\varvec{{\varLambda }}} {\varvec{y}}_i + {\varvec{\varepsilon }}_i,\quad i = 1,\ldots ,n \end{aligned}$$
(1)
$$\begin{aligned}&({\varvec{y}}_i, {\varvec{\varepsilon }}_i) \sim {\mathcal {N}}_q({\varvec{0}},\mathbf{I }_q){\mathcal {N}}_p({\varvec{0}},{\varvec{{\varSigma }}}),\quad \text{ iid } \text{ for } i = 1,\ldots ,n \end{aligned}$$
(2)
$$\begin{aligned}&{\varvec{{\varSigma }}} = \hbox {diag}(\sigma _1^2, \ldots ,\sigma _p^2) \end{aligned}$$
(3)
$$\begin{aligned}&{\varvec{x}}_i|{\varvec{y}}_i \sim {\mathcal {N}}_p({\varvec{\mu }}+{\varvec{{\varLambda }}} {\varvec{y}}_i,{\varvec{{\varSigma }}}),\quad \text{ ind. } \text{ for } i = 1,\ldots ,n \end{aligned}$$
(4)

Before proceeding, note that we are not differentiating the notation between random variables and their corresponding realizations. Bold uppercase letters are used for matrices; bold lowercase letters are used for vectors and normal text for scalars.

In Eq. (1), we assume that \({\varvec{x}}_i\) is expressed as a linear combination of a latent vector of factors \({\varvec{y}}_i\in {\mathbb {R}}^{q}\). The \(p\times q\) dimensional matrix \({\varvec{{\varLambda }}} = (\lambda _{rj})\) contains the factor loadings, while \({\varvec{\mu }} = (\mu _1,\ldots ,\mu _p)\) contains the marginal mean of \({\varvec{x}}_i\). The unobserved vector \({\varvec{y}}_i\) lies on a lower-dimensional space, that is, \(q < p\), and it consists of uncorrelated features \(y_{i1}, \ldots ,y_{iq}\) as shown in Eq. (2), where \({\varvec{0}}\) denotes a vector of zeros. Note that the error terms \({\varvec{\varepsilon }}_i\) are independent from \({\varvec{y}}_i\). Furthermore, the errors are consisting of independent random variables \(\varepsilon _{i1},\ldots ,\varepsilon _{ip}\), as implied by the diagonal covariance matrix \({\varvec{{\varSigma }}}\) in Eq. (3). As shown in Eq. (4), the knowledge of the missing data (\({\varvec{y}}_i\)) implies that the conditional distribution of \({\varvec{x}}_i\) has a diagonal covariance matrix. The previous assumptions lead to

$$\begin{aligned} {\varvec{x}}_i \sim {\mathcal {N}}_p({\varvec{\mu }},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}), \quad \text{ iid } \text{ for } i = 1,\ldots ,n. \end{aligned}$$
(5)

According to Eq. (5), the covariance matrix of the marginal distribution of \({\varvec{x}}_i\) is equal to \({\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}\). This is the crucial characteristic of factor analytic models, where they aim to explain high-dimensional dependencies using a set of lower-dimensional uncorrelated factors (Kim and Mueller 1978; Bartholomew et al. 2011).

Mixtures of Factor Analyzers (MFA) are generalizations of the typical FA model, by assuming that Eq. (5) becomes

$$\begin{aligned} {\varvec{x}}_i \sim \sum _{k = 1}^{K}w_k {\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}}_{k}), \text{ iid } i = 1,\ldots ,n \end{aligned}$$
(6)

where K denotes the number of mixture components. The vector of mixing proportions \({\varvec{w}} := (w_1,\ldots ,w_K)\) contains the weight of each component, with \(0\leqslant w_k\leqslant 1\); \(k = 1,\ldots ,K\) and \(\sum _{k=1}^{K}w_k = 1\). Note that the mixture components are characterized by different parameters \({\varvec{\mu }}_k,{\varvec{{\varLambda }}}_k,{\varvec{{\varSigma }}}_k\), \(k = 1,\ldots ,K\). Thus, MFAs are particularly useful when the observed data exhibit unusual characteristics such as heterogeneity. That being said, this approach aims to capture the behavior of each cluster within a component of the mixture model. A comprehensive perspective on the history and development of MFA models is given in Chapter 3 of the monograph by McNicholas (2016).

Early works applying the expectation–maximization (EM) algorithm (Dempster et al. 1977) for estimating MFA are the ones from Ghahramani et al. (1996), Tipping and Bishop (1999), McLachlan and Peel (2000). McNicholas and Murphy (2008), McNicholas and Murphy (2010) introduced the family of parsimonious Gaussian mixture models (PGMMs) by considering the case where the factor loadings and/or error variance may be shared or not between the mixture components. These models are estimated by the alternating expectation–conditional maximization algorithm (Meng and Van Dyk 1997) and have superior performance compared to other approaches (McNicholas and Murphy 2008). Under a Bayesian setup, Fokoué and Titterington (2003) estimate the number of mixture components and factors by simulating a continuous-time stochastic birth–death point process using a birth–death MCMC algorithm (Stephens 2000). More recently, Papastamoulis (2018b) estimated Bayesian MFA models with an unknown number of components using overfitting mixtures.

In recent years, there is a growing progress on the usage of overfitting mixture models in Bayesian analysis (Rousseau and Mengersen 2011; van Havre et al. 2015; Malsiner Walli et al. 2016, 2017; Frühwirth-Schnatter and Malsiner-Walli 2019). An overfitting mixture model consists of a number of components which is much larger than its true (and unknown) value. Under suitable prior assumptions (see “Appendix A”) introduced by Rousseau and Mengersen (2011), it has been shown that asymptotically the redundant components will have zero posterior weight and force the posterior distribution to put all its mass in the sparsest way to approximate the true density. Therefore, the inference on the number of mixture components can be based on the posterior distribution of the “alive” components of the overfitted model, that is, the components which contain at least one allocated observation.

Other Bayesian approaches to estimate the number of components in a mixture model include the reversible jump MCMC (RJMCMC) (Green 1995; Richardson and Green 1997; Dellaportas and Papageorgiou 2006; Papastamoulis and Iliopoulos 2009), birth–death MCMC (BDMCMC) (Stephens 2000) and allocation sampling (Nobile and Fearnside 2007; Papastamoulis and Rattray 2017) algorithms. However, overfitting mixture models are straightforward to implement, while the rest of the approaches require either careful design of various move types that bridge models with different number of clusters, or analytical integration of parameters.

The overall message is that there is a need for developing an efficient Bayesian method that will combine the previously mentioned frequentist advances on parsimonious representations of MFAs and the flexibility provided by the Bayesian viewpoint. This study aims at filling this gap by extending the Bayesian method of Papastamoulis (2018b) to the family of parsimonious Gaussian mixtures of McNicholas and Murphy (2008). Furthermore, we illustrate the proposed method using the R (Ihaka and Gentleman 1996; R Core Team 2016) package fabMix (Papastamoulis 2018a) available as a contributed package from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=fabMix. The proposed method efficiently deals with many inferential problems (see, e.g., Celeux et al. (2000a)) related to mixture posterior distributions, such as (i) inferring the number of non-empty clusters using overfitting models, (ii) efficient exploration of the posterior surface by running parallel heated chains and (iii) incorporating advanced techniques that successfully deal with the label switching issue (Papastamoulis 2016).

The rest of the paper is organized as follows: Section 2 reviews the basic concepts of parsimonious MFAs. Identifiability problems and corresponding treatments are detailed in Sect. 2.1. The Bayesian model is introduced in Sect. 2.2. Section 3 presents the full conditional posterior distributions of the model. The MCMC algorithm is described in Sect. 3.2. A detailed presentation of the main function of the contributed R package is given in Sect. 4. Our method is illustrated and compared to similar models estimated by the EM algorithm in Sects. 5.1 and 5.2 using an extended simulation study and four publicly available datasets, respectively. We conclude in Sect. 6 with a summary of our findings and directions for further research. Appendix contains further discussion on overfitting mixture models (“Appendix A”), details of the MCMC sampler (“Appendix B”) and additional simulation results (“Appendix C”).

2 Parsimonious mixtures of factor analyzers

Consider the latent allocation variables \(z_i\) which assign observation \({\varvec{x}}_i\) to a component \(k =1,\ldots ,K\) for \(i = 1,\ldots ,n\). A priori each observation is generated from component k with probability equal to \(w_k\), that is,

$$\begin{aligned} \mathrm {P}(z_i = k) = w_k,\quad k = 1,\ldots ,K, \end{aligned}$$
(7)

independent for \(i = 1,\ldots ,n\). Note that the allocation vector \({\varvec{z}} := (z_1,\ldots ,z_n)\) is not observed, so it should be treated as missing data. We assume that \({\varvec{z}}_i\) and \({\varvec{y}}_i\) are independent; thus, Eq. (2) is now written as:

$$\begin{aligned} ({\varvec{y}}_i, {\varvec{\varepsilon }}_i|z_i = k)\sim {\mathcal {N}}_q({\varvec{0}},\mathbf{I }_q){\mathcal {N}}_p({\varvec{0}},{\varvec{{\varSigma }}}_{k}), \end{aligned}$$
(8)

and conditional on the cluster membership and latent factors, we obtain that

$$\begin{aligned} ({\varvec{x}}_i|z_i = k,{\varvec{y}}_i) \sim {\mathcal {N}}_p({\varvec{\mu }}_{k}+{\varvec{{\varLambda }}}_{k} {\varvec{y}}_i,{\varvec{{\varSigma }}}_{k}). \end{aligned}$$
(9)

Consequently,

$$\begin{aligned} ({\varvec{x}}_i|z_i = k) \sim {\mathcal {N}}_p({\varvec{\mu }}_{k}, {\varvec{{\varLambda }}}_{k} {\varvec{{\varLambda }}}_{k}^{T}+ {\varvec{{\varSigma }}}_{k}), \end{aligned}$$
(10)

independent for \(i=1,\ldots ,n\). From Eqs. (7) and (10), we derive that the marginal distribution of \({\varvec{x}}_i\) is the finite mixture model in Eq. (6).

Following McNicholas and Murphy (2008), the factor loadings and error variance per component may be common or not among the K components in Eq. (6). If the factor is constrained, then:

$$\begin{aligned} {\varvec{{\varLambda }}}_1=\cdots ={\varvec{{\varLambda }}}_K = {\varvec{{\varLambda }}}. \end{aligned}$$
(11)

If the error variance is constrained, then:

$$\begin{aligned} {\varvec{{\varSigma }}}_1=\cdots ={\varvec{{\varSigma }}}_K = {\varvec{{\varSigma }}}. \end{aligned}$$
(12)

Furthermore, the error variance may be isotropic (i.e., proportional to the identity matrix) or not and depending on whether constraint (12) is disabled or enabled:

$$\begin{aligned} {\varvec{{\varSigma }}}_k= & {} \sigma ^2_k \mathbf{I }_p; k = 1,\ldots ,K \quad \text{ or } \end{aligned}$$
(13)
$$\begin{aligned} {\varvec{{\varSigma }}}_k= & {} \sigma ^2 \mathbf{I }_p; k = 1,\ldots ,K . \end{aligned}$$
(14)

We note that under constraint (13), the model is referred to as a mixture of probabilistic principal component analyzers (Tipping and Bishop 1999).

Depending on whether a particular constraint is present or not, the following set of eight parameterizations arises.

$$\begin{aligned} \text{ UUU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}}_{k})\\ \text{ UCU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + {\varvec{{\varSigma }}})\\ \text{ UUC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + \sigma ^2_k\mathbf{I }_p)\\ \text{ UCC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}_{k}{\varvec{{\varLambda }}}_{k}^T + \sigma ^2\mathbf{I }_p)\\ \text{ CUU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}}_{k})\\ \text{ CCU: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + {\varvec{{\varSigma }}})\\ \text{ CUC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + \sigma ^2_k\mathbf{I }_p)\\ \text{ CCC: }&{\varvec{x}}_i \sim \sum \limits _{k = 1}^{K}w_k{\mathcal {N}}_p({\varvec{\mu }}_{k},{\varvec{{\varLambda }}}{\varvec{{\varLambda }}}^T + \sigma ^2\mathbf{I }_p) \end{aligned}$$

independent for \(i = 1,\ldots ,n\). Following the pgmm nomenclature (McNicholas and Murphy 2008): the first, second and third letters denote whether \({\varvec{{\varLambda }}}_k\), \({\varvec{{\varSigma }}}_k = {\mathrm {diag}}(\sigma ^2_{k1}, \ldots ,\sigma ^2_{kp})\) and \(\sigma ^2_{kj}\), \(k=1,\ldots ,K\); \(j=1,\ldots ,p\), are constrained (C) or unconstrained (U), respectively. A novelty of the present study is to offer a Bayesian framework for estimating the whole family of the previous parameterizations (note that Papastamoulis (2018b) estimated the UUU and UCU parameterizations).

2.1 Label switching and other identifiability problems

Let \(L({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}}) = \prod _{i=1}^{n}\sum _{k=1}^{K}w_kf(x_i|\theta _k, {\varvec{\phi }})\), \(({\varvec{w}}, {\varvec{\theta }},{\varvec{\phi }})\in {\mathcal {P}}_{K-1}\times {\varTheta }^{K}\times {\varPhi }\) denote the likelihood function of a mixture of K densities, where \({\mathcal {P}}_{K-1}\) denotes the parameter space of the mixing proportions \({\varvec{w}}\), \({\varvec{\theta }} = ({\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_K)\) are the component-specific parameters and \({\varvec{\phi }}\) denotes a (possibly empty) collection of parameters that are common between all components. For instance, consider the UCU parameterization where \({\varvec{\theta }}_k = ({\varvec{\mu }}_k, {\varvec{{\varLambda }}}_k)\) for \(k = 1,\ldots ,K\) and \({\varvec{\phi }} = {\varvec{{\varSigma }}}\). For any permutation \(\tau =(\tau _1,\ldots ,\tau _K)\) of the set \(\{1,\ldots ,K\}\), the likelihood of mixture models is invariant to permutations of the component labels: \(L({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}}) = L(\tau {\varvec{w}}, \tau {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}})\). Thus, the likelihood surface of a mixture model with K components will exhibit K! symmetric areas. If \(({\varvec{w}}^{*},{\varvec{\theta }}^{*},{\varvec{\phi }}^{*})\) corresponds to a mode of the likelihood, the same will hold for any permutation \((\tau {\varvec{w}}^{*}, \tau {\varvec{\theta }}^{*},{\varvec{\phi }}^{*})\).

Label switching (Redner and Walker 1984) is the commonly used term to describe this phenomenon. Under a Bayesian point of view, in the case that the prior distribution is also invariant to permutations (which is typically the case, see, e.g., Marin et al. (2005), Papastamoulis and Iliopoulos (2013)), the same invariance property will also hold for the posterior distribution \(f({\varvec{w}}, {\varvec{\theta }}, {\varvec{\phi }}|{\varvec{x}})\). Consequently, the marginal posterior distributions of mixing proportions and component-specific parameters will be coinciding, i.e., \(f(w_1|{\varvec{x}}) = \cdots = f(w_K|{\varvec{x}})\) and \(f(\theta _1|{\varvec{x}}) = \cdots = f(\theta _K|{\varvec{x}})\). Thus, when approximating the posterior distribution via MCMC sampling, the standard practice of ergodic averages for estimating quantities of interest (such as the mean of the marginal posterior distribution for each parameter) becomes meaningless. In order to deal with this identifiability problem, we post-process the simulated MCMC output using a deterministic relabeling algorithm, that is the Equivalence Classes Representatives (ECR) algorithm (Papastamoulis and Iliopoulos 2010; Papastamoulis 2014), as implemented in the R package label.switching (Papastamoulis 2016).

A second source of identifiability problems is related to orthogonal transformations of the matrix of factor loadings. A popular practice (Geweke and Zhou 1996; Fokoué and Titterington 2003; Mavridis and Ntzoufras 2014; Papastamoulis 2018b) to overcome this issue is to pre-assign values to some entries of \({\varvec{{\varLambda }}}\); in particular, we set the entries of the upper diagonal of the first \(q\times q\) block matrix of \({\varvec{{\varLambda }}}\) equal to zero:

$$\begin{aligned} {\varvec{{\varLambda }}} = \begin{pmatrix} \lambda _{11} &{} 0 &{} \cdots &{} 0\\ \lambda _{21} &{} \lambda _{22} &{} \cdots &{} 0\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \lambda _{q1} &{} \lambda _{q2} &{} \cdots &{} \lambda _{qq}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \lambda _{p1} &{} \lambda _{p2} &{} \cdots &{} \lambda _{pq} \end{pmatrix}. \end{aligned}$$

Another problem is related to the so-called sign switching phenomenon, see, e.g., Conti et al. (2014). Simultaneously switching the signs of a given row r of \({\varvec{{\varLambda }}}\); \(r = 1,\ldots ,p\) and \({\varvec{y}}_i\) does not alter the likelihood. Thus, \({\varvec{{\varLambda }}}\) and \({\varvec{y}}_i\); \(i = 1, \ldots ,n\) are not marginally identifiable due to sign switching across the MCMC trace. However, this is not a problem in our implementation, since all parameters of the marginal density of \({\varvec{x}}_i\) in (6) are identified (see also the discussion for sign-invariant parametric functions in Papastamoulis (2018b)).

Parameter-expanded approaches are preferred in the recent literature (Bhattacharya and Dunson 2011; McParland et al. 2017), because the mixing of the MCMC sampler is improved. In our implementation, we are able to obtain excellent mixing using the popular approach of restricting elements of \({\varvec{{\varLambda }}}\): the reader is referred to Figure 2 of Papastamoulis (2018b), where it is obvious that our MCMC sampler has the ability to rapidly move between the multiple modes of the target posterior distribution of \({\varvec{{\varLambda }}}\) (more details on convergence diagnostics are also presented in “Appendix A.4” of Papastamoulis (2018b)).

2.2 Prior assumptions

We assume that the number of mixture components (K) has a sufficiently large value so that it overestimates the “true” number of clusters. Unless otherwise stated, the default choice is \(K = 20\). All prior assumptions of the overfitting mixture models are discussed in detail in Papastamoulis (2018b). For ease of presentation, we repeat them in this section. Let \( {\mathcal {D}}(\cdots )\) denote the Dirichlet distribution, and \( {\mathcal {G}}(\alpha ,\beta )\) denote the Gamma distribution with mean \(\alpha /\beta \). Let also \({\varvec{{\varLambda }}}_{kr\cdot }\) denote the r-th row of the matrix of factor loadings \({\varvec{{\varLambda }}}_k\); \(k = 1,\ldots ,K\); \(r = 1,\ldots ,p\). The following prior assumptions are imposed on the model parameters:

$$\begin{aligned}&{\varvec{w}} \sim {\mathcal {D}}\left( \gamma ,\ldots ,\gamma \right) , \quad \gamma = \frac{1}{K} \end{aligned}$$
(15)
$$\begin{aligned}&{\varvec{\mu }}_k \sim {\mathcal {N}}_p({\varvec{\xi }}, {\varvec{{\varPsi }}}), \quad \text{ iid } \text{ for } k = 1,\ldots ,K \end{aligned}$$
(16)
$$\begin{aligned}&{\varvec{{\varLambda }}}_{kr\cdot } \sim {\mathcal {N}}_{\nu _r}({\varvec{0}},{\varvec{{\varOmega }}}), \quad \text{ iid. } \text{ for } r = 1,\ldots ,p \end{aligned}$$
(17)
$$\begin{aligned}&\sigma _{kr}^{-2} \sim {\mathcal {G}}(\alpha ,\beta ), \quad \text{ iid } \text{ for } k = 1,\ldots ,K; r = 1,\ldots ,p \end{aligned}$$
(18)
$$\begin{aligned}&\omega _{\ell }^{-2} \sim {\mathcal {G}}(g,h), \quad \text{ iid } \text{ for } \ell = 1,\ldots ,q \end{aligned}$$
(19)

where all variables are assumed mutually independent and \(\nu _r =\min \{r,q\}\); \(r=1,\ldots ,p\); \(\ell = 1,\ldots ,q\); \(j=1,\ldots ,K\). In Eq. (17), \({\varvec{{\varOmega }}} = \hbox {diag}(\omega _1^2,\ldots ,\omega _q^2)\) denotes a \(q\times q\) diagonal matrix, where the diagonal entries are distributed independently according to Eq. (19). A graphical representation of the hierarchical model is given in Figure 1 of Papastamoulis (2018b). The default values of the remaining fixed hyper-parameters are given in “Appendix B”.

The previous assumptions refer to the case of the unconstrained parameter space, that is the UUU parameterization. Clearly, they should be modified accordingly when a constrained model is used. Under constraint (11), the prior distribution in Eq. (17) becomes \({\varvec{{\varLambda }}}_{r\cdot } \sim {\mathcal {N}}_{\nu _r}({\varvec{0}},{\varvec{{\varOmega }}})\), independent for \(r = 1,\ldots ,p\). Under constraints (12) and (13), the prior distribution in Eq. (18) becomes \(\sigma _{r}^{-2} \sim {\mathcal {G}}(\alpha ,\beta )\), independent for \(r = 1,\ldots ,p\). Finally, under constraints (12) and (14), the prior distribution in Eq. (18) becomes \(\sigma ^{-2} \sim {\mathcal {G}}(\alpha ,\beta )\).

3 Inference

This section describes the full conditional posterior distributions of model parameters and the corresponding MCMC sampler. Due to conjugacy, all full conditional posterior distributions are available in closed forms.

3.1 Full conditional posterior distributions

Let us define the following quantities:

$$\begin{aligned}&n_k = \sum _{k=1}^{K}I(z_i = k)\\&{\varvec{A}}_k = n_k{\varvec{{\varSigma }}}_k^{-1} + {\varPsi }^{-1} \\&{\varvec{B}}_k = {\varvec{{\varSigma }}}_k^{-1}\sum _{k=1}^{K}I(z_i=k)\left( {\varvec{x}}_i - {\varvec{{\varLambda }}}_k{\varvec{y}}_i\right) + {\varvec{\xi }}{\varvec{{\varPsi }}}^{-1}\\&{\varvec{\tau }}_{kr} = \frac{\sum _{i=1}^{n}I(z_i = k)(x_{ir} - \mu _{kr}){\varvec{y}}_i^{T}}{\sigma _{kr}^2}\\&{\varvec{{\varDelta }}}_{kr} = \frac{\sum _{i=1}^{n}I(z_i = k){\varvec{y}}_i{\varvec{y}}_i^{T}}{{\varvec{\sigma }}_{kr}^2}\\&s_{kr} = \sum _{i=1}^{n}I(z_i = k)\left( x_{ir} - \mu _{kr} - {\varvec{{\varLambda }}}_{kr\cdot }{\varvec{y}}_i\right) ^2\\&{\varvec{T}} = \sum _{k=1}^{K}\sum _{r=1}^{p}{\varvec{{\varLambda }}}_{kr\cdot }{\varvec{{\varLambda }}}_{kr\cdot }^{T}\\&{\varvec{M}}_k = \mathbf{I }_q + {\varvec{{\varLambda }}}^{T}_k{\varvec{{\varSigma }}}^{-1}_k{\varvec{{\varLambda }}}^{T}_k \end{aligned}$$

for \(k = 1,\ldots ,K\); \(r = 1,\ldots ,p\). For a generic sequence of the form \(\{G_{rc}; r \in {\mathcal {R}}, c\in {\mathcal {C}}\}\), we also define \(G_{\bullet c}=\sum _{r}G_{rc}\) and \(G_{r\bullet }=\sum _{c}G_{rc}\). Finally, \((x|\cdots )\) denotes the conditional distribution of x given the value of all remaining variables.

From Eqs. (6) and (7), it immediately follows that for \(k=1,\ldots ,K\)

$$\begin{aligned} \mathrm {P}(z_i = k|\cdots ) \propto w_kf\left( {\varvec{x}}_i;{\varvec{\mu }}_k,{\varvec{{\varLambda }}}_k{\varvec{{\varLambda }}}_k^{T} + {\varvec{{\varSigma }}}_k\right) , \end{aligned}$$
(20)

independent for \(i =1,\ldots ,n\), where \(f(\cdot ;{\varvec{\mu }},{\varvec{{\varSigma }}})\) denotes the probability density function of the multivariate normal distribution with mean \({\varvec{\mu }}\) and covariance matrix \({\varvec{{\varSigma }}}\). Note that in order to compute the right-hand side of the last equation, inversion of the \(p\times p\) matrix \({\varvec{{\varLambda }}}_k{\varvec{{\varLambda }}}_k^{T} + {\varvec{{\varSigma }}}_k\) is required. Using the Sherman–Morrison–Woodbury formula (see, e.g., Hager (1989)), the inverse matrix is equal to \({\varvec{{\varSigma }}}_k^{-1} - {\varvec{{\varSigma }}}_k^{-1}{\varvec{{\varLambda }}}_k {\varvec{M}}_{k}^{-1}{\varvec{{\varLambda }}}_k^T{\varvec{{\varSigma }}}^{-1}_k\), for \(k = 1,\ldots ,K\). The full conditional posterior distribution of mixing proportions is a Dirichlet distribution with parameters

$$\begin{aligned} {\varvec{w}}|\cdots \sim {\mathcal {D}}(\gamma + n_1,\ldots ,\gamma +n_K). \end{aligned}$$
(21)

The full conditional posterior distribution of the marginal mean per component is

$$\begin{aligned} {\varvec{\mu }}_k|\cdots \sim {\mathcal {N}}_p\left( {\varvec{A}}_k^{-1}{\varvec{B}}_k, {\varvec{A}}_k^{-1} \right) , \end{aligned}$$
(22)

independent for \(k = 1\ldots ,K\).

The full conditional posterior distribution of the factor loadings without any restriction is

$$\begin{aligned}&{\varvec{{\varLambda }}}_{kr\cdot }|\cdots \sim {\mathcal {N}}_{\nu _r}\left( \left[ {\varvec{{\varOmega }}}^{-1} {+} {\varvec{{\varDelta }}}_{kr}\right] ^{-1}{\varvec{\tau }}_{kr},\left[ {\varvec{{\varOmega }}}^{-1} {+} {\varvec{{\varDelta }}}_{kr}\right] ^{-1}\right) ,\nonumber \\ \end{aligned}$$
(23)

independent for \(k = 1,\ldots ,K\); \(r = 1,\ldots ,p\). Under constraint (11), we obtain that

$$\begin{aligned}&{\varvec{{\varLambda }}}_{r\cdot }|\cdots \sim {\mathcal {N}}_{\nu _r}\left( \left[ {\varvec{{\varOmega }}}^{-1} + {\varvec{{\varDelta }}}_{\bullet r}\right] ^{-1}{\varvec{\tau }}_{\bullet r},\left[ {\varvec{{\varOmega }}}^{-1} + {\varvec{{\varDelta }}}_{\bullet r}\right] ^{-1}\right) ,\nonumber \\ \end{aligned}$$
(24)

independent for \(r = 1,\ldots ,p\).

The full conditional distribution of error variance without any restriction is

$$\begin{aligned} \sigma _{kr}^{-2}|\cdots \sim {\mathcal {G}}\left( \alpha + n_k/2,\beta + s_{kr}/2\right) , \end{aligned}$$
(25)

independent for \(k = 1,\ldots ,K\); \(r = 1,\ldots ,p\). Under constraint (12), we obtain that

$$\begin{aligned} \sigma _{r}^{-2}|\cdots \sim {\mathcal {G}}(\alpha + n/2,\beta + s_{\bullet r}/2), \end{aligned}$$
(26)

independent for \(r = 1,\ldots ,p\). Under constraints (12) and (13), we obtain that

$$\begin{aligned} \sigma _{k}^{-2}|\cdots \sim {\mathcal {G}}(\alpha + n_kp/2,\beta + s_{k\bullet }/2), \end{aligned}$$
(27)

independent for \(k = 1,\ldots ,K\). Under constraints (12) and (14), we obtain that

$$\begin{aligned} \sigma ^{-2}|\cdots \sim {\mathcal {G}}(\alpha + np/2,\beta + s_{\bullet \bullet }/2). \end{aligned}$$
(28)

The full conditional distribution of latent factors is given by

$$\begin{aligned} {\varvec{y}}_i|\cdots \sim {\mathcal {N}}_q\left( {\varvec{M}}_{z_i}^{-1}{\varvec{{\varLambda }}}_{z_i}^{T}{\varvec{{\varSigma }}}^{-1}_{z_i}({\varvec{x}}_i - {\varvec{\mu }}_{z_i}), {\varvec{M}}_{z_i}^{-1} \right) , \end{aligned}$$
(29)

independent for \(i = 1,\ldots ,n\). Finally, the full conditional distribution for \(\omega _\ell \) is

$$\begin{aligned} \omega _\ell ^{-2}|\cdots \sim {\mathcal {G}}\left( g + Kp/2, h + T_{\ell \ell }/2\right) , \end{aligned}$$
(30)

while under constraint (11) we obtain that

$$\begin{aligned} \omega _\ell ^{-2}|\cdots \sim {\mathcal {G}}\left( g + p/2, h + T_{\ell \ell }/2K\right) , \end{aligned}$$
(31)

independent for \(\ell = 1,\ldots ,q\).

3.2 MCMC sampler

Given the number of factors (q) and a model parameterization, a Gibbs sampler (Geman and Geman 1984; Gelfand and Smith 1990) coupled with a prior parallel tempering scheme (Geyer 1991; Geyer and Thompson 1995; Altekar et al. 2004) is used in order to produce a MCMC sample from the joint posterior distribution. Each heated chain (\(j = 1,\ldots , {\texttt {nChains}}\)) corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any subset of parameters, it is only applied to the Dirichlet prior distribution of mixing proportions (van Havre et al. 2015). The inference is based on the output of the first chain (\(j = 1\)) of the prior parallel tempering scheme (van Havre et al. 2015). The number of factors and model parameterization is selected according to the Bayesian information criterion (BIC) (Schwarz 1978), conditional on the most probable number of alive clusters per model (see Papastamoulis (2018b) for a detailed comparison of BIC with other alternatives).

Let \({\mathcal {M}}\) and \({\mathcal {Q}}\) denote the set of model parameterizations and number of factors. In the following pseudocode, \(x\leftarrow [y|z]\) denotes that x is updated from a draw from the distribution f(y|z) and \(\theta _j^{(t)}\) denotes the value of \(\theta \) at the t-th iteration of the j-th chain.

  1. 1.

    For \((m,q) \in {\mathcal {M}}\times {\mathcal {Q}}\)

    1. (a)

      Obtain initial values \(({\varvec{{\varOmega }}}_j^{(0)}\), \({\varvec{{\varLambda }}}_{m;j}^{(0)}\), \({\varvec{\mu }}^{(0)}_j\), \({\varvec{z}}^{(0)}_j\), \({\varvec{{\varSigma }}}_{m;j}^{(0)}\), \({\varvec{w}}^{(0)}_j\), \({\varvec{y}}^{(0)}_j)\) by running the overfitting initialization scheme, for \(j = 1,\ldots , {\texttt {nChains}}\).

    2. (b)

      For MCMC iteration \(t = 1,2,\ldots \) update

      1. i.

        For chain \(j = 1,\ldots , {\texttt {nChains}}\)

        1. A.

          \({\varvec{{\varOmega }}}^{(t)}_j \leftarrow \left[ {\varvec{{\varOmega }}}|{\varvec{{\varLambda }}}_{mj}^{(t-1)}\right] \).

          If \(m \in \{\text{ UUU,UCU,UUC, } \text{ UCC }\}\) use (30)

          else use (31).

        2. B.

          \({\varvec{{\varLambda }}}_{m;j}^{(t)} \leftarrow \left[ {\varvec{{\varLambda }}}|{\varvec{{\varOmega }}}_j^{(t)}, {\varvec{\mu }}_j^{(t{-}1)},\right. \left. {\varvec{{\varSigma }}}_{m;j}^{(t{-}1)}, {\varvec{x}}, {\varvec{y}}_j^{(t{-}1)}, {\varvec{z}}_j^{(t{-}1)}\right] \)

          If \(m \in \{\text{ UUU,UCU,UUC, } \text{ UCC }\}\) use (23)

          else use (24).

        3. C.

          \({\varvec{\mu }}_j^{(t)}\leftarrow \left[ {\varvec{\mu }}|{\varvec{{\varLambda }}}_m^{(t)},{\varvec{{\varSigma }}}_m^{(t-1)}, {\varvec{x}}, {\varvec{y}}^{(t-1)}, {\varvec{z}}_j^{(t-1)}\right] \) according to (22).

        4. D.

          \({\varvec{z}}_j^{(t)}\leftarrow \left[ {\varvec{z}}|{\varvec{w}}_j^{(t-1)},{\varvec{\mu }}_j^{(t)}, {\varvec{{\varLambda }}}_{m;j}^{(t)}, {\varvec{{\varSigma }}}_{m;j}^{(t-1)}, {\varvec{x}}\right] \) according to (20).

        5. E.

          \({\varvec{w}}_j^{(t)}\leftarrow \left[ {\varvec{w}}|{\varvec{z}}_j^{(t)}\right] \) according to (21) with prior parameter \(\gamma = \gamma _{(j)}\).

        6. F.

          \({\varvec{{\varSigma }}}_{m;j}^{(t)}\leftarrow \left[ {\varvec{{\varSigma }}}|{\varvec{x}}, {\varvec{z}}_j^{(t)}, {\varvec{\mu }}_j^{(t)}, {\varvec{{\varLambda }}}_{m;j}^{(t)}, {\varvec{y}}_j^{(t-1)}\right] \)

          If \(m \in \{\text{ UUU,CUU }\}\) use (25)

          else if \(m \in \{\text{ UCU,CCU }\}\) use (26)

          else if \(m \in \{\text{ UUC,CUC }\}\) use (27)

          else use (28).

        7. G.

          \({\varvec{y}}_j^{(t)}\leftarrow \left[ {\varvec{y}}|{\varvec{x}}, {\varvec{z}}_j^{(t)}, {\varvec{\mu }}_j^{(t)}, {\varvec{{\varSigma }}}_{m;j}^{(t)},{\varvec{{\varLambda }}}_{m;j}^{(t)}\right] \) according to (29).

      2. ii.

        Select randomly \(1\leqslant j^*\leqslant {\texttt {nChains}}-1\) and propose to swap the states of chains \(j^*\) and \(j^*+1\).

    3. (c)

      For chain \(j = 1\) compute BIC conditionally on the most probable number of alive clusters.

  2. 2.

    Select the best (mq) model corresponding to chain \(j = 1\) according to BIC and reorder the simulated output of the selected model according to ECR algorithm, conditional on the most probable number of alive clusters.

The MCMC algorithm is initialized using random starting values arising from the “overfitting initialization” procedure introduced by Papastamoulis (2018b). For further details on steps 1.(a) (MCMC initialization) and 1.(b).ii (prior parallel tempering scheme), the reader is referred to “Appendix B” (see also Sections 2.6, 2.7 and 2.9 of Papastamoulis (2018b)).

Table 1 Arguments of the fabMix() function

4 Using the fabMix package

The main function of the fabMix package is fabMix(), with its arguments shown in Table 1. This function takes as input a matrix rawData of observed data where rows and columns correspond to observations and variables of the dataset, respectively. The parameters of the Dirichlet prior distribution (\(\gamma _{(j)}; j = 1,\ldots ,\text{ nChains }\)) of the mixing proportions are controlled by dirPriorAlphas. The range for the number of factors is specified in the q argument. Valid input for q is any positive integer vector between 1 and the Ledermann bound (Ledermann 1937) implied by the number of variables in the dataset. By default, all eight parameterizations are fitted; however, the user can specify in model any non-empty subset of them.

The fabMix() function simulates a total number of \({\texttt {nChains}}\times {\texttt {length(models)}}\times {\texttt {length(q)}}\) MCMC chains. For each parameterization and number of factors, the (nChains) heated chains are processed in parallel, while swaps between pairs of chains are proposed. Parallelization is possible in the parameterization level as well, using the argument parallelModels. This means that parallelModels are running in parallel where each one of them runs nChains chains in parallel, provided that the number of available threads is at least equal to \({\texttt {nChains}}\times {\texttt {parallelModels}}\). In order to parallelize our code, the doParallel (Revolution Analytics and Steve Weston 2015), foreach (Revolution Analytics and Steve Weston 2014) and doRNG (Gaujoux 2018) packages are imported.

The prior parameters \(g, h, \alpha , \beta \) in Eqs. (18) and (19) correspond to g, h,alpha_sigma and beta_sigma, respectively, with a (common) default value equal to 0.5. It is suggested to run the algorithm using normalize = TRUE, in order to standardize the data before running the MCMC sampler. The default behavior of our method is to normalize the data; thus, all reported estimates refer to the standardized dataset. In the case that the most probable number of mixture components is larger than 1, the ECR algorithm is applied in order to undo the label switching problem. Otherwise, the output is post-processed so that the generated parameters of the (single) alive component are switched to the first component of the overfitting mixture.

Table 2 Output returned to the user of the fabMix() function

The sampler will first run for warm_up iterations before starting to propose swaps between pairs of chain. By default, this stage consists of 5000 iterations. After that, each chain will run for a series of mCycles MCMC cycles, each one consisting of nIterPerCycle MCMC iterations (steps A, B, \(\ldots \), G of the pseudocode). The updates of factor loadings according to (23) and (24) at step B of the pseudocode are implemented using object-oriented programming using the Rcpp and RcppArmadillo libraries (Eddelbuettel and François 2011; Eddelbuettel and Sanderson 2014). At the end of each cycle, a swap between a pair of chains is proposed.

Obviously, the total number of MCMC iterations is equal to \({\texttt {warm}}\_{\texttt {up}}+ {\texttt {mCycles}}\times {\texttt {nIterPerCycle}}\), and the first \({\texttt {warm}}\_{\texttt {up}}+ {\texttt {burnCycles}}\times {\texttt {nIterPerCycle}}\) iterations are discarded as burn-in. Given the default values of nIterPerCycle, warm_up and overfittingInitialization, choices between \(50\leqslant {\texttt {burnCycles}}\leqslant 500 < {\texttt {mCycles}}\leqslant 1500\) are typical in our implementation (see also the convergence analysis in Papastamoulis (2018b)).

While the function runs, some basic information is printed either on the screen (if parallelModels is not enabled) or in separate text files inside the output folder (in the opposite case), such as the progress of the sampler as well as the acceptance rate of proposed swaps between chains. The output which is returned to the user is detailed in Table 2. The full MCMC output of the selected model is returned as a list (named as mcmc) consisting of mcmc objects, a class imported from the coda package (Plummer et al. 2006). In particular, mcmc consists of the following:

  • y: object of class mcmc containing the simulated factors.

  • w: object of class mcmc containing the simulated mixing proportions of the alive components, reordered according to ECR algorithm.

  • Lambda: list containing objects of class mcmc with the simulated factor loadings of the alive components, reordered according to ECR algorithm. Note that this particular parameter is not identifiable due to sign switching across the MCMC trace.

  • mu: list containing objects of class mcmc with the simulated marginal means of the alive components, reordered according to ECR algorithm.

  • z: matrix of the simulated latent allocation variables of the mixture model, reordered according to ECR algorithm.

  • Sigma: list containing objects of class mcmc with the simulated variance of errors of the alive components, reordered according to ECR algorithm.

  • K_all_chains: matrix of the simulated values of the number of alive components per chain.

The user can call the print, summary and plot methods of the package in order to easily retrieve and visualize various summaries of the output, as exemplified in the next section.

5 Examples

This section illustrates our method. At first, we demonstrate a typical implementation on two single simulated datasets and explain in detail the workflow. Then we perform an extensive simulation study for assessing the ability of the proposed method to recover the correct clustering and compare our findings to the pgmm package (McNicholas and Murphy 2008, 2010; McNicholas et al. 2010, 2015). Application to four publicly available datasets is provided next.

Fig. 1
figure 1

Simulated datasets of \(p = 30\) variables consisting of \(n = 300\) observations and \(K = 6\) clusters (dataset 1) and \(n = 200\), \(K = 2\) (dataset 2). The colors display the ground-truth classification of the data. (Color figure online)

5.1 Simulation study

We simulated synthetic data of \(p = 30\) variables consisting of \(n = 300\) observations and \(K = 6\) clusters (dataset 1) and \(n = 200\), \(K = 2\) (dataset 2), as shown in Fig. 1. Both of them were generated using MFA models with \(q = 2\) (dataset 1) and \(q = 3\) (dataset 2) factors. The two datasets exhibit different characteristics: The variance of errors per cluster (\({\varvec{{\varSigma }}}_k\)) is significantly larger in dataset 2 compared to dataset 1. In addition, the selection of factor loadings in dataset 2 results in more complex covariance structure. The generating mechanism, described in detail in Papastamoulis (2018b), is available in the fabMix package via the simData() and simData2() functions, as shown below.

figure a

Next we estimate the eight overfitting Bayesian MFA models with \(K_{{\max }}=20\) mixture components, assuming that the number of factors ranges in the set \(1\leqslant q\leqslant 5\). The MCMC sampler runs nChains = 4 heated chains, each one consisting of mCycles = 700 cycles, while the first burnCycles = 100 are discarded. Recall that each MCMC cycle consists of nIterPerCycle = 10 usual MCMC iterations and that there is an additional warm-up period of the MCMC sampler (before starting to propose chain swaps) corresponding to 5000 usual MCMC iterations.

figure b

The argument parallelModels = 4 implies that four parameterizations will be processed in parallel. In addition, each model will use nChains = 4 threads to run in parallel the specified number of chains. Our job script used 16 threads so in this case the \({\texttt {parallelModels}}\times {\mathtt{nChains}} = 16\) jobs are efficiently allocated.

5.1.1 Methods for printing, summarizing and plotting the output

The print method for a fabMix.object displays some basic information for a given run of the fabMix function. The following output corresponds to the first dataset.

figure c

The following output corresponds to the print method for the fabMix function for the second dataset.

figure d

We conclude that the selected models correspond to the UUC parameterization with \(K = 6\) clusters and \(q = 2\) factors for dataset 1 and \(K = 2\), \(q = 2\) for dataset 2. The selected number of clusters and factors for the whole range of eight models is displayed next, along with the estimated posterior probability of the number of alive clusters per model (K_MAP_prob), the value of the BIC for the selected number of factors (BIC_q) as well as the proportion of the accepted swaps between the heated MCMC chains in the last column. The frequency table of the estimated single best clustering of the datasets is displayed in the last field. We note that the labels of the frequency table correspond to the labels of the alive components of the overfitting mixture model, that is, components 4, 7, 13, 14, 15, and 17 for dataset 1 and components 6 and 20 for dataset 2. Clearly, these labels can be renamed to 1, 2, 3, 4, 5, 6 and 1, 2 respectively, but we prefer to retain the raw output of the sampler as a reminder of the fact that it corresponds to the alive components of the overfitted mixture model.

The summary method of the fabMix package summarizes the MCMC output for the selected model by calculating posterior means and quantiles for the mixing proportions, marginal means and the covariance matrix per (alive) cluster. A snippet of the output for dataset 2 is shown below.

figure e

The printed output is also returned to the user via s$posterior_means and s$quantiles.

Fig. 2
figure 2

BIC values per parameterization and factor level using the plot(fabMix.object) method

The plot() method of the package generates the following types of graphics output:

  1. (1)

    Plot of the BIC values per factor level and parameterization.

  2. (2)

    Plot of the posterior means of marginal means (\({\varvec{\mu }}_k\)) per (alive) cluster and highest density intervals of the corresponding normal distribution along with its assigned data.

  3. (3)

    The coordinate projection plot of the mclust package (Fraley and Raftery 2002; Scrucca et al. 2017), that is, a scatterplot of the assigned data per cluster for each pair of variables.

  4. (4)

    Visualization of the posterior mean of the correlation matrix per cluster using the corrplot package.

  5. (5)

    The MAP estimate of the factor loadings (\({\varvec{{\varLambda }}}_k\)) per (alive) cluster.

The following commands produce plot (1) for datasets 1 and 2.

figure f

The produced plots are shown in Fig. 2. Note that each point in the plot is labeled by an integer, which corresponds to the MAP number of alive components for the specific combination of factors and parameterization.

Fig. 3
figure 3

Marginal mean with \(95\%\) highest density interval and the corresponding assigned data per alive cluster using the plot(fabMix.object) method

The following commands produce plot (2) for datasets 1 and 2.

figure g

The created plots are shown in Fig. 3. The class_mfrow arguments control the rows and columns of the layout and it should consist of two integers with their product equal to the selected number of (alive) clusters. In addition, a legend is placed on the bottom of the layout. The value(s) in the confidence argument draws the highest density interval(s) of the estimated normal distribution. Note that these plots display the original and not the scaled dataset which is used in the MCMC sampler. Therefore, the central curve and confidence limits displayed in the specific plot correspond to the mean and variance (multiplied by the appropriate quantile of the standard Normal distribution) of the random variables arising by applying the inverse of the z-transformation on the MCMC estimates reported by the fabMix function.

Figure 4 visualizes the correlation matrix for the first cluster of each dataset, using the corrplot package. The argument \({\texttt {sig}}\_{\texttt {correlation}} = \alpha \) is used for marking cases where the equally tailed \((1-\alpha )\) Bayesian credible interval contains zero. The following commands generate the plots in Fig. 4.

figure h
Fig. 4
figure 4

Correlation matrix for the first (alive) cluster of each dataset

Fig. 5
figure 5

Adjusted Rand index (first row), estimated number of clusters (second row), estimated number of factors (third row) and selected parameterization (last row) for various replications of scenarios 1 and 2 with varying number of clusters and factors. In all cases, the sample size is drawn randomly in the set \(\{100, 200, \ldots ,1000\}\)

5.1.2 Assessing clustering accuracy and comparison with pgmm

In this section, we compare our findings against the ground truth in simulated datasets and also compare against the pgmm package, considering the same range of clusters and factors per dataset. For each combination of number of factors, components and parameterization, the pgmmEM() algorithm was initialized using three random starting values as well as the K-means clustering algorithm, that is, four different starts in total. Note that the number of different starts of the EM algorithm is set equal to number of parallel chains in the MCMC algorithm. The input data are standardized in both algorithms.

As shown in Table 3, the adjusted Rand index (ARI) (Rand 1971) between fabMix and the ground-truth classification is equal to 1 and 0.98 for simulated datasets 1 and 2, respectively. The corresponding ARI for pgmm is equal to 0.98 and 0.88, respectively. In both cases, our method finds the correct number of clusters; however, pgmm overestimates K in dataset 1. Both methods select the UUC parameterization in dataset 1, but in dataset 2 different models are selected (UUC by fabMix and CUC by pgmm).

Table 3 Selected number of clusters, factors, parameterization and adjusted Rand index for simulated data 1 and 2

The selected number of factors equals 2; however, in dataset 2 the “true” number of factors equals 3. The underestimation of the number of factors in dataset 2 remains true for a wide range of similar data: In particular, we generated synthetic datasets with identical parameter values as the ones in dataset 2 but each time the sample size was increasing by 200 observations. We observed that the correct number of factors is returned when \(n \geqslant 1600\) for fabMix and \(n \geqslant 1800\) for pgmm.

Next we replicate the two distinct simulation procedures (according to the simData() and simData2() functions of the package) used to generate the previously described datasets, but considering that \(1\leqslant K \leqslant 10\) (true number of clusters) and \(1\leqslant q \leqslant 3\) (true number of factors). The number of variables remains the same as before, that is \(p=30\), and the sample size is drawn uniformly at random in the set \(\{100,200,\ldots ,1000\}\). We will use the terms ‘scenario 1’ and ’scenario 2’ to label the two different simulation procedures. In scenario 1, the diagonal of the variance of errors is generated as \(\sigma _{kr}^{2} = 1 + 20\log (k+1)\), \(r = 1,\ldots ,p\), whereas in scenario 2: \(\sigma _{kr}^{2} = 1 + u_r\log (k+1)\), where \(u_r\sim \hbox {Uniform}(500,1000)\), \(r = 1,\ldots ,p\); \(k=1,\ldots ,K\). In general, scenario 1 generates datasets with well-separated clusters. On the other hand, the amount of error variance in scenario 2 makes the clusters less separated. For a given simulated dataset with \(K_{\text {true}}\) clusters and \(q_{\text {true}}\) factors, we are considering that the total number of components in the overfitting mixture model (fabMix) as well as the maximum number of components fitted from pgmm is set equal to \(K_{{\max }} = K_{\text {true}} + 6\) and that the number of factors ranges between \(1\leqslant q \leqslant q_{\text {true}} + 2\). These bounds are selected in order to speed up computation time without introducing any bias in the resulting inference (as confirmed by a smaller pilot study). For each scenario, 500 datasets were simulated.

The main findings of the simulation study are illustrated in Fig. 5. Note that in scenario 1 fabMix almost always finds the correct clustering structure: The boxplots of the adjusted Rand index are centered at 1, and, on the second row, the boxplots of the estimated number of clusters are centered at the corresponding true value. On the other hand, observe that for \(K\geqslant 6\)pgmm has the tendency to overestimate the number of clusters. In the more challenging scenario 2, the estimates of the number of cluster exhibit larger variability. However, note that for \(K=8,9,10\) the number of clusters selected by fabMix is closer to the true value than pgmm, a fact which is also reflected in the ARI where fabMix tends to have larger values than pgmm. For both scenarios, the estimation of the number of factors is in strong agreement between the two methods, as shown in the third row of Fig. 5. In the last row, the selected parameterization is shown. Observe that the results are fairly consistent between the two methods.

Finally, we note that in the presented simulation study, the generated clusters have equal sizes (on average). The reader is referred to “Appendix C” for exploring the performance of the compared methods in the presence of small and large clusters with respect to the size of the available data (n).

5.2 Publicly available datasets

In this section, we analyze four publicly available datasets: a subset of the wave dataset (Breiman et al. 1984; Lichman 2013) available at the fabMix package, the wine dataset (Forina et al. 1986) available at the pgmm package, the coffee dataset (Streuli 1973) available at the pgmm package, and the standardized yeast cell cycle data (Cho et al. 1998) available at http://faculty.washington.edu/kayee/model/. Note that Papastamoulis (2018b) analyzed the first three datasets but only considering the UUU and UCU parameterizations for fabMix.

The coffee dataset consists of \(n=43\) coffee samples of \(p = 12\) variables, collected from beans corresponding to the Arabica and Robusta species (thus, \(K = 2\)). The wave dataset consists of a randomly sampled subset of 1500 observations from the wave dataset (Breiman et al. 1984), available from the UCI machine learning repository (Lichman 2013). According to the available ground-truth classification of the dataset, there are three equally weighted underlying classes of 21-dimensional continuous data. The wine dataset (Forina et al. 1986), available at the pgmm package (McNicholas et al. 2015), contains \(p=27\) variables measuring chemical and physical properties of \(n = 178\) wines, grouped in three types (thus, \(K = 3\)). The reader is referred to McNicholas and Murphy (2008), Papastamoulis (2018b) for more detailed descriptions of the data.

The yeast cell cycle data (Cho et al. 1998) quantify gene expression levels over two cell cycles (17 time points). The dataset has previously been used for evaluating the effectiveness of model-based clustering techniques (Yeung et al. 2001). We used the standardized subset of the 5-phase criterion, containing \(n = 384\) genes measured at \(p = 17\) time points. The expression levels of the \(n=384\) genes peak at different time points corresponding to the five phases of cell cycle, so this five-class partition of the data is used as the ground-truth classification.

Table 4 Selected number of clusters, factors, parameterization and adjusted Rand index for the publicly available data
Fig. 6
figure 6

Marginal mean with \(95\%\) highest density interval and the corresponding assigned data per alive cluster for the yeast cell cycle data

Fig. 7
figure 7

Total time needed for fitting the eight parameterizations considering \(q =1,\ldots ,5\) (40 models in total) for various levels of sample size (n) and number of variables (p). We considered \(K_{{\max }} = 20\) components in fabMix and \(1\leqslant K\leqslant 20\) in pgmm. Each parameterization is fitted in parallel using eight threads. No multiple runs (pgmm) or parallel chains (fabMix) are considered. The MCMC algorithm in fabMix ran for 12,000 iterations. The bars display averaged wall clock run times across five replicates

We applied our method using the eight parameterizations of overfitting mixtures with Kmax = 20 components for \(1\leqslant q\leqslant q_{{\max }}\) factors using nChains = 4 heated chains. We set \(q_{{\max }} = 5\) for the coffee, wave and wine datasets, while \(q_{{\max }} = 10\) for the yeast cell cycle dataset. The number of MCMC cycles was set to mCycles = 1100, while the first burnCycles = 100 were discarded as burn-in. The eight parameterizations are processed in parallel on parallelModels = 4 cores, while each heated chain of a given parameterization is also running in parallel. All other prior parameters were fixed at their default values.

We have also applied pgmm considering the same range of clusters and factors per dataset. For each combination of number of factors, components and parameterization, the EM algorithm was initialized using five random starting values as well as the K-means clustering algorithm, that is, six different starts in total. For the coffee dataset, a larger number of different starts are required as discussed in Papastamoulis (2018b).

Table 4 summarizes the results for each of the publicly available data. We conclude that fabMix performs better than pgmm at the coffee and yeast datasets. In the wine dataset, on the other hand, pgmm performs better than fabMix, but we underline the improved performance of our method compared to the one reported by Papastamoulis (2018b) where only the UUU and UCU parameterizations were fitted. The two methods are in agreement on the wave dataset. The plot command of the fabMix package displays the estimated clusters according to the CUU model with six factors for the yeast dataset, as shown in Fig. 6.

6 Discussion and further remarks

This study offered an efficient Bayesian methodology for model-based clustering of multivariate data using mixtures of factor analyzers. The proposed model extended the ideas of Papastamoulis (2018b) building upon the previously introduced set of parsimonious Gaussian mixture models (McNicholas and Murphy 2008; McNicholas et al. 2010). The additional parameterizations improved the performance of the proposed method compared to Papastamoulis (2018b) where only two out of eight parameterizations were available. Furthermore, our contributed R package makes the proposed method available to a wider audience of researchers.

The computational cost of our MCMC method is larger than the EM algorithm, as shown in Fig. 7. But of course, when a point estimate is required, the EM algorithm is the quickest solution. When a point estimate is not sufficient, our method offers an attractive Bayesian treatment of the problem. Clearly, the Bayesian approach does show further advantages (as in the simulated datasets according to Scenario 1, as well as in the coffee and yeast datasets), where the multimodality of the likelihood potentially causes the EM to converge to local maxima.

A direction for future research is to generalize the method in order to automatically detect the number of factors in a fully Bayesian manner. This is possible by, for example, treating the number of factors as a random variable and implementing a reversible jump mechanism in order to update it inside the MCMC sampler. Another possibility would be to incorporate strategies for searching the space of sparse factor loading matrices allowing posterior inference for factor selection (Bhattacharya and Dunson 2011; Mavridis and Ntzoufras 2014; Conti et al. 2014). Recent advances on infinite mixtures of infinite factor models (Murphy et al. 2019) also allow for direct inference of the number of clusters and factors and could boost the flexibility of our modeling approach.