1 Introduction

Factor analysis (FA) models and finite mixture (FM) models are both popular statistical techniques which have wide application to the analysis of data and extraction of hidden or latent variables. In a FA model, the covariance relationship between variables can be explained by a fewer number of latent variables or latent factors which can be used to simplify analysis in high-dimensional settings or establish common themes or constructs (e.g., in psychometric testing). The FA model has wide applications in various fields such as social sciences, biology, medical sciences and epidemiological studies. An FM model is also a latent variable model and can represent the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs [for further details see e.g. McLachlan and Peel (2000) and references therein and Lee and McLachlan (2013a, b)].

A combination of the FA and FM models based on the Gaussian distribution was first studied by Hinton et al. (1997) and Ghahramani and Hinton (1997), commonly called the Gaussian mixture of factor analysis (MFA) model. However, in many applied problems, the data may be moderately or severely skewed which can result in seriously misleading inference even for small departures from normality (Wall et al. 2012). In practice, the imposition of symmetry in the components of the mixture models may be a fairly restrictive condition. For example Lin et al. (2007), Maleki and Arellano-Valle (2017) and Maleki et al. (2018a), argue the normal mixture model tends to over-fit when additional components are included to capture the skewness and, sometimes, increasing the number of components may lead to difficulties in computations (e.g. small number of observations belonging to a group) and interpretation of results (See discussion and results in Murray et al. 2014).

In recent years, there has been much research on asymmetrical families of distributions which contain the Gaussian family as a special (symmetrical) case. For example, the class of Skew-Normal distribution studied by Azzalini (1985), Azzalini and Dalla-Vale (1996), Azzalini and Capitanio (1999) and Arellano-Valle and Azzalini (2006), have wide application in many statistical models (for more asymmetrical distributions and their applications see Azzalini 2014). In particular, Azzalini and Dalla-Vale (1996) and Sahu et al. (2003) studied the so called restricted multivariate skew-normal (rMSN or rSN) distribution which is suitable for analyzing skewed multivariate data as well as symmetrically distributed. Recently, Lin et al. (2016) applied the rMSN distribution to the structure of the MFA model, hereafter called the mixtures of skew-normal factor analyzers (MSNFA) model. In the rMSN distribution, skewness is controlled by a vector of skewness parameters multiplied by a common skewing variable in its convolution type representation. An alternative formulation of skew distributions is through the use of so called unrestricted forms in which there is no reliance on a common skewing variable and skewness is allowed to be represented in more than one direction [in contrast to a single direction for the restricted case; see e.g. Lee and McLachlan (2013b), Maleki et al. (2018b)]. However, this extra flexibility can lead to identifiability issues (the skewness matrix is not rotation invariant) and there are also issues in terms of computational tractability. These forms and computational issues can be explored in future research.

In the MSNFA model the latent component factors follow the family of rMSN distributions in an attempt to model the data adequately in the presence of skewed sub-populations. The MSNFA model has a novel approach to dimension reduction and representing appropriately non-normal data. Lin et al. (2016) used an EM-type algorithm to obtain the maximum likelihood (ML) estimates of the proposed model parameters and estimated factor scores by products within the estimation procedure.

Most estimation methods for MFA models are classical inferences based on the maximum likelihood (ML) estimates. However, the likelihood function of the MFA and FM models can be unbounded for some samples and this is a problematic issue, so some researchers have considered the Bayesian approach to estimate the Gaussian MFA model. Bishop (1999) proposed a partial Bayesian framework to the mixture of principal component analyzer (PCA), which is an isotropic version of the MFA based on the Gaussian distribution. Bishop (1999) used a maximum a-posteriori (MAP) method of estimation by using a simple Gaussian prior for the factor loadings, and also by using approximate Bayesian inference, derived an algorithm for estimation of the hyper-parameters (parameters of the prior). Ghahramani and Beal (2000) proposed an efficient and deterministic variational approximation to full Bayesian integration of Gaussian MFA model parameters. Ustugi and Kumagai (2001) introduced a full prior on all parameters of the Gaussian MFA model by using conjugate priors.

More recent extensions to the Gaussian MFA model in a Bayesian framework include a matrix variate t distribution for the factor scores (Ando 2009) and normal/independent distributions for the error term to provide for outliers and a robust specification (Lee and Xia 2008a). A number of other extensions have focused on semi-parametric models (Yang and Dunson 2010; Lee and Xia 2008b; Song et al. 2010; Murray et al. 2013), non-parametric approaches (Chen et al. 2010; Paisley and Carin 2009) and allowing for flexible prior distributions (Ghosh and Dunson 2009). Other extensions have focused on exploiting the use of prior distributions or information for sparse applications in high-dimensional settings (Carvalho et al. 2008; Knowles and Ghahramani 2007; Paisley and Carin 2009; Bhattacharya and Dunson 2011) and in a dynamic time series context (Chen et al. 2011). A difficulty with some of the more flexible semi-parametric or non-parametric models proposed for the factor analysis models is there is often a sacrifice that is made in terms of interpretation, parsimony and computation. This is particularly an issue in the factor analysis context where various forms of the factor loading matrix can be derived and where simplicity and interpretability often become appealing for users (see, e.g., Frühwirth-Schnatter and Lopes 2012; Conti et al. 2014).

There are several other computational advantages of using a Bayesian approach for mixtures of factor analysis models (compared to ML estimation) including the use of prior information or specification of prior distributions to regularize the parameter space, particularly in high dimensional settings (Carvalho et al. 2008) and/or in cases where there is considerable noise (e.g., imaging data). In particular, Suarez and Ghosal (2016) examine the performance of placing a prior distribution on the error term of a principal components approach for functional data with the degree of informativeness or smoothing determined by apriori knowledge or empirically derived. We note that this information is relatively easily included in a Bayesian model without the need for introducing additional computational demands or complexity. Finally, we note that the number of components and factors could be allowed to vary and be updated as part of the computational approach (Frühwirth-Schnatter and Lopes 2012).

Extensions to the more general case of structural equation modeling are also relatively easier than in the ML estimates setting (e.g., Lee and Xia 2008a). Further extensions to allow for the influence or effect of missing data on parameter estimates is quite natural in a Bayesian setting as various patterns of missing data (e.g., class dependent missingness) can be imputed at each MCMC iteration from the posterior predictive distribution (e.g., using a mixture model defined using open source software such as JAGS or NIMBLE). Computation of the standard error or uncertainty of parameter estimates also does not rely on using asymptotic approximations to the observed information matrix if the sample size is large or resorting to a bootstrap method which requires a very large amount of computations (Basso et al. 2010).

In this paper, we consider the MSNFA model of Lin et al. (2016) and propose a Bayesian inference with full priors of all model parameters. This parametric model has several desirable properties, including representation of the symmetrical MFA model as a special case. The distribution also has a convenient hierarchical representation which leads to closed form marginal posteriors and facilitates ease of computations using a Gibbs sampler MCMC algorithm to estimate the model parameters. To illustrate the flexibility of the Bayesian approach in this setting, we also consider the performance of the model in missing data settings.

The paper is organized as follows. In Sect. 2, we provide a review and background to the rSN distribution and the MSNFA model. Section 3 presents a Bayesian analysis of the MSNFA and details of the Gibbs sampling algorithm. In Sect. 4, we illustrate the performance of the proposed model on real datasets. Finally, in Sect. 5, we present our main conclusions and discuss possible extensions and areas of further research.

2 A review of the multivariate rSN family and MSNFA model

In this part we begin with a brief review of the multivariate restricted skew normal (rMSN) family introduced and studied by Azzalini and Dalla-Vale (1996) and Lee and McLachlan (2013b). We then outline details for the mixtures of factor analysis model based on the rMSN family.

2.1 The multivariate restricted skew normal family

A q-dimensional random vector \( \varvec{X} \) follows an rMSN distribution with q-dimensional location vector \( \varvec{\mu} \), q × q positive definite dispersion matrix \( {\varvec{\Sigma}} \), and q-dimensional skewness parameter vector \( \varvec{\lambda} \), denoted by \( \varvec{X} \)\( \sim rSN_{q} \left( {\varvec{\mu},{\varvec{\Sigma}},\varvec{\lambda}} \right) \), can be constructed stochastically by

$$ \varvec{X} =\varvec{\mu}+\varvec{\lambda}W + {\varvec{\Sigma}}^{1/2} \varvec{V}, $$
(1)

where \( W = \left| {V_{0} } \right| \) is the absolute value of \( V_{0} \sim N_{1} \left( {0,1} \right) \) and independent of \( \varvec{V }\sim N_{q} \left( {0,\varvec{ I}_{q} } \right) \). Note that \( E\left( \varvec{X} \right) =\varvec{\mu}+ c\varvec{\lambda} \) and \( {\text{Cov}}\left( \varvec{X} \right) = {\varvec{\Sigma}} + \left( {1 - c^{2} } \right)\varvec{\lambda \lambda }^{{ \top }} \), where \( c = \sqrt {2/\pi } \).

Considering the stochastic representation of \( \varvec{X} \sim rSN_{q} \left( {\varvec{\mu},{\varvec{\Sigma}},\varvec{\lambda}} \right) \) leads to the following probability density function (pdf)

$$ f\left( {\varvec{x}|\varvec{\mu},{\varvec{\Sigma}},\varvec{\lambda}} \right) = 2\phi_{q} \left( {\varvec{x}|\varvec{\mu},{\varvec{\Omega}}} \right)\varPhi_{1} \left( {\sigma^{ - 1}\varvec{\lambda}^{{ \top }} {\varvec{\Omega}}^{ - 1} \left( {\varvec{x} -\varvec{\mu}} \right)} \right) , \quad \varvec{x} \in \varvec{R}^{q} , $$
(2)

where \( {\varvec{\Omega}} = {\varvec{\Sigma}} + \varvec{\lambda \lambda }^{{ \top }} \), \( \sigma^{2} = 1 -\varvec{\lambda}^{{ \top }} {\varvec{\Omega}}^{ - 1}\varvec{\lambda}= \left( {1 +\varvec{\lambda}^{{ \top }} {\varvec{\Sigma}}^{ - 1}\varvec{\lambda}} \right)^{ - 1} \), and \( \phi_{q} \left( { \cdot |\varvec{\mu},{\varvec{\Omega}}} \right) \) and \( \varPhi_{1} \left( \cdot \right) \) are, respectively, the probability distribution function (pdf) for the multivariate normal distribution \( N_{q} \left( {\varvec{\mu},{\varvec{\Omega}}} \right) \), and the cumulative distribution function (cdf) of the standard univariate normal distribution.

Also the random vector \( \varvec{X} \sim rSN_{q} \left( {\varvec{\mu},{\varvec{\Sigma}},\varvec{\lambda}} \right) \) has the following hierarchical representation:

$$ \begin{aligned} \left. \varvec{X} \right|W & = w \sim N_{q} \left( {\varvec{\mu}+\varvec{\lambda}W,{\varvec{\Sigma}}} \right), \\ W & \sim HN_{1} \left( {0,1} \right), \\ \end{aligned} $$
(3)

where HN1 is the univariate right-half of standard normal distribution

For more details about this family of distributions (including the mean, variance, the moment generating function, and other interesting properties), see e.g., Lee and McLachlan (2013b), Lin et al. (2016) and Maleki et al (2018a, b).

2.2 The mixture of restricted skew-normal factor model

Lin et al. (2016) introduced the generalization of traditional factor analysis (FA), called restricted skew-normal factor model. Given a p-dimentional random sample \( \varvec{Y} = \left\{ {\varvec{Y}_{1} , \ldots ,\varvec{Y}_{n} } \right\} \) and location vector \( \varvec{\mu} \), a \( p \times q \) matrix of factor loadings \( \varvec{L} \), factor analysis finds uncorrelated symmetrical/asymmetrical \( q \)-dimensional \( \left( {q < p} \right) \) vectors of latent factors \( \varvec{F}_{1} , \ldots ,\varvec{F}_{n} \) that explain a large amount of variability in the data, and \( \epsilon_{1} , \ldots ,\epsilon_{n} \) are the \( p \)-dimentional vector of Gaussian errors. The factor analysis model for \( j = 1, \ldots ,n \) can be written as

$$ \varvec{Y}_{j} =\varvec{\mu}+ \varvec{LF}_{j} + \epsilon_{j} , $$
(4a)

for which the latent factors and model errors are independently distributed as:

$$ \varvec{F}_{j} {\mathop{\sim}\limits^{iid}} rSN_{q} \left( { - c{\varvec{\Delta }}^{ - 1/2} \varvec{\lambda }, {\varvec{\Delta }}^{ - 1 } ,{\varvec{\Delta }}^{ - 1/2} \varvec{\lambda }} \right), \quad \varvec{\epsilon }_{j} {\mathop{\sim}\limits^{iid}} N_{p} \left( {{\mathbf{0}},\varvec{D}} \right), $$
(4b)

where \( c = \sqrt {2/\pi } \), the scale matrix \( {\varvec{\Delta}} = \varvec{I}_{q} + \left( {1 - c^{2} } \right)\varvec{\lambda \lambda }^{{ \top }} \), positive diagonal matrix \( \varvec{D} = {\text{diag}}\left( {D_{1} , \ldots ,D_{p} } \right) \) and SN denotes a skew-normal distribution. Note that \( E\left[ {\varvec{F}_{j} } \right] = 0 \), \( {\text{Cov}}\left[ {\varvec{F}_{j} } \right] = \varvec{I}_{q} \), \( E\left[ {\epsilon_{j} } \right] = 0 \) and also, \( E\left[ {\varvec{Y}_{j} } \right] =\varvec{\mu} \), \( {\text{Cov}}\left[ {\varvec{Y}_{j} } \right] = \varvec{LL}^{{ \top }} + \varvec{D} \). This model we will refer to as the SNFA model, and due to Proposition 3 from Lin et al. (2016),

$$ \varvec{Y}_{j} \sim rSN_{q} \left( {\varvec{\mu}- c\varvec{\alpha}, {\varvec{\Sigma}},\varvec{\alpha}} \right), \quad j = 1, \ldots ,n, $$
(5)

where \( \varvec{\alpha}= \varvec{L}{\varvec{\Delta}}^{ - 1/2}\varvec{\lambda} \) and \( {\varvec{\Sigma}} = \varvec{L}{\varvec{\Delta}}^{ - 1} \varvec{L}^{ \top } + \varvec{D} \). To ensure the identifiability of the SNFA model (4a, b), we constrain the loading matrix \( \varvec{L} \) so that the upper-right triangle is zero and diagonal entries are strictly positive (Fokoué and Titterington 2003; Lopes and West 2004; Lin et al. 2016). At times these conditions may be too restrictive and influence the ordering of the factors, so alternative formulations have been examined in Leung and Drton (2016), Frühwirth-Schnatter and Lopes (2012) and Conti et al. (2014)

Lin et al. (2016) generalize the SNFA model to its corresponding mixture model, called Mixture of restricted skew-normal factors denoted by MSNFA model with the following details. Let \( \varvec{Y}_{j} = \left( {Y_{j1} , \ldots ,Y_{jp} } \right)^{{ \top }} , j = 1, \ldots ,n \) are p-dimentional vector of p feature variables, for which \( \varvec{Y}_{j} \) follows from finite groups. The latent membership-indicator variables \( \varvec{Z}_{1} , \ldots ,\varvec{Z}_{n} \) indicate which component each observation belongs to. In detail, \( Z_{ij} = \left( {\varvec{Z}_{j} } \right)_{i} \) for \( i = 1, \ldots ,{\text{g}} \) and \( j = 1, \ldots ,n \) is one or zero, according to whether \( \varvec{Y}_{j} \) belongs or does not belong to the i-th component. These latent variables have multinomial distribution denoted by \( \varvec{Z}_{1} , \ldots ,\varvec{Z}_{n} \text{ }\sim{ \mathcal{M}}\left( {1;\varvec{\pi}} \right) \), for which \( \varvec{\pi}= \left( {\pi_{1} , \ldots ,\pi_{\text{g}} } \right)^{{ \top }} \), with marginal probability mass function (pmf) given by

$$ P\left( {\varvec{Z}_{j} ;\varvec{\pi}} \right) = \pi_{1}^{{z_{1j} }} \pi_{2}^{{z_{2j} }} \ldots \pi_{\text{g}}^{{z_{{{\text{g}}j}} }} , \quad j = 1, \ldots ,n , \quad {\text{subject to}} \mathop \sum \limits_{i = 1}^{\text{g}} \pi_{i} = 1 \quad {\text{and }}\pi_{i} > 0, \quad i = 1, \ldots ,{\text{g}}. $$

So given \( Z_{ij} = 1 \), \( \varvec{Y}_{j} \) has the structure

$$ Y_{j} = \mu_{i} + L_{i} F_{ij} + \epsilon_{ij} ,\quad {\text{with probability}}\,\pi_{i}, $$
(6)

where the latent factors and model errors are independently

distributed as

\( \varvec{F}_{ij} {\mathop{\sim}\limits^{ind}} rSN_{q} \left( { - c{\varvec{\Delta}}_{i}^{ - 1/2} \varvec{\lambda }_{i} , {\varvec{\Delta }}_{i}^{ - 1} , {\varvec{\Delta }}_{i}^{ - 1/2} \varvec{\lambda }_{i} } \right), \epsilon_{ij} {\mathop{\sim}\limits^{ind}} N_{p} \left( {\varvec{0},\varvec{D}_{i} } \right) \), for which \( {\varvec{\Delta}}_{i} = \varvec{I}_{q} + \left( {1 - c^{2} } \right)\varvec{\lambda}_{i}\varvec{\lambda}_{i}^{{ \top }} \) and positive diagonal matrix \( \varvec{D}_{i} = {\text{diag}}\left( {D_{i1} , \ldots ,D_{ip} } \right) \) for \( j = 1, \ldots ,n \) and \( i = 1, \ldots ,{\text{g}} \).

Also density of \( \varvec{Y}_{j} \) is

$$ f\left( {\varvec{y}_{j} |{\varvec{\Theta}}} \right) = \mathop \sum \limits_{i = 1}^{\text{g}} \pi_{i} f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right);\quad j = 1, \ldots ,n, $$
(7)

where \( f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right) \) is the pdf of each SNFA component (6) given by (5) and (2), and \( \varvec{\theta}_{i} = \left( {\varvec{\mu}_{i} ,\varvec{L}_{i} ,\varvec{D}_{i} ,\varvec{\lambda}_{i} } \right) \), for which \( {\varvec{\Theta}} = \left( {\pi_{1} , \ldots ,\pi_{{{\text{g}} - 1}} ,\varvec{\theta}_{1} , \ldots ,\varvec{\theta}_{\text{g}} } \right) \). Therefore, the log-likelihood function due to model (6) is given by

$$ {\mathcal{L}}\left( {{\varvec{\Theta}}|\varvec{y}} \right) = \mathop \sum \limits_{j = 1}^{n} \log \left( {\mathop \sum \limits_{i = 1}^{\text{g}} \pi_{i} f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right)} \right). $$
(8)

3 Bayesian analysis

In this section we construct the augmented likelihood function based on the completed data (including the latent variables) to derive the joint posterior distribution.

3.1 Augmented likelihood function

Let \( {\mathbf{\mathcal{C}}} = \left\{ {\varvec{Y},\varvec{W},\varvec{Z}} \right\} \) denote the complete data, where \( \varvec{Y} = \left( {\varvec{Y}_{1} , \ldots ,\varvec{Y}_{n} } \right) \), \( \varvec{W} = \left( {W_{1} , \ldots ,W_{n} } \right) \) and \( \varvec{Z} = \left( {\varvec{Z}_{1} , \ldots ,\varvec{Z}_{n} } \right) \). In accordance with the hierarchical representation (3) to the model (6), the following flexible hierarchical representations are satisfied:

$$ \begin{aligned} \left. {\varvec{Y}_{j} } \right|\varvec{F}_{ij} , Z_{ij} &= 1 {\mathop{\sim}\limits^{ind.}} N_{p} \left( {\varvec{\mu }_{i} + \varvec{L}_{i} \varvec{F}_{ij} ,\varvec{D}_{i} } \right), \hfill \\ \left. {\varvec{F}_{ij} } \right|W_{j} & = w_{j} , Z_{ij} = 1 {\mathop{\sim}\limits^{ind.}} N_{q} \left( { - c{\varvec{\Delta}}_{i}^{ - 1/2} \varvec{\lambda }_{i} + w_{j} {\varvec{\Delta}}_{i}^{ - 1/2} \varvec{\lambda }_{i} , {\varvec{\Delta}}_{i}^{ - 1} } \right), \hfill \\ \varvec{ }\left. {W_{j} } \right| Z_{ij} &= 1 {\mathop{\sim}\limits^{ind.}} HN_{1} \left( {0,1} \right). \hfill \\ \end{aligned} $$
(9)

Note that the above hierarchical representation can be reformulated as

$$ \begin{aligned} \left. {\varvec{Y}_{j} } \right|\varvec{\tilde{F}}_{ij} , Z_{ij} &= 1 {\mathop{\sim}\limits^{ind.}} N_{p} \left( {\varvec{\mu }_{i} + \varvec{\tilde{L}}_{i} \varvec{\tilde{F}}_{ij} ,\varvec{D}_{i} } \right), \hfill \\ \left. {\varvec{\tilde{F}}_{ij} } \right|W_{j} = w_{j} , Z_{ij} &= 1 {\mathop{\sim}\limits^{ind.}} N_{q} \left( {w_{j} \varvec{\lambda }_{i} , \varvec{I}_{q} } \right), \hfill \\ \left. {W_{j} } \right| Z_{ij} &= 1 {\mathop{\sim}\limits^{ind.}} TN_{1} \left( {c,1} \right)I\left( {W_{j} > c} \right), \hfill \\ \end{aligned} $$
(10)

where \( c = \sqrt {2/\pi } \), \( \tilde{\varvec{L}}_{i} = \varvec{L}_{i} {\varvec{\Delta}}_{i}^{ - 1/2} \), \( \tilde{\varvec{F}}_{ij} = {\varvec{\Delta}}_{i}^{1/2} \varvec{F}_{ij} \) and \( TN_{1} \left( {c,1} \right)I\left( {W_{j} > c} \right) \) denotes the truncated normal distribution with mean \( c \) and variance one before truncation on \( \left( {c, + \infty } \right) \). So, the complete-data augmented likelihood function of \( {\varvec{\Theta}} \) is given by

$$ L\left( {{\varvec{\Theta}}\left| {\mathbf{\mathcal{C}}} \right.} \right) = \mathop \prod \limits_{j = 1}^{n} \mathop \prod \limits_{i = 1}^{\text{g}} Z_{ij} \left[ {\pi_{i} \phi_{p} \left( {\left. {\varvec{y}_{j} } \right|\varvec{\mu}_{i} + \tilde{\varvec{L}}_{i} \tilde{\varvec{F}}_{ij} , \varvec{D}_{i} } \right)\phi_{q} \left( {\left. {\tilde{\varvec{F}}_{ij} } \right|w_{j}\varvec{\lambda}_{i} , \varvec{I}_{q} } \right)\phi_{1} \left( {\left. {W_{j} } \right|c,1} \right)I\left( {W_{j} > c} \right)} \right]. $$
(11)

3.2 Priors and posteriors

Our Bayesian approach is based on a Gibbs sampler MCMC algorithm to draw the samples from the full conditional posteriors. We assign prior distributions to the unknown model parameters and consider independently weak informative proper priors for the elements of \( {\varvec{\Theta}} \). Also, we consider the loading matrix in the form of \( \tilde{\varvec{L}}_{i} = \left[ {\ell_{i.rt} } \right] \) (\( \ell_{i.rt} \) are \( \varvec{L}_{i} \) elements). So, for the unknown parameters in the MSNFA model, we consider priors given by

$$ \varvec{\pi}= \left( {\pi_{1} , \ldots ,\pi_{{{\text{g}} - 1}} } \right) \sim Dir\left( {\eta_{1} , \ldots ,\eta_{\text{g}} } \right), \quad \varvec{ \mu }_{i} \sim N_{p} \left( {\varvec{m}_{i} ,\varvec{M}_{i} } \right), \quad\varvec{ }{\varvec{\uplambda}}_{i} \sim N_{q} \left( {{\mathbf{\mathcal{I}}}_{i} ,{\mathbf{\mathcal{G}}}_{i} } \right), $$
$$ \ell_{i.rt} \sim N_{1} \left( {\mu_{\ell i} ,\sigma_{\ell i}^{2} } \right);\quad r > t, \ell_{i.rr} \sim HN_{1} \left( {\mu_{\ell i} ,\sigma_{\ell i}^{2} } \right), \quad D_{i.r} \varvec{ }\sim\varvec{ }IG\left( {{\mathfrak{a}}_{i} ,{\mathfrak{b}}_{i} } \right), $$

for \( i = 1, \ldots ,{\text{g}} \), \( r = 1, \ldots ,p \) and \( t = 1, \ldots ,q \), where notations \( Dir \) and \( IG \) represent the Dirichlet and inverse Gamma distributions, respectively.

The joint posterior distribution \( p\left( {\left. {{\varvec{\Theta}},\varvec{F},\varvec{w},\varvec{z}} \right|\varvec{y}} \right) \propto L\left( {{\varvec{\Theta}}\left| {\mathbf{\mathcal{C}}} \right.} \right)p\left( {\varvec{\Theta}} \right) \) is (generally) analytically intractable and MCMC methods such as Gibbs sampling (Gelfand and Smith 1990) by using the full conditional posterior distributions are often needed to draw samples from this distribution. The full conditional posteriors for \( i = 1, \ldots ,{\text{g}} \), \( t = 1, \ldots ,p \) and \( r = 1, \ldots ,q \) are given as follows: (in the following quantities \( {\varvec{\Theta}}_{{\left( { - \varepsilon } \right)}} \) is the set of parameters without the parameter \( \varepsilon \), \( {\Im }_{i} = \left\{ {j: z_{ij} = 1} \right\} \) and \( n_{i} \) is equal to the number of observations allocated to the i-th FA component),

$$ \left.\varvec{\pi}\right|{\varvec{\Theta}}_{{\left( { -\varvec{\pi}} \right)}} ,\varvec{y},\varvec{F},\varvec{w},\varvec{z }\sim Dir\left( {\eta_{1} + n_{1} , \ldots ,\eta_{\text{g}} + n_{\text{g}} } \right). $$
$$ \left. {\varvec{\mu}_{i} } \right|{\varvec{\Theta}}_{{\left( { -\varvec{\mu}_{i} } \right)}} ,\varvec{y},\varvec{F},\varvec{w},z_{ij} = 1 \varvec{ }\sim\varvec{ }N_{p} \left( {\varvec{\mu},{\varvec{\Sigma}}} \right), $$

where \( \varvec{\mu}= {\varvec{\Sigma}}\left( {\varvec{M}_{i}^{ - 1} \varvec{m}_{i} + \sum\nolimits_{{{\Im }_{i} }} {\varvec{D}_{i}^{ - 1} } \left( {\varvec{y}_{j} - \tilde{\varvec{L}}_{i} \tilde{\varvec{F}}_{ij} } \right)} \right) \) and \( {\varvec{\Sigma}} = \left( {\varvec{M}_{i}^{ - 1} + \sum\nolimits_{{{\Im }_{i} }} {\varvec{D}_{i}^{ - 1} } } \right)^{ - 1} \).

$$ \left. {\ell_{i.rt} } \right|{\varvec{\Theta}}_{{\left( { - \ell_{i.rt} } \right)}} ,\varvec{y},\varvec{F},\varvec{w},z_{ij} = 1 \varvec{ }\sim\varvec{ }N_{1} \left( {\mu ,\sigma^{2} } \right), $$

where \( \mu = \sigma^{2} \left( {\mu_{\ell i} \sigma_{\ell i}^{ - 2} + D_{i.r}^{ - 1} \sum\nolimits_{{{\Im }_{i} }} {F_{ij\left( t \right)} } \left( {y_{jr} - \mu_{ir} - \ell_{{ir\left( { - t} \right)}}^{{ \top }} \tilde{\varvec{F}}_{ij} } \right)} \right) \) and \( \sigma^{2} = \left( {\sigma_{\ell i}^{ - 2} + D_{i.r}^{ - 1} \sum\nolimits_{{{\Im }_{i} }} {F_{ij\left( t \right)}^{2} } } \right)^{ - 1} \), for which \( y_{jr} \) and \( \mu_{ir} \) be the r-th components of \( \varvec{y}_{j} \) and \( \varvec{\mu}_{i} \), respectively, \( F_{ij\left( t \right)} \) be the t-th components of \( \tilde{\varvec{F}}_{ij} \), and \( \ell_{ir} \) be the r-th row of \( \tilde{\varvec{L}}_{i} \) (so \( \ell_{i.rt} \) is the t-th element of \( \ell_{ir} \)), and \( \ell_{{ir\left( { - t} \right)}} \) be the r-th row of \( \tilde{\varvec{L}}_{i} \) which t-th component zero.

Also \( \left. {\ell_{i.rr} } \right|{\varvec{\Theta}}_{{\left( { - \ell_{i.rr} } \right)}} ,\varvec{y},\varvec{F},\varvec{w},z_{ij} = 1\varvec{ }\sim\varvec{ }N_{1} \left( {\mu ,\sigma^{2} } \right)I\left( {\ell_{i.rr} > 0} \right), \) with the above parameters for which indices \( t \) replaced by \( r \).

$$ \left. {\tilde{\varvec{F}}_{ij} } \right|{\varvec{\Theta}},\varvec{y},\varvec{w},z_{ij} = 1 \varvec{ }\sim\varvec{ }N_{q} \left( {\varvec{\mu},{\varvec{\Sigma}}} \right), $$

where \( \varvec{\mu}= {\varvec{\Sigma}}\left( {w_{j}\varvec{\lambda}_{i} + \tilde{\varvec{L}}_{i}^{{ \top }} \varvec{D}_{i}^{ - 1} \left( {\varvec{y}_{j} -\varvec{\mu}_{i} } \right)} \right) \) and \( {\varvec{\Sigma}} = \left( {\varvec{I}_{q} + \tilde{\varvec{L}}_{i}^{{ \top }} \varvec{D}_{i}^{ - 1} \tilde{\varvec{L}}_{i} } \right)^{ - 1} \).

$$ \left. {D_{i.r} } \right|{\varvec{\Theta}}_{{\left( { - D_{i.r} } \right)}} ,\varvec{y},\varvec{F},\varvec{w},z_{ij} = 1\varvec{ }\sim\varvec{ }IG\left( {a,b} \right), $$

where \( a = {\mathfrak{a}}_{i} + n_{i} /2 \) and \( b = {\mathfrak{b}}_{i} + \frac{1}{2}\sum\nolimits_{{{\Im }_{i} }} {\left( {y_{jr} - \mu_{ir} - \ell_{i}^{{ \top }} \tilde{\varvec{F}}_{ij} } \right)}^{2} \).

$$ \varvec{ }\left. {{\varvec{\uplambda}}_{i} } \right|{\varvec{\Theta}}_{{\left( { - {\varvec{\uplambda}}_{i} } \right)}} ,\varvec{y},\varvec{F},\varvec{w},z_{ij} = 1\varvec{ }\sim\varvec{ }N_{q} \left( {\varvec{\mu},{\varvec{\Sigma}}} \right), $$

where \( \varvec{\mu}= {\varvec{\Sigma}}\left( {{\mathbf{\mathcal{G}}}_{i}^{ - 1} {\mathbf{\mathcal{I}}}_{i} + \sum\nolimits_{{{\Im }_{i} }} {w_{j} \tilde{\varvec{F}}_{ij} } } \right) \) and \( {\varvec{\Sigma}} = \left( {{\mathbf{\mathcal{G}}}_{i}^{ - 1} + \left[ {\sum\nolimits_{{{\Im }_{i} }} {w_{j}^{2} } } \right]\varvec{I}_{q} } \right)^{ - 1} \).

$$ \varvec{ }\left. {W_{j}^{ } } \right|{\varvec{\Theta}},\varvec{y},\varvec{F},z_{ij} = 1\varvec{ }\sim\varvec{ }TN_{1} \left( {\mu ,\sigma^{2} } \right)I\left( {W_{j}^{ } > c} \right), $$

where \( \mu = \sigma^{2} \left( {c + \sum\nolimits_{i = 1}^{\text{g}} {\varvec{\lambda}_{i}^{{ \top }} \tilde{\varvec{F}}_{ij} } } \right) \) and \( \sigma^{2} = \left( {1 + \sum\nolimits_{i = 1}^{\text{g}} {\varvec{\lambda}_{i}^{{ \top }}\varvec{\lambda}_{i} } } \right)^{ - 1} \).

$$ \left. \varvec{Z} \right|{\varvec{\Theta}},\varvec{y},\varvec{F},\varvec{w}\text{ }\sim{ \mathcal{M}}\left( {1; \frac{{\pi_{1} f_{1} \left( {\varvec{y}_{j} |\varvec{\theta}_{1} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{\text{g}} \pi_{i} f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right)}}, \ldots ,\frac{{\pi_{\text{g}} f_{\text{g}} \left( {\varvec{y}_{j} |\varvec{\theta}_{\text{g}} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{\text{g}} \pi_{i} f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right)}}} \right). $$

where \( f_{i} \left( {\varvec{y}_{j} |\varvec{\theta}_{i} } \right);\,i = 1, \ldots ,{\text{g}} \), are the component’s pdf defined in (7).

3.3 Imputation of missing values

An advantage of the hierarchical representation in (9) or (10) is the ability to easily simulate from the model or sample from the parameters using existing Bayesian software such as Stan (Stan Development Team 2017) or NIMBLE (NIMBLE Development Team 2017) (JAGS or OpenBUGS were not able to be used due to the absence of functions to enable matrix inversion). One advantage of this is the ability to easily accommodate missing data and impute values from the model naturally as part of the parameter updates.

Let \( \varvec{Y}_{M} \) and \( \varvec{Y}_{O} \) represent the missing and observed responses, respectively. Missing data imputation in a Bayesian framework relies on the posterior predictive distribution for the missing data,\( P\left( {\left. {\varvec{Y}_{M} } \right|\varvec{Y}_{O} } \right) = \smallint P(\varvec{Y}_{M} |\varvec{Y}_{O} ,{\varvec{\Theta}}) P({\varvec{\Theta}} | \varvec{Y}_{O} )d{\varvec{\Theta}} \). As for most missing data problems with an unknown missing pattern, the posterior predictive distribution cannot be simulated directly and a Gibbs sampling algorithm is often used with parameters updated in two generic steps, \( \varvec{y}_{i,M}^{{\left( {t + 1} \right)}} \sim P\left( {\left. {\varvec{y}_{i,M} } \right|\varvec{y}_{O} , {\varvec{\Theta}}^{\left( t \right)} } \right) \) for \( i = 1, \ldots ,N \) and \( {\varvec{\Theta}}^{{\left( {t + 1} \right)}} \sim P\left( {\left. {{\varvec{\Theta}}^{{\left( \varvec{t} \right)}} } \right|\varvec{y}_{O} ,\varvec{y}_{i,M} } \right) \). Starting from reasonable initial values \( \varvec{y}_{i,M}^{\left( 0 \right)} \) and \( {\varvec{\Theta}}^{\left( 0 \right)} \) and running the algorithm for a large number of iterations provides convergence towards these limiting distributions. In this paper, we implement this approach in NIMBLE as it can be undertaken relatively easily (using only one extra line of code) and extended (if needed) to include situations where missingness may depend on other covariates (e.g. conditions relating to the experiment or particular characteristics of individuals in a survey).

4 Applications

In this section, we assess the performance and flexibility of the proposed MSNFA model using real data examples which display signs of skewness and are challenging to fit using the MFA model.

4.1 Priors and computation

For estimation of the different models, largely non-informative prior distributions were used for each of the component parameters: \( \varvec{\mu}_{i} \sim N_{p} \left( {\varvec{m}_{i} ,\varvec{M}_{i} } \right) \), where \( \varvec{m}_{i} = 0 \) and \( \varvec{M}_{i} = 10^{3} \varvec{I}_{p} \) priors of its columns as \( {\varvec{\uplambda}}_{i} \sim N_{q} \left( {{\mathbf{\mathcal{I}}}_{i} ,{\mathbf{\mathcal{G}}}_{i} } \right) \), which \( {\mathbf{\mathcal{I}}}_{i} = 0 \) and \( {\mathbf{\mathcal{G}}}_{i} = 10^{3} \varvec{I}_{q} \), \( \ell_{i.rt} \sim N_{1} \left( {0, 100} \right); r > t, \)\( \ell_{i.rr} \sim HN_{1} \left( {0, 100} \right) \), \( D_{i.r} \varvec{ }\sim\varvec{ }IG\left( {1,1} \right) \) for \( i = 1,2 \), and \( \varvec{\pi }\sim\varvec{ }Dir\left( {1, \ldots ,1} \right) \). All computations are implemented on the R software version 3.3.1 (R Core Team 2017) with a core i7 760 processor 2.8 GHz. Gibbs sampling runs of 50,000 iterations with burn-in of 10,000 was used and convergence criteria was established using the Gelman–Rubin statistic (Gelman and Rubin 1992) and by visual inspection. Computations were also verified and models developed using NIMBLE. To address the issue of label switching over the MCMC iterations (Mengersen et al. 2011), we used the maximum a posteriori estimate (MAP) to select one of the k! modal regions and a distance based measure on the space of parameters to re-label parameters in proximity to this region (Celeux et al. 2000). A sample copy of the R and NIMBLE code used are available from the authors upon request (and will be available on a public website shortly). To avoid some computational issues common to factor analysis (e.g. underflow errors) we scale the datasets examined using the scale function in R. Finally note that a lot of different solutions have been proposed to boost MCMC (see, among others, Meng and Van Dyk 1999; van Dyk and Meng 2001; Yu and Meng 2011; Van Dyk 2010).

Model performance was assessed by comparing the classification accuracy and model selection criteria for MSNFA and MFA (see Table 2). For classification accuracy we report the adjusted Rand Index (ARI) (Hubert and Arabie 1985) which ranges from 0 (no match) to 1 (perfect match). We also report the EAIC and EBIC which are variations of the classical AIC and BIC criteria for use in a Bayesian setting (Carlin and Louis 2011) (lower values indicate a better fit). In a mixture setting it is also possible to compare the DIC values using one of the measures suggested by Celeux et al. (2006).

4.2 Seeds example

In the first example, we examine a clustering problem for a seeds dataset analyzed by Lin et al. (2016) and originally analyzed by Charytanowicz et al. (2010). The data consists of seven geometric features (area, perimeter, compactness, length of kernel, asymmetry coefficient, and length of kernel groove) measured from the X-ray images of 210 wheat kernels, belonging to three different wheat varieties (Kama, Rosa and Canadian). To illustrate the performance of MSNFA family we will focus on the case where g is a priori known to be 3 with q varying from 2 to 4.

For the MFA model (see Table 1) the best classification results were obtained for \( q = 4 \) with an ARI estimate of 0.69, however, all of the model selection criteria appeared to clearly favor \( q = 3 \) with a slightly lower ARI estimate of 0.66. The best classification results for the MSNFA model appeared to be for the \( q = 3 \) case and with a higher ARI estimate of 0.76. In terms of model selection criteria, estimates for all of the criteria for the MSNFA also clearly favored this particular model. Overall, the MSNFA model appears to better fit the three groups in this data quite well compared to the MFA case with significant improvements in model choice criteria estimates and classification results.

Table 1 Results for seeds data example

4.3 AIS data

The second example considers the Australian Institute of Sport (AIS) data containing a number of physical and hematological measurements (p = 11) from 100 female and 102 male athletes (n = 202). As a number of variables in the dataset (e.g. BMI) display signs of moderate skewness, a number of previous studies have used this dataset to examine the performance of skew-normal and skew-t mixture models to correctly classify the male and female athletes into their respective groups (e.g. Murray et al. 2014; Lee and McLachlan 2013a). Similarly, we are interested in assessing the performance of the MSNFA to correctly classify male and female athletes using all of the variables available (most of the previous studies have used only two variables).

From Table 2 we can see quite clearly that the classification performance of the MSNFA is very good with an ARI of 0.96 and model choice criteria all appear to favor this model. By contrast the MFA model is not able to accommodate the skewness in the data and the best ARI was 0.85 for the \( q = 5 \) model.

Table 2 Results for AIS data

To illustrate one of the benefits of using a Bayesian approach, we conduct an experiment on the AIS data by assessing the performance of the classification and associated errors in a missing data context. As mentioned previously, the hierarchical structure of the MNSFA allows the model to be coded and computations performed in NIMBLE (or Stan) which relatively easily facilitates the imputation of missing values from the full model (i.e. conditional means). In this experiment, we randomly delete values in the dataset under two different degrees of missingness [5% (low) and 30% (high) of the total sample (\( n \times p \))] and compare the performance of imputing values using the model (conditional approach) or according to mean imputation (unconditional approach) where the missing values are replaced by their unconditional means (mean of complete values for the variable). This type of missingness is often described as missing at random (MAR) (See Little and Rubin 1987). Along with the model selection and performance measures outlined previously, we also assess the results using the mean squared errors (MSE),

$$ MSE = \frac{1}{n}\mathop \sum \limits_{j = 1}^{n} \left( {\varvec{y}_{j}^{m} - \hat{\varvec{y}}_{j}^{m} } \right)^{{ \top }} \left( {\varvec{y}_{j}^{m} - \hat{\varvec{y}}_{j}^{m} } \right), $$

where \( \hat{\varvec{y}} \) denotes the imputed value and \( n^{*} = \sum\nolimits_{j = 1}^{\text{n}} {(p - p_{j}^{o} )} \varvec{ } \) is the number of total missing values. A smaller value of MSE indicates a more accurate prediction of missing values.

Table 3 presents the results of the two models (unconditional and conditional) against the mean values of the model selection criteria (EIC, EBIC, etc.), classification performance (ARI) and the MSE over 30 replications of the dataset under each missingness rate scenario (5% or 30%).

Table 3 Results for AIS data (missing data)

Under both degrees of missingness, the results for the conditional model (MSNFA-CO) are clearly superior to the results for the unconditional model (MSNFA-UC) with only a relatively small decrease in performance. In contrast, the results for the unconditional model have quickly deteriorated with an average classification result for the ARI of 0.68 (compared to 0.83 for the conditional model). The extent and type of deterioration in performance obviously depend on the setting but in this setting we saw substantial deterioration for a relatively small degree of missingness (5%). An alternative to the unconditional approach includes listwise deletion where an entire record is deleted from the analysis if a single value is missing. This approach is only really applicable for large samples, which is rarely the case for most applications where factor analysis is commonly used. Thus, the conditional approach (using the full model) is often preferred and used but relies upon the ease of use and availability of the computational approach in practice.

5 Conclusion

We have outlined and assessed the performance of a MSNFA model within a Bayesian framework. Various properties of the SNFA family are well defined and estimation of the parameters is relatively straightforward in a Bayesian framework with all of the Gibbs sampling updates available in closed form. Assessments of the performance of the proposed model on simulated and real data suggest that this distribution provides a considerable degree of flexibility in modeling data of varying directional shape. Various extensions to the MSNFA model are possible, including the use of this distribution in the more general setting of a structural equation model and extending existing models where sparse covariance structures are necessary for particular settings/applications. Similar to the work by Suarez and Ghosal (2016) more informative priors (known apriori or empirically derived) could be placed on the variance of the error term [diagonal matrix \( \varvec{D} \) in (4b)] in noisy or error prone settings to improve estimates. Such an extension is relatively easy to implement using the computational approach outlined. Further extensions relating to the incorporation of covariates, either as part of the missing data process or separately, also follow in a relatively straightforward way from the proposed model and software available (e.g. NIMBLE). Further extensions can also be made to incorporate unrestricted skew distributional forms (Maleki et al. 2018b) and asymmetric two-piece distributions belonging to the mixture distributions introduced by Maleki and Mahmoudi (2017) and Hoseinzadeh et al. (2018).