1 Introduction

Mixtures of factor analyzers (MFA) originally introduced by Ghahramani and Hinton (1997) have become the most popular tool for clustering and local dimensionality reduction of high-dimensional data, especially when the number of observations is not relatively large than their dimension. The MFA along with their applications have been extensively studied by Hinton et al. (1997), McLachlan and Peel (2000) and McLachlan et al. (2002, 2003), among others. To reduce the number of parameters, especially when the number of components or features is quite large, Baek et al. (2010) extended the MFA using common component-factor loadings, called mixtures of common factor analyzers (MCFA), and described an alternating expectation conditional maximization (AECM) algorithm (Meng and Dyk 1997) for conducting maximum likelihood (ML) estimation. Wang (2013) further studied an extension of the MCFA approach, which allows practitioners to handle model-based density estimation, clustering, visualization and discriminant analysis of high-dimensional data in the presence of missing values.

A number of different Bayesian strategies have been developed for inferring finite mixture models and its extensions through factor-analytic representations. Diebolt and Robert (1994) presented a Gibbs-sampling scheme to perform posterior inference on Gaussian mixture (GMIX) models. Zhang et al. (2004) advocated the use of the reversible jump Markov chain Monte Carlo (MCMC) algorithm (Green 1995; Richardson and Green 1997) for fitting GMIX models with unknown number of components. Lopes and West (2004) explored feasible MCMC methods for Bayesian model assessments in factor analysis models. Bayesian treatments on MFA have been investigated through a variational Bayes (VB) approximation (Ghahramani and Beal 2000) and a stochastic simulation procedure (Fokouè and Titterington 2003), where there is uncertainty about the dimensionality of the latent spaces, i.e., the unknown number of mixture components and common factors. Recently, Wei and Li (2013) proposed a VB algorithm for learning MCFA from a Bayesian perspective.

In the MFA and MCFA frameworks, component factors and errors are routinely assumed to be normally distributed for mathematical convenience and computational tractability. However, the normality assumption is not always realistic because of its known sensitivity to outliers. Furthermore, a poor fit for the data with longer than normal tails may subsequently yield a wrong clustering identification. To cope with such an obstacle, McLachlan et al. (2007) proposed the mixtures of t-factor analyzers (MtFA), whereby the multivariate t family (Kotz and Nadarajah 2004) with dimension p, mean vector \({\varvec{\mu }} (\nu >1)\), covariance matrix \(\nu (\nu -2)^{-1}{\varvec{\varSigma }} (\nu >2\)), and degrees of freedom (df) \(\nu \), denoted by \(t_p({\varvec{\mu }},{\varvec{\varSigma }},\nu )\), is used to be the underlying distribution for both component factors and errors. The multivariate t density is

$$\begin{aligned} t_p({\varvec{y}}|{\varvec{\mu }},{\varvec{\varSigma }},\nu )=\frac{\Gamma \left( \frac{\nu +p}{2}\right) |{\varvec{\varSigma }}|^{-1/2}}{(\pi \nu )^{p/2}\Gamma \left( \frac{\nu }{2}\right) } \big [1+({\varvec{y}}-{\varvec{\mu }})^\mathrm{T}{\varvec{\varSigma }}^{-1}({\varvec{y}}-{\varvec{\mu }})/\nu \big ]^{-(\nu +p)/2},\quad {\varvec{y}}\in \mathscr {R}^p, \end{aligned}$$

where the df \(\nu \) may be viewed as a robustness turning parameter that is used to control the fatness of the tails of the probability distribution.

Specifically, let \({\varvec{y}}_j=(y_{j1},\ldots ,y_{jp})^{\mathrm{T}}\), \(j=1,\ldots ,n\), be np-dimensional vectors of feature variables. The MtFA approach formulates \({\varvec{y}}_j\) as:

$$\begin{aligned} {\varvec{y}}_j={\varvec{\mu }}_i+{\varvec{B}}_i{\varvec{u}}_{ij}+{\varvec{e}}_{ij}\quad \text{ with } \text{ probability } \pi _i\quad (i=1,\ldots ,g), \end{aligned}$$
(1)

where \({\varvec{\mu }}_i\) is a \(p\times 1\) vector of component location, \({\varvec{B}}_i\) is a \(p\times q\) matrix of component factor loadings, \({\varvec{u}}_{ij}\) is a q-dimensional vector of component factors, and \({\varvec{e}}_{ij}\) is a p-dimensional vector of component errors. Here, \(({\varvec{u}}_{ij}^{\mathrm{T}},{\varvec{e}}_{ij}^{\mathrm{T}})^\mathrm{T}\) is assumed to jointly follow a multivariate t distribution with zero mean, a block-diagonal scale-covariance matrix \(\text{ diag }\{{\varvec{I}}_q,{\varvec{D}}_i\}\), and the df \(\nu _i\), where \({\varvec{I}}_q\) is an identity matrix of size q and \({\varvec{D}}_i\) is a diagonal matrix. Consequently, the density of \({\varvec{y}}_j\) for the MtFA is

$$\begin{aligned} f({\varvec{y}}_j)=\sum _{i=1}^g\pi _it_p({\varvec{y}}_j|{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,\nu _i), \end{aligned}$$

where \({\varvec{\varSigma }}_i={\varvec{B}}_i{\varvec{B}}_i^{\mathrm{T}}+{\varvec{D}}_i\). Note that MtFA will be reduced to MFA as all component dfs \(\nu _i\)’s tend to infinity. McNicholas and Murphy (2008) introduced a new class of Gaussian mixture models with several parsimonious covariance structures, called parsimonious Gaussian mixture models (PGMM). Andrews and McNicholas (2011) investigated a restricted MtFA model, which is obtained by imposing constraints on the df, the factor loadings, and the error covariance matrices. Recently, Wang and Lin (2013) proposed an ad-hoc expectation conditional maximization (ECM; Meng and Rubin 1993) algorithm on the basis of a much smaller hidden data space for fast ML estimation of MtFA. They have also done a simulation study to show that their new procedure can substantially outperform the commonly used expectation maximization (EM; Dempster et al. 1977) algorithm and the AECM algorithm used in McLachlan et al. (2007) in most situations, regardless of how the convergence speed is assessed by the computing time and/or number of iterations.

For model-based clustering of high-dimensional data, in practice, the dimension p is sometime quite large and/or the number of components (clusters) g is sometimes not small. Therefore, the number of parameters in model (1) might be unmanageable and, thus, encounters near-singular estimates or inestimable component covariance matrices. As a robust extension of MCFA, Baek and McLachlan (2011) proposed a parsimonious version of the MtFA, named as mixtures of common t-factor analyzers (MCtFA), which utilizes common factor loadings to reduce further the number of parameters in the specification of the component-covariance matrices. For the consideration of different covariance matrices for latent factors, this paper presents an extended version of MCtFA, called the EMCtFA, studies its essential properties and describes two variants of the EM algorithm, including the ECM and the expectation conditional maximization either (ECME; Liu and Rubin 1994) algorithms for ML estimation of model parameters.

As an alternative to exact ML methods, the simulated ML estimation can be implemented for the model using the Monte Carlo (MC) or importance sampling (IS) methods, known as the MCEM and ISEM algorithms. One drawback of simulated ML methods is that the model fitting procedure relies on MC estimates which can be difficult to implement due to the heavy computational burden. Another issue is that an increase in log-likelihood at each iteration is not guaranteed because of MC errors (McLachlan and Krishnan 2008). Our proposed EM-type algorithms have exactly closed-form expressions in the E-step and analytically reduced expressions in CM-steps, yielding more accurate estimates than the simulated ML methods.

In this work, we provide a guideline for choosing a set of suitable initial values. Furthermore, the probabilistic classification of new subjects and estimation of latent factors are also investigated. Under the assumption of non-normality, importantly, there is also a problem of outlier detection in mixture modeling. Outliers usually lead to overestimating the number of components to offer a good presentation of the data (Fraley and Raftery 2002). We also offer a rule for identifying which observations are suspected outliers under the EMCtFA framework.

The remainder of this paper is structured as follows. In Sect. 2, we establish the notation and formulate the EMCtFA model. Section 3 presents two EM-type algorithms for fitting EMCtFA and outlines a simple way of setting the initialization. Section 4 describes some practical tools, including model-based clustering, classification, outlier detection and model selection. In Sect. 5, the application of the proposed methodology is illustrated through analyzing the Italian wine data. In Sect. 6, we conduct a simulation study to compare the performance of our recommended initialization procedure with the existing method. We conclude the paper with a short summary in Sect. 7. The detailed derivations are sketched in “Appendix”.

2 Extended mixtures of common t-factor analyzers (EMCtFA)

Consider n independent p-dimensional feature vectors \({\varvec{y}}_1,\ldots ,{\varvec{y}}_n\), which come independently from a nonhomogeneous population with g subgroups. In the sense of dimensionality reduction, q must be smaller than p. The EMCtFA model for continuous features \({\varvec{y}}_j\) can be described as:

$$\begin{aligned} {\varvec{y}}_j={\varvec{A}}{\varvec{u}}_{ij}+{\varvec{e}}_{ij}\quad \text{ with } \text{ probability } \pi _i\quad (i=1, \ldots ,g), \end{aligned}$$
(2)

where \({\varvec{A}}\) is a \(p\times q\) matrix of common factor loadings, \({\varvec{e}}_{ij}\) is a p-dimensional vector of component errors, and \(\pi _i\in (0,1)\) is the mixing proportion subject to \(\sum _{i=1}^g\pi _i=1\). The joint distribution of \({\varvec{u}}_{ij}\) and \({\varvec{e}}_{ij}\) for the ith component is assumed to be

$$\begin{aligned} \begin{bmatrix}{\varvec{u}}_{ij}\\{\varvec{e}}_{ij}\end{bmatrix}\sim t_{q+p}\left( \begin{bmatrix}{\varvec{\beta }}_i\\{\varvec{0}}\end{bmatrix}, \begin{bmatrix}{\varvec{\varOmega }}_i&{\varvec{0}}\\ {\varvec{0}}&{\varvec{D}}_i\end{bmatrix},\nu _i\right) , \end{aligned}$$
(3)

where \({\varvec{\beta }}_i\) is a q-dimensional location vector, \({\varvec{\varOmega }}_i\) is a \(q\times q\) positive-definite scale-covariance matrix, \({\varvec{D}}_i\) is a \(p\times p\) diagonal covariance matrix, and \(\nu _i\) is the df. We further assume that the joint distributions of \(({\varvec{u}}_{ij}^{\mathrm{T}},{\varvec{e}}_{ij}^{\mathrm{T}})^\mathrm{T}\) for distinct subjects are independent. When \({\varvec{D}}_i={\varvec{D}}\) for all i, the EMCtFA reduces to the original MCtFA model (Baek and McLachlan 2011). Generally, the component dfs are allowed to vary for flexibly controlling possibly different degrees of the tail thickness of component distributions. The special case of equal df, say \(\nu _i=\nu \) for all i, is usually considered for the sake of parsimony and fast convergence of the algorithm. The MCtFA includes the MCFA as a limiting/special case when all component dfs approach infinity simultaneously. It can also be shown that MCtFA is a special case of MtFA by virtue of Eqs. (17)–(21) in Baek et al. (2010).

For the MtFA in (1), \(q(q-1)/2\) uniqueness constraints are imposed for component factor loadings \({\varvec{B}}_i\) and, thus, the total number of parameters in (1) is

$$\begin{aligned} d_1=(2g-1)+2gp+g[pq-q(q-1)/2]. \end{aligned}$$

For the EMCtFA in (2) along with assumption (3), the common factor loadings \({\varvec{A}}\) must be unique only up to postmultiply by a nonsingular matrix such that its number of free parameters is \(pq-q^2\). As a result, the total number of parameters in (2) is

$$\begin{aligned} d_2=(2g-1)+pg+q(p-q+g)+gq(q+1)/2, \end{aligned}$$

while that in the MCtFA (Baek and McLachlan 2011) is

$$\begin{aligned} d_3=(2g-1)+p+q(p-q+g)+gq(q+1)/2. \end{aligned}$$

It follows straightforwardly that the difference in numbers of parameters between MtFA and EMCtFA is \(d_1-d_2=(g-1)q(p-q)+g(p-q)\), which is nonnegative when \(p\ge q\) and \(g\ge 1\). Meanwhile, the difference in numbers of parameters between the EMCtFA and MCtFA is \(d_2-d_3=(g-1)p\), which is also nonnegative when \(g\ge 1\). Therefore, we have \(d_1\ge d_2\ge d_3\) if and only if \(p\ge q\) and \(g\ge 1\). Clearly, the EMCtFA reaches a compromise between the MtFA and MCtFA approaches through the specification of distinct covariance matrices for component errors. The EMCtFA as well as MCtFA are preferable to the MtFA model if the dimension p or the number of component g is relatively large to suffer from the convergence problems. Furthermore, unlike MtFA, the estimated posterior means of factor scores of EMCtFA can be used to portray the data in low-dimensional subspaces.

Let \({\varvec{\varTheta }}=\{{\varvec{A}},{\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_g\}\) denote the entire unknown model parameters where \({\varvec{\theta }}_i=(\pi _i,{\varvec{\beta }}_i,{\varvec{\varOmega }}_i,{\varvec{D}}_i,\nu _i)\), \(i=1,\ldots ,g\), represents the parameter vector for the ith component. According to (2) and (3), the probability density function (pdf) of \({\varvec{y}}_j\) is

$$\begin{aligned} f({\varvec{y}}_j\mid {\varvec{\varTheta }})=\sum _{i=1}^g\pi _it_p({\varvec{y}}_j\mid {\varvec{A}}{\varvec{\beta }}_i,{\varvec{\varSigma }}_i,\nu _i), \end{aligned}$$

where \({\varvec{\varSigma }}_i={\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\mathrm{T}}+{\varvec{D}}_i\). Therefore, the ML estimates \(\hat{{\varvec{\varTheta }}}\) based on a set of independent observations \({\varvec{y}}=\{{\varvec{y}}_1,\ldots ,{\varvec{y}}_n\}\) is \(\hat{{\varvec{\varTheta }}}=\mathop {\text{ argmax }}_{{\varvec{\varTheta }}}\ell ({\varvec{\varTheta }}\mid {\varvec{y}})\), where \(\ell ({\varvec{\varTheta }}|{\varvec{y}})=\sum _{j=1}^n\log f({\varvec{y}}_j|{\varvec{\varTheta }})\) is the observed log-likelihood function. Unfortunately, there are no explicit analytical solutions for ML estimator of \({\varvec{\varTheta }}\). In this case, we resort to the EM-type algorithm (Dempster et al. 1977), which is popular iterative device for ML estimation in models incorporating new hidden variables.

In the EM framework on supporting the interpretation of missing data problem, it is convenient to introduce a set of allocation variables \({\varvec{Z}}_j=(z_{1j},\ldots ,z_{gj}), j=1,\ldots ,n\), where the component membership \(z_{ij}=1\) if \({\varvec{y}}_j\) belongs to the ith component and \(z_{ij}=0\) otherwise. This indicates that \({\varvec{Z}}_j\) independently follows a multinomial distribution with one trial and mixing properties \((\pi _1,\ldots ,\pi _g)\) subject to \(\sum _{i=1}^g\pi _i=1\), denoted as \({\varvec{Z}}_j\sim {\mathscr {M}}(1;\pi _1,\ldots ,\pi _g)\). Based on the essential property of multivariate t distribution, we also utilize the scaling variables \(\tau _j\)s following the gamma distribution with shape \(\nu _i/2\) and rate \(\nu _i/2\) conditioning on \(z_{ij}=1\). Through introducing the latent variables \({\varvec{Z}}_j\) and \(\tau _j\), for \(j=1,\ldots ,n\), three hierarchical representations of the EMCtFA are sketched in “Appendix”.

As a consequence, we establish Proposition 1, which is useful for evaluating the conditional expectations involved in the ECME algorithm described in the next section.

Proposition 1

Given the hierarchical representations (18)–(20), we have

$$\begin{aligned} \begin{bmatrix}{\varvec{y}}_j\\ {\varvec{u}}_{ij}\end{bmatrix}\Bigg |(\tau _{j}, z_{ij}=1)\sim & {} {\mathscr {N}}_{p+q}\Bigg (\begin{bmatrix}{\varvec{A}}{\varvec{\beta }}_i\\{\varvec{\beta }}_i\end{bmatrix}, \tau _{j}^{-1}\begin{bmatrix}{\varvec{\varSigma }}_i&{\varvec{A}}{\varvec{\varOmega }}_i\\ {\varvec{\varOmega }}_i{\varvec{A}}^{\mathrm{T}}&{\varvec{\varOmega }}_i\end{bmatrix}\Bigg ). \end{aligned}$$

It follows that

$$\begin{aligned} \begin{bmatrix}{\varvec{y}}_j\\ {\varvec{u}}_{ij}\end{bmatrix}\Bigg |(z_{ij}=1) \sim t_{p+q}\Bigg ( \begin{bmatrix}{\varvec{A}}{\varvec{\beta }}_i\\ {\varvec{\beta }}_i\end{bmatrix}, \begin{bmatrix}{\varvec{\varSigma }}_i&{\varvec{A}}{\varvec{\varOmega }}_i\\ {\varvec{\varOmega }}_i{\varvec{A}}^{\mathrm{T}}&{\varvec{\varOmega }}_i\end{bmatrix},\nu _i\Bigg ). \end{aligned}$$

A simple algebra shows that

$$\begin{aligned} {\varvec{u}}_{ij}\mid ({\varvec{y}}_j, \tau _{j},z_{ij}=1)\sim & {} {\mathscr {N}}_q({\varvec{\beta }}_i+{\varvec{\gamma }}_i^\mathrm{T}({\varvec{y}}_j-{\varvec{A}}{\varvec{\beta }}_i), \tau _{j}^{-1}({\varvec{I}}_q-{\varvec{\gamma }}_i^\mathrm{T}{\varvec{A}}){\varvec{\varOmega }}_i), \end{aligned}$$
(4)
$$\begin{aligned} \tau _{j}\mid ({\varvec{y}}_j,z_{ij}=1)\sim & {} \mathrm{Gamma}\left( \frac{\nu _i+p}{2},\frac{\nu _i+\delta _{ij}}{2}\right) , \end{aligned}$$
(5)

where \({\varvec{\gamma }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i\) and \(\delta _{ij}=({\varvec{y}}_j-{\varvec{A}}{\varvec{\beta }}_i)^\mathrm{T}{\varvec{\varSigma }}_i^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{\beta }}_i)\) denotes the Mahalanobis distance between the observation \({\varvec{y}}_j\) and the component mean \({\varvec{A}}{\varvec{\beta }}_i\). Subsequently, multiplying (4) by (5) and then integrating out \(\tau _{j}\) implies

$$\begin{aligned} {\varvec{u}}_{ij}\mid ({\varvec{y}}_j, z_{ij}\!=\!1)\sim t_{q}\left( {\varvec{\beta }}_i+{\varvec{\gamma }}_i^\mathrm{T}({\varvec{y}}_j-{\varvec{A}}{\varvec{\beta }}_i), \Bigg (\frac{\nu _i+\delta _{ij}}{\nu _i+p}\Bigg )({\varvec{I}}_q-{\varvec{\gamma }}_i^\mathrm{T}{\varvec{A}}){\varvec{\varOmega }}_i, \nu _i+p\right) . \end{aligned}$$

Proof

The proof is straightforward and, hence, is omitted. \(\square \)

3 Parameter estimation

3.1 ML estimation via the ECM and ECME algorithms

The EM algorithm has several appealing features including simplicity of implementation and monotone convergence with each iteration increasing the likelihood. However, the EM algorithm is not straightforward for ML estimation of the model (2) because its M-step is computationally difficult. To go further, we exploit a variant of the EM algorithm, called the ECME (Liu and Rubin 1994) algorithm. The ECME algorithm proceeds to estimate parameters by replacing the M-steps of EM with either CM-steps that maximize a sequence of constrained Q functions, as in ECM, or CML-steps that maximize the correspondingly constrained actual likelihood function. Furthermore, it shares the appealing features of EM (Dempster et al. 1977) and ECM (Meng and Rubin 1993), and possesses typically a faster convergence rate than either EM or ECM in terms of CPU time and/or number of iterations.

For notational convenience, we denote the allocation variables by \({\varvec{Z}}=({\varvec{Z}}_1,\ldots ,{\varvec{Z}}_n)\), the scaling variables by \({\varvec{\tau }}=\{\tau _1,\ldots ,\tau _n\}\) and unobservable factors by \({\varvec{U}}=\{{\varvec{u}}_{ij};i=1,\ldots ,g,j=1,\ldots ,n\}\). Treating \(({\varvec{Z}}, {\varvec{\tau }}, {\varvec{U}})\) as the “missing” data and combining them with the observed data \({\varvec{y}}\) as the “complete” data, the complete-data log-likelihood function of \({\varvec{\varTheta }}\) based on hierarchy (20) is

$$\begin{aligned} \ell _c({\varvec{\varTheta }}|{\varvec{y}},{\varvec{Z}},{\varvec{\tau }},{\varvec{U}})= & {} \sum _{i=1}^g\sum _{j=1}^nz_{ij}\{\log \pi _i+\log \phi _p({\varvec{y}}_j\mid {\varvec{A}}{\varvec{u}}_{ij}, \tau _{j}^{-1}{\varvec{D}}_i)\nonumber \\&+\,\log \phi _q({\varvec{u}}_{ij}\mid {\varvec{\beta }}_i,\tau _{j}^{-1}{\varvec{\varOmega }}_i)+\log \mathscr {G}(\tau _{j}\mid \nu _i/2,\nu _i/2)\}, \end{aligned}$$
(6)

where \(\phi _p(\cdot |{\varvec{\mu }},{\varvec{\varSigma }})\) stands for the pdf of the p-variate normal distribution with mean vector \({\varvec{\mu }}\) and covariance matrix \({\varvec{\varSigma }}\), and \(\mathscr {G}(\cdot |a,b)\) denotes the pdf of the gamma distribution with mean a / b and variance \(a/b^2\).

Let \(\hat{{\varvec{\varTheta }}}^{(k)}=(\hat{{\varvec{A}}}^{(k)},\hat{\pi }_i^{(k)},\hat{{\varvec{\beta }}}_i^{(k)},\hat{{\varvec{\varOmega }}}_i^{(k)},\hat{{\varvec{D}}}_i^{(k)},\hat{\nu }_i^{(k)},i=1,\ldots ,g)\) be the estimates of \({\varvec{\varTheta }}\) at the kth iteration. In the E-step of ECME, one needs to evaluate the conditional expectation of (6) at \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}^{(k)}\), which is the so-called Q-function:

$$\begin{aligned} Q({\varvec{\varTheta }}|\hat{{\varvec{\varTheta }}}^{(k)})=E(\ell _c({\varvec{\varTheta }}|{\varvec{y}},{\varvec{Z}},{\varvec{\tau }},{\varvec{U}})|{\varvec{y}},\hat{{\varvec{\varTheta }}}^{(k)}). \end{aligned}$$
(7)

All necessary conditional expectations in (7) can result from Eq. (21). The CM-steps, each of which maximizes the constrained Q-function or the constrained log-likelihood function over \({\varvec{\varTheta }}\) step-by-step but conditioned on some vector functions of \({\varvec{\varTheta }}\) being estimated at its previous step, proceed as follows:

  • CM-step 1 for ECM and ECME Fix \(\nu _i=\hat{\nu }_i^{(k)} (i=1,\ldots ,g)\), and update \(\hat{\pi }_i^{(k)}, \hat{{\varvec{A}}}^{(k)}\), \(\hat{{\varvec{\beta }}}_i^{(k)}, \hat{{\varvec{\varOmega }}}_i^{(k)}\), and \(\hat{{\varvec{D}}}_i^{(k)}\) by maximizing (7), which gives

    $$\begin{aligned} \hat{\pi }_i^{(k+1)}= & {} \sum _{j=1}^n\hat{z}^{(k)}_{ij}/n,\nonumber \\ \hat{{\varvec{A}}}^{(k+1)}= & {} \left\{ \sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}^{(k)} \hat{\tau }_{ij}^{(k)}{\varvec{y}}_j\Big [\hat{{\varvec{\beta }}}_i^{(k)\mathrm{T}}+\hat{{\varvec{y}}}_{ij}^{(k) \mathrm{T}}\hat{{\varvec{\gamma }}}_i^{(k)}\Big ]\right\} \nonumber \\&\times \left\{ \sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}^{(k)} \Big [({\varvec{I}}_q-\hat{{\varvec{\gamma }}}_i^{(k)\mathrm{T}}\hat{{\varvec{A}}}^{(k)})\hat{{\varvec{\varOmega }}}_i^{(k)}+\hat{\tau }_{ij}^{(k)}\hat{{\varvec{u}}}_{ij}^{(k)}\hat{{\varvec{u}}}_{ij}^{(k)\mathrm T}\Big ]\right\} ^{-1}, \end{aligned}$$
    (8)
    $$\begin{aligned} \hat{{\varvec{\beta }}}_i^{(k+1)}= & {} \hat{{\varvec{\beta }}}_i^{(k)}+\frac{\sum _{j=1}^n\hat{z}_{ij}^{(k)}\hat{\tau }_{ij}^{(k)}\hat{{\varvec{\gamma }}}_i^{(k)\mathrm T}\hat{{\varvec{y}}}_{ij}^{(k)}}{\sum _{j=1}^n\hat{z}_{ij}^{(k)}\hat{\tau }_{ij}^{(k)}}, \end{aligned}$$
    (9)
    $$\begin{aligned} \hat{{\varvec{\varOmega }}}_i^{(k+1)}= & {} \frac{\sum _{j=1}^n\hat{z}_{ij}^{(k)}\hat{\tau }_{ij}^{(k)}\hat{{\varvec{\gamma }}}_i^{(k)\mathrm T}\hat{{\varvec{y}}}_{ij}^{(k)}\hat{{\varvec{y}}}_{ij}^{(k) \mathrm T}\hat{{\varvec{\gamma }}}_i^{(k)}}{\sum _{j=1}^n\hat{z}_{ij}^{(k)}}+({\varvec{I}}_q-\hat{{\varvec{\gamma }}}_i^{(k)\mathrm T}\hat{{\varvec{A}}}^{(k)})\hat{{\varvec{\varOmega }}}_i^{(k)},\nonumber \\ \end{aligned}$$
    (10)

    and

    (11)

    for \(i=1,\ldots ,g\). When \({\varvec{D}}_i\)s are assumed to be the same across components, that is \({\varvec{D}}_1=\cdots ={\varvec{D}}_g={\varvec{D}}\), the updated formula for \({\varvec{D}}\) is given by

    $$\begin{aligned} \hat{{\varvec{D}}}^{(k+1)}= & {} \text{ diag }\left\{ \frac{\sum _{j=1}^n \sum _{i=1}^g\hat{z}_{ij}^{(k)}\hat{{\varvec{D}}}^{(k)}- \sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}^{(k)}\hat{{\varvec{D}}}^{(k)} \hat{{\varvec{\varSigma }}}_i^{(k)-1}\hat{{\varvec{D}}}^{(k)}}{\sum _{j=1}^n \sum _{i=1}^g\hat{z}_{ij}^{(k)}}\right. \nonumber \\&\left. +\frac{\sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}^{(k)} \hat{\tau }_{ij}^{(k)}\hat{{\varvec{D}}}^{(k)} \hat{{\varvec{\varSigma }}}_i^{(k)-1}\hat{{\varvec{y}}}_{ij}^{(k)}\hat{{\varvec{y}}}_{ij}^{(k) \mathrm T}\hat{{\varvec{\varSigma }}}_i^{(k)-1}\hat{{\varvec{D}}}^{(k)}}{\sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}^{(k)}}\right\} . \end{aligned}$$
  • CM-step 2 for ECM Solve the roots of the following equation, which maximizes the constrained Q-functions:

    $$\begin{aligned} \log \Bigg (\frac{\nu _i}{2}\Bigg )+1-\mathscr {D_G} \Bigg (\frac{\nu _i}{2}\Bigg )+\frac{\sum ^n_{j=1}\hat{z}_{ij}^{(k)} (\hat{\kappa }_{ij}^{(k)}-\hat{\tau }_{ij}^{(k)})}{\sum ^n_{j=1} \hat{z}_{ij}^{(k)}}=0,\quad i=1,\ldots ,g.\nonumber \\ \end{aligned}$$
    (12)

    As \(\nu _1=\cdots =\nu _g=\nu \), we obtain \(\hat{\nu }^{(k+1)}\) as the solution of the following equation:

    $$\begin{aligned} \log \Bigg (\frac{\nu }{2}\Bigg )+1-\mathscr {D_G} \Bigg (\frac{\nu }{2}\Bigg )+\frac{\sum ^g_{i=1}\sum ^n_{j=1} \hat{z}_{ij}^{(k)}(\hat{\kappa }_{ij}^{(k)}-\hat{\tau }_{ij}^{(k)})}{n}=0. \end{aligned}$$
    (13)
  • CM-step 2 for ECME Alternatively, to improve the convergence, we may exploit the advantage of the ECME step. Given current estimates, calculate \(\hat{\nu }_i^{(k+1)}\) by maximizing the constrained log-likelihood function, i.e.,

    $$\begin{aligned} \hat{\nu }_i^{(k+1)}=\arg \max _{\nu _i}\left\{ \sum _{j=1}^n\log \Big (\hat{\pi }_i^{(k+1)}t_p({\varvec{y}}_j|\hat{{\varvec{A}}}^{(k+1)} \hat{{\varvec{\beta }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\nu _i)\Big )\right\} . \end{aligned}$$
    (14)

    Similarly, as the case of common dfs \((\nu _1=\cdots =\nu _g=\nu )\), we calculate

    $$\begin{aligned} \hat{\nu }^{(k+1)}=\arg \max _{\nu }\left\{ \sum _{j=1}^n\log \Bigg (\sum _{i=1}^g\hat{\pi }_i^{(k+1)}t_p({\varvec{y}}_j| \hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\beta }}}_i^{(k+1)}, \hat{{\varvec{\varSigma }}}_i^{(k+1)},\nu )\Bigg )\right\} .\nonumber \\ \end{aligned}$$
    (15)

Note that the solutions of Eqs. (12) and (13) involve a one-dimensional search for \(\nu _i\) one at a time or for the common df \(\nu \), which can be directly done by employing the uniroot routine built in the R package (R Development Core Team 2009) constrained within an appropriate [2, 200] interval. The procedures (14) and (15) can be implemented straightforwardly using the optim routine with starting value \(\hat{\nu }_i^{(k)}\) at each iteration. Given a set of suitable initial values \(\hat{{\varvec{\varTheta }}}^{(0)}\) recommended in the next subsection, the ECM or ECME algorithms are performed to obtain the ML estimates \(\hat{{\varvec{\varTheta }}}=(\hat{{\varvec{A}}},\hat{\pi }_i,\hat{{\varvec{\beta }}}_i,\hat{{\varvec{\varOmega }}}_i,\hat{{\varvec{D}}}_i,\hat{\nu }_i,i=1,\ldots ,g)\) iteratively until the user’s specified stopping rule is achieved. While carrying out quantitative analysis of experimental data, the stopping rule \(\ell (\hat{{\varvec{\varTheta }}}^{(k+1)}|{\varvec{y}})-\ell (\hat{{\varvec{\varTheta }}}^{(k)}|{\varvec{y}})<10^{-6}\) is employed.

3.2 Initialization

The EM-type algorithm, like other iteration-based methods, may suffer from computational difficulties such as slow or even non-convergence. In particular, when the data are too sparse or the dimension of latent factors is over-specified, a poor choice of initial values \(\hat{{\varvec{\varTheta }}}^{(0)}\) may lead to the convergence in the boundary of the parameter space. To alleviate such potential problems, a simple way of automatically generating a set of suitable initial values is recommended below:

  1. 1.

    Perform a K-means clustering (Hartigan and Wong 1979) initialized with respect to a random start. Specify the zero-one component indicator \(\hat{{\varvec{Z}}}_j^{(0)}=(\hat{z}_{1j}^{(0)},\ldots ,\hat{z}_{gj}^{(0)})\) according to the K-means results. The initial values of the mixing proportions \(\pi _i\)s are taken as

    $$\begin{aligned} \hat{\pi }_i^{(0)}=n^{-1}\sum _{j=1}^n\hat{z}_{ij}^{(0)},\quad i=1,\ldots ,g. \end{aligned}$$
  2. 2.

    Let \({\varvec{y}}_{(i)}\) be the data in the ith subpopulation (group). Perform the ordinary factor analysis (Spearman 1904) for \({\varvec{y}}_{(i)}\). The initial estimate of \(\hat{{\varvec{\varOmega }}}_i^{(0)}\) is chosen as the sample variance–covariance matrix of the estimated factor scores.

  3. 3.

    Obtain the factor loading matrix for \({\varvec{y}}_{(i)}\) via the principle components analysis (PCA; Flury 1984) method, denoted by \(\hat{{\varvec{B}}}^{(0)}_i\) for \(i=1,\ldots ,g\). Set the initial estimate of \({\varvec{A}}\) as

    $$\begin{aligned} \hat{{\varvec{A}}}^{(0)}=\sum _{i=1}^g\hat{\pi }^{(0)}_i\hat{{\varvec{B}}}^{(0)}_i\hat{{\varvec{\varOmega }}}_i^{{(0)}^{-1/2}}. \end{aligned}$$
  4. 4.

    As for initial estimate of \({\varvec{\beta }}_i\), set \(\hat{{\varvec{\beta }}}_i^{(0)}=\hat{{\varvec{A}}}^{(0)}\bar{{\varvec{y}}}_i\), where \(\bar{{\varvec{y}}}_i\) is the sample mean vector of \({\varvec{y}}_{(i)}\), \(i=1,\ldots ,g\).

  5. 5.

    The initial estimate of \({\varvec{D}}_i\) is obtained as a diagonal matrix formed from the diagonal elements of the sample covariance matrix of \({\varvec{y}}_{(i)}\). As \({\varvec{D}}_1=\cdots ={\varvec{D}}_g={\varvec{D}}\), we set \(\hat{{\varvec{D}}}^{(0)}\) as a diagonal matrix formed from the diagonal elements of the pooled within-cluster sample covariance matrix of g partitioned groups of the data.

  6. 6.

    With regard to the initial estimate of \(\nu _i\), we recommend setting a relatively large initial value, say \(\hat{\nu }_i^{(0)}=50\), \(\forall i\), which corresponds to an initial assumption of near-normality for the component factors and errors.

When implementing ECM and ECME for the EMCtFA, it is advantageous to use the Sherman–Morrison–Woodbury formula (Golub and Loan 1989) to avoid inverting any large \(p\times p\) matrix. That is, the inversion of the \(p\times p\) matrix \(({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^\mathrm{T}+{\varvec{D}}_i)\) can be undertaken using the following result:

$$\begin{aligned} \big ({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\mathrm{T}}+{\varvec{D}}_i\big )^{-1} ={\varvec{D}}_i^{-1}-{\varvec{D}}_i^{-1}{\varvec{A}}\big ({\varvec{\varOmega }}_i^{-1}+{\varvec{A}}^\mathrm{T}{\varvec{D}}_i^{-1}{\varvec{A}}\big )^{-1}{\varvec{A}}^{\mathrm{T}}{\varvec{D}}_i^{-1}, \end{aligned}$$

which involves only the inverse of \(q\times q\) matrix on the right hand side. It follows that \({\varvec{\gamma }}_i\) can be rewritten as \({\varvec{D}}_i^{-1}{\varvec{A}}({\varvec{\varOmega }}_i^{-1}+{\varvec{A}}^\mathrm{T}{\varvec{D}}_i^{-1}{\varvec{A}})^{-1}\). Moreover, to obtain the unique solution of \({\varvec{A}}\), as suggested by Baek et al. (2010), we perform the Cholesky decomposition on \(\hat{{\varvec{A}}}\) such that \(\hat{{\varvec{A}}}^{\mathrm{T}}\hat{{\varvec{A}}}={\varvec{C}}^{\mathrm{T}}{\varvec{C}}\), where \({\varvec{C}}\) is the upper triangular matrix of order q. If we replace \(\hat{{\varvec{A}}}\) by \(\hat{{\varvec{A}}}{\varvec{C}}^{-1}\), then the orthonormal estimate of \({\varvec{A}}\), which satisfies the condition of \(\hat{{\varvec{A}}}^{\mathrm{T}}\hat{{\varvec{A}}}={\varvec{I}}_q\), can be obtained. Consequently, the limiting estimates \(\hat{{\varvec{\beta }}}_i\) and \(\hat{{\varvec{\varOmega }}}_i\) are given as \({\varvec{C}}\hat{{\varvec{\beta }}}_i\) and \({\varvec{C}}\hat{{\varvec{\varOmega }}}_i{\varvec{C}}^{\mathrm{T}}\), respectively.

Notably, the EM-based procedures can get trapped in one of the many local maxima of the likelihood function, and such a phenomenon may still occur in the estimation of the EMCtFA, especially when the number of latent factors is over-specified. To circumvent such a limitation, we recommend initializing the algorithm with a variety of slightly different initial values by performing the K-means allocation of subjects with various random starts. The global optimum is obtained by choosing the one with the largest log-likelihood value.

4 Computational aspects

4.1 Clustering

The estimation of the component labels \({\varvec{Z}}_j\) and factor scores \({\varvec{u}}_j\) is meaningful for clustering each observation \({\varvec{y}}_j\) to a suitable cluster and displaying the high-dimensional data in lower-dimensional plots. Once the EMCtFA model has been fitted, a probabilistic clustering of the data into g clusters can be determined based on the maximum a posteriori (MAP) of component membership. That is, \(\hat{z}_{ij}^{(k)}\) evaluated at \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}\), denoted by \(\hat{z}_{ij}\), indicates the estimated posterior probability that \({\varvec{y}}_j\) belongs to the ith component. A natural assignment is achieved by assigning each observation to the component which has the highest estimated posterior probability.

From Eq. (21), we calculate the estimated conditional expectation of component factors \({\varvec{u}}_{ij}\) corresponding to \({\varvec{y}}_j\) evaluated at \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}\), denoted by \(\hat{{\varvec{u}}}_{ij}\). Then, it is straightforward to estimate the jth factor scores corresponding to \({\varvec{y}}_j\) as

$$\begin{aligned} \hat{{\varvec{u}}}_j=\sum _{i=1}^g\hat{z}_{ij}\hat{{\varvec{u}}}_{ij}. \end{aligned}$$
(16)

Let \(\tilde{z}_{ij}=1\) if \(\hat{z}_{ij}\ge \hat{z}_{hj}\) for \(h\ne i\), \(i,h=1,\ldots ,g\), and \(\tilde{z}_{ij}=0\) otherwise. Alternatively, substituting \(\tilde{z}_{ij}\) for \(\hat{z}_{ij}\) in (16) leads to the other posterior estimates of factor scores. Therefore, we can display the p-dimensional observations \({\varvec{y}}_j\) in a q-dimensional subspace by plotting the corresponding values \(\hat{{\varvec{u}}}_j\). In addition, the fitted values of \({\varvec{y}}_j\) can be calculated as \(\hat{{\varvec{y}}}_j=\hat{{\varvec{A}}}\hat{{\varvec{u}}}_j\).

4.2 Classification for new subjects

It is also of interest to classify a new subject using the EMCtFA approach. For this purpose, let \({\varvec{y}}_\mathrm{new}=(y_\mathrm{new 1},\ldots ,y_{\mathrm{new}p})^{\mathrm{T}}\) be the observations for a new subject. Suppose that the model for \({\varvec{y}}_\mathrm{new}\) can be written as:

$$\begin{aligned} {\varvec{y}}_\mathrm{new}={\varvec{A}}{\varvec{u}}_{i,\mathrm{new}}+{\varvec{e}}_{i,\mathrm{new}}\quad \text{ with } \text{ probability } \,\,\pi _i~(i=1,\ldots ,g), \end{aligned}$$

where the joint distribution of \({\varvec{u}}_{i,\mathrm{new}}\) and \({\varvec{e}}_{i,\mathrm{new}}\) satisfies the assumption (3). We now turn our attention to diagnose the allocated group of the new subject and characterize its predictive density. The work of classification of the new subject is based on a fitted (i) conditional distribution of the observed vector \({\varvec{y}}_\mathrm{new}\) given an appropriate predictor of factor scores (conditional prediction) and (ii) marginal distribution of the observed vector \({\varvec{y}}_\mathrm{new}\) (marginal prediction).

Given the model parameters, the strength of allocating \({\varvec{y}}_\mathrm{new}\) to the ith group is characterized by a predictive density \(p({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)\) whose estimated expression is discussed below. The predictive density of \({\varvec{y}}_\mathrm{new}\) is

$$\begin{aligned} \hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{\varTheta }})=\sum _{i=1}^g\pi _i\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i) \end{aligned}$$

in which the predictive density belonging to component i, say \(\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)\), can be estimated by the conditional and marginal predictions described below.

For conditional prediction, the predictive density \(p({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)\) is the conditional density of \({\varvec{y}}_\mathrm{new}\) given the estimated factor scores \(\hat{{\varvec{u}}}_{i,\mathrm{new}}\). Specifically,

$$\begin{aligned} \hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)=t_p({\varvec{y}}_\mathrm{new}|{\varvec{A}}\hat{{\varvec{u}}}_{i,\mathrm{new}},{\varvec{D}}_i,\nu _i). \end{aligned}$$

As in (16), a suitable estimate of component factors is the conditional mean of \({\varvec{u}}_{i,\mathrm{new}}\) which is calculated using an expression analogous to \(\hat{{\varvec{u}}}_{ij}^{(k)}\) with \({\varvec{y}}_j, {\varvec{u}}_{ij}\) and \(\hat{{\varvec{\varTheta }}}^{(k)}\) replaced by \({\varvec{y}}_\mathrm{new}, {\varvec{u}}_{i,\mathrm{new}}\) and \(\hat{{\varvec{\varTheta }}}\), respectively.

For marginal prediction, the predictive density \(p({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)\) is the marginal density of \({\varvec{y}}_\mathrm{new}\), where the term ‘marginal’ reflects the fact that the component factors \({\varvec{u}}_{i,\mathrm{new}}\) are integrated out from the joint density of \(({\varvec{y}}_\mathrm{new}^{\mathrm{T}}, {\varvec{u}}_{i,\mathrm{new}}^{\mathrm{T}})^\mathrm{T}\). We, thus, have

$$\begin{aligned} \hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i)=t_p({\varvec{y}}_\mathrm{new}|{\varvec{A}}{\varvec{\beta }}_i,{\varvec{\varSigma }}_i,\nu _i). \end{aligned}$$

Subsequently, the estimated allocation of the new subject to group i is according to a combination of the prior probabilities \(\pi _1,\ldots ,\pi _g\) and the estimated values of predictive densities \(\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_1),\ldots ,\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_g)\), given by

$$\begin{aligned} \hat{\mathscr {P}}_{i,\mathrm{new}}=\pi _i\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{A}},{\varvec{\theta }}_i) \big /\hat{p}({\varvec{y}}_\mathrm{new}|{\varvec{\varTheta }}),\quad i=1,\ldots ,g. \end{aligned}$$

Within the likelihood-based approach, all model parameters are estimated by the ML estimates \(\hat{{\varvec{A}}}\) and \(\hat{{\varvec{\theta }}}_i\). Consequently, based on the MAP classification rule, the feature vector \({\varvec{y}}_\mathrm{new}\) is classified to group i if \(\hat{\mathscr {P}}_{i,\mathrm{new}}>\hat{\mathscr {P}}_{h,\mathrm{new}}\), for \(h\ne i, i,h=1,\ldots ,g\).

4.3 Outlier identification

Identification of outliers is an important issue because few outliers may produce poor clustering results. Just like the use of allocation indicator \(z_{ij}\), introducing the scaling variable \(\tau _{j}\) not only facilitates the implementation of the EM-type algorithm but also enables the interpretation of the estimated model. As can be seen from (8) to (11), \(\hat{\tau }_{ij}^{(k)}\) can be treated as the weight in the estimation of \({\varvec{A}}, {\varvec{\beta }}_i, {\varvec{\varOmega }}_i\) and \({\varvec{D}}_i\). Because the estimated value of \(\tau _j\) is negatively correlated with the estimated Mahalanobis distance \(\delta _{ij}\) between \({\varvec{y}}_j\) and \({\varvec{A}}{\varvec{\beta }}_i\), a small value of \(\hat{\tau }_{ij}\) (i.e., \(\hat{\tau }_{ij}^{(k)}\) at convergence) would downweight the influence of the corresponding subject, which can be thought of as a suspected outlier.

To explicitly identify which subject should be an outlier, we follow the idea of Lo and Gottardo (2012) to establish a convenient rule of judging a subject with the associated \(\hat{\tau }_j=\sum _{i=1}^g\tilde{z}_{ij}\hat{\tau }_{ij}\) value smaller than a critical value, where \(\tilde{z}_{ij}=1\) if \(\hat{z}_{ij}\ge \hat{z}_{hj}\) for \(h\ne i\), and \(\tilde{z}_{ij}=0\) otherwise. From a viewpoint of hypothesis testing, if we treat \(\hat{\tau }_j\) as a test statistic, then the critical value should be theoretically selected based on some standard distributions. Given \(z_{ij}=1\), \({\varvec{y}}_j\) follows a p-dimensional t-distribution with location \({\varvec{A}}{\varvec{\beta }}_i\), scale-covariance \({\varvec{\varSigma }}_i\) and df \(\nu _i\), and the Mahalanobis distance \(\delta _{ij}\) follows \(p\mathscr {F}(p,\nu _i)\), where \(\mathscr {F}(a,b)\) denotes a F distribution with dfs a and b. Thus, \(\hat{\tau }_{j}\) has a scale Beta distribution, say \((1+p/\nu _i)\mathscr {B}eta(\nu _i/2, p/2)\). Under a significance level of \(\alpha \), the critical value is determined as:

$$\begin{aligned} c=(1+p/\nu _i)\mathscr {B}_{\alpha }(\nu _i/2, p/2), \end{aligned}$$
(17)

where \(\mathscr {B}_\alpha (\cdot ,\cdot )\) denotes the \(\alpha \) quantile of the Beta distribution such that \(P(B\ge \mathscr {B}_{\alpha })=1-\alpha \). Consequently, given \({\varvec{y}}_j\) belonging to the ith group, if \(\hat{\tau }_j<c\) then the corresponding subject will be treated as a suspect outlier.

4.4 Model selection

To choose the preferred models and determine the numbers of latent factors q and components g, we adopt two widely used model selection criteria. Let \(\ell _\mathrm{max}\) be the maximized log-likelihood, and m the number of free parameters in the model. The Bayesian information criterion (BIC; Schwarz 1978), defined as

$$\begin{aligned} \text{ BIC }=m\log n-2\ell _\mathrm{max}, \end{aligned}$$

is the most commonly employed approach to identifying which model gives the best approximation to the underlying density. Accordingly, models with smaller BIC scores are preferred. Under certain regularity conditions, Keribin (2000) presented a theoretical justification for the efficacy of the BIC in determining the number of components of a mixture model. Fraley and Raftery (2002) gave some empirical evidence that the BIC performs well in model-based clustering tasks.

As argued by Biernacki et al. (2000), BIC may not be an ideal way of identifying the number of clusters. Indeed, BIC favors models with more mixture components to provide a good density estimation of the data. Instead they proposed an alternative promising measure for estimating the proper number of clusters based on the integrated completed likelihood (ICL), calculated as:

$$\begin{aligned} \text{ ICL }=\text{ BIC }+2EN({\varvec{z}}), \end{aligned}$$

where \(EN({\varvec{z}})=-\sum _{j=1}^n\sum _{i=1}^g\hat{z}_{ij}\log \hat{z}_{ij}\) is the entropy of the classification matrix with the (ij)th entry being \(\hat{z}_{ij}\). In the same vein, the smaller the ICL value, the better the model. Typically, the ICL is preferable to BIC for EMCtFA as it leads to fewer factors since it places a higher penalty on more complex models. Nevertheless, there is no unanimity about which criterion is always the best, and a combined use of BIC and ICL could be of help in screening reasonable candidate models.

From a classification viewpoint, the accuracy of classification can be taken as an alternative measure of fitness of data in some sense. To measure the agreement between a clustering of the data and their true group labels, we employ the leave-one-out (LOO) cross validation of the MAP classifications against the true group labels to evaluate the correct classification rate (CCR; Lee et al. 2003) and the adjusted Rand index (ARI; Hubert and Arabie 1985). The LOO technique is to take one out of subjects and use the remaining subjects as the training data to update the parameters. The CCR, which ranges from zero to one, is computed as the proportion of correct clusters with respect to the true group labels. As a measure of class agreement between two data clustering, the ARI has an expected value of zero under random classification and takes the maximized value one for perfect classification.

5 Application: the Italian wine data

Forina et al. (1986) reported 28 chemical and physical properties of three types of Italian wine, including 59 Barolo, 71 Grignolino and 48 Barbera. A subset of \(p = 13\) of these variables (listed in the first column of Table 2) for \(n = 178\) wines is available as part of the gclus package (Hurley 2004) of R software. The proposed techniques are demonstrated on the analysis of these Italian wines.

For the sake of comparison, in addition to the EMCtFA model, the MFA, MtFA, MCFA, MCtFA and EMCFA (extended MCFA, which is the original MCFA with distinct variance–covariance matrices for latent factors) approaches are also fitted to the data. Prior to analyses, each variable is standardized to have zero mean and unit standard deviation. For the MtFA, EMCtFA, and MCtFA, the assumption of equal and unequal dfs is imposed on the component factors and errors. Henceforth, their counterparts in the case of equal dfs, say \(\nu _i=\nu \) for all i, are named as the ‘MtFAe’, ‘EMCtFAe’, and ‘MCtFAe’, respectively. For supervised learning of the wine data with three class labels, the nine candidate models are fitted with \(g=3\) components and q varying from 1 to 8, where the choice of maximum \(q=8\) satisfies the restriction of \((p-q)^2\ge (p+q)\), as recommended by McLachlan and Peel (2000, Chapter 8). All models are trained by the proposed ECME algorithm over five trials of different K-means initializations. The optimal solution is the one providing the largest log-likelihood value.

Table 1 Model comparison based on BIC and ICL for the wine data

Table 1 reports the number of model parameters m and the values of BIC and ICL for the considered 72 scenarios in terms of the specification of models and the number of factors q. In light of BIC and ICL, the t-based models outperform their normal counterparts except for the case of \(q=1\). Furthermore, it is evident that both criteria give a consistent preference in the study, that is, the best fit to the data is EMCtFAe (\(q=4\)), followed by EMCtFA (\(q=4\)), MCtFA (\(q=4\)) and MCtFAe (\(q=4\)).

The resulting ML estimates of common factor loadings \(\hat{{\varvec{A}}}\) and component means \(\hat{{\varvec{\mu }}}_i=\hat{{\varvec{A}}}\hat{{\varvec{\beta }}}_i\) (\(i=1,2,3\)) together with the empirical sample means \(\bar{{\varvec{y}}}_i\) for the best model are presented in Table 2. Herein, the names for the cluster components are matched with the shortest Euclidean norm of the distance between the sample class means \(\bar{{\varvec{y}}}_i\) and the estimated component means \(\hat{{\varvec{\mu }}}_i\), for \(i=1,2,3\). The estimates of mixing proportions are \(\hat{\pi }_1=0.331\), \(\hat{\pi }_2=0.382\) and \(\hat{\pi }_3=0.287\), respectively, and they are very close to the proportions of the corresponding groups of wine data. Besides, the estimate of common df (\(\hat{\nu }=12.658\)) is somewhat small, signifying that the heavy-tailed behavior exhibits within the multi-dimensional Italian wine data.

Table 2 Estimation results for the fitted EMCtFAe with \(g=3\) and \(q=4\) on the wine data

In the model-based classification framework, the main objective is to estimate the group memberships for new subjects with unknown group memberships. To empirically demonstrate the performance of the proposed classification procedure described in Sect. 4.2, we calculate the CCR and ARI of the classification results based on both marginal and conditional predictions. Table 3 tabulates the classification performance of the best fitting models under each class of EMCtFA and MCtFA models, respectively, namely EMCtFAe (\(q=4\)) and MCtFA (\(q=4\)). The EMCtFAe model shows a slight improvement on the classification accuracy by virtue of having higher CCR and ARI values compared to the MCtFA.

Table 3 Summary of outcome groups against classification results under the EMCtFAe (\(q=4\)) and MCtFA (\(q=4\)) models with CCR and ARI values (based on marginal prediction and conditional prediction) together with the corresponding mixing proportions

Figure 1 shows the 3D scatter-ellipsoids plots of two triple estimated factor scores calculated using (16) for the best model, where the colors of the dots correspond to the true class labels. It is interesting to see that the three groups of wines can be visually separated by mapping the estimated factor scores to a low-dimensional space. Furthermore, it is of interest to detect outlying observations based on the identification rule described in Sect. 4.2. With a significance level of \(\alpha =0.05\), any subject with the estimated \(\hat{\tau }_j\) which is less than the critical value \(c=0.552\), calculated by (17), will be deemed as an outlier. Using such an identification rule indicates that Barolo wine 14, Grignolino wines 62, 69, 70, 74, 96, 97, 111 and 122, and Barbera wines 159 and 160 can be thought of as potential outliers. The finding is consistent with the estimate of df, reflecting that the wine data have longer-than-normal noises.

Fig. 1
figure 1

3D Scatterplots along with \(95\%\) confidence ellipsoids of the triple estimated factor scores colored according to the classified clusters under the fitted EMCtFAe with \(g=3\) and \(q=4\) for the wine data

Table 4 Averages of numbers of iterations, CPU time (in seconds) for convergence, the log-likelihood values at the initial and converged iterations, and the CCR and ARI values for clustering results along with the number of non-convergence cases (in parentheses) over 100 replicates using ECME algorithm with two different initialization methods
Fig. 2
figure 2

Typical evolvements of log-likelihood values fitted to one of simulated datasets by the EMCtFA with \(g=5\) and \(q=2\) for each case through implementing the ECME with three different initialization methods

6 Simulation

In this section, we conduct a small-scale simulation study to compare the performance of the initialization method presented in Sect. 3.2 (Method 1) with the strategy described in the Appendix of Baek et al. (2010) (Method 2). The computation was carried out by R package 2.13.1 in win 64 environment of desktop PC machine with 3.40 GHz/Intel Core(TM) i7-2600 CPU Processor and 8.0 GB RAM. We generate 100 artificial data points in \(\mathscr {R}^{10+p_2}\) of size \(n=100\) and 250 from a five-component EMCtFA model with \(q=2\). The dimension for noise variables \(p_2\) is set to 0 and 20, so the numbers of total variables p are equal to 10 and 30, respectively. Specifically, the artificial data were generated from

$$\begin{aligned} {\varvec{y}}_j=({\varvec{A}}_1^{\mathrm{T}},{\varvec{A}}_2^{\mathrm{T}})^\mathrm{T}{\varvec{u}}_{ij}+{\varvec{e}}_{ij}\quad \text{ with } \text{ probability } \pi _i,(i=1,\ldots ,5), \end{aligned}$$

in which the distributional assumption for \(({\varvec{u}}_{ij}^\mathrm{T},{\varvec{e}}_{ij}^{\mathrm{T}})^\mathrm{T}\) satisfies Eq. (3). The presumed model parameters are the same with those specified in Section 6 of Baek et al. (2010), except for \(\nu _i=5\) for \(i=1,\ldots ,5\). Each simulated dataset was fitted with the EMCtFA (\(g=5,~q=2\)) by implementing the ECME algorithm with parameters initialized once from each of the two methods. A total of 100 independent replications were run for each simulated case.

Table 4 lists the averages of required numbers of iterations, consumed CPU time (in seconds) until convergence, initial and maximized (converged) log-likelihood values, and the CCR and ARI values for clustering results along with the number of non-convergence cases (in parentheses) over 100 trials. Those non-convergence cases occur due mostly to the singularity of scale-covariance matrices \({\varvec{\varSigma }}_i\) during the iterations before convergence. Figure 2 displays the typical evolvements of log-likelihood values for one of the 100 replicates for each considered case. The numerical results indicate that the initialization of Method 1 leads to much faster convergence speed as it requires smaller numbers of iterations and less CPU time than Method 2. Meanwhile, Method 1 can obtain higher starting log-likelihood values (closer to the maximized log-likelihood values upon convergence of ECME) and maximized log-likelihood values as well as better classification performance in terms of CCR and ARI. This study illustrates the effectiveness of our recommended initialization procedure. The poor performance of Method 2 is largely attributable to the fact that it generates inappropriate initial values simply from the standard normal distribution for each entry of \(\hat{{\varvec{A}}}^{(0)}\).

7 Conclusion

The MtFA approach indeed provides a more flexible formulation of the component scale-covariances and the component means without restrictions. Hence, it is useful for analyzing the high-dimensional data with heavy tails or atypical observations. In this paper, we have studied a comparable approach, named as EMCtFA, using a factor-analytic representation of the multivariate t-component scale-covariance matrices with common factor loadings and distinct covariance matrices for latent factors and errors. The EMCtFA approach, which contains the MCtFA as a special case, achieves a compromised reduction in the number of parameters, particularly when the dimension p and the number of clusters g are not small. This approach is very well suited for clustering a wide variety of high-dimensional data into several clusters and provides robustness (less sensitive to outliers) in the sense of resulting number of clusters.

In this work, we have developed two computationally flexible EM-type algorithms and offered a simple way of generating suitable initial values for carrying out ML estimation of the EMCtFA model within a convenient complete data framework. The utility of the proposed approach has been demonstrated through experimental studies based on the real and simulated datasets. Numerical results have also shown that the proposed techniques perform reasonably well for the Italian wine data and outperform some common existing approaches.

To alleviate some limitations associated with the deterministic likelihood-based approach, one may resort to the VB approximation method working with maximization of a lower bound on the marginal log-likelihood (Jordan 1999; Corduneanu and Bishop 2001; Tzikas et al. 2008; Zhao and Yu 2009). The VB strategy has been shown effective to simultaneous estimate model parameters and determine the number of components for the MFA (Ghahramani and Beal 2000) and MCFA (Wei and Li 2013) models. Therefore, it is worthwhile to establish a novel VB scheme for learning the EMCtFA model under an approximated Bayesian paradigm. Besides, it is of interest to extend the EMCtFA based on a broader mixture family of component densities such as the multivariate skew t (Lin 2010; Lee and McLachlan 2014) and the canonical fundamental multivariate skew t (Lee and McLachlan 2016) distributions.