1 Introduction

Mixtures of factor analyzers (MFA), originally introduced by Ghahramani and Hinton (1997), provide a global non-linear approach to dimension reduction via the adoption of component distributions having a factor-analytic representation for the component-covariance matrices. To substantially reduce the number of parameters in component matrices, especially when the number of components (g) or features (p) becomes large, Baek et al. (2010) extended the MFA by using common component factor loadings, known as mixtures of common factor analyzers (MCFA), which have now been a popular tool for high-dimensional data analysis. To deal with data with extreme values or outliers commonly observed in microarray experiments, Baek and McLachlan (2011) presented a robust version of MCFA using multivariate Student’s-t distributed component errors and factors, called mixtures of common t-factor analyzers (MCtFA). Recently, Wang (2013, 2015) extended the MCFA and MCtFA approaches to accommodating high-dimensional data with possibly missing values.

The specification of component factors and errors on both MFA and MCFA rests on the assumption of multivariate normality for computational convenience and mathematical tractability, but the two models are highly vulnerable to outliers. Although the use of MCtFA model is less affected by the violation of normality, it may still suffer from the lack of robustness against highly asymmetric observations. In many practical problems, however, the data to be analyzed may contain a group or groups of observations whose distributions are moderately or severely skewed and/or of having heavy tails. As shown in many empirical studies, a slight deviation from normality may seriously affect the estimates of mixture parameters and subsequently lead to spurious groups as well as misleading statistical inference.

Over the past few decades, there has been growing interest in adopting more flexible parametric distributions to accommodate non-normal features such as asymmetry and longer-than-normal tails leading to non-zero skewness and excess kurtosis, see the monograph by Azzalini (2014) for a more comprehensive overview. Lin et al. (2015) proposed a robust extension of factor analysis models based on the restricted multivariate skew-t (rMST) distribution (Pyne et al. 2009). Other related proposals include mixtures of skew-normal/t factor analyzers (Lin et al. 2016, 2018), mixtures of generalized hyperbolic (GH) factor analyzers (Tortora et al. 2016), mixtures of skew-t factor analyzers (Murray et al. 2014a), and mixtures of common skew-t factor analyzers (Murray et al. 2014b). Besides, Murray et al. (2017a) presented an extended version of MFA with the component factors and errors following the skew-t distribution considered by Sahu et al. (2003), which is referred to as the unrestricted multivariate skew-t (uMST) distribution by Lee and McLachlan (2014).

Note that the rMST and uMST distributions are not nested within each other, and they are equivalent only in the univariate case. Moreover, Sahu et al. (2003) have highlighted that the calculation of the uMST density becomes cumbersome as p increases. The computational difficulty of the uMST formulation was also pointed out by Murray et al. (2017a; Section 5). Azzalini et al. (2016) have provided a detailed comparison between the rMST and uMST distributions in terms of the merits of both distributions for data modeling. When comparing the two distributions in the context of model-based clustering, their illustrative examples indicate that “neither formulation is markedly superior and, if these results were to be taken in favor of either formulation, it would be the classical formulation”, namely the rMST distribution adopted in this paper.

Further, it is interesting to note that the skew-t distribution adopted by Murray et al. (2014a, b), arising from the family of GH distributions (Barndorff-Nielsen and Shephard 2001), is referred to as the generalized hyperbolic skew-t (GHST) distribution henceforth. Its density form is rather different from the rMST distribution and does not include the skew-normal as a limiting case (Lee and Poon 2011). The model proposed by Murray et al. (2014b) is henceforth referred to as mixtures of common generalized hyperbolic skew-t factor analyzers (MCghstFA).

In this paper, we propose an alternative skew extension of the MCtFA model based on the rMST distribution, called the mixture of common restricted skew-t factor analyzers (MCrstFA) model. This new proposal preserves resistance to extremely non-normal effects commonly happen in high-dimensional data. Similar to MCFA and MCtFA models, common factor loadings are utilized for parsimoniously modeling the component-covariance matrices. To portray the observed data into a lower dimensional space and avoid possible singularities, the scale-covariance matrices for component errors (\({\varvec{D}}_i\)) are generally assumed to be homogeneous (\({\varvec{D}}_i={\varvec{D}}\)). Under certain circumstances, \({\varvec{D}}_i\) can be relaxed to be unequal or modified to different types such as (isotropic with unequal variances) or (isotropic with equal variance). Lately, Wang and Lin (2017) presented a modification of MCtFA using component-specific \({\varvec{D}}_i\) and empirically demonstrated its advantage in classifying new subjects whose true group labels are unknown in advance.

The rest of the paper is structured as follows. In Sect. 2, we establish the notation and outline some preliminary properties of the rMST distribution. In Sect. 3, we present the specification of MCrstFA model and develop a workable expectation conditional maximization either (ECME) algorithm for carrying maximum likelihood (ML) estimation. In Sect. 4, the initialization along with the stopping rules, the criteria for model selection and clustering performance, and the identifiability issues are discussed. In Sect. 5, we conduct two simulation studies to examine the validity of MCrstFA model. The methodology is illustrated on a real example concerning human liver cancer data in Sect. 6. Concluding remarks and directions for future works are given in Sect. 7. Some detailed proofs and supplementary information are deferred to appendices.

2 Notation and prerequisites

We first review the rMST distribution and study its related properties. Let \(\phi _p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }})\) be the probability density function (pdf) of a multivariate normal distribution with mean vector \({\varvec{\mu }}\) and covariance matrix \({\varvec{\varSigma }}\), denoted by \(N_p({\varvec{\mu }},{\varvec{\varSigma }})\); \({\varPhi }(\cdot )\) the cumulative distribution function (cdf) of the standard normal distribution; \(TN(\nu ,\sigma ^2;(a,b))\) the truncated normal distribution defined as a normal distribution \(N(\mu ,\sigma ^2)\) lying within an interval (ab); \(t_p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }},\nu )\) the pdf of a p-variate t distribution with location \({\varvec{\mu }}\), scale-covariance matrix \({\varvec{\varSigma }}\) and the degree of freedom (DOF) \(\nu \); \(g(x;\alpha ,\beta )\) the pdf of gamma distribution given by \(\beta ^{\alpha }x^{\alpha -1}\exp \{-\,\beta x\}/{\varGamma }(\alpha )\); \(T(\cdot ;\nu )\) the cdf of the Student’s t distribution with zero mean, unit scale variance and DOF \(\nu \); \({\varvec{1}}_p\) a \(p\times 1\) vector with all elements being 1; \({\varvec{I}}_p\) a \(p\times p\) identity matrix; Diag\(\{\cdot \}\) a diagonal matrix made by extracting the main diagonal elements of a square matrix or the diagonalization of a vector; \({\varvec{A}}^{1/2}\) the square root of a symmetric matrix \({\varvec{A}}\).

Following Pyne et al. (2009), a p-dimensional random vector \({\varvec{Y}}\) is said to follow the rMST distribution with location vector \({\varvec{\mu }}\in {\mathbb {R}}^p\), scale-covariance matrix \({\varvec{\varSigma }}\), skewness vector \({\varvec{\lambda }}\in {\mathbb {R}}^p\) and DOF \(\nu \in {\mathbb {R}}^{+}\), denoted as \({\varvec{Y}}\sim rST_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }},\nu )\), if it has the pdf:

$$\begin{aligned} \psi _p({\varvec{y}};{\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }},\nu )=2t_p({\varvec{y}};{\varvec{\mu }},{\varvec{\varOmega }},\nu )T\left( M\sqrt{\frac{\nu +p}{\nu +\delta }};\nu +p\right) , \end{aligned}$$
(1)

where \({\varvec{\varOmega }}={\varvec{\varSigma }}+{\varvec{\lambda }}{\varvec{\lambda }}^{\top }\), \(\delta =({\varvec{y}}-{\varvec{\mu }})^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})\) and \(M={\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})/(1-{\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}{\varvec{\lambda }})^{1/2}\). Note that the distribution of \({\varvec{Y}}\) is reduced to \(t_p({\varvec{\mu }},{\varvec{\varSigma }},\nu )\) by setting \({\varvec{\lambda }}={\varvec{0}}\) and to \(rSN_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }})\) as \(\nu \rightarrow \infty \). Furthermore, the family of (1) also includes \(N_p({\varvec{\mu }},{\varvec{\varSigma }})\), obtained by letting \({\varvec{\lambda }}={\varvec{0}}\) and \(\nu \rightarrow \infty \).

Alternatively, the rMST distribution can be hierarchically represented as

$$\begin{aligned} {\varvec{Y}}\mid (\gamma ,\tau )\sim & {} N_p({\varvec{\mu }}+{\varvec{\lambda }}\gamma ,{\varvec{\varSigma }}/\tau ),\nonumber \\ \gamma \mid \tau\sim & {} TN(0,1/\tau ;(0,\infty )),\nonumber \\ \tau\sim & {} \mathrm{Gamma}(\nu /2,\nu /2), \end{aligned}$$
(2)

where Gamma(\(\alpha ,\beta \)) stands for the gamma distribution with mean \(\alpha /\beta \). Figure 1 shows the perspective plots with added contours for rMST densities under \({\varvec{\mu }}=(0,0)^\top \), \({\varvec{\varSigma }}={\varvec{I}}_2\), \(\nu =4\) and various specifications of \({\varvec{\lambda }}=(\lambda _1,\lambda _2)^{\top }\). It is clearly seen that these plots are non-elliptical and can be skewed and correlated toward different directions depending on the chosen parameters. Therefore, the rMST distribution provides a flexible mechanism to adapt well to more complicated data.

Fig. 1
figure 1

The contours of bivariate rMST distribution with \({\varvec{\mu }}=(0,0)^\top \), \({\varvec{\varSigma }}={\varvec{I}}_2\) and \(\nu =4\) for different values of \(\lambda _1\) and \(\lambda _2\)

3 Methodology

3.1 Model formulation

Suppose that \({\varvec{Y}}=({\varvec{Y}}_1,\ldots ,{\varvec{Y}}_n)\) forms a random sample of size n in which each \({\varvec{Y}}_j=(Y_{j1},\ldots ,Y_{jp})^{\top }\) is a p-dimensional vector of feature variables. Suppose further that these samples come independently from g distinct subgroups in a heterogeneous population. The MCrstFA model for each \({\varvec{Y}}_j\) is

$$\begin{aligned} {\varvec{Y}}_j={\varvec{A}}{\varvec{U}}_{ij}+{\varvec{e}}_{ij}~\quad \mathrm{with~probability}~ \pi _i\quad (i=1,\ldots ,g), \end{aligned}$$
(3)

for \(j=1,\ldots ,n\), where \({\varvec{A}}\) is a \(p\times q\) matrix of common factor loadings, \({\varvec{U}}_{ij}\) is a q-dimensional (\(q < p\)) vector of component factors, \({\varvec{e}}_{ij}\) is a p-dimensional vector of component errors, and \(\pi _i\)s are the mixing proportions subject to \(\sum _{i=1}^g \pi _i=1\).

Furthermore, we assume that \({\varvec{U}}_{ij}\) and \({\varvec{e}}_{ij}\) are jointly distributed as

$$\begin{aligned} \left[ \begin{array}{c} {\varvec{U}}_{ij}\\ {\varvec{e}}_{ij} \end{array} \right] \sim rST_{p+q} \left( \left[ \begin{array}{c} {\varvec{\xi }}_i\\ \mathbf 0 \end{array} \right] , \left[ \begin{array}{cc} {\varvec{\varOmega }}_i &{}\quad \mathbf 0\\ \mathbf 0 &{}\quad {\varvec{D}}_i \end{array} \right] , \left[ \begin{array}{c} {\varvec{\lambda }}_i\\ \mathbf 0 \end{array} \right] , \nu _i \right) , \end{aligned}$$
(4)

where \({\varvec{\xi }}_i\) is a q-dimensional location vector, \({\varvec{\varOmega }}_i\) is a \(q\times q\) positive-definite scale covariance matrix, \({\varvec{\lambda }}_i \in {\mathbb {R}}^q\) is a skewness vector, \({\varvec{D}}_i\) is a \(p\times p\) positive diagonal matrix, and \(\nu _i\) is the DOF. The specifications of \({\varvec{D}}_i\) and \(\nu _i\) in (4) can be either constrained to be equal or allowed to vary among components.

Based on (3) along with assumption (4), the pdf of \({\varvec{Y}}_j\) is

$$\begin{aligned} f({\varvec{y}}_j)=\sum _{i=1}^g \pi _i \psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i), \end{aligned}$$
(5)

where

$$\begin{aligned} {\varvec{\mu }}_i={\varvec{A}}{\varvec{\xi }}_i,~~{\varvec{\varSigma }}_i={\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}}_i,~~{\varvec{\alpha }}_i={\varvec{A}}{\varvec{\lambda }}_i, \end{aligned}$$
(6)

and \(\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)\) is the rMST density function defined in (1). Notice that the representations in (6) cannot be uniquely determined because they remain unchanged if the common factor loading matrix \({\varvec{A}}\) is postmultiplied by any nonsingular matrix. Thus, we must impose \(q^2\) constraints to achieve identifiability of \({\varvec{A}}\). As a result, the number of free parameters in the MCrstFA is

$$\begin{aligned} d_1=(g-1)+pg+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+g. \end{aligned}$$

If \({\varvec{D}}_i\)s are constrained to be homogeneous across components, the number of parameters is

$$\begin{aligned} d_2=(g-1)+p+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+g; \end{aligned}$$

and if component DOFs are further assumed to be identical, the resulting number of parameters is

$$\begin{aligned} d_3=(g-1)+p+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+1. \end{aligned}$$

We remark that the number of parameters in MCrstFA is increased by qg involved in \({\varvec{\lambda }}_i\) (without adding too much complexity) as compared with MCFA and MCtFA.

To indicate the class membership of observation \({\varvec{y}}_j\), we introduce allocation variables \({\varvec{Z}}_j=(Z_{1j},\ldots ,Z_{gj})^{\top }\), defined as

$$\begin{aligned} Z_{ij}= \left\{ \begin{array}{l} 1,~~{\varvec{Y}}_j~\mathrm{belongs~to}~i\mathrm{th~component};\\ 0,~~\mathrm{otherwise}. \end{array} \right. \end{aligned}$$

Thus, we have \({\varvec{Z}}_j{\mathop {\sim }\limits ^\mathrm{iid}}{{\mathscr {M}}}(1;\pi _1,\ldots ,\pi _g)\), meaning a multinomial distribution with g possible outcomes which can occur in a single trial, where \(\pi _i=\Pr (Z_{ij}=1)\) can be regarded as the prior probability of \({\varvec{y}}_j\) belonging to the ith component.

According to (2) and (3), the MCrstFA model can be formulated by a five-level hierarchical representation:

$$\begin{aligned} {\varvec{Y}}_j\mid ({\varvec{U}}_{ij},\,\gamma _j,\tau _j,\,Z_{ij}=1)\sim & {} N_p({\varvec{A}}{\varvec{U}}_{ij},\tau _j^{-1} {\varvec{D}}_i), \nonumber \\ {\varvec{U}}_{ij}\mid (\gamma _j,\tau _j,Z_{ij}=1)\sim & {} N_q({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j,\tau _j^{-1}{\varvec{\varOmega }}_i),\nonumber \\ \gamma _j\mid (\tau _j,Z_{ij}=1)\sim & {} TN(0,\,\tau _j^{-1};(0,\infty )),\nonumber \\ \tau _j\mid (Z_{ij}=1)\sim & {} \mathrm{Gamma}\left( \frac{\nu _i}{2},\frac{\nu _i}{2}\right) , \nonumber \\ {\varvec{Z}}_j\sim & {} {{\mathscr {M}}}(1;\pi _1,\ldots ,\pi _g). \end{aligned}$$
(7)

By Bayes’ rule, it suffices to derive the following conditional distributions, and the proofs of which are sketched in “Appendix A”. Specifically,

$$\begin{aligned} {\varvec{U}}_{ij}\mid ({\varvec{y}}_j,\gamma _j,\tau _j,Z_{ij}=1)\sim & {} N_q\left( {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j),\tau _j^{-1}({\varvec{I}}_q\right. \nonumber \\&\left. -{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i\right) ,\nonumber \\ \gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1)\sim & {} TN(h_{ij},\tau _j^{-1}\sigma _i^2;(0,\infty )),\nonumber \\ f(\tau _j\mid {\varvec{y}}_j,\,Z_{ij}=1)= & {} \frac{{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }g\left( \tau _j;\frac{\nu _i+p}{2},\frac{\nu _i+\delta _{ij}}{2}\right) ,\nonumber \\ Z_{ij}=1\mid {\varvec{y}}_j\sim & {} {{\mathscr {M}}}(1;\,{\tilde{\pi }}_{1j},\,\ldots \,,{\tilde{\pi }}_{gj}), \end{aligned}$$
(8)

where \({\varvec{\beta }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i\), \(\delta _{ij}=({\varvec{y}}_j-{\varvec{\mu }}_i)^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)\), and \(M_{ij}=h_{ij}/\sigma _i\) with \({\varvec{V}}_i={\varvec{\varSigma }}_i+{\varvec{\alpha }}_i{\varvec{\alpha }}_i^{\top }\), \(h_{ij}={\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)\) and \(\sigma _i^2=1-{\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}{\varvec{\alpha }}_i\). Moreover,

$$\begin{aligned} {\tilde{\pi }}_{ij}=P(Z_{ij}=1|{\varvec{y}}_j)=\frac{\pi _i\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)}{\sum _{h=1}^g \pi _h \psi _p({\varvec{y}}_j;{\varvec{\mu }}_h,{\varvec{\varSigma }}_h,{\varvec{\alpha }}_h,\nu _h)}. \end{aligned}$$
(9)

To simplify the notation, we define \({\varvec{b}}_{ij}={\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)\) and \(c_{ij}(r)=\{(\nu _i+p+r)/(\nu _i+\delta _{ij})\}^{1/2}\) for \(r=-2,0,2\), and let “\(|\cdots \)” represent conditioning on \({\varvec{Y}}_j={\varvec{y}}_j\) and \(Z_{ij}=1\). The following proposition summarizes some essential conditional expectations for implementing the ECME algorithm described in the next subsection.

Proposition 1

Consider the posterior distributions given in (8), we establish the following conditional expectations:

$$\begin{aligned} E(\tau _j\mid \cdots )= & {} \{c_{ij}(0)\}^2\frac{T(M_{ij}c_{ij}(2);\nu _i+p+2)}{T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\gamma _j\mid \cdots )= & {} h_{ij}+\frac{\sigma _it(M_{ij}c_{ij}(-2);\nu _i+p-2)}{c_{ij}(-2)T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\tau _j\gamma _j\mid \cdots )= & {} h_{ij}E(\tau _j\mid \cdots )+\sigma _i c_{ij}(0)\frac{t(M_{ij}c_{ij}(0);\nu _i+p)}{T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\tau _j\gamma _j^2\mid \cdots )= & {} \sigma _i^2+h_{ij}E(\tau _j\gamma _j\mid \cdots ),\nonumber \\ E({\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij}+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\gamma _j\mid \cdots ),\nonumber \\ E(\tau _j{\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij}E(\tau _j\mid \cdots )+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j\mid \cdots ),\nonumber \\ E(\tau _j\gamma _j{\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij} E(\tau _j\gamma _j\mid \cdots )+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j^2\mid \cdots ),\nonumber \\ E(\tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^\top \mid \cdots )= & {} \left( {\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}\right) {\varvec{\varOmega }}_i+E(\tau _j\gamma _j{\varvec{U}}_{ij}\mid \cdots )({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)^{\top }\nonumber \\&+\,E(\tau _j{\varvec{U}}_{ij}\mid \cdots ){\varvec{b}}_{ij}^{\top }, \end{aligned}$$
(10)

and

$$\begin{aligned} E(\log \tau _j\mid \cdots )= & {} \frac{ \int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }+E(\tau _j\mid \cdots )\nonumber \\&-\,\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) +\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) , \end{aligned}$$
(11)

where \(f_{\nu _i}(x)\) is defined by (B.10).

Proof

The results follow directly from some fundamental matrix manipulations and the law of iterated expectations. See “Appendix B” for more details. \(\square \)

3.2 Parameter estimation via the ECME algorithm

The EM algorithm (Dempster et al. 1977) is a popular iterative method for finding ML estimates when the data are incomplete or the model contains latent variables. The main advantage of EM lies in the fact of monotone convergence without sacrificing simplicity. One common limitation of the EM algorithm is that the M-step usually yields no closed forms for estimators of parameters. To overcome this weakness, Meng and Rubin (1993) proposed the expectation conditional maximization (ECM) algorithm to replace the M-step of EM with several computational simpler CM-steps, each of which maximizes the expected complete-data log-likelihood function (known as the Q-function) sequentially. Importantly, the authors also showed that the ECM algorithm preserves all desiring properties of EM. In certain situations, some of the CM-steps of ECM may be computationally intractable. Liu and Rubin (1994) advanced the ECM algorithm with the CM steps that maximize either the Q-function, called the CMQ-step, or the corresponding constrained actual log-likelihood function, called the CML-step. The method is referred to as the ECME algorithm.

For notational simplicity, we denote the observed data by \({\varvec{y}}=({\varvec{y}}_1,\ldots ,{\varvec{y}}_n)\), allocation indicators by \({\varvec{Z}}=({\varvec{z}}_1,\ldots ,{\varvec{z}}_n)\), latent factors by \({\varvec{U}}=({\varvec{U}}_1,\ldots ,{\varvec{U}}_n)\), hidden variables \({\varvec{\gamma }}=(\gamma _1,\ldots ,\gamma _n)\) and scaling weight variables by \({\varvec{\tau }}=(\tau _1,\ldots ,\tau _n)\). Therefore, the complete data \({\varvec{y}}_c\) comprise the observed data \({\varvec{y}}\) together with missing data \({\varvec{y}}_m=({\varvec{Z}},{\varvec{U}},{\varvec{\gamma }},{\varvec{\tau }})\). From (5), it is readily seen that

$$\begin{aligned} {\varvec{Y}}_j\mid (Z_{ij}=1) \sim rST_p({\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i). \end{aligned}$$

Therefore, the joint pdf of \(({\varvec{Y}},{\varvec{Z}})\) is

$$\begin{aligned} f({\varvec{y}},{\varvec{z}})=\prod _{j=1}^n\prod _{i=1}^g \{\pi _i\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)\}^{z_{ij}}. \end{aligned}$$
(12)

Let \({\varvec{\theta }}_i=(\pi _i,{\varvec{\xi }}_i,{\varvec{\varOmega }}_i,{\varvec{D}}_i,{\varvec{\lambda }}_i,\nu _i)\) be the parameter vector belonging to the i-th component, and \({\varvec{\varTheta }}=\{{\varvec{A}},{\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_g\}\) the entire unknown parameters to be estimated. According to (7), the complete-data log-likelihood function is

$$\begin{aligned} \ell _c({\varvec{\varTheta }}\mid {\varvec{y}}_c)= & {} \sum _{i=1}^g\sum _{j=1}^n z_{ij}\bigg \{\log \pi _i-\frac{1}{2}\log |{\varvec{D}}_i| -\frac{\tau _j}{2}({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})^{\top }{\varvec{D}}_i^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})\nonumber \\&-\,\frac{1}{2}\log |{\varvec{\varOmega }}_i|-\frac{\tau _j}{2}({\varvec{U}}_{ij}-{\varvec{\xi }}_i-{\varvec{\lambda }}_i\gamma _j)^{\top }{\varvec{\varOmega }}_i^{-1}({\varvec{U}}_{ij}-{\varvec{\xi }}_i-{\varvec{\lambda }}_i\gamma _j)\nonumber \\&-\,\log {\varGamma }\left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}\log \tau _j-\frac{\nu _i}{2}\tau _j\bigg \}. \end{aligned}$$

To evaluate the Q-function, defined as \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})=E\big [\ell _c({\varvec{\varTheta }}\mid {\varvec{y}}_c)\mid {\varvec{y}},\hat{{\varvec{\varTheta }}}^{(k)}\big ]\), we first define the following conditional expectations:

$$\begin{aligned} {\hat{z}}_{ij}^{(k)}= & {} P\left( Z_{ij}=1\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)}\right) ,\quad {\hat{\tau }}_{ij}^{(k)}=E\left( \tau _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \\ {{\hat{\kappa }}_{ij}}^{(k)}= & {} E\left( \log \tau _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\quad {{\hat{s}}_{1ij}}^{(k)}=E\left( \tau _j\gamma _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \\ {{\hat{s}}_{2ij}}^{(k)}= & {} E\left( \tau _j\gamma _j^2\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\quad \hat{{\varvec{\eta }}}_{ij}^{(k)}=E\left( \tau _j{\varvec{U}}_{ij}\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \nonumber \\ {\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}= & {} E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,~ {\hat{{\varvec{\zeta }}}_{ij}}^{(k)}=E\left( \tau _j\gamma _j{\varvec{U}}_{ij}\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) \end{aligned}$$

for \(i=1,\ldots ,g\) and \(j=1,\ldots ,n\), which can be evaluated using (9), (10) and (11).

To update the mixture parameters \({\varvec{\varTheta }}\), the ECME algorithm proceeds as follows:

  1. E-step:

    Given \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}^{(k)}\), calculate the Q-function, obtained as

    $$\begin{aligned} Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})= & {} \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)} \bigg \{\log \pi _i-\frac{1}{2}\log |{\varvec{D}}_i|-\frac{1}{2}\log |{\varvec{\varOmega }}_i|-\log {\varGamma }\left( \frac{\nu _i}{2}\right) \nonumber \\&+\,\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}({{\hat{\kappa }}_{ij}}^{(k)}-{{\hat{\tau }}_{ij}}^{(k)})-\frac{1}{2}\mathrm{tr}\big ({\varvec{D}}_i^{-1}{\varvec{\varUpsilon }}_{ij}-{\varvec{\varOmega }}_i^{-1}{\varvec{\varLambda }}_{ij}\big )\bigg \},\nonumber \\ \end{aligned}$$
    (13)

    where

    $$\begin{aligned} {\varvec{\varUpsilon }}_{ij}={\varvec{\varUpsilon }}_{ij}({\varvec{A}})={\hat{\tau }}_{ij}^{(k)}{\varvec{y}}_j{\varvec{y}}_j^{\top }-{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{(k)\top }{\varvec{A}}^{\top }-{\varvec{A}}{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{y}}_j^{\top }+{\varvec{A}}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}^{\top } \end{aligned}$$
    (14)

    and

    $$\begin{aligned} {\varvec{\varLambda }}_{ij}={\varvec{\varLambda }}_{ij}({\varvec{\xi }},{\varvec{\lambda }})= & {} {\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}-{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}{\varvec{\lambda }}_i^{\top }-{\varvec{\xi }}_i\left( {\hat{{\varvec{\eta }}}_{ij}}^{{(k)}^{\top }}-{\hat{\tau }}_{ij}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) \nonumber \\&-\,{\varvec{\lambda }}_i\left( {\hat{{\varvec{\zeta }}}_{ij}}^{(k)^{\top }}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) . \end{aligned}$$
    (15)
  2. CM-steps:

    Maximizing (13) with respect to \(\pi _i\), \({\varvec{\xi }}_i\), \({\varvec{\lambda }}_i\), \({\varvec{A}}\), \({\varvec{\varOmega }}_i\) and \({\varvec{D}}_i\), we obtain

    $$\begin{aligned} {\hat{\pi }}_i^{\left( k+1\right) }= & {} \frac{1}{n}\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) },\\ \hat{{\varvec{\xi }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\eta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\zeta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{\lambda }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{A}}}^{\left( k+1\right) }= & {} \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) {\top }}\right) \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\varPsi }}}_{ij}}^{\left( k\right) }\right) ^{-1},\\ {{\hat{{\varvec{\varOmega }}}_i}}^{\left( k+1\right) }= & {} \frac{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varLambda }}}_{ij}^{\left( k+1\right) }}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}~~\text{ and }~~ \hat{{\varvec{D}}}_i^{\left( k+1\right) } =\frac{\mathrm{Diag}\{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varUpsilon }}}_{ij}^{\left( k+1\right) }\}}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}, \end{aligned}$$

    where \(\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}\) and \(\hat{{\varvec{\varLambda }}}_{ij}^{(k+1)}\) are \({\varvec{\varUpsilon }}_{ij}\) and \({\varvec{\varLambda }}_{ij}\) in (14) and (15) with \({\varvec{\xi }}_i\), \({\varvec{\lambda }}_i\) and \({\varvec{A}}\) replaced by \({\hat{{\varvec{\xi }}}_i}^{(k+1)}\), \({\hat{{\varvec{\lambda }}}_i}^{(k+1)}\) and \(\hat{{\varvec{A}}}^{(k+1)}\), respectively. Moreover, when \({\varvec{D}}_i\)s are assumed to be the same, say \({\varvec{D}}_i={\varvec{D}}\) for all i, the updated estimator of \({\varvec{D}}\) is given by \(\hat{{\varvec{D}}}^{(k+1)}=n^{-1}\mathrm{Diag}\{\sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{(k)}\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}\}.\) The proof of the updated estimators is sketched in “Appendix C”.

  3. CML-step:

    In light of (12), the updated estimator of \(\nu _i\) can be obtained by solving the following equations:

    $$\begin{aligned} {{\hat{\nu }}}_i^{(k+1)}=\arg \max _{\nu _i}\bigg \{\sum _{j=1}^n{\hat{z}}^{(k+1)}_{ij}\log \Big (\psi _p({\varvec{y}}_j;\hat{{\varvec{\mu }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\hat{{\varvec{\alpha }}}_i^{(k+1)},\nu _i)\Big )\bigg \},\nonumber \\ \end{aligned}$$
    (16)

    for \(i=1,\ldots ,g\), where \(\hat{{\varvec{\mu }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\xi }}}_i^{(k+1)}\), \(\hat{{\varvec{\varSigma }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\varOmega }}}_i^{(k+1)}\)\(\hat{{\varvec{A}}}^{(k+1)\top }+\hat{{\varvec{D}}}_i^{(k+1)}\) and \(\hat{{\varvec{\alpha }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\lambda }}}_i^{(k+1)}\).

In the case of assuming common DOFs, say \(\nu _i=\nu \) for all i, the updated estimator of \(\nu \) is obtained by maximizing the constrained actual log-likelihood function, that is,

$$\begin{aligned} {\hat{\nu }}^{(k+1)}=\arg \max _{\nu }\bigg \{\sum _{j=1}^n\log \Big (\sum _{i=1}^g\hat{\pi }^{(k+1)}_i \psi _p({\varvec{y}}_j;\hat{{\varvec{\mu }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\hat{{\varvec{\alpha }}}_i^{(k+1)},\nu )\Big )\bigg \}.\nonumber \\ \end{aligned}$$
(17)

Herein, we remark that the solutions of (16) and (17) involve carrying out a one-dimensional search using the built-in R function optim function over a box constraint (2, 200). Given an initial guess of parameters \(\hat{{\varvec{\varTheta }}}^{(0)}\), the above ECME procedure is performed recursively until maximization of the log-likelihood function is achieved. The resulting ML estimates are denoted by \(\hat{{\varvec{\varTheta }}}=(\hat{{\varvec{A}}},\hat{\pi }_i,\hat{{\varvec{\xi }}}_i,\hat{{\varvec{\varOmega }}}_i,\hat{{\varvec{D}}}_i,\hat{{\varvec{\lambda }}}_i,\hat{{\varvec{\nu }}}_i,i=1,\ldots ,g)\). As a result, the posterior probability of \({\varvec{y}}_j\) belonging to the i-th component of the mixture is calculated by replacing \({\varvec{\varTheta }}\) in (9) with \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}\), denoted by \({\hat{z}}_{ij}=P(Z_{ij}=1\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}})\). Based on the maximum a posteriori (MAP) classification rule, \({\varvec{y}}_j\) is assigned to group s if \(\max \{{\hat{z}}_{ij}\}_{i=1}^g\) occurs at \(i=s\).

Consequently, the conditional expectations of the factor scores \({\varvec{U}}_{ij}\) given \({\varvec{y}}_{j}\) and the i-th membership of the mixture meaning that \(Z_{ij}=1\) can be estimated by \(\hat{{\varvec{u}}}_{ij}=E({\varvec{U}}_{ij}\mid {\varvec{Y}}_j={\varvec{y}}_j,Z_{ij}=1,\hat{{\varvec{\varTheta }}})\) which is given in (10) with \({\varvec{\varTheta }}\) substituted by \(\hat{{\varvec{\varTheta }}}\). Then, the j-th estimated factor scores corresponding to \({\varvec{y}}_j\) can be calculated as

$$\begin{aligned} \hat{{\varvec{u}}}_j=\sum _{i=1}^g {\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}, \quad j=1\ldots n. \end{aligned}$$
(18)

An alternative estimator of (18) is given by

$$\begin{aligned} \hat{{\varvec{u}}}_j=\sum _{i=1}^g \text{ MAP }\{{\hat{z}}_{ij}\}\hat{{\varvec{u}}}_{ij}, \end{aligned}$$
(19)

where \(\text{ MAP }\{{\hat{z}}_{ij}\}=1\), if \(\max \{{\hat{z}}_{hj}\}_{h=1}^g\) occurs at \(h=i\), and \(\text{ MAP }\{{\hat{z}}_{ij}\}=0\) otherwise. These estimated factor scores can be used to portray the observed data into a lower dimensional space (Baek et al. 2010; Baek and McLachlan 2011) and be applied to feature extractions (Ueda et al. 2000).

4 Practical issues from computational aspects

4.1 Initialization and stopping rules

Like other iterative procedures, the ECME algorithm may suffer from convergence difficulties such as singularity of component covariance matrices or undetermined local maximum. To alleviate such problems, one simple strategy is to try many different initial values and select the solution that provides the highest likelihood. To obtain different sets of initial values, this can be done by performing multiple times of K-means (Hartigan and Wong 1979) clustering or random starts (McLachlan and Peel 2000) in the sense that each sample point is randomly assigned to one of clusters. We recommend below a simple way of generating sensible initial values.

  1. 1.

    Given initial memberships obtained by a single run of clustering through K-means, we set \(\hat{{\varvec{Z}}}_j^{(0)}=({\hat{z}}_{1j}^{(0)},\ldots ,{\hat{z}}_{gj}^{(0)})\). The initial values of \(\pi _i\)s are

    $$\begin{aligned} {\hat{\pi }}_i^{(0)}=\frac{1}{n}\sum _{j=1}^n{\hat{z}}_{ij}^{(0)},\quad i=1,\ldots ,g. \end{aligned}$$
  2. 2.

    Let \({\varvec{y}}_{(i)}\) be the collection of the i-th partitioned group. After that, we compute factor scores using the R built-in factanal function. The initial estimates of \(\hat{{\varvec{\xi }}}_i^{(0)}\), \(\hat{{\varvec{\varOmega }}}_i^{(0)}\), \(\hat{{\varvec{\lambda }}}_i^{(0)}\) and \(\hat{\nu }_i^{(0)}\), for \(i=1,\ldots ,g\), are obtained by implementing R EMMIXskew package (Wang et al. 2009) for fitting the rMST distribution to the estimated factor scores.

  3. 3.

    Perform the principal components analysis (PCA) method to obtain the factor loading matrix for \({\varvec{y}}_{(i)}\), denoted by \(\hat{{\varvec{B}}}^{(0)}_i\) for \(i=1,\ldots ,g\). The initial estimate of \({\varvec{A}}\) is specified as

    $$\begin{aligned} \hat{{\varvec{A}}}^{(0)}=\sum _{i=1}^g{\hat{\pi }}^{(0)}_i\hat{{\varvec{B}}}^{(0)}_i\hat{{\varvec{\varOmega }}}_i^{{(0)}^{-1/2}}. \end{aligned}$$
  4. 4.

    The initial estimate of \({\varvec{D}}_i\) is obtained as a diagonal matrix formed from the diagonal elements of the sample covariance matrix of \({\varvec{y}}_{(i)}\). For the restricted case of \({\varvec{D}}_i={\varvec{D}}\), the initial estimate \(\hat{{\varvec{D}}}^{(0)}\) is formed as the diagonal elements of the pooled within-cluster sample covariance matrix of \({\varvec{y}}_{(1)},\ldots ,{\varvec{y}}_{(g)}\).

Since the ECME algorithm is an iterative method, the stopping rules should be specified. In our experimental studies, we adopt by default the traditional criterion to terminate the algorithm when a predefined the maximum number of iterations \(k_\mathrm{max}=2\times 10^4\) is reached or when the difference between two successive log-likelihood values is less than \(10^{-6}\). Alternatively, one can use the Aitken acceleration-based stopping criterion (Aitken 1926; McLachlan and Krishnan 2008), which is at least as strict as lack of progress in likelihood in the neighborhood of a maximum (McNicholas et al. 2010).

4.2 Model selection and performance evaluation

The log-likelihood value cannot be adopted as a model selection criterion because it is a nondecreasing function of the number of components (g) and the dimension of factors (q). We use the Bayesian information criterion (BIC; Schwarz 1978) and the integrated classification likelihood (ICL; Biernacki et al. 2000) to determine the best pair of (gq) over a number of candidate models for achieving satisfactory performance (McNicholas and Murphy 2008; Lin et al. 2016). The BIC and ICL are defined as

$$\begin{aligned} \text{ BIC }=d\log n-2\ell _{\mathrm{max}}\quad \text{ and }\quad \text{ ICL }=\text{ BIC }+2\text{ ENT }(\hat{{\varvec{z}}}), \end{aligned}$$

where d is the number of free parameters, \(\ell _{\mathrm{max}}\) is the maximized log-likelihood value, and \(\text{ ENT }(\hat{{\varvec{z}}})=-\sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\log {\hat{z}}_{ij}\) is a penalty term called entropy that favors well-separated mixtures. The ICL penalizes complex model seriously and selects more parsimonious models than does BIC.

To evaluate the clustering performance of model-based approach, the adjusted Rand index (ARI; Hubert and Arabie 1985) and the correct classification rate (CCR; Lee et al. 2003) are employed. Typically, the ARI value ranges between 0 and 1 in most cases, but it can be negative corresponding to a poor level of agreement, e.g., fewer instances are correctly classified than would be expected by chance. The metric of CCR has a value between 0 and 1. The CCR is determined to have the lowest misclassification rate by comparing all permutations of the MAP clustering labels with the true class labels.

4.3 Identifiability issues

The mixture model itself suffers from an non-identifiability problem arising from a permutation of the class labels in parameter vectors. The switching issue of class labels is often inherent in Bayesian implementation of mixture models. However, this is not a problem in practice when employing the EM-based algorithm to estimate mixture densities since we can still determine a sequence of ML estimates that are consistent and asymptotically efficient, see McLachlan and Basford (1988).

On the other hand, there is another identifiability problem corresponding to the rotational indeterminacy of common factor loading matrix \({\varvec{A}}\). As suggested by Baek et al. (2010), a unique solution of \({\varvec{A}}\), say \(\hat{{\varvec{A}}}^*\), can be obtained by postmultiplying a nonsingular matrix for which the solution is orthonormal, i.e., \(\hat{{\varvec{A}}}^{*\top }\hat{{\varvec{A}}}^*={\varvec{I}}_q\). This can be achieved by adopting the Cholesky decomposition to find the upper triangular matrix \({\varvec{C}}\) of order q such that \(\hat{{\varvec{A}}}^{\top }\hat{{\varvec{A}}}={\varvec{C}}^{\top }{\varvec{C}}\), resulting in \(\hat{{\varvec{A}}}^*=\hat{{\varvec{A}}}\hat{{\varvec{C}}}^{-1}\).

Related to the standard errors of the ML estimates, it would be of interest to calculate them using the empirical information matrix for \({{\varvec{\varTheta }}}\) in a manner analogous to Wang and Lin (2016). This procedure will be tackled by the authors in a future paper.

5 Simulation

We conduct two simulation experiments to demonstrate the proposed techniques. Unless otherwise stated, we shall consider only the case of \({\varvec{D}}_i={\varvec{D}}\) for all i in the later analysis.

5.1 Experiment 1

In this experiment, to compare the accuracy of three parsimonious factor-analytic approaches for clustering and representing low-dimensional data, we generate a \(p=3\) dimensional dataset of size \(n=1000\) from a \(g=2\) component mixture of rMST distributions. The presumed mixture parameters as involved in (5) are

$$\begin{aligned} \pi _1= & {} 0.5, \quad \pi _2=0.5, \quad {\varvec{\mu }}_1=(0,0,0)^{\top }, \quad {\varvec{\mu }}_2=(1,1,3)^{\top },\\ \nu _1= & {} 4, \quad \nu _2=5, \quad {\varvec{\alpha }}_1=(-\,2,-\,5,-\,5)^{\top }, \quad {\varvec{\alpha }}_2=(-\,2,5,5)^{\top },\\ {\varvec{\varSigma }}_1= & {} \left[ \begin{array}{c@{\quad }c@{\quad }c} 4 &{} -\,1.8 &{} -\,1\\ -\,1.8 &{} 2 &{} 0.9\\ -\,1 &{} 0.9 &{} 2 \end{array} \right] \quad \text{ and } \quad {\varvec{\varSigma }}_2= \left[ \begin{array}{c@{\quad }c@{\quad }c} 4 &{} 1.8 &{} 0.8\\ 1.8 &{} 2 &{} 0.5\\ 0.8 &{} 0.5 &{} 2 \end{array} \right] . \end{aligned}$$

The MCFA, MCtFA and MCrstFA models with \(q=2\) factors and \(g=2\) components are fitted via the ECME algorithm to the simulated data. When the parameter estimates and the corresponding factor scores are obtained under each fitted model, we can compare the clustering performance and calculate the predicted values of each observed feature vector \({\varvec{y}}_j\). As anticipated, the MCrstFA approach gives the best clustering result (\(\hbox {ARI}=0.891; \hbox {CCR}=0.972\)), followed closely by MCtFA (\(\hbox {ARI}=0.817; \hbox {CCR}=0.952\)). The MCFA has the worst performance (\(\hbox {ARI}=1.78\times 10^{-6}; \hbox {CCR}=0.51\)), indicating a lack of ability to cluster mixtures of skewed data with outliers. A cross-tabulation of the true and predicted class memberships is given in Table 1. As can be seen, the MCrstFA approach provides fewer misclassified observations and outperforms the other two considered approaches, say MCtFA and MCFA.

Table 1 Cross-tabulations of true (A, B) and predicted (1, 2) class memberships for three parsimonious factor-analytic approaches for the simulated data
Fig. 2
figure 2

Original observations and the predicted observations by MCFA, MCtFA, and MCrstFA

Fig. 3
figure 3

Scatter plot of generated bivariate factors for each of \(g=3\) components

Figure 2 displays plots of the actual observations \({\varvec{y}}_j\) overlaid with predicted observations \(\hat{{\varvec{y}}}_j\), calculated as \(\hat{{\varvec{y}}}_j=\hat{{\varvec{A}}}\hat{{\varvec{u}}}_j\), \((j=1,\ldots ,1000)\), where \(\hat{{\varvec{A}}}\) is the estimated projection matrix, and \(\hat{{\varvec{u}}}_j\) is the estimated factor scores defined in (18). As shown in Fig. 2a, the MCFA model performs poorly because of a lack of mechanisms to cope with data exhibiting non-normal features. On the other hand, it is clearly observed from Fig. 2b, c that the original scattering structure of two groups can be retrieved quite well using the MCtFA and MCrstFA approaches, but the MCtFA is slightly unfavored due to somewhat poor fit caused by having 20 more misclassified units than the MCrstFA.

5.2 Experiment 2

To further demonstrate the validity of the MCrstFA approach for handing the data of higher dimensions, we perform a second simulation experiment in situations where the MCrstFA holds exactly. In this study, data were generated from the 3-component MCrstFA model with \(q=2\), and \(p=10\) and 20. We perform 100 Monte Carlo (MC) repetitions of sample size \(n=1500\) observations and equal mixing proportions, namely \(\pi _i=1/3\) for all i. The elements of \(p\times q\) common factor loadings \({\varvec{A}}\) were randomly generated from N(0, 1), while the component DOFs are taken as \((\nu _1,\nu _2,\nu _3)=(4,6,9)\). The location vectors, scale-covariance matrices and skewness parameters of the component factors \({\varvec{U}}_{ij}\) are chosen as

$$\begin{aligned} {\varvec{\xi }}_1= & {} (0,2.5)^{\top }, \quad {\varvec{\xi }}_2=(-\,2.5,0)^{\top }, \quad {\varvec{\xi }}_3=(2.5,0)^{\top },\\ {\varvec{\lambda }}_1= & {} (5,5)^{\top }, \quad {\varvec{\lambda }}_2=(-\,5,-\,5)^{\top }, \quad {\varvec{\lambda }}_3=(0,0)^{\top },\\ {\varvec{\varOmega }}_1= & {} \left[ \begin{array}{c@{\quad }c} 0.1 &{} 0\\ 0 &{} 0.45 \end{array} \right] , \quad {\varvec{\varOmega }}_2= \left[ \begin{array}{c@{\quad }c} 0.45 &{} 0\\ 0 &{} 0.1 \end{array} \right] , \quad {\varvec{\varOmega }}_3= \left[ \begin{array}{c@{\quad }c} 0.45 &{} 0\\ 0 &{} 0.1 \end{array} \right] . \end{aligned}$$

Figure 3 gives an illustration of the generated bivariate factor scores based on one simulated case for each of the three components. Typically, these component factor scores look somewhat well separated and exhibit non-elliptical scattering patterns and heavy tails. The component error vectors \({\varvec{e}}_{ij}\)s were drawn independently from \(t_p({\varvec{0}},{\varvec{D}},\nu _i)\), where diagonal elements of \({\varvec{D}}\) were randomly generated from a uniform distribution ranging between 0.1 and 0.3.

We process each of 100 MC simulated datasets by fitting the MCFA, MCtFA and MCrstFA models. Comparisons were made on the adequacy of overall fitness in terms of BIC and ICL and the classification agreement on the true and predicted memberships assessed by ARI and CCR. Table 2 lists the average values of criteria together with the corresponding standard deviations (Std) under every scenario considered. As a guide to select the most plausible model, the frequencies (Freq) preferred by these criteria are also reported. In all cases, the MCrstFA model provides better fits and clustering results than the other two approaches. In particular, the MCFA and MCtFA are seldom or even never chosen by these four indices due to a lack of sufficient robustness against skewness. We have also undertaken the simulation study with a much higher dimension, say \(p=100\), and found that the MCrstFA model still works similarly well without degrading its performance.

Table 2 Comparison of MCFA, MCtFA and MCrstFA models for simulation based on 100 replications

6 Application to real data

We applied our method to the human liver cancer data (Chen et al. 2002), which consist of \(p=85\) gene expressions partitioned into two subpopulations. Hepatocellular carcinoma (HCC) is one of the 10 leading causes of death in the world. Chen et al. (2002) used cDNA microarrays to characterize patterns of gene expression in HCC, from which they found that the expression patterns in HCC and nontumor liver tissues (LIVER) are distinctly different from one another. In the data, there are \(n=179\) samples in the genomic expression patterns from patients, of which 104 belong to HCC and 75 to LIVER.

Fig. 4
figure 4

Boxplots for the 30 genes in the human liver cancer data. The x-coordinate indicates the order of original genes

Figure 4 depicts the boxplots of top 30 genes which have the most significant difference between two classes obtained by performing the two-sample t-test. Apparently, the distribution of each selected gene is highly skewed or has a long tail.

Table 3 Comparison of fitting results and implied clustering versus the true membership of the human liver cancer data

We implement the two-component MCFA, MCtFA, MCrstFA and MCghstFA approaches with q ranging from 1 to 10. In the same vein as that of the simulation experiments, we assume \({\varvec{D}}_i={\varvec{D}}\) for all i, but place no restrictions on component DOFs. A comparison of some characterizations between the MCrstFA and MCghstFA models is summarized in Table 5. When fitting the MCghstFA model, we implement the ECM algorithm described in “Appendix D”. For clarity, Table 3 presents only the fitting results and classification agreements of each method with q ranging from 5 to 10. Judging from BIC and ICL, the best fitted model is given by the MCghstFA model with \(q=8\). While comparing the classification performance, the MCrstFA model with \(q=6\) provides the best agreement on predicting the true group memberships (\(\hbox {ARI}=0.2427\) and \(\hbox {CCR}=0.7486\)) for this dataset. Notice that the best classifier does not necessarily give the best fit to the data. Again, the MCrstFA approach demonstrates its usefulness in clustering high-dimensional data with asymmetry and/or fat tails.

Table 4 Cross-tabulations of true and predicted (1,2) class memberships for four mixtures of common factor-analytic approaches for the human liver cancer data

Table 4 compares the best classification results obtained from the fitted MCFA (\(q=10\)), MCtFA (\(q=6\)), MCrstFA (\(q=6\)) and MCghstFA (\(q=10\)) models. We found that the number of the correctly classified HCC tissues in the fit of MCrstFA is more than those of the other three approaches. However, there is no obvious difference among them in predicting the class memberships of LIVER tissues.

To visualize the clustering results in a low-dimensional space, Fig. 5 portrays the data in a 3D space using the factor scores estimated by (19). In the plot, we use the second, third and fifth factors in the fit of MCrstFA with \(q=6\) factors. The estimated factor scores in Fig. 5a, b are plotted according to the true and implied clustering labels, respectively. It can be observed from the two plots that the two clusters are inherently overlapped so that no approach works satisfactorily on classifying these tissues. Most of the misclassified tissues, labelled by ‘plus symbol’ in Fig. 5b, appear in the overlapping area between two clusters.

Fig. 5
figure 5

Plot of the (estimated) posterior mean factor scores via the MCrstFA approach for the human liver cancer data based on a the true class labels, and b the implied clustering labels. (

figure a
) HCC; (
figure b
) LIVER; (+) Misclassification

7 Conclusion

We propose an extension of MCFA in which component factors and errors are jointly modeled by the rMST distribution, called the MCrstFA model, as a new model-based tool for analyzing high-dimensional data with strong degree of abnormality and multimodality. An attractive feature of the MCrstFA is that the component means, component covariance matrices as well as component skewness parameters are represented by common factor loadings, allowing parsimonious model fitting while preserving its robustness.

We describe an analytically simple ECME procedure developed under a five-level hierarchy for fitting the MCrstFA. This approach enables us to project high-dimensional clustering results into a low-dimensional space through displaying estimated factor scores. Numerical simulation studies and experimental data demonstrate its usefulness and flexibility on the basis of model fitting and outright clustering.

The techniques presented so far are limited to the likelihood-based approach and focus on complete data analysis. Some possible avenues for future research include building a framework to handle the presence of censoring observations (Castro et al. 2015; Lachos et al. 2017) or the occurrence of missing values (Ouyang et al. 2004; Lin 2014; Wang et al. 2017a, b), both of which are common problems in the analysis of high-dimensional data. Although our estimating procedure is easy to implement, there is a lack of feasible guidelines for a joint determination of (gq) within a single run of the training process. Toward this end, variational Bayes (VB) approximations (Waterhouse et al. 1996; Jordan et al. 1999; Beal 2003) have been presented as an iterative Bayesian alternative to the EM-based algorithm for their fast and deterministic nature. The attractive feature of the VB scheme allows for an automated learning of parameter estimation and model selection. The VB approach has been effectively applied to Gaussian mixtures (Teschendorff et al. 2005), MFA models (Ghahramani and Beal 2000), and mixtures of normal inverse Gaussian distributions (Subedi and McNicholas 2014) for simultaneously estimating model parameters and determining the number of components. Therefore, it is worthwhile to develop a novel VB algorithm for learning the MCrstFA model. Another inspiration for future work is to extend the MCrstFA model based on a broader family of multivariate skew distributions such as the scale mixtures of skew-normal distributions (Cabral et al. 2012; Prates et al. 2013), the multivariate canonical fundamental skew-t distributions (Arellano-Valle and Genton 2005; Lee and McLachlan 2016), and the hidden truncation hyperbolic distributions introduced very recently by Murray et al. (2017b).