Mixtures of restricted skew-t factor analyzers with common factor loadings

Wang, Wan-Lun; Castro, Luis M.; Chang, Yen-Ting; Lin, Tsung-I

doi:10.1007/s11634-018-0317-2

Mixtures of restricted skew-t factor analyzers with common factor loadings

Regular Article
Published: 08 March 2018

Volume 13, pages 445–480, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Mixtures of restricted skew-t factor analyzers with common factor loadings

Download PDF

Wan-Lun Wang¹,
Luis M. Castro²,
Yen-Ting Chang³ &
…
Tsung-I Lin ORCID: orcid.org/0000-0002-3992-1128^3,4

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Mixtures of common t factor analyzers (MCtFA) have been shown its effectiveness in robustifying mixtures of common factor analyzers (MCFA) when handling model-based clustering of the high-dimensional data with heavy tails. However, the MCtFA model may still suffer from a lack of robustness against observations whose distributions are highly asymmetric. This paper presents a further robust extension of the MCFA and MCtFA models, called the mixture of common restricted skew-t factor analyzers (MCrstFA), by assuming a restricted multivariate skew-t distribution for the common factors. The MCrstFA model can be used to accommodate severely non-normal (skewed and leptokurtic) random phenomena while preserving its parsimony in factor-analytic representation and performing graphical visualization in low-dimensional plots. A computationally feasible expectation conditional maximization either algorithm is developed to carry out maximum likelihood estimation. The numbers of factors and mixture components are simultaneously determined based on common likelihood penalized criteria. The usefulness of our proposed model is illustrated with simulated and real datasets, and experimental results signify its superiority over some existing competitors.

Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions

Article 02 September 2020

Mixtures of multivariate restricted skew-normal factor analyzer models in a Bayesian framework

Article 31 January 2019

Flexible clustering via extended mixtures of common t-factor analyzers

Article 02 November 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Mixtures of factor analyzers (MFA), originally introduced by Ghahramani and Hinton (1997), provide a global non-linear approach to dimension reduction via the adoption of component distributions having a factor-analytic representation for the component-covariance matrices. To substantially reduce the number of parameters in component matrices, especially when the number of components (g) or features (p) becomes large, Baek et al. (2010) extended the MFA by using common component factor loadings, known as mixtures of common factor analyzers (MCFA), which have now been a popular tool for high-dimensional data analysis. To deal with data with extreme values or outliers commonly observed in microarray experiments, Baek and McLachlan (2011) presented a robust version of MCFA using multivariate Student’s-t distributed component errors and factors, called mixtures of common t-factor analyzers (MCtFA). Recently, Wang (2013, 2015) extended the MCFA and MCtFA approaches to accommodating high-dimensional data with possibly missing values.

The specification of component factors and errors on both MFA and MCFA rests on the assumption of multivariate normality for computational convenience and mathematical tractability, but the two models are highly vulnerable to outliers. Although the use of MCtFA model is less affected by the violation of normality, it may still suffer from the lack of robustness against highly asymmetric observations. In many practical problems, however, the data to be analyzed may contain a group or groups of observations whose distributions are moderately or severely skewed and/or of having heavy tails. As shown in many empirical studies, a slight deviation from normality may seriously affect the estimates of mixture parameters and subsequently lead to spurious groups as well as misleading statistical inference.

Over the past few decades, there has been growing interest in adopting more flexible parametric distributions to accommodate non-normal features such as asymmetry and longer-than-normal tails leading to non-zero skewness and excess kurtosis, see the monograph by Azzalini (2014) for a more comprehensive overview. Lin et al. (2015) proposed a robust extension of factor analysis models based on the restricted multivariate skew-t (rMST) distribution (Pyne et al. 2009). Other related proposals include mixtures of skew-normal/t factor analyzers (Lin et al. 2016, 2018), mixtures of generalized hyperbolic (GH) factor analyzers (Tortora et al. 2016), mixtures of skew-t factor analyzers (Murray et al. 2014a), and mixtures of common skew-t factor analyzers (Murray et al. 2014b). Besides, Murray et al. (2017a) presented an extended version of MFA with the component factors and errors following the skew-t distribution considered by Sahu et al. (2003), which is referred to as the unrestricted multivariate skew-t (uMST) distribution by Lee and McLachlan (2014).

Note that the rMST and uMST distributions are not nested within each other, and they are equivalent only in the univariate case. Moreover, Sahu et al. (2003) have highlighted that the calculation of the uMST density becomes cumbersome as p increases. The computational difficulty of the uMST formulation was also pointed out by Murray et al. (2017a; Section 5). Azzalini et al. (2016) have provided a detailed comparison between the rMST and uMST distributions in terms of the merits of both distributions for data modeling. When comparing the two distributions in the context of model-based clustering, their illustrative examples indicate that “neither formulation is markedly superior and, if these results were to be taken in favor of either formulation, it would be the classical formulation”, namely the rMST distribution adopted in this paper.

Further, it is interesting to note that the skew-t distribution adopted by Murray et al. (2014a, b), arising from the family of GH distributions (Barndorff-Nielsen and Shephard 2001), is referred to as the generalized hyperbolic skew-t (GHST) distribution henceforth. Its density form is rather different from the rMST distribution and does not include the skew-normal as a limiting case (Lee and Poon 2011). The model proposed by Murray et al. (2014b) is henceforth referred to as mixtures of common generalized hyperbolic skew-t factor analyzers (MCghstFA).

In this paper, we propose an alternative skew extension of the MCtFA model based on the rMST distribution, called the mixture of common restricted skew-t factor analyzers (MCrstFA) model. This new proposal preserves resistance to extremely non-normal effects commonly happen in high-dimensional data. Similar to MCFA and MCtFA models, common factor loadings are utilized for parsimoniously modeling the component-covariance matrices. To portray the observed data into a lower dimensional space and avoid possible singularities, the scale-covariance matrices for component errors (${\varvec{D}}_i$) are generally assumed to be homogeneous (${\varvec{D}}_i={\varvec{D}}$). Under certain circumstances, ${\varvec{D}}_i$ can be relaxed to be unequal or modified to different types such as (isotropic with unequal variances) or (isotropic with equal variance). Lately, Wang and Lin (2017) presented a modification of MCtFA using component-specific ${\varvec{D}}_i$ and empirically demonstrated its advantage in classifying new subjects whose true group labels are unknown in advance.

The rest of the paper is structured as follows. In Sect. 2, we establish the notation and outline some preliminary properties of the rMST distribution. In Sect. 3, we present the specification of MCrstFA model and develop a workable expectation conditional maximization either (ECME) algorithm for carrying maximum likelihood (ML) estimation. In Sect. 4, the initialization along with the stopping rules, the criteria for model selection and clustering performance, and the identifiability issues are discussed. In Sect. 5, we conduct two simulation studies to examine the validity of MCrstFA model. The methodology is illustrated on a real example concerning human liver cancer data in Sect. 6. Concluding remarks and directions for future works are given in Sect. 7. Some detailed proofs and supplementary information are deferred to appendices.

2 Notation and prerequisites

We first review the rMST distribution and study its related properties. Let $\phi _p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }})$ be the probability density function (pdf) of a multivariate normal distribution with mean vector ${\varvec{\mu }}$ and covariance matrix ${\varvec{\varSigma }}$, denoted by $N_p({\varvec{\mu }},{\varvec{\varSigma }})$; ${\varPhi }(\cdot )$ the cumulative distribution function (cdf) of the standard normal distribution; $TN(\nu ,\sigma ^2;(a,b))$ the truncated normal distribution defined as a normal distribution $N(\mu ,\sigma ^2)$ lying within an interval (a, b); $t_p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }},\nu )$ the pdf of a p-variate t distribution with location ${\varvec{\mu }}$, scale-covariance matrix ${\varvec{\varSigma }}$ and the degree of freedom (DOF) $\nu $; $g(x;\alpha ,\beta )$ the pdf of gamma distribution given by $\beta ^{\alpha }x^{\alpha -1}\exp \{-\,\beta x\}/{\varGamma }(\alpha )$; $T(\cdot ;\nu )$ the cdf of the Student’s t distribution with zero mean, unit scale variance and DOF $\nu $; ${\varvec{1}}_p$ a $p\times 1$ vector with all elements being 1; ${\varvec{I}}_p$ a $p\times p$ identity matrix; Diag$\{\cdot \}$ a diagonal matrix made by extracting the main diagonal elements of a square matrix or the diagonalization of a vector; ${\varvec{A}}^{1/2}$ the square root of a symmetric matrix ${\varvec{A}}$.

Following Pyne et al. (2009), a p-dimensional random vector ${\varvec{Y}}$ is said to follow the rMST distribution with location vector ${\varvec{\mu }}\in {\mathbb {R}}^p$, scale-covariance matrix ${\varvec{\varSigma }}$, skewness vector ${\varvec{\lambda }}\in {\mathbb {R}}^p$ and DOF $\nu \in {\mathbb {R}}^{+}$, denoted as ${\varvec{Y}}\sim rST_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }},\nu )$, if it has the pdf:

$$\begin{aligned} \psi _p({\varvec{y}};{\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }},\nu )=2t_p({\varvec{y}};{\varvec{\mu }},{\varvec{\varOmega }},\nu )T\left( M\sqrt{\frac{\nu +p}{\nu +\delta }};\nu +p\right) , \end{aligned}$$

(1)

where ${\varvec{\varOmega }}={\varvec{\varSigma }}+{\varvec{\lambda }}{\varvec{\lambda }}^{\top }$, $\delta =({\varvec{y}}-{\varvec{\mu }})^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})$ and $M={\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})/(1-{\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}{\varvec{\lambda }})^{1/2}$. Note that the distribution of ${\varvec{Y}}$ is reduced to $t_p({\varvec{\mu }},{\varvec{\varSigma }},\nu )$ by setting ${\varvec{\lambda }}={\varvec{0}}$ and to $rSN_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }})$ as $\nu \rightarrow \infty $. Furthermore, the family of (1) also includes $N_p({\varvec{\mu }},{\varvec{\varSigma }})$, obtained by letting ${\varvec{\lambda }}={\varvec{0}}$ and $\nu \rightarrow \infty $.

Alternatively, the rMST distribution can be hierarchically represented as

$$\begin{aligned} {\varvec{Y}}\mid (\gamma ,\tau )\sim & {} N_p({\varvec{\mu }}+{\varvec{\lambda }}\gamma ,{\varvec{\varSigma }}/\tau ),\nonumber \\ \gamma \mid \tau\sim & {} TN(0,1/\tau ;(0,\infty )),\nonumber \\ \tau\sim & {} \mathrm{Gamma}(\nu /2,\nu /2), \end{aligned}$$

(2)

where Gamma($\alpha ,\beta $) stands for the gamma distribution with mean $\alpha /\beta $. Figure 1 shows the perspective plots with added contours for rMST densities under ${\varvec{\mu }}=(0,0)^\top $, ${\varvec{\varSigma }}={\varvec{I}}_2$, $\nu =4$ and various specifications of ${\varvec{\lambda }}=(\lambda _1,\lambda _2)^{\top }$. It is clearly seen that these plots are non-elliptical and can be skewed and correlated toward different directions depending on the chosen parameters. Therefore, the rMST distribution provides a flexible mechanism to adapt well to more complicated data.

3 Methodology

3.1 Model formulation

Suppose that ${\varvec{Y}}=({\varvec{Y}}_1,\ldots ,{\varvec{Y}}_n)$ forms a random sample of size n in which each ${\varvec{Y}}_j=(Y_{j1},\ldots ,Y_{jp})^{\top }$ is a p-dimensional vector of feature variables. Suppose further that these samples come independently from g distinct subgroups in a heterogeneous population. The MCrstFA model for each ${\varvec{Y}}_j$ is

$$\begin{aligned} {\varvec{Y}}_j={\varvec{A}}{\varvec{U}}_{ij}+{\varvec{e}}_{ij}~\quad \mathrm{with~probability}~ \pi _i\quad (i=1,\ldots ,g), \end{aligned}$$

(3)

for $j=1,\ldots ,n$, where ${\varvec{A}}$ is a $p\times q$ matrix of common factor loadings, ${\varvec{U}}_{ij}$ is a q-dimensional ($q < p$) vector of component factors, ${\varvec{e}}_{ij}$ is a p-dimensional vector of component errors, and $\pi _i$s are the mixing proportions subject to $\sum _{i=1}^g \pi _i=1$.

Furthermore, we assume that ${\varvec{U}}_{ij}$ and ${\varvec{e}}_{ij}$ are jointly distributed as

$$\begin{aligned} \left[ \begin{array}{c} {\varvec{U}}_{ij}\\ {\varvec{e}}_{ij} \end{array} \right] \sim rST_{p+q} \left( \left[ \begin{array}{c} {\varvec{\xi }}_i\\ \mathbf 0 \end{array} \right] , \left[ \begin{array}{cc} {\varvec{\varOmega }}_i &{}\quad \mathbf 0\\ \mathbf 0 &{}\quad {\varvec{D}}_i \end{array} \right] , \left[ \begin{array}{c} {\varvec{\lambda }}_i\\ \mathbf 0 \end{array} \right] , \nu _i \right) , \end{aligned}$$

(4)

where ${\varvec{\xi }}_i$ is a q-dimensional location vector, ${\varvec{\varOmega }}_i$ is a $q\times q$ positive-definite scale covariance matrix, ${\varvec{\lambda }}_i \in {\mathbb {R}}^q$ is a skewness vector, ${\varvec{D}}_i$ is a $p\times p$ positive diagonal matrix, and $\nu _i$ is the DOF. The specifications of ${\varvec{D}}_i$ and $\nu _i$ in (4) can be either constrained to be equal or allowed to vary among components.

Based on (3) along with assumption (4), the pdf of ${\varvec{Y}}_j$ is

$$\begin{aligned} f({\varvec{y}}_j)=\sum _{i=1}^g \pi _i \psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i), \end{aligned}$$

(5)

where

$$\begin{aligned} {\varvec{\mu }}_i={\varvec{A}}{\varvec{\xi }}_i,~~{\varvec{\varSigma }}_i={\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}}_i,~~{\varvec{\alpha }}_i={\varvec{A}}{\varvec{\lambda }}_i, \end{aligned}$$

(6)

and $\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)$ is the rMST density function defined in (1). Notice that the representations in (6) cannot be uniquely determined because they remain unchanged if the common factor loading matrix ${\varvec{A}}$ is postmultiplied by any nonsingular matrix. Thus, we must impose $q^2$ constraints to achieve identifiability of ${\varvec{A}}$. As a result, the number of free parameters in the MCrstFA is

$$\begin{aligned} d_1=(g-1)+pg+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+g. \end{aligned}$$

If ${\varvec{D}}_i$s are constrained to be homogeneous across components, the number of parameters is

$$\begin{aligned} d_2=(g-1)+p+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+g; \end{aligned}$$

and if component DOFs are further assumed to be identical, the resulting number of parameters is

$$\begin{aligned} d_3=(g-1)+p+q(p+g)+\frac{1}{2}gq(q+1)-q^2+gq+1. \end{aligned}$$

We remark that the number of parameters in MCrstFA is increased by qg involved in ${\varvec{\lambda }}_i$ (without adding too much complexity) as compared with MCFA and MCtFA.

To indicate the class membership of observation ${\varvec{y}}_j$, we introduce allocation variables ${\varvec{Z}}_j=(Z_{1j},\ldots ,Z_{gj})^{\top }$, defined as

$$\begin{aligned} Z_{ij}= \left\{ \begin{array}{l} 1,~~{\varvec{Y}}_j~\mathrm{belongs~to}~i\mathrm{th~component};\\ 0,~~\mathrm{otherwise}. \end{array} \right. \end{aligned}$$

Thus, we have ${\varvec{Z}}_j{\mathop {\sim }\limits ^\mathrm{iid}}{{\mathscr {M}}}(1;\pi _1,\ldots ,\pi _g)$, meaning a multinomial distribution with g possible outcomes which can occur in a single trial, where $\pi _i=\Pr (Z_{ij}=1)$ can be regarded as the prior probability of ${\varvec{y}}_j$ belonging to the ith component.

According to (2) and (3), the MCrstFA model can be formulated by a five-level hierarchical representation:

$$\begin{aligned} {\varvec{Y}}_j\mid ({\varvec{U}}_{ij},\,\gamma _j,\tau _j,\,Z_{ij}=1)\sim & {} N_p({\varvec{A}}{\varvec{U}}_{ij},\tau _j^{-1} {\varvec{D}}_i), \nonumber \\ {\varvec{U}}_{ij}\mid (\gamma _j,\tau _j,Z_{ij}=1)\sim & {} N_q({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j,\tau _j^{-1}{\varvec{\varOmega }}_i),\nonumber \\ \gamma _j\mid (\tau _j,Z_{ij}=1)\sim & {} TN(0,\,\tau _j^{-1};(0,\infty )),\nonumber \\ \tau _j\mid (Z_{ij}=1)\sim & {} \mathrm{Gamma}\left( \frac{\nu _i}{2},\frac{\nu _i}{2}\right) , \nonumber \\ {\varvec{Z}}_j\sim & {} {{\mathscr {M}}}(1;\pi _1,\ldots ,\pi _g). \end{aligned}$$

(7)

By Bayes’ rule, it suffices to derive the following conditional distributions, and the proofs of which are sketched in “Appendix A”. Specifically,

$$\begin{aligned} {\varvec{U}}_{ij}\mid ({\varvec{y}}_j,\gamma _j,\tau _j,Z_{ij}=1)\sim & {} N_q\left( {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j),\tau _j^{-1}({\varvec{I}}_q\right. \nonumber \\&\left. -{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i\right) ,\nonumber \\ \gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1)\sim & {} TN(h_{ij},\tau _j^{-1}\sigma _i^2;(0,\infty )),\nonumber \\ f(\tau _j\mid {\varvec{y}}_j,\,Z_{ij}=1)= & {} \frac{{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }g\left( \tau _j;\frac{\nu _i+p}{2},\frac{\nu _i+\delta _{ij}}{2}\right) ,\nonumber \\ Z_{ij}=1\mid {\varvec{y}}_j\sim & {} {{\mathscr {M}}}(1;\,{\tilde{\pi }}_{1j},\,\ldots \,,{\tilde{\pi }}_{gj}), \end{aligned}$$

(8)

where ${\varvec{\beta }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i$, $\delta _{ij}=({\varvec{y}}_j-{\varvec{\mu }}_i)^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)$, and $M_{ij}=h_{ij}/\sigma _i$ with ${\varvec{V}}_i={\varvec{\varSigma }}_i+{\varvec{\alpha }}_i{\varvec{\alpha }}_i^{\top }$, $h_{ij}={\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)$ and $\sigma _i^2=1-{\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}{\varvec{\alpha }}_i$. Moreover,

$$\begin{aligned} {\tilde{\pi }}_{ij}=P(Z_{ij}=1|{\varvec{y}}_j)=\frac{\pi _i\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)}{\sum _{h=1}^g \pi _h \psi _p({\varvec{y}}_j;{\varvec{\mu }}_h,{\varvec{\varSigma }}_h,{\varvec{\alpha }}_h,\nu _h)}. \end{aligned}$$

(9)

To simplify the notation, we define ${\varvec{b}}_{ij}={\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)$ and $c_{ij}(r)=\{(\nu _i+p+r)/(\nu _i+\delta _{ij})\}^{1/2}$ for $r=-2,0,2$, and let “$|\cdots $” represent conditioning on ${\varvec{Y}}_j={\varvec{y}}_j$ and $Z_{ij}=1$. The following proposition summarizes some essential conditional expectations for implementing the ECME algorithm described in the next subsection.

Proposition 1

Consider the posterior distributions given in (8), we establish the following conditional expectations:

$$\begin{aligned} E(\tau _j\mid \cdots )= & {} \{c_{ij}(0)\}^2\frac{T(M_{ij}c_{ij}(2);\nu _i+p+2)}{T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\gamma _j\mid \cdots )= & {} h_{ij}+\frac{\sigma _it(M_{ij}c_{ij}(-2);\nu _i+p-2)}{c_{ij}(-2)T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\tau _j\gamma _j\mid \cdots )= & {} h_{ij}E(\tau _j\mid \cdots )+\sigma _i c_{ij}(0)\frac{t(M_{ij}c_{ij}(0);\nu _i+p)}{T(M_{ij}c_{ij}(0);\nu _i+p)},\nonumber \\ E(\tau _j\gamma _j^2\mid \cdots )= & {} \sigma _i^2+h_{ij}E(\tau _j\gamma _j\mid \cdots ),\nonumber \\ E({\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij}+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\gamma _j\mid \cdots ),\nonumber \\ E(\tau _j{\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij}E(\tau _j\mid \cdots )+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j\mid \cdots ),\nonumber \\ E(\tau _j\gamma _j{\varvec{U}}_{ij}\mid \cdots )= & {} {\varvec{b}}_{ij} E(\tau _j\gamma _j\mid \cdots )+\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j^2\mid \cdots ),\nonumber \\ E(\tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^\top \mid \cdots )= & {} \left( {\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}\right) {\varvec{\varOmega }}_i+E(\tau _j\gamma _j{\varvec{U}}_{ij}\mid \cdots )({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)^{\top }\nonumber \\&+\,E(\tau _j{\varvec{U}}_{ij}\mid \cdots ){\varvec{b}}_{ij}^{\top }, \end{aligned}$$

(10)

and

$$\begin{aligned} E(\log \tau _j\mid \cdots )= & {} \frac{ \int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }+E(\tau _j\mid \cdots )\nonumber \\&-\,\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) +\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) , \end{aligned}$$

(11)

where $f_{\nu _i}(x)$ is defined by (B.10).

Proof

The results follow directly from some fundamental matrix manipulations and the law of iterated expectations. See “Appendix B” for more details. $\square $

3.2 Parameter estimation via the ECME algorithm

The EM algorithm (Dempster et al. 1977) is a popular iterative method for finding ML estimates when the data are incomplete or the model contains latent variables. The main advantage of EM lies in the fact of monotone convergence without sacrificing simplicity. One common limitation of the EM algorithm is that the M-step usually yields no closed forms for estimators of parameters. To overcome this weakness, Meng and Rubin (1993) proposed the expectation conditional maximization (ECM) algorithm to replace the M-step of EM with several computational simpler CM-steps, each of which maximizes the expected complete-data log-likelihood function (known as the Q-function) sequentially. Importantly, the authors also showed that the ECM algorithm preserves all desiring properties of EM. In certain situations, some of the CM-steps of ECM may be computationally intractable. Liu and Rubin (1994) advanced the ECM algorithm with the CM steps that maximize either the Q-function, called the CMQ-step, or the corresponding constrained actual log-likelihood function, called the CML-step. The method is referred to as the ECME algorithm.

For notational simplicity, we denote the observed data by ${\varvec{y}}=({\varvec{y}}_1,\ldots ,{\varvec{y}}_n)$, allocation indicators by ${\varvec{Z}}=({\varvec{z}}_1,\ldots ,{\varvec{z}}_n)$, latent factors by ${\varvec{U}}=({\varvec{U}}_1,\ldots ,{\varvec{U}}_n)$, hidden variables ${\varvec{\gamma }}=(\gamma _1,\ldots ,\gamma _n)$ and scaling weight variables by ${\varvec{\tau }}=(\tau _1,\ldots ,\tau _n)$. Therefore, the complete data ${\varvec{y}}_c$ comprise the observed data ${\varvec{y}}$ together with missing data ${\varvec{y}}_m=({\varvec{Z}},{\varvec{U}},{\varvec{\gamma }},{\varvec{\tau }})$. From (5), it is readily seen that

$$\begin{aligned} {\varvec{Y}}_j\mid (Z_{ij}=1) \sim rST_p({\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i). \end{aligned}$$

Therefore, the joint pdf of $({\varvec{Y}},{\varvec{Z}})$ is

$$\begin{aligned} f({\varvec{y}},{\varvec{z}})=\prod _{j=1}^n\prod _{i=1}^g \{\pi _i\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)\}^{z_{ij}}. \end{aligned}$$

(12)

Let ${\varvec{\theta }}_i=(\pi _i,{\varvec{\xi }}_i,{\varvec{\varOmega }}_i,{\varvec{D}}_i,{\varvec{\lambda }}_i,\nu _i)$ be the parameter vector belonging to the i-th component, and ${\varvec{\varTheta }}=\{{\varvec{A}},{\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_g\}$ the entire unknown parameters to be estimated. According to (7), the complete-data log-likelihood function is

$$\begin{aligned} \ell _c({\varvec{\varTheta }}\mid {\varvec{y}}_c)= & {} \sum _{i=1}^g\sum _{j=1}^n z_{ij}\bigg \{\log \pi _i-\frac{1}{2}\log |{\varvec{D}}_i| -\frac{\tau _j}{2}({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})^{\top }{\varvec{D}}_i^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})\nonumber \\&-\,\frac{1}{2}\log |{\varvec{\varOmega }}_i|-\frac{\tau _j}{2}({\varvec{U}}_{ij}-{\varvec{\xi }}_i-{\varvec{\lambda }}_i\gamma _j)^{\top }{\varvec{\varOmega }}_i^{-1}({\varvec{U}}_{ij}-{\varvec{\xi }}_i-{\varvec{\lambda }}_i\gamma _j)\nonumber \\&-\,\log {\varGamma }\left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}\log \tau _j-\frac{\nu _i}{2}\tau _j\bigg \}. \end{aligned}$$

To evaluate the Q-function, defined as $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})=E\big [\ell _c({\varvec{\varTheta }}\mid {\varvec{y}}_c)\mid {\varvec{y}},\hat{{\varvec{\varTheta }}}^{(k)}\big ]$, we first define the following conditional expectations:

$$\begin{aligned} {\hat{z}}_{ij}^{(k)}= & {} P\left( Z_{ij}=1\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)}\right) ,\quad {\hat{\tau }}_{ij}^{(k)}=E\left( \tau _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \\ {{\hat{\kappa }}_{ij}}^{(k)}= & {} E\left( \log \tau _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\quad {{\hat{s}}_{1ij}}^{(k)}=E\left( \tau _j\gamma _j\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \\ {{\hat{s}}_{2ij}}^{(k)}= & {} E\left( \tau _j\gamma _j^2\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\quad \hat{{\varvec{\eta }}}_{ij}^{(k)}=E\left( \tau _j{\varvec{U}}_{ij}\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,\nonumber \nonumber \\ {\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}= & {} E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) ,~ {\hat{{\varvec{\zeta }}}_{ij}}^{(k)}=E\left( \tau _j\gamma _j{\varvec{U}}_{ij}\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}}^{(k)},Z_{ij}=1\right) \end{aligned}$$

for $i=1,\ldots ,g$ and $j=1,\ldots ,n$, which can be evaluated using (9), (10) and (11).

To update the mixture parameters ${\varvec{\varTheta }}$, the ECME algorithm proceeds as follows:

E-step:
Given ${\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}^{(k)}$, calculate the Q-function, obtained as
$$\begin{aligned} Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})= & {} \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)} \bigg \{\log \pi _i-\frac{1}{2}\log |{\varvec{D}}_i|-\frac{1}{2}\log |{\varvec{\varOmega }}_i|-\log {\varGamma }\left( \frac{\nu _i}{2}\right) \nonumber \\&+\,\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}({{\hat{\kappa }}_{ij}}^{(k)}-{{\hat{\tau }}_{ij}}^{(k)})-\frac{1}{2}\mathrm{tr}\big ({\varvec{D}}_i^{-1}{\varvec{\varUpsilon }}_{ij}-{\varvec{\varOmega }}_i^{-1}{\varvec{\varLambda }}_{ij}\big )\bigg \},\nonumber \\ \end{aligned}$$
(13)
where
$$\begin{aligned} {\varvec{\varUpsilon }}_{ij}={\varvec{\varUpsilon }}_{ij}({\varvec{A}})={\hat{\tau }}_{ij}^{(k)}{\varvec{y}}_j{\varvec{y}}_j^{\top }-{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{(k)\top }{\varvec{A}}^{\top }-{\varvec{A}}{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{y}}_j^{\top }+{\varvec{A}}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}^{\top } \end{aligned}$$
(14)
and
$$\begin{aligned} {\varvec{\varLambda }}_{ij}={\varvec{\varLambda }}_{ij}({\varvec{\xi }},{\varvec{\lambda }})= & {} {\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}-{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}{\varvec{\lambda }}_i^{\top }-{\varvec{\xi }}_i\left( {\hat{{\varvec{\eta }}}_{ij}}^{{(k)}^{\top }}-{\hat{\tau }}_{ij}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) \nonumber \\&-\,{\varvec{\lambda }}_i\left( {\hat{{\varvec{\zeta }}}_{ij}}^{(k)^{\top }}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) . \end{aligned}$$
(15)
CM-steps:
Maximizing (13) with respect to $\pi _i$, ${\varvec{\xi }}_i$, ${\varvec{\lambda }}_i$, ${\varvec{A}}$, ${\varvec{\varOmega }}_i$ and ${\varvec{D}}_i$, we obtain
$$\begin{aligned} {\hat{\pi }}_i^{\left( k+1\right) }= & {} \frac{1}{n}\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) },\\ \hat{{\varvec{\xi }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\eta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\zeta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{\lambda }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{A}}}^{\left( k+1\right) }= & {} \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) {\top }}\right) \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\varPsi }}}_{ij}}^{\left( k\right) }\right) ^{-1},\\ {{\hat{{\varvec{\varOmega }}}_i}}^{\left( k+1\right) }= & {} \frac{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varLambda }}}_{ij}^{\left( k+1\right) }}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}~~\text{ and }~~ \hat{{\varvec{D}}}_i^{\left( k+1\right) } =\frac{\mathrm{Diag}\{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varUpsilon }}}_{ij}^{\left( k+1\right) }\}}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}, \end{aligned}$$
where $\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}$ and $\hat{{\varvec{\varLambda }}}_{ij}^{(k+1)}$ are ${\varvec{\varUpsilon }}_{ij}$ and ${\varvec{\varLambda }}_{ij}$ in (14) and (15) with ${\varvec{\xi }}_i$, ${\varvec{\lambda }}_i$ and ${\varvec{A}}$ replaced by ${\hat{{\varvec{\xi }}}_i}^{(k+1)}$, ${\hat{{\varvec{\lambda }}}_i}^{(k+1)}$ and $\hat{{\varvec{A}}}^{(k+1)}$, respectively. Moreover, when ${\varvec{D}}_i$s are assumed to be the same, say ${\varvec{D}}_i={\varvec{D}}$ for all i, the updated estimator of ${\varvec{D}}$ is given by $\hat{{\varvec{D}}}^{(k+1)}=n^{-1}\mathrm{Diag}\{\sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{(k)}\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}\}.$ The proof of the updated estimators is sketched in “Appendix C”.
CML-step:
In light of (12), the updated estimator of $\nu _i$ can be obtained by solving the following equations:
$$\begin{aligned} {{\hat{\nu }}}_i^{(k+1)}=\arg \max _{\nu _i}\bigg \{\sum _{j=1}^n{\hat{z}}^{(k+1)}_{ij}\log \Big (\psi _p({\varvec{y}}_j;\hat{{\varvec{\mu }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\hat{{\varvec{\alpha }}}_i^{(k+1)},\nu _i)\Big )\bigg \},\nonumber \\ \end{aligned}$$
(16)
for $i=1,\ldots ,g$, where $\hat{{\varvec{\mu }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\xi }}}_i^{(k+1)}$, $\hat{{\varvec{\varSigma }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\varOmega }}}_i^{(k+1)}$$\hat{{\varvec{A}}}^{(k+1)\top }+\hat{{\varvec{D}}}_i^{(k+1)}$ and $\hat{{\varvec{\alpha }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\lambda }}}_i^{(k+1)}$.

In the case of assuming common DOFs, say $\nu _i=\nu $ for all i, the updated estimator of $\nu $ is obtained by maximizing the constrained actual log-likelihood function, that is,

$$\begin{aligned} {\hat{\nu }}^{(k+1)}=\arg \max _{\nu }\bigg \{\sum _{j=1}^n\log \Big (\sum _{i=1}^g\hat{\pi }^{(k+1)}_i \psi _p({\varvec{y}}_j;\hat{{\varvec{\mu }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\hat{{\varvec{\alpha }}}_i^{(k+1)},\nu )\Big )\bigg \}.\nonumber \\ \end{aligned}$$

(17)

Herein, we remark that the solutions of (16) and (17) involve carrying out a one-dimensional search using the built-in R function optim function over a box constraint (2, 200). Given an initial guess of parameters $\hat{{\varvec{\varTheta }}}^{(0)}$, the above ECME procedure is performed recursively until maximization of the log-likelihood function is achieved. The resulting ML estimates are denoted by $\hat{{\varvec{\varTheta }}}=(\hat{{\varvec{A}}},\hat{\pi }_i,\hat{{\varvec{\xi }}}_i,\hat{{\varvec{\varOmega }}}_i,\hat{{\varvec{D}}}_i,\hat{{\varvec{\lambda }}}_i,\hat{{\varvec{\nu }}}_i,i=1,\ldots ,g)$. As a result, the posterior probability of ${\varvec{y}}_j$ belonging to the i-th component of the mixture is calculated by replacing ${\varvec{\varTheta }}$ in (9) with ${\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}$, denoted by ${\hat{z}}_{ij}=P(Z_{ij}=1\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}})$. Based on the maximum a posteriori (MAP) classification rule, ${\varvec{y}}_j$ is assigned to group s if $\max \{{\hat{z}}_{ij}\}_{i=1}^g$ occurs at $i=s$.

Consequently, the conditional expectations of the factor scores ${\varvec{U}}_{ij}$ given ${\varvec{y}}_{j}$ and the i-th membership of the mixture meaning that $Z_{ij}=1$ can be estimated by $\hat{{\varvec{u}}}_{ij}=E({\varvec{U}}_{ij}\mid {\varvec{Y}}_j={\varvec{y}}_j,Z_{ij}=1,\hat{{\varvec{\varTheta }}})$ which is given in (10) with ${\varvec{\varTheta }}$ substituted by $\hat{{\varvec{\varTheta }}}$. Then, the j-th estimated factor scores corresponding to ${\varvec{y}}_j$ can be calculated as

$$\begin{aligned} \hat{{\varvec{u}}}_j=\sum _{i=1}^g {\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}, \quad j=1\ldots n. \end{aligned}$$

(18)

An alternative estimator of (18) is given by

$$\begin{aligned} \hat{{\varvec{u}}}_j=\sum _{i=1}^g \text{ MAP }\{{\hat{z}}_{ij}\}\hat{{\varvec{u}}}_{ij}, \end{aligned}$$

(19)

where $\text{ MAP }\{{\hat{z}}_{ij}\}=1$, if $\max \{{\hat{z}}_{hj}\}_{h=1}^g$ occurs at $h=i$, and $\text{ MAP }\{{\hat{z}}_{ij}\}=0$ otherwise. These estimated factor scores can be used to portray the observed data into a lower dimensional space (Baek et al. 2010; Baek and McLachlan 2011) and be applied to feature extractions (Ueda et al. 2000).

4 Practical issues from computational aspects

4.1 Initialization and stopping rules

Like other iterative procedures, the ECME algorithm may suffer from convergence difficulties such as singularity of component covariance matrices or undetermined local maximum. To alleviate such problems, one simple strategy is to try many different initial values and select the solution that provides the highest likelihood. To obtain different sets of initial values, this can be done by performing multiple times of K-means (Hartigan and Wong 1979) clustering or random starts (McLachlan and Peel 2000) in the sense that each sample point is randomly assigned to one of clusters. We recommend below a simple way of generating sensible initial values.

1.
Given initial memberships obtained by a single run of clustering through K-means, we set $\hat{{\varvec{Z}}}_j^{(0)}=({\hat{z}}_{1j}^{(0)},\ldots ,{\hat{z}}_{gj}^{(0)})$. The initial values of $\pi _i$s are
$$\begin{aligned} {\hat{\pi }}_i^{(0)}=\frac{1}{n}\sum _{j=1}^n{\hat{z}}_{ij}^{(0)},\quad i=1,\ldots ,g. \end{aligned}$$
2.
Let ${\varvec{y}}_{(i)}$ be the collection of the i-th partitioned group. After that, we compute factor scores using the R built-in factanal function. The initial estimates of $\hat{{\varvec{\xi }}}_i^{(0)}$, $\hat{{\varvec{\varOmega }}}_i^{(0)}$, $\hat{{\varvec{\lambda }}}_i^{(0)}$ and $\hat{\nu }_i^{(0)}$, for $i=1,\ldots ,g$, are obtained by implementing R EMMIXskew package (Wang et al. 2009) for fitting the rMST distribution to the estimated factor scores.
3.
Perform the principal components analysis (PCA) method to obtain the factor loading matrix for ${\varvec{y}}_{(i)}$, denoted by $\hat{{\varvec{B}}}^{(0)}_i$ for $i=1,\ldots ,g$. The initial estimate of ${\varvec{A}}$ is specified as
$$\begin{aligned} \hat{{\varvec{A}}}^{(0)}=\sum _{i=1}^g{\hat{\pi }}^{(0)}_i\hat{{\varvec{B}}}^{(0)}_i\hat{{\varvec{\varOmega }}}_i^{{(0)}^{-1/2}}. \end{aligned}$$
4.
The initial estimate of ${\varvec{D}}_i$ is obtained as a diagonal matrix formed from the diagonal elements of the sample covariance matrix of ${\varvec{y}}_{(i)}$. For the restricted case of ${\varvec{D}}_i={\varvec{D}}$, the initial estimate $\hat{{\varvec{D}}}^{(0)}$ is formed as the diagonal elements of the pooled within-cluster sample covariance matrix of ${\varvec{y}}_{(1)},\ldots ,{\varvec{y}}_{(g)}$.

Since the ECME algorithm is an iterative method, the stopping rules should be specified. In our experimental studies, we adopt by default the traditional criterion to terminate the algorithm when a predefined the maximum number of iterations $k_\mathrm{max}=2\times 10^4$ is reached or when the difference between two successive log-likelihood values is less than $10^{-6}$. Alternatively, one can use the Aitken acceleration-based stopping criterion (Aitken 1926; McLachlan and Krishnan 2008), which is at least as strict as lack of progress in likelihood in the neighborhood of a maximum (McNicholas et al. 2010).

4.2 Model selection and performance evaluation

The log-likelihood value cannot be adopted as a model selection criterion because it is a nondecreasing function of the number of components (g) and the dimension of factors (q). We use the Bayesian information criterion (BIC; Schwarz 1978) and the integrated classification likelihood (ICL; Biernacki et al. 2000) to determine the best pair of (g, q) over a number of candidate models for achieving satisfactory performance (McNicholas and Murphy 2008; Lin et al. 2016). The BIC and ICL are defined as

$$\begin{aligned} \text{ BIC }=d\log n-2\ell _{\mathrm{max}}\quad \text{ and }\quad \text{ ICL }=\text{ BIC }+2\text{ ENT }(\hat{{\varvec{z}}}), \end{aligned}$$

where d is the number of free parameters, $\ell _{\mathrm{max}}$ is the maximized log-likelihood value, and $\text{ ENT }(\hat{{\varvec{z}}})=-\sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\log {\hat{z}}_{ij}$ is a penalty term called entropy that favors well-separated mixtures. The ICL penalizes complex model seriously and selects more parsimonious models than does BIC.

To evaluate the clustering performance of model-based approach, the adjusted Rand index (ARI; Hubert and Arabie 1985) and the correct classification rate (CCR; Lee et al. 2003) are employed. Typically, the ARI value ranges between 0 and 1 in most cases, but it can be negative corresponding to a poor level of agreement, e.g., fewer instances are correctly classified than would be expected by chance. The metric of CCR has a value between 0 and 1. The CCR is determined to have the lowest misclassification rate by comparing all permutations of the MAP clustering labels with the true class labels.

4.3 Identifiability issues

The mixture model itself suffers from an non-identifiability problem arising from a permutation of the class labels in parameter vectors. The switching issue of class labels is often inherent in Bayesian implementation of mixture models. However, this is not a problem in practice when employing the EM-based algorithm to estimate mixture densities since we can still determine a sequence of ML estimates that are consistent and asymptotically efficient, see McLachlan and Basford (1988).

On the other hand, there is another identifiability problem corresponding to the rotational indeterminacy of common factor loading matrix ${\varvec{A}}$. As suggested by Baek et al. (2010), a unique solution of ${\varvec{A}}$, say $\hat{{\varvec{A}}}^*$, can be obtained by postmultiplying a nonsingular matrix for which the solution is orthonormal, i.e., $\hat{{\varvec{A}}}^{*\top }\hat{{\varvec{A}}}^*={\varvec{I}}_q$. This can be achieved by adopting the Cholesky decomposition to find the upper triangular matrix ${\varvec{C}}$ of order q such that $\hat{{\varvec{A}}}^{\top }\hat{{\varvec{A}}}={\varvec{C}}^{\top }{\varvec{C}}$, resulting in $\hat{{\varvec{A}}}^*=\hat{{\varvec{A}}}\hat{{\varvec{C}}}^{-1}$.

Related to the standard errors of the ML estimates, it would be of interest to calculate them using the empirical information matrix for ${{\varvec{\varTheta }}}$ in a manner analogous to Wang and Lin (2016). This procedure will be tackled by the authors in a future paper.

5 Simulation

We conduct two simulation experiments to demonstrate the proposed techniques. Unless otherwise stated, we shall consider only the case of ${\varvec{D}}_i={\varvec{D}}$ for all i in the later analysis.

5.1 Experiment 1

In this experiment, to compare the accuracy of three parsimonious factor-analytic approaches for clustering and representing low-dimensional data, we generate a $p=3$ dimensional dataset of size $n=1000$ from a $g=2$ component mixture of rMST distributions. The presumed mixture parameters as involved in (5) are

$$\begin{aligned} \pi _1= & {} 0.5, \quad \pi _2=0.5, \quad {\varvec{\mu }}_1=(0,0,0)^{\top }, \quad {\varvec{\mu }}_2=(1,1,3)^{\top },\\ \nu _1= & {} 4, \quad \nu _2=5, \quad {\varvec{\alpha }}_1=(-\,2,-\,5,-\,5)^{\top }, \quad {\varvec{\alpha }}_2=(-\,2,5,5)^{\top },\\ {\varvec{\varSigma }}_1= & {} \left[ \begin{array}{c@{\quad }c@{\quad }c} 4 &{} -\,1.8 &{} -\,1\\ -\,1.8 &{} 2 &{} 0.9\\ -\,1 &{} 0.9 &{} 2 \end{array} \right] \quad \text{ and } \quad {\varvec{\varSigma }}_2= \left[ \begin{array}{c@{\quad }c@{\quad }c} 4 &{} 1.8 &{} 0.8\\ 1.8 &{} 2 &{} 0.5\\ 0.8 &{} 0.5 &{} 2 \end{array} \right] . \end{aligned}$$

The MCFA, MCtFA and MCrstFA models with $q=2$ factors and $g=2$ components are fitted via the ECME algorithm to the simulated data. When the parameter estimates and the corresponding factor scores are obtained under each fitted model, we can compare the clustering performance and calculate the predicted values of each observed feature vector ${\varvec{y}}_j$. As anticipated, the MCrstFA approach gives the best clustering result ($\hbox {ARI}=0.891; \hbox {CCR}=0.972$), followed closely by MCtFA ($\hbox {ARI}=0.817; \hbox {CCR}=0.952$). The MCFA has the worst performance ($\hbox {ARI}=1.78\times 10^{-6}; \hbox {CCR}=0.51$), indicating a lack of ability to cluster mixtures of skewed data with outliers. A cross-tabulation of the true and predicted class memberships is given in Table 1. As can be seen, the MCrstFA approach provides fewer misclassified observations and outperforms the other two considered approaches, say MCtFA and MCFA.

Table 1 Cross-tabulations of true (A, B) and predicted (1, 2) class memberships for three parsimonious factor-analytic approaches for the simulated data

Full size table

Figure 2 displays plots of the actual observations ${\varvec{y}}_j$ overlaid with predicted observations $\hat{{\varvec{y}}}_j$, calculated as $\hat{{\varvec{y}}}_j=\hat{{\varvec{A}}}\hat{{\varvec{u}}}_j$, $(j=1,\ldots ,1000)$, where $\hat{{\varvec{A}}}$ is the estimated projection matrix, and $\hat{{\varvec{u}}}_j$ is the estimated factor scores defined in (18). As shown in Fig. 2a, the MCFA model performs poorly because of a lack of mechanisms to cope with data exhibiting non-normal features. On the other hand, it is clearly observed from Fig. 2b, c that the original scattering structure of two groups can be retrieved quite well using the MCtFA and MCrstFA approaches, but the MCtFA is slightly unfavored due to somewhat poor fit caused by having 20 more misclassified units than the MCrstFA.

5.2 Experiment 2

To further demonstrate the validity of the MCrstFA approach for handing the data of higher dimensions, we perform a second simulation experiment in situations where the MCrstFA holds exactly. In this study, data were generated from the 3-component MCrstFA model with $q=2$, and $p=10$ and 20. We perform 100 Monte Carlo (MC) repetitions of sample size $n=1500$ observations and equal mixing proportions, namely $\pi _i=1/3$ for all i. The elements of $p\times q$ common factor loadings ${\varvec{A}}$ were randomly generated from N(0, 1), while the component DOFs are taken as $(\nu _1,\nu _2,\nu _3)=(4,6,9)$. The location vectors, scale-covariance matrices and skewness parameters of the component factors ${\varvec{U}}_{ij}$ are chosen as

$$\begin{aligned} {\varvec{\xi }}_1= & {} (0,2.5)^{\top }, \quad {\varvec{\xi }}_2=(-\,2.5,0)^{\top }, \quad {\varvec{\xi }}_3=(2.5,0)^{\top },\\ {\varvec{\lambda }}_1= & {} (5,5)^{\top }, \quad {\varvec{\lambda }}_2=(-\,5,-\,5)^{\top }, \quad {\varvec{\lambda }}_3=(0,0)^{\top },\\ {\varvec{\varOmega }}_1= & {} \left[ \begin{array}{c@{\quad }c} 0.1 &{} 0\\ 0 &{} 0.45 \end{array} \right] , \quad {\varvec{\varOmega }}_2= \left[ \begin{array}{c@{\quad }c} 0.45 &{} 0\\ 0 &{} 0.1 \end{array} \right] , \quad {\varvec{\varOmega }}_3= \left[ \begin{array}{c@{\quad }c} 0.45 &{} 0\\ 0 &{} 0.1 \end{array} \right] . \end{aligned}$$

Figure 3 gives an illustration of the generated bivariate factor scores based on one simulated case for each of the three components. Typically, these component factor scores look somewhat well separated and exhibit non-elliptical scattering patterns and heavy tails. The component error vectors ${\varvec{e}}_{ij}$s were drawn independently from $t_p({\varvec{0}},{\varvec{D}},\nu _i)$, where diagonal elements of ${\varvec{D}}$ were randomly generated from a uniform distribution ranging between 0.1 and 0.3.

We process each of 100 MC simulated datasets by fitting the MCFA, MCtFA and MCrstFA models. Comparisons were made on the adequacy of overall fitness in terms of BIC and ICL and the classification agreement on the true and predicted memberships assessed by ARI and CCR. Table 2 lists the average values of criteria together with the corresponding standard deviations (Std) under every scenario considered. As a guide to select the most plausible model, the frequencies (Freq) preferred by these criteria are also reported. In all cases, the MCrstFA model provides better fits and clustering results than the other two approaches. In particular, the MCFA and MCtFA are seldom or even never chosen by these four indices due to a lack of sufficient robustness against skewness. We have also undertaken the simulation study with a much higher dimension, say $p=100$, and found that the MCrstFA model still works similarly well without degrading its performance.

Table 2 Comparison of MCFA, MCtFA and MCrstFA models for simulation based on 100 replications

Full size table

6 Application to real data

We applied our method to the human liver cancer data (Chen et al. 2002), which consist of $p=85$ gene expressions partitioned into two subpopulations. Hepatocellular carcinoma (HCC) is one of the 10 leading causes of death in the world. Chen et al. (2002) used cDNA microarrays to characterize patterns of gene expression in HCC, from which they found that the expression patterns in HCC and nontumor liver tissues (LIVER) are distinctly different from one another. In the data, there are $n=179$ samples in the genomic expression patterns from patients, of which 104 belong to HCC and 75 to LIVER.

Figure 4 depicts the boxplots of top 30 genes which have the most significant difference between two classes obtained by performing the two-sample t-test. Apparently, the distribution of each selected gene is highly skewed or has a long tail.

Table 3 Comparison of fitting results and implied clustering versus the true membership of the human liver cancer data

Full size table

We implement the two-component MCFA, MCtFA, MCrstFA and MCghstFA approaches with q ranging from 1 to 10. In the same vein as that of the simulation experiments, we assume ${\varvec{D}}_i={\varvec{D}}$ for all i, but place no restrictions on component DOFs. A comparison of some characterizations between the MCrstFA and MCghstFA models is summarized in Table 5. When fitting the MCghstFA model, we implement the ECM algorithm described in “Appendix D”. For clarity, Table 3 presents only the fitting results and classification agreements of each method with q ranging from 5 to 10. Judging from BIC and ICL, the best fitted model is given by the MCghstFA model with $q=8$. While comparing the classification performance, the MCrstFA model with $q=6$ provides the best agreement on predicting the true group memberships ($\hbox {ARI}=0.2427$ and $\hbox {CCR}=0.7486$) for this dataset. Notice that the best classifier does not necessarily give the best fit to the data. Again, the MCrstFA approach demonstrates its usefulness in clustering high-dimensional data with asymmetry and/or fat tails.

Table 4 Cross-tabulations of true and predicted (1,2) class memberships for four mixtures of common factor-analytic approaches for the human liver cancer data

Full size table

Table 4 compares the best classification results obtained from the fitted MCFA ($q=10$), MCtFA ($q=6$), MCrstFA ($q=6$) and MCghstFA ($q=10$) models. We found that the number of the correctly classified HCC tissues in the fit of MCrstFA is more than those of the other three approaches. However, there is no obvious difference among them in predicting the class memberships of LIVER tissues.

To visualize the clustering results in a low-dimensional space, Fig. 5 portrays the data in a 3D space using the factor scores estimated by (19). In the plot, we use the second, third and fifth factors in the fit of MCrstFA with $q=6$ factors. The estimated factor scores in Fig. 5a, b are plotted according to the true and implied clustering labels, respectively. It can be observed from the two plots that the two clusters are inherently overlapped so that no approach works satisfactorily on classifying these tissues. Most of the misclassified tissues, labelled by ‘plus symbol’ in Fig. 5b, appear in the overlapping area between two clusters.

7 Conclusion

We propose an extension of MCFA in which component factors and errors are jointly modeled by the rMST distribution, called the MCrstFA model, as a new model-based tool for analyzing high-dimensional data with strong degree of abnormality and multimodality. An attractive feature of the MCrstFA is that the component means, component covariance matrices as well as component skewness parameters are represented by common factor loadings, allowing parsimonious model fitting while preserving its robustness.

We describe an analytically simple ECME procedure developed under a five-level hierarchy for fitting the MCrstFA. This approach enables us to project high-dimensional clustering results into a low-dimensional space through displaying estimated factor scores. Numerical simulation studies and experimental data demonstrate its usefulness and flexibility on the basis of model fitting and outright clustering.

The techniques presented so far are limited to the likelihood-based approach and focus on complete data analysis. Some possible avenues for future research include building a framework to handle the presence of censoring observations (Castro et al. 2015; Lachos et al. 2017) or the occurrence of missing values (Ouyang et al. 2004; Lin 2014; Wang et al. 2017a, b), both of which are common problems in the analysis of high-dimensional data. Although our estimating procedure is easy to implement, there is a lack of feasible guidelines for a joint determination of (g, q) within a single run of the training process. Toward this end, variational Bayes (VB) approximations (Waterhouse et al. 1996; Jordan et al. 1999; Beal 2003) have been presented as an iterative Bayesian alternative to the EM-based algorithm for their fast and deterministic nature. The attractive feature of the VB scheme allows for an automated learning of parameter estimation and model selection. The VB approach has been effectively applied to Gaussian mixtures (Teschendorff et al. 2005), MFA models (Ghahramani and Beal 2000), and mixtures of normal inverse Gaussian distributions (Subedi and McNicholas 2014) for simultaneously estimating model parameters and determining the number of components. Therefore, it is worthwhile to develop a novel VB algorithm for learning the MCrstFA model. Another inspiration for future work is to extend the MCrstFA model based on a broader family of multivariate skew distributions such as the scale mixtures of skew-normal distributions (Cabral et al. 2012; Prates et al. 2013), the multivariate canonical fundamental skew-t distributions (Arellano-Valle and Genton 2005; Lee and McLachlan 2016), and the hidden truncation hyperbolic distributions introduced very recently by Murray et al. (2017b).

References

Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc R Soc Edinb 46:289–305
Article MATH Google Scholar
Arellano-Valle RB, Genton MG (2005) On fundamental skew distributions. J Multivar Anal 96:93–116
Article MathSciNet MATH Google Scholar
Azzalini A (2014) The skew-normal and related families. IMS monographs series. Cambridge University Press, Cambridge
MATH Google Scholar
Azzalini A, Browne RP, Genton MG, McNicholas PD (2016) On nomenclature for, and the relative merits of, two formulations of skew distributions. Stat Probab Lett 110:201–206
Article MathSciNet MATH Google Scholar
Baek J, McLachlan GJ (2011) Mixtures of common $t$-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276
Article Google Scholar
Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32:1–13
Article Google Scholar
Barndorff-Nielsen O, Shephard N (2001) Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics. J Roy Stat Soc Ser B 63:167–241
Article MathSciNet MATH Google Scholar
Beal MJ (2003) Variational algorithms for approximate Bayesian inference. Ph.D. thesis, The University of London, London, UK
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22:719–725
Article Google Scholar
Cabral CR, Lachos VH, Prates MO (2012) Multivariate mixture modeling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142
Article MathSciNet MATH Google Scholar
Castro LM, Costa DR, Prates MO, Lachos VH (2015) Likelihood-based inference for Tobit confirmatory factor analysis using the multivariate Student-$t$ distribution. Stat Comput 25:1163–1183
Article MathSciNet MATH Google Scholar
Chen X, Cheung ST, So S, Fan ST, Barry C, Higgins J, Lai KM, Ji J, Dudoit S, Ng IO, Van De Rijn M, Botstein D, Brown PO (2002) Gene expression patterns in human liver cancers. Mol Biol Cell 13:1929–1939
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 9:1–38
MATH Google Scholar
Ghahramani Z, Beal M (2000) Variational inference for Bayesian mixture of factor analysers. In: Solla S, Leen T, Muller K-R (eds) Advances in neural information processing systems. MIT Press, Cambridge
Google Scholar
Ghahramani Z, Hinton GE (1997) The EM algorithm for factor analyzers. Technical Report No. CRG-TR-96-1, The University of Toronto, Toronto
Hartigan JA, Wong MA (1979) Algorithm AS 136: a K-means clustering algorithm. J R Stat Soc C 28:100–108
MATH Google Scholar
Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233
Article MATH Google Scholar
Lachos VH, Morenoa EJL, Chen K, Cabralc CRB (2017) Finite mixture modeling of censored data using the multivariate Student-$t$ distribution. J Multivar Anal 159:151–167
Article MathSciNet Google Scholar
Lee SX, McLachlan GJ (2014) Finite mixtures of multivariate skew $t$-distributions: some recent and new results. Stat Comp 24:181–202
Article MathSciNet MATH Google Scholar
Lee SX, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew $t$-distributions: the unication of the restricted and unrestricted skew $t$-mixture models. Stat Comp 26:573–589
Article MATH Google Scholar
Lee YW, Poon SH (2011) Systemic and systematic factors for loan portfolio loss distribution. Econometrics and applied economics workshops, pp 1–61. School of Social Science, University of Manchester
Lee WL, Chen YC, Hsieh KS (2003) Ultrasonic liver tissues classification by fractal feature vector based on M-band wavelet transform. IEEE Trans Med Imaging 22:382–392
Article Google Scholar
Lin TI (2014) Learning from incomplete data via parameterized $t$ mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195
Article MathSciNet MATH Google Scholar
Lin TI, Wu PH, McLachlan GJ, Lee SX (2015) A robust factor analysis model using the restricted skew-$t$ distribution. TEST 24:510–531
Article MathSciNet MATH Google Scholar
Lin TI, McLachlan GJ, Lee SX (2016) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. J Multivar Anal 143:398–413
Article MathSciNet MATH Google Scholar
Lin TI, Wang WL, McLachlan GJ, Lee SX (2018) Robust mixtures of factor analysis models using the restricted multivariate skew-$t$ distribution. Stat Model 28:50–72
Article MathSciNet Google Scholar
Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81:33–648
MathSciNet MATH Google Scholar
McLachlan GJ, Basford KE (1988) Mixture models: inference and application to clustering. Marcel Dekker, New York
MATH Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
Book MATH Google Scholar
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comp 18:285–296
Article MathSciNet Google Scholar
McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54:711–723
Article MathSciNet MATH Google Scholar
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Article MathSciNet MATH Google Scholar
Murray PM, Browne RP, McNicholas PD (2014a) Mixtures of skew-$t$ factor analyzers. Comput Stat Data Anal 77:326–335
Article MathSciNet MATH Google Scholar
Murray PM, McNicholas PD, Browne RP (2014b) Mixtures of common skew-$t$ factor analyzers. Stat 3:68–82
Article MATH Google Scholar
Murray PM, Browne RP, McNicholas PD (2017a) A mixture of SDB skew-$t$ factor analyzers. Econom Stat 3:160–168
Article MathSciNet Google Scholar
Murray PM, Browne RP, McNicholas PD (2017b) Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. J Multivar Anal 161:141–156
Article MathSciNet MATH Google Scholar
Ouyang M, Welsh W, Georgopoulos P (2004) Gaussian mixture clustering and imputation of microarray data. Bioinformatics 20:917–923
Article Google Scholar
Prates MO, Cabral CR, Lachos VH (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Soft 54:1–20
Article Google Scholar
Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524
Article Google Scholar
Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with application to Bayesian regression models. Can J Stat 31:129–150
Article MathSciNet MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Subedi S, McNicholas PD (2014) Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Adv Data Anal Classif 8:167–193
Article MathSciNet Google Scholar
Teschendorff A, Wang Y, Barbosa-Morais N, Brenton J, Caldas C (2005) A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 21:3025–3033
Article Google Scholar
Tortora C, McNicholas P, Browne R (2016) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif 10:423–440
Article MathSciNet Google Scholar
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12:2109–2128
Article Google Scholar
Wang WL (2013) Mixtures of common factor analyzers for high-dimensional data with missing information. J Multivar Anal 117:120–133
Article MathSciNet MATH Google Scholar
Wang WL (2015) Mixtures of common $t$-factor analyzers for modeling high-dimensional data with missing values. Comput Stat Data Anal 83:223–235
Article MathSciNet MATH Google Scholar
Wang WL, Lin TI (2016) Maximum likelihood inference for the multivariate t mixture model. J Multivar Anal 149:54–64
Article MathSciNet MATH Google Scholar
Wang WL, Lin TI (2017) Flexible clustering via extended mixtures of common $t$-factor analyzers. AStA Adv Stat Anal 101:227–252
Article MathSciNet MATH Google Scholar
Wang K, McLachlan GJ, Ng SK, Peel D (2009) EMMIX-skew: EM algorithm for mixture of multivariate skew normal/$t$ distributions. R package version 1.0-12
Wang WL, Castro LM, Lin TI (2017a) Automated learning of $t$ factor analysis models with complete and incomplete data. J Multivar Anal 161:157–171
Article MathSciNet MATH Google Scholar
Wang WL, Liu M, Lin TI (2017b) Robust skew-$t$ factor analysis models for handling missing data. Stat Methods Appl 26:649–672
Article MathSciNet MATH Google Scholar
Waterhouse S, MacKay D, Robinson T (1996) Bayesian methods for mixture of experts. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems, vol 8. MIT Press, Cambridge
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the Coordinating Editor, Maurizio Vichi, the Associate Editor and three anonymous referees for their comments and suggestions that greatly improved this paper. W.L. Wang and T.I. Lin would like to acknowledge the support of the Ministry of Science and Technology of Taiwan under Grant Nos. MOST 105-2118-M-035-004-MY2 and MOST 105-2118-M- 005-003-MY2, respectively. L.M. Castro acknowledges support from Grant FONDECYT 1170258 from Chilean government.

Author information

Authors and Affiliations

Department of Statistics, Graduate Institute of Statistics and Actuarial Science, Feng Chia University, Taichung, Taiwan
Wan-Lun Wang
Department of Statistics, Pontificia Universidad Católica de Chile, Casilla 306, Correo 22, Santiago, Chile
Luis M. Castro
Institute of Statistics, National Chung Hsing University, Taichung, Taiwan
Yen-Ting Chang & Tsung-I Lin
Department of Public Health, China Medical University, Taichung, Taiwan
Tsung-I Lin

Authors

Wan-Lun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Luis M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
Yen-Ting Chang
View author publications
You can also search for this author in PubMed Google Scholar
Tsung-I Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsung-I Lin.

Appendices

Appendix A: Proof of hierarchical representation (8)

It follows from (7) that

$$\begin{aligned}&E\left( {\varvec{Y}}_j{\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right) \nonumber \\&\quad =E\left[ E\left( {\varvec{Y}}_j{\varvec{U}}_{ij}^{\top }\mid {\varvec{U}}_{ij},\gamma _j,\tau _j,Z_{ij}=1\right) \mid \gamma _j,\tau _j,Z_{ij}=1\right] \\&\quad =E\left[ E({\varvec{Y}}_j\mid {\varvec{U}}_{ij},\gamma _j,\tau _j,Z_{ij}=1){\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right] \\&\quad =E\left( {\varvec{A}}{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right) \\&\quad ={\varvec{A}}\left[ \tau _j^{-1}{\varvec{\varOmega }}_i+({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)^{\top }\right] , \end{aligned}$$

and

$$\begin{aligned}&\text{ cov }\left( {\varvec{Y}}_j, {\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right) \\&\quad =E\left( {\varvec{Y}}_j{\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right) -E({\varvec{Y}}_j\mid \gamma _j,\tau _j,Z_{ij}=1)E\left( {\varvec{U}}_{ij}^{\top }\mid \gamma _j,\tau _j,Z_{ij}=1\right) \\&\quad ={\varvec{A}}\left[ \tau _j^{-1}{\varvec{\varOmega }}_i+({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)^{\top }\right] -({\varvec{\mu }}_i+{\varvec{\alpha }}_i\gamma _j)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)^{\top }\\&\quad ={\varvec{A}}\left[ \tau _j^{-1}{\varvec{\varOmega }}_i+({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)^{\top }\right] -{\varvec{A}}({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)^{\top }\\&\quad =\tau _j^{-1}{\varvec{A}}{\varvec{\varOmega }}_i. \end{aligned}$$

This gives rise to the following joint distribution:

$$\begin{aligned}\left[ \begin{array}{c} {\varvec{Y}}_j\\ {\varvec{U}}_{ij} \end{array} \right] \bigg |(\gamma _j,\tau _j,Z_{ij}=1)\sim N_{p+q} \left( \left[ \begin{array}{c} {\varvec{\mu }}_i+{\varvec{\alpha }}_i\gamma _j\\ {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j \end{array} \right] , \tau _j^{-1}\left[ \begin{array}{cc} {\varvec{\varSigma }}_i &{} {\varvec{A}}{\varvec{\varOmega }}_i\\ {\varvec{\varOmega }}_i^{\top }{\varvec{A}}^{\top } &{} {\varvec{\varOmega }}_i \end{array} \right] \right) . \end{aligned}$$

We then have the following standard results:

$$\begin{aligned}&E({\varvec{U}}_{ij}\mid {\varvec{y}}_j,\gamma _j,\tau _j,Z_{ij}=1)\\&\quad =({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j)+\left( \tau _j^{-1}{\varvec{\varOmega }}_i^{\top }{\varvec{A}}^{\top }\right) (\tau _j^{-1}{\varvec{\varSigma }}_i)^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)\\&\quad ={\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j), \end{aligned}$$

and

$$\begin{aligned} \text{ cov }({\varvec{U}}_{ij}\mid {\varvec{y}}_j,\gamma _j,\tau _j,Z_{ij}=1)= & {} \tau _j^{-1}{\varvec{\varOmega }}_i-\left( \tau _j^{-1}{\varvec{\varOmega }}_i^{\top }{\varvec{A}}^{\top }\right) (\tau _j^{-1}{\varvec{\varSigma }}_i)^{-1}(\tau _j^{-1}{\varvec{A}}{\varvec{\varOmega }}_i)\\= & {} \tau _j^{-1}\left( {\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}\right) {\varvec{\varOmega }}_i, \end{aligned}$$

where ${\varvec{\beta }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i$. Using the characterization of the multivariate normal distribution, we can obtain

$$\begin{aligned}&{\varvec{U}}_{ij}\mid ({\varvec{y}}_j,\gamma _j,\tau _j,Z_{ij}=1) \sim N_q\left( {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j),\tau _j^{-1}\left( {\varvec{I}}_q\right. \right. \\&\quad \left. -\,{\varvec{\beta }}_i^{\top }\left. {\varvec{A}}\right) {\varvec{\varOmega }}_i\right) . \end{aligned}$$

With similar arguments, we have

$$\begin{aligned}&f\left( {\varvec{y}}_j,\gamma _j,\tau _j\mid z_{ij}=1\right) \\&\quad =f\left( {\varvec{y}}_j\mid \gamma _j,\tau _j,z_{ij}=1\right) f\left( \gamma _j\mid \tau _j,z_{ij}=1\right) f\left( \tau _j\mid z_{ij}=1\right) \\&\quad =\frac{2|{\varvec{\varSigma }}_i|^{-\frac{1}{2}}\left( \frac{\nu _i}{2}\right) ^{\frac{\nu _i}{2}}}{\left( 2\pi \right) ^{\frac{p+1}{2}}{\varGamma }\left( \frac{\nu _i}{2}\right) }\tau _j^{\frac{p+\nu _i+1}{2}-1}\exp \left\{ -\frac{\tau _j}{2}\left[ \frac{\left( \gamma _j-h_i\right) ^2}{\sigma _i^2}+\delta _{ij}+\nu _i\right] \right\} ,\\&f\left( {\varvec{y}}_j,\tau _j\mid z_{ij}=1\right) =\frac{2|{\varvec{\varSigma }}_i|^{-\frac{1}{2}}\sigma _i\left( \frac{\nu _i}{2}\right) ^{\frac{\nu _i}{2}}}{\left( 2\pi \right) ^{\frac{p}{2}}{\varGamma }\left( \frac{\nu _i}{2}\right) }\tau _j^{\frac{p+\nu _i}{2}-1}\nonumber \\&\quad \exp \left\{ -\frac{\tau _j}{2}\left[ \delta _{ij}+\nu _i\right] \right\} {\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) ,\\&f\left( \gamma _j\mid {\varvec{y}}_j,\tau _j,z_{ij}=1\right) =\frac{f\left( {\varvec{y}}_j,\gamma _j,\tau _j\mid z_{ij}=1\right) }{f\left( {\varvec{y}}_j,\tau _j\mid z_{ij}=1\right) } =\frac{\phi \left( \gamma _j;h_i,\tau _j^{-1}\sigma _i^2\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }. \end{aligned}$$

Hence, it is trivial to establish that $\gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1) \sim TN(h_i,\tau _j^{-1}\sigma _i^2;(0,\infty ))$. Furthermore, standard calculation gives

$$\begin{aligned} f(\tau _j\mid {\varvec{y}}_j,Z_{ij}=1)= & {} \frac{f({\varvec{y}}_j,\tau _j\mid Z_{ij}=1)}{f({\varvec{y}}_j\mid Z_{ij}=1)}\\= & {} \frac{\tau _j^{\frac{\nu _i+p}{2}-1}}{{\varGamma }\left( \frac{\nu _i+p}{2}\right) }\left( \frac{\nu _i+\delta _{ij}}{2}\right) ^{\frac{\nu _i+p}{2}}\frac{{\varPhi }(\sqrt{\tau _j}M_{ij})}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\\&\times \exp \left\{ -\frac{\tau _j}{2}\left[ \delta _{ij}+\nu _i\right] \right\} \\= & {} \frac{{\varPhi }(\sqrt{\tau _j}M_{ij})}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) . \end{aligned}$$

Appendix B: Proof of Proposition 1

(a)
Standard calculation of conditional expectation yields
$$\begin{aligned}&E(\tau _j\mid {\varvec{y}}_j,\,z_{ij}=1)=\int _{0}^{\infty }\tau _j f(\tau _j\mid {\varvec{y}}_j,\,z_{ij}=1)d\tau _j\nonumber \\&\quad =\int _{0}^{\infty }\tau _j\frac{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) g\left( \tau _j;\frac{\nu _i+p+2}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \frac{T\left( M_{ij}\sqrt{\frac{\nu _i+p+2}{\nu _i+\delta _{ij}}};\nu _i+p+2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$
(B.1)
(b)
Because $\gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1) \sim TN(h_i,\tau _j^{-1}\sigma _i^2;(0,\infty ))$, we obtain
$$\begin{aligned} E(\gamma _j\mid {\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)=h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }. \end{aligned}$$
(B.2)
(c)
We first need to show
$$\begin{aligned}&E\left( \tau _j^{\frac{k}{2}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right) \nonumber \\&\quad =\frac{1}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }\tau _j^{\frac{k}{2}}\phi \left( \sqrt{\tau _j}M_{ij}\right) g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{1}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }\tau _j^{\frac{k-1}{2}}\phi \big (M_{ij};0,\,\tau _j^{-1}\big )g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{{\varGamma }\left( \frac{\nu _i+p+k-1}{2}\right) \int _{0}^{\infty }\phi \left( M_{ij};0,\,\tau _j^{-1}\right) g\left( \tau _j;\frac{\nu _i+p+k-1}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j}{{\varGamma }\left( \frac{\nu _i+p}{2}\right) \left( \frac{\nu _i+\delta _{ij}}{2}\right) ^{\frac{k-1}{2}}T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\nonumber \\&\quad =\frac{{\varGamma }\left( \frac{\nu _i+p+k-1}{2}\right) \sqrt{\frac{\nu _i+p+k-1}{\nu _i+\delta _{ij}}}\,t\left( M_{ij}\sqrt{\frac{\nu _i+p+k-1}{\nu _i+\delta _{ij}}};\nu _i+p+k-1\right) }{{\varGamma }\left( \frac{\nu _i+p}{2}\right) \left( \frac{\nu _i+\delta _{ij}}{2}\right) ^{\frac{k-1}{2}}T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$
(B.3)
Applying the result in (B.3) with $k=-\,1$ and (B.2) yields
$$\begin{aligned}&E(\gamma _j\mid {\varvec{y}}_j,\,z_{ij}=1) =E[E(\gamma _j\mid {\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)\mid {\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E\left[ h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij}+\sigma _iE\left( \frac{1}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right) \nonumber \\&\quad =h_{ij}+\frac{\sigma _i}{\sqrt{\frac{\nu _i+p-2}{\nu _i+\delta _{ij}}}}\frac{t\left( M_{ij}\sqrt{\frac{\nu _i+p-2}{\nu _i+\delta _{ij}}};\nu _i+p-2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$
(B.4)
(d)
Using (B.1), (B.2) and (B.3) with $k=1$, we have
$$\begin{aligned}&E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1) =E[E(\tau _j\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E[\tau _jE(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E\left[ \tau _j\left( h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\right) \bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij} E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)+\sigma _i E\left[ \sqrt{\tau _j}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij}\left[ \frac{\nu _i+p}{\nu _i+\delta _{ij}}\frac{T\left( M_{ij}\sqrt{\frac{\nu _i+p+2}{\nu _i+\delta _{ij}}};\nu _i+p+2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\right] \nonumber \\&\qquad +\,\sigma _i\left[ \sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}}\frac{t\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\right] . \end{aligned}$$
(B.5)
(e)
Using the result of (B.2), the second moment of a truncated normal distribution is given by
$$\begin{aligned} E\left( \gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right)= & {} h_{ij}E(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)+\frac{\sigma _i^2}{\tau _j}\nonumber \\= & {} h_{ij}\left( h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\right) +\frac{\sigma _i^2}{\tau _j}. \end{aligned}$$
(B.6)
(f)
Applying the double expectation and using (B.5) and (B.6), we have
$$\begin{aligned} E\left( \tau _j\gamma _j^2|{\varvec{y}}_j,\,z_{ij}=1\right)= & {} E\left[ E\left( \tau _j\gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\= & {} E\left[ \tau _jE\left( \gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\= & {} E\left\{ \tau _j\left[ h_{ij}E(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)+\tau _j^{-1}\sigma _i^2\right] |{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\= & {} h_{ij} E\left( \tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1\right) +\sigma _i^2. \end{aligned}$$
(B.7)
(g)
Applying the double expectation and the result of (B.4), we have
$$\begin{aligned}&E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)=E\left[ E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad ={\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)+({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)E(\gamma _j|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$
(B.8)
(h)
Applying the double expectation and using (B.1) and (B.5), we have
$$\begin{aligned}&E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)=E\left[ E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ \tau _j E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left\{ \tau _j\left[ {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)\right] |{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\&\quad =\left[ {\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)\right] E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)\nonumber \\&\qquad +\,\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$
(B.9)
(i)
Applying the double expectation and using (B.5) and (B.7), we have
$$\begin{aligned}&E\left( \tau _j\gamma _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1\right) \\&\quad =E\left[ E(\tau _j\gamma _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ \tau _j\gamma _j E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\left\{ \tau _j\gamma _j[{\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^\top ({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)]|{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\&\quad =\left[ {\varvec{\xi }}_i+{\varvec{\beta }}_i^\top ({\varvec{y}}_j-{\varvec{\mu }}_i)\right] E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1)\\&\qquad +\,({\varvec{\lambda }}_i-{\varvec{\beta }}_i^\top {\varvec{\alpha }}_i)E(\tau _j\gamma _j^2|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$
(j)
Applying the double expectation and using (B.8) and (B.9), we have
$$\begin{aligned}&E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,z_{ij}=1\right) =E\left[ E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\left[ \tau _jE({\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\big \{\tau _j[E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)E({\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)\\&\qquad +\,\text{ cov }({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)]|{\varvec{y}}_j,\,z_{ij}=1\big \}\\&\quad =E\big \{\tau _j[E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j))^{\top }\\&\qquad +\,\tau _j^{-1}({\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i]|{\varvec{y}}_j,\,z_{ij}=1\big \}\\&\quad =E(\gamma _j\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)^{\top }\\&\qquad +\,E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)[{\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)]^{\top }+({\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i. \end{aligned}$$
(k)
It is known that $\int _0^{\infty }f(\tau _j|{\varvec{y}}_j,\,Z_{ij}=1)d\tau _j=1$, that is,
$$\begin{aligned} \int _0^{\infty }\frac{{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\frac{{(\frac{\nu _i+\delta _{ij}}{2})}^{(\frac{\nu _i+p}{2})}}{{\varGamma }(\frac{\nu _i+p}{2})}\exp \left\{ -\frac{\nu _i+\delta _{ij}}{2}\tau _j\right\} d\tau _j =1. \end{aligned}$$
Then
$$\begin{aligned} \frac{d}{d\nu _i}\int _0^{\infty }b_j{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )\exp \left\{ -\frac{\nu _i+\delta _{ij}}{2}\tau _j\right\} d\tau _j=0, \end{aligned}$$
where
$$\begin{aligned} b_j=\frac{{\left( \frac{\nu _i+\delta _{ij}}{2}\right) }^{\left( \nu _i+p\right) /2}}{{\varGamma }\left( \frac{\nu _i+p}{2}\right) T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$
By Leibnitz’s rule, we can obtain
$$\begin{aligned}&E(\log \tau _j|{\varvec{y}}_j,\,z_{ij}=1)-E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)+\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) +\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \\&\quad -\,\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\frac{\int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }=0, \end{aligned}$$
where
$$\begin{aligned} f_{\nu _i}(x)= & {} \mathrm{DG}\left( \frac{\nu _i+p+1}{2}\right) -\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\frac{1}{\pi (\nu _i+\delta _{ij})}\nonumber \\&-\,\log \left( 1+\frac{x^2}{\nu _i+\delta _{ij}}\right) +\frac{(\nu _i+p+1)x^2}{(\nu _i+\delta _{ij})(x^2+\nu _i+\delta _{ij})}. \end{aligned}$$
(B.10)
It follows that
$$\begin{aligned} E(\log \tau _j|{\varvec{y}}_j,\,z_{ij}=1)= & {} E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)-\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) -\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \\&+\,\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) +\,\frac{\int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$

Appendix C: Proof of CM-steps

(a)
By the Lagrange multiplier method, we define
$$\begin{aligned} L(\pi _i,\lambda )=Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})-\lambda \left( \sum _{i=1}^g\pi _i-1\right) , \end{aligned}$$
and then take partial derivatives, yielding
$$\begin{aligned} \frac{\partial L(\pi _i,\lambda )}{\partial \pi _i} = \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\frac{1}{\pi _i}-\lambda =0,\quad \text{ and } \quad \frac{\partial L(\pi _i,\lambda )}{\partial \lambda } = -\left( \sum _{i=1}^g\pi _i-1\right) =0. \end{aligned}$$
Since $\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}=n$, we obtain $\hat{\pi }_i^{(k+1)}=\sum _{j=1}^n{\hat{z}}_{ij}^{(k)}/n$.
(b)
Differentiating $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})$ with respect to ${\varvec{\xi }}_i$ leads to
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\xi }}_i}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{\xi }}_i}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}\mathrm{tr}\bigg [-{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{\xi }}_i^\top -{\varvec{\xi }}_i{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }\\&+\,{\varvec{\xi }}_i{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i^\top +{\varvec{\xi }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^\top +{\varvec{\lambda }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^\top \bigg ]\\= & {} \mathrm{tr}\bigg \{{\varvec{\varOmega }}_i^{-1}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ {\hat{{\varvec{\eta }}}_{ij}}^{(k)}-{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i\right] \bigg \}. \end{aligned}$$
Moreover, the partial derivative of $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})$ with respect to ${\varvec{\lambda }}_i$ is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\lambda }}_i}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{\lambda }}_i}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}\mathrm{tr}\bigg [-{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}{\varvec{\lambda }}_i^\top +{\varvec{\xi }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^\top \\&-\,{\varvec{\lambda }}_i{\hat{{\varvec{\zeta }}}_{ij}}^{(k)^\top }+{\varvec{\lambda }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^\top +{\varvec{\lambda }}_i{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i^\top \bigg ]\\= & {} \mathrm{tr}\bigg \{{\varvec{\varOmega }}_i^{-1}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ {\hat{{\varvec{\zeta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i-{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i\right] \bigg \}. \end{aligned}$$
Solving the above two equations, we get
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\xi }}_i}= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}({\hat{{\varvec{\eta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i)-\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i=\mathbf 0, \end{aligned}$$
(C.1)
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\lambda }}_i}= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}({\hat{{\varvec{\zeta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i)-\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i=\mathbf 0. \end{aligned}$$
(C.2)
After rearrangement, (C.1) and (C.2) can be rewritten as
$$\begin{aligned} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i+\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\eta }}}_{ij}}^{(k)},\\ \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i+\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}. \end{aligned}$$
Using Cramer’s law, the solutions of the two linear equations are
$$\begin{aligned} \hat{{\varvec{\xi }}}_i^{\left( k+1\right) }= \frac{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) }{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2}, \end{aligned}$$
and
$$\begin{aligned} \hat{{\varvec{\lambda }}}_i^{\left( k+1\right) }=\frac{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) }{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2}. \end{aligned}$$
(c)
The partial derivative of $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})$ with respect to ${\varvec{A}}$ is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{A}}}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{A}}}\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\mathrm{tr}\bigg (-{\varvec{D}}_i^{-1}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }{\varvec{A}}^\top \nonumber \\&-\,{\varvec{D}}_i^{-1}{\varvec{A}}{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{y}}_j^\top +{\varvec{D}}_i^{-1}{\varvec{A}}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}^\top \bigg )\nonumber \\= & {} \mathrm{tr}\left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{D}}_i^{-1}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }-\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{D}}_i^{-1}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}\right) . \end{aligned}$$
(C.3)
Equating (C.3) to the zero matrix, we have
$$\begin{aligned} \hat{{\varvec{A}}}^{(k+1)}=\left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }\right) \left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}\right) ^{-1}. \end{aligned}$$
(d)
The partial derivative of $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})$ with respect to ${\varvec{\varOmega }}_i$ is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\varOmega }}_i^{-1}}= & {} \frac{1}{2}\frac{\partial }{\partial {\varvec{\varOmega }}_i^{-1}}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left\{ \log |{\varvec{\varOmega }}_i^{-1}|-\mathrm{tr}\left( {\varvec{\varOmega }}_i^{-1}{\varvec{\varLambda }}_{ij}\right) \right\} \nonumber \\= & {} \frac{1}{2}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\Big [2{\varvec{\varOmega }}_i-\mathrm{Diag}\{{\varvec{\varOmega }}_i\}-\big (2{\varvec{\varLambda }}_{ij}-\mathrm{Diag}\{{\varvec{\varLambda }}_{ij}\}\big )\Big ]. \end{aligned}$$
(C.4)
Equating (C.4) to the zero vector gives
$$\begin{aligned} {\hat{{\varvec{\varOmega }}}_i}^{(k+1)}=\frac{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varLambda }}}_{ij}}^{(k+1)}}{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}}. \end{aligned}$$
(e)
Taking the partial derivative of $Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})$ with respect to ${\varvec{D}}_i$ yields
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{D}}_i^{-1}}= & {} \frac{1}{2}\frac{\partial }{\partial {\varvec{D}}_i^{-1}}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ \log |{\varvec{D}}_i^{-1}|-\mathrm{tr}\left( {\varvec{D}}_i^{-1}{\varvec{\varUpsilon }}_{ij}\right) \right] \nonumber \\= & {} \frac{1}{2}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}({\varvec{D}}_i-{\varvec{\varUpsilon }}_{ij}). \end{aligned}$$
(C.5)
We have the following estimator
$$\begin{aligned} {\hat{{\varvec{D}}}_i}^{(k+1)}=\frac{\mathrm{Diag}\left\{ \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varUpsilon }}}_{ij}}^{(k+1)}\right\} }{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}} \end{aligned}$$
obtained by equating (C.5) to the zero matrix.

Appendix D: Parameter estimation for the MCghstFA model using the ECM algorithm

Table 5 Comparison of some characterizations between the MCrstFA and MCghstFA models

Full size table

According to Table 5, the MCghstFA model admits a three-level hierarchy:

$$\begin{aligned} \left[ \begin{array}{c}{\varvec{Y}}_{j} \\ {\varvec{U}}_{ij} \end{array}\right] \Big | (W_{j}, Z_{ij}=1)\sim & {} N_{p+q} \left( \left[ \begin{array}{c}{\varvec{A}} {\varvec{\xi }}_i + W_{j}{\varvec{A}}{\varvec{\lambda }}_i \\ {\varvec{\xi }}_i + W_{j}{\varvec{\lambda }}_i \end{array}\right] , W_{j}\left[ \begin{array}{cc}{\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}+{\varvec{D}} &{} {\varvec{A}}{\varvec{{\varOmega }}}_i \\ {\varvec{{\varOmega }}}_i{\varvec{A}}^\top &{} {\varvec{{\varOmega }}}_i \end{array}\right] \right) , \nonumber \\ W_{j}\mid Z_{ij}=1\sim & {} {\varGamma }^{-1}\left( \frac{\nu _i}{2}, \frac{\nu _i}{2}\right) ,\nonumber \\ {\varvec{Z}}_j\sim & {} {\mathscr {M}}(1;\pi _1,\ldots ,\pi _g). \end{aligned}$$

(D.1)

From (D.1), it can be verified that

$$\begin{aligned} {\varvec{Y}}_j \mid (W_{j}, Z_{ij}=1) \sim N_p \big ({\varvec{A}} {\varvec{\xi }}_i + W_{j}{\varvec{A}}{\varvec{\lambda }}_i, W_{j}({\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}+{\varvec{D}})\big ), \end{aligned}$$

and

$$\begin{aligned} {\varvec{U}}_{ij}\mid ({\varvec{y}}_j,W_j,Z_{ij}=1) \sim N_q({\varvec{\mu }}_{2\cdot 1},{\varvec{{\varSigma }}}_{22\cdot 1}), \end{aligned}$$

(D.2)

where ${\varvec{\mu }}_{2\cdot 1}={\varvec{\xi }}_i+W_j{\varvec{\lambda }}_i+ {\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }({\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i-W_j{\varvec{A}}{\varvec{\lambda }}_i)$ and ${\varvec{{\varSigma }}}_{22\cdot 1}=W_j({\varvec{{\varOmega }}}_i-{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }({\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}{\varvec{A}}{\varvec{{\varOmega }}}_i)= W_j({\varvec{{\varOmega }}}^{-1}_i-{\varvec{A}}^{\top }{\varvec{D}}^{-1}{\varvec{A}})^{-1}$.

A positive random variable X is said to follow the Generalized Inverse Gaussian (GIG) distribution (Good 1953), denoted by $W\sim \mathrm{GIG}(\psi ,\chi ,r)$, if it has the pdf

$$\begin{aligned} f_{GIG}(w;\psi ,\chi ,r)=\frac{\chi ^{-r}(\sqrt{\chi \psi })^{r} }{2K_{r}(\sqrt{\chi \psi }) }w^{q-1}\exp \left\{ -\frac{1}{2}(\chi w^{-1}+\psi w) \right\} , \end{aligned}$$

(D.3)

where $\psi ,\chi \in {\mathbb {R}}^+$, $r\in {\mathbb {R}}$, and $K_q$ is the modified Bessel function of the third kind with index r. Some particular moments of the GIG distribution have tractable forms, for instance,

$$\begin{aligned} E(W^a)= (\chi /\psi )^{a/2} \frac{K_{r+a}(\sqrt{\psi \chi })}{K_r(\sqrt{\psi \chi })},~a\in {\mathbb {R}}, \end{aligned}$$

(D.4)

and

$$\begin{aligned} E(\log W)= \log (\chi /\psi )^{1/2} +\frac{K^{\prime }_{r}(\sqrt{\psi \chi })}{K_r(\sqrt{\psi \chi })}, \end{aligned}$$

(D.5)

where

$$\begin{aligned} K'_{r}(x)=\frac{dK_{r}(x)}{dr}=\frac{1}{2}\int _{0}^{\infty }\log (y)y^{r-1}\exp \left\{ -\frac{x}{2}\left( y+\frac{1}{y} \right) \right\} d y, \quad x>0.\nonumber \\ \end{aligned}$$

(D.6)

By Bayes’ Theorem, the conditional pdf of $W_j$ given ${\varvec{y}}_j$ can be written as

$$\begin{aligned} f(w_j\mid {\varvec{y}}_j,Z_{ij}=1)\propto & {} w_j^{-(\nu _i+p)/2-1} \exp \Big \{-\frac{1}{2}\Big [(\nu _i+{\varDelta }_{ij})w_j^{-1}\\&+\,({\varvec{\lambda }}^{\top }_i{\varvec{A}}^{\top }\big ({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}{\varvec{A}}{\varvec{\lambda }}_i\big )w_j\Big ]\Big \}, \end{aligned}$$

where ${\varDelta }_{ij}=({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i)^{\top }({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i)$. It follows from (D.3) that

$$\begin{aligned} W_j\mid ({\varvec{y}}_j,Z_{ij}=1)\sim \mathrm{GIG}\left( {\varvec{\lambda }}^{\top }_i{\varvec{A}}^{\top }({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}{\varvec{A}}{\varvec{\lambda }}_i,\nu _i+{\varDelta }_{ij},-\frac{\nu _i+p}{2}\right) . \end{aligned}$$

Alternatively, the MCghstFA model can be represented by a four-level hierarchy:

$$\begin{aligned} {\varvec{Y}}_{j} \mid ({\varvec{U}}_{ij}, W_{j}, Z_{ij}=1)\sim & {} N_p({\varvec{A}}{\varvec{U}}_{ij}, W_{j}{\varvec{D}}),\nonumber \\ {\varvec{U}}_{ij} \mid (W_{j}, Z_{ij}=1)\sim & {} N_q({\varvec{\xi }}_i + W_{j} {\varvec{\lambda }}_i, W_{j}{\varvec{{\varOmega }}}_i),\nonumber \\ W_{j}\mid Z_{ij}=1\sim & {} {\varGamma }^{-1}\left( \frac{\nu _i}{2}, \frac{\nu _i}{2}\right) ,\nonumber \\ {\varvec{Z}}_j\sim & {} {\mathscr {M}}(1;\pi _1,\ldots ,\pi _g). \end{aligned}$$

(D.7)

From (D.7), the complete-data log-likelihood function for ${\varvec{\varTheta }}$ on the basis of ${\varvec{Y}}_{c}=\{{\varvec{y}}_{j},{\varvec{U}}_{ij},W_{j},{\varvec{Z}}_j\}^{n}_{j=1}$, for $i=1,\ldots ,g$, is given by

$$\begin{aligned}&\ell _c({\varvec{{\varTheta }}}\mid {\varvec{Y}}_{c}) = \sum ^{g}_{i=1} \sum ^{n}_{j=1} Z_{ij} \log \bigg \{ \pi _i \log \phi _p({\varvec{y}}_{j}\mid {\varvec{A}}{\varvec{U}}_{ij}, W_{j}{\varvec{D}}) \phi _p({\varvec{U}}_{ij}\mid {\varvec{\xi }}_i \nonumber \\&\qquad + \,W_{j} {\varvec{\lambda }}_i, W_{j}{\varvec{{\varOmega }}}_i)\times \, f\left( w_{j}\mid \frac{\nu _i}{2},\frac{\nu _i}{2}\right) \bigg \}\nonumber \\&\quad =\sum ^{g}_{i=1} \sum ^{n}_{j=1} Z_{ij}\bigg \{\log \pi _i-\frac{1}{2}\log |W_j{\varvec{D}}|-\frac{1}{2}\log |W_j{\varvec{{\varOmega }}}_i|\nonumber \\&\qquad -\,\frac{W^{-1}_j}{2}\Big [({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})^{\top }{\varvec{D}}^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{U}}_{ij})\nonumber \\&\qquad +\,({\varvec{U}}_{ij}-{\varvec{\xi }}_{i}-W_j {\varvec{\lambda }}_{i})^{\top }{\varvec{{\varOmega }}}^{-1}_i({\varvec{U}}_{ij}-{\varvec{\xi }}_{i}-W_j {\varvec{\lambda }}_{i})\Big ]\nonumber \\&\qquad + \,\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) - \log {\varGamma }\left( \frac{\nu _i}{2}\right) -\left( \frac{\nu _i}{2}+1\right) \log W_{j} -\frac{\nu _i}{2W_{j}}\bigg \}. \end{aligned}$$

(D.8)

To evaluate the expected value of (D.8), called the Q function, we first calculate

$$\begin{aligned} {\hat{z}}_{ij}=E\left( Z_{ij}\mid {\varvec{y}}_{j},\hat{{\varvec{\varTheta }} }\right) =\frac{\hat{\pi }_{i}\zeta \left( {\varvec{y}}_{j};\hat{{\varvec{A}}}\hat{{\varvec{\xi }}}_i,\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}}}, \hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i,\hat{\nu }_i\right) }{f({\varvec{y}}_j; \hat{{\varvec{\varTheta }}})}, \end{aligned}$$

(D.9)

which is the posterior probability of ${\varvec{y}}_j$ belonging to the ith component of the mixture. In addition, we utilize the results (D.4) and (D.5) to calculate of the following conditional expectations:

$$\begin{aligned} {\hat{s}}_{1ij}= & {} E(W_{j} \mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}})\nonumber \\= & {} \left( \frac{\hat{\nu }_i+\hat{{\varDelta }}_{ij}}{\hat{{\varvec{\lambda }}}^{\top }_i\hat{{\varvec{A}}}^{\top }(\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}})}^{-1}\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i}\right) ^{1/2} \frac{K_{-\frac{(\hat{\nu }+p)}{2}+1}(\hat{\omega }_{ij})}{K_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})},\nonumber \\ {\hat{s}}_{2ij}= & {} E(W_{j}^{-1} \mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}})\nonumber \\= & {} \left( \frac{\hat{\nu }_i+\hat{{\varDelta }}_{ij}}{\hat{{\varvec{\lambda }}}^{\top }_i\hat{{\varvec{A}}}^{\top }(\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}})}^{-1}\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i}\right) ^{-1/2} \frac{K_{-\frac{(\hat{\nu }+p)}{2}-1}(\hat{\omega }_{ij})}{K_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})},\nonumber \\ {\hat{s}}_{3ij}= & {} E(\log W_{j} \mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}})\nonumber \\= & {} \log \left( \frac{\hat{\nu }_i+\hat{{\varDelta }}_{ij}}{\hat{{\varvec{\lambda }}}^{\top }_i\hat{{\varvec{A}}}^{\top } (\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}})}^{-1}\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i}\right) ^{1/2} +\frac{K^{\prime }_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})}{K_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})}, \end{aligned}$$

(D.10)

where $\hat{\omega }_{ij}=\sqrt{(\hat{\nu }_i+\hat{{\varDelta }}_{ij})\hat{{\varvec{\lambda }}}^{\top }_i\hat{{\varvec{A}}}^{\top }(\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}}})^{-1}\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i}$ and $K^{\prime }_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})$ is evaluated via (D.6). By (D.2), we obtain

$$\begin{aligned} \hat{{\varvec{u}}}_{ij}= & {} E\left( {\varvec{U}}_{ij} \mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}}\right) =E\left[ E({\varvec{U}}_{ij} \mid {\varvec{y}}_{j}, W_j,Z_{ij}=1)\mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}} \right] \nonumber \\= & {} \hat{{\varvec{\xi }}}_i+{\hat{s}}_{1ij}\hat{{\varvec{\lambda }}}_i+\hat{{\varvec{\gamma }}}^{\top }_i({\varvec{y}}_j-\hat{{\varvec{A}}}\hat{{\varvec{\xi }}}_i-{\hat{s}}_{1ij} \hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i), \end{aligned}$$

(D.11)

$$\begin{aligned} \hat{{\varvec{\eta }}}_{ij}= & {} E\left( W^{-1}_j{\varvec{U}}_{ij} \mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}}\right) \nonumber \\= & {} E\left[ W^{-1}_jE({\varvec{U}}_{ij} \mid {\varvec{y}}_{j}, W_j,Z_{ij}=1)\mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}} \right] \nonumber \\= & {} \hat{{\varvec{\lambda }}}_i-\hat{{\varvec{\gamma }}}^{\top }_i\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i+{\hat{s}}_{2ij}\Big (\hat{{\varvec{\xi }}}_i+\hat{{\varvec{\gamma }}}^{\top }_i({\varvec{y}}_j-\hat{{\varvec{A}}}\hat{{\varvec{\xi }}}_i)\Big ), \end{aligned}$$

(D.12)

and

$$\begin{aligned} \hat{{\varvec{{\varPsi }}}}_{ij}= & {} E\left( W^{-1}_j{\varvec{U}}_{ij} {\varvec{U}}^{\top }_{ij}\mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}}\right) \nonumber \\= & {} E\left[ W^{-1}_jE({\varvec{U}}_{ij}{\varvec{U}}^{\top }_{ij} \mid {\varvec{y}}_{j}, W_j,Z_{ij}=1)\mid {\varvec{y}}_{j}, Z_{ij}=1, \hat{{\varvec{{\varTheta }}}} \right] \nonumber \\= & {} E\left[ W^{-1}_j\left[ E({\varvec{U}}_{ij}\mid \right. {\varvec{y}}_{j},\right. W_j,Z_{ij}=1)E({\varvec{U}}^{\top }_{ij}\mid {\varvec{y}}_{j}, W_j,Z_{ij}=1)\nonumber \\&+\,\mathrm{cov}\left. ({\varvec{U}}_{ij}\mid {\varvec{y}}_{j}, W_j,Z_{ij}=1)\right] \mid {\varvec{y}}_{j}, \left. Z_{ij}=1, \hat{{\varvec{{\varTheta }}}} \right] \nonumber \\= & {} \hat{{\varvec{\eta }}}_{ij}\Big (\hat{{\varvec{\xi }}}_i+\hat{{\varvec{\gamma }}}^{\top }_i({\varvec{y}}_j-\hat{{\varvec{A}}}\hat{{\varvec{\xi }}}_i)\Big )^{\top } +\hat{{\varvec{u}}}_{ij}\left( \hat{{\varvec{\lambda }}}_i-\hat{{\varvec{\gamma }}}^{\top }_i\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i\right) ^{\top }\nonumber \\&+\left( \hat{{\varvec{{\varOmega }}}}^{-1}_i+\,\hat{{\varvec{A}}}^{\top }\hat{{\varvec{D}}}^{-1}\hat{{\varvec{A}}}\right) ^{-1}, \end{aligned}$$

(D.13)

where $\hat{{\varvec{\gamma }}}_i = (\hat{{\varvec{A}}} \hat{{\varvec{{\varOmega }}}}_i \hat{{\varvec{A}}}^{\top } + \hat{{\varvec{D}}})^{-1}\hat{{\varvec{A}}}\hat{{\varvec{{\varOmega }}}}_i$.

After some algebraic manipulations, the resulting Q function that gets rid of the constants is given by

$$\begin{aligned} Q({\varvec{{\varTheta }}}\mid \hat{{\varvec{{\varTheta }}}})= & {} \sum ^{g}_{i=1}\sum ^{n}_{j=1}{\hat{z}}_{ij}\Bigg \{\log \pi _i -\frac{1}{2}\log |{\varvec{D}}| -\frac{1}{2}\log |{\varvec{{\varOmega }}}_i| -\frac{1}{2} \mathrm{tr}\Big ({\varvec{{\varvec{D}}}}^{-1}\Big [{\hat{s}}_{2ij}{\varvec{y}}_j{\varvec{y}}^{\top }_j\nonumber \\&-\,{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{\top }{\varvec{A}}^{\top } -{\varvec{A}}\hat{{\varvec{\eta }}}_{ij}{\varvec{y}}^{\top }_j+{\varvec{A}}\hat{{\varvec{\varPsi }}}_{ij}{\varvec{A}}^{\top }\Big ]\Big )\nonumber \\&-\,\frac{1}{2}\mathrm{tr}\bigg ({\varvec{{\varvec{{\varOmega }}}}}^{-1}_i\Big [\hat{{\varvec{{\varPsi }}}}_{ij}-\hat{{\varvec{\eta }}}_{ij}{\varvec{\xi }}_{i}^{\top }-{\varvec{\xi }}_{i}\hat{{\varvec{\eta }}}_{ij}^{\top }\nonumber \\&+\,{\hat{s}}_{2ij}{\varvec{\xi }}_{i}{\varvec{\xi }}_{i}^{\top }+{\hat{s}}_{1ij}{\varvec{\lambda }}_i{\varvec{\lambda }}^{\top }_i-(\hat{{\varvec{u}}}_{ij}-{\varvec{\xi }}_i){\varvec{\lambda }}^{\top }_i-{\varvec{\lambda }}_i(\hat{{\varvec{u}}}_{ij}-{\varvec{\xi }}_i)^{\top }\Big ]\bigg )\nonumber \\&+ \,\left( \frac{\nu _i}{2}\right) \log \left( \frac{\nu _i}{2}\right) - \frac{\nu _i}{2}{\hat{s}}_{2ij}-\frac{\nu _i}{2}{\hat{s}}_{3ij}- \log {\varGamma }\left( \frac{\nu _i}{2}\right) \Bigg \}. \end{aligned}$$

(D.14)

Taking partial derivatives of (D.14) with respect to ${\varvec{\xi }}_i$ and ${\varvec{\lambda }}_i$ and equating them to zero vectors yield

$$\begin{aligned}&\sum ^{n}_{j=1}{\hat{z}}_{ij}\left( \hat{{\varvec{\eta }}}_{ij}-{\hat{s}}_{2ij}{\varvec{\xi }}_{i}-{\varvec{\lambda }}_{i}\right) ={\varvec{0}},\end{aligned}$$

(D.15)

$$\begin{aligned}&\sum ^{n}_{j=1}{\hat{z}}_{ij}\left( \hat{{\varvec{u}}}_{ij}-{\varvec{\xi }}_{i}-{\hat{s}}_{1ij}{\varvec{\lambda }}_{i}\right) ={\varvec{0}}. \end{aligned}$$

(D.16)

In summary, the ECM algorithm for estimating the parameters of MCghstFA proceeds as follows:

E-step:
Given the current value ${\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}$, compute ${\hat{z}}_{ij}$, ${\hat{s}}_{1ij}$, ${\hat{s}}_{2ij}$, ${\hat{s}}_{3ij}$, $\hat{{\varvec{u}}}_{ij}$, $\hat{{\varvec{\eta }}}_{ij}$ and $\hat{{\varvec{{\varPsi }}}}_{ij}$ as defined in (D.9)–(D.13) for $i=1,\ldots ,g$ and $j=1,\ldots ,n$.
CM step 1:
Maximizing (D.14) with respect to $\pi _i$ and using the Lagrange multiplier method, this gives $\hat{\pi }_i={\hat{n}}_i/n$, where ${\hat{n}}_i=\sum _{j=1}^n {\hat{z}}_{ij}$.
CM step 2:
Update parameters ${\varvec{\xi }}_i$ and ${\varvec{\lambda }}_i$ by solving simultaneous Eqs. (D.15) and (D.16). Simple matrix algebra yields
$$\begin{aligned} \hat{{\varvec{\xi }}}_i= & {} \frac{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{\eta }}}_{ij}\big )- {\hat{n}}_i(\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}\big )}{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )-{\hat{n}}^2_i} \end{aligned}$$
and
$$\begin{aligned} \hat{{\varvec{\lambda }}}_i= & {} \frac{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}\big )- {\hat{n}}_i(\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{\eta }}}_{ij}\big )}{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )-{\hat{n}}^2_i}. \end{aligned}$$
CM-step3:
The updates for ${\varvec{A}}$, ${\varvec{{\varOmega }}}_i$ and ${\varvec{D}}$ are given by
$$\begin{aligned} \hat{{\varvec{A}}}= & {} \left( \sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{y}}}_j\hat{{\varvec{\eta }}}_{ij}^{\top }\right) \left( \sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{{\varPsi }}}}_{ij}\right) ^{-1},\\ \hat{{\varvec{{\varOmega }}}}_i= & {} \frac{1}{{\hat{n}}_i}\sum _{j=1}^n{\hat{z}}_{ij}\Big [\hat{{\varvec{{\varPsi }}}}_{ij}-\hat{{\varvec{\eta }}}_{ij}\hat{{\varvec{\xi }}}_{i}^{\top } -\hat{{\varvec{\xi }}}_{i}\hat{{\varvec{\eta }}}_{ij}^{\top }+{\hat{s}}_{2ij}\hat{{\varvec{\xi }}}_{i}\hat{{\varvec{\xi }}}_{i}^{\top }+{\hat{s}}_{1ij}\hat{{\varvec{\lambda }}}_i\hat{{\varvec{\lambda }}}^{\top }_i\\&-\,(\hat{{\varvec{u}}}_{ij}-\hat{{\varvec{\xi }}}_i)\hat{{\varvec{\lambda }}}^{\top }_i-\hat{{\varvec{\lambda }}}_i(\hat{{\varvec{u}}}_{ij}-\hat{{\varvec{\xi }}}_i)^{\top }\Big ],\\ \hat{{\varvec{D}}}= & {} \frac{1}{n}\mathrm{Diag}\Bigg \{\sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\big ({\hat{s}}_{2ij}{\varvec{y}}_j{\varvec{y}}^{\top }_j-{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{\top }\hat{{\varvec{A}}}^{\top }\big )\Bigg \}. \end{aligned}$$
CM step 4:
Calculate $\hat{\nu }_i$ by solving the root of the following equation:
$$\begin{aligned} \log \left( \frac{\nu _i}{2}\right) - \mathrm{DG}\left( \frac{\nu _i}{2}\right) + 1 -\frac{1}{{\hat{n}}_i} \sum _{j=1}^n {\hat{z}}_{ij}({\hat{s}}_{2ij} + {\hat{s}}_{3ij})=0. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, WL., Castro, L.M., Chang, YT. et al. Mixtures of restricted skew-t factor analyzers with common factor loadings. Adv Data Anal Classif 13, 445–480 (2019). https://doi.org/10.1007/s11634-018-0317-2

Download citation

Received: 14 February 2017
Revised: 03 February 2018
Accepted: 27 February 2018
Published: 08 March 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11634-018-0317-2

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mixtures of restricted skew-t factor analyzers with common factor loadings

Abstract

Similar content being viewed by others

Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions

Mixtures of multivariate restricted skew-normal factor analyzer models in a Bayesian framework

Flexible clustering via extended mixtures of common t-factor analyzers

1 Introduction

2 Notation and prerequisites

3 Methodology

3.1 Model formulation

Proposition 1

Proof

3.2 Parameter estimation via the ECME algorithm

4 Practical issues from computational aspects

4.1 Initialization and stopping rules

4.2 Model selection and performance evaluation

4.3 Identifiability issues

5 Simulation

5.1 Experiment 1

5.2 Experiment 2

6 Application to real data

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of hierarchical representation (8)

Appendix B: Proof of Proposition 1

Appendix C: Proof of CM-steps

Appendix D: Parameter estimation for the MCghstFA model using the ECM algorithm

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Mixtures of restricted skew-t factor analyzers with common factor loadings

Abstract

Similar content being viewed by others

Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions

Mixtures of multivariate restricted skew-normal factor analyzer models in a Bayesian framework

Flexible clustering via extended mixtures of common t-factor analyzers

1 Introduction

2 Notation and prerequisites

3 Methodology

3.1 Model formulation

Proposition 1

Proof

3.2 Parameter estimation via the ECME algorithm

4 Practical issues from computational aspects

4.1 Initialization and stopping rules

4.2 Model selection and performance evaluation

4.3 Identifiability issues

5 Simulation

5.1 Experiment 1

5.2 Experiment 2

6 Application to real data

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of hierarchical representation (8)

Appendix B: Proof of Proposition 1

Appendix C: Proof of CM-steps

Appendix D: Parameter estimation for the MCghstFA model using the ECM algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation