Abstract
Mixtures of common t factor analyzers (MCtFA) have been shown its effectiveness in robustifying mixtures of common factor analyzers (MCFA) when handling model-based clustering of the high-dimensional data with heavy tails. However, the MCtFA model may still suffer from a lack of robustness against observations whose distributions are highly asymmetric. This paper presents a further robust extension of the MCFA and MCtFA models, called the mixture of common restricted skew-t factor analyzers (MCrstFA), by assuming a restricted multivariate skew-t distribution for the common factors. The MCrstFA model can be used to accommodate severely non-normal (skewed and leptokurtic) random phenomena while preserving its parsimony in factor-analytic representation and performing graphical visualization in low-dimensional plots. A computationally feasible expectation conditional maximization either algorithm is developed to carry out maximum likelihood estimation. The numbers of factors and mixture components are simultaneously determined based on common likelihood penalized criteria. The usefulness of our proposed model is illustrated with simulated and real datasets, and experimental results signify its superiority over some existing competitors.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Mixtures of factor analyzers (MFA), originally introduced by Ghahramani and Hinton (1997), provide a global non-linear approach to dimension reduction via the adoption of component distributions having a factor-analytic representation for the component-covariance matrices. To substantially reduce the number of parameters in component matrices, especially when the number of components (g) or features (p) becomes large, Baek et al. (2010) extended the MFA by using common component factor loadings, known as mixtures of common factor analyzers (MCFA), which have now been a popular tool for high-dimensional data analysis. To deal with data with extreme values or outliers commonly observed in microarray experiments, Baek and McLachlan (2011) presented a robust version of MCFA using multivariate Student’s-t distributed component errors and factors, called mixtures of common t-factor analyzers (MCtFA). Recently, Wang (2013, 2015) extended the MCFA and MCtFA approaches to accommodating high-dimensional data with possibly missing values.
The specification of component factors and errors on both MFA and MCFA rests on the assumption of multivariate normality for computational convenience and mathematical tractability, but the two models are highly vulnerable to outliers. Although the use of MCtFA model is less affected by the violation of normality, it may still suffer from the lack of robustness against highly asymmetric observations. In many practical problems, however, the data to be analyzed may contain a group or groups of observations whose distributions are moderately or severely skewed and/or of having heavy tails. As shown in many empirical studies, a slight deviation from normality may seriously affect the estimates of mixture parameters and subsequently lead to spurious groups as well as misleading statistical inference.
Over the past few decades, there has been growing interest in adopting more flexible parametric distributions to accommodate non-normal features such as asymmetry and longer-than-normal tails leading to non-zero skewness and excess kurtosis, see the monograph by Azzalini (2014) for a more comprehensive overview. Lin et al. (2015) proposed a robust extension of factor analysis models based on the restricted multivariate skew-t (rMST) distribution (Pyne et al. 2009). Other related proposals include mixtures of skew-normal/t factor analyzers (Lin et al. 2016, 2018), mixtures of generalized hyperbolic (GH) factor analyzers (Tortora et al. 2016), mixtures of skew-t factor analyzers (Murray et al. 2014a), and mixtures of common skew-t factor analyzers (Murray et al. 2014b). Besides, Murray et al. (2017a) presented an extended version of MFA with the component factors and errors following the skew-t distribution considered by Sahu et al. (2003), which is referred to as the unrestricted multivariate skew-t (uMST) distribution by Lee and McLachlan (2014).
Note that the rMST and uMST distributions are not nested within each other, and they are equivalent only in the univariate case. Moreover, Sahu et al. (2003) have highlighted that the calculation of the uMST density becomes cumbersome as p increases. The computational difficulty of the uMST formulation was also pointed out by Murray et al. (2017a; Section 5). Azzalini et al. (2016) have provided a detailed comparison between the rMST and uMST distributions in terms of the merits of both distributions for data modeling. When comparing the two distributions in the context of model-based clustering, their illustrative examples indicate that “neither formulation is markedly superior and, if these results were to be taken in favor of either formulation, it would be the classical formulation”, namely the rMST distribution adopted in this paper.
Further, it is interesting to note that the skew-t distribution adopted by Murray et al. (2014a, b), arising from the family of GH distributions (Barndorff-Nielsen and Shephard 2001), is referred to as the generalized hyperbolic skew-t (GHST) distribution henceforth. Its density form is rather different from the rMST distribution and does not include the skew-normal as a limiting case (Lee and Poon 2011). The model proposed by Murray et al. (2014b) is henceforth referred to as mixtures of common generalized hyperbolic skew-t factor analyzers (MCghstFA).
In this paper, we propose an alternative skew extension of the MCtFA model based on the rMST distribution, called the mixture of common restricted skew-t factor analyzers (MCrstFA) model. This new proposal preserves resistance to extremely non-normal effects commonly happen in high-dimensional data. Similar to MCFA and MCtFA models, common factor loadings are utilized for parsimoniously modeling the component-covariance matrices. To portray the observed data into a lower dimensional space and avoid possible singularities, the scale-covariance matrices for component errors (\({\varvec{D}}_i\)) are generally assumed to be homogeneous (\({\varvec{D}}_i={\varvec{D}}\)). Under certain circumstances, \({\varvec{D}}_i\) can be relaxed to be unequal or modified to different types such as (isotropic with unequal variances) or (isotropic with equal variance). Lately, Wang and Lin (2017) presented a modification of MCtFA using component-specific \({\varvec{D}}_i\) and empirically demonstrated its advantage in classifying new subjects whose true group labels are unknown in advance.
The rest of the paper is structured as follows. In Sect. 2, we establish the notation and outline some preliminary properties of the rMST distribution. In Sect. 3, we present the specification of MCrstFA model and develop a workable expectation conditional maximization either (ECME) algorithm for carrying maximum likelihood (ML) estimation. In Sect. 4, the initialization along with the stopping rules, the criteria for model selection and clustering performance, and the identifiability issues are discussed. In Sect. 5, we conduct two simulation studies to examine the validity of MCrstFA model. The methodology is illustrated on a real example concerning human liver cancer data in Sect. 6. Concluding remarks and directions for future works are given in Sect. 7. Some detailed proofs and supplementary information are deferred to appendices.
2 Notation and prerequisites
We first review the rMST distribution and study its related properties. Let \(\phi _p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }})\) be the probability density function (pdf) of a multivariate normal distribution with mean vector \({\varvec{\mu }}\) and covariance matrix \({\varvec{\varSigma }}\), denoted by \(N_p({\varvec{\mu }},{\varvec{\varSigma }})\); \({\varPhi }(\cdot )\) the cumulative distribution function (cdf) of the standard normal distribution; \(TN(\nu ,\sigma ^2;(a,b))\) the truncated normal distribution defined as a normal distribution \(N(\mu ,\sigma ^2)\) lying within an interval (a, b); \(t_p(\cdot ;{\varvec{\mu }},{\varvec{\varSigma }},\nu )\) the pdf of a p-variate t distribution with location \({\varvec{\mu }}\), scale-covariance matrix \({\varvec{\varSigma }}\) and the degree of freedom (DOF) \(\nu \); \(g(x;\alpha ,\beta )\) the pdf of gamma distribution given by \(\beta ^{\alpha }x^{\alpha -1}\exp \{-\,\beta x\}/{\varGamma }(\alpha )\); \(T(\cdot ;\nu )\) the cdf of the Student’s t distribution with zero mean, unit scale variance and DOF \(\nu \); \({\varvec{1}}_p\) a \(p\times 1\) vector with all elements being 1; \({\varvec{I}}_p\) a \(p\times p\) identity matrix; Diag\(\{\cdot \}\) a diagonal matrix made by extracting the main diagonal elements of a square matrix or the diagonalization of a vector; \({\varvec{A}}^{1/2}\) the square root of a symmetric matrix \({\varvec{A}}\).
Following Pyne et al. (2009), a p-dimensional random vector \({\varvec{Y}}\) is said to follow the rMST distribution with location vector \({\varvec{\mu }}\in {\mathbb {R}}^p\), scale-covariance matrix \({\varvec{\varSigma }}\), skewness vector \({\varvec{\lambda }}\in {\mathbb {R}}^p\) and DOF \(\nu \in {\mathbb {R}}^{+}\), denoted as \({\varvec{Y}}\sim rST_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }},\nu )\), if it has the pdf:
where \({\varvec{\varOmega }}={\varvec{\varSigma }}+{\varvec{\lambda }}{\varvec{\lambda }}^{\top }\), \(\delta =({\varvec{y}}-{\varvec{\mu }})^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})\) and \(M={\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}({\varvec{y}}-{\varvec{\mu }})/(1-{\varvec{\lambda }}^{\top }{\varvec{\varOmega }}^{-1}{\varvec{\lambda }})^{1/2}\). Note that the distribution of \({\varvec{Y}}\) is reduced to \(t_p({\varvec{\mu }},{\varvec{\varSigma }},\nu )\) by setting \({\varvec{\lambda }}={\varvec{0}}\) and to \(rSN_p({\varvec{\mu }},{\varvec{\varSigma }},{\varvec{\lambda }})\) as \(\nu \rightarrow \infty \). Furthermore, the family of (1) also includes \(N_p({\varvec{\mu }},{\varvec{\varSigma }})\), obtained by letting \({\varvec{\lambda }}={\varvec{0}}\) and \(\nu \rightarrow \infty \).
Alternatively, the rMST distribution can be hierarchically represented as
where Gamma(\(\alpha ,\beta \)) stands for the gamma distribution with mean \(\alpha /\beta \). Figure 1 shows the perspective plots with added contours for rMST densities under \({\varvec{\mu }}=(0,0)^\top \), \({\varvec{\varSigma }}={\varvec{I}}_2\), \(\nu =4\) and various specifications of \({\varvec{\lambda }}=(\lambda _1,\lambda _2)^{\top }\). It is clearly seen that these plots are non-elliptical and can be skewed and correlated toward different directions depending on the chosen parameters. Therefore, the rMST distribution provides a flexible mechanism to adapt well to more complicated data.
3 Methodology
3.1 Model formulation
Suppose that \({\varvec{Y}}=({\varvec{Y}}_1,\ldots ,{\varvec{Y}}_n)\) forms a random sample of size n in which each \({\varvec{Y}}_j=(Y_{j1},\ldots ,Y_{jp})^{\top }\) is a p-dimensional vector of feature variables. Suppose further that these samples come independently from g distinct subgroups in a heterogeneous population. The MCrstFA model for each \({\varvec{Y}}_j\) is
for \(j=1,\ldots ,n\), where \({\varvec{A}}\) is a \(p\times q\) matrix of common factor loadings, \({\varvec{U}}_{ij}\) is a q-dimensional (\(q < p\)) vector of component factors, \({\varvec{e}}_{ij}\) is a p-dimensional vector of component errors, and \(\pi _i\)s are the mixing proportions subject to \(\sum _{i=1}^g \pi _i=1\).
Furthermore, we assume that \({\varvec{U}}_{ij}\) and \({\varvec{e}}_{ij}\) are jointly distributed as
where \({\varvec{\xi }}_i\) is a q-dimensional location vector, \({\varvec{\varOmega }}_i\) is a \(q\times q\) positive-definite scale covariance matrix, \({\varvec{\lambda }}_i \in {\mathbb {R}}^q\) is a skewness vector, \({\varvec{D}}_i\) is a \(p\times p\) positive diagonal matrix, and \(\nu _i\) is the DOF. The specifications of \({\varvec{D}}_i\) and \(\nu _i\) in (4) can be either constrained to be equal or allowed to vary among components.
Based on (3) along with assumption (4), the pdf of \({\varvec{Y}}_j\) is
where
and \(\psi _p({\varvec{y}}_j;{\varvec{\mu }}_i,{\varvec{\varSigma }}_i,{\varvec{\alpha }}_i,\nu _i)\) is the rMST density function defined in (1). Notice that the representations in (6) cannot be uniquely determined because they remain unchanged if the common factor loading matrix \({\varvec{A}}\) is postmultiplied by any nonsingular matrix. Thus, we must impose \(q^2\) constraints to achieve identifiability of \({\varvec{A}}\). As a result, the number of free parameters in the MCrstFA is
If \({\varvec{D}}_i\)s are constrained to be homogeneous across components, the number of parameters is
and if component DOFs are further assumed to be identical, the resulting number of parameters is
We remark that the number of parameters in MCrstFA is increased by qg involved in \({\varvec{\lambda }}_i\) (without adding too much complexity) as compared with MCFA and MCtFA.
To indicate the class membership of observation \({\varvec{y}}_j\), we introduce allocation variables \({\varvec{Z}}_j=(Z_{1j},\ldots ,Z_{gj})^{\top }\), defined as
Thus, we have \({\varvec{Z}}_j{\mathop {\sim }\limits ^\mathrm{iid}}{{\mathscr {M}}}(1;\pi _1,\ldots ,\pi _g)\), meaning a multinomial distribution with g possible outcomes which can occur in a single trial, where \(\pi _i=\Pr (Z_{ij}=1)\) can be regarded as the prior probability of \({\varvec{y}}_j\) belonging to the ith component.
According to (2) and (3), the MCrstFA model can be formulated by a five-level hierarchical representation:
By Bayes’ rule, it suffices to derive the following conditional distributions, and the proofs of which are sketched in “Appendix A”. Specifically,
where \({\varvec{\beta }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i\), \(\delta _{ij}=({\varvec{y}}_j-{\varvec{\mu }}_i)^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)\), and \(M_{ij}=h_{ij}/\sigma _i\) with \({\varvec{V}}_i={\varvec{\varSigma }}_i+{\varvec{\alpha }}_i{\varvec{\alpha }}_i^{\top }\), \(h_{ij}={\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}({\varvec{y}}_j-{\varvec{\mu }}_i)\) and \(\sigma _i^2=1-{\varvec{\alpha }}_i^{\top }{\varvec{V}}_i^{-1}{\varvec{\alpha }}_i\). Moreover,
To simplify the notation, we define \({\varvec{b}}_{ij}={\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)\) and \(c_{ij}(r)=\{(\nu _i+p+r)/(\nu _i+\delta _{ij})\}^{1/2}\) for \(r=-2,0,2\), and let “\(|\cdots \)” represent conditioning on \({\varvec{Y}}_j={\varvec{y}}_j\) and \(Z_{ij}=1\). The following proposition summarizes some essential conditional expectations for implementing the ECME algorithm described in the next subsection.
Proposition 1
Consider the posterior distributions given in (8), we establish the following conditional expectations:
and
where \(f_{\nu _i}(x)\) is defined by (B.10).
Proof
The results follow directly from some fundamental matrix manipulations and the law of iterated expectations. See “Appendix B” for more details. \(\square \)
3.2 Parameter estimation via the ECME algorithm
The EM algorithm (Dempster et al. 1977) is a popular iterative method for finding ML estimates when the data are incomplete or the model contains latent variables. The main advantage of EM lies in the fact of monotone convergence without sacrificing simplicity. One common limitation of the EM algorithm is that the M-step usually yields no closed forms for estimators of parameters. To overcome this weakness, Meng and Rubin (1993) proposed the expectation conditional maximization (ECM) algorithm to replace the M-step of EM with several computational simpler CM-steps, each of which maximizes the expected complete-data log-likelihood function (known as the Q-function) sequentially. Importantly, the authors also showed that the ECM algorithm preserves all desiring properties of EM. In certain situations, some of the CM-steps of ECM may be computationally intractable. Liu and Rubin (1994) advanced the ECM algorithm with the CM steps that maximize either the Q-function, called the CMQ-step, or the corresponding constrained actual log-likelihood function, called the CML-step. The method is referred to as the ECME algorithm.
For notational simplicity, we denote the observed data by \({\varvec{y}}=({\varvec{y}}_1,\ldots ,{\varvec{y}}_n)\), allocation indicators by \({\varvec{Z}}=({\varvec{z}}_1,\ldots ,{\varvec{z}}_n)\), latent factors by \({\varvec{U}}=({\varvec{U}}_1,\ldots ,{\varvec{U}}_n)\), hidden variables \({\varvec{\gamma }}=(\gamma _1,\ldots ,\gamma _n)\) and scaling weight variables by \({\varvec{\tau }}=(\tau _1,\ldots ,\tau _n)\). Therefore, the complete data \({\varvec{y}}_c\) comprise the observed data \({\varvec{y}}\) together with missing data \({\varvec{y}}_m=({\varvec{Z}},{\varvec{U}},{\varvec{\gamma }},{\varvec{\tau }})\). From (5), it is readily seen that
Therefore, the joint pdf of \(({\varvec{Y}},{\varvec{Z}})\) is
Let \({\varvec{\theta }}_i=(\pi _i,{\varvec{\xi }}_i,{\varvec{\varOmega }}_i,{\varvec{D}}_i,{\varvec{\lambda }}_i,\nu _i)\) be the parameter vector belonging to the i-th component, and \({\varvec{\varTheta }}=\{{\varvec{A}},{\varvec{\theta }}_1,\ldots ,{\varvec{\theta }}_g\}\) the entire unknown parameters to be estimated. According to (7), the complete-data log-likelihood function is
To evaluate the Q-function, defined as \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})=E\big [\ell _c({\varvec{\varTheta }}\mid {\varvec{y}}_c)\mid {\varvec{y}},\hat{{\varvec{\varTheta }}}^{(k)}\big ]\), we first define the following conditional expectations:
for \(i=1,\ldots ,g\) and \(j=1,\ldots ,n\), which can be evaluated using (9), (10) and (11).
To update the mixture parameters \({\varvec{\varTheta }}\), the ECME algorithm proceeds as follows:
-
E-step:
Given \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}^{(k)}\), calculate the Q-function, obtained as
$$\begin{aligned} Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})= & {} \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)} \bigg \{\log \pi _i-\frac{1}{2}\log |{\varvec{D}}_i|-\frac{1}{2}\log |{\varvec{\varOmega }}_i|-\log {\varGamma }\left( \frac{\nu _i}{2}\right) \nonumber \\&+\,\frac{\nu _i}{2}\log \left( \frac{\nu _i}{2}\right) +\frac{\nu _i}{2}({{\hat{\kappa }}_{ij}}^{(k)}-{{\hat{\tau }}_{ij}}^{(k)})-\frac{1}{2}\mathrm{tr}\big ({\varvec{D}}_i^{-1}{\varvec{\varUpsilon }}_{ij}-{\varvec{\varOmega }}_i^{-1}{\varvec{\varLambda }}_{ij}\big )\bigg \},\nonumber \\ \end{aligned}$$(13)where
$$\begin{aligned} {\varvec{\varUpsilon }}_{ij}={\varvec{\varUpsilon }}_{ij}({\varvec{A}})={\hat{\tau }}_{ij}^{(k)}{\varvec{y}}_j{\varvec{y}}_j^{\top }-{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{(k)\top }{\varvec{A}}^{\top }-{\varvec{A}}{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{y}}_j^{\top }+{\varvec{A}}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}^{\top } \end{aligned}$$(14)and
$$\begin{aligned} {\varvec{\varLambda }}_{ij}={\varvec{\varLambda }}_{ij}({\varvec{\xi }},{\varvec{\lambda }})= & {} {\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}-{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}{\varvec{\lambda }}_i^{\top }-{\varvec{\xi }}_i\left( {\hat{{\varvec{\eta }}}_{ij}}^{{(k)}^{\top }}-{\hat{\tau }}_{ij}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) \nonumber \\&-\,{\varvec{\lambda }}_i\left( {\hat{{\varvec{\zeta }}}_{ij}}^{(k)^{\top }}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^{\top }-{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i^{\top }\right) . \end{aligned}$$(15) -
CM-steps:
Maximizing (13) with respect to \(\pi _i\), \({\varvec{\xi }}_i\), \({\varvec{\lambda }}_i\), \({\varvec{A}}\), \({\varvec{\varOmega }}_i\) and \({\varvec{D}}_i\), we obtain
$$\begin{aligned} {\hat{\pi }}_i^{\left( k+1\right) }= & {} \frac{1}{n}\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) },\\ \hat{{\varvec{\xi }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\eta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\zeta }}}_{ij}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{\lambda }}}_i^{\left( k+1\right) }= & {} \frac{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) }{\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{\tau }}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) -\left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2},\\ \hat{{\varvec{A}}}^{\left( k+1\right) }= & {} \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) {\top }}\right) \left( \sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\varPsi }}}_{ij}}^{\left( k\right) }\right) ^{-1},\\ {{\hat{{\varvec{\varOmega }}}_i}}^{\left( k+1\right) }= & {} \frac{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varLambda }}}_{ij}^{\left( k+1\right) }}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}~~\text{ and }~~ \hat{{\varvec{D}}}_i^{\left( k+1\right) } =\frac{\mathrm{Diag}\{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }\hat{{\varvec{\varUpsilon }}}_{ij}^{\left( k+1\right) }\}}{\sum _{j=1}^n {{\hat{z}}_{ij}}^{\left( k\right) }}, \end{aligned}$$where \(\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}\) and \(\hat{{\varvec{\varLambda }}}_{ij}^{(k+1)}\) are \({\varvec{\varUpsilon }}_{ij}\) and \({\varvec{\varLambda }}_{ij}\) in (14) and (15) with \({\varvec{\xi }}_i\), \({\varvec{\lambda }}_i\) and \({\varvec{A}}\) replaced by \({\hat{{\varvec{\xi }}}_i}^{(k+1)}\), \({\hat{{\varvec{\lambda }}}_i}^{(k+1)}\) and \(\hat{{\varvec{A}}}^{(k+1)}\), respectively. Moreover, when \({\varvec{D}}_i\)s are assumed to be the same, say \({\varvec{D}}_i={\varvec{D}}\) for all i, the updated estimator of \({\varvec{D}}\) is given by \(\hat{{\varvec{D}}}^{(k+1)}=n^{-1}\mathrm{Diag}\{\sum _{i=1}^g\sum _{j=1}^n {{\hat{z}}_{ij}}^{(k)}\hat{{\varvec{\varUpsilon }}}_{ij}^{(k+1)}\}.\) The proof of the updated estimators is sketched in “Appendix C”.
-
CML-step:
In light of (12), the updated estimator of \(\nu _i\) can be obtained by solving the following equations:
$$\begin{aligned} {{\hat{\nu }}}_i^{(k+1)}=\arg \max _{\nu _i}\bigg \{\sum _{j=1}^n{\hat{z}}^{(k+1)}_{ij}\log \Big (\psi _p({\varvec{y}}_j;\hat{{\varvec{\mu }}}_i^{(k+1)},\hat{{\varvec{\varSigma }}}_i^{(k+1)},\hat{{\varvec{\alpha }}}_i^{(k+1)},\nu _i)\Big )\bigg \},\nonumber \\ \end{aligned}$$(16)for \(i=1,\ldots ,g\), where \(\hat{{\varvec{\mu }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\xi }}}_i^{(k+1)}\), \(\hat{{\varvec{\varSigma }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\varOmega }}}_i^{(k+1)}\)\(\hat{{\varvec{A}}}^{(k+1)\top }+\hat{{\varvec{D}}}_i^{(k+1)}\) and \(\hat{{\varvec{\alpha }}}_i^{(k+1)}=\hat{{\varvec{A}}}^{(k+1)}\hat{{\varvec{\lambda }}}_i^{(k+1)}\).
In the case of assuming common DOFs, say \(\nu _i=\nu \) for all i, the updated estimator of \(\nu \) is obtained by maximizing the constrained actual log-likelihood function, that is,
Herein, we remark that the solutions of (16) and (17) involve carrying out a one-dimensional search using the built-in R function optim function over a box constraint (2, 200). Given an initial guess of parameters \(\hat{{\varvec{\varTheta }}}^{(0)}\), the above ECME procedure is performed recursively until maximization of the log-likelihood function is achieved. The resulting ML estimates are denoted by \(\hat{{\varvec{\varTheta }}}=(\hat{{\varvec{A}}},\hat{\pi }_i,\hat{{\varvec{\xi }}}_i,\hat{{\varvec{\varOmega }}}_i,\hat{{\varvec{D}}}_i,\hat{{\varvec{\lambda }}}_i,\hat{{\varvec{\nu }}}_i,i=1,\ldots ,g)\). As a result, the posterior probability of \({\varvec{y}}_j\) belonging to the i-th component of the mixture is calculated by replacing \({\varvec{\varTheta }}\) in (9) with \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}\), denoted by \({\hat{z}}_{ij}=P(Z_{ij}=1\mid {\varvec{y}}_j,\hat{{\varvec{\varTheta }}})\). Based on the maximum a posteriori (MAP) classification rule, \({\varvec{y}}_j\) is assigned to group s if \(\max \{{\hat{z}}_{ij}\}_{i=1}^g\) occurs at \(i=s\).
Consequently, the conditional expectations of the factor scores \({\varvec{U}}_{ij}\) given \({\varvec{y}}_{j}\) and the i-th membership of the mixture meaning that \(Z_{ij}=1\) can be estimated by \(\hat{{\varvec{u}}}_{ij}=E({\varvec{U}}_{ij}\mid {\varvec{Y}}_j={\varvec{y}}_j,Z_{ij}=1,\hat{{\varvec{\varTheta }}})\) which is given in (10) with \({\varvec{\varTheta }}\) substituted by \(\hat{{\varvec{\varTheta }}}\). Then, the j-th estimated factor scores corresponding to \({\varvec{y}}_j\) can be calculated as
An alternative estimator of (18) is given by
where \(\text{ MAP }\{{\hat{z}}_{ij}\}=1\), if \(\max \{{\hat{z}}_{hj}\}_{h=1}^g\) occurs at \(h=i\), and \(\text{ MAP }\{{\hat{z}}_{ij}\}=0\) otherwise. These estimated factor scores can be used to portray the observed data into a lower dimensional space (Baek et al. 2010; Baek and McLachlan 2011) and be applied to feature extractions (Ueda et al. 2000).
4 Practical issues from computational aspects
4.1 Initialization and stopping rules
Like other iterative procedures, the ECME algorithm may suffer from convergence difficulties such as singularity of component covariance matrices or undetermined local maximum. To alleviate such problems, one simple strategy is to try many different initial values and select the solution that provides the highest likelihood. To obtain different sets of initial values, this can be done by performing multiple times of K-means (Hartigan and Wong 1979) clustering or random starts (McLachlan and Peel 2000) in the sense that each sample point is randomly assigned to one of clusters. We recommend below a simple way of generating sensible initial values.
-
1.
Given initial memberships obtained by a single run of clustering through K-means, we set \(\hat{{\varvec{Z}}}_j^{(0)}=({\hat{z}}_{1j}^{(0)},\ldots ,{\hat{z}}_{gj}^{(0)})\). The initial values of \(\pi _i\)s are
$$\begin{aligned} {\hat{\pi }}_i^{(0)}=\frac{1}{n}\sum _{j=1}^n{\hat{z}}_{ij}^{(0)},\quad i=1,\ldots ,g. \end{aligned}$$ -
2.
Let \({\varvec{y}}_{(i)}\) be the collection of the i-th partitioned group. After that, we compute factor scores using the R built-in factanal function. The initial estimates of \(\hat{{\varvec{\xi }}}_i^{(0)}\), \(\hat{{\varvec{\varOmega }}}_i^{(0)}\), \(\hat{{\varvec{\lambda }}}_i^{(0)}\) and \(\hat{\nu }_i^{(0)}\), for \(i=1,\ldots ,g\), are obtained by implementing R EMMIXskew package (Wang et al. 2009) for fitting the rMST distribution to the estimated factor scores.
-
3.
Perform the principal components analysis (PCA) method to obtain the factor loading matrix for \({\varvec{y}}_{(i)}\), denoted by \(\hat{{\varvec{B}}}^{(0)}_i\) for \(i=1,\ldots ,g\). The initial estimate of \({\varvec{A}}\) is specified as
$$\begin{aligned} \hat{{\varvec{A}}}^{(0)}=\sum _{i=1}^g{\hat{\pi }}^{(0)}_i\hat{{\varvec{B}}}^{(0)}_i\hat{{\varvec{\varOmega }}}_i^{{(0)}^{-1/2}}. \end{aligned}$$ -
4.
The initial estimate of \({\varvec{D}}_i\) is obtained as a diagonal matrix formed from the diagonal elements of the sample covariance matrix of \({\varvec{y}}_{(i)}\). For the restricted case of \({\varvec{D}}_i={\varvec{D}}\), the initial estimate \(\hat{{\varvec{D}}}^{(0)}\) is formed as the diagonal elements of the pooled within-cluster sample covariance matrix of \({\varvec{y}}_{(1)},\ldots ,{\varvec{y}}_{(g)}\).
Since the ECME algorithm is an iterative method, the stopping rules should be specified. In our experimental studies, we adopt by default the traditional criterion to terminate the algorithm when a predefined the maximum number of iterations \(k_\mathrm{max}=2\times 10^4\) is reached or when the difference between two successive log-likelihood values is less than \(10^{-6}\). Alternatively, one can use the Aitken acceleration-based stopping criterion (Aitken 1926; McLachlan and Krishnan 2008), which is at least as strict as lack of progress in likelihood in the neighborhood of a maximum (McNicholas et al. 2010).
4.2 Model selection and performance evaluation
The log-likelihood value cannot be adopted as a model selection criterion because it is a nondecreasing function of the number of components (g) and the dimension of factors (q). We use the Bayesian information criterion (BIC; Schwarz 1978) and the integrated classification likelihood (ICL; Biernacki et al. 2000) to determine the best pair of (g, q) over a number of candidate models for achieving satisfactory performance (McNicholas and Murphy 2008; Lin et al. 2016). The BIC and ICL are defined as
where d is the number of free parameters, \(\ell _{\mathrm{max}}\) is the maximized log-likelihood value, and \(\text{ ENT }(\hat{{\varvec{z}}})=-\sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\log {\hat{z}}_{ij}\) is a penalty term called entropy that favors well-separated mixtures. The ICL penalizes complex model seriously and selects more parsimonious models than does BIC.
To evaluate the clustering performance of model-based approach, the adjusted Rand index (ARI; Hubert and Arabie 1985) and the correct classification rate (CCR; Lee et al. 2003) are employed. Typically, the ARI value ranges between 0 and 1 in most cases, but it can be negative corresponding to a poor level of agreement, e.g., fewer instances are correctly classified than would be expected by chance. The metric of CCR has a value between 0 and 1. The CCR is determined to have the lowest misclassification rate by comparing all permutations of the MAP clustering labels with the true class labels.
4.3 Identifiability issues
The mixture model itself suffers from an non-identifiability problem arising from a permutation of the class labels in parameter vectors. The switching issue of class labels is often inherent in Bayesian implementation of mixture models. However, this is not a problem in practice when employing the EM-based algorithm to estimate mixture densities since we can still determine a sequence of ML estimates that are consistent and asymptotically efficient, see McLachlan and Basford (1988).
On the other hand, there is another identifiability problem corresponding to the rotational indeterminacy of common factor loading matrix \({\varvec{A}}\). As suggested by Baek et al. (2010), a unique solution of \({\varvec{A}}\), say \(\hat{{\varvec{A}}}^*\), can be obtained by postmultiplying a nonsingular matrix for which the solution is orthonormal, i.e., \(\hat{{\varvec{A}}}^{*\top }\hat{{\varvec{A}}}^*={\varvec{I}}_q\). This can be achieved by adopting the Cholesky decomposition to find the upper triangular matrix \({\varvec{C}}\) of order q such that \(\hat{{\varvec{A}}}^{\top }\hat{{\varvec{A}}}={\varvec{C}}^{\top }{\varvec{C}}\), resulting in \(\hat{{\varvec{A}}}^*=\hat{{\varvec{A}}}\hat{{\varvec{C}}}^{-1}\).
Related to the standard errors of the ML estimates, it would be of interest to calculate them using the empirical information matrix for \({{\varvec{\varTheta }}}\) in a manner analogous to Wang and Lin (2016). This procedure will be tackled by the authors in a future paper.
5 Simulation
We conduct two simulation experiments to demonstrate the proposed techniques. Unless otherwise stated, we shall consider only the case of \({\varvec{D}}_i={\varvec{D}}\) for all i in the later analysis.
5.1 Experiment 1
In this experiment, to compare the accuracy of three parsimonious factor-analytic approaches for clustering and representing low-dimensional data, we generate a \(p=3\) dimensional dataset of size \(n=1000\) from a \(g=2\) component mixture of rMST distributions. The presumed mixture parameters as involved in (5) are
The MCFA, MCtFA and MCrstFA models with \(q=2\) factors and \(g=2\) components are fitted via the ECME algorithm to the simulated data. When the parameter estimates and the corresponding factor scores are obtained under each fitted model, we can compare the clustering performance and calculate the predicted values of each observed feature vector \({\varvec{y}}_j\). As anticipated, the MCrstFA approach gives the best clustering result (\(\hbox {ARI}=0.891; \hbox {CCR}=0.972\)), followed closely by MCtFA (\(\hbox {ARI}=0.817; \hbox {CCR}=0.952\)). The MCFA has the worst performance (\(\hbox {ARI}=1.78\times 10^{-6}; \hbox {CCR}=0.51\)), indicating a lack of ability to cluster mixtures of skewed data with outliers. A cross-tabulation of the true and predicted class memberships is given in Table 1. As can be seen, the MCrstFA approach provides fewer misclassified observations and outperforms the other two considered approaches, say MCtFA and MCFA.
Figure 2 displays plots of the actual observations \({\varvec{y}}_j\) overlaid with predicted observations \(\hat{{\varvec{y}}}_j\), calculated as \(\hat{{\varvec{y}}}_j=\hat{{\varvec{A}}}\hat{{\varvec{u}}}_j\), \((j=1,\ldots ,1000)\), where \(\hat{{\varvec{A}}}\) is the estimated projection matrix, and \(\hat{{\varvec{u}}}_j\) is the estimated factor scores defined in (18). As shown in Fig. 2a, the MCFA model performs poorly because of a lack of mechanisms to cope with data exhibiting non-normal features. On the other hand, it is clearly observed from Fig. 2b, c that the original scattering structure of two groups can be retrieved quite well using the MCtFA and MCrstFA approaches, but the MCtFA is slightly unfavored due to somewhat poor fit caused by having 20 more misclassified units than the MCrstFA.
5.2 Experiment 2
To further demonstrate the validity of the MCrstFA approach for handing the data of higher dimensions, we perform a second simulation experiment in situations where the MCrstFA holds exactly. In this study, data were generated from the 3-component MCrstFA model with \(q=2\), and \(p=10\) and 20. We perform 100 Monte Carlo (MC) repetitions of sample size \(n=1500\) observations and equal mixing proportions, namely \(\pi _i=1/3\) for all i. The elements of \(p\times q\) common factor loadings \({\varvec{A}}\) were randomly generated from N(0, 1), while the component DOFs are taken as \((\nu _1,\nu _2,\nu _3)=(4,6,9)\). The location vectors, scale-covariance matrices and skewness parameters of the component factors \({\varvec{U}}_{ij}\) are chosen as
Figure 3 gives an illustration of the generated bivariate factor scores based on one simulated case for each of the three components. Typically, these component factor scores look somewhat well separated and exhibit non-elliptical scattering patterns and heavy tails. The component error vectors \({\varvec{e}}_{ij}\)s were drawn independently from \(t_p({\varvec{0}},{\varvec{D}},\nu _i)\), where diagonal elements of \({\varvec{D}}\) were randomly generated from a uniform distribution ranging between 0.1 and 0.3.
We process each of 100 MC simulated datasets by fitting the MCFA, MCtFA and MCrstFA models. Comparisons were made on the adequacy of overall fitness in terms of BIC and ICL and the classification agreement on the true and predicted memberships assessed by ARI and CCR. Table 2 lists the average values of criteria together with the corresponding standard deviations (Std) under every scenario considered. As a guide to select the most plausible model, the frequencies (Freq) preferred by these criteria are also reported. In all cases, the MCrstFA model provides better fits and clustering results than the other two approaches. In particular, the MCFA and MCtFA are seldom or even never chosen by these four indices due to a lack of sufficient robustness against skewness. We have also undertaken the simulation study with a much higher dimension, say \(p=100\), and found that the MCrstFA model still works similarly well without degrading its performance.
6 Application to real data
We applied our method to the human liver cancer data (Chen et al. 2002), which consist of \(p=85\) gene expressions partitioned into two subpopulations. Hepatocellular carcinoma (HCC) is one of the 10 leading causes of death in the world. Chen et al. (2002) used cDNA microarrays to characterize patterns of gene expression in HCC, from which they found that the expression patterns in HCC and nontumor liver tissues (LIVER) are distinctly different from one another. In the data, there are \(n=179\) samples in the genomic expression patterns from patients, of which 104 belong to HCC and 75 to LIVER.
Figure 4 depicts the boxplots of top 30 genes which have the most significant difference between two classes obtained by performing the two-sample t-test. Apparently, the distribution of each selected gene is highly skewed or has a long tail.
We implement the two-component MCFA, MCtFA, MCrstFA and MCghstFA approaches with q ranging from 1 to 10. In the same vein as that of the simulation experiments, we assume \({\varvec{D}}_i={\varvec{D}}\) for all i, but place no restrictions on component DOFs. A comparison of some characterizations between the MCrstFA and MCghstFA models is summarized in Table 5. When fitting the MCghstFA model, we implement the ECM algorithm described in “Appendix D”. For clarity, Table 3 presents only the fitting results and classification agreements of each method with q ranging from 5 to 10. Judging from BIC and ICL, the best fitted model is given by the MCghstFA model with \(q=8\). While comparing the classification performance, the MCrstFA model with \(q=6\) provides the best agreement on predicting the true group memberships (\(\hbox {ARI}=0.2427\) and \(\hbox {CCR}=0.7486\)) for this dataset. Notice that the best classifier does not necessarily give the best fit to the data. Again, the MCrstFA approach demonstrates its usefulness in clustering high-dimensional data with asymmetry and/or fat tails.
Table 4 compares the best classification results obtained from the fitted MCFA (\(q=10\)), MCtFA (\(q=6\)), MCrstFA (\(q=6\)) and MCghstFA (\(q=10\)) models. We found that the number of the correctly classified HCC tissues in the fit of MCrstFA is more than those of the other three approaches. However, there is no obvious difference among them in predicting the class memberships of LIVER tissues.
To visualize the clustering results in a low-dimensional space, Fig. 5 portrays the data in a 3D space using the factor scores estimated by (19). In the plot, we use the second, third and fifth factors in the fit of MCrstFA with \(q=6\) factors. The estimated factor scores in Fig. 5a, b are plotted according to the true and implied clustering labels, respectively. It can be observed from the two plots that the two clusters are inherently overlapped so that no approach works satisfactorily on classifying these tissues. Most of the misclassified tissues, labelled by ‘plus symbol’ in Fig. 5b, appear in the overlapping area between two clusters.
7 Conclusion
We propose an extension of MCFA in which component factors and errors are jointly modeled by the rMST distribution, called the MCrstFA model, as a new model-based tool for analyzing high-dimensional data with strong degree of abnormality and multimodality. An attractive feature of the MCrstFA is that the component means, component covariance matrices as well as component skewness parameters are represented by common factor loadings, allowing parsimonious model fitting while preserving its robustness.
We describe an analytically simple ECME procedure developed under a five-level hierarchy for fitting the MCrstFA. This approach enables us to project high-dimensional clustering results into a low-dimensional space through displaying estimated factor scores. Numerical simulation studies and experimental data demonstrate its usefulness and flexibility on the basis of model fitting and outright clustering.
The techniques presented so far are limited to the likelihood-based approach and focus on complete data analysis. Some possible avenues for future research include building a framework to handle the presence of censoring observations (Castro et al. 2015; Lachos et al. 2017) or the occurrence of missing values (Ouyang et al. 2004; Lin 2014; Wang et al. 2017a, b), both of which are common problems in the analysis of high-dimensional data. Although our estimating procedure is easy to implement, there is a lack of feasible guidelines for a joint determination of (g, q) within a single run of the training process. Toward this end, variational Bayes (VB) approximations (Waterhouse et al. 1996; Jordan et al. 1999; Beal 2003) have been presented as an iterative Bayesian alternative to the EM-based algorithm for their fast and deterministic nature. The attractive feature of the VB scheme allows for an automated learning of parameter estimation and model selection. The VB approach has been effectively applied to Gaussian mixtures (Teschendorff et al. 2005), MFA models (Ghahramani and Beal 2000), and mixtures of normal inverse Gaussian distributions (Subedi and McNicholas 2014) for simultaneously estimating model parameters and determining the number of components. Therefore, it is worthwhile to develop a novel VB algorithm for learning the MCrstFA model. Another inspiration for future work is to extend the MCrstFA model based on a broader family of multivariate skew distributions such as the scale mixtures of skew-normal distributions (Cabral et al. 2012; Prates et al. 2013), the multivariate canonical fundamental skew-t distributions (Arellano-Valle and Genton 2005; Lee and McLachlan 2016), and the hidden truncation hyperbolic distributions introduced very recently by Murray et al. (2017b).
References
Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc R Soc Edinb 46:289–305
Arellano-Valle RB, Genton MG (2005) On fundamental skew distributions. J Multivar Anal 96:93–116
Azzalini A (2014) The skew-normal and related families. IMS monographs series. Cambridge University Press, Cambridge
Azzalini A, Browne RP, Genton MG, McNicholas PD (2016) On nomenclature for, and the relative merits of, two formulations of skew distributions. Stat Probab Lett 110:201–206
Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276
Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32:1–13
Barndorff-Nielsen O, Shephard N (2001) Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics. J Roy Stat Soc Ser B 63:167–241
Beal MJ (2003) Variational algorithms for approximate Bayesian inference. Ph.D. thesis, The University of London, London, UK
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22:719–725
Cabral CR, Lachos VH, Prates MO (2012) Multivariate mixture modeling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142
Castro LM, Costa DR, Prates MO, Lachos VH (2015) Likelihood-based inference for Tobit confirmatory factor analysis using the multivariate Student-\(t\) distribution. Stat Comput 25:1163–1183
Chen X, Cheung ST, So S, Fan ST, Barry C, Higgins J, Lai KM, Ji J, Dudoit S, Ng IO, Van De Rijn M, Botstein D, Brown PO (2002) Gene expression patterns in human liver cancers. Mol Biol Cell 13:1929–1939
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 9:1–38
Ghahramani Z, Beal M (2000) Variational inference for Bayesian mixture of factor analysers. In: Solla S, Leen T, Muller K-R (eds) Advances in neural information processing systems. MIT Press, Cambridge
Ghahramani Z, Hinton GE (1997) The EM algorithm for factor analyzers. Technical Report No. CRG-TR-96-1, The University of Toronto, Toronto
Hartigan JA, Wong MA (1979) Algorithm AS 136: a K-means clustering algorithm. J R Stat Soc C 28:100–108
Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37:183–233
Lachos VH, Morenoa EJL, Chen K, Cabralc CRB (2017) Finite mixture modeling of censored data using the multivariate Student-\(t\) distribution. J Multivar Anal 159:151–167
Lee SX, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comp 24:181–202
Lee SX, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unication of the restricted and unrestricted skew \(t\)-mixture models. Stat Comp 26:573–589
Lee YW, Poon SH (2011) Systemic and systematic factors for loan portfolio loss distribution. Econometrics and applied economics workshops, pp 1–61. School of Social Science, University of Manchester
Lee WL, Chen YC, Hsieh KS (2003) Ultrasonic liver tissues classification by fractal feature vector based on M-band wavelet transform. IEEE Trans Med Imaging 22:382–392
Lin TI (2014) Learning from incomplete data via parameterized \(t\) mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195
Lin TI, Wu PH, McLachlan GJ, Lee SX (2015) A robust factor analysis model using the restricted skew-\(t\) distribution. TEST 24:510–531
Lin TI, McLachlan GJ, Lee SX (2016) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. J Multivar Anal 143:398–413
Lin TI, Wang WL, McLachlan GJ, Lee SX (2018) Robust mixtures of factor analysis models using the restricted multivariate skew-\(t\) distribution. Stat Model 28:50–72
Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81:33–648
McLachlan GJ, Basford KE (1988) Mixture models: inference and application to clustering. Marcel Dekker, New York
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comp 18:285–296
McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54:711–723
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Murray PM, Browne RP, McNicholas PD (2014a) Mixtures of skew-\(t\) factor analyzers. Comput Stat Data Anal 77:326–335
Murray PM, McNicholas PD, Browne RP (2014b) Mixtures of common skew-\(t\) factor analyzers. Stat 3:68–82
Murray PM, Browne RP, McNicholas PD (2017a) A mixture of SDB skew-\(t\) factor analyzers. Econom Stat 3:160–168
Murray PM, Browne RP, McNicholas PD (2017b) Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. J Multivar Anal 161:141–156
Ouyang M, Welsh W, Georgopoulos P (2004) Gaussian mixture clustering and imputation of microarray data. Bioinformatics 20:917–923
Prates MO, Cabral CR, Lachos VH (2013) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Soft 54:1–20
Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524
Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with application to Bayesian regression models. Can J Stat 31:129–150
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Subedi S, McNicholas PD (2014) Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Adv Data Anal Classif 8:167–193
Teschendorff A, Wang Y, Barbosa-Morais N, Brenton J, Caldas C (2005) A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 21:3025–3033
Tortora C, McNicholas P, Browne R (2016) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif 10:423–440
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12:2109–2128
Wang WL (2013) Mixtures of common factor analyzers for high-dimensional data with missing information. J Multivar Anal 117:120–133
Wang WL (2015) Mixtures of common \(t\)-factor analyzers for modeling high-dimensional data with missing values. Comput Stat Data Anal 83:223–235
Wang WL, Lin TI (2016) Maximum likelihood inference for the multivariate t mixture model. J Multivar Anal 149:54–64
Wang WL, Lin TI (2017) Flexible clustering via extended mixtures of common \(t\)-factor analyzers. AStA Adv Stat Anal 101:227–252
Wang K, McLachlan GJ, Ng SK, Peel D (2009) EMMIX-skew: EM algorithm for mixture of multivariate skew normal/\(t\) distributions. R package version 1.0-12
Wang WL, Castro LM, Lin TI (2017a) Automated learning of \(t\) factor analysis models with complete and incomplete data. J Multivar Anal 161:157–171
Wang WL, Liu M, Lin TI (2017b) Robust skew-\(t\) factor analysis models for handling missing data. Stat Methods Appl 26:649–672
Waterhouse S, MacKay D, Robinson T (1996) Bayesian methods for mixture of experts. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems, vol 8. MIT Press, Cambridge
Acknowledgements
The authors gratefully acknowledge the Coordinating Editor, Maurizio Vichi, the Associate Editor and three anonymous referees for their comments and suggestions that greatly improved this paper. W.L. Wang and T.I. Lin would like to acknowledge the support of the Ministry of Science and Technology of Taiwan under Grant Nos. MOST 105-2118-M-035-004-MY2 and MOST 105-2118-M- 005-003-MY2, respectively. L.M. Castro acknowledges support from Grant FONDECYT 1170258 from Chilean government.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of hierarchical representation (8)
It follows from (7) that
and
This gives rise to the following joint distribution:
We then have the following standard results:
and
where \({\varvec{\beta }}_i={\varvec{\varSigma }}_i^{-1}{\varvec{A}}{\varvec{\varOmega }}_i\). Using the characterization of the multivariate normal distribution, we can obtain
With similar arguments, we have
Hence, it is trivial to establish that \(\gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1) \sim TN(h_i,\tau _j^{-1}\sigma _i^2;(0,\infty ))\). Furthermore, standard calculation gives
Appendix B: Proof of Proposition 1
-
(a)
Standard calculation of conditional expectation yields
$$\begin{aligned}&E(\tau _j\mid {\varvec{y}}_j,\,z_{ij}=1)=\int _{0}^{\infty }\tau _j f(\tau _j\mid {\varvec{y}}_j,\,z_{ij}=1)d\tau _j\nonumber \\&\quad =\int _{0}^{\infty }\tau _j\frac{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) g\left( \tau _j;\frac{\nu _i+p+2}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \frac{T\left( M_{ij}\sqrt{\frac{\nu _i+p+2}{\nu _i+\delta _{ij}}};\nu _i+p+2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$(B.1) -
(b)
Because \(\gamma _j\mid ({\varvec{y}}_j,\tau _j,Z_{ij}=1) \sim TN(h_i,\tau _j^{-1}\sigma _i^2;(0,\infty ))\), we obtain
$$\begin{aligned} E(\gamma _j\mid {\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)=h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }. \end{aligned}$$(B.2) -
(c)
We first need to show
$$\begin{aligned}&E\left( \tau _j^{\frac{k}{2}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right) \nonumber \\&\quad =\frac{1}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }\tau _j^{\frac{k}{2}}\phi \left( \sqrt{\tau _j}M_{ij}\right) g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{1}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\int _{0}^{\infty }\tau _j^{\frac{k-1}{2}}\phi \big (M_{ij};0,\,\tau _j^{-1}\big )g\left( \tau _j;\frac{\nu _i+p}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j\nonumber \\&\quad =\frac{{\varGamma }\left( \frac{\nu _i+p+k-1}{2}\right) \int _{0}^{\infty }\phi \left( M_{ij};0,\,\tau _j^{-1}\right) g\left( \tau _j;\frac{\nu _i+p+k-1}{2},\,\frac{\nu _i+\delta _{ij}}{2}\right) d\tau _j}{{\varGamma }\left( \frac{\nu _i+p}{2}\right) \left( \frac{\nu _i+\delta _{ij}}{2}\right) ^{\frac{k-1}{2}}T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\nonumber \\&\quad =\frac{{\varGamma }\left( \frac{\nu _i+p+k-1}{2}\right) \sqrt{\frac{\nu _i+p+k-1}{\nu _i+\delta _{ij}}}\,t\left( M_{ij}\sqrt{\frac{\nu _i+p+k-1}{\nu _i+\delta _{ij}}};\nu _i+p+k-1\right) }{{\varGamma }\left( \frac{\nu _i+p}{2}\right) \left( \frac{\nu _i+\delta _{ij}}{2}\right) ^{\frac{k-1}{2}}T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$(B.3)Applying the result in (B.3) with \(k=-\,1\) and (B.2) yields
$$\begin{aligned}&E(\gamma _j\mid {\varvec{y}}_j,\,z_{ij}=1) =E[E(\gamma _j\mid {\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)\mid {\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E\left[ h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij}+\sigma _iE\left( \frac{1}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right) \nonumber \\&\quad =h_{ij}+\frac{\sigma _i}{\sqrt{\frac{\nu _i+p-2}{\nu _i+\delta _{ij}}}}\frac{t\left( M_{ij}\sqrt{\frac{\nu _i+p-2}{\nu _i+\delta _{ij}}};\nu _i+p-2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$(B.4) -
(d)
Using (B.1), (B.2) and (B.3) with \(k=1\), we have
$$\begin{aligned}&E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1) =E[E(\tau _j\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E[\tau _jE(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1]\nonumber \\&\quad =E\left[ \tau _j\left( h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\right) \bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij} E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)+\sigma _i E\left[ \sqrt{\tau _j}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\bigg |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =h_{ij}\left[ \frac{\nu _i+p}{\nu _i+\delta _{ij}}\frac{T\left( M_{ij}\sqrt{\frac{\nu _i+p+2}{\nu _i+\delta _{ij}}};\nu _i+p+2\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\right] \nonumber \\&\qquad +\,\sigma _i\left[ \sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}}\frac{t\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\right] . \end{aligned}$$(B.5) -
(e)
Using the result of (B.2), the second moment of a truncated normal distribution is given by
$$\begin{aligned} E\left( \gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right)= & {} h_{ij}E(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)+\frac{\sigma _i^2}{\tau _j}\nonumber \\= & {} h_{ij}\left( h_{ij}+\frac{\sigma _i}{\sqrt{\tau _j}}\frac{\phi \left( \sqrt{\tau _j}M_{ij}\right) }{{\varPhi }\left( \sqrt{\tau _j}M_{ij}\right) }\right) +\frac{\sigma _i^2}{\tau _j}. \end{aligned}$$(B.6) -
(f)
Applying the double expectation and using (B.5) and (B.6), we have
$$\begin{aligned} E\left( \tau _j\gamma _j^2|{\varvec{y}}_j,\,z_{ij}=1\right)= & {} E\left[ E\left( \tau _j\gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\= & {} E\left[ \tau _jE\left( \gamma _j^2|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\= & {} E\left\{ \tau _j\left[ h_{ij}E(\gamma _j|{\varvec{y}}_j,\,\tau _j,\,z_{ij}=1)+\tau _j^{-1}\sigma _i^2\right] |{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\= & {} h_{ij} E\left( \tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1\right) +\sigma _i^2. \end{aligned}$$(B.7) -
(g)
Applying the double expectation and the result of (B.4), we have
$$\begin{aligned}&E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)=E\left[ E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad ={\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)+({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)E(\gamma _j|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$(B.8) -
(h)
Applying the double expectation and using (B.1) and (B.5), we have
$$\begin{aligned}&E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)=E\left[ E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ \tau _j E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left\{ \tau _j\left[ {\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)\right] |{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\&\quad =\left[ {\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)\right] E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)\nonumber \\&\qquad +\,\left( {\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i\right) E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$(B.9) -
(i)
Applying the double expectation and using (B.5) and (B.7), we have
$$\begin{aligned}&E\left( \tau _j\gamma _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1\right) \\&\quad =E\left[ E(\tau _j\gamma _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \nonumber \\&\quad =E\left[ \tau _j\gamma _j E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\left\{ \tau _j\gamma _j[{\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^\top ({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j)]|{\varvec{y}}_j,\,z_{ij}=1\right\} \nonumber \\&\quad =\left[ {\varvec{\xi }}_i+{\varvec{\beta }}_i^\top ({\varvec{y}}_j-{\varvec{\mu }}_i)\right] E(\tau _j\gamma _j|{\varvec{y}}_j,\,z_{ij}=1)\\&\qquad +\,({\varvec{\lambda }}_i-{\varvec{\beta }}_i^\top {\varvec{\alpha }}_i)E(\tau _j\gamma _j^2|{\varvec{y}}_j,\,z_{ij}=1). \end{aligned}$$ -
(j)
Applying the double expectation and using (B.8) and (B.9), we have
$$\begin{aligned}&E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,z_{ij}=1\right) =E\left[ E\left( \tau _j{\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1\right) |{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\left[ \tau _jE({\varvec{U}}_{ij}{\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)|{\varvec{y}}_j,\,z_{ij}=1\right] \\&\quad =E\big \{\tau _j[E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)E({\varvec{U}}_{ij}^{\top }|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)\\&\qquad +\,\text{ cov }({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)]|{\varvec{y}}_j,\,z_{ij}=1\big \}\\&\quad =E\big \{\tau _j[E({\varvec{U}}_{ij}|{\varvec{y}}_j,\,\gamma _j,\,\tau _j,\,z_{ij}=1)({\varvec{\xi }}_i+{\varvec{\lambda }}_i\gamma _j+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i-{\varvec{\alpha }}_i\gamma _j))^{\top }\\&\qquad +\,\tau _j^{-1}({\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i]|{\varvec{y}}_j,\,z_{ij}=1\big \}\\&\quad =E(\gamma _j\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)({\varvec{\lambda }}_i-{\varvec{\beta }}_i^{\top }{\varvec{\alpha }}_i)^{\top }\\&\qquad +\,E(\tau _j{\varvec{U}}_{ij}|{\varvec{y}}_j,\,z_{ij}=1)[{\varvec{\xi }}_i+{\varvec{\beta }}_i^{\top }({\varvec{y}}_j-{\varvec{\mu }}_i)]^{\top }+({\varvec{I}}_q-{\varvec{\beta }}_i^{\top }{\varvec{A}}){\varvec{\varOmega }}_i. \end{aligned}$$ -
(k)
It is known that \(\int _0^{\infty }f(\tau _j|{\varvec{y}}_j,\,Z_{ij}=1)d\tau _j=1\), that is,
$$\begin{aligned} \int _0^{\infty }\frac{{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }\frac{{(\frac{\nu _i+\delta _{ij}}{2})}^{(\frac{\nu _i+p}{2})}}{{\varGamma }(\frac{\nu _i+p}{2})}\exp \left\{ -\frac{\nu _i+\delta _{ij}}{2}\tau _j\right\} d\tau _j =1. \end{aligned}$$Then
$$\begin{aligned} \frac{d}{d\nu _i}\int _0^{\infty }b_j{\varPhi }\big (\sqrt{\tau _j}M_{ij}\big )\exp \left\{ -\frac{\nu _i+\delta _{ij}}{2}\tau _j\right\} d\tau _j=0, \end{aligned}$$where
$$\begin{aligned} b_j=\frac{{\left( \frac{\nu _i+\delta _{ij}}{2}\right) }^{\left( \nu _i+p\right) /2}}{{\varGamma }\left( \frac{\nu _i+p}{2}\right) T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$By Leibnitz’s rule, we can obtain
$$\begin{aligned}&E(\log \tau _j|{\varvec{y}}_j,\,z_{ij}=1)-E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)+\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) +\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \\&\quad -\,\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\frac{\int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }=0, \end{aligned}$$where
$$\begin{aligned} f_{\nu _i}(x)= & {} \mathrm{DG}\left( \frac{\nu _i+p+1}{2}\right) -\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) -\frac{1}{\pi (\nu _i+\delta _{ij})}\nonumber \\&-\,\log \left( 1+\frac{x^2}{\nu _i+\delta _{ij}}\right) +\frac{(\nu _i+p+1)x^2}{(\nu _i+\delta _{ij})(x^2+\nu _i+\delta _{ij})}. \end{aligned}$$(B.10)It follows that
$$\begin{aligned} E(\log \tau _j|{\varvec{y}}_j,\,z_{ij}=1)= & {} E(\tau _j|{\varvec{y}}_j,\,z_{ij}=1)-\log \left( \frac{\nu _i+\delta _{ij}}{2}\right) -\left( \frac{\nu _i+p}{\nu _i+\delta _{ij}}\right) \\&+\,\mathrm{DG}\left( \frac{\nu _i+p}{2}\right) +\,\frac{\int _{-\infty }^{M_{ij}}t\left( x;0,\frac{\nu _i+\delta _{ij}}{\nu _i+p},\nu _i+p\right) f_{\nu _i}(x)dx}{T\left( M_{ij}\sqrt{\frac{\nu _i+p}{\nu _i+\delta _{ij}}};\nu _i+p\right) }. \end{aligned}$$
Appendix C: Proof of CM-steps
-
(a)
By the Lagrange multiplier method, we define
$$\begin{aligned} L(\pi _i,\lambda )=Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})-\lambda \left( \sum _{i=1}^g\pi _i-1\right) , \end{aligned}$$and then take partial derivatives, yielding
$$\begin{aligned} \frac{\partial L(\pi _i,\lambda )}{\partial \pi _i} = \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\frac{1}{\pi _i}-\lambda =0,\quad \text{ and } \quad \frac{\partial L(\pi _i,\lambda )}{\partial \lambda } = -\left( \sum _{i=1}^g\pi _i-1\right) =0. \end{aligned}$$Since \(\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}=n\), we obtain \(\hat{\pi }_i^{(k+1)}=\sum _{j=1}^n{\hat{z}}_{ij}^{(k)}/n\).
-
(b)
Differentiating \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})\) with respect to \({\varvec{\xi }}_i\) leads to
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\xi }}_i}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{\xi }}_i}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}\mathrm{tr}\bigg [-{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{\xi }}_i^\top -{\varvec{\xi }}_i{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }\\&+\,{\varvec{\xi }}_i{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i^\top +{\varvec{\xi }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^\top +{\varvec{\lambda }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^\top \bigg ]\\= & {} \mathrm{tr}\bigg \{{\varvec{\varOmega }}_i^{-1}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ {\hat{{\varvec{\eta }}}_{ij}}^{(k)}-{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i\right] \bigg \}. \end{aligned}$$Moreover, the partial derivative of \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})\) with respect to \({\varvec{\lambda }}_i\) is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\lambda }}_i}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{\lambda }}_i}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}\mathrm{tr}\bigg [-{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}{\varvec{\lambda }}_i^\top +{\varvec{\xi }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i^\top \\&-\,{\varvec{\lambda }}_i{\hat{{\varvec{\zeta }}}_{ij}}^{(k)^\top }+{\varvec{\lambda }}_i{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i^\top +{\varvec{\lambda }}_i{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i^\top \bigg ]\\= & {} \mathrm{tr}\bigg \{{\varvec{\varOmega }}_i^{-1}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ {\hat{{\varvec{\zeta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i-{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i\right] \bigg \}. \end{aligned}$$Solving the above two equations, we get
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\xi }}_i}= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}({\hat{{\varvec{\eta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i)-\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i=\mathbf 0, \end{aligned}$$(C.1)$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\lambda }}_i}= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}({\hat{{\varvec{\zeta }}}_{ij}}^{(k)}-{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i)-\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{\varOmega }}_i^{-1}{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i=\mathbf 0. \end{aligned}$$(C.2)After rearrangement, (C.1) and (C.2) can be rewritten as
$$\begin{aligned} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{\tau }_{ij}}^{(k)}{\varvec{\xi }}_i+\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{1ij}}^{(k)}{\varvec{\lambda }}_i= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\eta }}}_{ij}}^{(k)},\\ \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{1ij}}^{(k)}{\varvec{\xi }}_i+\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{{\hat{s}}_{2ij}}^{(k)}{\varvec{\lambda }}_i= & {} \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\zeta }}}_{ij}}^{(k)}. \end{aligned}$$Using Cramer’s law, the solutions of the two linear equations are
$$\begin{aligned} \hat{{\varvec{\xi }}}_i^{\left( k+1\right) }= \frac{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) }{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2}, \end{aligned}$$and
$$\begin{aligned} \hat{{\varvec{\lambda }}}_i^{\left( k+1\right) }=\frac{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\zeta }}}_{ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{{\varvec{\eta }}}_{ij}}^{\left( k\right) }\right) }{ \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{\hat{\tau }_{ij}}^{\left( k\right) }\right) \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{2ij}}^{\left( k\right) }\right) - \left( \sum _{j=1}^n{{\hat{z}}_{ij}}^{\left( k\right) }{{\hat{s}}_{1ij}}^{\left( k\right) }\right) ^2}. \end{aligned}$$ -
(c)
The partial derivative of \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})\) with respect to \({\varvec{A}}\) is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{A}}}= & {} -\frac{1}{2}\frac{\partial }{\partial {\varvec{A}}}\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\mathrm{tr}\bigg (-{\varvec{D}}_i^{-1}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }{\varvec{A}}^\top \nonumber \\&-\,{\varvec{D}}_i^{-1}{\varvec{A}}{\hat{{\varvec{\eta }}}_{ij}}^{(k)}{\varvec{y}}_j^\top +{\varvec{D}}_i^{-1}{\varvec{A}}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}^\top \bigg )\nonumber \\= & {} \mathrm{tr}\left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{D}}_i^{-1}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }-\sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{D}}_i^{-1}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}{\varvec{A}}\right) . \end{aligned}$$(C.3)Equating (C.3) to the zero matrix, we have
$$\begin{aligned} \hat{{\varvec{A}}}^{(k+1)}=\left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\varvec{y}}_j{\hat{{\varvec{\eta }}}_{ij}}^{(k)^\top }\right) \left( \sum _{i=1}^g\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varPsi }}}_{ij}}^{(k)}\right) ^{-1}. \end{aligned}$$ -
(d)
The partial derivative of \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})\) with respect to \({\varvec{\varOmega }}_i\) is
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{\varOmega }}_i^{-1}}= & {} \frac{1}{2}\frac{\partial }{\partial {\varvec{\varOmega }}_i^{-1}}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left\{ \log |{\varvec{\varOmega }}_i^{-1}|-\mathrm{tr}\left( {\varvec{\varOmega }}_i^{-1}{\varvec{\varLambda }}_{ij}\right) \right\} \nonumber \\= & {} \frac{1}{2}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\Big [2{\varvec{\varOmega }}_i-\mathrm{Diag}\{{\varvec{\varOmega }}_i\}-\big (2{\varvec{\varLambda }}_{ij}-\mathrm{Diag}\{{\varvec{\varLambda }}_{ij}\}\big )\Big ]. \end{aligned}$$(C.4)Equating (C.4) to the zero vector gives
$$\begin{aligned} {\hat{{\varvec{\varOmega }}}_i}^{(k+1)}=\frac{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varLambda }}}_{ij}}^{(k+1)}}{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}}. \end{aligned}$$ -
(e)
Taking the partial derivative of \(Q({\varvec{\varTheta }}\mid \hat{{\varvec{\varTheta }}}^{(k)})\) with respect to \({\varvec{D}}_i\) yields
$$\begin{aligned} \frac{\partial Q}{\partial {\varvec{D}}_i^{-1}}= & {} \frac{1}{2}\frac{\partial }{\partial {\varvec{D}}_i^{-1}}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}\left[ \log |{\varvec{D}}_i^{-1}|-\mathrm{tr}\left( {\varvec{D}}_i^{-1}{\varvec{\varUpsilon }}_{ij}\right) \right] \nonumber \\= & {} \frac{1}{2}\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}({\varvec{D}}_i-{\varvec{\varUpsilon }}_{ij}). \end{aligned}$$(C.5)We have the following estimator
$$\begin{aligned} {\hat{{\varvec{D}}}_i}^{(k+1)}=\frac{\mathrm{Diag}\left\{ \sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}{\hat{{\varvec{\varUpsilon }}}_{ij}}^{(k+1)}\right\} }{\sum _{j=1}^n{{\hat{z}}_{ij}}^{(k)}} \end{aligned}$$obtained by equating (C.5) to the zero matrix.
Appendix D: Parameter estimation for the MCghstFA model using the ECM algorithm
According to Table 5, the MCghstFA model admits a three-level hierarchy:
From (D.1), it can be verified that
and
where \({\varvec{\mu }}_{2\cdot 1}={\varvec{\xi }}_i+W_j{\varvec{\lambda }}_i+ {\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }({\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i-W_j{\varvec{A}}{\varvec{\lambda }}_i)\) and \({\varvec{{\varSigma }}}_{22\cdot 1}=W_j({\varvec{{\varOmega }}}_i-{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }({\varvec{A}}{\varvec{{\varOmega }}}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}{\varvec{A}}{\varvec{{\varOmega }}}_i)= W_j({\varvec{{\varOmega }}}^{-1}_i-{\varvec{A}}^{\top }{\varvec{D}}^{-1}{\varvec{A}})^{-1}\).
A positive random variable X is said to follow the Generalized Inverse Gaussian (GIG) distribution (Good 1953), denoted by \(W\sim \mathrm{GIG}(\psi ,\chi ,r)\), if it has the pdf
where \(\psi ,\chi \in {\mathbb {R}}^+\), \(r\in {\mathbb {R}}\), and \(K_q\) is the modified Bessel function of the third kind with index r. Some particular moments of the GIG distribution have tractable forms, for instance,
and
where
By Bayes’ Theorem, the conditional pdf of \(W_j\) given \({\varvec{y}}_j\) can be written as
where \({\varDelta }_{ij}=({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i)^{\top }({\varvec{A}}{\varvec{\varOmega }}_i{\varvec{A}}^{\top }+{\varvec{D}})^{-1}({\varvec{y}}_j-{\varvec{A}}{\varvec{\xi }}_i)\). It follows from (D.3) that
Alternatively, the MCghstFA model can be represented by a four-level hierarchy:
From (D.7), the complete-data log-likelihood function for \({\varvec{\varTheta }}\) on the basis of \({\varvec{Y}}_{c}=\{{\varvec{y}}_{j},{\varvec{U}}_{ij},W_{j},{\varvec{Z}}_j\}^{n}_{j=1}\), for \(i=1,\ldots ,g\), is given by
To evaluate the expected value of (D.8), called the Q function, we first calculate
which is the posterior probability of \({\varvec{y}}_j\) belonging to the ith component of the mixture. In addition, we utilize the results (D.4) and (D.5) to calculate of the following conditional expectations:
where \(\hat{\omega }_{ij}=\sqrt{(\hat{\nu }_i+\hat{{\varDelta }}_{ij})\hat{{\varvec{\lambda }}}^{\top }_i\hat{{\varvec{A}}}^{\top }(\hat{{\varvec{A}}}\hat{{\varvec{\varOmega }}}_i\hat{{\varvec{A}}}^{\top }+\hat{{\varvec{D}}})^{-1}\hat{{\varvec{A}}}\hat{{\varvec{\lambda }}}_i}\) and \(K^{\prime }_{-\frac{(\hat{\nu }+p)}{2}}(\hat{\omega }_{ij})\) is evaluated via (D.6). By (D.2), we obtain
and
where \(\hat{{\varvec{\gamma }}}_i = (\hat{{\varvec{A}}} \hat{{\varvec{{\varOmega }}}}_i \hat{{\varvec{A}}}^{\top } + \hat{{\varvec{D}}})^{-1}\hat{{\varvec{A}}}\hat{{\varvec{{\varOmega }}}}_i\).
After some algebraic manipulations, the resulting Q function that gets rid of the constants is given by
Taking partial derivatives of (D.14) with respect to \({\varvec{\xi }}_i\) and \({\varvec{\lambda }}_i\) and equating them to zero vectors yield
In summary, the ECM algorithm for estimating the parameters of MCghstFA proceeds as follows:
-
E-step:
Given the current value \({\varvec{\varTheta }}=\hat{{\varvec{\varTheta }}}\), compute \({\hat{z}}_{ij}\), \({\hat{s}}_{1ij}\), \({\hat{s}}_{2ij}\), \({\hat{s}}_{3ij}\), \(\hat{{\varvec{u}}}_{ij}\), \(\hat{{\varvec{\eta }}}_{ij}\) and \(\hat{{\varvec{{\varPsi }}}}_{ij}\) as defined in (D.9)–(D.13) for \(i=1,\ldots ,g\) and \(j=1,\ldots ,n\).
-
CM step 1:
Maximizing (D.14) with respect to \(\pi _i\) and using the Lagrange multiplier method, this gives \(\hat{\pi }_i={\hat{n}}_i/n\), where \({\hat{n}}_i=\sum _{j=1}^n {\hat{z}}_{ij}\).
-
CM step 2:
Update parameters \({\varvec{\xi }}_i\) and \({\varvec{\lambda }}_i\) by solving simultaneous Eqs. (D.15) and (D.16). Simple matrix algebra yields
$$\begin{aligned} \hat{{\varvec{\xi }}}_i= & {} \frac{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{\eta }}}_{ij}\big )- {\hat{n}}_i(\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}\big )}{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )-{\hat{n}}^2_i} \end{aligned}$$and
$$\begin{aligned} \hat{{\varvec{\lambda }}}_i= & {} \frac{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{u}}}_{ij}\big )- {\hat{n}}_i(\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{\eta }}}_{ij}\big )}{\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{1ij}\big )\big (\sum _{j=1}^n{\hat{z}}_{ij}{\hat{s}}_{2ij}\big )-{\hat{n}}^2_i}. \end{aligned}$$ -
CM-step3:
The updates for \({\varvec{A}}\), \({\varvec{{\varOmega }}}_i\) and \({\varvec{D}}\) are given by
$$\begin{aligned} \hat{{\varvec{A}}}= & {} \left( \sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{y}}}_j\hat{{\varvec{\eta }}}_{ij}^{\top }\right) \left( \sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\hat{{\varvec{{\varPsi }}}}_{ij}\right) ^{-1},\\ \hat{{\varvec{{\varOmega }}}}_i= & {} \frac{1}{{\hat{n}}_i}\sum _{j=1}^n{\hat{z}}_{ij}\Big [\hat{{\varvec{{\varPsi }}}}_{ij}-\hat{{\varvec{\eta }}}_{ij}\hat{{\varvec{\xi }}}_{i}^{\top } -\hat{{\varvec{\xi }}}_{i}\hat{{\varvec{\eta }}}_{ij}^{\top }+{\hat{s}}_{2ij}\hat{{\varvec{\xi }}}_{i}\hat{{\varvec{\xi }}}_{i}^{\top }+{\hat{s}}_{1ij}\hat{{\varvec{\lambda }}}_i\hat{{\varvec{\lambda }}}^{\top }_i\\&-\,(\hat{{\varvec{u}}}_{ij}-\hat{{\varvec{\xi }}}_i)\hat{{\varvec{\lambda }}}^{\top }_i-\hat{{\varvec{\lambda }}}_i(\hat{{\varvec{u}}}_{ij}-\hat{{\varvec{\xi }}}_i)^{\top }\Big ],\\ \hat{{\varvec{D}}}= & {} \frac{1}{n}\mathrm{Diag}\Bigg \{\sum _{i=1}^g\sum _{j=1}^n{\hat{z}}_{ij}\big ({\hat{s}}_{2ij}{\varvec{y}}_j{\varvec{y}}^{\top }_j-{\varvec{y}}_j\hat{{\varvec{\eta }}}_{ij}^{\top }\hat{{\varvec{A}}}^{\top }\big )\Bigg \}. \end{aligned}$$ -
CM step 4:
Calculate \(\hat{\nu }_i\) by solving the root of the following equation:
$$\begin{aligned} \log \left( \frac{\nu _i}{2}\right) - \mathrm{DG}\left( \frac{\nu _i}{2}\right) + 1 -\frac{1}{{\hat{n}}_i} \sum _{j=1}^n {\hat{z}}_{ij}({\hat{s}}_{2ij} + {\hat{s}}_{3ij})=0. \end{aligned}$$
Rights and permissions
About this article
Cite this article
Wang, WL., Castro, L.M., Chang, YT. et al. Mixtures of restricted skew-t factor analyzers with common factor loadings. Adv Data Anal Classif 13, 445–480 (2019). https://doi.org/10.1007/s11634-018-0317-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0317-2