1 Introduction

Factor analysis (FA), which originated from the work of (1904), is concerned with a way of summarizing the variability between a number of correlated variables; see, for example, (Lawley and Maxwell 1971). The correlations between the variables under consideration are explained by their linear dependence on a usually much smaller number of unobservable (latent) factors. In particular, FA can be considered as an extension of principal component analysis (PCA), both of which are widely used statistical tools for reducing dimensionality by constructing linear combinations of the variables. Unlike the PCA model, the FA model enjoys a powerful invariance property: changes in the scales of the variables in \(Y\) appear only as scale changes in the appropriate rows of the matrix of factor loadings.

FA has been successfully applied to numerous problems that arise naturally in many areas, see Basilevsky (2008) for a literature survey. In the FA framework, errors and factors are routinely assumed to have a Gaussian distribution because of their mathematical and computational tractability. However, the traditional FA approach has often been criticized for the lack of stability and robustness against non-normal characteristics such as skewness and heavy tails. Statistical methods which ignore the departure of normality may cause biased or misleading inference. To remedy this weakness, authors such as McLachlan et al. (2007), Wang and Lin (2013), and Zhang et al. (2013) considered the use of the multivariate \(t\) (MT) distribution for robust estimation of FA models, known as the tFA model.

When the data have longer than normal tails or contain atypical observations (the so-called outliers), the MT distribution has been shown to be a natural extension of the normal for making robust statistical inference (Lange et al. 1989; Kotz and Nadarajah 2004) as it has an extra tuning parameter, the degrees of freedom (df), to regulate the thickness of tails. In many biological applications (cf. Pyne et al. 2009; Rossin et al. 2011; Ho et al. 2012) and other applied problems, however, the data often involve observations whose distributions are highly asymmetric as well as having fat tails.

Over the past two decades, there has been a growing interest in proposing more flexible parametric families that can accommodate skewness and other non-normal features. In particular, the family of multivariate skew-\(t\) (MST) distributions (Azzalini and Capitanio 2003; Jones and Faddy 2003; Sahu et al. 2003; Azzalini and Genton 2008) have received considerable attention. This family contains additional skewness parameters for modeling asymmetry and includes the MT family as a special case.

This paper presents a robust version of the standard FA model by considering the joint distribution of the factors and the error vector to have a joint restricted skew-\(t\) distribution in which the skewness parameters are zero for the error vector; that is, the latent factors have a rMST distribution and the error vector has a symmetric MT distribution. Henceforth, we refer to this skew-\(t\) factor analysis as STFA. Notably, the practical use of STFA would be more widely applicable as it includes the classical FA as a limiting case and the tFA as a special case. The rMST distribution denotes the skew distribution of (Sahu et al. 2003) with the restriction that the skewing latent variables in its convolution formulation are all equal; that is, the rMST distribution has a univariate skewing function. The rMST distribution is the same after an appropriate transformation as the skew-\(t\) distribution proposed by Azzalini and Capitanio (2003) and has been widely studied in the literature and used in practice. When the df approaches infinity, the limiting distribution of rMST is the restricted multivariate skew-normal (rMSN) distribution. A comprehensive overview of their characterizations together with their conditioning-type and convolution-type representations can be found in Lee and McLachlan (2013, 2014).

Recently, several different skew factor-analytic models have been proposed in the literature, for example, Montanari and Viroli (2010) and Wall et al. (2012). More recently, Murray et al. (2014a) proposed a skew factor analysis model in which the error vector is taken to have a generalized hyperbolic skew \(t\) (GHST) distribution (Barndorff-Nielsen and Shephard 2001), while the factor vector is assumed to have a MT distribution. For brevity, we call this approach the “generalized hyperbolic skew-\(t\) factor analysis (GHSTFA)” model.

It is important to note that the GHST distribution is quite different from the rMST distribution as pointed out in Aas and Haff (2006). Firstly, as the degrees of freedom parameter in the rMST distribution approaches infinity, the rMST distribution is reduced to the restricted multivariate skew-normal (rMSN) distribution, whereas the GHST distribution tends to an elliptically symmetric distribution, namely, an ordinary multivariate normal (MN) distribution (Lee and Poon 2011). Secondly, the rMST distribution has heavy tails (polynomial) in all directions, whereas the GHST distribution has some tails that are semi-heavy (exponential).

To further reduce the number of free parameters, Murray et al. (2014b) have put forward an alternative to the GHSTFA model that assumes skew common factors. This new approach is called the “generalized hyperbolic common skew-\(t\) factor analysis (GHCSTFA)” model, constructed by taking the latent factor vector rather than the error vector to have the GHST distribution. It is also important to note that except in Lin et al. (2013) and Murray et al. (2014b), in all previous works the factor vector is taken to have a symmetric distribution with the asymmetric distribution being assumed for the error vector as, for example, in Murray et al. (2013) and Tortora et al. (2013).

The paper is structured as follows. In Sect. 2, we establish the notation and briefly outline some preliminary properties of the rMSN and rMST distributions. Section 3 discusses the specification of the STFA model and presents the development of an ECM algorithm for obtaining the ML estimates of model parameters. In Sect. 4, we describe two simple ways of computing the standard errors of the STFA model parameters based on the information-based method and the parametric bootstrap procedure. In Sect. 5, we illustrate the usefulness of the proposed method with a real-life data set. A simulation study is undertaken to compare the performance of the STFA, GHSTFA and GHCSTFA methods. Some concluding remarks are given in Sect. 6 and technical derivations are sketched in Supplementary Appendices.

2 Preliminaries

We begin with a brief review of the rMST distribution and a study of some essential properties. To establish notation, we let \(\phi _p(\cdot ;\mu ,\Sigma )\) be the probability density function of \(N_p(\mu ,\Sigma )\) (a \(p\)-variate MN distribution with mean \(\mu \) and covariance matrix \(\Sigma \)); \(\Phi (\cdot )\) be the cumulative distribution function (cdf) of the standard normal distribution; \(t_p(\cdot ;\mu ,\Sigma ,\nu )\) be the pdf of \(t_p(\cdot ;\mu ,\Sigma ,\nu )\) (a \(p\)-variate MT with location \(\mu \) and scale covariance matrix \(\Sigma \) and degrees of freedom \(\nu \)); \(T(\cdot ;\nu )\) be the cdf of the Student’s \(t\) distribution with df \(\nu \); \(TN(\mu ,\sigma ^2;(a,b))\) be the truncated normal distribution for \(N(\mu ,\sigma ^2)\) lying within a truncated interval \((a,b)\); \(M^{1/2}\) denote the square root of a symmetric matrix \(M\); \(1_p\) denote a \(p\times 1\) vector of ones; \(I_p\) be the \(p\times p\) identity matrix; \(\mathrm{Diag}\{\cdot \}\) be a diagonal matrix created by extracting the main diagonal elements of a square matrix or the diagonalization of a vector and vec(\(\cdot \)) for a operator that vectorizes a matrix by stacking its columns vertically.

Based on Pyne et al. (2009), a \(p\)-dimensional random vector \(Y\) is said to follow a rMST distribution with location vector \(\mu \in \mathbb {R}^p\), scale covariance matrix \(\Sigma \), skewness vector \(\lambda \in \mathbb {R}^p\) and df \(\nu \in (0,\infty )\), denoted as \(rSt_p(\mu ,\Sigma ,\lambda ,\nu )\), if it can be represented by

$$\begin{aligned} Y&= \mu + W^{-1/2}X,~X\sim rSN_p(0,\Sigma ,\lambda ),\nonumber \\ W&\sim \mathrm{gamma}(\nu /2,\nu /2),\quad X~\bot ~W, \end{aligned}$$
(1)

where \(\text{ gamma }(\alpha ,\beta )\) stands for a gamma distribution with mean \(\alpha /\beta \). If \(\lambda =0\), the distribution of \(Y\) reduces to \(t_{p}(\mu ,\Sigma ,\nu )\) and to \(rSN_p(\mu ,\Sigma ,\lambda )\) as \(\nu \rightarrow \infty \). In addition, this class of distributions also includes the MN distribution, recovered by setting \(\lambda =0\) and \(\nu \rightarrow \infty \). Combining the strengths of the MT and rMSN distributions, the rMST distribution offers a robustness mechanism against both asymmetry and outliers observed in the data.

From (1), it is clear that the rMST distribution corresponds to a two-level hierarchical representation

$$\begin{aligned} Y \mid (W=w)&\sim rSN_{p}\big (\mu ,w^{-1}\Sigma ,w^{-1/2}\lambda \big )\quad \text{ and }\quad W \sim \text{ gamma }(\nu /2,\nu /2). \end{aligned}$$
(2)

Integrating \(W\) from the joint density of \((Y,W)\) yields the marginal density of \(Y\)

$$\begin{aligned} f(y)&= 2t_{p}(y;\mu ,\Omega ,\nu )~T\Big (A\Big (\frac{\nu +p}{\nu +M}\Big )^{1/2}{;}\nu +p\Big ), \end{aligned}$$
(3)

where \(\Omega =\Sigma +\lambda \lambda ^\mathrm{T}\), \(A=(1-\lambda ^\mathrm{T}\Omega ^{-1}\lambda )^{-1/2}\lambda ^\mathrm{T}\Omega ^{-1}(y-\mu )\) and \(M=(y-\mu )^\mathrm{T}\Omega ^{-1}(y-\mu ).\)

3 Skew-\(t\) factor analysis model

3.1 Model formulation

Suppose that \(Y=\left\{ Y_1,\ldots ,Y_n\right\} \) constitutes a random sample of \(n\) \(p\)-dimensional observations. To improve the robustness for modeling correlation in the presence of asymmetric levels of sources, we consider a generalization of the \(t\)FA model in which the latent factor is described by the rMST distribution defined in (3). The model considered here is

$$\begin{aligned} Y_j&= \mu +BU_j+\varepsilon _j\quad \text{ with }\nonumber \\ \left[ \begin{array}{c}U_j \\ \varepsilon _j\end{array}\right]&\sim rSt_{q+p} \left( \left[ \begin{array}{c}-a_{\nu }\Lambda ^{-1/2}\lambda \\ 0\end{array}\right] , \left[ \begin{array}{cc}\Lambda ^{-1} &{} 0\\ 0 &{} D\end{array}\right] , \left[ \begin{array}{c}\Lambda ^{-1/2}\lambda \\ 0\end{array}\right] , \nu \right) , \end{aligned}$$
(4)

for \(j=1,\ldots ,n\), where \(\mu \) is a \(p\)-dimensional location vector, \(B\) is a \(p\times q\) matrix of factor loadings, \(U_j\) is a \(q\)-dimensional vector \((q<p)\) of latent variables called factors, \(\varepsilon _j\) is a \(p\)-dimensional vector of errors called specific factors, \(D\) is a positive diagonal matrix, \(\Lambda =I_{q}+\left( 1-a_{\nu }^{2}(\nu -2)/\nu \right) \lambda \lambda ^\mathrm{T}\) with

$$\begin{aligned} a_{\nu }&= (\nu /\pi )^{1/2}\displaystyle \frac{\Gamma \left( (\nu -1)/2\right) }{\Gamma \left( \nu /2\right) } \end{aligned}$$
(5)

being a scaling coefficient. Marginally, the latent factors in (4) follow an asymmetric rMST distribution, while the errors follow a (symmetric) MT distribution. Moreover, one appealing feature of (4) is that

$$\begin{aligned} E(U_j)=0\quad \text{ and }\quad \text{ cov }(U_j) =\{\nu /(\nu -2)\}I_{q}, \end{aligned}$$

which coincide with the conditions under the \(t\)FA model. According to (2), the STFA model has a two-level hierarchical representation:

$$\begin{aligned} Y_j \mid w_j&\sim rSN_{p}\left( \mu -a_{\nu }\alpha ,w^{-1}_j\Sigma ,w^{-1/2}_j\alpha \right) \quad \text{ and } \quad W_j \sim \text{ gamma }(\nu /2,\nu /2). \end{aligned}$$
(6)

Derivation of the marginal distribution of \(Y\) can be accomplished by direct calculation which leads to

$$\begin{aligned} Y_j\sim rSt_p(\mu -a_{\nu }\alpha ,\Sigma ,\alpha ,\nu ), \end{aligned}$$

where \(\Sigma =B\Lambda ^{-1}B^\mathrm{T}+D\) and \(\alpha =B\Lambda ^{-1/2}\lambda \). The marginal density of \(Y_j\) is

$$\begin{aligned} f(y_j;\theta )&= 2t_{p}(y_j;\mu -a_{\nu }\alpha ,\Omega ,\nu )T\left( A_j\left( \frac{\nu +p}{\nu +M_j}\right) ^{1/2};\nu +p\right) , \end{aligned}$$
(7)

where \(\Omega =\Sigma +\alpha \alpha ^\mathrm{T}\), \(M_j=(y_j-\mu +a_{\nu }\alpha )^\mathrm{T}\Omega ^{-1}(y_j-\mu +a_{\nu }\alpha )\) and \(A_j=h_j/\sigma \) with \(h_j=\alpha ^\mathrm{T}\Omega ^{-1}(y_j-\mu +a_{\nu }\alpha )\) and \(\sigma ^{2}=1-\alpha ^\mathrm{T}\Omega ^{-1}\alpha \).

The mean and covariance matrix of \(Y_j\) can be obtained as

$$\begin{aligned} E(Y_j)=\mu \quad \text{ and }\quad \text{ cov }(Y_j)=\frac{\nu }{\nu -2}(BB^\mathrm{T}+D). \end{aligned}$$

It appears that both tFA and STFA models share the same first two moments for the marginal distribution of \(Y_j\).

For a hidden dimensionality \(q>1\), the STFA model also suffers from an identifiability problem associated with the rotation invariance of the loading matrix \(B\), since model (4) still satisfies when \(B\) is replaced by \(BR\), where \(R\) is any orthogonal rotation matrix of order \(q\). To remedy the situation of rotational indeterminacy, there are several different ways of placing rotational identifiability constraints. The most popular method is to choose \(R\) such that \(B^\mathrm{T}D^{-1}B\) is a diagonal matrix (Lawley and Maxwell 1971) with its diagonal elements arranged in a descending order. The other commonly used technique is to constrain the loading matrix \(B\) so that the upper-right triangle is zero and the diagonal entries are strictly positive (e.g., Fokoué and Titterington 2003; Lopes and West 2004). Both methods impose \(q(q-1)/2\) constraints on \(B\). Therefore, the number of free parameters to be estimated is \(m=p(q+2)+q-q(q-1)/2+1\).

3.2 Maximum likelihood estimation via the ECM algorithm

To help the derivation of the algorithm, we adopt the following scaling transformation:

$$\begin{aligned} \tilde{B}\mathop {=}\limits ^{\triangle }B\Lambda ^{-1/2}\quad \text{ and } \quad \tilde{U}_j\mathop {=}\limits ^{\triangle }\Lambda ^{1/2}U_j. \end{aligned}$$

Clearly, the model remains invariant under the above transformation. It follows from (6) that the STFA model can be formulated in a flexible hierarchical representation as follows:

$$\begin{aligned} Y_j\mid (\tilde{U}_j,v_j, w_j)&\sim N_{p}(\mu +\tilde{B}\tilde{U}_j, w^{-1}_jD),\nonumber \\ \tilde{U}_j\mid (v_j, w_j)&\sim N_{q}\big ((v_j-a_{\nu })\lambda , w^{-1}_jI_q\big ),\nonumber \\ V_j\mid w_j&\sim TN\big (0, w^{-1}_j;(0,\infty )\big ),\nonumber \\ W_j&\sim \mathrm{gamma}(\nu /2,\nu /2). \end{aligned}$$
(8)

Consequently, applying Bayes’ rule, it suffices to show

$$\begin{aligned} \tilde{U}_j\mid (y_j,v_j, w_j)&\sim N_{q}\big (q_j, w^{-1}_j C\big ),\nonumber \\ V_j\mid (y_j, w_j)&\sim TN\big (h_j, w^{-1}_j\sigma ^{2};(0,\infty )\big ),\nonumber \\ f( w_j; y_j)&= \frac{\Phi \left( w_j^{1/2}A_j\right) }{T\Big (A_j\left( \frac{\nu +p}{\nu +M_j}\right) ^{1/2};\nu +p\Big )}\,f_G\Big (w_j;\frac{\nu +p}{2},\frac{\nu +M_j}{2}\Big ), \end{aligned}$$
(9)

where \(q_j=C\big \{d_j+\lambda (v_j-a_{\nu })\big \},\quad d_j=\tilde{B}^\mathrm{T}D^{-1}(Y_j-\mu ) \text{ and } C=(I_q+\tilde{B}^\mathrm{T}D^{-1}\tilde{B})^{-1}.\)

For notational convenience, let \(y=(y_1^\mathrm{T},\ldots ,y_n^\mathrm{T})^\mathrm{T}\) be the observed data. Moreover, we define \(U=(U_1^\mathrm{T},\ldots ,~U_n^\mathrm{T})^\mathrm{T}\), \(V=(V_1,\ldots ,V_n)^\mathrm{T}\), and \(W=(W_1,\ldots ,W_n)^\mathrm{T}\), which are treated as missing values in the complete data framework. In light of (8), the complete data log-likelihood function for \(\theta =(\mu ,B,D,\lambda ,\nu )\) given \(y_c=(y^\mathrm{T},U^\mathrm{T},V^\mathrm{T},W^\mathrm{T})^\mathrm{T}\), aside from additive constants, is

$$\begin{aligned} \ell _c(\theta ;y_{c})&= -\frac{n}{2}\log \mid D\mid -\frac{1}{2}\mathrm{tr}\left( D^{-1}\sum _{j=1}^n\tilde{\Upsilon }_{j}\right) \nonumber \\&-\frac{1}{2}\sum _{j=1}^n\left[ W_j\left\{ (V_j-a_{\nu })^2\lambda ^\mathrm{T}\lambda -2(V_j-a_{\nu })\lambda ^\mathrm{T}\tilde{U}_j+\tilde{U}_{j}\tilde{U}_{j}^\mathrm{T}\right\} \right] \nonumber \\&+\frac{n\nu }{2}\log \left( \frac{\nu }{2}\right) -n\log \Gamma \left( \frac{\nu }{2}\right) +\frac{\nu }{2}\sum _{j=1}^n(\log W_j-W_j), \end{aligned}$$
(10)

where \(\tilde{\Upsilon }_{j} = W_{j}(y_{j}-\mu -\tilde{B}\tilde{U}_{j})(y_{j}-\mu -\tilde{B}\tilde{U}_{j})^\mathrm{T}.\)

The expectation–maximization (EM) algorithm (Dempster et al. 1977) is a popular iterative method to compute the ML estimates when the data are incomplete. Given an initial solution \(\theta ^{(0)}\), the implementation of the EM algorithm consists of alternating repeatedly the Expectation (E)- and Maximization (M)-steps until convergence has been reached. Often in many practical problems, the solution to the M-step may encounter some difficulties such that no closed-form expressions exist for updating parameters. For ML estimation of the STFA model, we resort to the ECM algorithm (Meng and Rubin 1993) in which the M-step is replaced by a sequence of computationally simpler conditional maximization (CM) steps while sharing all appealing advantages of the standard EM algorithm.

To calculate the expectation of the complete data log-likelihood, called the \(Q\)-function, we require the following conditional expectations:

$$\begin{aligned} \hat{w}_j^{(k)}&= E(W_{j}\mid y_j,\hat{\theta }^{(k)}),\quad \hat{\kappa }_{j}^{(k)}=E(\log W_j\mid y_j,\hat{\theta }^{(k)}),\nonumber \\ \hat{s}_{1j}^{(k)}&= E(W_{j}V_{j}\mid y_j,\hat{\theta }^{(k)}),\quad \hat{s}_{2j}^{(k)}=E(W_{j}V_{j}^{2}\mid y_j,\hat{\theta }^{(k)}), \nonumber \\ \hat{\tilde{\Omega }}_{j}^{(k)}&= E(W_{j}\tilde{U}_{j}\tilde{U}_{j}^\mathrm{T}\mid y_j,\hat{\theta }^{(k)}),\nonumber \\ \hat{\tilde{\eta }}_{j}^{(k)}&= E(W_{j}\tilde{U}_{j}\mid y_j,\hat{\theta }^{(k)})~~\text{ and }~~\hat{\tilde{\zeta }}_{j}^{(k)}=E(W_{j}V_{j}\tilde{U}_{j}\mid y_j,\hat{\theta }^{(k)}), \end{aligned}$$
(11)

which are directly obtainable from using (A.1)–(A.7) given in Supplementary Proposition 4. As a result, the \(Q\)-function can be written as

$$\begin{aligned} Q(\theta ;\hat{\theta }^{(k)})&= -\frac{n}{2}\log \mid D\mid -\frac{1}{2}\mathrm{tr}\left( D^{-1}\sum _{j=1}^{n}\hat{\tilde{\Upsilon }}_{j}^{(k)}\right) \nonumber \\&-\frac{1}{2}\sum _{j=1}^{n}\left\{ \left( \hat{s}_{2j}^{(k)}-2a_{\nu }\hat{s}_{1j}^{(k)}+a_{\nu }^{2}\hat{w}_{j}^{(k)}\right) \lambda ^\mathrm{T}\lambda -2\lambda ^\mathrm{T}\left( \hat{\tilde{\zeta }}_{j}^{(k)}-a_{\nu }\hat{\tilde{\eta }}_{j}^{(k)}\right) +\hat{\tilde{\Omega }}^{(k)}_j\right\} \nonumber \\&+\frac{n\nu }{2}\log \left( \frac{\nu }{2}\right) -n\log \Gamma \left( \frac{\nu }{2}\right) +\frac{\nu }{2}\sum _{j=1}^n(\hat{\kappa }_j^{(k)}-\hat{w}_j^{(k)}), \end{aligned}$$
(12)

where

$$\begin{aligned} \hat{\tilde{\Upsilon }}_{j}^{(k)}&= \hat{w}_{j}^{(k)}(y_{j}-\mu )(y_{j}-\mu )^\mathrm{T}-\tilde{B}\hat{\tilde{\eta }}_{j}^{(k)}(y_{j}-\mu )^\mathrm{T}-(y_{j}-\mu )\hat{\tilde{\eta }}_{j}^{(k)\mathrm T}\tilde{B}^\mathrm{T}\nonumber \\&+\tilde{B}\hat{\tilde{\Omega }}_{j}^{(k)}\tilde{B}^\mathrm{T}, \end{aligned}$$
(13)

which contains free parameters \(\mu \) and \(\tilde{B}\). In summary, the implementation of the ECM algorithm proceeds as follows:

  • E-step: Given \(\theta =\hat{\theta }^{(k)}\), compute \(\hat{w}_{j}^{(k)}\), \(\hat{\kappa }_{j}^{(k)}\), \(\hat{s}_{1j}^{(k)}\), \(\hat{s}_{2j}^{(k)}\), \(\hat{\tilde{\eta }}_{j}^{(k)}\), \(\hat{\tilde{\zeta }}_{j}^{(k)}\) and \(\hat{\tilde{\Omega }}_{j}^{(k)}\) in (11), for \(j=1,\ldots ,n\).

  • CM-step 1: Update \(\hat{\mu }^{(k)}\) by maximizing (12) over \(\mu \), which leads to

    $$\begin{aligned} \hat{\mu }^{(k+1)}=\frac{\sum _{j=1}^{n}\Big (\hat{w}_{j}^{(k)}y_{j}-\hat{\tilde{B}}^{(k)}\hat{\tilde{\eta }}_{j}^{(k)}\Big )}{\sum _{j=1}^{n}\hat{w}_{j}^{(k)}}. \end{aligned}$$
  • CM-step 2: Given \(\mu =\hat{\mu }^{(k+1)}\), update \(\hat{\tilde{B}}^{(k)}\) by maximizing (12) over \(\tilde{B}\), which gives

    $$\begin{aligned} \hat{\tilde{B}}^{(k+1)}=\left\{ \sum _{j=1}^{n}\left( y_{j}-\hat{\mu }^{(k+1)}\right) \hat{\tilde{\eta }}_{j}^{(k)\mathrm T}\right\} \left( \sum _{j=1}^{n}\hat{\tilde{\Omega }}_{j}^{(k)}\right) ^{-1}. \end{aligned}$$
  • CM-step 3: Given \(\mu =\hat{\mu }^{(k+1)}\) and \(\tilde{B}= \hat{\tilde{B}}^{(k+1)}\), update \(\hat{D}^{(k)}\) by maximizing (12) over \(D\), which leads to

    $$\begin{aligned} \hat{D}^{(k+1)}=\frac{1}{n}\mathrm {Diag}\left( \sum _{j=1}^{n}\hat{\tilde{\Upsilon }}_{j}^{(k)}\right) . \end{aligned}$$

    where \(\hat{\tilde{\Upsilon }}_{j}^{(k)}\) is \(\tilde{\Upsilon }_{j}^{(k)}\) in (13) with \(\mu \) and \(\tilde{B}\) replaced by \(\hat{\mu }^{(k+1)}\) and \(\hat{\tilde{B}}^{(k+1)}\), respectively.

  • CM-step 4: Update \(\hat{\lambda }^{(k)}\) by maximizing (12) over \(\lambda \), which gives

    $$\begin{aligned} \hat{\lambda }^{(k+1)}=\frac{\sum _{j=1}^n\left( \hat{\tilde{\zeta }}_{j}^{(k)}-a_{\nu }\hat{\tilde{\eta }}_{j}^{(k)}\right) }{\sum _{j=1}^n\left( \hat{s}_{2j}^{(k)}-2a_{\nu }\hat{s}_{1j}^{(k)}+a_{\nu }^{2}\hat{w}_{j}^{(k)}\right) }. \end{aligned}$$
  • CM-step 5: Calculate \(\hat{\nu }^{(k+1)}\) by maximizing (12) over \(\nu \), which is equivalent to solving the root of the following equation:

    $$\begin{aligned} -\frac{1}{n}\sum _{j=1}^{n}\left\{ \left( -2a^{\prime }_{\nu }\hat{s}_{1j}^{(k)}+2a^{\prime }_{\nu }a_{\nu }\hat{w}_{j}^{(k)}\right) \lambda ^\mathrm{T}\lambda +2a^{\prime }_{\nu }\lambda ^\mathrm{T}\hat{\tilde{\eta }}_{j}^{(k)}\right\}&\\ +\log \left( {\frac{\nu }{2}}\Big )-\mathrm{DG}\Big (\frac{\nu }{2}\right) +1+\frac{1}{n}\sum _{j=1}^n \left( \hat{\kappa }_{j}^{(k)}-\hat{w}_{j}^{(k)}\right)&= 0, \end{aligned}$$

    where \(\mathrm DG(\cdot )\) denotes the digamma function and

    $$\begin{aligned} a_{\nu }^{\prime }&= \frac{\mathrm{d} a_{\nu }}{\mathrm{d}\nu }=\frac{1}{2} \left( \frac{1}{\pi \nu }\right) ^{1/2}\frac{\Gamma \left( \frac{\nu -1}{2}\right) }{\Gamma \left( \frac{\nu }{2}\right) }+2\left( \frac{\nu }{\pi }\right) ^{1/2}\\&\times \frac{\Gamma \left( \frac{\nu -1}{2}\right) }{\Gamma \left( \frac{\nu }{2}\right) }\left\{ \mathrm{DG}\left( \frac{\nu -1}{2}\right) -\mathrm{DG}\left( \frac{\nu }{2}\right) \right\} . \end{aligned}$$

In the above CM-step 5, the R function ‘uniroot’ is employed to obtain the solution of \(\nu \). To facilitate faster convergence, the range of \(\nu \) is restricted to have a maximum of 200, which does not affect the inference when the underlying distribution of factor scores has a near skew-normal or normal shape. Upon convergence, the ML estimate of \(\theta \) is denoted by \(\hat{\theta }=\left( \hat{\mu }, \hat{B}, \hat{D}, \hat{\lambda }\right) \), where \(\hat{B}=\hat{\tilde{B}}\hat{\Lambda }^{1/2}\) and \(\hat{\Lambda }=I_q+\left( 1-\frac{\hat{\nu }-2}{\hat{\nu }}\hat{a}_{\nu }^2\right) \hat{\lambda }\hat{\lambda }^\mathrm{T}\). Consequently, the estimation of factor scores through conditional prediction is obtained by

$$\begin{aligned} \hat{U}_j= E(U_j\mid y_j,\hat{\theta })=\hat{\Lambda }^{-1/2}\hat{C}\Big \{\hat{d}_j+\hat{\lambda }\big (\hat{v}_j-\hat{a}_{\nu }\big )\Big \}, \end{aligned}$$

where \(\hat{v}_j=E(V_j\mid y_j,\hat{\theta })\) can be evaluated via (A.2) with \(\theta \) replaced by \(\hat{\theta }\), and \(\hat{a}_{\nu }\) is \(a_{\nu }\) in (5) with \(\nu \) replaced by \(\hat{\nu }\).

We further make some remarks on the implementation of the proposed ECM algorithm.

Remark 1

To assess the convergence based on the monotonicity property of the algorithm, we adopt the Aitken’s acceleration method (cf. Aitken 1926; Böhning et al. 1994), which outperforms the lack of progress criterion and allows to avoid the premature convergence (McNicholas et al. 2010). Denote by \(l^{(k)}\) the log-likelihood value evaluated at \(\hat{\theta }^{(k)}\). The asymptotic estimate of the log-likelihood at iteration \(k\) can be calculated as

$$\begin{aligned} l^{(k+1)}_{\infty }=l^{(k)}+\frac{1}{1-a^{(k)}}(l^{(k+1)}-l^{(k)}), \end{aligned}$$

where \(a^{(k)}=(l^{(k+1)}-l^{(k)})/(l^{(k)}-l^{(k-1)})\) is called the Aikten acceleration factor. Lindsay (1995) proposed that the algorithm can be considered to have converged when \(\ell ^{(k)}_{\infty }-\ell ^{(k)}<\epsilon \), where \(\epsilon \) is the desired tolerance.

Remark 2

Analogous to other iterative optimization procedures, one needs to search for appropriate initial values to avoid divergence or time-consuming computation. A direct way of deriving the initial estimate for mean vector, factor loading and error covariance matrix can be obtained by performing a simple FA fit using the factanal command in the R package. The resulting estimates are taken as initial values, namely \(\hat{\mu }^{(0)}\), \(\hat{B}^{(0)}\) and \(\hat{D}^{(0)}\), respectively. Next, compute the factor scores via the conditional prediction method. The initial skewness vector \(\hat{\lambda }^{(0)}\) and df \(\hat{\nu }^{(0)}\) are obtained by fitting the rMST distribution to the sample of factor scores via the R package EmSkew (Wang et al. 2009).

Remark 3

A number of information criteria taking the form of a penalized log-likelihood \(-2\ell _\mathrm{max}+C(n)m\) are used for model selection and determination of \(q\), where \(\ell _\mathrm{max}\) is the maximized log-likelihood and \(m\) is the number of free parameters in the considered model. Five popular criteria are considered in later analysis, including the Akaike information criterion (AIC; Akaike 1973) with \(C(n)=2\), the consistent version of AIC (CAIC; Bozdogan 1987) with \(C(n)=\log (n)+1\), the Bayesian information criterion (BIC; Schwarz 1978) with \(C(n)=\log (n)\), the sample-size adjusted BIC (SABIC; Sclove 1987) with \(C(n)=\log ((n+2)/24)\), and the Hannan–Quinn criterion (HQC, Hannan and Quinn 1979) with \(C(n)=2\log (\log (n))\). When several competing models are compared, the models with smaller values of these criteria are favored on the basis of fit and parsimony.

4 Provision of standard errors

Under regularity conditions (Zacks 1971), the asymptotic covariance matrix of \(\hat{\theta }\) can be approximated by the inverse of the observed information matrix; see also Efron and Hinkley (1978). Specifically, the observed information matrix is defined as

$$\begin{aligned} I(\hat{\theta };y)=-\frac{\partial ^2\ell (\theta ;y)}{\partial \theta \partial \theta ^\mathrm{T}}\Big |_{\theta =\hat{\theta }}. \end{aligned}$$

To obtain \(I(\hat{\theta };y)\) numerically, Jamshidian (1997) suggested using the central difference method. Let \(G=[g_1;\ldots ; g_m]\) be a \(m\times m\) matrix with the \(c\)th column being

$$\begin{aligned} g_c=\frac{s(\theta +h_c e_c;y)-s(\theta -h_c e_c;y)}{2h_c},\quad c=1,\ldots ,m, \end{aligned}$$

where \(s(\theta ;y)=\partial \ell (\theta ;y)/\partial \theta \) is the score vector of \(\ell (\theta ;y)\), \(e_c\) is a unit vector with all of its elements equal to zero except for its \(c\)th element which is equal to \(1\), \(h_c\) is a small number, and \(m\) is the number of parameters in \(\theta \). Explicit expressions for the elements of \(s(\theta ;y)\) are summarized in Supplementary Appendix B.

Since \(G\) may not be symmetric, we suggest using

$$\begin{aligned} \tilde{I}(\hat{\theta };y)&= -\frac{G+G^\mathrm{T}}{2}. \end{aligned}$$
(14)

to approximate \(I(\hat{\theta };y)\). The asymptotic standard errors of \(\hat{\theta }\) can be calculated by taking the square roots of the diagonal elements of \([\tilde{I}(\hat{\theta };y)]^{-1}\).

Notably, the inverse of (14) is not always guaranteed to yield proper (positive) standard errors. The parametric bootstrap method (Efron and Tibshirani 1986), although computationally expensive, is often used instead to obtain estimates of the standard errors. Let \(f(y;\hat{\theta })\) be the estimated density function of (7) obtained from fitting the STFA model to the original data. The calculation of bootstrap standard error estimates consists of the following four steps.

  1. 1.

    Drawing a bootstrap sample \(y^*_1,\ldots ,y^*_n\) from the fitted distribution \(f(y;\hat{\theta })\).

  2. 2.

    Compute the ML estimates \(\hat{\theta }^*\) from fitting the STFA model to the generated bootstrap samples \(y^*_1,\ldots ,y^*_n\).

  3. 3.

    Repeat Steps 1 and 2 a large number of times, say \(B\), thereby obtaining bootstrap replications, namely \(\hat{\theta }^*_1,\ldots ,\hat{\theta }^*_B\).

  4. 4.

    Estimate the bootstrap standard errors of \(\hat{\theta }\) via the sample standard errors of \(\hat{\theta }^*_1,\ldots ,\hat{\theta }^*_B\).

5 Numerical examples

5.1 A simulation study

We conduct a simulation study to examine the performance of the STFA, GHSTFA and GHCSTFA approaches. We implement the alternating expectation conditional maximization (AECM) algorithms described in Murray et al. (2014a, b) for fitting the latter two models. A comparison of some characterizations among the three considered models is summarized in Table 1.

Table 1 Comparison of some characterizations among three different skew-\(t\) factor analysis models

We generate artificial data from the basic FA model with \(p=10\) and \(50\) variables and \(q=2,~3\) and 4 factors, while the underlying distribution for the latent factors is non-normal. The presumed parameter values are \(\mu =0\), \(B=\mathrm{Unif}(p,q)\) and \(D=\mathrm{Diag}\{\mathrm{Unif}(p,q)\}\), where \(\mathrm{Unif}(p,q)\) denotes a \(p\times q\) matrix of random numbers drawn from a uniform distribution on the unit interval (0,1). Moreover, the latent factors \(U\) are assumed to be one of two standardized distributions with varying degrees of skewness and kurtosis, including \(\mathrm{Beta}(0.1, 30)\) and Chi-square distribution with one df (\(\chi ^2_1\)). Indeed, the population skewness/kurtosis for \(\mathrm{Beta}(0.1, 30)\) is 6/52 (high) and that for \(\chi ^2_1\) is 2.8/12 (mild). Errors are generated from the MT distribution with zero mean, scale covariance \(D\) and \(\nu =5\).

The sample sizes evaluated range from small \((n=100)\) to moderately large \((n=300)\). The objective of such settings is to see how the performance of the three models varies with respect to the degree of non-normality of the latent factors across different numbers of \(p\) and \(q\), and the sample size \(n\). Assuming the number of latent factors is known, each simulated datum is fitted using the three considered models. Simulation results are based on the 100 repeated Monte Carlo samples.

The numbers of parameters contained in the three models with various \(p\) and \(q\) are listed in Table 2. As can be seen, the model complexity of STFA falls between GHSTFA and CHCSTFA. For ease of exposition, comparisons made in Table 3 are based on the average BIC values together with the frequencies of the particular model chosen based on the smallest BIC value. In 23 out of a total of 24 scenarios, the STFA model provides a better fit than the other two GH-based approaches. The performance of STFA can be improved as the degree of non-normality and the sample size increase. The GHSTFA model has the worst fit and is never chosen as it is penalized more heavily by BIC.

Table 2 Numbers of free parameters involved in three skew-\(t\) factor analysis models
Table 3 Simulation results based on 100 replications

In this study, it is important to note that the results are limited to the extent of our simulation experiments. We are certainly not making a claim that the STFA model can replace any of the others. A comparison of our proposed model versus any of the models in Murray et al. (2013, 2014a, b) is of limited practical value. This is because each of the models applies to different situations. For example, if the data were simulated from our proposed model, we can so specify it to ensure that it will produce superior results to those based on the models in Murray et al. (2013, 2014a, b). Likewise, if the data were generated from the GHST distribution as in Murray et al. (2014b), then the configuration of the latter can be chosen to make our model have relative inferior performance. This study contributes to providing the user with a wider choice of existing skew factor models to cover distinct situations that might arise in practice.

5.2 The AIS data set

As an illustration, we apply the proposed technique to the Australian Institute of Sport (AIS) data, which were originally reported by Cook and Weisberg (1994) and subsequently analyzed by Azzalini and Dalla Valle (1996), Azzalini and Capitanio (1999, 2003) and Azzalini (2005), among others. The dataset consists of \(p = 11\) physical and hematological measurements on athletes in different sports which are almost equally bisected between 102 male and 100 female.

For simplicity of illustration, we focus solely on \(n=102\) observations of male. A summary of the 11 attributes along with their sample skewness and kurtosis is given in Table 4. It is readily seen that most of the attributes are moderately to strongly skewed with a heavy tail.

Table 4 An overview of 11 attributes of 102 male athletes of the AIS data

c

Figure 1 depicts the histograms and corresponding normal quantile plots of the first three factor score estimates obtained from the classical FA with \(q=4\). The factor score estimates are obtained using the “regression” method, see Chapter 9.5 of Johnson and Wichern (2007). The histograms in the left panels indicate that the distributions of factor scores deviate from normality due to positive skewness and high excess kurtosis. This feature can also be demonstrated through the normal quantile–quantile plots shown in the right panels. The result motivates us to advocate the use of STFA model as a proper tool for the analysis of this data set.

Fig. 1
figure 1

Histograms and corresponding normal quantile plots of the estimated factor scores obtained from fitting FA to \(102\) male athletes of the AIS data

Next, we are interested in comparing the ML results of STFA with three of its nested models, namely the FA, tFA and SNFA models. To assess the performance of non-nested models, comparisons are also made on the GHSTFA and CHCSTFA approaches. The data have been standardized to have zero mean and unit standard deviation to avoid variables having a greater impact due to different scales. We fit these models with \(q\) ranging from 3 to 6 using the ECM algorithm developed in Sect. 3. Notice that the choice of maximum \(q = 6\) satisfies the restriction \((p-q)^2\ge (p+q)\) as suggested by Eq. (8.5) of McLachlan and Peel (2000).

A summary of ML fitting results, including the maximized log-likelihood values, the number of parameters together with five model selection indices, is reported in Table 5. From this table, the model selected by the five information criteria is the STFA model with \(q=4\). Table 6 reports the ML solutions of the best chosen model along with the standard errors in parentheses obtained using \(500\) bootstrap replications. We found that the estimated skewness parameters are moderately to highly significant, revealing that the joint distribution of latent factors is skewed. Moreover, the estimated df (\(\hat{\nu }=6.15\)) is quite small, confirming the presence of thick tails.

Table 5 Comparison of ML estimation results on 102 male athletes.
Table 6 Summary ML results together with the associated standard errors in parentheses for the best chosen model

Observing the unrotated solution of factor loadings displayed in the 3–6th columns of Table 6, the first factor can be labeled general nutritional status, with a very high loading on lbm, followed by Wt, Ht and bmi. The second factor, which loads heavily on rcc, Hc and Hg, might be called a hematological factor. The third factor can be viewed as overweight assessment indices since the bmi, ssf and Bfat load highly on this factor. The fourth factor is not easily interpreted at this point.

The comparison process is also conducted for the original (non-standardized) data. Clearly, as shown in Supplementary Figure 1, the STFA still provides the best overall fit. The fit of FA is the worst, indicating a lack of adequacy of normality assumptions for this dataset. It is also noted that all criteria prefer four-factor solutions under all scenarios.

We consider diagnostics to assess the validity of the underlying distributional assumption of \(Y\). For FA, we can use the Mahalanobis-like distance \((y\!-\!\hat{\mu })^\mathrm{T}\hat{\Omega }^{-1}(y\!-\!\hat{\mu })\), which has an asymptotic Chi-square distribution with \(p\) df. Checking the normality assumption can be achieved by constructing Healy (1968) plot. To further assess the goodness of fit of STFA, it follows from Supplementary Proposition 3 that \(f_j=(y_j-\hat{\mu })^\mathrm{T}\hat{\Omega }^{-1}(y_j-\hat{\mu })/p\) follows the \(F(p,\nu )\) distribution for \(j=1,\ldots ,n\). In this case, one can construct another Healy-type plot (or Snedecor’s F plot) by plotting the cumulative \(F(p,\nu )\) probabilities associated with the ordered values of \(f_j\) against their nominal values \(1/n,2/n,\ldots ,1\). As such, one can examine whether the corresponding Healy’s plot resembles a straight line through the origin having unit slope. In other words, the greater the departure from the 45-degree line, the greater the evidence for concluding a poor fitting of the model. Inspecting Healy’s plots shown in Fig. 2, the STFA adapts the identity more closely than does the FA, suggesting that it is appropriate to use a skew and heavy-tailed distribution.

Fig. 2
figure 2

Healy’s plot for assessing the goodness of fit of fitted models

Figure 3 depicts coordinate projected scatter plots for each pair of four selected variables superimposed with the marginal contours obtained by marginalization of the best fitted STFA model. A visual inspection reveals that the fitted contours adapt the shape of the scattering pattern satisfactorily. To summarize, the implementation of STFA tends to be more reasonable for analyzing this data set.

Fig. 3
figure 3

Scatter plots of pairs of four selected variables of 102 male AIS athletes and coordinate projected contours

6 Conclusion

We introduce an extension of FA models obtained by replacing the normality assumption for the latent factors and errors with a joint rMST distribution, called the STFA model, as a new robust tool for dimensionality reduction. The model accommodates both asymmetry and heavy tails simultaneously and allows practitioners for analyzing data in a wide variety of situations. We have described a four-level hierarchical representation for the STFA model and presented a computationally analytical ECM algorithm for ML estimation in a flexible complete-data framework. We demonstrate our approach with a simulation study and an analysis of the AIS data set. The numerical results show that the STFA model performs relatively well for the experimental data. The computer programs used in the analyses can be downloaded from http://amath.nchu.edu.tw/www/teacher/tilin/STFA.html.

In the situation with the occurrence of missing data, our algorithm can be easily modified to account for missingness based on the scheme proposed in Lin et al. (2006) and Liu and Lin (2014). Due to recent advances in computer power and availability, it would be interesting to develop Markov chain Monte Carlo (MCMC) methods (Lin et al. 2009 and Lin and Lin 2011) for Bayesian estimation of the STFA model. It is also of interest to consider a finite mixture representation of STFA models. Our initial work on the latter problem has been limited to mixtures of factor analyzers with a skew-normal distribution (Lin et al. 2013).