1 Introduction

Using y for a p×1 observation vector whose expectation E(y) equals the p×1 zero vector 0 p , the factor analysis model is expressed as y=Bz+e, with B=(b jk ) a p-variables × m-factors loading matrix, z the m×1 latent random vector with unit variance, e a p×1 random error vector, and p>m. It is assumed that E(z)=0 p , E(e)=0 p , E(ee′)=Ψ, and z is uncorrelated with e, where \(\boldsymbol {\varPsi} = \operatorname{diag}\{\psi_{1}, \dots, \psi_{p}\}\) is the p×p diagonal matrix whose diagonal elements ψ 1,…,ψ p are called unique variances. Then, the covariance matrix of y is expressed as

$$ \boldsymbol{\varSigma} = \mathbf{BRB}'+ \boldsymbol{\varPsi} , $$
(1)

with R=(r kl )=E(zz′) being an m×m factor correlation matrix. The normality assumptions for z and e lead to the log-likelihood

(2)

given a sample covariance matrix \(\mathbf{C}_{\mathbf{yy}} = N^{-1}\mathbf{Y}'(\mathbf{I}_{n} - n^{-1}\mathbf{1}_{n}\mathbf{1}_{n}')\mathbf{Y}\), where Y is the n×p matrix with its rows the realizations of y′, N is n or n−1, I n denotes the n×n identity matrix, and 1 n expresses the n×1 vector of ones. Parameter matrices B, Ψ, and R can be estimated by maximizing (2). Factor analysis is further classified into either exploratory factor analysis (EFA) or confirmatory factor analysis (CFA); in the latter, some elements of B are constrained to be zero, while such constraints are not imposed in the former (e.g., Mulaik 2010).

In the prevailing procedures for maximum likelihood EFA and CFA, (2) is maximized by the gradient algorithms, such as the Newton–Raphson and quasi-Newton methods (Jennrich & Robinson, 1969; Jöreskog, 1967, 1969; Mulaik 2010; Yanai & Ichikawa, 2007). A problem with those procedures is that they can give improper solutions that have at least one of two properties: (1) unique variances (i.e., the diagonal elements of Ψ) include zero or a negative value; (2) factor correlation matrix R includes an element whose absolute value exceeds one (e.g., Anderson & Gerbing, 1984; Gerbing & Anderson, 1987, Kano 1998; Sato 1987; Van Driel 1978).

One purpose of this paper is to prove that such improper solutions are not given by Rubin and Thayer’s (1982) maximum likelihood EFA and CFA procedures using the EM algorithm (Dempster, Laird, & Rubin, 1977) if the sample covariance matrix C yy and the initial matrices of Ψ and R are positive definite. We refer to the EFA and CFA with the EM algorithm as EM-EFA and EM-CFA, respectively, and generally name them EM factor analysis. After a brief introduction of its algorithms in Section 2, we provide some theorems to prove the above fact in Section 3. Furthermore, in Section 4, we present a modified EM algorithm in which the equations used for the proofs are substituted for Rubin and Thayer’s original formulas.

Another purpose is to numerically assess the behavior of EM factor analysis. In Section 5 we report real data examples, where EM factor analysis is shown to avoid the improper solutions produced by the factor analysis with the prevailing gradient algorithms, although the following observations are also reported: The EM algorithm with Rubin and Thayer’s original formulas can update factor correlation matrices into asymmetric ones so that the algorithm fails to converge in real computations where numbers are treated with finite precision by computers, but this problem is overcome by using the modified EM algorithm or an algorithm in which the original formulas are replaced by those guaranteeing the symmetry of factor correlation matrices. In Section 6, we further report simulation studies to answer the questions remaining to be elucidated with simulated data.

2 Rubin and Thayer’s (1982) EM Factor Analysis

In EM factor analysis, the maximization of log-likelihood (2) is attained by initializing B, Ψ, and R to iterate the so-called E and M steps until convergence is reached. In the E step, it is required to find the expectation of the complete-data log-likelihood

(3)

supposing that an n×m factor score matrix Z were to be observed with its rows being realizations of z′, and the expected log-likelihood is maximized over B, Ψ, and R in the M step.

The computations actually required in the E step are simple: obtain

(4)
(5)
(6)

from the current B, Ψ, and R (Rubin and Thayer 1982, Equations (5), (7)). This E step is common between EFA and CFA, except that the factor correlation matrix R is fixed at I m in EFA. The M step, however, is different for each of the two methods.

In the M step for EFA, B and Ψ are updated into B new and Ψ new with

(7)
(8)

(Rubin & Thayer, 1982, Equation (10)), where \(\operatorname {diag}\{\cdot\}\) denotes the diagonal matrix whose diagonal elements are those of a parenthesized matrix.

In the M step for CFA, B=[β 1,…,β p ]′ and \(\boldsymbol{\varPsi} = \operatorname{diag}\{\psi_{1}, \dots, \psi_{p}\}\) are updated per variable j=1,…,p (Rubin & Thayer, 1982, Equation (11)). Consider the jth variable with loading vector β j on m factors. Permute its elements so that \(\mathbf{P}_{j}\boldsymbol{\beta} _{j} = [\boldsymbol{\beta} _{1j}', \boldsymbol{\beta} _{0j}']'\) with P j a permutation matrix, where m j ×1 vector β 1j contains m j unconstrained loadings to be estimated, and (mm j )×1β 0j consists of the loadings constrained to be zero. Similarly, partition Q(m×m) and δC yy (m×p) so that their m j ×m j block (Q)1j and m j ×1 block (δC yy )1j correspond to the factors with unconstrained loadings for variable j. The vector β 1j and ψ j are updated into \(\boldsymbol{\beta }_{1j}^{(\mathrm{new})}\) and \(\psi_{j}^{(\mathrm{new})}\) with

(9)
(10)

where (C yy ) j is the jth diagonal element of C yy . Further, factor correlation matrix R is updated into

$$ \mathbf{R}_{\mathrm{new}} = \operatorname{diag}\{\mathbf{Q}\}^{-1/2}\mathbf{Q}\operatorname{diag} \{\mathbf{Q}\}^{-1/2} , $$
(11)

with \(\operatorname{diag}\{\mathbf{Q}\}^{-1/2}\operatorname{diag}\{ \mathbf{Q}\} ^{-1/2} = \operatorname{diag}\{\mathbf{Q}\}^{-1}\).

In the log-likelihood (3) for complete data, the positive definiteness of Ψ and R is supposed. However, this supposition does not guarantee that they are estimated as positive definite matrices. We thus prove that Ψ and R can always be updated into positive definite Ψ new and R new, which implies that no improper solution occurs when C yy and the initial Ψ and R are positive definite.

3 Theorems

The positive definiteness of sample covariance matrix C yy is not required in the formulas of (4) to (11) and the log-likelihood (2). In the following lemma and theorems, we thus consider not only positive definite C yy but also the cases where C yy is not positive but rather nonnegative definite.

Lemma 1

If Ψ and R are positive definite and C yy is nonnegative definite, then Δ and Q are positive definite.

Proof

The positive definiteness of Ψ and R guarantees that the inverse of (1) exists. Using this in (5) we have Δ=R−(RB′)(BRB′+Ψ)−1(BR), which can be rewritten as

(12)

where R 1/2 is the m×m positive definite matrix satisfying R 1/2 ′ R 1/2=R; and we have used the SMW formula described in Appendix A. The nonnegative definiteness of (R 1/2 B′)Ψ −1(BR 1/2 ′) thus leads to the positive definiteness of Δ. This fact and the nonnegative definiteness of δC yy δ imply the positive definiteness of (6). □

Theorem 1

If Ψ is positive definite and C yy is nonnegative definite, then Ψ new is nonnegative definite in EM-EFA. Moreover, if both Ψ and C yy are positive definite, then Ψ new is positive definite in EM-EFA.

Proof

Lemma 1 implies that the inverse of (6) exists. Substituting this inverse into (8), it can be rewritten as

(13)

with \(\mathbf{C}_{\mathbf{yy}}^{1/2}\) the p×p nonnegative definite matrix satisfying \(\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime}\mathbf{C}_{\mathbf {yy}}^{1/2} = \mathbf{C}_{\mathbf{yy}}\). We can further use the SMW formula (Appendix A) to rewrite (13) as

$$ \boldsymbol{\varPsi} _{\mathrm{new}} = \operatorname{diag}\bigl\{\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime }\bigl(\mathbf{I}_{p} + \mathbf{C}_{\mathbf{yy}}^{1/2}\boldsymbol{\delta\Delta} ^{-1}\boldsymbol{\delta} '\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime}\bigr)^{-1}\mathbf {C}_{\mathbf{yy}} ^{1/2}\bigr\} , $$
(14)

which is nonnegative definite. Moreover, if C yy is positive definite, then (14) is positive definite. □

Theorem 1 shows the following fact for EM-EFA: If unique variances are initialized at positive values, the resulting solution is always proper for a positive definite sample covariance matrix; but unique variances can be estimated at zero, though never at negative values, for a sample covariance matrix which is not positive but rather nonnegative definite.

Theorem 2

If R and Ψ are positive definite and C yy is nonnegative definite, then Ψ new is nonnegative definite in EM-CFA. Moreover, if R, Ψ, and C yy are positive definite, then Ψ new is positive definite in EM-CFA.

Proof

The scalar (C yy ) j , m j ×1 vector (δC yy )1j , and m j ×m j matrix (Q)1j in (10) can be rewritten as \(\mathbf{e}_{j}'\mathbf{C}_{\mathbf{yy}}\mathbf{e}_{j}\), \(\mathbf{H}_{j}'\boldsymbol{\delta} '\mathbf{C}_{\mathbf {yy}}\mathbf{e}_{j}\), and \(\mathbf{H}_{j}'\mathbf{QH}_{j}\), respectively. Here, e j denotes the jth column of I p , and H j is the m×m j binary matrix satisfying β j =H j β 1j ; for example, if and , then ; and if and β 1j =[b j2] (a scalar), then , with b jk and 0 standing for unconstrained and constrained loadings, respectively. Therefore, (10) is rewritten as

$$ \psi_{j}^{(\mathrm{new})}= \mathbf{e}_{j}'\mathbf{C}_{\mathbf {yy}}\mathbf{e}_{j }- \mathbf{e}_{j}'\mathbf{C}_{\mathbf{yy}}\boldsymbol{\delta} \mathbf{H}_{j}\bigl(\mathbf{H}_{j}'\mathbf{QH}_{j}\bigr )^{-1}\mathbf{H}_{j}'\boldsymbol{\delta} '\mathbf{C}_{\mathbf{yy}}\mathbf{e}_{j} . $$
(15)

Here, it should be noted that the positive definiteness of R and Ψ guarantees that the inverse of \(\mathbf{H}_{j}'\mathbf{QH}_{j}\) exists, which followed from the fact that the positive definiteness of Q (Lemma 1) and H j being of full column rank by definition imply the positive definiteness of \(\mathbf{H}_{j}'\mathbf{QH}_{j}\) (e.g., Lütkepohl 1996, p. 152).

Substituting (6) into (15), we have

(16)

The positive definiteness of Δ (Lemma 1) and H j being of full column rank imply that \(\mathbf{H}_{j}'\boldsymbol{\Delta} \mathbf{H}_{j}\) is positive definite. Using this property and the SMW formula (Appendix A), (16) is further rewritten as

$$ \psi_{j}^{(\mathrm{new})} = \mathbf{e}_{j}'\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime}\bigl\{ \mathbf{I}_{p} + \mathbf{C}_{\mathbf{yy}}^{1/2}\boldsymbol{\delta} \mathbf{H}_{j}\bigl(\mathbf{H}_{j}'\boldsymbol{\Delta} \mathbf{H}_{j}\bigr)^{-1}\mathbf{H}_{j}'\boldsymbol{\delta} '\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime}\bigr\}^{-1} \mathbf{C}_{\mathbf{yy}}^{1/2}\mathbf{e}_{j} . $$
(17)

The nonnegative definiteness of \(\mathbf{C}_{\mathbf{yy}}^{1/2}\boldsymbol{\delta} \mathbf{H}_{j}(\mathbf{H}_{j}'\boldsymbol{\Delta} \mathbf{H}_{j})^{-1}\mathbf{H}_{j}'\boldsymbol{\delta} '\mathbf{C}_{\mathbf{yy}}^{1/2\,\prime}\) implies that (17) is nonnegative. Moreover, if C yy is positive definite, then (17) is positive. □

Theorem 3

If R and Ψ are positive definite and C yy is nonnegative definite, R new is a positive definite correlation matrix.

Proof

Lemma 1 shows the positive definiteness of Q and, thus, that of \(\operatorname{diag}\{\mathbf{Q}\}\). Further, Q is symmetric, as found in (6). These facts imply that R new in (11) is a symmetric positive definite matrix, whose diagonal elements are unity and whose off-diagonal elements take the values within the range [−1,1]. □

Theorems 2 and 3 show the following facts for EM-CFA: If unique variances and a factor correlation matrix are initialized at positive values and a positive definite matrix, respectively, then the resulting solution is always proper for a positive definite sample covariance matrix; but unique variances can be estimated at zero, though never at negative values, for a sample covariance matrix which is not positive but rather nonnegative definite. Further, if unique variances and a factor correlation matrix are initialized as above, factor correlations are properly estimated, even for a sample covariance matrix that is not positive but rather nonnegative definite.

Theorems 1, 2, and 3 imply that EM factor analysis is a constrained estimation procedure in which its solutions are restricted to proper ones for the sample covariance and initial parameter matrices satisfying a certain condition. On the other hand, factor analysis with gradient algorithms is an unconstrained estimation procedure, although a constrained variant exists in which the unique variance ψ j is restricted to nonnegative values by treating the square root of ψ j as the parameter to be estimated (e.g., SPSS Inc. 1997). Here, the constraint of this version is noted to be weaker than the constraint in EM factor analysis which restricts ψ j to be positive, not only nonnegative, for the C yy and initial R and Ψ which are positive definite.

Besides the properness of unique variances and factor correlations, we can add the following theorem on the loading matrices in EM-EFA:

Theorem 4

If Ψ and C yy are positive definite and \(\operatorname{rank}(\mathbf{B}) = m\), then \(\operatorname{rank}(\mathbf{B}_{\mathrm{new}}) = m\) in EM-EFA, where \(\operatorname{rank}(\mathbf{B})\) denotes the rank of B.

Proof

Substituting (4) into (7) while setting R=I p , we have B new=C yy Σ −1 BQ −1. The positive definiteness of Ψ and Lemma 1 imply \(\operatorname{rank}(\boldsymbol{\varSigma} ^{-1}) = p\) and \(\operatorname{rank} (\mathbf{Q}^{-1}) = m\). Furthermore, the positive definiteness of C yy implies \(\operatorname{rank}(\mathbf{C}_{\mathbf{yy}}) = p\). Here, we can use the fact that \(\operatorname{rank}(\mathbf{AW}) = \operatorname{rank}(\mathbf{A}) = \operatorname{rank}(\mathbf{W}'\mathbf{A}')\) holds for any matrix A if W is square of full rank (e.g., Magnus & Neudecker, 1991, p. 8). That is, C yy Σ −1 and Q −1 are square of full rank, which leads to \(\operatorname{rank}(\mathbf{C}_{\mathbf{yy}} \boldsymbol{\varSigma} ^{-1}\mathbf{B}) = \operatorname{rank}(\mathbf{B}) = \operatorname{rank}(\mathbf {C}_{\mathbf{yy}} \boldsymbol{\varSigma} ^{-1}\mathbf{BQ}^{-1}) = \operatorname{rank}(\mathbf{B}_{\mathrm {new}})\). □

The above theorem shows that if the initial unique variances and loading matrix are positive and of full column rank, respectively, EM-EFA always gives a loading matrix of full column rank for a positive definite sample covariance matrix.

4 Modified Algorithm

In order to prove the lemma and theorems in the last section, we rewrote Rubin and Thayer’s (1982) original formulas (5), (8) and (10) into Equations (12), (14), and (17), respectively, which explicitly guarantee EM factor analysis to give proper solutions when C yy and the initial R and Ψ are proper. We can thus consider the EM algorithm in which (12), (14), and (17) are substituted for (5), (8) and (10), respectively. This modified EM algorithm follows the steps below:

  1. Step 1.

    Initialize B, R, and Ψ so that R and Ψ are positive definite and B is of full column rank.

  2. Step 2.

    Obtain δ by (4) and Δ by (12) and Q by (6).

  3. Step 3.

    Update Ψ with (14) for EFA and (17) for CFA.

  4. Step 4.

    Update B with (7) for EFA and (9) for CFA.

  5. Step 5.

    Update R with (11) for CFA

  6. Step 6.

    Finish if convergence is reached; otherwise, go back to Step 2.

Here, R is fixed at I m for EFA.

The modified EM algorithm and the original one in Section 2 are mathematically equivalent, thus the former algorithm also decreases log-likelihood (2) monotonically with the parameter updates according to the principle of the EM algorithm (Dempster et al. 1977). However, the two algorithms are not necessarily guaranteed to behave equivalently in real computations, where numbers are treated with finite precision by computers. In such conditions, the modified EM algorithm, whose formulas were used for proving theorems in Section 3, is thought to follow those theorems more faithfully than the original one whose formulas do not explicitly incorporate the facts given by the theorems. Their different behaviors are reported in the second example in the next section.

5 Real Data Examples

We compare the original EM, the modified EM, and gradient algorithms for EFA and CFA, using Maxwell’s (1961, p. 55) 10×10 correlation matrix. All computations in this paper are performed with the computer programs detailed in Appendix B; gradient algorithms are run using the popular program package SAS (SAS Institute Inc., 2009), while EM algorithms are performed by our own FORTRAN programs, starting with the initial Ψ and R that are positive definite and the initial B of full column rank (though R=I m in EFA).

First, we compared the three algorithms in EFA with m=4. As found in the first row of results presented in Table 1, the unique variance of Variable 8 estimated by the gradient algorithm was far smaller than zero. On the other hand, the original and modified EM algorithms gave identical four-factor solutions with the proper unique variances, shown in the second row of results in Table 1, and a loading matrix of full column rank.

Table 1. Unique variances obtained by EFA with gradient and EM algorithms for Maxwell’s (1961) data.

Next, we performed CFA with m=3 and loadings b 41,…,b 10,1,b 12,b 22,b 62,…,b 10,2, and b 13,…,b 53 constrained to be zero. As a result, the gradient algorithm produced an improper solution with a factor correlation r 21>1, as shown in Table 2. On the other hand, the modified EM algorithm gave a proper one with the correlations shown in Table 2, positive unique variances, and a loading matrix of full column rank. However, the original EM algorithm for CFA failed to converge with log-likelihood (2) rather decreasing at the 74th iteration. What happened in this algorithm is detailed in the next paragraph.

Table 2. Factor correlation matrices obtained with CFA with gradient and modified EM algorithms for Maxwell’s (1961) data.

We found that the failure of the original EM algorithm originated when RBΣ −1 BR in (5) became asymmetric, which led to the asymmetry of factor correlation matrix R through updates (5), (6), and (11). Although RBΣ −1 BR=(BR)′Σ −1 BR being asymmetric seems to be strange, it did occur in our computer program, where (BR)′ was post-multiplied by Σ −1 BR; the resulting (BR)′Σ −1 BR was not guaranteed to be symmetric in real computation, since BR and Σ −1 BR are different matrices. Indeed, averaged asymmetry \(\operatorname {AS}(\mathbf{R}) = 2\varSigma_{k<l}|r_{kl} - r_{lk}|/\{m(m-1)\}\) and \(\operatorname{AS}(\mathbf{RB}'\boldsymbol{\varSigma} ^{-1}\mathbf{BR})\) were 0.203 and 0.074, respectively, at the 74th iteration, which is due to limited numerical precision as described in Appendix C. Figure 1 shows the changes in averaged asymmetry. It is considered that the increase of asymmetry with the iteration found in Figure 1 was caused by RBΣ −1 BR being a function of R; the latter asymmetry made the former more asymmetric, which further increased the latter asymmetry through (5), (6) and (11). We can thus consider the following scenario leading to the decrease in (2): at the 74th iteration, R became substantially asymmetric to deviate from a correlation matrix, which led to the deviation of l(B,Ψ,R|C yy ) from the proper log-likelihood that guarantees its monotone increase with updates in the EM algorithm.

Figure 1.
figure 1

Averaged asymmetry of RBΣ −1 BR and R as a function of the number of iterations in the original EM algorithm.

The above considerations are supported by the following fact: the original EM algorithm, in which (6) was replaced by

figure a

to guarantee the symmetry of Q and R, yielded the same proper solution that the modified EM algorithm gave for the data set. We refer to the above algorithm with (6′) as a symmetric original EM algorithm, while continuing to simply call the one with (6) the original EM algorithm.

6 Simulation Studies

In this section, after describing the purposes and outline of our simulation studies, we report on the studies for EFA and CFA, which are followed by a summary of the major results.

6.1 Purposes and Outline

The observation in the last section raises the following questions: [Q1] How often does the original EM algorithm for CFA fail to converge, and [Q2] can the modified EM algorithm behave differently than the symmetric original EM algorithm? Also of interest is [Q3] how EM factor analysis behaves when sample covariance matrices are not positive but nonnegative definite, since such matrices were considered in Section 3. Furthermore, we should study [Q4] how well the true parameters are recovered, and [Q5] how many iterations are needed until convergence is reached, since the properness of solutions does not necessarily imply goodness of recovery and it is known that convergence is generally slow in EM algorithms (e.g., Minami 2004).

In order to answer the above questions and demonstrate the results presented in Section 3, we compare the original EM, symmetric original EM, modified EM, and gradient algorithms in simulation studies, where sample correlations are synthesized and analyzed under the proper (P), likely improper (L), true improper (T), and singular correlation (S) conditions outlined in Table 3. In Condition P, model parameters are set at the values unlikely to lead to improper solutions. A change of one procedure in Condition P defines another condition. The procedures in which gradient algorithms are likely to produce improper solutions are used in Condition L, which are divided into three sub-conditions. In Condition T, a true parameter is set at an improper value; EM factor analysis should incorrectly produce proper solutions. Positive definite sample correlation matrices are synthesized in P, L, and T, but in Condition S, data matrices with n<p (more variables than observations) give singular correlation matrices which are not positive but rather nonnegative definite. In Condition S, only EM factor analysis is performed, as singular matrices cannot be analyzed by the gradient algorithms that we use in this paper. The studies for EFA and CFA are reported in the next subsections.

Table 3. Setting of true parameters or matrix sizes specifying conditions in simulation studies.

6.2 Exploratory Factor Analysis

In Condition P, we set the true B(12×3) and Ψ as in Panel (A) of Table 4 with R=I 3 and synthesize 180×12 data matrices Y, whose rows were sampled from the multivariate normal distribution with its mean vector 0 12 and covariance matrix (1). The EFA algorithms were applied to the correlation matrix \(\operatorname{diag}(\mathbf{C}_{\mathbf{yy}})^{-1/2}\mathbf {C}_{\mathbf{yy}}\operatorname{diag} (\mathbf{C}_{\mathbf{yy}})^{-1/2}\) with the estimated number of factors \(\hat{m}\) set at m=3. The procedures in other conditions were the same as those in Condition P, except for the one described next. The true parameters were set as in Panel (B) of Table 4, with ψ 1=−0.05, in Condition T, and the number of rows in Y was set to 11 (<p=12) in S. Condition L is divided into LE1, LE2, and LE3. The true parameters were set as in Panel (B) of Table 4 with ψ 1=0.02 in Condition LE1, they were set as in Panel (C) of Table 4 in LE2, and EFA was performed with \(\hat{m}= 4\) (> true m=3) in LE3. It is known that the EFA model is not uniquely identified in Condition LE2 (Anderson & Rubin, 1956) and gradient algorithms tend to produce improper solutions in LE1, LE2, and LE3 (Kano 1998; Sato 1987; Van Driel 1978).

Table 4. True loadings and unique variances used in simulation studies.

In each of Conditions P, LE1, LE2, LE3, S, and T, 20 data sets (correlation matrices) were synthesized and analyzed. Each estimated loading matrix was rotated so that it was optimally matched to the corresponding true B, as described in Appendix D. Let \(\hat{\mathbf{B}} = (\hat{b}_{jk})\) and \(\hat{\psi} _{j}\) denote the resulting loading matrix and unique variance, respectively. In Appendix D, it is also described how B was redefined so that its size equals that of \(\hat{\mathbf{B}}\) in Condition LE3.

Every algorithm produced \(\hat{\mathbf{B}}\) of full column rank for all data sets. The other results are summarized in Table 5, while those for the symmetric original and modified EM algorithms are not shown, since these two always gave the same solution as the original EM algorithm. In the table, the bold font in gray cells indicates the larger values between the two algorithms. Panel (A) in Table 5 shows the percentages of the data sets for which the improper solutions including \(\hat{\psi} _{j}\leq0\) arose. There, we find that the EM algorithm always gave a positive \(\hat{\psi} _{j}\), while the gradient algorithm produced improper solutions in Conditions L and T.

Table 5. Percentages of improper solutions, averaged MAE (mean absolute error), averaged CN (condition number), and averaged number of iterations for EFA with gradient and original EM (O-EM) algorithms.

The recovery of B and that of Ψ were assessed by mean absolute errors \(\operatorname{MAE}(\hat{\mathbf{B}}) = (p\hat{m})^{-1}\sum_{j = 1}^{p} \sum_{k = 1}^{\hat{m}} | \hat{b}_{jk} - b_{jk}|\) and \(\operatorname{MAE}(\hat{\boldsymbol{\varPsi}} )=p^{-1}\sum_{j = 1}^{p} | \hat{\psi} _{j} -\psi_{j}|\), whose averages over data sets are presented in Table 5, Panels (B) and (C). We further obtained the condition number of \(\hat{\mathbf{B}}\), CN(\(\hat{\mathbf{B}}\)) (which is the ratio of the largest singular value to the smallest one of \(\hat{\mathbf{B}}\)), to assess the nearness of \(\hat{\mathbf{B}}\) to rank deficiency. The average of CN(\(\hat{\mathbf{B}}\)) is shown in Panel (D) of Table 5, with CN(B) (the condition number of the true B) being 1.114 in Condition LC2, 1.581 in LC3, 1.129 in T, and 1.0 in the others (P, LE3, T, and S). In Table 5, we find that the gradient algorithm recovered parameters more poorly and gave \(\hat{\mathbf{B}}\) nearer to rank deficiency, than the EM algorithm in LE2 and LE3, while the EM algorithm recovered parameters more poorly in Condition T, where that algorithm incorrectly gave proper solutions. The averages of MAE(\(\hat{\boldsymbol{\varPsi}} \)) in LE2 and LE3 for the gradient algorithm were found to be very large, which was due to the fact that ψ j was estimated to be far smaller than zero for some data sets. Such a solution of the gradient algorithm (with \(\hat{\psi} _{11}= -312.25\)) in LE3 is presented in Table 6, where the CN(\(\hat{\mathbf{B}}\)) shows that \(\hat{\mathbf{B}}\) for the gradient algorithm was nearer to rank deficiency than \(\hat{\mathbf{B}}\) for the EM algorithm.

Table 6. True parameters, the solutions of EFA with gradient and EM algorithms, and their MAE and CN values for a data set in Condition LE3.

Table 5, Panel (D), shows that the gradient algorithm needed more iterations for convergence in LE2 and LE3, though the EM algorithm converged more slowly in the other conditions. Finally, let us note the performances of the EM algorithm in Condition S (Table 5). Though Theorem 1 shows that an improper solution with \(\hat {\psi} _{j}= 0\) could occur in S, such a solution was not found. However, averaged MAE values were large, which demonstrates the difficulty of estimating parameters from nonsingular sample correlation matrices.

6.3 Confirmatory Factor Analysis

The true parameters used in Conditions P, S, T, and LC1 are the same as those in P, S, T, and LE1 for EFA (Section 6.2), respectively, except that all true inter-factor correlations r kl (kl) were set to 0.4 in P, S, T, and, LC1 for CFA. In those conditions, CFA was performed subject to the loadings whose true values were zero being constrained to zero. A change of one procedure in Condition P defines each of the remaining conditions. In T′, which is another true improper condition beside T, the true r 21 and r 12 were set to 1.05 (not 0.4), while they were set to 0.98 in LC2. In Condition LC3, CFA was performed with \(\hat{m}= 4\) (>m=3) subject to the constraint found in Table 8(B) and (C). Twenty data sets were synthesized in each condition. We use \(\hat{\mathbf{B}}= (\hat{b}_{jk})\), \(\hat{\psi} _{j}\), and \(\hat{\mathbf{R}}= (\hat{r}_{kl})\) for the solutions of B, ψ j , and R=(r jk ).

The original EM algorithm always failed to converge with a decrease in (2) in all conditions except P (though such a failure did not occur in Condition P). The failures are considered to be due to R becoming asymmetric, as observed in Section 5, which can be ascertained from the fact that the symmetric original EM algorithm was successfully applied to all data sets. Thus, we do not summarize the solutions of the original EM algorithm. The other three algorithms always gave positive definite \(\hat{\mathbf{R}}\) and \(\hat{\mathbf{B}}\) of full column rank. The other results are summarized in Table 7. There, the results for the symmetric original EM algorithm have not been presented, since it and the modified EM algorithm showed equivalent behavior, except for the slightly different performances in Condition S, which were ignorable, as reported later. In the table, the bold font in gray cells indicates the larger values between the two algorithms.

Table 7. Percentages of improper solutions, the averaged MAE (mean absolute error), averaged CN (condition number), and averaged number of iterations for CFA with gradient and modified EM (M-EM) algorithms.

Panels (A) and (B) in Table 7 show the percentages of improper solutions with \(\hat{\psi} _{j}\leq0\) and \(|\hat{r}_{kl}| > 1\), respectively. The gradient algorithm produced improper \(\hat{\psi} _{j}\) in Conditions LC1 and T, and improper \(\hat{r}_{kl}\) in LC2, LC3, and T′, while the EM algorithm always gave proper solutions. The recovery of parameters was assessed by \(\operatorname{MAE}(\hat {\mathbf{B}}) = h^{-1}\sum_{j = 1}^{p} \sum_{k = 1}^{m} h_{jk}| \hat{b}_{jk} - b_{jk}|\), \(\operatorname{MAE}(\hat{\boldsymbol{\varPsi}} ) = p^{-1}\sum_{j = 1}^{p} | \hat{\psi} _{j} -\psi_{j}|\), and \(\operatorname{MAE}(\hat{\mathbf{R}}) = 3^{-1}(\mid\hat{r}_{21}- r_{21}| + |\hat{r}_{31}- r_{31}|+ |\hat{r}_{32}- r_{32}|)\), whose averages over the data sets are shown in Panels (C), (D), and (E), respectively, where h jk =0 if \(\hat{b}_{jk}= 0\) and h jk =1 if otherwise, and h=Σ j,k h jk . In the panels, we found that the goodness of recovery was not very different between the two algorithms. Panels (F) and (G) show the averages of condition numbers \(\operatorname{CN}(\hat{\mathbf{B}})\) and \(\operatorname{CN}(\hat {\mathbf{R}})\), with the true counterpart CN(B) being 1.114 in Condition LC2, 1.129 in T, and 1.0 in the others, and with CN(R) being 111.92 in LC2, 45.935 in T′, and 3.0 in the other conditions. We can find that CN(\(\hat{\mathbf{R}}\)) was very large in all conditions except P, LC1, and T. In particular, it is noted that CN(\(\hat{\mathbf{R}}\)) for the EM algorithms was far larger than CN(\(\hat{\mathbf{R}}\)) for the gradient one in LC3 and T′. That is, in those conditions, the EM algorithm gave nearly rank deficient \(\hat{\mathbf{R}}\), though all \(\hat{\mathbf{R}}\) were proper and positive definite. Such a solution in LC3 is presented in Table 8 with the corresponding gradient algorithm’s solution which includes improper \(\hat{r}_{42}\) and \(\hat{r}_{43}\). Here, we should note the above definition of \(\operatorname{MAE}(\hat{\mathbf{R}})\) in which \(\hat{r}_{41}\), \(\hat{r}_{42}\), and \(\hat{r}_{43}\) have not been considered, since their true counterparts do not exist (in LC3 with \(\hat{m}= 4 > m = 3\)). Thus, \(\operatorname{MAE}(\hat{\mathbf{R}})\) did not take large values.

Table 8. True parameters, the solutions of CFA with gradient and modified EM algorithms, and their MAE and CN values for a data set in Condition LC3, with blank cells standing for zero loadings.

Let us return back to Table 7. Panel (H) shows that the EM algorithm needed far more iterations until convergence than the gradient algorithm. The numbers were different between the modified and symmetric original EM algorithms for three data sets in Condition S. But the averaged number of iterations (4678.2) for the latter algorithm was very close to the number (4676.9) for the former in Panel (H), and the other statistics for the symmetric original algorithm were equivalent to those in Table 7. We can find that the EM algorithm did not give improper solutions even in Condition S, but the recovery of parameters was not good with \(\hat{\mathbf{R}}\) nearly rank deficient in the condition.

6.4 Summary of Major Results

The three EM algorithms for EFA gave equivalent solutions. Also, the symmetric original and modified EM algorithms for CFA yielded equivalent solutions, except for their slightly different behavior for singular sample correlation matrices. However, the original EM algorithm for CFA often failed to converge. The results of that algorithm are thus excluded from the following summary.

The results in Section 3 were demonstrated: the EM algorithm always gave a proper solution matrix of full rank. The recovery by the EM algorithm was better than, or almost the same as, that for the gradient algorithm, except for those cases with the true parameters being improper. However, the EM algorithm for CFA yielded factor correlation matrices whose ranks were nearly deficient for the cases where models were over-factorized with \(\hat{m}> m\), and where true factor correlations were improper. The EM algorithm needed more iterations for convergence than the gradient algorithm, except for two conditions (LE2 and LE3) in EFA.

The gradient algorithms often produced improper solutions for the cases where the true parameters are improper and where models are over-factorized, but also gave proper solutions for some data sets in such cases (Table 5, Panel (A); Table 7, Panel (A)), the implications of which are mentioned in the next section.

7 Conclusions

In this paper, we proved the following mathematical fact: Rubin and Thayer’s (1982) EM factor analysis always produces a proper solution when the covariance matrix to be analyzed and the initial matrices including unique variances and inter-factor correlations are positive definite. We further presented the symmetric original and modified EM algorithms in which Rubin and Thayer’s (1982) original formulas are replaced by those with more numerically desirable properties. The real data examples and simulation studies showed that the original, symmetric original, and modified EM algorithms behave equivalently well for EFA, but that the original EM algorithm should not be used for CFA. It is inconclusive which is superior, the symmetric original or the modified EM algorithm, since they showed almost equivalent performances for CFA.

Obviously, EM factor analysis is welcomed by users who wish to avoid improper solutions. However, one drawback of EM factor analysis is that it needs more iterations for convergence (except for special conditions for EFA), although this difficulty may be dealt with by incorporating an acceleration technique into the factor analysis (Kuroda & Sakakihara, 2006). The following two points may also be drawbacks: First, EM factor analysis is a constrained estimation procedure that restricts solutions to be proper and, thus, involves difficulties with constrained maximum likelihood estimation procedures (e.g., Savalei & Kolenikov, 2008). Second, EM factor analysis yields proper solutions when models are over-factorized and even when true parameter values are improper, which implies that the improperness of user model specifications and the improperness of true parameters are not detected. On the other hand, such improperness can be suggested by improper solutions in the unconstrained gradient algorithms. However, such a suggestion is not always given, since the gradient algorithms can also give proper solutions for true improper and over-factorization cases.