1 Introduction

Obtaining a set of parameter estimates for a theoretically-plausible model is the first step in any statistical analysis. For structural equation modeling (SEM), however, parameter estimates have to be computed iteratively, and there is a good chance that a researcher is unable to obtain a set of converged solutions in practice, especially when the sample size is not large enough. Various factors can contribute to non-convergence of an iterative algorithm, including bad model or bad data. But only after a set of solutions is obtained can we possibly distinguish different causes. In particular, non-convergences not just occur to poorly formulated models, as they have been repeatedly reported for correctly specified models with simulated data in Monte Carlo studies (Bentler and Yuan 1999; Hu et al. 1992; Jackson 2001) and in bootstrap replications (Ichikawa and Konishi 1995; Yuan and Hayashi 2003). Obtaining converged solutions is equally important to Monte Carlo studies although replications are essentially cost free. This is because non-converged replications cannot be regarded as identically distributed as the converged ones (e.g., Yuan and Hayashi 2003). Similar to missing data analysis that ignores missing not at random mechanism, when the percentage of non-converged replications is substantial, results obtained based on just the converged replications may not correctly reflect the properties of the methodology being studied. The main purpose of this article is to examine factors affecting the convergence properties of the Fisher-scoring (FS) algorithm, which is used in most SEM packages. Strategies for achieving higher convergence rates in both simulation and real data analysis are explored as well.

In conducting Monte Carlo studies with confirmatory factor models using LISREL (Jöreskog and Sörbom 1981), Anderson and Gerbing (1984) observed that non-converged replications occurred more often with small factor loadings or indicators with low reliability. Boomsma (1985) also observed that in simulation studies the convergence rate of LISREL increased with greater measurement reliabilities. Jackson (2001) noted a similar phenomenon in simulation studies using SAS Calis (SAS Institute 1996). However, recent studies of a ridge method in SEM indicate that adding a diagonal matrix to the sample covariance matrix \(\mathbf{S}\) increases the rate of convergence (Yuan and Chan 2008). Because manipulation for larger factor loadings is in the opposite direction of the ridge method, we will call it an anti-ridge method. Thus, the findings in the literature are seemingly in conflict. By examining factors affecting the convergence properties of the FS algorithm, we will clarify why these two seemingly contradictory methods both lead to higher convergence rate. Our analysis of the ridge and anti-ridge methods also applies to other algorithms for minimizing the normal-distribution-based maximum likelihood (NML) discrepancy function and similarly is relevant to other discrepancy functions.

The convergence properties of the FS algorithm are affected by many factors. One of them is multicolinearity among the observed variables. If the sample or model implied covariance matrix is close to being singular, then the FS algorithm may have difficulty to reach a set of converged solutions. The ridge method developed in Yuan and Chan (2008) aims to address the problem of near singular covariance matrices. Instead of fitting \(\mathbf{S}\) by a structural model \({\varvec{\Sigma }}({\varvec{\theta }})\), the ridge method fits \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\) by \({\varvec{\Sigma }}({\varvec{\theta }})\) through minimizing the NML-based discrepancy function, where \(a>0\) is a constant that may depend on the sample size N and the number of variables p but not the observed data. Let \(\hat{\varvec{\theta }}_a\) be the resulting estimates from fitting \(\mathbf{S}_a\). Then the final parameter estimates are obtained by deducting a from the elements of \(\hat{\varvec{\theta }}_a\) that correspond to the variances of errors.

In the literature of numerical analysis, the ratio of the largest over the smallest eigenvalues of a matrix is called the condition number of the matrix (Golub and Van Loan 1983). A large condition number not only causes computations involving the inverse of the matrix to be less accurate, it may also cause the algorithm to fail to converge if the computation is iterative. The condition number of \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\) is always smaller than that of \(\mathbf{S}\). Yuan and Chan (2008) briefly discussed how the ridge method improves the convergence properties of the FS algorithm. In this article, we further study the changes in condition numbers as well as the convergence rate and speed between modeling \(\mathbf{S}\) and \(\mathbf{S}_a\) using numerical examples and Monte Carlo simulation.

The population or sample covariance matrix corresponding to a SEM model might be close to singular if certain error variances are tiny. Thus, we would expect the convergence rate of the FS algorithm to decrease with the increase of factor loadings, opposite to what Anderson and Gerbing (1984) and Boomsma (1985) have found. However, as we shall show, with the increase of factor loadings, the relative sampling errors in the sample covariances become smaller. Our analysis and results further show that smaller sampling errors in \(\mathbf{S}\) will positively affect the convergence rate as well as the speed of convergence of the FS algorithm. Smaller sampling errors also tend to improve the condition of the sample covariance matrix \(\mathbf{S}\). Because the condition number is a rather complicated function of the elements of \(\mathbf{S}\), we will use numerical examples and Monte Carlo simulation to evaluate the change of condition numbers following the anti-ridge method.

In addition, we will discuss how to use the findings in practice when FS fails to obtain a set of converged solutions. As we shall see, the ridge method can be applied to all the models where error variances are subject to estimation. In contrast, the anti-ridge method is mostly usable a priori in Monte Carlo studies where the factor loadings are subject to manipulation, and can also be applied to special models in post-hoc analysis when the space of the common factors are known or when alternative items are available.

We will not study the properties of parameter estimates or test statistics for overall model evaluation with the ridge method. These have been studied in Yuan and Chan (2008). In particular, Kamada (2011) and Kamada and Kano (2012) found that the ridge method can yield parameter estimates that are substantially more accurate than MLEs at smaller N, even when the population is normally distributed. Since the applicability of the anti-ridge method is limited, we will not study properties of parameter estimates following the anti-ridge method.

In Sect. 2 of the article, we review the formulation of the FS algorithm and examine the factors that affect its convergence properties. In Sect. 3, we obtain formulas that show how the relative errors in \(\mathbf{S}\) are affected by population factor loadings and error/unique variances. In Sect. 4, using examples, we numerically illustrate how converged solutions are obtained with ridge and/or anti-ridge methods. Monte Carlo results on the effectiveness of ridge and anti-ridge methods are presented in Sect. 5. Recommendation and discussions regarding the applications of ridge and anti-ridge methods are given in the concluding section.

2 Fisher-scoring algorithm

In this section, we will first present a formulation of the FS algorithm in SEM. Factors that affect the speed of convergence of FS as well as whether there exists a vector of parameters that satisfies a given convergence criterion are then examined.

Let \(\mathbf{S}=(s_{jk})\) be a sample covariance matrix of size N from a p-variate population. We are interested in modeling the covariance matrix \({\varvec{\Sigma }}=E(\mathbf{S})\) by a structural model \({\varvec{\Sigma }}({\varvec{\theta }})\) using NML, which defines parameter estimate \(\hat{\varvec{\theta }}\) by minimizing

$$\begin{aligned} F_{ML}(\mathbf{S},{\varvec{\Sigma }}({\varvec{\theta }}))=\mathrm{tr}[\mathbf{S}{\varvec{\Sigma }}^{-1}({\varvec{\theta }})] -\log |\mathbf{S}{\varvec{\Sigma }}^{-1}({\varvec{\theta }})|-p. \end{aligned}$$
(1)

Let \(\mathrm{vec}(\mathbf{S})\) be the vector of stacking the columns of \(\mathbf{S}\), and \(\mathbf{s}=\mathrm{vech}(\mathbf{S})\) be the vector of stacking the lower-triangular part of \(\mathbf{S}\). Then, with \(p^*=p(p+1)/2\), there exists a \(p^2\times p^*\) matrix \(\mathbf{D}_p\) such that \(\mathbf{D}_p\mathrm{vech}(\mathbf{S})=\mathrm{vec}(\mathbf{S})\), and \(\mathbf{D}_p\) is called a duplication matrix (see e.g., Schott 2005, p. 313). Further let \({\varvec{\sigma }}({\varvec{\theta }})=\mathrm{vech}[{\varvec{\Sigma }}({\varvec{\theta }})]\),

$$\begin{aligned} \dot{{\varvec{\sigma }}}({\varvec{\theta }})=\frac{\partial {\varvec{\sigma }}({\varvec{\theta }})}{\partial {\varvec{\theta }}'}, \quad \mathrm{and}\quad \mathbf{W}({\varvec{\theta }}) =\frac{1}{2}\mathbf{D}_p'\left[ {\varvec{\Sigma }}^{-1}({\varvec{\theta }})\otimes {\varvec{\Sigma }}^{-1}({\varvec{\theta }})\right] \mathbf{D}_p. \end{aligned}$$

With initial value \({\varvec{\theta }}^{(0)}\), the FS algorithm for computing \(\hat{\varvec{\theta }}\) is given by

$$\begin{aligned} {\varvec{\theta }}^{(t+1)}={\varvec{\theta }}^{(t)}+\Delta {\varvec{\theta }}^{(t)}, \end{aligned}$$
(2)

where

$$\begin{aligned} \Delta {\varvec{\theta }}^{(t)}=[\mathbf{H}({\varvec{\theta }}^{(t)})]^{-1} \dot{{\varvec{\sigma }}}'({\varvec{\theta }}^{(t)}) \mathbf{W}({\varvec{\theta }}^{(t)})[\mathbf{s}-{\varvec{\sigma }}({\varvec{\theta }}^{(t)})] \end{aligned}$$
(3)

with \(\mathbf{H}({\varvec{\theta }})=\dot{{\varvec{\sigma }}}'({\varvec{\theta }})\mathbf{W}({\varvec{\theta }})\dot{{\varvec{\sigma }}}({\varvec{\theta }})\) being the information matrix. The NML estimate \(\hat{\varvec{\theta }}\) is obtained when the algorithm converges, which is typically defined as the absolute values of all the elements of \(\Delta {\varvec{\theta }}^{(t)}\) being smaller than a given number. A variant of (2) is

$$\begin{aligned} {\varvec{\theta }}^{(t+1)}={\varvec{\theta }}^{(t)}+\alpha \Delta {\varvec{\theta }}^{(t)}, \end{aligned}$$
(4)

where the scalar \(\alpha \) is to control the size of the step so that

$$\begin{aligned} F_{ML}(\mathbf{S},{\varvec{\Sigma }}({\varvec{\theta }}^{(t+1)}))<F_{ML}(\mathbf{S},{\varvec{\Sigma }}({\varvec{\theta }}^{(t)})). \end{aligned}$$

The value of \(\alpha \) can be chosen using step halving (.50, .25, \(\ldots \)) or a line search method (e.g., chapter 2 of Everitt 1987; chapter 3 of Nocedal and Wright 1999). We will call (4) the FS algorithm with step-size adjustment.

Since all the elements of the \(\Delta {\varvec{\theta }}^{(t)}\) in (3) must be small enough for FS to converge, we further examine its two major components: the inverse of the information matrix \(\mathbf{H}^{(t)}=\mathbf{H}({\varvec{\theta }}^{(t)})\) and the score vector \({\varvec{\nu }}^{(t)}=\dot{{\varvec{\sigma }}}'({\varvec{\theta }}^{(t)})\mathbf{W}({\varvec{\theta }}^{(t)})[\mathbf{s}-{\varvec{\sigma }}({\varvec{\theta }}^{(t)})]\). Denote \({\varvec{\Sigma }}^{(t)}={\varvec{\Sigma }}({\varvec{\theta }}^{(t)})\) and \(\dot{\varvec{\Sigma }}_j^{(t)}=\partial {\varvec{\Sigma }}({\varvec{\theta }}^{(t)})/ \partial \theta _j^{(t)}\), then the jth element of \({\varvec{\nu }}^{(t)}\) can be further written as

$$\begin{aligned} \nu _j^{(t)}=\mathrm{tr}\left[ ({\varvec{\Sigma }}^{(t)})^{-1}\dot{\varvec{\Sigma }}_j^{(t)} ({\varvec{\Sigma }}^{(t)})^{-1}(\mathbf{S}-{\varvec{\Sigma }}^{(t)})\right] . \end{aligned}$$

Clearly, causes for FS to fail to converge must be through \({\varvec{\nu }}^{(t)}\) and \(\mathbf{H}^{(t)}\), and they might be classified into four categories: (C1) The first cause is when \({\varvec{\Sigma }}^{(t)}\) is near singular. Then \({\varvec{\Sigma }}^{(t)}\) may not be invertible, which will cause problems to the computation of both \(\mathbf{H}^{(t)}\) and \({\varvec{\nu }}^{(t)}\). (C2) The second is when \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\) is close to rank deficient. Rank deficient \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\) does not cause operational problems to the calculation of \(\nu _j^{(t)}\) because matrix multiplication is very robust to the values of the elements of the matrices, but it may cause \(\mathbf{H}^{(t)}\) to be close to singular. Then the calculation of \(\Delta {\varvec{\theta }}^{(t)}\) cannot proceed. (C3) The third is when there exist large sampling and/or systematic differences between \(\mathbf{S}\) and \({\varvec{\Sigma }}^{(t)}\) relative to the size of the elements of \({\varvec{\Sigma }}^{(t)}\). Notice that the matrix \(({\varvec{\Sigma }}^{(t)})^{-1}(\mathbf{S}-{\varvec{\Sigma }}^{(t)})\) in the expression of \(\nu _j^{(t)}\) essentially represents the relative errors in \(\mathbf{S}\). When the relative errors are large enough, certain elements of \(\Delta {\varvec{\theta }}^{(t)}\) may never satisfy a given convergence criterion. (C4) The fourth is the effect of interactions between the relative errors in \(\mathbf{S}\) and the conditions of \({\varvec{\Sigma }}^{(t)}\) and/or \(\mathbf{H}^{(t)}\). A matrix \(\mathbf{A}\) is said to be ill-conditioned if its condition number \(\kappa (\mathbf{A})\) is huge. An ill-conditioned matrix may not be near singular if its smallest eigenvalue is not close to zero. However, according to Golub and Van Loan (1983, Sect. 2.5), the relative errors in \(\mathbf{x}\) resulting from \(\mathbf{x}=\mathbf{A}^{-1}\mathbf{b}\) can be \(\kappa (\mathbf{A})\) times those in \(\mathbf{A}\) and \(\mathbf{b}\). Moderate errors in \(\mathbf{S}\) together with large condition numbers of \({\varvec{\Sigma }}^{(t)}\) and/or \(\mathbf{H}^{(t)}\) can result in substantial fluctuations in \(\Delta {\varvec{\theta }}^{(t)}\) from iteration to iteration, which will not satisfy a given convergence criterion.

In addition to the four noted causes, the convergence properties of the FS algorithm are also affected by the initial value \({\varvec{\theta }}^{(0)}\). In the following section, we will examine how ridge and anti-ridge methods change the formulation of \(\Delta {\varvec{\theta }}^{(t)}\) so that the convergence properties of FS improve. We will not discuss initial values because they are not unique to either the ridge or the anti-ridge method.

3 Relative errors in sample covariances, ridge and anti-ridge methods

In this section, we will first quantify the relative errors in \(\mathbf{S}\) using the coefficient of variation (CV). Then we examine how the relative errors change in the ridge and anti-ridge methods. Condition numbers of covariance and information matrices following ridge and anti-ridge methods will also be discussed. Since our interest is in the effect of the size of factor loadings versus that of the size of error variances, we will mainly consider factor models. Another reason for us to consider factor models is that an SEM model can be equivalently expressed as a factor model with structured factor variances-covariances. Notice that, in SEM or factor analysis, the size of factor loadings and that of factor variances-covariances cannot be distinguished before fixing the scales of latent variables. Unless stated otherwise, we assume that the variance of each factor is fixed at 1.0 from now on.

3.1 Relative errors in \(s_{jk}\)

Let \(\mathbf{y}=(y_1,y_2,\ldots ,y_p)'\) represent a population with p random variables. Suppose \(\mathbf{y}\) follows a confirmatory factor model

$$\begin{aligned} \mathbf{y}={\varvec{\mu }}+{\varvec{\Lambda }}{\varvec{\xi }}+{\varvec{\varepsilon }}, \end{aligned}$$
(5)

where \({\varvec{\mu }}=E(\mathbf{y})\); \({\varvec{\Lambda }}\) is a \(p\times q\) matrix of factor loadings; \({\varvec{\xi }}\) is a vector of q latent factors with \(E({\varvec{\xi }})=\mathbf{0}\) and \(\mathrm{Cov}({\varvec{\xi }})={\varvec{\Phi }}=(\phi _{lm})\) being a correlation matrix; and \({\varvec{\varepsilon }}\) is a vector of p errors or uniquenesses with \(E({\varvec{\varepsilon }})=\mathbf{0}\) and \(\mathrm{Cov}({\varvec{\varepsilon }})={\varvec{\Psi }}=\mathrm{diag}(\psi _{11},\psi _{22},\ldots ,\psi _{pp})\). When \({\varvec{\xi }}\) and \({\varvec{\varepsilon }}\) are uncorrelated, the covariance matrix of \(\mathbf{y}\) is given by

$$\begin{aligned} {\varvec{\Sigma }}=(\sigma _{jk})={\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'+{\varvec{\Psi }}. \end{aligned}$$
(6)

In this section, we will further assume that \({\varvec{\xi }}\) and \({\varvec{\varepsilon }}\) are independent to avoid overly complicated analytical results. In Monte Carlo studies in Sect. 5, we will further evaluate relative errors in \(\mathbf{S}\) when \({\varvec{\xi }}\) and \({\varvec{\varepsilon }}\) are uncorrelated but dependent.

Let \(\mathbf{y}_i=(y_{i1},y_{i2},\ldots ,y_{ip})'\), \(i=1, 2,\ldots , N\), be a random sample of the \(\mathbf{y}\) in (5), then the sample covariance matrix is given by \(\mathbf{S}=(s_{jk})\) with

$$\begin{aligned} s_{jk}=\frac{1}{n}\sum _{i=1}^N(y_{ij}-\bar{y}_j)(y_{ik}-\bar{y}_k), \end{aligned}$$

where \(n=N-1\). Notice that each \(s_{jk}\) is a 2nd-order sample moment. Standard large sample theory shows that the asymptotic variance of \(\sqrt{n}s_{jk}\) is given by \(\gamma _{jk}=\mathrm{Var}(y_{j0}y_{k0})\) (e.g., Ferguson 1996), where \(y_{j0}=y_j-\mu _j\) and \(y_{k0}=y_k-\mu _k\). Because the exact expression for \(\mathrm{Var}(\sqrt{n}s_{jk})\) is rather complicated and its difference from \(\gamma _{jk}\) is in the order of 1 / N, we treat \(\gamma _{jk}\) as the variance of \(\sqrt{n}s_{jk}\) for simplicity. Let \(\mathrm{CV}_{jk}\) denote the coefficient of variation of \(s_{jk}\). Then \(\mathrm{CV}_{jk}=\gamma _{jk}^{1/2}/(\sqrt{n}\sigma _{jk})\). We next quantify \(\mathrm{CV}_{jk}\) with respect to the population values of the parameters in (6). For the obtained formulas to have relatively simple forms, we only consider unidimensional measurement where each variable loads on only one factor.

With q factors, suppose \(y_j=\lambda _j\xi _{j^*}+\varepsilon _j\) and \(y_k=\lambda _k\xi _{k^*}+\varepsilon _k\), where \(1\le j^*\le k^*\le q\). It follows from the results in Sect. 7 (Appendix) that, when \(j^*\ne k^*\) (\(\sigma _{jk}=\lambda _j\lambda _k\phi _{j^*k^*}\)),

$$\begin{aligned} n\mathrm{CV}_{jk}^2=\frac{1}{\phi _{j^*k^*}^2} \left\{ \left[ E\left( \xi _{j*}^2\xi _{k^*}^2\right) -\phi _{j^*k^*}^2\right] +\frac{\psi _{kk}}{\lambda _k^2}+\frac{\psi _{jj}}{\lambda _j^2} +\frac{\psi _{jj}}{\lambda _j^2}\frac{\psi _{kk}}{\lambda _k^2}\right\} ; \end{aligned}$$
(7)

when \(j^*=k^*\) but \(j\ne k\) (\(\sigma _{jk}=\lambda _j\lambda _k\)),

$$\begin{aligned} n\mathrm{CV}_{jk}^2=\left[ E\left( \xi _{j^*}^4\right) -1\right] +\frac{\psi _{kk}}{\lambda _k^2}+\frac{\psi _{jj}}{\lambda _j^2} +\frac{\psi _{jj}}{\lambda _j^2}\frac{\psi _{kk}}{\lambda _k^2}; \end{aligned}$$
(8)

and when \(j=k\) (\(\sigma _{jj}=\lambda _j^2+\psi _{jj}\)),

$$\begin{aligned} n\mathrm{CV}_{jj}^2=\frac{\left[ E\left( \xi _{j^*}^4\right) -1\right] +4\psi _{jj}/\lambda _j^2 +\left[ E\left( \varepsilon _j^4\right) -\psi _{jj}^2\right] /\lambda _j^4}{\left( 1+\psi _{jj}/\lambda _j^2\right) ^2}. \end{aligned}$$
(9)

Let \(\varepsilon _j=\psi _j\varepsilon _{j0}\) with \(\psi _j=\psi _{jj}^{1/2}\), then we can further write (9) as

$$\begin{aligned} n\mathrm{CV}_{jj}^2 =\frac{ \left[ E\left( \xi _{j^*}^4\right) -3\right] }{\left( 1+\psi _{jj}/\lambda _j^2\right) ^2} +\frac{\left[ E\left( \varepsilon _{j0}^4\right) -3\right] \left( \psi _{jj}/\lambda _j^2\right) ^2}{\left( 1+\psi _{jj}/\lambda _j^2\right) ^2} +2. \end{aligned}$$
(10)

It is clear from (7) and (8) that, when \(j\ne k\), the relative error in \(s_{jk}\) is an increasing function of \(\psi _{jj}\) and \(\psi _{kk}\), and a decreasing function of \(|\lambda _j|\) and \(|\lambda _k|\). The relationship of \(\mathrm{CV}_{jj}^2\) with \(\lambda _j\) and \(\psi _{jj}\) in (9) or (10) depends on the kurtoses of \(\xi _{j^*}\) and \(\varepsilon _j\). When both \(\xi _{j^*}\) and \(\varepsilon _j\) are normally distributed, \(E(\xi _{j^*}^4)=E(\varepsilon _{j0}^4)=3\), then \(n\mathrm{CV}_{jj}^2\) is unrelated to \(\lambda _j\) or \(\psi _{jj}\). Otherwise, \(n\mathrm{CV}_{jj}^2\) will depend on the values of factor loadings and error variances. Suppose \(E(\varepsilon _{j0}^4)>3\) and \(E(\xi _{j^*}^4)>3\), then the first term on the ride side of (10) increases with \(|\lambda _j|\) and decreases with \(\psi _{jj}\), whereas the second term on the right side of (10) changes in the opposite direction.

For normally distributed populations considered in Anderson and Gerbing (1984) and Boomsma (1985), relative errors in \(s_{jj}\) are not affected by the values of factor loadings or error variances, but relative errors in \(s_{jk}\) become smaller with larger factor loadings. Since each element of \(\Delta {\varvec{\theta }}^{(t)}\) is proportional to relative errors in \(\mathbf{S}\), results in (7) and (8) explain why larger factor loadings lead to smaller elements of \(\Delta {\varvec{\theta }}^{(t)}\) and consequently more convergent replications, as reported in the literature.

3.2 Smaller relative errors via the ridge and anti-ridge methods

The results in the previous subsection characterize the relationship of relative errors in \(s_{jk}\) with the population factor loadings and error variances that generated the data. The ridge method of modeling \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}=(s_{jka})\) is a post-hoc technique after \(\mathbf{S}=(s_{jk})\) is obtained. Since a is a constant, the variance of \(s_{jka}\) is the same as that of \(s_{jk}\). However, \(E(s_{jja})=E(s_{jj})+a\). Consequently, the relative error in \(s_{jja}\) monotonically decreases with a. Thus, the \(\Delta {\varvec{\theta }}^{(t)}\) in (2) following the ridge method becomes smaller, which increases both the speed and rate of convergence of the FS algorithm.

The manipulations on the size of factor loadings and error variances in Anderson and Gerbing (1984) and Boomsma (1985) are a priori. We may consider applying the anti-ridge method in a post-hoc manner after \(\mathbf{S}\) is observed. Suppose we know the variance-covariance matrix of the common scores \({\varvec{\Lambda }}{\varvec{\xi }}\). Then we may consider fitting \(\mathbf{S}_c=(s_{jkc})=\mathbf{S}+(c{\varvec{\Lambda }}){\varvec{\Phi }}(c{\varvec{\Lambda }})'=\mathbf{S}+c^2{\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\) by the structural model \({\varvec{\Sigma }}({\varvec{\theta }})\). Since \({\varvec{\Sigma }}_c=E(\mathbf{S}_c)=(1+c^2){\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'+{\varvec{\Psi }}\) and \(\mathrm{Var}(s_{jkc})=\mathrm{Var}(s_{jk})\), relative errors in all the elements of \(\mathbf{S}_c\) are smaller than those of \(\mathbf{S}\). Thus, we expect that the post-hoc use of the anti-ridge method is more effective than the ridge method in improving the convergence properties of FS. However, such a technique can only be used in certain applications where the factor loadings are not subject to estimation.

Example 1

Consider the linear latent growth curve model (Preacher et al. 2008) \(y_j=\xi _1+(j-1)\xi _2+\varepsilon _j\), \(j=1, 2, \ldots , p\), where \(\xi _1\) is the latent intercept, \(\xi _2\) is the latent slope, with \(E(\xi _1)=\tau _1\), \(E(\xi _2)=\tau _2\), \(\mathrm{Var}(\xi _1)=\phi _{11}\), \(\mathrm{Var}(\xi _2)=\phi _{22}\), and \(\mathrm{Cov}(\xi _1,\xi _2)=\phi _{12}\), resulting in a covariance structure as in (6), where all the elements of the first column of \({\varvec{\Lambda }}\) are 1.0, and those of the second column of \({\varvec{\Lambda }}\) are 0, 1, 2, \(\ldots ,\,p-1\) in sequence; \({\varvec{\Phi }}\) is a free matrix subject to estimation; and \({\varvec{\Psi }}\) is a diagonal matrix. For growth curve modeling, there is also a mean structure \(\mu _j=\tau _1+(j-1)\tau _2\). Then, with the same rationale as for just covariance structure analysis, keeping the sample means the same and treating \(\mathbf{S}_c=\mathbf{S}+c^2{\varvec{\Lambda }}{\varvec{\Lambda }}'\) as the new sample covariance matrix will increase the likelihood for the Fisher-scoring algorithm to converge. Except for the estimate of \({\varvec{\Phi }}\), all other estimates obtained from fitting \((\bar{\mathbf{y}},\mathbf{S}_c)\) by the mean and covariance structure model are consistent, and one can get a consistent estimate of \({\varvec{\Phi }}\) by \(\hat{{\varvec{\Phi }}}=\hat{{\varvec{\Phi }}}_c-c^2\mathbf{I}\), where \(\hat{{\varvec{\Phi }}}_c\) is the estimate of \({\varvec{\Phi }}\) under modeling \(\mathbf{S}_c\).

In summary, both the ridge and the anti-ridge methods alleviate the non-convergence problems caused by (C3), as discussed in Sect. 2. If the problem of a nearly rank deficient \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\) is due to extreme or improper values of \({\varvec{\theta }}^{(t)}\), caused by large sampling errors, then the two methods also alleviate the non-convergence problems caused by (C2). When non-convergence is due to the fluctuations of certain elements of \(\Delta {\varvec{\theta }}^{(t)}\) from iteration to iteration caused by the interaction of sizeable errors in \(\mathbf{S}\) and the conditions of \({\varvec{\Sigma }}^{(t)}\) and/or \(\mathbf{H}^{(t)}\), then the two methods also address the problems caused by (C4).

3.3 Condition numbers following the ridge and anti-ridge methods

In addition to affecting relative errors in FS, the ridge and anti-ridge methods also affect condition numbers of the model covariance and information matrices. In the following discussion, \({\varvec{\Sigma }}^{(t)}\) and \(\mathbf{H}^{(t)}\) are used to denote the model covariance and information matrices corresponding to modeling \(\mathbf{S}\); \({\varvec{\Sigma }}_a^{(t)}\) and \(\mathbf{H}_a^{(t)}\), and \({\varvec{\Sigma }}_c^{(t)}\) and \(\mathbf{H}_c^{(t)}\) are used to denote those corresponding to modeling \(\mathbf{S}_a\) and \(\mathbf{S}_c\), respectively.

When \(\mathbf{S}\) is close to being singular or ill-conditioned, because the algorithm is to approximate \(\mathbf{S}\) by \({\varvec{\Sigma }}^{(t)}\) as close as possible, \({\varvec{\Sigma }}^{(t)}\approx \mathbf{S}\) will be very likely close to singular or not invertible from iteration to iteration, and so will be \(\mathbf{W}({\varvec{\theta }}^{(t)})\) and the corresponding \(\mathbf{H}^{(t)}\). Since \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\), there exists \({\varvec{\Sigma }}_a^{(t)}\approx {\varvec{\Sigma }}^{(t)}+a\mathbf{I}\). Thus, with an appropriate a, \({\varvec{\Sigma }}_a^{(t)}\) in the ridge method is always well-conditioned. But this is not always true for \(\mathbf{H}_a^{(t)}\). When an ill-conditioned \(\mathbf{H}^{(t)}\) is due to an ill-conditioned \({\varvec{\Sigma }}^{(t)}\), not a rank deficient \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\), then \(\mathbf{H}_a^{(t)}\) will be well-conditioned. However, if an ill-conditioned \(\mathbf{H}^{(t)}\) is caused by a rank deficient \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\), then the condition of \(\mathbf{H}_a^{(t)}\) may not improve. This is because the ridge constant a mainly affects only the variances of errors, and their values do not have any effect on the Jacobian matrix \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\).

Another scenario in which the ridge method improves the condition numbers of \({\varvec{\Sigma }}^{(t)}\) and \(\mathbf{H}^{(t)}\) is when \(\mathbf{S}\) is well-conditioned but its elements contain substantial sampling or systematic errors, which cause some \(\psi _{jj}^{(t)}\) to be close to zero or negative during the iterative process. Then \({\varvec{\Sigma }}({\varvec{\theta }}^{(t)})\) might be close to singular and so might be \(\mathbf{H}^{(t)}\). Again, because \({\varvec{\Sigma }}_a^{(t)}\approx {\varvec{\Sigma }}^{(t)}+a\mathbf{I}\), both \({\varvec{\Sigma }}_a^{(t)}\) and \(\mathbf{H}_a^{(t)}\) will be well-conditioned.

A third scenario where the ridge method improves the condition numbers of \({\varvec{\Sigma }}^{(t)}\) and \(\mathbf{H}^{(t)}\) is through reducing relative errors in \(\mathbf{S}_a\). Certain elements of \({\varvec{\theta }}^{(t)}\) can become extreme when a model is not a good representation of the data, due to sampling or systematic errors. Extreme elements other than variances of errors in \({\varvec{\theta }}^{(t)}\) can also make \({\varvec{\Sigma }}^{(t)}\) close to singular or \(\dot{{\varvec{\sigma }}}({\varvec{\theta }}^{(t)})\) close to rank deficient. Due to smaller relative errors in \(\mathbf{S}_a\), \({\varvec{\theta }}_a^{(t)}\) becomes less extreme and results in well-conditioned \({\varvec{\Sigma }}_a^{(t)}\) and \(\mathbf{H}_a^{(t)}\).

The changes in condition numbers of the model covariance and information matrices following the anti-ridge method are much more complicated than those following the ridge method. This is because there is no simple relationship between the eigenvalues of \(\mathbf{S}\) and those of \(\mathbf{S}_c=\mathbf{S}+c^2{\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\). Although \(\kappa (\mathbf{S}_c)\) is not necessarily always greater than \(\kappa (\mathbf{S})\), we would expect \(\kappa (\mathbf{S}_c)\) to be greater than \(\kappa (\mathbf{S})\) most of the times in practice. However, \(\kappa (\mathbf{H}_c^{(t)})\) is not necessarily greater than \(\kappa (\mathbf{H}^{(t)})\) even if \(\kappa (\mathbf{S}_c)>\kappa (\mathbf{S})\). A scenario where \({\varvec{\Sigma }}_c^{(t)}\) and \(\mathbf{H}_c^{(t)}\) are better conditioned than \({\varvec{\Sigma }}^{(t)}\) and \(\mathbf{H}^{(t)}\) is when elements of \({\varvec{\theta }}^{(t)}\) are extreme due to substantial relative errors in \(\mathbf{S}\). Similar to the third scenario with the ridge method, with smaller relative errors in \(\mathbf{S}_c\), \({\varvec{\theta }}_c^{(t)}\) becomes less extreme and results in well-conditioned \({\varvec{\Sigma }}_c^{(t)}\) and \(\mathbf{H}_c^{(t)}\). We will show the changes in condition numbers due to anti-ridge manipulations using examples and Monte Carlo simulations in the next two sections.

Another interesting fact is that the condition number \(\kappa (\mathbf{H}^{(t)})\) is not invariant with respect to rescaling of \({\varvec{\Sigma }}\) by a constant. That is, \(\kappa (\mathbf{H}^{(t)})\) may change when all the elements in \(\mathbf{S}\) or \({\varvec{\Sigma }}({\varvec{\theta }})\) change proportionally. We will illustrate such a property of condition numbers in Sect. 5.

Having discussed how non-convergence problems are affected by condition numbers of \({\varvec{\Sigma }}^{(t)}\) and/or \(\mathbf{H}^{(t)}\) caused through (C1) and (C4), we would like to note that, when the model \({\varvec{\Sigma }}({\varvec{\theta }})\) is in a neighborhood of \(\mathbf{S}\), a large but not extreme \(\kappa (\mathbf{S})\) alone may not cause a non-convergence problem although it affects the accuracy of parameter estimates. The value of \(\kappa (\mathbf{S})\) affects but does not determine the value of \(\kappa ({\varvec{\Sigma }}^{(t)})\) or \(\kappa (\mathbf{H}^{(t)})\). The effect of \(\kappa (\mathbf{S})\) on convergence is mostly through its interactions with the model and/or the relative errors in \(\mathbf{S}\).

4 Numerical examples

In this section, we consider two examples. Fisher-scoring algorithm repeatedly has non-convergence problems in simulation studies, especially at smaller N. The data (sample covariance matrices) for the examples are just two samples (replications) from our simulation studies. The first example involves a one-factor model with \(p=4\) variables, and the second example involves a structural equation model with \(p=6\) variables and two latent factors. When estimating each of the models, the FS algorithm implemented in SAS IMLFootnote 1 cannot reach convergence. In addition to the FS algorithm, we will also use the commercial programs EQS (Bentler 2008) and SAS Calis (SAS Institute Inc 2011) to estimate the models in the two examples.

Example 2

The sample covariance matrix

$$\begin{aligned} \mathbf{S}=\left( \begin{array}{rrrr} 1.436 &{}\quad .176 &{}\quad .506 &{}\quad .120\\ .176 &{}\quad 1.681 &{}\quad 1.153 &{}\quad .616\\ .506 &{}\quad 1.153 &{}\quad 1.278 &{}\quad .243\\ .120 &{}\quad .616 &{}\quad .243 &{}\quad 1.946 \end{array}\right) \end{aligned}$$
(11)

is obtained from a normally distributed population with sample size \(N=30\). The population covariance matrix satisfies a one-factor model with all the factor loadings, the factor variance, and all the error variances being at 1.0. In estimating the model, we fix the factor variance at 1.0 for model identification. Thus, the model has 8 parameters with \({\varvec{\theta }}=(\lambda _1,\lambda _2,\lambda _3,\lambda _4,\psi _{11},\psi _{22},\psi _{33},\psi _{44})'\). The population counterpart of \({\varvec{\theta }}\) is \({\varvec{\theta }}_0=(1, 1, 1, 1, 1, 1, 1, 1)'\), which is used as the initial value of the FS and other algorithms described below, even when \(\mathbf{S}\) is replaced by \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\) or \(\mathbf{S}_c=\mathbf{S}+c^2\mathbf{1}\mathbf{1}'\), where \(\mathbf{1}=(1,1,1,1)'\) with \(\mathbf{1}\mathbf{1}'\) being the population covariance matrix of the common scores. The criterion for convergence of the FS algorithm is defined as

$$\begin{aligned} \max |\Delta {\varvec{\theta }}^{(t)}|<.0001\;\text{ within } \text{300 } \text{ iterations, } \end{aligned}$$
(12)

where \(\max |\cdot |\) is the maximum absolute value on all the elements of \(\Delta {\varvec{\theta }}^{(t)}\). EQS and SAS Calis have their own convergence criteria that are different from (12).

When fitting the \(\mathbf{S}\) in (11) by the one-factor model, at the 60th iteration (\(t=60\)), FS as implemented in SAS IML declares that the information matrix \(\mathbf{H}^{(t)}\) cannot be inverted, with \(\kappa ({\varvec{\Sigma }}^{(t)})\approx 3.74\times 10^8\) and \(\kappa (\mathbf{H}^{(t)})\approx 1.56\times 10^{17}\). Initial values other than \({\varvec{\theta }}_0\) are also used and FS runs into the same problem. The problem of singular information matrix encountered by FS in this example may also occur to other iterative methods. We may want to address such a problem by adjusting the step size as in (4) with a proper value of \(\alpha \), which is the default implementation in EQS. With the default convergence criterion of EQS (\(\hbox {conv}=.001\)), the program converges in 261 iterations and yields

$$\begin{aligned} \hat{{\varvec{\theta }}}=(.049, .142, 8.267, -.027, 1.434, 1.661, -67.031, 1.945)'. \end{aligned}$$
(13)

The output of EQS indicates that during the iteration process step size is adjusted many times with the \(\alpha \) in (4) ranging from 1.0 to .001. However, using the \(\hat{\varvec{\theta }}\) in (13) as the initial value \({\varvec{\theta }}^{(0)}\), FS declares that \(\mathbf{H}^{(t)}\) is singular at \(t=16\), with \(\kappa (\mathbf{H}^{(t)})=-3.1\times 10^{16}\). To better understand the problem, we reset the convergence criterion in EQS to \(\hbox {conv}=.000001\). Then EQS cannot reach convergence in 1000 iterations.

The program SAS Calis uses the so-called Levenberg-Marquardt optimization method (Nocedal and Wright 1999, pp. 262–266) with step size adjustment. With the default convergence criterion of Calis and setting the maximum number of iterations at 1000, SAS Calis gives an error message “LEVMAR Optimization cannot be completed”. at the end of the 1000 iterations.

For the \(\mathbf{S}\) in (11), fitting the \(\mathbf{S}_a\) at \(a=.5\) by the one-factor model, FS converges in 82 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}=(.388, .933, 1.236, .243, 1.285, .811, -.250, 1.887)', \end{aligned}$$
(14)

where the estimates of error variances are obtained by subtracting a from each of their values directly from the FS algorithm for the purpose of consistency. Notice that the solution in (14) contains a negative estimate of error variance (called Heywood case in factor analysis). Fitting this \(\mathbf{S}_a\) by the one-factor model with \(\hbox {conv}=.000001\), after 126 iterations EQS obtains estimates with the first 3 decimals identical to those in (14).

Fitting the \(\mathbf{S}_a\) at \(a=1\) by the one-factor model, FS converges in 29 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}=(.335, 1.089, 1.059, .380, 1.324, .495, .157, 1.801)'. \end{aligned}$$
(15)

Fitting this \(\mathbf{S}_a\) by the one-factor model with \(\hbox {conv}=.000001\), EQS converges in 46 iterations and yields estimates with the first 3 decimals identical to those in (15).

We next apply the anti-ridge method by fitting the one-factor model to \(\mathbf{S}_c=\mathbf{S}+c^2\mathbf{1}\mathbf{1}'\) at \(c^2=.5\) and 1.0. These values of \(c^2\) are chosen so that the variances-covariances of the common scores are increased by respectively 50 and 100 %, corresponding to parallel increases of error variances at \(a=.5\) and 1.0. Let \(\hat{{\varvec{\lambda }}}_c\) be the vector of estimates of the factor loadings corresponding to the solution of fitting \(\mathbf{S}_c\). The anti-ridge estimates of factor loadings reported below are obtained by \(\hat{{\varvec{\lambda }}}=\hat{{\varvec{\lambda }}}_c-[(1+c^2)^{1/2}-1]\mathbf{1}\), while the estimates of error variances are not changed. Thus, all the reported estimates following the anti-ridge method are consistent.

Fitting the one-factor model to \(\mathbf{S}_c\) at \(c^2=.5\), FS converges in 37 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}=(.529, .992, 1.134, .285, 1.368, .700, -.069, 2.186)', \end{aligned}$$
(16)

which also contains a Heywood case. For this \(\mathbf{S}_c\) and the one-factor model, with \(\hbox {conv}=.000001\), EQS converges in 63 iterations and yields the same estimates as in (16) for the first 3 decimal places. Fitting the one-factor model to the \(\mathbf{S}_c\) at \(c^2=1.0\), FS converges in 20 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}=(.583, 1.025, 1.081, .438, 1.441, .609, .042, 2.220)'. \end{aligned}$$
(17)

Fitting this \(\mathbf{S}_c\) by the one-factor model with \(\hbox {conv}=.000001\), EQS converges in 34 iterations and yields estimates with the first 3 decimals identical to those in (17).

Table 1 Condition numbers of the input covariance (ICov) matrix, the estimated covariance (ECov) matrix, and the estimated information (EInf) matrix corresponding to \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) in Example 3

Table 1 contains the condition numbers of \(\mathbf{S}\), \(\mathbf{S}_a\) (\(a=.5, 1.0\)) and \(\mathbf{S}_c\) (\(c^2=.5, 1.0\)), called input covariance (ICov) matrix in the table. Condition numbers of the estimated covariance (ECov) matrix \(\hat{{\varvec{\Sigma }}}\) or \({\varvec{\Sigma }}^{(60)}\) as well as the corresponding estimated information (EInf) matrix are also reported in the table. With \(\kappa (\mathbf{S})=14.645\), the sample covariance matrix is not ill conditioned. However, the sampling errors in \(\mathbf{S}\) cause \({\varvec{\Sigma }}^{(t)}\) and \(\mathbf{H}^{(t)}\) to be close to singular, which further causes FS and other algorithms to fail to converge. The results in Table 1 indicate that \(\kappa (\hat{\varvec{\Sigma }}_a)\) decreases as a increases and \(\kappa (\hat{\varvec{\Sigma }}_c)\) increases as \(c^2\) increases. However, the condition number of the information matrix corresponding to \(\mathbf{S}_c\) at \(c^2=1.0\) is smaller than that at \(c^2=.5\). The condition number of the information matrix corresponding to \(\mathbf{S}_a\) at \(a=.5\) (\(c^2=0\)) is also much larger than that corresponding to \(\mathbf{S}_c\) at either \(c^2=.5\) or 1.0.

In Example 2, neither FS nor the default algorithm in SAS Calis is able to reach convergence when fitting \(\mathbf{S}\) by the 1-factor model. The seeming convergence of EQS at \(\hbox {conv}=.001\) is just a coincidence, with the \(\hat{{\varvec{\theta }}}\) in (13) being not a stationary point. By using either the ridge or anti-ridge method, FS easily yields converged solutions. However, the estimates by different methods are quite different. Although both the ridge and the anti-ridge methods yield estimates that are consistent with the model and population, the \(\mathbf{S}\) in (11) contains substantial sampling errors, which cause the differences among the estimates in (14) to (17). The example shows that, in addition to yielding converged solutions, ridge and anti-ridge methods can be also effective in removing Heywood cases.

Notice that the convergence criterion in EQS is defined differently from that in (12). The reason for us to choose \(\hbox {conv}=.000001\) when using EQS to fit \(\mathbf{S}_a\) and \(\mathbf{S}_c\) is because the program was unable to reach convergence when working with \(\mathbf{S}\) under the same value of conv. If setting conv at a larger number, it will take fewer iterations for EQS to reach convergence.

The previous example is on the convergence issue of the FS algorithm with a confirmatory factor model. The FS algorithm has similar problems when fitting a structural equation model, as illustrated by the following example.

Example 3

Consider six variables with \(y_1\), \(y_2\) and \(y_3\) being indicators for the first factor \(\xi _1\); \(y_4\), \(y_5\) and \(y_6\) being indicators for the second factor \(\xi _2\); and \(\xi _2\) is predicted by \(\xi _1\) according to

$$\begin{aligned} \xi _2=\gamma _{21}\xi _1+\zeta _2, \end{aligned}$$

where \(\xi _1\) and \(\zeta _2\) are independent with \(\phi _{11}=\mathrm{Var}(\xi _1)\) and \(\varphi _{22}=\mathrm{Var}(\zeta _2)\). Letting \({\varvec{\xi }}=(\xi _1,\xi _2)'\), then the covariance structure of \(\mathbf{y}=(y_1,y_2,\ldots , y_6)'\) is given by Eq. (6) with

$$\begin{aligned} {\varvec{\Lambda }}'\!=\!\left( \begin{array}{llllll} 1.0&{}\quad \lambda _2&{}\quad \lambda _3&{}\quad 0&{}\quad 0&{}\quad 0\\ 0&{}\quad 0&{}\quad 0&{}\quad 1.0&{}\quad \lambda _5&{}\quad \lambda _6 \end{array}\right) \quad \mathrm{and}\quad {\varvec{\Phi }}\!=\!\left( \begin{array}{ll} \phi _{11}&{}\quad \gamma _{21}\phi _{11}\\ \gamma _{21}\phi _{11}&{}\quad \gamma _{21}^2\phi _{11}+\varphi _{22} \end{array}\right) .\qquad \end{aligned}$$
(18)

Note that, for the above structural equation model, we cannot fix the variance of \(\xi _2\) at 1.0 becuase its value is subject to prediction. So we put \(\lambda _1=\lambda _4=1.0\) for the purpose of model identification. In this model, there are 13 free parameters with

$$\begin{aligned} {\varvec{\theta }}=(\lambda _2,\lambda _3,\lambda _5,\lambda _6,\phi _{11},\gamma _{21},\varphi _{22}, \psi _{11},\psi _{22}, \psi _{33}, \psi _{44}, \psi _{55}, \psi _{66})'. \end{aligned}$$

Like for Example 2, the sample covariance matrix

$$\begin{aligned} \mathbf{S}=\left( \begin{array}{rrrrrr} 3.498 &{}\quad 1.686 &{}\quad .679 &{}\quad 1.033 &{}\quad 1.426 &{}\quad .286\\ 1.686 &{}\quad 1.964 &{}\quad 1.062 &{}\quad .718 &{}\quad .654 &{}\quad .642\\ .679 &{}\quad 1.062 &{}\quad 1.754 &{}\quad .863 &{}\quad 1.036 &{}\quad .720\\ 1.033 &{}\quad .718 &{}\quad .863 &{}\quad 2.061 &{}\quad 1.402 &{}\quad .779\\ 1.426 &{}\quad .654 &{}\quad 1.036 &{}\quad 1.402 &{}\quad 2.284 &{}\quad 1.380\\ .286 &{}\quad .642 &{}\quad .720 &{}\quad .779 &{}\quad 1.380 &{}\quad 2.646 \end{array}\right) \end{aligned}$$
(19)

is obtained from a normally distributed population with \(N=30\). Except \(\gamma _{21}=.5\) and \(\varphi _{22}=.75\), all the other elements of \({\varvec{\theta }}\) in the population are 1.0; and we denote the vector of these values as \({\varvec{\theta }}_0\). The initial value of the FS and other algorithms described below are set at \({\varvec{\theta }}_0\) regardless of whether the ridge or anti-ridge method is used when estimating the model in (18). The criterion for convergence of the FS algorithm is the same as defined in (12).

When fitting the model in (18) to the sample covariance matrix in (19), the FS algorithm does not converge. Starting at 66th iteration (\(t=66\)), the \({\varvec{\theta }}^{(t)}\) in Eq. (2) oscillates between

$$\begin{aligned} {\varvec{\theta }}^{(t)}= & {} (.668, .633, 1.346, .972, 2.054, .366, .690, 1.444, .832, \nonumber \\&\quad .926, 1.040, .384, 1.688)' \end{aligned}$$
(20)

and

$$\begin{aligned} {\varvec{\theta }}^{(t+1)}= & {} (.983, .640, 1.535, .948, 1.379, .509, .514, 2.119, .483, 1.184,\nonumber \\&\quad 1.148, .139, 1.828)'. \end{aligned}$$
(21)

Other initial values are also used but FS eventually runs into the same problem.

We would hope that the problem of oscillation between two points encountered by FS in this example would be solved by adjusting the step size via the value of \(\alpha \) in (4). With the default convergence criterion of EQS (\(\hbox {conv}=.001\)), the program converges in 12 iterations and yields

$$\begin{aligned} \hat{{\varvec{\theta }}}= & {} ( .887, .622, 1.445, .963, 1.799, .415, .662, 1.699, .549, 1.058,\nonumber \\&\quad 1.089, .256, 1.745)'. \end{aligned}$$
(22)

The output of EQS indicates that, at the 4th and 10th iterations, step halving is used. However, with the \(\hat{\varvec{\theta }}\) in (22) as the initial values, the FS algorithm in (2) returns to oscillating between the two sets of values in (20) and (21), starting at \(t=134\). To better understand the problem, we reset the convergence criterion in EQS to \(\hbox {conv}=.00001\). Then EQS cannot reach convergence in 1000 iterations.

With the default convergence criterion, SAS Calis yields a vector of converged values essentially the same as that of EQS. However, using the converged values of SAS Calis as the initial values for FS and let the algorithm continue to run, the iteration returns to oscillating between the two sets of values in (20) and (21).

Since the relative errors in \(s_{jk}\) do not depend on how the structure model is identified, the results and properties regarding ridge and anti-ridge methods obtained in the previous sections still hold, as illustrated below.

For the \(\mathbf{S}\) in (19), fitting the \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\) at \(a=.5\) by the model in (18), the FS algorithm converges in 50 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}= & {} (.877, .633, 1.438, .928, 1.758, .462, .609, 1.740, .613, 1.050, \nonumber \\&\quad 1.077, .248, 1.798)'. \end{aligned}$$
(23)

Fitting the \(\mathbf{S}_a\) with \(a=.5\) and \(\hbox {conv}=.00001\) by the model in (18), EQS converges in 59 iterations and yields estimates with the first 3 decimals identical to those in (23).

Fitting the \(\mathbf{S}_a\) at \(a=1\) by the SEM model in (18), the FS algorithm converges in 28 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}= & {} (.875, .644, 1.430, .904, 1.725, .484, .593, 1.773, .643, 1.039, \nonumber \\&\quad 1.063, .244, 1.830)'. \end{aligned}$$
(24)

Fitting the \(\mathbf{S}_a\) at \(a=1\) by the SEM model and set \(\hbox {conv}=.00001\), EQS converges in 33 iterations and yields estimates with the first 3 decimals identical to those in (24).

Let \(\mathbf{S}_c=\mathbf{S}+c^2{\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\), where \({\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\) is the population covariance matrix of the common scores. Because \(\lambda _1\) and \(\lambda _4\) are fixed at 1.0 in the formulation of the SEM model in (18), the population counterpart of \({\varvec{\Phi }}\) corresponding to \(\mathbf{S}_c\) becomes \((1+c^2){\varvec{\Phi }}\), and those of \({\varvec{\Lambda }}\) and \({\varvec{\Psi }}\) remain the same. Thus, consistent estimates of \(\phi _{11}\) and \(\varphi _{22}\) are obtained by \(\hat{\phi }_{11}=\hat{\phi }_{c11}-c^2\phi _{110}\) and \(\hat{\varphi }_{c22}=\hat{\varphi }_{22}-c^2\varphi _{220}\), where \(\hat{\phi }_{c11}\) and \(\hat{\varphi }_{c22}\) are the estimates of \(\phi _{11}\) and \(\varphi _{22}\) under modeling \(\mathbf{S}_c\); and \(\phi _{110}\) and \(\varphi _{220}\) are the population values of \(\phi _{11}\) and \(\varphi _{22}\), respectively.

Fitting the SEM model to the \(\mathbf{S}_c\) at \(c^2=.5\), FS converges in 135 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}= & {} (.973, .715, 1.318, .977, 1.690, .417, .687, 1.808, .389, 1.135, \nonumber \\&\quad 1.119, .278, 1.770)'. \end{aligned}$$
(25)

With \(\hbox {conv}=.00001\), fitting the \(\mathbf{S}_c\) by the SEM model in (18) using EQS takes 160 iterations and yields estimates with the first 3 decimals identical to those in (25).

Fitting the SEM model to the \(\mathbf{S}_c\) at \(c^2=1.0\), FS converges in 44 iterations and yields

$$\begin{aligned} \hat{\varvec{\theta }}= & {} (1.006, .769, 1.249, .984, 1.633, .421, .705, 1.865, .299, 1.196, \nonumber \\&\quad 1.139 .287, 1.786)'. \end{aligned}$$
(26)

With \(\hbox {conv}=.00001\), fitting the SEM model to the \(\mathbf{S}_c\) at \(c^2=1\) by EQS takes 53 iterations and yields estimates with the first 3 decimals identical to those in (26).

Table 2 Condition numbers of the input covariance (ICov) matrix, the estimated covariance (ECov) matrix, and the estimated information (EInf) matrix corresponding to \(\mathbf{S}\) (\(t\ge 22\)), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) in Example 2

Parallel to Table 1, the condition numbers of \(\mathbf{S}\), \(\mathbf{S}_a\) (\(a=.5\), 1.0) and \(\mathbf{S}_c\) (\(c^2=.5\), 1.0) as well as those of the corresponding \(\hat{{\varvec{\Sigma }}}\) or \({\varvec{\Sigma }}^{(t)}\) and the associated information matrices for this example are reported in Table 2. The sample covariance matrix is not ill conditioned. But \(\kappa (\mathbf{S})=31.381\) is several times of that in Table 1. The results in Table 2 indicate that \(\kappa (\hat{\varvec{\Sigma }}_a)\) decreases as a increases and \(\kappa (\hat{\varvec{\Sigma }}_c)\) increases as \(c^2\) increases, which are similarly observed in Table 1. However, in Table 2 the condition number of the information matrix in fitting \(\mathbf{S}_c\) increases with \(c^2\), and is the largest at \(c^2=1.0\). This may explain why anti-ridge method in this example is not as effective as in the previous example, and it took 135 iterations for the FS algorithm to converge when fitting \(\mathbf{S}_c\) at \(c^2=.5\), compared to 50 iterations when fitting \(\mathbf{S}_a\) at \(a=.5\).

In this example, FS algorithm is unable to reach convergence when \(\mathbf{S}\) is fitted by the SEM model. Step size adjustment does not solve the problem, as shown by running EQS with \(\hbox {conv}=.00001\). With or without step size adjustment, FS has no problem in reaching a converged solution with either the ridge or the anti-ridge method. Although the \(\mathbf{S}\) in (19) contains substantial sampling errors, the 4 sets of estimates in (23) to (26) are comparable.

Notice that the convergence criterion of EQS with fitting \(\mathbf{S}_a\) and \(\mathbf{S}_c\) in this example is set at \(\hbox {conv}=.00001\) while in the previous example it was set at \(\hbox {conv}=.000001\). This is because, when working with \(\mathbf{S}\), EQS could not reach convergence at these specified values.

5 Monte Carlo results

In this section, we empirically compare the convergence rate and speed of the FS algorithm in fitting \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\). In Sect. 3.1, our characterization of the relative errors in \(\mathbf{S}\) is based on the assumption that factors and errors in the factor model are independent, and the results are derived using asymptotics. We will empirically evaluate the size of relative errors in \(\mathbf{S}\) when factors and errors are dependent but uncorrelated. Because the convergence properties of FS are related to the condition numbers of \(\mathbf{S}\) and/or the information matrix in (3), we will also evaluate how these condition numbers and sampling errors jointly affect the convergence rate and speed of the FS algorithm.

5.1 Conditions

The population distributions are specified through a confirmatory factor model with \(p=15\) observed variables

$$\begin{aligned} \mathbf{y}={\varvec{\mu }}+{\varvec{\Lambda }}(r{\varvec{\xi }})+r{\varvec{\varepsilon }}, \end{aligned}$$
(27)

where \({\varvec{\mu }}\) is a \(15\times 1\) vector of means;

$$\begin{aligned} {\varvec{\Lambda }}=\left( \begin{array}{lll} {\varvec{\lambda }}&{}\quad \mathbf{0}&{}\quad \mathbf{0}\\ \mathbf{0}&{}\quad {\varvec{\lambda }}&{}\quad \mathbf{0}\\ \mathbf{0}&{}\quad \mathbf{0}&{}\quad {\varvec{\lambda }}\end{array}\right) \end{aligned}$$

with \({\varvec{\lambda }}\) being a \(5\times 1\) vector of factor loadings; \({\varvec{\xi }}=(\xi _1,\xi _2,\xi _3)'\) and \({\varvec{\varepsilon }}=(\varepsilon _1, \varepsilon _2, \ldots , \varepsilon _{15})'\) are independent with \(E({\varvec{\xi }})=\mathbf{0}\),

$$\begin{aligned} {\varvec{\Phi }}=\mathrm{Var}({\varvec{\xi }})=(\phi _{jk})=\left( \begin{array}{lll} 1.0 &{}\quad .3 &{}\quad .4\\ .3 &{}\quad 1.0 &{}\quad .5\\ .4 &{}\quad .5 &{}\quad 1.0 \end{array}\right) , \end{aligned}$$

\(E({\varvec{\varepsilon }})=\mathbf{0}\), and \(\mathrm{Var}({\varvec{\varepsilon }})={\varvec{\Psi }}=\mathrm{diag}(\psi _{11},\psi _{22},\ldots , \psi _{pp})\). The multiplier r in (27), to be further specified, is to make the factors \((r{\varvec{\xi }})\) and errors \((r{\varvec{\varepsilon }})\) dependent but uncorrelated. Such a condition was used in Hu et al. (1992) to invalidate the so-called asymptotic robustness conditions in studying the likelihood ratio statistic. Three sets of population parameters are used: (P1) \({\varvec{\lambda }}=(1, 1, 1, 1, 1)'\) or \(\lambda _j=1\) and \(\psi _{jj}=1\), \(j=1\) to 15; (P2) \({\varvec{\lambda }}=(2, 2, 2, 2, 2)'\) or \(\lambda _j=2\) and \(\psi _{jj}=1\), \(j=1\) to 15; (P3) \({\varvec{\lambda }}=(1, 1, 1, 1, 1)'\) or \(\lambda _j=1\) and \(\psi _{jj}=2\), \(j=1\) to 15. Four population distribution conditions of \(\mathbf{y}\), as described in Table 3a, are used. Each distribution of \(\mathbf{y}\) is defined through \({\varvec{\xi }}={\varvec{\Phi }}^{1/2}\mathbf{z}_{\xi }\) and \({\varvec{\varepsilon }}={\varvec{\Psi }}^{1/2}\mathbf{z}_{\varepsilon }\), where the elements in \(\mathbf{z}_{\xi }=(z_{\xi 1},z_{\xi 2},z_{\xi 3})'\) and \(\mathbf{z}_{\varepsilon }=(z_{\varepsilon 1},z_{\varepsilon 2},\ldots ,z_{\varepsilon 15})'\) are independent and each follows a standardized distribution described in the table. In condition D1, \(r=1\), \({\varvec{\xi }}\sim N(\mathbf{0},{\varvec{\Phi }})\) and \({\varvec{\varepsilon }}\sim N(\mathbf{0},{\varvec{\Psi }})\), and thus \(\mathbf{y}\) is normally distributed with mean \({\varvec{\mu }}\) and covariance matrix as given in (6). In conditions D2, D3 and D4, \(r\sim (3/\chi _5^2)^{1/2}\). Since \(E(r^2)=1\), each \({\varvec{\Sigma }}\) in D2, D3 or D4 is also given by (6). In D2, \(\mathbf{y}\) follows an elliptical distribution. In D3, \(\mathbf{y}\) follows a skewed distribution due to a skewed \({\varvec{\xi }}\). In D4, \(\mathbf{y}\) follows a skewed distribution due to a skewed \({\varvec{\varepsilon }}\).

Table 3 (a) Population distributions of \(\mathbf{y}\) for the Monte Carlo study, \({\varvec{\xi }}={\varvec{\Phi }}^{1/2}\mathbf{z}_{\xi }\) with \(\mathbf{z}_{\xi }=(z_{\xi 1},z_{\xi 2},z_{\xi 3})'\), and the \(z_{\xi j}\) are independent; \({\varvec{\varepsilon }}={\varvec{\Psi }}^{1/2}\mathbf{z}_{\varepsilon }\) with \(\mathbf{z}_{\varepsilon }=(z_{\varepsilon 1},z_{\varepsilon 2},\ldots ,z_{\varepsilon 15})'\), and the \(z_{\varepsilon j}\) are independent. (b) Values of parameters in the population, and condition numbers of \({\varvec{\Sigma }}\), \({\varvec{\Sigma }}_a=E(\mathbf{S}_a)\) and \({\varvec{\Sigma }}_c=E(\mathbf{S}_c)\) as well as those of the corresponding information matrix \(\mathbf{H}=\dot{{\varvec{\sigma }}}'\mathbf{W}\dot{{\varvec{\sigma }}}\)

Since non-convergence is typically associated with smaller sample sizes, we choose \(N=30\), 50, 100 and 200. The number of replications is \(N_r=500\).

To study the convergence properties of FS with ridge and anti-ridge methods, we fit \(\mathbf{S}\), \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\) and \(\mathbf{S}_c=\mathbf{S}+c^2{\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\), where \({\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\) is set at the population values of the variances-covariances of the common scores; \(a=1\) and \(c^2=1\) for parameterizations P1 and P2; and \(a=2\) and \(c^2=1\) for parameterization P3. Thus, the choice of a makes the error variances corresponding to \(\mathbf{S}_a\) doubled those corresponding to \(\mathbf{S}\) in P1, P2 and P3; whereas the choice of c makes the common-score variances-covariances corresponding to \(\mathbf{S}_c\) doubled those corresponding to \(\mathbf{S}\) in the three conditions. Clearly, the conditions contain both post-hoc and a priori implementations of the ridge and anti-ridge methods. Condition numbers for the population covariance matrix and the information matrix corresponding to \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) are listed in Table 3b. Notice that the \(\kappa ({\varvec{\Sigma }})\) under P1 equals the \(\kappa ({\varvec{\Sigma }}_c)\) under P3 because \({\varvec{\Sigma }}_c=2{\varvec{\Sigma }}\), but the condition numbers of their corresponding information matrices are not equal.

For each condition in Table 3, the model is the same. That is, a confirmatory 3-factor model as in Eq. (6), each factor has 5 unidimensional indicators and the factors are freely correlated, and the errors are uncorrelated. In the estimation, each factor variance is fixed at 1.0. Thus, there are 15 factor loadings, 3 factor correlations, and 15 error variances.

For a given condition, let \(s_{ijk}\) be the sample covariance between the jth and kth variables in the ith replication. Since \(E(s_{ijk})=\sigma _{jk}\) is positive in all the conditions, we use the average

$$\begin{aligned} \mathrm{RE}_{od}=\frac{1}{N_r}\sum _{i=1}^{N_r} \left[ \sum _{j=1}^{p-1}\sum _{k=j+1}^{p} \frac{|s_{ijk}-\sigma _{jk}|}{\sigma _{jk}}\right] /[p(p-1)/2] \end{aligned}$$

to measure the relative errors in the off-diagonal elements of the sample covariance matrix \(\mathbf{S}\). Similarly, we use

$$\begin{aligned} \mathrm{RE}_d=\frac{1}{N_r}\sum _{i=1}^{N_r} \left[ \sum _{j=1}^p\frac{|s_{ijj}-\sigma _{jj}|}{\sigma _{jj}}\right] /p \end{aligned}$$

to measure the relative errors in the diagonal elements of \(\mathbf{S}\).

Table 4 Relative errors in the sample variances (\(\mathrm{RE}_d\)) and covariances (\(\mathrm{RE}_{od}\)) as factor loadings and/or unique variances vary: (D1) normally distributed population; (D2) elliptically distributed population; (D3) distribution with skewed factors and symmetrically distributed errors; (D4) distribution with skewed errors and symmetrically distributed factors

When fitting \(\mathbf{S}\), \(\mathbf{S}_a\) or \(\mathbf{S}_c\), the population values of \({\varvec{\theta }}\) corresponding to \({\varvec{\Sigma }}=E(\mathbf{S})\), \({\varvec{\Sigma }}_a={\varvec{\Sigma }}+a\mathbf{I}\) and \({\varvec{\Sigma }}_c={\varvec{\Sigma }}+c^2{\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\) are used as initial values respectively. For some replications, the FS algorithm cannot reach the convergence criterion defined in (12). These non-converged replications can be classified into two types. One is that (12) is still not satisfied after completing 300 iterations, call it type A; and the other is that, for \(t<300\), the ratio of the largest absolute eigenvalue over the smallest absolute eigenvalue of \(\mathbf{H}^{(t)}\) is so large that SAS IML declares \(\mathbf{H}^{(t)}\) as singular, call it type B. The numbers of replications for each type are recorded to measure the rate of non-convergence/convergence. For each condition, the average number of iterations across all the converged replications (out of 500) is also recorded as an indicator of the speed of convergence of FS.

Average condition numbers of \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) for each condition across the 500 replications are recorded in order to examine their relationship to the convergence properties of FS. Similarly, the average condition numbers of the information matrices corresponding to \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) across the converged replications are also recorded for each condition.

5.2 Results

Table 4 contains the averages of relative errors in sample variances (\(\mathrm{RE}_d\)) and covariances (\(\mathrm{RE}_{od}\)) across 500 replications of \(\mathbf{S}\) for each condition. It is clear that, regardless of whether the population distribution is symmetric or skewed, or whether the errors and factors are independent or just uncorrelated, the \(\mathrm{RE}_{od}\)s in condition P2 (\(\lambda _j=2\), \(\psi _{jj}=1\)) are smallest and those in condition P3 (\(\lambda _j=1\), \(\psi _{jj}=2\)) are largest, consistent with the analytical results obtained in Sect. 3.1. The relative errors in sample variances (\(\mathrm{RE}_d\)) do not follow the same pattern as that for the sample covariances. In particular, under normally or elliptically distributed population, \(\mathrm{RE}_d\)s barely change from P1 to P3. With skewed factors, \(\mathrm{RE}_d\)s are greatest in P2 and smallest in P3. However, with skewed errors/uniquenesses, \(\mathrm{RE}_d\)s are greatest in P3 and smallest in P2. These results are also consistent with our analysis in Sect. 3.1.

Table 5 The number of replications that the Fisher-scoring algorithm cannot reach convergence within 300 iterations (in front of /) or the information matrices are singular during iterations (after /); (D1) normally distributed population; (D2) elliptically distributed population; (D3) distribution with skewed factors and symmetrically distributed errors; (D4) distribution with skewed errors and symmetrically distributed factors

Table 5 contains the numbers of non-converged replications of type A (in front of /) and type B (after /), where empty cells correspond to conditions in which all 500 replications converged. We did not include the results for \(N=200\) because all 500 replications converged. It is clear from Table 5 that the number of non-converged replications in fitting \(\mathbf{S}\) is closely related to the size of \(\mathrm{RE}_{od}\)s reported in Table 4. In particular, least numbers of non-converged replications occurred under P2 and largest numbers occurred under P3. Within P3, the three conditions with largest \(\mathrm{RE}_{od}\) in Table 4 (D3, D2, and D4 under \(N=30\)) correspond to most non-converged replications in Table 5 (100/62, 74/38, 56/28). The three largest entries of \(\mathrm{RE}_{od}\) under \(N=50\) in Table 4 (D3, D2, D4 following P3) also correspond to most non-converged conditions following \(N=50\) in Table 5 (18/25, 25/8, 20/5).

Results in Table 5 also show that both post-hoc ridge and anti-ridge methods are effective in addressing the problem of non-convergence with fitting \(\mathbf{S}\). Under P1, the anti-ridge method is slightly more effective than the ridge method for conditions D2 and D3 but not for D4. Under P2, only two replications in D4 could not converge with fitting \(\mathbf{S}\), and the ridge method solves the problem whereas the anti-ridge method does not. This is because the variances-covariances of the common-scores (\({\varvec{\Lambda }}{\varvec{\Phi }}{\varvec{\Lambda }}'\)) corresponding to fitting \(\mathbf{S}\) are already rather large in P2, and further enlarging their values does not make much difference. In other words, non-converged replications in fitting \(\mathbf{S}\) under P2 are not due to large relative errors in \(\mathbf{S}\) but something related to condition numbers of \({\varvec{\Sigma }}^{(t)}\) and/or \(\mathbf{H}^{(t)}\), and the ridge method directly addresses the problem. In contrast, under P3, because the relative errors (\(\mathrm{RE}_{od}\)) are rather large (see Table 4) and \(\kappa (\mathbf{S})\) (to be discussed) is already quite small, reducing the relative errors by modeling \(\mathbf{S}_c\) is more effective than further improving the condition number of \(\mathbf{S}\).

Notice that, under conditions P1 and P3 in Table 5, there are more non-converged replications of type A than type B when fitting \(\mathbf{S}\), whereas it is the other way around when fitting \(\mathbf{S}_a\). This suggests that the ridge method is less effective in dealing with type B non-converged replications. This is because, as reported in Table 3b, the condition numbers \(\kappa ({\varvec{\Sigma }})\) for the two conditions are already rather small, and \(\kappa (\mathbf{H}_a)\) is even greater than \(\kappa (\mathbf{H})\). Although the condition number \(\kappa (\mathbf{H}_c)\) under P1 or P3 is also greater than the corresponding \(\kappa (\mathbf{H})\), the relative errors in \(\mathbf{S}\) are effectively reduced by the anti-ridge method, and thus, the number of non-converged replications due to singular \(\mathbf{H}^{(t)}\) caused by the size of relative errors in \(\mathbf{S}\) becomes smaller.

Table 6 Average number of iterations across converged replications; (D1) normally distributed population; (D2) elliptically distributed population; (D3) distribution with skewed factors and symmetrically distributed errors; (D4) distribution with skewed errors and symmetrically distributed factors

For each condition, the average number of iterations across the converged replications is reported in Table 6; and a further average across the 4 sample sizes and 4 distribution conditions is reported in the last line of the table. It is clear that both the ridge and anti-ridge methods accelerate the speed of convergence of FS. Under P1, the anti-ridge method is slightly faster than the ridge method on average. Under P2, the ridge method is uniformly faster than the anti-ridge method; and, under P3, the anti-ridge method is uniformly faster than the ridge method. Comparing Tables 6 and 5, we may notice that conditions under which FS converges faster also tend to have smaller numbers of non-converged replications. This implies that both the speed and rate of convergence of FS are strongly affected by the relative errors in sample covariances.

Table 7 Average condition number of \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) across 500 replications; (D1) normally distributed population; (D2) elliptically distributed population; (D3) distribution with skewed factors and symmetrically distributed errors; (D4) distribution with skewed errors and symmetrically distributed factors

The average condition numbers of \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) across the 500 replications for each condition are reported in Table 7. Due to sampling errors, each of the averages is much greater than the corresponding population condition number reported in Table 3b, especially in condition D4 when errors follow a skewed distribution. The average condition numbers monotonically decrease as N increases, but they are still substantially above the population values even at \(N=200\). Further averages across the 4 sample sizes and 4 distribution conditions are reported in the last row of Table 7, and those under the ridge method are less than two times of the population condition number, while those under the anti-ridge method are about 5 times of the corresponding population value.

Table 8 Average condition number of the information matrix across the converged replications; (D1) normally distributed population; (D2) elliptically distributed population; (D3) distribution with skewed factors and symmetrically distributed errors; (D4) distribution with skewed errors and symmetrically distributed factors

Corresponding to \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\), the average condition numbers of the information matrices across the converged replications for each condition are reported in Table 8. Many numbers in the table are huge, and far above their population values reported in Table 3b. Since condition numbers corresponding to type B non-converged replications are so large that SAS cannot properly store them, including one of them in the calculation of the average can make the result larger than any of the numbers reported in Table 8. Comparing Tables 8 and 5, we may notice that most larger numbers in Table 8 correspond to conditions with many non-converged replications in Table 5, although the numbers in Table 8 are the averages of only the converged replications. Comparing Table 8 with Tables 3b and 7 suggests that condition numbers of the information matrices are affected much more by the relative errors in the sample covariances than by the population condition numbers or those of the sample covariance matrices. The rapid decline of condition number with increasing sample size in Table 8 is also due to smaller relative errors in \(\mathbf{S}\).

In summary, the results in this section suggest that the size of sampling errors in \(\mathbf{S}\) affects the convergence properties of FS most. The sampling errors strongly affect the condition numbers of \(\mathbf{S}\), \(\mathbf{S}_a\) and \(\mathbf{S}_c\) as well as those of the corresponding information matrices, which are key components of the FS algorithm. For majority of the replications with \(\mathbf{H}^{(t)}\) being singular, the ridge method solves the problem by fitting \(\mathbf{S}_a\), which improves the condition number of \(\mathbf{S}\). However, the ridge method is more effective when the non-convergence is caused by fluctuations of \({\varvec{\theta }}^{(t)}\) from iteration to iteration rather than by a singular \(\mathbf{H}^{(t)}\). By directly reducing the size of the relative errors in \(\mathbf{S}\), the anti-ridge method is more effective in improving the convergence properties of the FS algorithm. However, if the non-convergence is not due to the size of relative errors in \(\mathbf{S}\), the ridge method will outperform the anti-ridge method.

6 Discussion and conclusion

In the context of SEM, the FS or other algorithms are not always able to reach a set of converged solutions. Methods to improve the convergence properties of FS or its variants have been explored empirically, and one of the findings is to use more reliable indicators. However, such a finding is in the opposite direction of the ridge method, which improves the convergence properties of the FS algorithm by working with \(\mathbf{S}_a=\mathbf{S}+a\mathbf{I}\). In this article we clarified the mechanisms behind the convergence properties of the two seemingly contradicting methods. Our analytical results indicate that, when the population follows a SEM or a confirmatory factor model, the size of relative errors in sample covariances increases with error variances, and decreases with the size of factor loadings or common-score variances. The improved convergence properties of FS or other algorithms following the anti-ridge method are due to smaller relative errors in \(\mathbf{S}\). On the other hand, the ridge method is a post-hoc method, and convergence properties of FS improve because both relative errors in \(\mathbf{S}_a\) and its condition number become smaller. For majority of the cases where the information matrices \(\mathbf{H}^{(t)}\) corresponding to \(\mathbf{S}\) are singular, the ridge method is able to solve the problem.

Comparing the ridge and anti-ridge methods, the latter is more effective in improving the convergence properties of FS or other algorithms. However, the scope of the applicability of the anti-ridge method is limited. In addition to models with fixed factor loadings as in Example 1, anti-ridge strategy can also be used when multiple indicators for each construct are available and we have the freedom to choose a subset of them. Then more reliable indicators will correspond to a higher likelihood of obtaining a set of converged solutions. However, if one has only a limited number of indicators for each construct or the indicators are not exchangeable, then the ridge method can be used. Even if all the indicators have fine reliabilities, the ridge method can still be used. In particular, as Kamada (2011) and Kamada and Kano (2012) showed, the ridge method can substantially improve the accuracy of parameter estimates when sample size is small even when data are normally distributed.

When the model is correctly specified or the difference between data and model is due to sampling error, the ridge method or a priori use of the anti-ridge method still yields consistent parameter estimates. When a model is incorrectly misspecified, then parameter estimates corresponding to the ridge or a priori use of the anti-ridge method may be systematically different from those obtained by modeling \(\mathbf{S}\), and the differences also will be related to the values of a or c. But it is not clear which estimate will be more biased. We briefly explored the effect of a and c numerically in Sect. 4. Yuan and Chan (2008) used \(a=p/N\) in their empirical study of mean square errors of ridge estimates. Kamada (2011) and Kamada and Kano (2012) obtained more refined formula of a that depends on data. However, these results are for correctly specified models and normally distributed data. More studies for the effect of a on the convergence properties of FS and other algorithms as well as on the properties of parameter estimates are needed.

The current research addressed sources of non-convergence problems in the FS algorithm for minimizing the NML-based discrepancy function (1). A modification to the weight matrix \(\mathbf{W}({\varvec{\theta }})\) in (3) yields an iterative algorithm for minimizing the normal-distribution-based generalized least squares (GLS) discrepancy function (Browne 1974). Thus, most of our analyses and discussions also apply to computing the GLS estimator. There also exists a GLS function that does not depend on normal-distribution assumption, called asymptotically distribution free (ADF) function (Browne 1984). Huang and Bentler (2015) recently showed that the condition numbers of the ADF weight matrices in both covariance and correlation structures are strongly affected by sample size, and are closely related to the performance of test statistics in the ADF method. We expect that a ridge method applying to the weight matrix will improve the convergence properties of the corresponding FS algorithm for minimizing the ADF function. Further studies in this direction are needed.

Our study of relative errors in sample covariances might be extended to relative errors in sample correlations, which are commonly used in exploratory factor analysis. Since it has been reported that the size of factor loadings is closely related to factor pattern recovery (e.g., Velicer and Fava 1998), we suspect that the positive effect of larger loadings on factor pattern recovery occurs because the relative errors in sample correlations become smaller. Further study is needed in this direction.

The development of this article is around Fisher scoring algorithm for SEM with complete data. With incomplete data, two methods for SEM were found to perform well in practice (Savalei and Falk 2014). One is a two-stage procedure (Yuan and Bentler 2000) in which saturated means and covariance matrix are obtained via the EM-algorithm (Dempster et al. 1977) in the first stage. In the second stage, the \(\mathbf{S}\) in Eq. (1) is replaced by the estimated covariance matrix, and then estimate of the structural parameter \({\varvec{\theta }}\) is obtained by minimizing the resulting function \(F_{ML}\). It is clear that the ridge and anti-ridge methods equally apply to the second stage of this two-stage procedure. However, they may not be applicable to the first stage when estimating the saturated means and covariance matrix by the EM-algorithm. This is because the E-step involves conditional expectation of the missing variables given the observed values, and the formulation of the conditional expectation depends on the current values of the covariance matrix. When the covariance matrix is changed as in the ridge or anti-ridge method, the conditional expectation will also be different. Then the resulting algorithm may no longer possess the properties as described in Wu (1983). Another method for SEM with incomplete data is via direct maximum likelihood, and Jamshidian and Bentler (1999) developed an EM algorithm for this approach. The E-step is performed based on the structured means and covariances, and the M-step is performed by maximizing a counterpart of Eq. (1) that also includes the mean structure. The convergence properties of this EM-algorithm might be improved by apply the ridge or anti-ridge method at the M-step. For example, one may change the \(\mathbf{S}^*\) in Eq. (4) of Jamshidian and Bentler (1999) by \(\mathbf{S}_a^*=\mathbf{S}^*+a\mathbf{I}\). However, for a similar reason with estimating the saturated means and covariance matrix, the ridge or anti-ridge method may not be applicable at the E-step of the EM algorithm. More studies for applying the ridge or anti-ridge idea to improve the convergence of EM algorithm for SEM with missing data is worth further studying.

The focus of the article is ridge and anti-ridge techniques in improving the convergence properties of the Fisher-scoring algorithm in SEM. However, both Fisher-scoring and EM can apply to many other models beyond SEM. In particular, the EM-algorithm is not sensitive to starting values, and the sequence of the iterated values of EM always converges to a stationary point of the likelihood function (see Wu 1983), whereas FS does not possess such properties. But in most cases, when convergence is not an issue, the speed of FS is much faster than EM (Bentler and Tanaka 1983).