Keywords

1 Introduction

Social scientists, policy makers, and others make use of scores from psychological scales to make decisions about persons, and groups of people, for a variety of purposes, including hiring, school matriculation, professional licensure, and determinations regarding the need for special educational and psychological services. Given their importance, there must be strong validity evidence for using scores in these ways (American Education Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). One important aspect of providing such evidence is the determination as to whether the measures provide equivalent information for members of different groups in the population, such as males and females, or members of different economic subgroups (Wu, Li, & Zumbo, 2007). Traditionally, such factor invariance (FI) assessments have been made using a Chi-square difference test with multiple group confirmatory analysis (MGCFA). However, this approach is very sensitive to sample size, so that it might be statistically significant for very minor differences in group parameter values (Yuan & Chan, 2016). Perhaps more importantly, information about the magnitude of any group differences in latent variable model parameters identified is not available (Yuan & Chan, 2016). Yuan and Chan described an alternative approach to FI assessment that is based on equivalence testing. When indicator variables are normally distributed, this equivalence testing based method is an effective tool, yielding accurate results with respect to the invariance (or noninvariance) of the latent variable model (Finch & French, 2018). The purpose of the current simulation study was to extend this earlier work by investigating how the equivalence testing technique performed when the observed indicators were ordinal variables (such as items on a scale), rather than being normally distributed.

1.1 MGCFA and FI Assessment

FI assessment (Millsap, 2011) refers to a set of nested models with differing levels of cross group equality assumed about the parameters in a latent variable model linking observed indicators (\( x \)) to latent variables (\( \xi \)). The weakest type of FI is configural invariance (CI), where only the general latent structure (i.e., number of latent variables and correspondence of observed indicators to latent variables) is the same across groups. The next level of FI is measurement invariance (MI), where the factor loading matrix (\( \Lambda \)) is assumed to be equivalent across groups (Kline, 2016; Wicherts & Dolan, 2010). If MI holds, researchers might next assess the equality of the factor model intercepts (\( \tau \)) across groups (Steenkamp & Baumgartner, 1998), and/or group equality of the unique variances (\( \delta \)) invariant across groups.

The most common approach for assessing FI is based on the MGCFA model:

$$ x_{g} = \tau_{g} +\Lambda _{\text{g}} \xi + \delta_{g} $$
(1)

where

\( x_{g} \):

Observed indicators for group g

\( \tau_{g} \):

Threshold parameters for group g

\( \Lambda _{\text{g}} \):

Factor loading matrix for group g

\( \xi \):

Latent variable(s)

\( \delta_{g} \):

Unique variances of the indicator variables for group g.

The terms in Eq. (1) are as described above, except that the parameters are allowed to vary by group, which is denoted by the g subscript. MGCFA is used to test each type of FI through a series of nested models, which differ in terms of that model parameters that are held equivalent between groups. For example, in order to assess MI, the researcher would constrain factor loadings to be equivalent across groups, thereby replacing \( \Lambda _{\text{g}} \) with \( \Lambda \) in Eq. (1). The fit of the constrained and unconstrained models are then compared using a difference in Chi-square fit statistic value, \( \chi_{\Delta }^{2} \). The null hypothesis of this test is that MI is present.

The performance of \( \chi_{\Delta }^{2} \) for invariance testing has yielded somewhat mixed results. French and Finch (2006) found that for normally distributed indicators and a sample size of no more than 500, \( \chi_{\Delta }^{2} \) had Type I error rates at the nominal (0.05) level, while also exhibiting relatively high power. Other researchers have reported that \( \chi_{\Delta }^{2} \) is sensitive to sample size, to a lack of normality in the indicators, and to model misspecification errors, and in such cases may yield inflated Type I error rates when assessing MI (Chen, 2007; Yuan & Chan, 2016; Yuan & Bentler, 2004).

1.2 Factor Invariance with Equivalence Testing

Given these problems associated with using \( \chi_{\Delta }^{2} \), Yuan and Chan (2016) proposed an extension of other work designed to assess model fit using an equivalence testing approach (Marcoulides & Yuan, 2017; Yuan, Chan, Marcoulides, & Bentler, 2016) to the assessment of FI using MGCFA. In the case of MGCFA for FI assessment, the null hypothesis is:

$$ H_{0I} :\,\left( {F_{bc0} - F_{b0} } \right) > \varepsilon_{0} $$
(2)

\( F_{bc0} \) is the fit function value for a model where group parameters are constrained to be equal, \( F_{b0} \) is the fit function value for a model where group latent variable model parameters are allowed to differ between groups, and \( \varepsilon_{0} \) is the maximum acceptable model misspecification. Rejecting \( H_{0I} \) leads to the conclusion that any model misspecification due to constraining factor model parameters to be equal across groups does not greatly degrade model fit vis-à-vis the model where these constraints are relaxed. Therefore, rejecting \( H_{0I} \) in the MI equivalence testing framework would indicate that when the groups’ factor loadings are constrained to be equal, the difference in fit between the loadings constrained and loadings unconstrained models does not exceed an acceptable level of misfit, as expressed by \( \varepsilon_{0} \).

Yuan and Chan (2016) showed that the value of \( \varepsilon_{0} \) can be obtained as follows:

$$ \varepsilon_{0} = \frac{{df\left( {RMSEA_{0} } \right)^{2} }}{m} $$
(3)

where

\( df \):

Model degrees of freedom

\( m \):

Number of groups

\( RMSEA_{0} \):

Maximum value of RMSEA that can be tolerated.

For FI assessment, Yuan and Chan recommend using this equivalence testing procedure to characterize the relative degree of noninvariance present in the data, as opposed to making strict hypothesis testing based determinations regarding equivalence or not. In this framework, the degree of model parameter invariance present in the data can be characterized using common guidelines (e.g., MacCallum, Browne, & Sugawara, 1996) to describe the model constraining group parameters to be equal. These guidelines for interpreting values of RMSEA suggest the following fit categories: Excellent fit (<0.01), Close fit (0.01–0.05), Fair fit (0.05–0.08), Mediocre fit (0.08–0.10), and Poor fit (0.10+). Thus, an RMSEA of 0.17 for a model constraining factor loadings to be equal among groups would suggest poor fit of the MI model, meaning that model parameters are likely not equivalent between the groups. Yuan and Chan (2016) found that for the purposes of determining the value of \( \varepsilon_{0} \), these standard cutoffs for interpreting RMSEA may be too stringent, and thus recommended an alternative approach for obtaining adjusted cutoffs based on the data being analyzed. The interested reader is encouraged to review this earlier paper for a description of how these alternatives are obtained. This equivalence testing approach is effective for assessing the fit of a single model, and for invariance assessment (e.g., Finch & French, 2018; Marcoulides & Yuan, 2017; Yuan & Chan, 2016). However, the performance of the equivalence testing approach to invariance assessment when indicators are categorical and not normally distributed has not been investigated.

1.3 Fit Indices for Categorical Indicators

Yuan and Chan (2016) indicated that the equivalence invariance test was designed for use with normally distributed indicators. However, in many contexts in the social sciences researchers work with ordinal observed variables, such as responses to items on a rating scale. In such cases, the equivalence testing approach may not be appropriate, because calculation of the standard full information \( \chi^{2} \) statistic upon which RMSEA is based is problematic (Maydeu-Olivares & Joe, 2006). In the context of categorical indicators, this statistic relies on the full cross-tabulation of the entire set of categorical indicators (full information), leading to the potential for cell sparsity, and resulting problems in its calculation (Maydeu-Olivares & Joe), which in turn biases the RMSEA estimate.

In order to address these problems caused by sparsity, alternative goodness of fit statistics based on limited information approaches have been proposed for use with latent variable modeling in the context of categorical indicator variables. One set of alternatives is based upon a least squares, rather than maximum likelihood, estimation paradigm. For example, the weighted least squares (WLS) fit function takes the form:

$$ F\left( {\theta ;W} \right) = \left( {\hat{\rho } - \rho \left( \theta \right)} \right)^{\prime } W\left( {\hat{\rho } - \rho \left( \theta \right)} \right) $$
(4)

where

\( \hat{\rho } \):

Sample polychoric correlation matrix for the indicator variables

\( \rho \left( \theta \right) \):

Model implied polychoric correlation matrix

\( W \):

Asymptotic covariance matrix of \( \hat{\rho } \).

Given that WLS yields biased estimates and has difficulty in converging when samples are small (Muthén, 1993), the diagonally weighted least squares (DWLS) estimator was proposed (Muthén, du Toit, & Spisic, 1997). DWLS reduces the computational burden and yields less biased parameter estimates for smaller sample sizes by using only the diagonal of \( W \) as the weight matrix (Flora & Curran, 2004). When \( W \) is the identity matrix, (5) is the unweighted least squares (ULS) estimator. For each of these estimators, a moment corrected goodness of fit statistic, \( T_{3} \), can be calculated based upon the fit function, and is asymptotically a Chi-square statistic (Asparouhov & Muthén, 2010). \( T_{3} \) can then be used to calculate RMSEA, which in turn can be used with the invariance equivalence methodology described above.

An alternative limited information goodness of fit statistic for use with categorical indicators was proposed by Maydeu-Olivares and Joe (2006). This statistic is defined as:

$$ M_{2}^{*} = N\hat{e}_{2} {\hat{\Omega }}_{2} \hat{e}_{2} $$
(5)

where

\( \hat{e}_{2} \):

Vector of first and second order residual probabilities.

\( {\hat{\Omega }}_{2} \):

\( {\hat{\Omega }}_{2} =\Xi _{2}^{ - 1} -\Xi _{2}^{ - 1}\Delta _{2} \left( {\Delta _{2}^{{\prime }}\Xi _{2}^{ - 1}\Delta _{2} } \right)^{ - 1}\Delta _{2}^{{\prime }}\Xi _{2}^{ - 1} \)

\( \Xi _{2} \):

Asymptotic covariance matrix of the first and second order sample proportions

\( \Delta _{2} \):

Matrix of derivatives of the first and second order model implied probabilities with respect to the vector of parameter estimates \( \hat{\theta } \).

\( M_{2}^{*} \):

is asymptotically distributed as a Chi-square statistic, and can be used to calculate RMSEA for use with the invariance equivalence testing approach described above.

1.4 Goals of the Current Study

The goal of the current study was to extend earlier work that investigated the performance of the equivalence testing procedure for normally distributed indicators (Finch & French, 2018). The current study extends this research by examining the performance of \( T_{3} \) for both DWLS and ULS, as well as \( M_{2} \) in the context of MI when the indicator variables are categorical.

2 Method

A Monte Carlo simulation study (1000 replications) was utilized to address the study goals. Data simulation was completed in Mplus, version 7.11 (Muthén & Muthén, 19982016), and data analyses were conducted using R version 3.3.1 (R Development Core Team, 2016). Data were generated using a single factor confirmatory factor analysis model for 2 groups, where the factor, error variances, and factor variances followed the standard normal distribution, with a mean of 0 and variance of 1. Indicator variables were simulated to be ordinal with 5 categories, with the following pattern of thresholds: −1, −0.5, 0.5, 1. Factor loadings were set to 1 for all indicators, unless manipulated to induce measurement noninvariance, as described below. All other model parameters were held invariant between the two groups. The referent indicator method was used to identify the factor models. The following factors were manipulated in the study, and were based upon earlier published work in this area (e.g., Finch & French, 2018).

2.1 Sample Size

Given that sample size has been shown to be important in terms of the performance of the equivalence testing approach, and the \( \chi_{\Delta }^{2} \) test (Chen, 2007; Finch & French, 2018; French & Finch, 2006), it was manipulated in the current study. Total sample sizes were simulated to be 200, 400, 600, 1000, 1500, or 2000, and were designed to reflect small to large samples.

2.2 Number of Indicator Variables

Either 10 or 20 observed factor indicators were simulated, representing a range of values that might be encountered in practice.

2.3 Number of Noninvariant Indicators and Magnitude of Measurement Noninvariance

Measurement noninvariance was simulated by creating group differences in the factor loadings for some observed indicators. For the invariance condition, the difference in factor loadings between the groups was 0 (complete invariance). For the noninvariant cases, loadings were simulated to differ by 0.1, 0.2, 0.3, 0.4, or 0.5. The percent of indicators allowed to be noninvariant was 0, 10, 20, or 30%. As an example of how noninvariance was simulated, in the 10 indicators, 10% noninvariant, 0.1 noninvariance magnitude condition, the factor loading for indicator 2 was set to 0.9 in one group, and kept at 1.0 in the other group.

2.4 Invariance Assessment Approaches

For each replication within each simulation condition, invariance was tested using the MGCFA \( \chi_{\Delta }^{2} \) approach, with \( T_{3} \) for DWLS (\( T_{DWLS} \)) and ULS (\( T_{ULS} \)), as well as \( M_{2}^{*} \). In addition, the equivalence test method based was also used to assess invariance, with the RMSEA values based upon \( T_{DWLS} \), \( T_{ULS} \), and \( M_{2}^{*} \), respectively.

2.5 Study Outcomes

The outcomes were the Type I error and power rates of the \( \chi_{\Delta }^{2} \) tests, and the adjusted equivalence test fit category distribution (Excellent, Close, Fair, Mediocre, or Poor). Analysis of variance (ANOVA) was used to identify statistically significant main effects and interactions of the manipulated conditions with respect to the proportion of cases for which the equivalence testing method identified poor fit. In addition, the partial \( \eta^{2} \) effect size was also used to identify ANOVA model terms of interest, such that main effects and interactions of the manipulated conditions had to be statistically significant with partial \( \eta^{2} \) value of 0.1 or larger, ensuring that effects accounted for at least 10% of the outcome variance to be deemed important.

3 Results

3.1 Measurement Invariance Is Present

The interaction of invariance assessment method by sample size was the only statistically significant model term (\( F_{10,8} = 10.527, p = 0.001, \eta^{2} = 0.929 \)) when invariance was present. The Type I error rate for the \( T_{DWLS} \) statistic was the only one that was in the acceptable range (0.025–0.075) as defined by Bradley (1978), across all sample size conditions (Table 1). For the other two statistics, the samples needed to be at least 1000 (\( M_{2}^{ *} \)) or 1500 (\( T_{ULS} \)) in order for the Type I error rates to be in this range. Results in Table 1 show that equivalence testing based on \( T_{DWLS} \) had the highest rates in the expected excellent/close fit categories, across sample sizes. This proportion increased concomitantly with increases in sample size. Finally, the proportion in the expected excellent/close fit range was below 0.8 for samples of less than 1000 for \( T_{DWLS} \), which was the best performer in this regard.

Table 1 Type I error rates and proportion of adjusted equivalence test results in excellent or close range, by sample size and method

3.2 Measurement Invariance Is Not Present

When factor loadings were simulated to differ between the groups, ANOVA found the interactions of equivalence test statistic by number of noninvariant indicator variables by magnitude of group loading difference (\( F_{16,234} = 14.389, p < 0.001, \eta^{2} = 0.496 \)), and equivalence test statistic by number of loadings by magnitude of group loading difference (\( F_{8,234} = 9.104, p < 0.001, \eta^{2} = 0.237 \)), to be statistically significantly related to the performance of the equivalence test procedure. The first set of results to be examined are those for measurement invariance not present, by method, magnitude of group loading difference, and percent of noninvariant loadings. The proportion of replications in each equivalence testing category for this combination of conditions appear in Fig. 1. It is clear from these results that when the magnitude of group loading differences was 0.3 or more, and 20 or 30% of the indicators were noninvariant between groups, virtually all replications were in the poor fit range (expected outcome given simulated lack of invariance) for all of the methods. Under conditions in which the degree of group difference was less pronounced, the invariance tests based on \( M_{2}^{ *} \) and \( T_{ULS} \) tended to indicate worse fit more frequently than did those based on \( T_{DWLS} \). This result was strongest when 30% of the indicators were simulated to have different loadings between groups, and the magnitude of these differences was 0.1 or 0.2. Power results for the \( \chi_{\Delta }^{2} \) tests appear in Table 2, and demonstrate that \( T_{ULS} \) had the highest rates of power across conditions, whereas \( T_{DWLS} \) exhibited somewhat lower power than did \( M_{2}^{ *} \), particularly for lower group loading difference magnitudes, and fewer noninvariant indicators. It is important when interpreting these results to recall that the Type I error rates were inflated under many conditions for each of these statistics, particularly \( M_{2}^{ *} \) and \( T_{ULS} \).

Fig. 1
figure 1

Proportion of adjusted equivalence test results in each fit category by equivalence statistic, number of noninvariant loadings, and magnitude of group loading difference: noninvariance present

Table 2 Power rates for detecting measurement noninvariance for the chi-square difference test by the magnitude of group loading difference, number of noninvariant indicators, and test statistic

The proportion of replications in each equivalence testing category by magnitude of group loading difference and number of indicators when noninvariance was simulated to be present appear in Fig. 2. These results revealed that with a larger group loading difference there was a higher likelihood of mediocre and poor fit, based on the equivalence test. In addition, with more indicators this effect was magnified for each of the statistics. For example, the proportion of cases in the mediocre and poor fit categories was greater for 20 indicators than for 10, across methods studied here. Power results for the \( \chi_{\Delta }^{2} \) tests by magnitude of group loading difference and number of indicators appear in Table 3, and are aggregated over the number of non-invariant indicators. Power for all three equivalence testing methods was higher when more indicators were present, and that power for \( T_{ULS} \) was the highest across conditions, whereas power for \( T_{DWLS} \) was the lowest for the smallest magnitudes of group loading difference.

Fig. 2
figure 2

Proportion of adjusted equivalence test results in each fit category by equivalence statistic, number of indicator variables, and magnitude of group loading difference: noninnvariance present

Table 3 Power rates for detecting measurement noninvariance for the chi-square difference test by the magnitude of group loading difference, number of indicators, and test statistic

4 Discussion

The results of this study demonstrated that the equivalence testing procedure based on \( T_{DWLS} \) appeared to correctly identify models in which MI held at the highest rates among the methods studied here, while at the same time generally identifying poorly fitting models at a high rate. It is important to note that when the magnitude of group factor loading difference was relatively low (0.2 or less), this statistic was less likely to indicate fair to poor fit than the alternatives studied here. This result could suggest a relative lack of power for this approach, or it could simply reflect the fact that small differences in factor loadings are not indicative of a major lack of equivalence between groups. Finally, the \( \chi_{\Delta }^{2} \) based approaches exhibited inflated Type I error rates in many cases, and may not be as useful as the equivalence testing approach.

Future research in this area should focus on identifying additional alternatives for calculating RMSEA with categorical indicators. Though \( T_{DWLS} \) was the best performer, it was not without problems, particularly for low levels of noninvariance. In addition, future work should include a wider array of indicator categories (e.g., 3, 4, 6, 7), and more complex latent structure (e.g., 2 or 3 factors). Such continued work will allow the invariance literature to continue to expand to address group differences in the measurement of constructs used to make decisions about individuals.