Keywords

1 Introduction

Cognitive diagnosis models (CDMs) try to account for the dependence among observations by latent dimensions that are related to the mastery or possession of cognitive skills, or “attributes” required for a correct response to an item. These models have received considerable attention in educational research because tests based on CDMs promise to provide more diagnostic information about an examinee’s ability than tests that are based on Item Response Theory (IRT) (Rupp et al., 2010). Specifically, whereas IRT defines ability as a unidimensional continuous construct, CDMs describe ability as a composite of K discrete, binary latent skill variables called attributes that define \(2^K\) distinct classes of proficiency.

Like with other measurement models in assessment, the validity of a CDM depends on whether the latent attributes characterizing each proficiency class entirely determine an examinee’s test performance, so that item responses can be assumed to be independent after controlling for the effect of the attributes. (This property of conditional independence is often called “local independence” in the IRT literature.) As Lord and Novick (1968) pointed out, the misspecification of the latent ability space underlying a test usually leads to violations of the conditional independence assumption that, in turn, result in inaccurate estimates of the model parameters and, ultimately, incorrect assessments of examinees’ ability. For cognitive diagnosis, the assumption of conditional independence is equivalent to the assumption that the K attributes span the complete latent space. More to the point, violations of conditional independence are likely to occur if the latent attribute space has been misspecified in either including too few or too many latent attributes in the model.

Within the context of IRT models, various methods have been proposed for examining the dimensionality of the latent ability space underlying a test through checking for possible violations of conditional independence. Stout (1987), for example, developed DIMTEST, a nonparametric procedure for establishing unidimensionality of the test items through testing for conditional independence. Another instance is Rosenbaum’s (1984) use of the Mantel-Haenszel statistic for assessing the unidimensionality of dichotomous items.

Lim and Drasgow (2019) proposed a nonparametric procedure for detecting misspecifications of the latent attribute space in cognitive diagnosis, which relies on the Mantel-Haenszel statistic to check for violations of conditional independence in the context of nonparametric cognitive diagnosis method approaches. This study extends the study of Lim and Drasgow (2019) by using the proposed statistic with parametric cognitive models for the estimation of proficiency classes.

2 The Mantel-Haenszel Test

Lim and Drasgow (2019) propose to use the Mantel-Haenszel (MH) chi-square statistic to test for the (conditional) independence of two dichotomous variables j and \(j^\prime \) by forming the 2-by-2 contingency tables in conditioning on the levels of the stratification variable C. In their study, the stratification variable C is defined in terms of the latent attribute vector \(\mathbf {\alpha }_c = (\alpha _{c1}, \alpha _{c2}, ..., \alpha _{cK})',\) for \(c=1, 2, ..., 2^K\); that is, the different strata of C are formed by the \(2^K\) proficiency classes.

Let \(\{i_{j,j^\prime c}\}\) denote the frequencies of examinees in the \(2 \times 2 \times C\) contingency table. The marginal frequencies are the row totals \(\{i_{1+c}\}\) and the column totals \(\{i_{+1c}\},\) and \({i_{++c}}\) represents the total sample size in the cth stratum. Then, the MH statistic is defined as

$$\begin{aligned} \text {MH} \chi ^2 = \displaystyle \frac{[\sum _{c} (i_{11c} - \sum _{c} E(i_{11c})]^2}{\sum _{c} \text {var}(i_{11c})}, \end{aligned}$$
(1)

where \(E(i_{11c}) = i_{1+c} i_{+1c} / i_{++c}\) and \(\text {var} (i_{11c}) = i_{0+c} i_{1+c} i_{+0c} i_{+1c} / i_{++c}^2 (i_{++c} - 1). \) The stratum having minimum total sample size \(i_{++c}\) equal or larger than 1 is included. Under the null hypothesis of conditional independence of the items j and \(j^\prime \), for cognitive diagnosis models, the MH statistic has approximately a chi-square distribution with degrees of freedom equal to 1 if examinee’s true latent attribute vectors are used as the levels of the stratification variable C. Assume that the odds ratio between j and \(j'\) is constant across all strata. Then the null hypothesis of independence is equivalent to an odds ratio of one

$$\begin{aligned} \text {Odds Ratio}_{\text {MH}j,j^\prime } = \displaystyle \frac{1}{C} \sum _{c = 1}^C \text {or}_{j,j^\prime c}, \end{aligned}$$
(2)

where \(\text {or}_{j,j^\prime c} = (i_{11c} i_{00c}) /(i_{10c} i_{01c}).\)

3 Simulation Studies

The finite test-length and sample-size properties of \(\text {MH} \chi ^2\) have been investigated in simulation studies. For each condition, item response data of sample sizes I = 500, or 2000 were drawn from a discretized multivariate normal distribution \({{ MVN} {(0_{K}, {\sum })}},\) where the covariance matrix \({\sum }\) has unit variance and common correlation \({\rho }\) = 0.3 or 0.6. The K-dimensional continuous vectors \({\mathbf {\theta }_{ i}} = ({{\theta }_{ i1}}, {{\theta }_{ i2}}, ..., {{\theta }_{ iK}})^{\prime }\) were dichotomized according to

$${\alpha _{ ik}} = {\left\{ \begin{array}{ll} 1, &{} \text{ if } \; {{\theta }_{ ik}} \ge {{\varPhi }^{-1}} {\frac{k}{K+1}}; \\ 0, &{} \text{ otherwise } \end{array}\right. } $$

Test lengths J = 20 or 40 were studied with attribute vectors of length K= 3 or 5. The correctly specified Q-matrix for J = 20 is presented in Table 1 (Attributes with \(\star \) were used for Q-matrix (K = 3); attributes with \(\star \star \) for Items 4 and 5). The Q-matrix for J = 40 was obtained by duplicating this matrix two times.

Data were generated from three different models: the DINA model, the additive-cognitive diagnosis model (A-CDM), and a saturated model (i.e., the generalized-DINA (G-DINA) model). For the DINA model, item parameters were drawn from Uniform (0, 0.3). For the A-CDM and the saturated model, like Chen et al. (2013), the parameters were restricted as \(P(\alpha ^\star _{ij})_{\min } = 0.10\) and \(P(\alpha ^\star _{ij})_{\max } = 0.90\), where \(\alpha ^\star _{ij}\) was the reduced attribute vector whose components are the required attributes for the \(j-{th}\) item (see de la Torre, 2011, more details). The R was used for the estimation in this study (e.g., Robitzsch, Kiefer, George, & Uenlue, 2015) in which model parameter estimation was performed by maximization of the marginal likelihood.

Table 1 Correctly specified Q (K = 5)

For each condition, a set of item response vectors was simulated for 100 replications. The proposed MH statistic, Chi-squared statistic \(x_{jj'}\) (Chen and Thissen, 1997), absolute deviations of observed and predicted corrections \(r_{jj'}\) (Chen et al. 2013), and their corresponding p-values were computed for all \((J \times (J - 1)) / 2\) item-pairs in an individual replication.

4 Results

Across 100 trials for each condition, the proportion of times the p-value of each item-pair was smaller than the significance level 0.05 was recorded and is summarized in the tables shown below.

Type I Error Study In this simulation study, the correctly specified Q-matrices (K = 5, or K = 3) were used to fit the data to examine type I error rates. Table 2 shows that most type I error rates of the three different statistics were around the nominal significance level 0.05. The Chi-squared test statistic \(x_{jj^\prime }\) was conservative, with type I error rates below 0.024. The MH statistic got consistent under all conditions when item J = 40, confirming the asymptotic consistency. In the condition of K= 5, J = 20, and I = 2000, the type I error rates of the MH test slightly increased over the nominal rate in the A-CDM and the saturated model for the difficulty of correct classification.

Table 2 Type I error study
Table 3 Power study: 20% misspecified Q.

Power Study: 20% misspecified Q-matrix For each replication, 20% of \(q_{jk}\) entries of the correctly specified Q-matrices (K = 5, or K = 3) were randomly misspecified. It is over-specification when q-entries of 0 are incorrectly coded as 1, and it is underspecification when q-entries of 1 are incorrectly coded as 0. Table 3 shows that the average rejection rates of all \(J \times (J-1) \times 1/2 \) item pairs result in relatively low in the MH test (i.e., 0.310 or below in Non-Parametric Model, 373 or below in DINA model, 0.258 or below in A-CDM, 0.270 or below in saturated model). When K = 5, and I = 500, the power rates appear to be low (i.e., 0.074 or below) in the A-CDM, and the saturated model. They are rather complex models. It is very likely for small sample size to increase the difficulty of accurate model estimation.

Power Study: Over-specified \(\varvec{Q}\)-matrix For each replication, a data set was generated with the Q-matrix (K = 3) that is embedded as a subset of the Q-matrix (K = 5) in Table 1. The data was fitted with the Q-matrix (K = 5) to over-specify the correctly specified Q-matrix (K = 3). A dimension (total 9 items) or two dimensions (total 4 items) were over-specified. The results were consistent with what Chen et al. (2013) found.

As Table 4 shows, all statistics were insensitive to over-specified Q-matrices when the true models were the saturated model or the A-CDM. The average power rates of the item pairs where both items were over-specified in the same dimension were Non-Parametric Model = 0.074, MH = 0.052, \(x_{jj^\prime }\) = 0.181, and \(r_{jj^\prime }\) = 0.220, and those of the item pairs where either item was over-specified were MH = 0.058, \(x_{jj^\prime }\) = 0.104, and \(r_{jj^\prime }\) = 0.137 when the true model was the DINA model. If more attributes are included in the Q-matrix than required, as Rupp et al. (2010) indicated, conditional independence may still be preserved, because true attribute vector may be embedded in subcomponents of the modeled vector, resulting in a model that is too complex but preserves conditional independence. This finding implies that unlike the other statistics, the MH statistic is inappropriate to be used for the detection of the over-specified Q-matrices when the true model is the DINA model.

Table 4 Power study: Over-specified Q with true K = 5.

Power Study: Under-specified \(\varvec{Q}\)-matrix A data set was generated with the Q-matrix (K = 5) in Table 1. The data was fitted with the embedded Q-matrix (K = 3) in each replication. A dimension (total 9 items) or two dimensions (total 4 items) were under-specified. The average power rates of the item pairs where both items were under-specified in the same dimension were MH = 0.572, \(x_{jj^\prime }\) = 0.669, and \(r_{jj^\prime }\) = 0.735, with power relatively consistent across all conditions as shown in Table 5. The average rejection rates across item pairs where either item was under-specified were MH = 0.124, \(x_{jj^\prime }\) = 0.144, and \(r_{jj^\prime }\) = 0.201. The power rates slightly increased when J = 40, I = 2000, or the true model is the A-CDM in all statistics. Taking this finding into account, like the other statistics, the MH test is sensitive to Q-underspecification and has high power in all conditions.

Table 5 Power study: under-specified Q with true K=6
Table 6 Power study: model misspecification

Power Study: Model Misspecification In this simulation study, a correctly specified Q-matrix \((K = 3 \ \text {or} \ 5)\) was used, but with a misspecified cognitive diagnosis models. As Chen et al. (2013) indicated, no statistics detected the model misspecification in all conditions when the fitted model was the saturated model, and the true models were the DINA model and the A-CDM (i.e., 0.052 or below for MH, 0.024 or below for \(x_{jj'}\), and 0.059 or below for \(r_{jj'}\)). Due to limited space, the output is not included. The results in Table 6 show that the rejection rates of the MH statistic were low (i.e., 0.186 or below with few exceptions when the true model was the DINA model, and the fitted model was the A-CDM, 0.097 or below with few exceptions when verse versa). When the true model was the A-CDM, and the fitted model was the DINA model, the power rates were even lower because the DINA model is simper than the A-CDM.

5 Discussion

A Mantel-Haenszel(MH) statistic proposed by Lim and Drasgow (2019) was evaluated for detecting misspecifications of the latent attribute space in parametric cognitive diagnosis models; that is, the Q-matrix might contain too many or too few latent attributes. (Recall that a misspecified latent attribute space may result in inaccurate parameter estimates that will cause incorrect assessments of examinees’ ability.) The proposed MH statistic uses as the levels of the stratification variable the different proficiency classes, with examinees’ individual attribute vectors—that identify proficiency class membership—estimated from the data. Simulation studies were conducted for investigating the diagnostic sensitivity of the MH statistic in terms of Type-I-Error rate and power under a variety of testing conditions. Across different sample sizes, test lengths, number of attributes defining the true attribute space, and levels of correlation between the attributes, the MH statistic consistently attained a Type-I-Error rate that was typically close to the nominal \(0.05-\alpha \)-level when the data were generated using the true Q-matrix based on the correctly specified latent attribute space. When the data were generated using a Q-matrix based on an under-specified latent attribute space, the MH statistic displayed moderate power in detecting the resulting conditional dependence among test items. In summary, the MH statistic might be a promising tool for uncovering possible misspecifications of the latent attribute space in cognitive diagnosis. Further research is needed to investigate the specific factors that affect the power of the MH statistic; especially, when the latent attribute space has been over-specified (i.e., too many attributes have been included).