Cognitive diagnosis models are used to provide diagnostic feedback to examinees and stakeholders at a finer grain size than a single test score. Many different models have been proposed, but they all require a common feature, the Q-matrix, that indicates the item J by latent attribute K relationship (Tatsuoka 1983). Each entry qjk in the matrix indicates whether the kth attribute is necessary in the solution of the jth item. An examinee’s performance with respect to what is measured is assumed to be influenced by a composite of the latent attributes such that different combinations define profiles of distinct proficiency classes, which are characterized by the K-dimensional latent attribute vectors α1, α2, …, αC, with C = 2K.

The validity of a cognitive diagnosis model depends on whether the K-dimensional latent attribute vector entirely determines classes of examinees so that the conditional distributions of item scores are all independent of each other after adjusting for the effect of the attributes. This property is often called local independence (e.g., Rupp et al. 2010; Lord and Novick 1968). The assumption of local independence is equivalent to the assumption that the K attributes α1, α2, …, αK span the complete latent space—that is, no latent attributes have been missed or left out. Said differently, violations of local independence indicate the possible misspecification of attributes.

The testlet effect also calls into question the assumption of local independence. A testlet is a cluster of items that shares a common stimulus, such as a reading passage and measures something additional in common (Wainer and Kiely 1987). One way to account for the testlet effect is to incorporate specific dimensions in addition to the K-dimensional latent attributes the Q-matrix specifies. Therefore, testing for local independence can be used as a diagnostic tool for detecting testlet effects as well as incorrect specifications of the latent attributes in cognitive diagnostic modeling.

In cognitive diagnosis modeling, evaluations of model-data fit provide information about the cognitive diagnosis model and data fit as well as the Q-matrix and data fit (e.g., Chen et al. 2013). Various fit statistics and methods have been proposed for both types of evaluations. Some of them include conventional relative fit measures such as Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC), log likelihood, and Bayes factor (e.g., Chen et al. 2013; Kunina-Habenicht et al. 2012; Rupp et al. 2010). Furthermore, absolute fit measures such as the residual between the observed and predicted Fisher-transformed correlation, the residual between the observed and predicted correct proportion, the residual between the observed and predicted log-odds ratios, and the G statistic have been proposed (e.g., Chen et al. 2013; Rupp et al. 2010). These statistics and methods are limited to use in parametric cognitive diagnosis models because most statistics are computed as a function of maximum likelihood estimates; the predicted item responses are generated based on the fitted model.

This article proposes the Mantel-Haenszel (MH) statistic as an index for detecting misspecification of latent attributes as well as testlet effects in nonparametric cognitive diagnosis methods. Obviously, under the nonparametric methods, evaluation of model-data fit is informative about the Q-matrix and data fit only. The MH statistic is a well-researched tool for evaluating the conditional independence of binary variables that are stratified along the levels of a third random variable, for example, examining conditional independence of item pairs that are stratified along the levels of total test scores under an IRT model (Rosenbaum 1984).

The next section describes the assumption of conditional independence underlying cognitive diagnosis models and provides a brief review of nonparametric cognitive diagnosis methods. Then, the MH test of model fit is presented. Next, simulation studies are described with a wide range of conditions. Then, an analysis of real data is described. In the final section, applications and implications of the method are discussed.

1 Conditional Independence and Its Violations

Let Yij denote the binary item response of the ith examinee to the jth item, i = 1, ..., I, j = 1, ..., J. Cognitive diagnosis models describe the joint distribution of item response vector Yi conditional on binary attribute vector αic = {αick}, for c = 1, 2, ..., 2K, and for k = 1, ..., K. Each entry αick indicates whether the ith examinee has mastered the kth attribute. Each binary entry qjk in the Q-matrix indicates whether the kth attribute is relevant for the jth item, with 1 meaning the attribute is relevant and 0 indicating it is irrelevant. The joint probability of a cognitive diagnosis model for the ith examinee is

$$ P\ \left({\boldsymbol{Y}}_i\ \right|\ {\boldsymbol{\alpha}}_i\left)=\prod \limits_{j=1}^JP\left({Y}_{ij}\ \right|{\boldsymbol{\alpha}}_i\right). $$
(1)

Therefore, most models are required to satisfy the assumption of conditional independence among item responses Yij given the attribute vector αic (e.g., Rupp et al. 2010) because the assumption makes it possible to assess the joint probability or likelihood of the models.

The assumption of conditional independence is violated when the dimensionality of the Q-matrix is incorrectly specified. More specifically, a necessary attribute may be omitted. The assumption may be also a concern when the response to an item is based on the responses to the previous items, or when items are grouped by sharing a common stimulus such as a reading passage or a common scenario. Such a grouping of items is referred to the testlet effect, and an additional dimension may be required to adequately model the data, but would be considered as a nuisance dimension because it is not substantively meaningful (e.g., Wainer and Kiely 1987). Most cognitive diagnosis models ignore the testlet effect, and it may result in underspecified dimensions. Therefore, the existence of testlets calls into question the assumption of local independence. A misfit item j may indicate that the item is problematic or qj is underspecified; a few misfit items indicate the dimensions of Q-matrix may be underspecified; misfit items sharing a common stimulus may indicate a testlet effect.

2 Nonparametric Cognitive Diagnosis Methods

Nonparametric cognitive diagnostic methods assess examinees’ mastery and nonmastery of attributes without regard to parametric form. These methods are useful in cognitive diagnosis modeling, especially when parametric model fitting is inefficient because of too small or large sample sizes, or more complex sets of latent attributes (Junker 2011).

One approach to nonparametric cognitive diagnosis methods is to apply cluster analysis to identify groups of examinees with similar pattern of latent attributes given the assumption of a conjunctive relationship among attributes and a valid Q-matrix. Chiu et al. (2009) clustered the sum score vectors Wi = (Wi1, …, WiK) using hierarchical agglomerative and K-means clustering to produce the 2K latent classes. Ayers et al. (2008) utilized the capability score vectors Bi = (Bi1, …, BiK), where \( {B}_{ik}={\sum}_j{Y}_{ij}{q}_{jk}/{\sum}_j{q}_{jk} \) instead of Wi. Park and Lee (2011) mapped item responses to an attribute matrix and then conducted K-means and hierarchical agglomerative clustering.

Another approach utilizes the Hamming distance technique that was originally proposed by Barnes (2003) with a valid Q-matrix. In this technique, examinees’ latent attribute vectors are obtained by minimizing the Hamming distance between the observed item responses Yi and all possible ideal responses η1, η2, ⋯, ηC, C = 2K,

$$ D\left({\boldsymbol{Y}}_i,{\boldsymbol{\alpha}}_c\right)=\sum \limits_{j=1}^J\mid {Y}_{ij}-{\eta}_{cj}\mid . $$
(2)

Like Barnes (2003), Chiu and Douglas (2013) posited a conjunctive relationship among the attributes. Lim and Drasgow (2017) proposed an algorithm given the assumption of conjunctive, disjunctive, or compensatory relationships among attributes. The theoretical justification of this approach is that the true attribute pattern minimizes the expected distance between Yi and ηc regardless of what the true model is, under some regularity conditions (Lim and Drasgow 2017; Wang and Douglas 2015).

3 Mantel and Haenszel Test of Model Fit

The MH statistic χ2 introduced by Mantel and Haenszel (1959) is generally used to test for conditional independence—of two dichotomous or categorical variables j and j by forming the row-by-column contingency tables, conditional on the levels of the control variable C. For IRT models, the MH statistic has been commonly used to detect differential item functioning, items that function differently for two groups of examinees called focal and reference groups with different experiences or backgrounds (Holland and Thayer 1988). In the procedure, the sample is stratified into C classes according to their observed total test score.

In this study, the latent attribute vector αc = (αc1, αc2, ⋯, αcK), for c = 1, 2, …2K = C is proposed as the stratification variable. As discussed above, in cognitive diagnosis models, item responses are assumed to be independent given the correct αc, and a higher value of αc implies a higher probability that Yj = 1 for each j = 1, 2, …, J (e.g., Holland and Rosenbaum 1985). Then, any pair of vectors of monotonic nondecreasing functions gj(Y) and gj(Y) of a vector of dichotomous responses Y to item j and j, given any monotonic nondecreasing function h(αc ), has a nonnegative conditional covariance, a result of Rosenbaum (1984).

Let \( \left\{{i}_{jj{\prime}_c}\right\} \) denote the frequencies of examinees in the 2 × 2 × C contingency table. The marginal frequencies are the row totals \( \left\{{i}_{1{+}_c}\right\} \) and the column totals {\( {i}_{+1{\prime}_c}\Big\} \), and \( {i}_{++_c} \) represents the total sample size in the cth stratum. Strata having a minimum total sample size \( {i}_{++_c} \) equal or larger than 1 are included. If any cell count in a table is 0, then the Haldane correction is applied to each cell in the table to obtain a more accurate significance level of the MH test (e.g., Li et al. 1979). Under the null hypothesis of conditional independence between j and j, the following statistic is proposed:

$$ \mathrm{MH}\ {\chi}^2=\frac{{\left(|{\sum}_c{i}_{11c}-{\sum}_cE\left({i}_{11c}\right)|-1/2\right)}^2}{\sum_c\operatorname{var}\left({i}_{11c}\right)}, $$
(3)

where E(i11c) = i1 + ci+1c/i+ + c and \( \operatorname{var}\left({i}_{11c}\right)={i}_{0+c}\ {i}_{1+c}\ {i}_{+0c}\ {i}_{+1c}/{i}_{++c}^2\ \left({i}_{++c}-1\right). \)

Under the null hypothesis, the test statistic has approximately a chi-squared distribution with degrees of freedom equal to 1 when sample sizes in each contingency table become large, and in cognitive diagnosis models, if each examinee’s true latent attribute vector αi is known. Mantel and Haenszel (1959) indicate that this summary chi-square reference distribution is suitable even when some of the strata have small counts. This statistic would be suitable for the analysis of sparse contingency tables, provided the overall counts for each cell in the combined table obtained by collapsing across all C contingency tables are sufficiently large. The null hypothesis of independence is equivalent to the odds ratio equal to 1.

$$ {\mathrm{Odds}\ \mathrm{ratio}}_{MHj,j\prime }=\frac{\sum_{c=1}^C\left({i}_{11_c}\ {i}_{00_c}\right)/{i}_c}{\sum_{c=1}^C\left({i}_{10_c}{i}_{01_c}\right)/{i}_c}, $$
(4)

where \( {i}_c={i}_{11_c}+{i}_{00_c}+{i}_{10_c}+{i}_{01_c}. \)

4 Heuristic Justification of the Large Sample Chi-square Reference Distribution

The estimated test statistic \( \mathrm{MH}\ {\widehat{\chi}}^2 \) would have an asymptotic chi-square distribution with one degree of freedom as the true MH statistic MH χ2 would, if the true attribute vector α were known. Mantel and Haenszel (1959) asserted that under the null hypothesis, the MH χ2 has an asymptotic chi-squared distribution with one degree of freedom, under some general conditions.

It is assumed here that the number of items J is sufficiently large so that \( P\left[\widehat{\boldsymbol{\alpha}}=\boldsymbol{\alpha} \right] \) is close to 1, a result of previous theoretical studies (Lim and Drasgow 2017; Wang and Douglas 2015). A rigorous argument requires that the number of items J grows sufficiently fast with the sample size N. Note that

$$ \mathrm{MH}\ {\widehat{\chi}}^2=\mathrm{MH}\ {\chi}^2+\left(\mathrm{MH}\ {\widehat{\chi}}^2-\mathrm{MH}\ {\chi}^2\right), $$
(5)

where \( \left( MH{\widehat{\chi}}^2-{MH\chi}^2\right) \) represents error, due to using \( \widehat{\boldsymbol{\alpha}} \) rather than α. If in (5), we have convergence in probability to zero in the second term on the right, we see that our approximate M-H test statistic \( MH{\widehat{\chi}}^2 \) has the same asymptotic distribution as the desired MH statistic MHχ2. Specifically,

$$ \left( MH{\widehat{\chi}}^2-{MH\chi}^2\right)\overset{P}{\to }0\Longrightarrow MH{\widehat{\chi}}^2\overset{D}{\to }{\chi}_1^2. $$
(6)

The result in (6) is obtained if J is sufficiently large, so that under the null hypothesis the overwhelming majority of estimated attribute patterns are identical to the true attribute patterns. Finite test length and sample size properties are studied in the following simulation studies, and type I error rate power rates are summarized.

5 Simulation Study

To investigate the performance of the MH statistic, a variety of simulation conditions were studied by crossing the number of examinees, the length of tests, the number of attributes, and the distribution of α under the nonparametric cognitive diagnosis model.

6 Simulation Design

For each condition, item response data of sample sizes I = 500, or 2000, were drawn from a discretized multivariate normal distribution MVN(0K, Σ), where the covariance matrix Σ has unit variance and common correlation ρ = .3, or .6 (e.g., Chiu et al. 2009). The K-dimensional continuous vector θi = (θi1, θi2, ⋯, θiK) were dichotomized by

$$ {\boldsymbol{a}}_{ik}=\left\{\begin{array}{cc}1,& \mathrm{if}{\theta}_{ik}\ge {\Phi}^{-1}\left(\frac{k}{\left(K+1\right)}\right);\\ {}0,& \mathrm{otherwise}\end{array}\right. $$
(7)

Test lengths J = 20 or 40 were studied with attribute vectors of length K = 3 or 5. The correctly specified Q-matrix for J = 20 is presented in Table 1. The Q-matrix for J = 40 was obtained by duplicating the matrix. Item response data sets were generated from the DINA model and its item parameters were drawn from the uniform (0, .3) distribution. The Hamming distance–based nonparametric cognitive diagnosis model (Lim and Drasgow 2017) was used for the estimation of latent attributes. The main advantage with this proposed method is that it can be applied to parametric models because only class information is necessary.

Table 1 Correctly specified Q-matrix (K = 5)

6.1 Results

For each condition, sets of item response vectors were simulated for 100 replications. The proposed MH statistics and their corresponding p values were computed for all J × (J − 1)/2 item pairs in an individual replication. Of 100 trials, the proportion of times the p value of each item pair was smaller than the significance level .05, which was recorded and summarized in the tables.

Type I Error Study

In this simulation study, the correctly specified Q-matrices (K = 5, or K = 3) were used to fit the data to examine type I error rates. Table 2 shows that most type I error rates were around the nominal significance level .05. The MH statistic appears consistent under all conditions when J = 40, confirming asymptotic consistency. In the condition K = 5, J = 20, and I = 2000, the type I error rate was slightly increased.

Table 2 Type I error study: correctly specified Q-matrix

The MH statistic with known true class membership α was also examined because it is not confounded by possible estimation errors due to the specific algorithm used to estimate latent attributes. The rejection rates were very close to the nominal significance level .05 for all conditions.

Power Study with Underspecified Q-matrices

A data set was generated with the Q-matrix (K = 5) in Table 1. The data was fitted with the embedded Q-matrix (K = 3) in each replication (Table 3). One dimension (a total of 9 items) or two dimensions (4 items) were underspecified. The average power rate of the item pairs where both items were underspecified in the same dimension was .572 with power relatively consistent across all conditions. The average rejection rate across item pairs where either item was underspecified was .124. Taking this finding into account, like the other statistics, the MH test is sensitive to Q-underspecification and has moderately high power, particularly for the larger sample size.

Table 3 Power study: underspecified Q-matrices with true K = 5 and fitted K = 3

Power Study with Testlet-Dependent Data

For this simulation study, the fixed T-matrix in Table 4 was utilized to generate the testlet-dependent data. The entry tmj of the T-matrix indicates whether the mth testlet, for m = 1, 2, ... M, includes the jth item. For each replication, the transpose of T-matrix was combined with the Q-matrix (K = 3) embedded in Table 1, to simulate item responses. A model was fitted only with the Q-matrix (K = 3). The T-matrix for J = 40 was obtained by duplicating the matrix.

Table 4 T-matrix: testlet specification (M = 2)

As shown in Table 5, high rejection rates for testlet-dependent item pairs were obtained (i.e., .922 or above). The power rates were moderately consistent across conditions. The rejection rates of the MH statistic for item pairs in which only either item was testlet dependent were low (i.e., .088 or below). This implies that the MH test can play an important role only in detecting testlet-dependent items.

Table 5 Testlet-dependent data with Q-matrix (K = 3)

7 Fraction Subtraction Data

Fraction subtraction data (e.g., Tatsuoka 1983) were analyzed to investigate the performance of the MH statistic in practice. The data include the item responses to 20 items with 8 necessary attributes from 536 examinees. In this study, the Q-matrix (see Table 6) that appeared originally in de la Torre and Douglas (2004) was used. The specified attributes are interpreted as (1) convert a whole number to a fraction, (2) separate a whole number from fraction, (3) simplify before subtracting, (4) find a common denominator, (5) borrow from whole number part, (6) column borrow to subtract the second numerator from the first, (7) subtract numerators, and (8) reduce answers to simplest form.

Table 6 Q-matrix for fraction subtraction data

7.1 Results

The data were analyzed with seven different cognitive diagnosis models: the nonparametric model, the DINA model, the DINO model, the A-CDM, the saturated model, the log linear model, and the R-RUM. Additional fit statistics, the chi-squared statistic xjj (Chen and Thissen 1997) and absolute deviations of observed and predicted corrections rjj (Chen et al. 2013), were used for the evaluation of model-data fit. The average rejection rates of 190 item pairs are summarized and reported in Table 7. Interestingly, the MH statistic indicates substantially fewer model violations than the other two fit measures.

Table 7 Proportion of conditionally dependent item pairs

Table 8 reports the most frequently rejected four items for each of the statistics over all model settings. The results of statistics were consistent with those of Lim and Drasgow (2017). In their data-driven Q-matrix estimation study, the component-wise agreement rates between the implemented Q-matrix in this study and a data-driven Q-matrix were obtained as shown in Table 8. The items for which the q-vectors may have been incorrectly specified were the most frequently rejected by the MH statistic. The disagreement across methods is especially noticeable for item 8. This result may imply that this item was overspecified based on the results of previous studies (e.g., Chen et al. 2013).

Table 8 Most frequently rejected items

8 Discussion

The significance of this study lies in proposing a test of model fit for detecting Q-matrix misspecifications and identifying testlet effects. The only requirement for this method is the availability of an estimate of the latent attributes, which serves as the stratification variable in the MH statistic. Several simulation studies investigated the usefulness and sensitivity of the MH statistic in a variety of conditions. The primary findings were that the MH test could play an important role in identifying underspecified q-vectors when the true model is unknown. It performs reasonably well in detecting testlet-dependent items. These results are important because ignoring such dependencies could possibly lead to inaccurate estimates of model parameters as shown in Table 9 as well as misclassifications of examinees (e.g., Chen et al. 2015; Rupp et al. 2010).

Table 9 Mean of absolute difference of estimated and true DINA item parameters (α with ρ = .3)

The real data analysis illustrated how the MH test can be used with different cognitive diagnosis models along with other model fit test statistics. The MH test found less misfit and was less sensitive to the use of different models. For q-vector misspecifications, it can be effective to identify problematic items. When it is used with the other test statistics, the results can provide more detail—whether an item may be underspecified, or a different model is needed for the data.

Whether the fit evaluation is to detect the Q-matrix underspecification, or testlet effects, the MH test is simple, easy to implement, and theoretically supported. The results of the simulations suggest that the MH is a reasonably efficient test of model fit. Nevertheless, some consideration of other tests of model fit will always be desirable. Future research might include more attributes as well as more complex models. At the present time, however, the MH test appears to be a promising statistic for the detection of local dependence in cognitive diagnosis models.