Keywords

1 Introduction

Cognitive diagnostic models (CDMs) are a family of constrained latent class models that estimate relationships between observed item responses and latent traits (Rupp & Templin, 2008). These models assume that the items measure multiple latent traits and the latent traits are categorical (Liu & Shi, 2020). CDMs have been advocated as having the potential to provide rich diagnostic information from tests to aid instruction and learning, because they provide a profile for each student in regard to whether or not the student has mastered the required skills (a.k.a. attributes) to provide correct responses to the test items. CDMs, therefore, are able to provide useful diagnostic feedback to teachers and students.

As summarized by DiBello et al. (2007), a systematic cognitive diagnostic assessment involves six steps: (i) describing assessment purpose; (ii) describing skill space; (iii) developing assessment tasks; (iv) specifying psychometric model; (v) performing model calibration and evaluation; and (vi) score reporting. However, very few large-scale tests are designed under a cognitive diagnostic modeling framework. Therefore, in most CDM applications, a non-diagnostic preexisting test is analyzed, which is referred to as retrofitting (Liu et al., 2017). A major challenge involved in retrofitting is that constructing the post-hoc Q-matrix is time consuming. In addition, calibrating an existing unidimensional test with a multidimensional CDM may not work or may be inefficient (Haberman & von Davier, 2007).

The Q-matrix that represents a particular hypothesis about which attributes are required to answer each test item successfully (Tatsuoka, 1983). As shown in Table 1, each row represents an item of the test and each column represents an attribute. A Q-matrix can have a simple structure—each item requires only one attribute—or a complex structure—at least one item requires more than one attribute (Rupp et al., 2010). A sound Q-matrix is critical for a successful CDM application (Gorin, 2009).

Table 1 Sample Q-matrix

2 Method

Studies that met the following criteria were included in our review. First, the study had to apply CDM(s) to a real dataset. If a study adopted a Q-matrix developed and validated in a previous study, we would only keep the earlier study to avoid duplicates. For example, Jang et al. (2015) used the same Q-matrix that was developed and validated in Jang et al. (2013). We only included Jang et al. (2013). Second, only journal articles in English from 1983 to March 2021 were included.

To begin with, we performed a systematic search of ERIC and APA PsychInfo using keywords “cognitive diagnostic” or “diagnostic classification.” The authors read the full-text of each article to decide whether it was eligible. Then, we searched Google scholar using the same keywords “cognitive diagnostic” or “diagnostic classification.” The authors went through each entry to decide if any new study would be added to our previous findings. Finally, we consulted the studies included in Sessoms and Henson (2018), who reviewed the application of CDMs from 2008 to 2016.

After selecting the initial set of articles, the authors held several rounds of discussions among themselves to refine the selection criteria. During these discussions, we decided to restrict the study to journal articles published in English, in order to complete the study in a timely manner. We also realized that several articles used a Q-matrix developed in an earlier study, which would have created duplicates. Because of this, we decided to include only the earliest study. In the end, we found 80 articles that met our inclusion criteria. Seven of the articles reported two different CDM studies. For each of these articles, two studies were included for the final coding. Therefore, 87 studies from 80 published journal articles were included in this review.

We first drafted a coding sheet based on existing literature and prior CDM study review (e.g., Sessoms & Henson, 2018), with an emphasis on the Q-matrix development and validation. After several rounds of discussion and training of the coding procedure, one author coded all the studies. Then another author went through the coding of each article multiple times. All discrepancies and questions were resolved through discussion and negotiation until a consensus was reached.

Table 2 Study descriptive statistics

3 Results

The earliest study included in our review was published in 1993 (Birenbaum & Tatsuoka, 1993). Fewer than 10 articles per year were published until 2019, when 11 were published. Seventeen were published in 2020. The articles were published in a variety of journals with journals focused on education and psychology being the most frequent categories, such as Studies in Educational Evaluation and Frontiers in Psychology. The most studies conducted in a single country were 34 in the US, followed by 11 in Iran. Thirteen studies used data collected from multiple countries, primarily because the tests used were the Programme for International Student Assessment (PISA) or the Trends in International Mathematics and Science Study (TIMSS). Many sample sizes were extremely large because of the multinational samples, and the Korean National Assessment of Educational Achievement (NAEA) exam, which had a sample of over 16 million examinees (Table 2). The number of items per assessment ranged from 4 to 216; and the number of attributes ranged from 2 to 27.

In terms of the construct being tested, language skills and mathematics, which appeared in 30 studies each, were the most frequent. Ten studies looked at different areas of psychology, such as personality, and pathological behavior (Table 3).

Table 3 Content area studied

In all the studies, the attribute classification was dichotomous (i.e., master vs. non-master). Six of the studies reported correlations among the attributes. Although many of the studies discussed how to prepare score reports, none of the studies reported whether such diagnostic results were actually delivered to students or teachers.

The Q-matrices in 74 studies had a complex structure, while eight had a simple structure. In five of the studies, the relationships among the attributes were hierarchical. In 24 of the studies, the authors modified their initial Q-matrices, and in 52 of the studies, the authors provided the final Q-matrix. A variety of CDMs were used in the applications, and it was common for one study to apply multiple CDMs. The deterministic input, noisy “and” gate (DINA) and the generalized deterministic input, noisy “and” gate (GDINA) models were the most frequently used and appeared in 23 studies each, followed by the Reduced Reparameterized Unified Model (RRUM or FUSION) model (20 times; Table 4).

Table 4 CDM used

The Q-matrices were developed using combinations of different techniques (Table 5). The most common was using a literature review to determine the attributes (or skills) needed to respond to the items correctly. Review of the assessment items by content experts was the second most common method. Consulting the test specifications was used in eight studies, while asking examinees about how they answered questions was used in seven studies. Thirty-six studies did not report how the Q-matrix was developed.

Table 5 Q-matrix development and validation

Checking model fit indices was the most common means of validating the Q-matrix. Indices of both absolute fit (e.g., SRMSR, MAD) and relative fit (e.g., AIC, BIC) were used (Chen et al., 2013). Relative fit indices were used to compare different Q-matrices or different CDMs, and to compare CDM results to results from Classical Test Theory (CTT) or Item Response Theory (IRT) models. The Wald test was also used to compare different models. Checking attribute mastery predictions (both for accuracy and for consistency) was the second most common validation technique. This was used in 13 studies. Other methods include evaluating item parameters and reliability indices, as well as using empirical validation algorithms (e.g., de la Torre and Chiu’s (2016) validation method, implemented in R package GDINA (Ma & de la Torre, 2020)). Twelve studies did not report their validation methods.

4 Discussion and Significance

The Q-matrix is of vital importance for the proper functioning of a CDM. If the Q-matrix is misspecified, the usefulness of the CDM is impaired (Gorin, 2009). It is, therefore, important to have a Q-matrix that is well-founded theoretically, as well as supported by empirical evidence (Rupp et al., 2010).

Developing a Q-matrix is an iterative process that involves theoretical guidance and content knowledge about the construct being tested. After an initial set of attributes has been developed, the Q-matrix needs to be refined to ensure that there are sufficient high-quality items for each attribute to produce stable results. Also, redundant or closely overlapped attributes need to be identified, combined or removed for parsimony. This is frequently done by removing an attribute that has high correlation with other attributes, or that has low correlation with item difficulty (Buck & Tatsuoka, 1998).

The most common methods of Q-matrix development are literature reviews and ratings by content experts. In addition, consulting test specifications and consulting with examinees via think-aloud or posttest interviews are good ways to understand cognitive processes when examinees respond to the test items. However, test specifications are often not available for a retrofitted test, and consulting examinees, which can provide valuable insights, sometimes may be not be feasible given the development context. It is common to find a CDM study that uses multiple methods discussed above in their Q-matrix development stage (e.g., Li & Suen, 2013). Utilizing evidence from multiple sources greatly strengthens their Q-matrix.

Similarly, high quality CDM application studies tend to adopt multiple procedures to validate their Q-matrices from different perspectives. Our review shows that there are three main types of Q-matrix validation procedures: (a) comparing the CDM model results with results from CTT or IRT models; (b) examination of model fit indices; and (c) examination of attribute mastery predictions. First, it is not always valid to compare CDM results with results from IRT models. CDMs are multidimensional while IRT models are usually unidimensional. Even when multi-dimensional IRT is used, the assumptions are different as the latent variables in CDMs are categorical while the latent variables in IRT models are continuous. Therefore, such comparison does not always lead to meaningful results. Second, model fit indices have played an important role in Q-matrix validation. Both absolute and relative fit indices are utilized. With the lack of well-established criteria for the absolute fit indices for CDMs (Lei & Li, 2016), the relative fit indices (e.g., AIC, BIC) seem to be more useful when results from different models are compared. Some studies (e.g., Ravand et al., 2020) allowed the items to “choose” the best fitting model, but Hemati and Baghaei (2020) found that overall model fit for this procedure was not as good as using the GDINA model for all items.

Attribute mastery predictions usually consist of evaluating both classification accuracy and consistency of the whole latent class pattern for examinee responses. As Park et al. (2020) note, predicting the accuracy and consistency of examinee scores by a model is a measure of reliability. CDM studies in earlier years usually did not report attribute reliability information, but more recent studies (Min & He, 2021; Wu et al., 2020, 2021) start to report such information.

In addition, in more recent years, a few studies (e.g., Effatpanah, 2019; Javidanmehr & Anani Sarab, 2019; Kilgus et al., 2020) used the Q-matrix empirical validation algorithm (de la Torre & Chiu, 2016) which was further available in the GDINA R package (Ma & de la Torre, 2020). This offers the possibility of a convenient way to validate the Q-matrix empirically. However, this empirical algorithm can only serve as a supplementary information for Q-matrix validation. As recommended by de la Torre (2008), it is always important to combine the Q-matrix empirical validation results with content knowledge.

Suggestions. Our findings suggest several procedures that should be followed when developing a retrofitted Q-matrix, as well as some procedures that should be used only with caution. Our primary recommendation is that researchers and practitioners should consider perspectives from both construct theory and statistical analysis. The process of developing a retrofitted Q-matrix should always include subject matter experts who know both theory and content of the test construct (Rupp et al., 2010). Second, the set of attributes developed should be based on the principle of parsimony where highly correlated attributes may be combined (Buck & Tatsuoka, 1998). Finally, more than one method needs to be used to develop the Q-matrix so that evidence from multiple sources can be combined to strengthen the validity of the Q-matrix (Li & Suen, 2013).

Once the Q-matrix has been developed statistical testing needs to be done to verify the appropriateness of the matrix. This can be done by testing for model fit and reliability using actual data, using both absolute and relative fit indices to compare different models (Lei & Li, 2016) to select the best fitting one. Also, researchers should test for reliability by evaluating both classification accuracy and classification consistency of the whole latent class pattern for examinee responses (Park et al., 2020). An empirical validation algorithm (e.g., de la Torre and Chiu’s (2016) empirical validation model for DINA) could also be used.

We recommend against comparing results from CDM models with IRT or CTT models, because they are based on different theory and are not strictly comparable. Results from such comparisons may not be meaningful. For optimal model fit, given sufficient sample size, we recommend starting the analysis with a saturated CDM and examining the significance of the main effects and interaction effects (if any), before considering specific smaller CDMs with particular assumptions on the relationship between items and attributes (Hemati & Baghaei, 2020).

Limitations. A major limitation of this review is that we only included journal articles published in English. Adding dissertations, conference presentations, and articles published in other languages has the potential of opening up more methods of Q-matrix development and validation, as well as insights into the CDM applications. These are areas for continued work. Furthermore, some CDM studies did not provide details about their Q-matrix development and validation procedures so that we were not able to code such information for every study included in the review. We, therefore, call for a detailed report of Q-matrix development and validation procedures in future CDM application studies.

Notwithstanding these limitations, this review contributes to the research into Q-matrix and CDM applications by highlighting the present state of Q-matrix development and validation, some of the possible tools for the process, and the need to use multiple methods in developing and validating Q-matrices.