1 Introduction

Diagnostic classification models in educational assessment (DCM—often also referred to as “cognitive diagnosis models”; DiBello, Roussos, & Stout, 2007; Haberman & von Davier, 2007; Leighton & Gierl, 2007; Rupp, Templin, & Henson, 2010) describe an examinee’s ability as a composite of specific discrete (cognitive) skills called attributes, each of which an examinee may or may not have mastered. Distinct profiles of attributes define classes of proficiency. Modeling educational testing data within a DCM-framework seeks to estimate the item parameters and to assign examinees to proficiency classes (i.e., estimate their individual attribute profiles). Current methods for fitting DCMs either use marginal maximum likelihood estimation relying on the Expectation Maximization algorithm (MMLE-EM) or Markov chain Monte Carlo (MCMC) techniques.

These methods work well for simple DCMs (e.g., the implementations of MMLE-EM for the Deterministic Input, Noisy “And” gate [DINA] model; Junker & Sijtsma, 2001; Macready & Dayton, 1977; and the Deterministic Input, Noisy “Or” gate [DINO] model; Templin & Henson, 2006; in the R package CDM; Robitzsch, Kiefer, George, & Uenlue, 2015). However, for more complex DCMs (e.g., the Reduced Reparameterized Unified Model [Reduced RUM]; Hartz, 2002; Hartz & Roussos, 2008) and general DCMs like von Davier’s (2005, 2008) General Diagnostic Model (GDM) or the Log-Linear Cognitive Diagnosis Model (LCDM; Henson, Templin, & Willse, 2009; Rupp et al., 2010; Templin & Bradshaw, 2014; Templin & Hoffman, 2013), MMLE-EM and MCMC can be computationally expensive, which may limit their usefulness in research and practice.

This article explores the potential of joint maximum likelihood estimation (JMLE) for fitting DCMs. JMLE has been mostly avoided in psychometrics—despite the mathematical convenience of simple likelihood functions—because the JMLE parameter estimators typically lack statistical consistency (Baker & Kim, 2004; Haberman, 2004; Neyman & Scott, 1948). The JMLE procedure presented here resolves the consistency issue by incorporating an external, statistically consistent estimator of examinees’ proficiency class membership into the joint likelihood function, which subsequently allows for the construction of item parameter estimators that also have the consistency property.

The presentation is preceded by a brief review of DCMs. The results and proofs concerning the consistency of the item parameter estimators are derived for the LCDM; using the theoretical framework of general DCMs makes the results and proofs also applicable to DCMs that can be expressed as submodels of the LCDM. The theoretical part is augmented by simulation studies that compare the performance of JMLE with that of MMLE-EM under finite sample conditions using artificial data conforming to the LCDM. In addition, the results of a real-world application of JMLE to the analysis of language assessment data are reported. The paper concludes with a discussion of the findings and directions for future research.

2 Background: Diagnostic Classification Models

DCMs model the functional relation between attribute mastery and the probability of a correct item response. Suppose that K latent binary attributes constitute a certain ability domain; there are then \(2^K\) distinct attribute profiles composed of these K attributes representing \(2^K=M\) distinct proficiency classes. Let the K-dimensional vector, \(\varvec{\alpha }_m = (\alpha _{m1}, \alpha _{m2}, \ldots , \alpha _{mK})^{\prime }\), denote the binary attribute profile of proficiency class \({\mathcal {C}}_m, m = 1,2, \ldots , M\), where the \(k^{th}\) entry indicates whether the respective attribute has been mastered. \(Y_{ij}\) is the observed response of examinee \(i, i = 1,2, \ldots , N\), to binary item \(j, j = 1,2,\ldots , J\). The attribute profile of examinee \(i \in {\mathcal {C}}_m, \varvec{\alpha }_{i \in {\mathcal {C}}_m}\), is written as \(\varvec{\alpha }_i = (\alpha _{i1}, \alpha _{i2}, \ldots , \alpha _{iK})^{\prime }\).

Consider a test consisting of J items. Each individual item j is associated with a K-dimensional binary vector \({\mathbf {q}}_j\) called the item-attribute profile, where \(q_{jk} = 1, k=1,2, \ldots ,K\), if a correct answer requires mastery of the kth attribute, and 0 otherwise. Given K attributes, there are at most \(2^K-1\) distinct item-attribute profiles. The J item-attribute profiles of a test constitute its Q-matrix, \({\mathbf {Q}}=\{q_{jk}\}_{(J \times K)}\), (Tatsuoka, 1985) that summarizes the constraints specifying the associations between items and attributes.

The distinct parameterization of specific DCMs reflects differences in the underlying theories on how (non-)mastery of attributes affects an examinee’s test performance. General DCMs allow for expressing these distinct functional relations in a unified mathematical form and parameterization. The archetypal general DCM is von Davier’s (2005, 2008) General Diagnostic Model (GDM). Von Davier defined \(h({\mathbf {q}}_j, \varvec{\alpha }_i)\) as a general function of the attribute profile of item j and the attribute profile of examinee i to allow for the flexible modeling of examinees’ responses to item j. The item response function (IRF) of presumably the most popular version of von Davier’s GDM is formed by the logistic function of the linear combination of all K attribute main effects

$$\begin{aligned} P(Y_{ij} = 1 \mid \varvec{\alpha }_i) = \frac{ \exp ( \beta _{j0} + \varvec{\beta }^{\prime }_j h({\mathbf {q}}_j, \varvec{\alpha }_i) ) }{ 1 + \exp ( \beta _{j0} + \varvec{\beta }^{\prime }_j h({\mathbf {q}}_j, \varvec{\alpha }_i) ) } = \frac{ \exp ( \beta _{j0} + \sum _{k=1}^K \beta _{jk}q_{jk}\alpha _{ik} ) }{ 1 + \exp ( \beta _{j0} + \sum _{k=1}^K \beta _{jk}q_{jk}\alpha _{ik} ) }, \end{aligned}$$

where \(q_{jk}\) indicates whether mastery of attribute \(\alpha _{ik}\) is required for item j (see Equations 1 and 2; von Davier, 2005). Henson et al. (2009) specified \(v_j\) as the linear combination of the K attribute main effects, \(\alpha _k\), and all their two-way, three-way, \(\ldots , K\)-way interactions

$$\begin{aligned} v_j = \beta _{j0} + \sum _{k=1}^K \beta _{jk}q_{jk}\alpha _{ik} + \sum _{k'=k+1}^K \sum _{k=1}^{K-1} \beta _{j(kk')}q_{jk}q_{jk'}\alpha _{ik}\alpha _{ik'} + \cdots + \beta _{j12\ldots K}\prod _{k=1}^K q_{jk}\alpha _{ik} \end{aligned}$$

and defined the IRF of a general DCM termed the Loglinear Cognitive Diagnosis Model (LCDM) as

$$\begin{aligned} P(Y_{ij} = 1 \mid \varvec{\alpha }_i) = \frac{ \exp (v_j) }{ 1 + \exp (v_j) } \end{aligned}$$
(1)

(see Equation 11 in Henson et al., 2009). By imposing appropriate constraints on the \(\beta \)-coefficients in \(v_j\), the IRFs of specific DCMs can be expressed as submodels of the LCDM. In addition to the logit link, de la Torre (2011) proposed the identity link, \(P(Y_{ij} = 1 \mid \varvec{\alpha }_i) = v_j\), and the log link, \(P(Y_{ij} = 1 \mid \varvec{\alpha }_i) = \exp \big \{v_j\big \}\), for constructing the IRF of a general DCM called the Generalized DINA (G-DINA) model (see Equations 1–3 in de la Torre, 2011). (The identity and the log link require additional constraints on the coefficients to guarantee \(0 \le P(Y_{ij} = 1 \mid \varvec{\alpha }_i) \le 1\).)

3 Joint Maximum Likelihood Estimation for Diagnostic Classification

Let \({\mathbf {Y}}=({\mathbf {y}}_1,{\mathbf {y}}_2,\ldots ,{\mathbf {y}}_N)^{\prime }\) denote the \(N \times J\) matrix of observed item responses, where \({\mathbf {y}}_i=(y_{i1},y_{i2},\ldots ,y_{iJ})^{\prime }\) is the vector of observed item responses of examinee i. Conditional independence, given attribute profile \(\varvec{\alpha }\), is assumed for the observed item responses. Thus, the joint likelihood is

$$\begin{aligned} L(\varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N, \varvec{\Theta };{\mathbf {Y}}) = \prod _{i=1}^N L_i(\varvec{\alpha }_i, \varvec{\Theta };{\mathbf {y}}_i) = \prod _{i=1}^N \prod _{j=1}^J f(y_{ij} | \varvec{\theta }_j, \varvec{\alpha }_i), \end{aligned}$$
(2)

where \(\varvec{\Theta } = (\varvec{\theta }_1, \varvec{\theta }_2, \ldots , \varvec{\theta }_J)\) denotes the matrix of item parameters.

3.1 If Examinees’ Attribute Profiles are Known

Suppose examinees’ true attribute profiles \(\varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N\) are known. Then, the joint likelihood in Equation 2 reduces to a function of only a single set of unknowns, the item parameters: \(L ( \varvec{\Theta }; {\mathbf {Y}}, \varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N )\) (Baker & Kim, 2004; Birnbaum, 1968; Embretson & Reise, 2000). The estimators of the elements of the item parameter vector \(\varvec{\theta }_j\) are derived by maximizing the logarithm of the item likelihood

$$\begin{aligned} \ln L_j(\varvec{\theta }_j; {\mathbf {y}}_j, \varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N) = \sum _{i=1}^N \ln \big ( f(y_{ij} | \varvec{\theta }_j, \varvec{\alpha }_i) \big ). \end{aligned}$$
(3)

The resulting estimators of the item parameters are denoted by \(\hat{\varvec{\theta }}_j\).

3.2 If Examinees’ Attribute Profiles are Unknown

However, examinees’ true attribute profiles \(\varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N\) are never known and can only be estimated from the observed item responses. Suppose that an estimator \(\tilde{\varvec{\alpha }}\) of examinees’ true attribute profiles is available that is statistically consistent and does not depend on the JMLE procedure. The estimates of examinees’ proficiency class membership obtained from the external estimator \(\tilde{\varvec{\alpha }}\) can be used in Equation 3, which then becomes

$$\begin{aligned} \ln L_j(\varvec{\theta }_j; {\mathbf {y}}_j, \tilde{\varvec{\alpha }}_1, \tilde{\varvec{\alpha }}_2, \ldots , \tilde{\varvec{\alpha }}_N) = \sum _{i=1}^N \ln \big ( f(y_{ij} | \varvec{\theta }_j, \tilde{\varvec{\alpha }}_i) \big ). \end{aligned}$$
(4)

Maximizing Equation 4 results in the estimators \(\tilde{\varvec{\theta }}_j\) of the item parameter estimators. Note that the “tilde-notation” is used to emphasize that these estimators are linked to the consistent estimator \(\tilde{\varvec{\alpha }}\), as opposed to the estimators \(\hat{\varvec{\theta }}_j\) that require examinees’ attribute profiles \(\varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N\) to be known. The item parameter estimators \(\tilde{\varvec{\theta }}_j\) for the LCDM are derived in the next section; their consistency is proven in the subsequent section.

4 Joint Maximum Likelihood Estimation for the LCDM

Recall the item response function of the LCDM (see Equation 1)

$$\begin{aligned} P(Y_{ij} = 1 \mid \varvec{\alpha }_i) = \frac{ \exp (v_j) }{ 1 + \exp (v_j)}, \end{aligned}$$

where

$$\begin{aligned} v_j = \beta _{j0} + \sum _{k=1}^K \beta _{jk}q_{jk}\alpha _{ik} + \sum _{k'=k+1}^K \sum _{k=1}^{K-1} \beta _{j(kk')}q_{jk}q_{jk'}\alpha _{ik}\alpha _{ik'} + \cdots + \beta _{j12\ldots K}\prod _{k=1}^K q_{jk}\alpha _{ik} \end{aligned}$$

Suppose the attribute profiles \(\varvec{\alpha }\) are known. The estimators of the elements of the item parameter vector, \(\varvec{\beta }_j = ( \beta _{j0}, \beta _{j1}, \beta _{j2}, \ldots , \beta _{j12\ldots K} )^{\prime }\), are derived by maximizing the item likelihood

$$\begin{aligned} L_j(\varvec{\beta }_j;{\mathbf {y}}_j,\varvec{\alpha }) = \prod _{i=1}^N f(y_{ij} \mid \varvec{\beta }_j, \varvec{\alpha }_i) = \prod _{i=1}^N \left( \frac{ \exp (v_j)}{1+\exp (v_j) }\right) ^{y_{ij}}\left( \frac{1}{1+\exp (v_j) }\right) ^{1-y_{ij}}. \end{aligned}$$

Maximizing the logarithm of the item likelihood results in the same estimator as maximizing the likelihood

$$\begin{aligned} \ln L_j(\varvec{\beta }_j; {\mathbf {y}}_j,\varvec{\alpha })= & {} \sum _{i=1}^N y_{ij} \ln \left( \frac{\exp (v_j)}{1+\exp (v_j)} \right) + \sum _{i=1}^N (1-y_{ij}) \ln \left( \frac{1}{1+\exp (v_j)} \right) \nonumber \\= & {} \sum _{i=1}^N \left( y_{ij} v_j-\ln \big ( 1+\exp (v_j) \big )\right) . \end{aligned}$$
(5)

Taking the derivative of Equation 3 with regard to \(\beta _{j0}\) results in

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{j0}} = \sum _{i=1}^N \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) . \end{aligned}$$

Similarly,

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{jk}} = q_{jk}\sum _{i=1}^N \alpha _{ik} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) \, \forall k \end{aligned}$$
(6)

and

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{jkk'}} = q_{jk}q_{jk'} \sum _{i=1}^N \alpha _{ik}\alpha _{ik'} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) \, \forall k, k' \end{aligned}$$
(7)

and so on. The derivative of Equation 3 with regard to the coefficient of the highest interaction term \(\beta _{j12\ldots K}\) is

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{j12\ldots K}} = q_{j1} q_{j2}\ldots q_{jK} \sum _{i=1}^N \alpha _{i1} \alpha _{i2} \ldots \alpha _{iK} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) . \end{aligned}$$
(8)

The inspection of the partial derivatives in Equations 6, 7, and 8 shows that they are reduced to zero if item j does not require attribute k because then the corresponding \(q_{jk}\) is zero. Assume that item j requires \(K_j^*\le K\) attributes that, without loss of generality, have been permuted to the first \(K_j^*\) positions of the item attribute vector \({\mathbf {q}}_j\). Because \(q_{j1} = q_{j2} = \ldots = q_{jK^{*}_J} = 1\), they become implicit, whereas all \(\alpha _{jk}\) corresponding to \(q_{jk}=0\) are eliminated from the expressions of the partial derivatives in Equations 6, 7, and 8:

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{jk}} = \sum _{i=1}^N \alpha _{ik} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)}\right) \, \forall k \in {\mathcal {L}}_j, \end{aligned}$$
(9)

where \({\mathcal {L}}_j = \{1,2, \ldots , K^{*}_j\}\) is defined as the collection of indices of the non-zero elements in \({\mathbf {q}}_j\),

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{jkk'}} = \sum _{i=1}^N \alpha _{ik}\alpha _{ik'} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)}\right) \, \forall k,k^{\prime } \in {\mathcal {L}}_j \end{aligned}$$
(10)

and

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{j12\ldots K_j^*}} = \sum _{i=1}^N \alpha _{i1} \alpha _{i2} \ldots \alpha _{iK_j^*} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) . \end{aligned}$$
(11)

The expressions of the item parameter estimators are derived in working backwards beginning with Equation 11. Using the indicator function \(I[\cdot ]\) and setting Equation 11 to zero yields

$$\begin{aligned} \sum _{i=1}^N \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)}\right) \, I \left[ \bigcap _{k=1}^{K_j^*}\{\alpha _{ik}=1\} \right] = 0 \end{aligned}$$

which is equivalent to

$$\begin{aligned} \sum _{i \in {\mathcal {C}}({\mathcal {L}}_j)} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) = 0, \end{aligned}$$
(12)

where \({\mathcal {C}}({\mathcal {L}}_j)\) denotes the proficiency class consisting of examinees who master all the \(K^{*}_j\) items required for item j. In general, define the proficiency class \({\mathcal {C}}({\mathcal {A}})\), with \({\mathcal {A}} \subseteq {\mathcal {L}}_j = \{1,2,\ldots ,K_j^*\}\), such that \({\mathcal {C}}({\mathcal {A}}) = \{i \mid \alpha _{ik}=1, \forall k \in {\mathcal {A}} \text{ and } \, \alpha _{ik'}=0, \forall k' \in {\mathcal {A}}^c\}\). Equation 12 implies

$$\begin{aligned} \sum _{i \in {\mathcal {C}}({\mathcal {L}}_j)} y_{ij} - |{\mathcal {C}}({\mathcal {L}}_j)| \frac{\exp \big ( \beta _{j0}+\sum _{k=1}^{K_j^*}\beta _{jk}+ \cdots +\beta _{j12\ldots K_j^*} \big )}{1+\exp \big ( \beta _{j0}+\sum _{k=1}^{K_j^*}\beta _{jk}+\cdots +\beta _{j12\ldots K_j^*} \big )}= & {} 0\\ \Rightarrow \,\frac{\exp \big ( \beta _{j0}+\sum _{k=1}^{K_j^*}\beta _{jk}+\cdots +\beta _{j12\ldots K_j^*}\big )}{1+\exp \big ( \beta _{j0}+\sum _{k=1}^{K_j^*}\beta _{jk}+\cdots +\beta _{j12\ldots K_j^*} \big )}= & {} \frac{\sum _{i \in {\mathcal {C}}({\mathcal {L}}_j)} y_{ij}}{|{\mathcal {C}}({\mathcal {L}}_j)|} = {\bar{y}}_{j \,{\mathcal {C}}({\mathcal {L}}_j)}, \end{aligned}$$

where \({\bar{y}}_{j \,{\mathcal {C}}({\mathcal {L}}_j)}\) is the mean response of proficiency class \({\mathcal {C}}({\mathcal {L}}_j)\) of item j. Therefore,

$$\begin{aligned} {\hat{\beta }}_{j0}+\sum _{k=1}^{K_j^*}{\hat{\beta }}_{jk}+\cdots +{\hat{\beta }}_{j12\ldots K_j^*}=\ln \left( \frac{{\bar{y}}_{j \,{\mathcal {C}}({\mathcal {L}}_j)}}{1-{\bar{y}}_{j \,{\mathcal {C}}({\mathcal {L}}_j)}}\right) . \end{aligned}$$

There are \({K^*\atopwithdelims ()K_j^*-1}= K_j^*\) interaction terms of order \((K_j^*-1)\). Without loss of generality, only the partial derivative of \(\ln (L_j)\) with regard to the parameter \(\beta _{j12 \ldots (K_j^*-1)}\) is analyzed here in detail:

$$\begin{aligned} \frac{\partial \ln (L_j)}{\partial \beta _{j12\ldots (K_j^*-1)}} = \sum _{i=1}^N \alpha _{i1} \alpha _{i2} \ldots \alpha _{i(K_j^*-1)} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) . \end{aligned}$$
(13)

Setting Equation 13 to zero and using the indicator function leads to

$$\begin{aligned}&\sum _{i=1}^N \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)}\right) \, I\left[ \bigcap _{k=1}^{K_j^*-1}\{\alpha _{ik}=1\} \right] = 0 \nonumber \\&\quad \Rightarrow \, \sum _{i\in {\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) + \sum _{i\in {\mathcal {C}}({\mathcal {L}}_j)} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) = 0 \nonumber \\&\quad \Rightarrow \, \sum _{i\in {\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})} \left( y_{ij} - \frac{\exp (v_j)}{1+\exp (v_j)} \right) = 0. \end{aligned}$$
(14)

(The second sum on the left-hand side of Equation 14 is zero due to Equation 12.) The last equation can be written explicitly as

$$\begin{aligned} \sum _{i\in {\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})} y_{ij} - |{\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})| \, \frac{\exp \big ( \beta _{j0}\!+\!\sum _{k\!=\!1}^{K_j^*}\beta _{jk}\!+\!\cdots +\beta _{j12\ldots (K_j^*-1)} \big )}{1\!+\!\exp \big ( \beta _{j0}\!+\!\sum _{k=1}^{K_j^*}\beta _{jk}\!+\!\cdots \!+\!\beta _{j12\ldots (K_j^*-1)} \big ) }\! =\! 0 \end{aligned}$$

providing the final result

$$\begin{aligned} {\hat{\beta }}_{j0}+\sum _{k=1}^{K_j^*-1}{\hat{\beta }}_{jk}+\cdots +{\hat{\beta }}_{j12\ldots (K_j^*-1)} = \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})}}{1-{\bar{y}}_{j \, {\mathcal {C}}(\{1,2,\ldots , K_j^*-1\})}} \right) . \end{aligned}$$

The partial derivatives for the remaining parameters are manipulated in the same manner; suffice it here to present the results for Equations 10 and 9

$$\begin{aligned} {\hat{\beta }}_{j0}+{\hat{\beta }}_{jk}+{\hat{\beta }}_{jk'}+{\hat{\beta }}_{jkk'}= & {} \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\{k,k'\})}}{1-{\bar{y}}_{j \, {\mathcal {C}}(\{k,k'\})}} \right) \end{aligned}$$
(15)
$$\begin{aligned} {\hat{\beta }}_{j0}+{\hat{\beta }}_{jk}= & {} \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\{k\})}}{1-{\bar{y}}_{j \, {\mathcal {C}}(\{k\})}} \right) \end{aligned}$$
(16)

and finally

$$\begin{aligned} {\hat{\beta }}_{j0} = \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\emptyset )}}{1-{\bar{y}}_{j \, {\mathcal {C}}(\emptyset )}} \right) . \end{aligned}$$
(17)

The expressions of the estimators of \({\hat{\beta }}_{jk}\) are then obtained by subtracting \({\hat{\beta }}_{j0}\) of Equation 17 from Equation 16

$$\begin{aligned} {\hat{\beta }}_{jk}= & {} \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\{k\})}}{1-{\bar{y}}_{j \, {\mathcal {C}}(\{k\})}}\right) - {\hat{\beta }}_{j0}\nonumber \\= & {} \ln \left( \frac{{\bar{y}}_{j \, {\mathcal {C}}(\{k\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{k\})}}\right) - \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}}\right) . \end{aligned}$$

The expression of the estimators of \({\hat{\beta }}_{jkk'}\) is obtained by subtracting the lower-order parameter estimators \({\hat{\beta }}_{j0}, {\hat{\beta }}_{jk}\), and \({\hat{\beta }}_{jk'}\) from the left-hand sum of estimators in Equation 15

$$\begin{aligned} {\hat{\beta }}_{jkk'}= & {} \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{k,k'\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{k,k'\})}} \right) - {\hat{\beta }}_{jk} - {\hat{\beta }}_{jk'} - {\hat{\beta }}_{j0}\nonumber \\= & {} \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{k,k'\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{k,k'\})}} \right) - \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{k\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{k\})}} \right) - \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{k'\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{k'\})}} \right) + \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}} \right) . \end{aligned}$$

In summary, if examinees’ attribute profiles are known, then the (closed-form) expressions of the estimators \({\hat{\beta }}_{j \, \ldots }\) of the coefficients of k-order terms, \(k \in \{1, 2, \ldots , K^*_j\}\) are functions of the means of the \(2^k\) proficiency classes characterized by attribute profiles \(\varvec{\alpha }\), where the first k attributes are mastered or not, and the remaining attributes are not mastered. For example, let \(K^*=3\); then, \({\hat{\beta }}_{12}\) is a function of the means of the four proficiency classes with \(\varvec{\alpha }_{{\mathcal {C}}(\emptyset )} = (000)^{\prime }, \varvec{\alpha }_{{\mathcal {C}}(\{1\})} = (100)^{\prime }, \varvec{\alpha }_{{\mathcal {C}}(\{2\})} = (010)^{\prime }\), and \(\varvec{\alpha }_{{\mathcal {C}}(\{1,2\})} = (110)^{\prime }\)

$$\begin{aligned} {\hat{\beta }}_{j12}= & {} \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{1,2\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{1,2\})}} \right) - {\hat{\beta }}_{j1} - {\hat{\beta }}_{j2} - {\hat{\beta }}_{j0}\nonumber \\= & {} \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{1,2\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{1,2\})}} \right) - \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{1\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{1\})}} \right) - \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\{2\})}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\{2\})}} \right) + \ln \left( \frac{{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}}{1-{\bar{y}}_{j\, {\mathcal {C}}(\emptyset )}} \right) . \end{aligned}$$

The estimators of the \(\beta _{j \, \ldots }\) under the proposed JMLE framework are obtained by solving Equation 12 with \(\alpha _{ik}\) replaced by \({\tilde{\alpha }}_{ik}\). To be specific, define the proficiency class \(\tilde{{\mathcal {C}}}({\mathcal {A}}) = \{i \mid {\tilde{\alpha }}_{ik}=1, \forall k \in {\mathcal {A}} \text { and } {\tilde{\alpha }}_{ik'}=0, \forall k' \in {\mathcal {A}}^c\}\). The expressions of the estimators of the item parameters \({\tilde{\beta }}_{j0}, {\tilde{\beta }}_{jk}\), and \({\tilde{\beta }}_{jkk'}\), are derived as

$$\begin{aligned} {\tilde{\beta }}_{j0}= & {} \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}} \right) \nonumber \\&\nonumber \\ {\tilde{\beta }}_{jk}= & {} \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}} \right) - {\tilde{\beta }}_{j0} \nonumber \\= & {} \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}}\right) - \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}} \right) \nonumber \\&\nonumber \\ {\tilde{\beta }}_{jkk'}= & {} \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k,k'\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k,k'\})}} \right) - {\tilde{\beta }}_{jk} - {\tilde{\beta }}_{jk'} - {\tilde{\beta }}_{j0} \nonumber \\= & {} \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k,k'\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k,k'\})}}\right) - \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k\})}} \right) - \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k'\})}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\{k'\})}} \right) + \ln \left( \frac{{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}}{1-{\bar{y}}_{j \, \tilde{{\mathcal {C}}}(\emptyset )}} \right) .\nonumber \\ \end{aligned}$$
(18)

The expressions of the estimators of the remaining parameters can be readily deduced from the profile emerging from the equations of \({\tilde{\beta }}_{j0}, {\tilde{\beta }}_{jk}\), and \({\tilde{\beta }}_{jkk'}\).

5 Consistency of the Item Parameter Estimators

It was stated earlier that the estimators of the item parameters of the LCDM, \({\tilde{\beta }}_{j \ldots }\), derived by differentiating the joint likelihood function are statistically consistent provided the estimation procedure uses an (external) estimator of examinees’ proficiency class membership that is itself statistically consistent. In the previous section, the conceptual distinction was made between the item parameter estimators \({\hat{\beta }}_{j \ldots }\) and \({\tilde{\beta }}_{j \ldots }\) depending on whether examinees’ attribute profiles are known or must be estimated using the consistent estimator \(\tilde{\varvec{\alpha }}\). In this section, two consistency theorems are presented that establish asymptotic consistency of \({\tilde{\beta }}_{j \ldots }\) (Theorem 1) and a stronger form of consistency called uniform consistency of \({\tilde{\beta }}_{j \ldots }\) (Theorem 2). To avoid any redundancy, only the proofs concerning \({\tilde{\beta }}_{j0}\) are presented because the proofs for \({\tilde{\beta }}_{jk}, {\tilde{\beta }}_{jkk^{\prime }}\), and so on, can be readily constructed using the same argument. The proofs of Theorems 1 and 2 require two lemmas that establish (a) the normality of \({\hat{\beta }}_{j0}\) and (b) the convergence of \({\tilde{\beta }}_{j0}\) to \({\hat{\beta }}_{j0}\). These lemmas are presented first.

Lemma 1

Let \({\hat{\beta }}_{j0}\) be the estimator of the parameter \(\beta _{j0}, j=1,\ldots ,J\), of the LCDM when examinees’ attribute profiles \(\varvec{\alpha }_i\) are known (see Equation 17). Let \({\mathcal {A}}\) be a subset of \({\mathcal {L}}_j = \{1,2,\ldots ,K_j^*\}\) for item j, where \(K^*_j=\sum _{k=1}^K q_{jk}\) and \({\mathcal {C}}({\mathcal {A}}) = \{i \mid \alpha _{ik}=1, \forall k \in {\mathcal {A}} \text { and } \alpha _{ik'}=0, \forall k' \in {\mathcal {A}}^c\}\). Then, \(\sqrt{|{\mathcal {C}}(\emptyset )|} \,({\hat{\beta }}_{j0}-\beta _{j0}) \, \overset{{\mathcal {D}}}{\longrightarrow } \, N\left( 0,\frac{\big (1+\exp (\beta _{j0})\big )^2}{\exp (\beta _{j0})} \right) \) for all j.

Proof

Let \(\varvec{Y}_i=(Y_{i1},\ldots ,Y_{iJ})\) be the item response vector of examinee i. Based on the item response function of the LCDM in Equation 1, the conditional expectation of \(Y_{ij}\) reduces to

$$\begin{aligned}&E(Y_{ij} \mid \varvec{\alpha }_i) \\= & {} \frac{ \exp \left( \beta _{j0} + \sum _{k=1}^{K^{*}_j} \beta _{jk}\alpha _{ik} + \sum _{k'=k+1}^{K^{*}_j} \sum _{k=1}^{K^{*}_j-1} \beta _{jkk'}\alpha _{ik}\alpha _{ik'} + \cdots + \beta _{j12 \ldots K^{*}_j} \prod _{k=1}^{K^{*}_j} \alpha _{ik} \right) }{ 1 + \exp \Big ( \beta _{j0} + \sum _{k=1}^{K^{*}_j} \beta _{jk}\alpha _{ik} + \sum _{k'=k+1}^{K^{*}_j} \sum _{k=1}^{K^{*}_j-1} \beta _{jkk'}\alpha _{ik}\alpha _{ik'} + \cdots + \beta _{j12 \ldots K^{*}_j} \prod _{k=1}^{K^{*}_j} \alpha _{ik} \Big ) }. \end{aligned}$$

If \(\varvec{\alpha }_i=(0,0,\ldots , 0)^{\prime }\), then the conditional expectation of \(Y_{ij}\) is

$$\begin{aligned} E \big ( Y_{ij} \mid \varvec{\alpha }_i=(0,0,\ldots , 0)^{\prime } \big ) = \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}. \end{aligned}$$

Because \(Y_{ij}\) is binary, the variance of \(Y_{ij}\) is

$$\begin{aligned} \text{ Var } \big ( Y_{ij} \mid \varvec{\alpha }_i=(0,0,\ldots , 0)^{\prime } \big )= \frac{\exp (\beta _{j0})}{\big ( 1+\exp (\beta _{j0}) \big )^2} \end{aligned}$$

with \(0<\frac{\exp (\beta _{j0})}{\big ( 1+\exp (\beta _{j0}) \big )^2} < \infty \). Due to the Central Limit Theorem

$$\begin{aligned} \sqrt{|{\mathcal {C}}(\emptyset )|} \, \left( {\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right) \, \overset{{\mathcal {D}}}{\longrightarrow } \, N \left( 0,\frac{\exp (\beta _{j0})}{\big ( 1+\exp (\beta _{j0}) \big )^2} \right) \, \forall j \end{aligned}$$
(19)

where \({\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )}\) is the mean of the responses \(Y_{ij}\) in proficiency class \({\mathcal {C}}(\emptyset )\). Define the function \(g: (0,1)\rightarrow \mathbb {R}\) as \(g(x)=\ln (\frac{x}{1-x})\). Then g has a continuous derivative at \(\frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}\) because \(0<\frac{\exp (\beta _{j0})}{1 + \exp (\beta _{j0})} < 1\). Applying the Delta Method to Equation 19 results in

$$\begin{aligned}&\sqrt{|{\mathcal {C}}(\emptyset )|} \, \left( g ({\bar{Y}}_{j \, {\mathcal {C}} (\emptyset )}) - g\left( \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}\right) \right) \, \overset{{\mathcal {D}}}{\longrightarrow } \\&N \left( 0,\frac{\exp (\beta _{j0})}{\big (1+\exp (\beta _{j0})\big )^2}\left( g'\left( \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}\right) \right) ^2 \right) \end{aligned}$$

which can be simplified to

$$\begin{aligned} \sqrt{|{\mathcal {C}}(\emptyset )|} \, \left( \ln \left( \frac{{\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )}}{1-{\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )}} \right) - \beta _{j0} \right) \, \overset{{\mathcal {D}}}{\longrightarrow } \, N \left( 0,\frac{\big (1+\exp (\beta _{j0})\big )^2}{\exp (\beta _{j0})}\right) . \end{aligned}$$

Recall that the estimator \({\hat{\beta }}_{j0}=\ln \frac{{\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )}}{1-{\bar{Y}}_{j \, {\mathcal {C}}(\emptyset )}}\) (see Equation 17); hence,

$$\begin{aligned} \sqrt{|{\mathcal {C}}(\emptyset )|} \, \big ({\hat{\beta }}_{j0}-\beta _{j0}\big ) \, \overset{{\mathcal {D}}}{\longrightarrow } \, N \left( 0,\frac{\big ( 1+\exp (\beta _{j0})\big )^2}{\exp (\beta _{j0})} \right) \end{aligned}$$

for all j. \(\square \)

Lemma 2

Let \({\hat{\beta }}_{j0}\) and \({\tilde{\beta }}_{j0}\) be the parameter estimators as defined in Equations 17 and 18, respectively. Assume \(J \rightarrow \infty \) and \(J<N\). Then \({\tilde{\beta }}_{j0} \overset{{\mathcal {P}}}{\longrightarrow } {\hat{\beta }}_{j0}\) for all j.

Proof

Because \(\tilde{\varvec{\alpha }}\) is a consistent estimator of \(\varvec{\alpha }\)

$$\begin{aligned} P \left( \bigcup _{i=1}^N \big \{ |\tilde{\varvec{\alpha }}_i- \varvec{\alpha }_i|>\varepsilon \big \} \right) \rightarrow 0 \end{aligned}$$
(20)

as \(J \rightarrow \infty \). Equations 17 and 18 show that \({\tilde{\beta }}_{j0} = {\hat{\beta }}_{j0}\) if \(\tilde{\varvec{\alpha }}_i = \varvec{\alpha }_i\) for all i, which can be expressed as

$$\begin{aligned} P \big ( \{\tilde{\varvec{\alpha }}_i = \varvec{\alpha }_i\} \big ) \le P \big ( \{{\tilde{\beta }}_{j0} = {\hat{\beta }}_{j0}\} \big ) \end{aligned}$$

or equivalently as

$$\begin{aligned} P \big ( \{{\tilde{\beta }}_{j0}\ne {\hat{\beta }}_{j0}\} \big ) \le P \big ( \{\tilde{\varvec{\alpha }}_i\ne \varvec{\alpha }_i\} \big ) \, \forall i \end{aligned}$$

Hence, for every \(\varepsilon >0\),

$$\begin{aligned} P \big ( |{\tilde{\beta }}_{j0} - {\hat{\beta }}_{j0}|> \varepsilon \big ) \le P \left( \bigcup _{i=1}^N \big \{ |\tilde{\varvec{\alpha }}_i- \varvec{\alpha }_i|>\varepsilon \big \} \right) \end{aligned}$$
(21)

Due to Equation 20, Equation 21 also implies

$$\begin{aligned} P \big ( |{\tilde{\beta }}_{j0}- {\hat{\beta }}_{j0}|>\varepsilon \big ) \rightarrow 0 \end{aligned}$$

as \(J \rightarrow \infty \). Note that \(J \rightarrow \infty \) implies \(N \rightarrow \infty \) because \(N>J\). Therefore,

$$\begin{aligned} {\tilde{\beta }}_{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, {\hat{\beta }}_{j0} \end{aligned}$$

for all j. \(\square \)

In reiterating, Lemma 1 establishes that the estimator \({\hat{\beta }}_{j0}\) is normally distributed if examinees’ attribute profiles \(\varvec{\alpha }_i\) are known—a useful property if hypothesis testing or interval estimation is desired. Lemma 2 establishes the convergence of \({\tilde{\beta }}_{j0}\) to \({\hat{\beta }}_{j0}\), and thus, connects \({\tilde{\beta }}_{j0}\) with the true \(\beta _{j0}\). Lemmas 1 and 2 are instrumental for the proofs of Theorems 1 and 2.

Theorem 1

Suppose that there exist \(0<\varepsilon _1, \varepsilon _2 <1\) such that \(\varepsilon _1<\frac{| {\mathcal {C}}(\emptyset )|}{N}<1-\varepsilon _2\). Let \({\hat{\beta }}_{j0}\) and \({\tilde{\beta }}_{j0}\) be the estimators of \(\beta _{j0}\) with properties as specified in Lemmas 1 and 2. Assume \(N>J\) and \(J \rightarrow \infty \). Also, assume \(\tilde{\varvec{\alpha }}_i\) is a consistent estimator of \(\varvec{\alpha }_i\). Then \({\tilde{\beta }}_{j0} \overset{{\mathcal {P}}}{\longrightarrow } \beta _{j0}\) for all j.

Proof

Lemma 1 stated

$$\begin{aligned} \sqrt{| {\mathcal {C}} (\emptyset )|}({\hat{\beta }}_{j0} - \beta _{j0}) \, \overset{{\mathcal {D}}}{\longrightarrow } \, N \left( 0,\frac{\big (1+\exp (\beta _{j0})\big )^2}{\exp (\beta _{j0})} \right) . \end{aligned}$$
(22)

Lemma 2 stated

$$\begin{aligned} {\tilde{\beta }}_{j0}-{\hat{\beta }}_{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, 0 \, \forall j \end{aligned}$$
(23)

Because \(\varepsilon _1<\frac{| {\mathcal {C}} (\emptyset )|}{N}<1-\varepsilon _2\), the condition \(| {\mathcal {C}} (\emptyset )|\rightarrow \infty \) as \(N\rightarrow \infty \) is guaranteed. Then, because \(\frac{1}{\sqrt{| {\mathcal {C}} (\emptyset )|}} \overset{{\mathcal {P}}}{\longrightarrow } 0\), due to Slutsky’s theorem, Equation 22 can be written as

$$\begin{aligned} {\hat{\beta }}_{j0}-\beta _{j0} \, \overset{{\mathcal {D}}}{\longrightarrow } \, \left( \frac{1}{\sqrt{| {\mathcal {C}}(\emptyset )|}}\right) \, N \left( 0,\frac{\big (1+\exp (\beta _{j0})\big )^2}{\exp (\beta _{j0})} \right) = 0 \, \forall j \end{aligned}$$
(24)

Note that because \({\hat{\beta }}_{j0}-\beta _{j0}\) converges to 0 in distribution, and 0 is a constant, \({\hat{\beta }}_{j0}-\beta _{j0}\) converges to 0 in probability as well—that is,

$$\begin{aligned} {\hat{\beta }}_{j0} - \beta _{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, 0 \, \forall j \end{aligned}$$

Now, applying Slutsky’s theorem again to Equations 23 and 24 yields

$$\begin{aligned} {\tilde{\beta }}_{j0} - {\hat{\beta }}_{j0} + {\hat{\beta }}_{j0} - \beta _{j0} = {\tilde{\beta }}_{j0}-\beta _{j0} \, \overset{{\mathcal {D}}}{\longrightarrow } \, 0 \end{aligned}$$

which implies

$$\begin{aligned} {\tilde{\beta }}_{j0}-\beta _{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, 0 \, \forall j \end{aligned}$$

because 0 is a constant—or equivalently,

$$\begin{aligned} {\tilde{\beta }}_{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, \beta _{j0} \end{aligned}$$

for all j. \(\square \)

Theorem 1 states that \({\tilde{\beta }}_{j0}\) converges in probability to \(\beta _{j0}\); hence, \({\tilde{\beta }}_{j0}\) is a statistically consistent estimator of \(\beta _{j0}\). Theorem 2 states that \({\tilde{\beta }}_{j0}\) is also a uniformly consistent estimator of \(\beta _{j0}\). The proof of Theorem 2 requires a modification of Lemma 2 that is given here as

Proposition 1

If \(\tilde{\varvec{\alpha }}_i \overset{{\mathcal {P}}}{\longrightarrow } \varvec{\alpha }_i\) uniformly, \({\tilde{\beta }}_{j0} \overset{{\mathcal {P}}}{\longrightarrow } {\hat{\beta }}_{j0}\) uniformly.

Proof

The claim \({\tilde{\beta }}_{j0}={\hat{\beta }}_{j0}\) if \(\tilde{\varvec{\alpha }}_i=\varvec{\alpha }_i\) for all i (used already in the proof of Lemma 2) also implies

$$\begin{aligned} P \left( \bigcap _{i=1}^N \big \{ \tilde{\varvec{\alpha }}_i=\varvec{\alpha }_i \big \} \right)\le & {} P \left( \bigcap _{j=1}^J \big \{ {\tilde{\beta }}_{j0} = {\hat{\beta }}_{j0} \big \} \right) \\ \Rightarrow \, P \left( \bigcup _{j=1}^J \big \{ {\tilde{\beta }}_{j0}\ne {\hat{\beta }}_{j0} \big \} \right)&\le P&\left( \bigcup _{i=1}^N \big \{ \tilde{\varvec{\alpha }}_i \ne \varvec{\alpha }_i \big \} \right) . \end{aligned}$$

Hence, for every \(\varepsilon >0\),

$$\begin{aligned} P \left( \bigcup _{j=1}^J \big \{ |{\tilde{\beta }}_{j0}- {\hat{\beta }}_{j0}|>\varepsilon \big \} \right) \le P\left( \bigcup _{i=1}^N \big \{ |\tilde{\varvec{\alpha }}_i- \varvec{\alpha }_i|>\varepsilon \big \} \right) . \end{aligned}$$
(25)

Because \(\tilde{\varvec{\alpha }}_i \overset{{\mathcal {P}}}{\longrightarrow } \varvec{\alpha }_i\) uniformly

$$\begin{aligned} P \left( \bigcup _{i=1}^N \big \{ |\tilde{\varvec{\alpha }}_i- \varvec{\alpha }_i|>\varepsilon \big \} \right) \, \overset{{\mathcal {P}}}{\longrightarrow } \, 0. \end{aligned}$$
(26)

Combining Equations 25 and 26 results in

$$\begin{aligned} P \left( \bigcup _{j=1}^J \big \{ |{\tilde{\beta }}_{j0}- {\hat{\beta }}_{j0}|>\varepsilon \big \} \right) \rightarrow 0 \end{aligned}$$

as \(J \rightarrow \infty \). Note that \(J \rightarrow \infty \) implies \(N \rightarrow \infty \) because \(N>J\). Therefore,

$$\begin{aligned} {\tilde{\beta }}_{j0} \, \overset{{\mathcal {P}}}{\longrightarrow } \, {\hat{\beta }}_{j0} \end{aligned}$$

uniformly. \(\square \)

Theorem 2

Assume \(\tilde{\varvec{\alpha }}_i\) is a uniformly consistent estimator of \(\varvec{\alpha }_i\). Also, assume \(J<N\) and \(\varepsilon _1<\frac{|{\mathcal {C}}(\emptyset )|}{N}<1-\varepsilon _2\), where \(0<\varepsilon _1, \varepsilon _2<1\). Then \({\tilde{\beta }}_{j0}\) is a uniformly consistent estimator of \(\beta _{j0}\) for all j if \(N \exp (-J)\rightarrow 0\) as \(J\rightarrow \infty \).

Proof

The proof of Theorem 2 uses Hoeffding’s theorem (Hoeffding, 1963), which states that \(P(|\frac{1}{N}\sum _{i=1}^N X_i-E(X_i)|>\epsilon )<2 \exp (-2N\epsilon ^2)\) where \(X_1,\ldots ,X_N\) are iid random variables and \(0\le X_i \le 1\) for all i. In formal agreement with Hoeffding’s theorem, the argument is developed for \(\frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})}\) (and not \({\tilde{\beta }}_{j0}\))—the final convergence result, however, is transformed into an expression in terms of \({\tilde{\beta }}_{j0}\). Consider

$$\begin{aligned}&\max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right| \le \max _j \left( \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right| \right. \\&\quad + \left. \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right| \right) . \end{aligned}$$

Hence,

$$\begin{aligned}&P\left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right|> \varepsilon \right) \nonumber \\&\quad \le P\left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right| \right. \nonumber \\&\left. \quad + \max _j \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}\right|> \varepsilon \right) \nonumber \\&\quad \le P\left( \left\{ \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right|> \frac{\varepsilon }{2} \right\} \nonumber \right. \\&\quad \left. \bigcup \left\{ \max _j \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right|> \frac{\varepsilon }{2} \right\} \right) \nonumber \\&\quad \le P \left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right|> \frac{\varepsilon }{2} \right) \nonumber \\&\quad + P \left( \max _j \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right| > \frac{\varepsilon }{2} \right) . \end{aligned}$$

Due to Hoeffding’s theorem and the assumption \(\varepsilon _1<\frac{| {\mathcal {C}}(\emptyset )|}{N}<1-\varepsilon _2\), the first term in the last line can be written as

$$\begin{aligned}&P\left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right|> \frac{\varepsilon }{2} \right) \nonumber \\&\quad = P\left( \bigcup _j \left\{ \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right|> \frac{\varepsilon }{2} \right\} \right) \nonumber \\&\quad \le \sum _{j=1}^J P \left( \left| \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right| > \frac{\varepsilon }{2} \right) \nonumber \\&\quad \le J \left( 2 \exp \Big ( -2 | {\mathcal {C}}(\emptyset ) | \, \Big (\frac{\varepsilon }{2} \Big )^2 \Big ) \right) \nonumber \\&\quad \le 2N \exp \Big ( -N \varepsilon _1 \, \frac{\varepsilon ^2}{2} \Big )\nonumber \\&\quad = 2N \exp \Big ( -N \frac{(\sqrt{\varepsilon _1}\varepsilon )^2}{2} \Big ) \nonumber \\&\quad \le 2N \exp \Big ( -J \frac{(\sqrt{\varepsilon _1}\varepsilon )^2}{2} \Big ) \, {\longrightarrow } \, 0 \end{aligned}$$
(27)

as \(J\rightarrow \infty , N \exp (-J) \rightarrow 0\) for all \(\varepsilon \). Proposition 1 states that \({\tilde{\beta }}_{j0} \overset{{\mathcal {P}}}{\longrightarrow }{\hat{\beta }}_{j0}\) uniformly. Consider the continuous function \(t: \mathbb {R}\rightarrow (0,1)\) defined as \(t(x) = \frac{\exp (x)}{1+\exp (x)}\). Applying the Continuous Mapping Theorem to Proposition 1 results in

$$\begin{aligned} \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} \, \overset{{\mathcal {P}}}{\longrightarrow } \, \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \end{aligned}$$

uniformly. Thus,

$$\begin{aligned} P \left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp ({\hat{\beta }}_{j0})}{1+\exp ({\hat{\beta }}_{j0})} \right| > \frac{\varepsilon }{2} \right) \, \rightarrow \, 0 \end{aligned}$$
(28)

as \(J \rightarrow \infty , N \exp (-J) \rightarrow 0\) for all \(\varepsilon \). Then, Equations 27 and 28 imply that

$$\begin{aligned} P\left( \max _j \left| \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} - \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \right| > \varepsilon \right) \, \rightarrow \, 0 \end{aligned}$$

as \(J \rightarrow \infty \) and \(N \exp (-J) \rightarrow 0\). Equivalently,

$$\begin{aligned} \frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} \, \overset{{\mathcal {P}}}{\longrightarrow } \, \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})} \end{aligned}$$

uniformly. Notice that the assumption \(J < N\) implies that \(N \rightarrow \infty \) as \(J \rightarrow \infty \). Hence, \(\frac{\exp ({\tilde{\beta }}_{j0})}{1+\exp ({\tilde{\beta }}_{j0})} \overset{{\mathcal {P}}}{\longrightarrow } \frac{\exp (\beta _{j0})}{1+\exp (\beta _{j0})}\) as \(N \rightarrow \infty \) and \(N \exp (-J) \rightarrow 0\). Define the continuous function \(g: (0,1) \rightarrow \mathbb {R}\) as \(g(x) = \ln (\frac{x}{1-x})\). Thus, based on the Continuous Mapping Theorem

$$\begin{aligned} {\tilde{\beta }}_{j0} \overset{{\mathcal {P}}}{\longrightarrow } \beta _{j0} \end{aligned}$$

uniformly. \(\square \)

In summary, Theorems 1 and 2 address two different kinds of convergences. Theorem 1 establishes the pointwise convergence in probability, whereas Theorem 2 establishes uniform convergence, the guarantee that all parameter estimators converge, which is a much stronger form of statistical consistency. (Note that uniform convergence implies pointwise convergence. However, Theorem 1 has been presented here too because its proof uses a different technical argument and has its own theoretical merit.) Finally, it should be noted that the assumptions \(J < N\) (i.e., the test length must not exceed the number of examinees) and \(\varepsilon _1< \frac{N_j}{N} < 1-\varepsilon _2\) are often made automatically in testing without explicit mentioning.

6 Simulation Studies

Two simulation studies were conducted to evaluate the performance of the JMLE estimators on artificial data sets under finite data conditions for selected test settings. All data sets conformed to the LCDM. The item parameter estimates and the classification of examinees obtained from JMLE were compared to those obtained from MMLE-EM (in using the implementation of the EM algorithm in Mplus; Muthén & Muthén, 1998–2012; see also, Templin & Hoffman, 2013). In addition, all data sets were also fitted by a procedure called conditional maximum likelihood estimation (CMLE). CMLE uses examinees’ (known) true attribute profiles as input when estimating the item parameters, and the (known) true item parameters as input when estimating examinees’ attribute profiles. Thus, CMLE provided a most conservative benchmark for the results obtained from JMLE and MMLE-EM. (CMLE is implemented in the R package NPCD, Zheng & Chiu, 2014).

6.1 Implementation of Joint Maximum Likelihood Estimation

6.1.1 The Consistent Estimator \(\tilde{\varvec{\alpha }}\) of Examinees’ Proficiency Class Membership

The Nonparametric Classification (NPC) method (Chiu & Douglas, 2013) was used here to obtain estimates of examinees’ attribute profiles \(\varvec{\alpha }\) (i.e., their proficiency class membership) that are needed as input to the JMLE algorithm. Wang and Douglas (2015) proved that under certain regularity conditions \(\tilde{\varvec{\alpha }}\) obtained by the NPC method is a statistically consistent estimator of an examinee’s attribute profile for any DCM:

\(\ldots \) the only general condition required of the underlying item response function is that the probability of a correct response for masters of the attributes is bounded above 0.5 for each item, and the probability for non-masters is bounded below 0.5. If the true model satisfies these simple conditions, nonparametric classication will be consistent as the test length increases.” (Wang & Douglas, 2015, p. 99)

The NPC method estimates the proficiency class membership of examinees by comparing their observed item response profiles with each of the ideal response profiles of the possible proficiency classes. The ideal response is a function of the q-vector of item \(j, {\mathbf {q}}_j\), and the attribute profile \(\varvec{\alpha }_m\) of proficiency class \({\mathcal {C}}_m\). The ideal response to item j is the score that would be realized by an examinee in proficiency class \({\mathcal {C}}_m\) (having attribute profile \(\varvec{\alpha }_m\)) if no perturbation occurred. The NPC estimator \(\tilde{\varvec{\alpha }}\) of an examinee’s attribute profile is defined as the attribute profile associated with that ideal item response profile, which minimizes the distance between all ideal item response profiles and an examinee’s observed item response profile. Said differently, the estimator \(\tilde{\varvec{\alpha }}\) identifies the attribute profile underlying that ideal item response profile, which among all possible ideal item response profiles is closest—or most similar—to the observed item response profile (for further technical details, consult Chiu & Douglas, 2013). An implementation of the NPC method is available in the R package NPCD (Zheng & Chiu, 2014).

6.1.2 JMLE Algorithm

The algorithm used here is an adaptation of Birnbaum’s paradigm (Birnbaum, 1968), a two-stage procedure for JMLE (Baker & Kim, 2004; Embretson & Reise, 2000). Examinees’ attribute profiles and the item parameters are treated as two sets where one is assumed to consist of known parameters, whereas those in the second set are to be estimated. The algorithm is initialized with the estimates of examinees’ attribute profiles as input, which are obtained by the consistent estimators \(\tilde{\varvec{\alpha }}_1, \tilde{\varvec{\alpha }}_2, \ldots , \tilde{\varvec{\alpha }}_N\) of the NPC method. The joint likelihood in Equation 2 then reduces to a function of only the item parameters. The estimator of \(\varvec{\beta }_j\) is derived by maximizing the logarithm of the item likelihood \(L_j\) (see Equation 5) for all j.

6.2 Simulation Study I

The purpose of Study I was to assess the accuracy of the JMLE estimates (i.e., item parameters and examinees’ attribute profiles) with replicated data sets.

6.2.1 Design

Item responses conforming to the LCDM of \(N = 3000\) examinees were generated. \(K = 3\) attributes were used; the number of items was \(J = 30, 40\). Examinees’ attribute profiles were generated based on the multivariate normal threshold model, with variances and covariances set to 1 and 0.3, respectively (for further details, consult Chiu, Daouglas, & Li, 2009). The Q-matrices were designed such that (a) the maximum number of attributes required per item was two, (b) each attribute was used by 36% to 40% of the items (a proportion that corresponds to Q-matrix compositions used in other simulation studies of general CDMs; e.g., de la Torre, 2011; de la Torre & Chiu, 2010), and (c) they were complete (Chiu et al., 2009; Chiu & Köhn, 2015). Based on the result by Wang and Douglas (2015) mentioned earlier, the item parameters of the underlying LCDM were chosen such that \(P(Y_{ij}=1 \mid \xi _{ij}=0)\) was less than 0.5 and \(P(Y_{ij}=1 \mid \xi _{ij}=1)\) was about 0.8 (\(\xi \) is generic notation for the ideal item response). Examinees’ manifest item responses were sampled from a Bernoulli distribution, with \(P(Y_{ij}=1 \mid \varvec{\alpha })\) determined by the IRF of the LCDM. For each condition, 25 replicated data sets were generated. For each replicated data set, examinees’ attribute profiles and responses were re-sampled, whereas the Q-matrix and the item parameters were held constant.

6.2.2 Results

The data were analyzed using JMLE, MMLE-EM, and CMLE. The multivariate normal threshold model that was used for generating examinees’ attribute profiles imposes a higher-order structure on the attributes, which determines the distribution of the attribute profiles \(\varvec{\alpha }\) that characterize the different proficiency classes. Recall that fitting educational data by a DCM, in addition to estimating the item parameters, requires estimating the distribution of the \(2^K\) proficiency classes, which involves \(2^K-1\) parameters. Explicitly modeling the distribution of the \(\varvec{\alpha }\) allows for a reduction of the number of model parameters, which might be useful especially if K, the number of attributes, is large. Mplus, for example, offers the option to construct a loglinear model for the distribution of \(\varvec{\alpha }\), which can range from a parsimonious main effects model to more complicated models with any order of interactions (for details, consult Rupp et al., 2010, Ch. 8). As a by-product, estimates of the correlations between the attributes can be derived from the parameter estimates of the loglinear model. The NPC method used for estimating examinees’ \(\varvec{\alpha }\)-profiles—and thus, the proportions of the different proficiency classes—relies on a nonparametric algorithm that, by definition, does not incorporate a (parametric) higher-order structure among the attributes. Chiu and Douglas (2013) established in a series of simulation studies the accuracy of the NPC method in estimating examinees’ \(\varvec{\alpha }\)-profiles for data without as well as for data with an underlying higher-order attribute structure. As JMLE relies on the input from NPC, JMLE does not—unlike Mplus—provide the means for explicitly modeling the distribution of the \(\varvec{\alpha }\). These conceptual and algorithmic differences between JMLE and Mplus-MMLE-EM raised the question how their performance could be compared in a fair and meaningful way. Specifically, when fitting the data with Mplus-MMLE-EM, should the distribution of \(\varvec{\alpha }\) be explicitly modeled, or not? Using Mplus-MMLE-EM without explicitly modeling the distribution of \(\varvec{\alpha }\) appeared to be the closest match to JMLE. But, on the other hand, this choice could lead to inferior Mplus-MMLE-EM parameter estimates; especially, when the number of attributes is large (as is shown in Simulation Study II below). Thus, as a reasonable compromise, the data were fitted by Mplus-MMLE-EM without and with explicitly modeling the distribution of \(\varvec{\alpha }\). (In the latter case, following Rupp et al. [2010], a logistic model including all main effects and all two-way interactions was used.)

Table 1 Simulation Study I: average CCR and RMSE of JMLE, Mplus MMLE-EM, and CMLE plus average CPU times of JMLE and Mplus MMLE-EM when the data conformed to the LCDM; \(N=3000, K=3\), 25 replications.

The performance of JMLE, MMLE-EM, and CMLE in assigning examinees to their true proficiency classes was assessed by computing the classification-correct rate (CCR) (i.e., the proportion of correctly identified examinee attribute profiles). The accuracy of the item parameter estimates was evaluated by computing the root mean squared errors (RMSE). The CPU times were recorded for JMLE and MMLE-EM. The results are reported in Table 1. JMLE, MMLE-EM, and CMLE obtained (more or less) identical CCR values. Not too surprising, CMLE realized the lowest RMSE, whereas the RMSE of JMLE was slightly inferior to MMLE-EM. Table 1 also reports the \(95\%\) coverage intervals computed for CCR and RMSE; all three methods produce stable results, as can be concluded from the narrow confidence intervals. Surprisingly, modeling the distribution of \(\varvec{\alpha }\) explicitly with Mplus did neither appear to affect the CCR nor the accuracy of the item parameter estimates (assessed via the RMSE). The effect on CPU time was inconclusive: for \(J=30\) items, an average increase of the CPU time by about 30 min was observed, whereas for \(J=40\) the average CPU time decreased by almost 8 min. Perhaps the most remarkable result was the difference in average CPU times between JMLE and MMLE. The former required on average 7 and 8 s per data set, whereas the average CPU times used by MMLE-EM ranged from about 5–50 min per data set depending on the number of items, and whether the distribution of \(\varvec{\alpha }\) was explicitly modeled.

6.3 Simulation Study II

The purpose of Study 2 was to check the performance of JMLE for a relatively large number of attributes and items. Like in Study 1, the results of CMLE served as a benchmark.

6.3.1 Design

Compared to Study 1, the design of Study 2 needed to be stripped down so as to meet the CPU time requirements of Mplus. Hence, no replicated data sets were used; instead, four data sets were generated each containing the responses of \(N=3000\) examinees to \(J = 30, 40, 50\), and 60 items conforming to the LCDM. \(K=5\) attributes were used. The Q-matrices were again designed such that (a) the maximum number of attributes required per item was three, (b) each attribute was used by 36% to 40% of the items and (c) they were complete. Examinees’ attribute profiles were generated based on the multivariate normal threshold model, with variances and covariances set to 1 and 0.3, respectively.

6.3.2 Results

Like in Simulation Study I, the data were fitted by Mplus-MMLE-EM without and with explicitly modeling the distribution of \(\varvec{\alpha }\) (using a logistic model including all main effects and all two-way interactions; see Rupp et al., 2010). Table 2 reports the results observed on the four LCDM data sets. CMLE realized the highest CCR and the lowest RMSE, whereas JMLE and MMLE-EM attained about the same CCR and RMSE. Similar to Simulation Study I, the findings concerning the effect of explicitly modeling the distribution of \(\varvec{\alpha }\) on the CPU times for Mplus-MMLE-EM were inconclusive. For \(J=30\) items, the CPU time increased when the distribution of \(\varvec{\alpha }\) was explicitly modeled, whereas the CPU times decreased for \(J=40, 50\). Remarkably, for \(J=60\), Mplus-MMLE-EM did not provide any results when used without explicitly modeling the distribution of \(\varvec{\alpha }\). (Perhaps, without imposing this additional structure, the number of parameters to be estimated is simply too large when the test involves 60 items—however, an observation made on a single data set should not be over-interpreted.) The differences in CPU time between JMLE and MMLE-EM, however, were quite impressive: JMLE required between 22 and 34 s per data set, as opposed to MMLE-EM that used at least about three and a half hours (\(J=30\), without explicitly modeling the distribution of \(\varvec{\alpha }\)) and more than 37 h if \(J=60\) (with explicitly modeling the distribution of \(\varvec{\alpha }\)).

Table 2 Simulation Study II: CCR and RMSE of JMLE, Mplus MMLE-EM, and CMLE plus CPU times of JMLE and MMLE-EM when the data conformed to the LCDM; \(N=3000, K=5\).

7 Practical Application: Analysis of Language Testing Data

As a real-world application of JMLE , data from a retired version of the Examination for the Certificate of Proficiency in English (ECPE) were fitted with the LCDM. The ECPE is a test of advanced English language proficiency, developed, administered, and scored by Cambridge Michigan Language Assessments (CaMLA). The test is given annually around the world to approximately 41,000 non-native speakers of English (ECPE 2013 Report, 2013). The items used in this study are a subset from the grammar section of the ECPE. They have been previously analyzed by Buck and Tatsuoka (1998), Feng, Habing, and Huebner (2014), Henson and Templin (2007), Liu, Douglas, and Henson (2009), Templin and Hoffman (2013), and Templin and Bradshaw (2014). Responses to \(J = 28\) items were collected from \(N=2922\) examinees. The test involved \(K=3\) attributes (\(\alpha _1\) = lexical skills; \(\alpha _2\) = morphosyntactic skills; and \(\alpha _3\) = cohesive skills); the individual items required at most the mastery of \(K=2\) attributes (the complete Q-matrix is given in Templin & Hoffman, 2013).

7.1 Results

The ECPE data were recently analyzed with the LCDM by Templin and Hoffman (2013). (They used MMLE-EM relying on the implementation of the EM algorithm in Mplus; Muthén & Muthén, 1998–2012.) The Templin-Hoffman findings (see also Templin & Bradshaw, 2014) are presented here for comparison with the JMLE results. Table 3 reports the item parameter estimates obtained from JMLE and MMLE-EM.

Table 3 Estimates of the item parameters obtained by JMLE and MMLE-EM (Templin & Hoffman, 2013).

The estimates obtained by JMLE and MMLE-EM were quite different. Overall, MMLE-EM yielded higher guessing probabilities than JMLE (i.e., \(\hat{\beta _0}\); in general, values of \(\hat{\beta _0} > 0\) correspond to a guessing probability higher than 0.5) and lower coefficient estimates for the interaction terms than JMLE. In an attempt to resolve the discrepancies between MMLE-EM and JMLE, the ECPE data were also fitted using MCMC. However, the MCMC results (not reported here), in turn, were different from the MMLE-EM and JMLE results.

To further probe these discrepant findings, the estimated attribute profiles (i.e., proficiency class membership) of a subset of five examinees were inspected. These five examinees had already been used by Templin and Hoffman (2013, p. 47) as exemplary cases for a deeper analysis of their findings. Table 4 presents the observed item response profiles; the estimated attribute profiles of these examinees, as they were obtained from JMLE and MMLE-EM are reported in Table 4 (a). Note that the two methods produced identical attribute profile estimates for examinees 1 and 29. For the other three examinees, the estimates disagree: JMLE resulted in higher probabilities of attribute mastery than MMLE-EM.

As a descriptive measure for evaluating the attribute profile estimates, for each of the five examinees, the proportion of correct answers to all items requiring mastery of the kth attribute was computed for \(k=1,2,3\). For example, examinees 1 and 33 both had an 83% chance (i.e., 5 out of 6) to answer items requiring attribute 2 correctly. The attribute estimates obtained from MMLE-EM indicated that examinee 1 did master attribute 2, but examinee 33 did not. Closer inspection, however, showed that examinees 1 and 33 had identical responses to all items that required attribute 2. Said differently, it is presumably hard to explain to parents and students—independent of what particular DCM was used—why a student—here: examinee 33—was flagged as failing attribute 2 if, in fact, he or she got 5 out of 6 items correct that required mastery of attribute 2.

Table 4 Analysis of the ECPE data: Observed item response profiles and estimated attribute profiles of examinees 1, 10, 14, 29, and 33 obtained from JMLE and MMLE-EM.

Templin and Bradshaw (2014) re-analyzed the ECPE data, but used, different from Templin and Hoffman (2013), the Hierarchical Diagnostic Classification Model (HDCM), a variant of the LCDM modified to accommodate a hierarchical structure supposed to underlie the attributes. A hierarchy implies an order among attributes such that mastery of certain attributes requires mastery of other attributes as a prerequisite. If the attributes have a hierarchical structure, then, typically, several proficiency classes are empty because the corresponding combination of attributes cannot occur, given the specific attribute hierarchy. Templin and Bradshaw (2014) postulated a linear hierarchy among attributes (i.e., the attributes are ordered along a line, with precedence implying hierarchy; see also Leighton, Gierl, & Hunka, 2004) such that \(\alpha _3\): lexical rules \(\preccurlyeq \alpha _2\): cohesive rules \(\preccurlyeq \alpha _1\): morphosyntactic rules, where \(a \preccurlyeq b\) denotes that a precedes (is a prerequisite for) b. This hierarchy implies that of the eight proficiency classes from the previous analysis, only four are meaningfully defined: \(\varvec{\alpha }_1 = (000), \varvec{\alpha }_2 = (001), \varvec{\alpha }_3 = (011)\), and \(\varvec{\alpha }_4 = (111)\). The interesting question, of course, is whether this important modification of the LCDM might have resolved the inconsistencies described earlier for the subset of examinees. Hence, the probabilities of attribute mastery were re-computed based on the new item parameter estimates obtained for the HDCM by Templin and Bradshaw (2014, Table 5) and the estimates of the “structural parameters” reported in Figure 1. The revised probabilities of attribute mastery are reported in Table 4(b). Unfortunately, using the HDCM did not resolve the issues discussed earlier. In summary, further research seems warranted to determine, which method might be most appropriate for analyzing the ECPE data.

8 Discussion

In this article, JMLE was developed for fitting DCMs—that is, for estimating the item parameters and examinees’ proficiency class membership. JMLE has been barely used in Psychometrics because JMLE-parameter estimators typically lack statistical consistency. The JMLE procedure presented here resolves the consistency issue by incorporating an external, statistically consistent estimator of examinees’ proficiency class membership into the joint likelihood function, which subsequently allows for the construction of item parameter estimators that also have the consistency property. This claim was developed and proven using the framework of general DCMs and the LCDM in particular. Two consistency theorems established (a) pointwise convergence in probability (Theorem 1) and (b) uniform convergence (Theorem 2) of the JMLE item parameter estimators to the true item parameters. Two simulation studies were conducted for evaluating the performance of JMLE when used with tests of varying length and numbers of attributes. The results showed that the JMLE-based item parameter estimates and examinee classification were essentially as accurate as those obtained from MMLE using the EM algorithm. However, the computational efficiency of JMLE outperformed that of Mplus MMLE-EM—occasionally, reducing the CPU times by a factor of 1000 or even larger. In light of these results, three questions remain to be addressed.

First, \(\tilde{\varvec{\alpha }}\) provided by the NPC method was used as (external) statistically consistent estimator of examinees’ attribute profiles \(\varvec{\alpha }\) (i.e., their proficiency class memberships) for initializing JMLE in the computational experiments reported here. At present, NPC appears to be the only method available for obtaining \(\tilde{\varvec{\alpha }}\). But what are the specific conditions under which \(\tilde{\varvec{\alpha }}\) from NPC is guaranteed to be a statistically consistent estimator of \(\varvec{\alpha }\)? Wang and Douglas (2015) proved that the NPC method guarantees the statistical consistency of \(\tilde{\varvec{\alpha }}\) for any CDM provided the probability of a correct response is greater than 0.5 for examinees who master the required attributes and less than 0.5 for examinees who do not master all the required attributes.

Second, to increase the numerical accuracy of the estimates, the JMLE estimation procedure can be iterated (an option that was used in the simulations reported earlier). The item parameter estimates obtained initially are used in a second stage for re-estimating examinees’ attribute profiles by maximizing \(\ln L \big ( \varvec{\alpha }_1, \varvec{\alpha }_2, \ldots , \varvec{\alpha }_N; {\mathbf {Y}}, (\tilde{\varvec{\beta }}_1, \tilde{\varvec{\beta }}_2, \ldots , \tilde{\varvec{\beta }}_J) \big )\). The updated examinee attribute profiles \(\tilde{\varvec{\alpha }}\) are then used as input for re-estimating the item parameter estimates \(\tilde{\varvec{\beta }}_j\), and so on. These steps can be repeated until the estimates do not change much. The convergence of the estimation can be monitored by the relative likelihood change

$$\begin{aligned} \frac{\ln L^{t} - \ln L^{t-1}}{\ln L^{t-1}}, \end{aligned}$$

where t and \(t-1\) refer to consecutive iterations. For the estimates of examinee’s attribute profiles, a viable criterion is

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N I \big [ \tilde{\varvec{\alpha }}_i^{(t)}=\tilde{\varvec{\alpha }}_i^{(t-1)} \big ] \ge 0.99 \end{aligned}$$

(\(I[\cdot ]\) denotes the indicator function). For the item parameter estimates, consider (in generic \(\theta \)-notation)

$$\begin{aligned} \max _{j = 1}^J \Big \{ \big | \tilde{\varvec{\theta }}_j^{(t)}-\tilde{\varvec{\theta }}_j^{(t-1)} \big | \Big \} \le 0.001. \end{aligned}$$

One should note that iterations are not required to guarantee consistency of the JMLE estimators; the sole purpose of iterating the algorithm is to improve the numerical accuracy of the estimates. An interesting aside within this context is the question whether the consistency property of the parameter estimators is preserved during iterations. Theorem 4.2 in Junker (1991) suggests that this is the case.

Third, what are the current computational options for educational researchers and practitioners who wish to use the LCDM in their testing programs and empirical research? Of course, writing code from scratch is always an option. If a researcher wants to use MCMC then he or she can refer, for example, to OpenBUGS (Lunn, Spiegelhalter, Thomas, & Best, 2009). Alternatively, a user can opt for a commercial package that offers an implementation of the EM algorithm for fitting (constrained) latent class models, for example, Latent GOLD (Vermunt & Magidson, 2000) and Mplus (Muthen & Muthen, 1998–2012). (For details on how to use Mplus for fitting the LCDM, consult the tutorial by Templin & Hoffman, 2013.) In light of the considerable amounts of CPU time often encountered with MCMC and MMLE-EM, researchers and educational practitioners might indeed consider JMLE as a viable computational alternative for estimating the item parameters of DCMs and examinees’ attribute profiles.