The Graded Response Model (Samejima 1969) is a psychometric generalized linear mixed model to infer a continuously distributed latent trait from a set of ordinal item scores. In psychology, the model is particularly useful in the field of personality (e.g., Fraley, Waller, & Brennan, 2000; Emons, Meijer, & Denollet, 2007) where questionnaires are administered using items with Likert-type scales.

The GRM can be derived by assuming that the ordinal scores on item i have risen from categorization of an underlying normally distributed variable, Z i , at increasing thresholds. This variable is modeled as a linear function of the latent trait, θ. The category response functions—which specify the relation between the probability of a response and θ—are then given by integrating over the conditional density function of Z i|θ between the corresponding thresholds (see Takane & de Leeuw, 1987; Wirth & Edwards, 2007). In the present article, we focus on how violations of the normality assumption affect the model. We will present the Heteroscedastic GRM with Skewed Latent Trait which is an extension of the traditional GRM that enables specific tests on this assumption. These tests are useful both in the statistical and the substantive setting.

1 The Normality Assumption for Z i

The assumption of a normal distribution for Z i is pragmatic but testable (see, e.g., Jöreskog 2002, p. 13, for an unconditional test on the bivariate marginal distribution of Z i ). The normal distribution can be replaced by a logistic distribution (Birnbaum 1968) as an approximation. The shape of the category response functions in the GRM depends on Z i|θ . Thus, using a normal distribution for Z i implies a normal distribution for Z i|θ which results in the normal-ogive GRM. On the contrary, using a logistic distribution for Z i implies a logistic distribution for Z i|θ which results in the logistic GRM (see Samejima 1969). Note that the distribution of θ does not affect the category response function. For example, if the distribution of θ is logistic and the distribution of Z i|θ is normal, the category response functions are still characterized by a normal-ogive function while Z i is non-normally distributed (due to the non-normal θ).

Central to both the logistic and normal distribution is that they are symmetric, implying a symmetric distribution for Z i|θ , which results in symmetric category response functions. As a result, the category response functions increase and decrease at the same rate to their upper and lower limits, respectively, i.e., plotting the probability of answering in category c as a function of the underlying trait, θ, will result in symmetric functions for all c. This property of generalized (non-)linear (mixed) models has been questioned in general with respect to binary data (e.g., Czado & Santner, 1992; Chen, Dey, & Shao, 1999), and specifically in item response theory (IRT). Notably, Samejima (1997, 2000, 2008) pointed out that the assumption of a symmetrical distribution for Z i|θ can lead to an inconsistent relation between the category location parameters, β ic and estimated θ (see Samejima, 1997, 2000). As a solution, Samejima (2000, 2008) proposed the logistic positive exponent family that incorporates an acceleration parameter within the logistic GRM, which allows the category response functions to be asymmetric. In addition, Bazán, Branco, and Bolfarine (2006) (see also Bazán, Bolfarine, & Branco, 2004) obtained asymmetric category response functions by using a skewed normal distribution function for Z i|θ . In applying the model to a dataset on mathematics ability and on weight perception, Bazán et al. found that a model with a skewed distribution function fitted better than a model with the traditional normal distribution function. Finally, Ramsay and Abrahamowicz (1989) proposed using a spline function for the category response functions.

Past research as described above focused on asymmetric category response functions due to skewness in the conditional distribution of Z i|θ . We argue that asymmetry in the category response functions can also arise if the distribution of Z i|θ is perfectly symmetric. Specifically, we show that when the errors of the underlying regression of Z i on θ are heteroscedastic, the distribution of Z i|θ is symmetric, but the category response functions are not. As this effect will make the marginal distribution of Z i skewed, we also take into account that the marginal distribution could be asymmetric due to a skewed distribution of θ. However, this will not result in asymmetric category response functions.

The heteroscedastic GRM with a skewed latent trait—as presented in this paper—can be used to test for specific basis for the departure from normality in the marginal distribution of Z i ; i.e., heteroscedasticity or a skewed latent trait. The basis of the departure may be of interest for both statistical and substantive reasons.

1.1 Statistical Point of View

1.1.1 Parameter Bias

In the IRT literature, it has been shown that the presence of a non-normal latent trait will bias the parameter estimates of the discrimination parameters (Azevedo, Bolfarine, & Andrade, 2011), the item category parameters (Zwinderman & van der Wollenberg, 1990), and the ability estimates (Seong 1990; Ree 1979; Swaminathan & Gifford, 1983). However, some authors found no bias (Stone 1992) which Kirisci, Hsu, and Yu (2001) attributed to the different simulation conditions used in the different studies. It was concluded that for decreasing test length and decreasing sample size, bias on the item and person parameters increase (see Kirisci et al. 2001).

With respect to heteroscedasticity in IRT, little is known as—interestingly—the subject of heteroscedasticity has not yet been investigated within this field. Heteroscedasticity has been investigated in depth in other fields including (generalized) linear regression (e.g., Agresti 2002, p. 151; Long & Ervin, 2000), (repeated measures) analysis of (co)variance (e.g., Keselman & Lix, 1997; Rochon 1992), and factor analysis (e.g., Bollen 1996; Meijer & Mooijaart, 1996; Hessen & Dolan, 2009).

As heteroscedasticity has not yet been investigated in the GRM, it is unclear whether the presence of heteroscedastic error variances will bias either item and/or person parameter estimates. We suspected that this would be the case given that heteroscedasticity has a similar effect on the distribution of Z i compared to a non-normal latent trait. We conducted a small simulation study to investigate the effect of heteroscedasticity in the GRM on the parameter estimates. As it is not the main focus of the present paper, results are presented in the appendix. From the results it is clear that for increasing effect sizes item parameters get more biased (in the most extreme case, the root mean squared differences between the parameter estimates and the true values exceeded twice the values under the tradition GRM).

Thus, we think that, from a statistical point of view, testing for heteroscedasticity and non-normality in the latent trait within the GRM is valuable, at least to compare the results to those obtained from the traditional GRM. If the effects of heteroscedasticity and non-normality of the trait are small, and item parameters are highly similar to those obtained using the traditional GRM, one can safely rely on the traditional GRM. However, when differences are large, and effects appear to be large, it might be better to take the effects of heteroscedasticity and/or non-normality in the trait into account, which can be done using the models outlined in this paper.

1.1.2 Consistency of Estimated θ

As Samejima (1997, 2000) pointed out, estimating θ in the normal-ogive model using symmetric category response curves will result in an inconsistent relation between category location parameters, β ic , and estimated θ. For simplicity, consider two items that follow a 1-parameter normal-ogive model with category location parameter β 1 for item 1 and β 2 for item 2. Assume that item 2 is more difficult than item 1, i.e., β 1<β 2. Within this model Samejima (1997) showed that for subjects with \(\theta < \operatorname{Mean}(\theta)\), answering the more difficult item 2 correctly is given more credit, while for subjects with \(\theta > \operatorname{Mean}(\theta)\) answering the easier item 1 correctly is given more credit. Stated differently, for \(\theta < \operatorname{Mean}(\theta)\) answering a more difficult item correctly will result in a higher estimate of θ compared to answering an easier item correctly. In addition, for \(\theta > \operatorname{Mean}(\theta)\) answering an easier item correctly will result in a higher estimate of θ compared to answering a more difficult item correctly. When category response functions are asymmetric, this inconsistency can be overcome. For a detailed discussion, see Samejima (1997) for the 2-parameter normal-ogive model, and Samejima (2000) for the GRM. For the present undertaking, it thus seems reasonable to take the possibility of asymmetric category response functions into account, to address this inconsistency.

1.1.3 Advantage over Existing Tests

When testing for heteroscedasticity and/or for non-normality in θ, an omnibus test is less sensitive than a more specifically focused test. In an omnibus test, such as the standard tests in LISREL, see Jöreskog (2002, p. 13), or the test proposed by Muthén and Hofacker (1988), no distinction is made between non-normality due to heteroscedastic errors and/or due to a non-normal trait. We see three reasons that specific tests (on the trait and the errors) have added value relative to the omnibus tests on normality. First, omnibus tests suggest no solution when a violation is found, whereas the models in this paper can be used to take the non-normality into account. Second, it is important to distinguish between departures from non-normal Z i due to heteroscedasticity, and departures due to a non-normal trait. Heteroscedasticity will make the category response functions asymmetric similarly to Samejima (1997, 2000, 2008), but a non-normal trait will not influence the category response functions. Third, the effects of heteroscedasticity and the non-normal trait could be in opposite directions, causing the marginal distribution of Z i to be reasonably normal while homoscedasticity and normality of the trait are violated.

1.1.4 Non-linear Moderated Factor Analysis and Differential Item Functioning

In non-linear moderated factor analysis (Bauer & Hussong, 2009; see also Neale 1998; Neale, Aggen, Maes, Kubarych, & Schmitt, 2006; Molenaar, Dolan, Wicherts, & van der Maas, 2010b), the parameters in a given latent variable model are a function of an external variable (e.g., age), which may moderate these parameters. These models have important applications as models for differential item functioning (DIF; Mellenbergh 1989) with respect to a continuous background variable (see Neale et al. 2006; and Bauer & Hussong, 2009). We note that heteroscedasticity in the GRM, as formulated below, may be attributable to DIF. For instance, suppose that the error variance in the regression of a given item on the latent trait varies with the continuous background variable that is correlated with the latent trait. This heteroscedasticity may be detected and interpreted as a manifestation of DIF by conditioning on the background variable (as in Neale et al. 2006). In the unconditional application of our GRM, such DIF would give rise to heteroscedasticity in the regression of the item on the latent trait.

1.2 Substantive Point of View

Finding that a set of items that purport to measure the same underlying latent trait is associated with heteroscedasticity across the trait can have interesting implications for the theory of that trait. This applies equally well to the finding that a given latent trait has a skewed distribution. We consider three examples, which we discuss shortly below.

1.2.1 Schematicity

In personality research, schematicity refers to the hypothesis that people differ in the certainty and accuracy with which they answer to self-report measures because of differences in their cognitive structures concerned with processing information about the self (Markus 1977; Rogers, Kuiper & Kirker, 1977; Tellegen 1988). Research indicates that being highly schematic (i.e., having strong cognitive structures about the self) is associated with an extreme position on the trait (Markus 1977). Note that this phenomenon implies heteroscedastic errors in the GRM, i.e., conditional variances decrease across the trait. We are not aware of any latent variable approaches to modeling schematicity. However, using the present model this is possible, as we illustrate later.

1.2.2 Gene-by-Environment Interaction

In behavior genetics, a gene-by-environment interaction (GxE) will result in heteroscedasticity of the environment factors with respect to the genetic factors (Jinks & Fulker, 1970). Within the linear factor model, van der Sluis, Dolan, Neale, Boomsma, and Posthuma (2006) propose testing for heteroscedastic errors as a test of GxE, i.e., the variance conditional on the genetic factor increases or decreases across that factor (see also Molenaar, van der Sluis, Boomsma, & Dolan, 2012). This approach is, however, limited to continuous factor indicators (e.g., subtest scores). The GRM proposed in this paper can be used to investigate GxE at the item level.

1.2.3 Ability Differentiation

This hypothesis from intelligence research refers to the claim that the general intelligence factor is not equally strong across its range (Spearman 1927; Tucker-Drob 2009). This phenomenon could be attributed to heteroscedastic errors, i.e., increasing conditional variances across the intelligence factor (Hessen & Dolan, 2009), non-linear factor loadings (Tucker-Drob 2009) or a skewed ability factor (Molenaar, Dolan, & van der Maas, 2011). All previous studies have focused on the subtest level; however, using the heteroscedastic GRM with non-normal trait enables the investigation of the phenomenon at the item level.

In conclusion, the heteroscedastic GRM with non-normal trait can be used to test statistical and substantive hypotheses. The outline of the present paper is as follows: First, we present the derivation of the traditional GRM using the underlying Z i . Next we discuss how category response functions can be skewed. Then, we extend the GRM to incorporate heteroscedastic errors and a skewed θ. Next, we present a simulation study to show the viability of the model and to investigate the power to detect both effects. Finally, we apply the model to a questionnaire measuring affect, and to a dataset on alexithymia to test the schematicity hypothesis.

1.3 The Graded Response Model

In the GRM, it is assumed that a normally distributed variable Z i underlies Y i , the observed scores on item i. To enable inferences about a latent trait θ, Z i is regressed on θ resulting in

$$ Z_{i} = \upsilon_{i} + \lambda_{i}\theta + \varepsilon_{i}, $$
(1)

where υ i is the intercept, λ i is the regression weight or factor loading, and ε i is the error term.Footnote 1 The marginal distribution of Z i , g(.), is given by

$$ g(Z_{i}) = \int_{ - \infty}^{\infty} f( Z_{i|\theta} )h(\theta)\,d\theta, $$
(2)

where h(.) is the density function for θ, and Z i|θ is the score on Z i conditional on θ with conditional density function f(.) and

$$ E(Z_{i|\theta} ) = \upsilon_{i} + \lambda_{i} \theta $$
(3)

and

$$ \operatorname{Var}(Z_{i|\theta} ) = \sigma_{\varepsilon i}^{2}. $$
(4)

The probability of a subject with a given θ answering in category c on item i is given by

$$ P(Y_{i|\theta} = c) = \int_{\tau _{ic}}^{\tau _{i(c + 1)}} f(Z_{i|\theta} )\,dZ_{i|\theta} \quad \mbox{for }c = 0, \dots, C_{i}-1, $$
(5)

where Y i|θ is the observed score on item i conditional on θ,C i is the number of answer categories of item i,τ i0=−∞, and τ i(Ci−1)=∞. Equations (3), (4), and (5) give

$$ P(Y_{i|\theta} = c) = F\biggl( \frac{\upsilon_{i} + \lambda_{i}\theta - \tau_{ic}}{\sigma_{\varepsilon i}} \biggr) - F\biggl( \frac{\upsilon_{i} + \lambda_{i}\theta - \tau_{i(c + 1)}}{\sigma_{\varepsilon i}} \biggr), $$
(6)

where F(.) is the conditional distribution function of Z i|θ . Substituting \(\upsilon_{i} = 0,\alpha_{i} = \frac{\lambda_{i}}{\sigma_{\varepsilon i}}\), and \(\beta_{ic} = - \frac{\tau_{ic}}{\sigma_{\varepsilon i}}\), we obtain

$$ P(Y_{i|\theta} = c) = F( \alpha_{i}\theta + \beta_{ic} ) - F( \alpha_{i}\theta + \beta_{i(c + 1)} ) $$
(7)

which is the normal-ogive GRM with item scale parameter α i and category location parameter β ic ; see Takane and de Leeuw (1987), McDonald (1999).

2 Asymmetric Z i

In Equations (1) and (2), it holds that if θ and ε i are both normally distributed, Z i is normally distributed by convolution. Conversely, Cramér’s theorem (Cramér 1937) ensures that when Z i is normally distributed, θ is normally distributed and Z i|θ (i.e., ε i ) is normally distributed.Footnote 2 Additionally, for a normal distribution for Z i to hold, Z i|θ should be homoscedastic (i.e., constant across θ; see Meijer & Mooijaart, 1996; Hessen & Dolan, 2009). See Figure 1 for a graphical representation for the case where C=3.

Figure 1.
figure 1

Graphical representation of the relation between P(Y i|θ =c) and Z i|θ for C=3. In the figure, only three of the infinite conditional distributions are depicted. N(.) refers to a normal density.

Thus, if we accept the linear relation between Z i and θ in Equation (1), then any violation of symmetry in the marginal distribution of Z i is caused by the following not mutually exclusive causes:

  1. 1.

    \(\sigma_{\varepsilon i}^{2}\) is heteroscedastic (i.e., variable across θ), see Figure 2(b).

    Figure 2.
    figure 2

    Graphical representation of the causes of non-normality in Z i : (a) traditional model (see also Figure 1), (b) heteroscedastic \(\sigma_{\varepsilon i}^{2}\), (c) skewed θ (d) skewed Z i|θ . For clarity, effect sizes are exaggerated.

  2. 2.

    The distribution of θ is skewed, see Figure 2(c).

  3. 3.

    The distribution of Z i|θ is skewed, see Figure 2(d).

Existing approaches by Samejima (1997, 2000, 2008), Bazán et al. (2006), and Ramsay and Abrahamowicz (1989), as discussed above, focus exclusively on cause 3. However, cause 1 will also result in asymmetric category response functions. Thus, if normality of Z i|θ is not rejected by one of the existing approaches, category response functions can still be asymmetric due to heteroscedastic \(\sigma_{\varepsilon i}^{2}\). In addition, non-normality can arise in Z i due to a skewed θ.

Specific tests on the violations as presented above are developed within the framework of factor analysis for continuous variables; Hessen and Dolan (2009) proposed a one-factor model to test for heteroscedastic errors, and Molenaar, Dolan, and Verhelst (2010a) proposed a one-factor model to test for heteroscedastic errors, non-linear factor loadings, and/or a non-normal factor distribution. In the framework of factor analysis for ordinal variables, including the GRM, such specific tests have received relatively little attention. Effort has been invested in approximating the distribution of θ using mixture distributions (Muthén & Muthén, 2007; Vermunt 2004; Vermunt & Hagenaars, 2004; Schmitt, Mehta, Aggen, Kubarych, & Neale, 2006), extensions of the normal distribution (van den Oord, 2005; Verhelst 2009; Azevedo et al., 2011), and splines (Woods 2007). However, we are not aware of any work on heteroscedastic errors in the GRM.

3 Extending the GRM

We extended the GRM to include heteroscedastic \(\sigma_{\varepsilon i}^{2}\) and a skewed θ. We present the model development in factor analysis notation, as conceptualization of the GRM in terms of an underlying Z i variable originates in this field (Wirth & Edwards, 2007). In addition, within factor analysis, heteroscedasticity has already been studied to some degree, so that our presentation connects better with the existing literature. Still, translating the final result to IRT notation is straightforward (using the results in Takane & de Leeuw, 1987, as motivated above).

3.1 Heteroscedastic Errors

In Equations (2), (3), and (4), \(\sigma_{\varepsilon i}^{2}\) are assumed to be homoscedastic. Thus, we propose to account for heteroscedasticity by making \(\sigma_{\varepsilon i}^{2}\) a function of θ, i.e.,

$$\sigma_{\varepsilon i|\theta}^{2} = k(\theta; \boldsymbol{\delta} _{i}), $$

where δ i is a vector of parameters with elements δ i0,…,δ ir , and k(.) is any function that is strictly positive. Within the framework of linear factor analysis for continuous variables, Hessen and Dolan (2009) used an exponential function for k(.), i.e.,

$$\sigma_{\varepsilon i|\theta}^{2} = \exp(\delta_{i0} + \delta_{i1}\theta + \delta_{i2}\theta^{2} + \cdots + \delta_{ir}\theta_{r}). $$

Thus, \(\log(\sigma_{\varepsilon i}^{2})\) is modeled as an rth degree polynomial function of θ. Bauer and Hussong (2009) used a similar function to model the relation between \(\sigma_{\varepsilon i}^{2}\) and a moderator variable. Generally, as a simple test on heteroscedasticity it will suffice to consider the minimal heteroscedasticity model, i.e., r=1,

$$ \sigma_{\varepsilon i|\theta}^{2} = \exp(\delta_{i0} + \delta_{i1}\theta), $$
(8)

with baseline parameter δ i0∈(−∞,∞) and heteroscedasticity parameter δ i1∈(−∞,∞). When δ i1=0, error variances are homoscedastic. When δ i1>0 error variances are increasing for increasing levels of θ, and when δ i1<0 error variances are decreasing for increasing levels if θ. Note that, in principle, in the homoscedastic case of the GRM (i.e., δ i1=0) and the heteroscedastic case of the GRM (i.e., δ i1≠0) all parameters are interpreted the same (the thresholds, factor loadings and intercepts), except for δ i0. In the homoscedastic model, δ i0 represents \(\log(\sigma_{\varepsilon i}^{2})\) for all θ’s, while in the heteroscedastic model it represents \(\log(\sigma_{\varepsilon i|\theta}^{2})\) for θ=0.

The function in Equation (8) is not generally suitable within the GRM. To see this, consider the top graph in Figure 3 in which P(Y i =c|θ) from Equation (6) is plotted for C i =5, λ i =1, υ i =0, τ ic ={−3,−1,1,3}, −5≤θ≤5, and \(\sigma_{\varepsilon i|\theta}^{2} = \exp( 0.5 + 0.5 \times \theta)\). It appears that the heteroscedastic errors underlying the distribution of Z i make the category response curves skewed. In the bottom graph of Figure 3, the behavior of the same category response functions is shown for −5≤θ≤50. Clearly, the category response functions for c=0 and c=4 (first and last category) behave undesirably as both have an upper limit of 0.5. How this happens is clear from the category response functions in Equation (6), e.g., for c=0

$$P(Y_{i|\theta} = 0) = 1 - \varPhi \biggl( \frac{\upsilon_{i} + \lambda_{i}\theta - \tau_{i(c + 1)}}{\exp (\delta_{i0} + \delta_{i1}\theta ) ^{\frac{1}{2}}} \biggr) $$

because τ i0=−∞. From this equation it can be seen that when θ increases, the denominator of the fraction in the normal distribution function, Φ(.), accelerates more than the numerator (as the denominator involves an exponential function of θ and the numerator a linear function of θ). Therefore, for increasing θ the fraction approaches 0 causing P(Y i|θ =c) to approach 1−Φ(0)=0.5 in the limit. Key of the problem is that category 0 has only one effective threshold (τ i1) as the other threshold (τ i0) equals minus infinity. Therefore, a similar problem occurs for c=4 (i.e., the upper category), but not for the intermediate categories as these categories have thresholds that do not involve (minus) infinity. The problem outlined in Figure 3 appears to be occurring on a range of θ that will generally fall outside the observed data. However, when the degree of heteroscedasticity increases, the category response curves of category 0 and 4 will approach 0.5 in the more reasonable ranges for θ as well.

Figure 3.
figure 3

Heteroscedastic item category curves for −5≤θ≤5 (top) and −5≤θ≤50 (bottom).

We therefore propose an alternative function for k(.) to link \(\sigma_{\varepsilon i}^{2}\) and θ, i.e.,

$$ \sigma _{\varepsilon i|\theta }^2=\frac{2\delta _{i0}}{1+\exp(-\delta _{i1}\frac{\theta -E(\theta )}{\mathit{SD}(\theta )})} $$
(9)

with baseline parameter δ i0∈[0,∞) and heteroscedasticity parameter δ i1∈(−∞,∞).

Note that when δ i1>0 and \(\theta \rightarrow \infty \Rightarrow \sigma_{\varepsilon i|\theta}^{2} \rightarrow \delta_{0}\), i.e., the function has an upper bound which prevents the category functions of c=0 and c=C−1 to approach 0.5. In addition, when \(\theta \rightarrow - \infty \Rightarrow \sigma_{\varepsilon i|\theta}^{2} \rightarrow 0\), i.e., the function has a lower bound of 0. Similarly when δ i1<0 and \(\theta \rightarrow - \infty \Rightarrow \sigma_{\varepsilon i|\theta}^{2} \rightarrow \delta_{i0}\), and when \(\theta \rightarrow - \infty \Rightarrow \sigma_{\varepsilon i|\theta}^{2} \rightarrow 0\). In addition, δ i1 is a heteroscedasticity parameter, i.e., when \(\delta_{i1} = 0 \Rightarrow \sigma_{\varepsilon i|\theta}^{2} = \delta_{i0}\).

3.2 Skewed Latent Trait

Within the 2-parameter logistic IRT model, Azevedo et al. (2011) proposed the use of the skew-normal density function (Azzalini, 1985, 1986; Azzalini & Capatanio, 1999; Arnold, Beaver, Groeneveld & Meeker, 1993).Footnote 3 Note that within the field of latent variable modeling, this distribution is also used by Bazán et al. (2006) to model skewness in the category response function of a 2-parameter normal-ogive model (as discussed above) and the distribution is also used by Molenaar et al. (2010a) and Molenaar et al. (2011) to approximate the latent variable distribution in the linear factor model.

An appealing feature of the class of skew-normal distributions is that it includes the normal distribution as a special case, which makes it straightforward to test a given variable for normality. Specifically, if random variable η has a skew-normal distribution, its probability density function is given by

$$ h(\eta) = \frac{2}{\omega} \times\varPhi\biggl( \zeta\frac{\eta - \kappa}{ \omega} \biggr) \times\varphi\biggl( \frac{\eta - \kappa}{\omega} \biggr) $$
(10)

where κ is a location parameter, κ∈(−∞,∞), ω is a scale parameter, ω∈[0,∞), and ζ is a shape parameter, ζ∈(−∞,∞). From these parameters the expected value, variance, and skewness of η can be calculated using the so-called centered parameterization of the skew-normal distribution (see Azzalini 1985)

(11)
(12)

and

$$\operatorname{Skew}(\eta) = \frac{4 - \pi}{2}\frac{( \rho \sqrt{2/\pi} )^{3}}{( 1 - 2\rho^{2}/\pi )^{\frac{3}{2}}}, $$

with

$$\rho = \frac{\zeta}{\sqrt{1 + \zeta^{2}}}. $$

Note that in Equation (10), for ζ=0, h(.) reduces to the normal density function. Within the heteroscedastic GRM with skewed θ, the latent trait, θ, can still be interpreted as in the traditional GRM. That is, a high position on θ in the traditional GRM corresponds with a high value on θ in the extended GRM.

3.3 Identification

The present form of the model in Equation (6), is unidentified as both θ and Z i are unobserved variables. Identification of θ traditionally proceeds by restricting E(θ)=0, \(\operatorname{Var}(\theta) = 1\), and identification of Z i proceeds by restricting E(Z i )=0 and \(\operatorname{Var}(Z_{i}) = 1\). Because of Equation (13), these restrictions have the following implications:

$$\sigma_{\varepsilon i|\theta}^{2} = \operatorname{Var}(Z_{i}) - \lambda_{i}^{2} \times \operatorname{Var} ( \theta) = 1 - \lambda_{i}^{2} $$

and

$$\upsilon_{i} = E(Z_{i}) -\lambda_{i} \times E(\theta) = 0. $$

That is, intercepts are not in the model and the error variances are a deterministic function of the factor loadings. Thus, in the present model, we could not introduce heteroscedastic errors, as \(\sigma_{\varepsilon i}^{2}\) is not a free parameter. We therefore identify the scale of Z i by fixing two adjacent thresholds to some arbitrary values (see Mehta, Neale, & Flay, 2004). The scale of Z i is now identified because the distance between the fixed thresholds define one unit on the Z i scale. An advantage is that we can estimate \(\sigma_{\varepsilon i}^{2}\) and υ i for all i. We are still free to fix the scale of θ by either fixing \(\operatorname{Var}(\theta)\) to equal 1, or by fixing λ i to equal 1 for a given i. In the latter case we can estimate ω from Equation (10). In addition, in multi-group applications we can fix E(θ) to equal 0 in the first group and estimate κ from Equation (10) in the second group. However, in single group applications E(θ) (and thus κ) is fixed.

The restrictions presented above result in an identified model, as we verify in our simulation study (see below). In addition, in all applications of the model (among which are the simulations and illustrations that are presented below), using (marginal) maximum likelihood, the Hessian was positive definite, parameter recovery was satisfactory, and altering starting values did not result in a different solution with the same likelihood. We also note that within the linear one-factor model, we already established that heteroscedastic errors and a skew-normal distribution are simultaneously identified (see Molenaar et al., 2010a). Taking these results together, we are confident that the present model is identified.

3.4 Estimation

Equation (6), with \(\sigma_{\varepsilon i}^{2}\) given by Equation (9) and θ distributed with density function given by Equation (10), constitutes the heteroscedastic GRM with a skewed trait distribution. To fit this model to data, we propose a marginal maximum likelihood procedure (MML; Bock & Aitkin, 1981). That is, we maximize the log-marginal likelihood function, i.e.,

$$ \ell(\boldsymbol{\gamma} |\mathbf{X}) = \sum_{p = 1}^{N} \log\int_{ - \infty}^{\infty} \prod _{i = 1}^{q} P(Y_{i|\theta} = x_{(pi)}) h( \theta)\,d\theta, $$
(13)

where X denotes the N×q matrix with item scores of N subjects on q items, x (pi) is element (p,i) from this matrix, and γ is the vector of parameters in the model. This vector includes: δ i0, δ i1, λ i , υ i , τ i2, τ i3, and ζ for all i=1,…,q. Note that κ and ω are not free parameters as these are fixed, such that E(θ)=0 and \(\operatorname{Var}(\theta) = 1\) according to Equations (11) and (12) as discussed above.

4 Simulation Study

In the present section, we present results of a small simulation study. The simulation study served three aims. First, we wanted to establish the viability of the model. Second, we wanted to investigate the statistical power to detect the different effects simultaneously and in isolation. Finally, we wanted to investigate the specificity of both effects, i.e., whether the models can distinguish between a dataset which is generated according to a skew-normal θ and a dataset which is generated according to heteroscedastic \(\sigma_{\varepsilon i}^{2}\).

4.1 Design

We simulated data according to the model and varied three aspects in the data: (i) The number of respondents: N=400 or N=800; (ii) the number of answer categories, C=3 or C=5; and (iii) the nature of the effect(s) in the data: no effects in the data, heteroscedastic \(\sigma_{\varepsilon i}^{2}\), skew-normal θ, or both. We simulated data for 10 items (i.e., q=10). In the case that one or both of the effects were present in the data, we chose ζ=2.17 [\(\operatorname{Skew}(\theta) = 0.5\)] and/or δ i1=0.4. All other parameters were fixed to λ i =1, υ i =0, δ i0=1.5, E(θ)=0, \(\operatorname{Var}(\theta) = 1\), and τ ic ={−2,−0.75,0.75,2} when C=5, and τ ic ={−0.75,0.75} when C=3. The models were identified by fixing E(θ)=0, \(\operatorname{Var}(\theta)= 1\), and τ i1=−2, τ i2=−0.75 (if C=5) and τ i1=−0.75, τ i2=0.75 (if C=3).

For each of the 16 conditions (2×2×4) we simulated 40 data sets and fitted four different models: the baseline model with no effects (i.e., ζ=0 and δ i1=0 for all i), a heteroscedasticity model (i.e., δ i1 is free for all i but ζ=0), a skewed-trait model (i.e., ζ is free but δ i1=0 for all i), and a full model with both effects simultaneously (i.e., ζ is free and δ i1 is free for all i).

4.2 Likelihood Ratio and Power Calculation

In testing for skewness in the distribution of θ and/or heteroscedasticity of the errors, we use the likelihood ratio test (LRT). In this test, H0 states that ζ=0 and/or δ i1=0 for some or all of the items, and HA states that ζ≠0 and/or δ i1≠0. The LRT statistic, T, is calculated as

$$ T=-2\times\bigl[ \ell(\hat{\boldsymbol{\gamma}_0 }|\mathbf{X})-\ell(\hat{\boldsymbol{\gamma}_4 }|\mathbf{X}) \bigr], $$
(14)

where \(\hat{\boldsymbol{\gamma }}_{0}\) is the estimated vector of parameters from Equation (13) subjected to the parameter restrictions of H0, and \(\hat{\boldsymbol{\gamma }}_{\mathrm{A}}\) is the vector of parameter estimates under HA. Under H0, T has a central-χ 2 distribution with degrees of freedom (df) equal to the number of restrictions in H0. Under HA, this statistic has a non-central χ 2 distribution with a non-centrality parameter that depends on the effect size and sample size. Two important conditions that need to be satisfied for the LRT statistic to approach these theoretical distributions are (i) that H0 is nested under HA and (ii) that the restrictions in H0 should not be on a boundary of the parameter space (Cramér 1946). As our likelihood ratio tests concern either fixing ζ with parameter space (−∞,∞) to 0, and/or fixing δ i1 with parameter space (−∞,∞) to 0, these regularity conditions are satisfied.

In the present simulations, we study the empirical power of the LRT to reject H0 of a normal θ and/or homoscedastic ε i . To this end, we calculated T from Equation (14) for each replication in the design. As T has a non-central χ 2 distribution under HA, it holds that (Fisher 1928):

$$E(T) = \mathrm{df} + \mathrm{ncp}. $$

Thus, the non-centrality parameter (ncp) of the LRT statistic, T, could empirically be approximated by subtracting the df from the average T across the replications in a given cell in the simulation study. As now the ncp of the distribution of T is known under HA, we can choose a nominal alpha level (α) and calculate the power to reject H0 by integrating over the non-central χ 2 distribution. Within covariance structure modeling, Satorra and Saris (1985), (see Molenaar, Dolan, & Wicherts, 2009, for an illustrative example) developed an analytical approach to calculate the ncp given that N is large, and that the misspecification under H0 is not severe. We use essentially the same approach as that of Satorra and Saris; however, because we have no analytical expression for the ncp, we therefore establish the ncp empirically from the data, as described above.

We used Gauss–Hermite quadratures with 50 quadrature points to approximate the integral in the log marginal likelihood function in Equation (13). We fitted the model using the freely available software package Mx (Neale, Boker, Xie, & Maes, 2002). A script to fit the heteroscedastic GRM with skewed θ is available from the website of the first author.

4.3 Results

In Figures 4 and 5, box plots of the parameter estimates are shown for the heteroscedasticity parameters, δ i1, and the skewness parameter, ζ, for the case that both effects were in the data and the full model was fitted to that data. As it appears, true parameter values are recovered quite well. Variability in the parameter estimates is highest when N=400 and C=3 but decreases with increasing N and increasing C.

Figure 4.
figure 4

Box plots of the parameter estimates of the heteroscedasticity parameters, δ i1 (in black; left axis) and the skewness parameter, ζ (in grey, right axis) for N=400 and respectively C=3 (top figure) and C=5 (bottom figure). The horizontal line represents the true value on both the δ i1 axis and the ζ axis.

Figure 5.
figure 5

Box plots of the parameter estimates of the heteroscedasticity parameters, δ i1 (in black; left axis) and the skewness parameter, ζ (in grey, right axis) for N=800 and respectively C=3 (top figure) and C=5 (bottom figure). The horizontal line represents the true value on both the δ i1 axis and the ζ axis.

In Table 1, power coefficients (α=0.05) are shown for the 16 conditions in the simulation study and for three classes of models. The first class of models is the full model, i.e., the heteroscedastic GRM with a skewed trait (abbreviated het-GRM-skew in Table 1), second is the heteroscedastic GRM (het-GRM; i.e., a GRM with heteroscedastic errors only), and third is the GRM with a skewed trait only (GRM-skew). Ideally, when a given effect is in the data (e.g., heteroscedastic \(\sigma_{\varepsilon i}^{2}\)) it is not detected by a misspecified model (i.e., the GRM-skew). Thus, power of the misspecified model should ideally be equal to 0.05 (i.e., equal to the Type I error probability specified a priori). If this is the case, the model is said to be highly specific, that is, it detects only the effects that are in the model and it does not pick up any additional effects. In Table 1, the power coefficients of the misspecified models are underlined and are desired to be close to the nominal alpha level. All other coefficients should be reasonably large.

Table 1. Power to detect heteroscedastic errors and skewness in the trait simultaneously and in isolation.

As can be seen from the table, all underlined values are close to or equal to 0.05 for the het-GRM-skew, which suggests that the two effects (heteroscedastic \(\sigma_{\varepsilon i}^{2}\) and skewed θ) are well separable within this model. For the het-GRM and GRM-skew, the underlined values are departing somewhat more from the 0.05 level (at most 0.31). This indicates that unmodeled skewness in the trait or heteroscedasticity in the errors will increase the false positive rate somewhat.

For N=400 and C=3, power values are around 0.4–0.5 which is judged to be low power (exception is the 0.74 power of the GRM-skew to detect the skewness in θ, which is already acceptable). For C=5, power is acceptable, as power coefficients vary around 0.7–0.8. An exception is the power of the het-GRM-skew to detect the skewed θ in the data. This situation is associated with a power of 0.46 which is still low. However, this is not surprising as the model has 11 extra parameters compared to the standard GRM of which only 1 is concerned with the skewed θ. Power is thus lowered because of the 10 parameters that remain unaffected. For N=800 power is judged to be acceptable-to-high as coefficients are around 0.8–1.0, with the larger power coefficients when C=5.

From Table 1 it also appears that when both effects are in the data, the het-GRM-skew outperforms both the het-GRM and GRM-skew in terms of power. For instance, in the cases that N=800 and C=3, power to detect the heteroscedastic errors is 0.72 and 0.67, respectively, in the het-GRM-skew, compared to 0.53 and 0.29, respectively, in the het-GRM. In addition, power of the het-GRM-skew to detect the skewness in θ is 0.84 compared to 0.25 in the GRM-skew. When only one effect is in the data but the het-GRM-skew is applied nevertheless (which includes both effects), power is somewhat smaller compared to the GRM-skew and het-GRM. For instance, when N=800 and C=3 and a skewed θ underlies the data, the het-GRM-skew has a power of 0.88 to detect the effect, while the GRM-skew has a power of 0.96. In addition, when heteroscedastic errors underlie the data, the het-GRM skew has power of 0.82 to detect this effect, and the het-GRM has a power of 0.88. In the case of one effect in the data, power is, thus, somewhat increased in the models with a single effect only. However, benefits are not that large as compared to when both effects are in the data and the het-GRM-skew is applied. It seems, thus, to be advisable that if one is interested in detecting one effect, e.g., heteroscedastic errors, it is safest to take the possibility of the skewed θ into account as well because of substantial power gains.

Results above depend highly on the present choice of effect sizes. However, we showed that given the chosen effect sizes power can be acceptable for reasonable sample sizes. Regardless of effect size, we can conclude that the model is viable and that the effects are highly specific and, therefore, well separable.

In the present simulation study, we only used effects on θ and \(\sigma_{\varepsilon i}^{2}\) that are in the same direction, i.e., they both make the distribution of Z i positively skewed. We did consider the case in which both effects were in opposite directions (specifically, ζ=−2.17 and δ i1=0.4), but this did not affect the results as presented above. This finding is not surprising as both effects are highly specific and, therefore, do not interfere with each other.

5 Illustration

In this section, we present two applications purported to test the schematicity hypothesis from personality research (Markus 1977; Rogers et al., 1977; Tellegen 1988). As discussed above, this hypothesis supposes that people who are low on a given personality trait have a less clear self-schemata, i.e., they are more ambiguous about their trait level compared to people high on that trait. In the literature, a similar phenomenon is the hypothesis that not all personality traits apply equally well to everybody (Allport, 1937; Baumeister & Tice, 1988). As discussed previously, the schematicity hypothesis predicts that people who are low on a given personality trait are less accurate in reporting their exact position on the answer scale of a given item from a questionnaire. On the item level, this prediction implies heteroscedasticity of the error variances in the GRM. In the applications below, we investigate the schematicity hypothesis by testing for heteroscedastic errors. We do also take the possibility of skewness in the trait into account as this might benefit the power to detect the heteroscedastic errors (as we showed in the simulation). The first application involves an investigation of the schematicity hypothesis in an affect questionnaire, and the second application involves an investigation of the schematicity hypothesis in an alexithymia questionnaire.

6 Application 1

6.1 Description of the Data

A total of 557 psychology freshman completed a Dutch version of the Positive Affect and Negative Affect scale (Guadagnoli & Mor, 1989). All items in these scales consist of words that describe a particular affect (e.g., wistful, desperate, happy, cheerful). Respondents are asked to report on a 5-point Likert scale to what degree they experienced each affect during past week. The questionnaire contains 40 items in total of which 20 purport to measure positive affect and 20 items purport to measure negative affect. Here we limit the analysis to the Negative Affect scale only.Footnote 4 The negative affect scale was judged to have acceptable fit to a discrete one-factor model (RMSEA=0.066).Footnote 5 We identified the model by fixing E(θ)=0, \(\operatorname{Var}(\theta)=1\), τ i0=−2, and τ i1=−0.75.

In the data analysis, we proceeded as follows: First, we fitted the full model (including the skew-normal distribution for θ and the heteroscedastic \(\sigma_{\varepsilon i}^{2}\)). From the full model, we dropped the skewness parameter, ζ, resulting in a model with heteroscedastic errors only. Next, from the full model, we dropped the heteroscedastic errors, resulting in a model with a skewed θ only. Finally, we fitted a model with no effects, i.e., the traditional GRM. To see which of these models fitted the data best, we considered various fit indices. First, we conducted the LRT between the model of interest and the full model as described above. Next, we consulted Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both the AIC and BIC are calculated on basis of the marginal likelihood, i.e., the incidental parameters, θ, are integrated out and are not included in these indices. For the AIC and BIC, a lower value suggests a better model fit. Note that these indices are not tests in a statistical sense (e.g., Konishi & Kitagawa, 2008), i.e., they are used to decide which model predicts future data best among competing models. For present purposes we adopt this model selection based strategy using the AIC, BIC and LRT. However, we note that a more thorough approach would be to test whether data are a true realization of the stochastic model under investigation (Schmueli 2010).

6.2 Results

Table 2 contains the parameter estimates of υ i , τ i2, τ i3, δ 0i , δ 1i , and ζ in the full model. From Table 3, it can be seen that dropping the skewness parameter, ζ, from the full model did not result in a significant LRT. However, dropping the heteroscedasticity parameters, δ i1, did result in a significant LRT. Thus, judged by the LRT we would choose a model with heteroscedastic errors only. The AIC provides the same conclusion; however, judged by the BIC the model with no effects should be preferred. Thus, results concerning the Negative Affect scale are mixed. The observation that the AIC and the BIC do not agree can be attributed to differences in the penalty that is used in these indices. BIC is known to favor more parsimonious models, while the AIC tends to favor less parsimonious models. We tend to favor the model without effects as this model is more parsimonious, i.e., we put more weight on the BIC. We can thus conclude that we did not find any support for the schematicity hypothesis with respect to self-reported negative affect.

Table 2. Parameter estimates of the full model in Application 1.
Table 3. Modeling results of Application 1.

7 Application 2

7.1 Description of the Data

A sample purported to be representative of the Dutch population (N=816) completed the Bermond–Vorst alexithymia questionnaire (Vorst & Bermond, 2001) at home on a computer.Footnote 6 Alexithymia is a personality trait that reflects the inability to understand, process, or describe emotions. The Bermond–Vorst alexithymia questionnaire consists of 6 subscales: Emotionalizing, Fantasizing, Identifying, Verbalizing, Cognitive Analyzing, and Affective Analyzing. Each subscale consists of 8 items measured on a 5-point Likert scale. An example of an item in the Identifying subscale is: “If I am nervous, it is unclear to me which emotion caused this feeling.”

We analyzed all subscales separately. A discrete one-factor model was judged to fit in an acceptable way to the Fantasizing (RMSEA=0.061), Identifying (RMSEA=0.073), Affective Analyzing (RMSEA=0.064), and Cognitive Analyzing (RMSEA=0.060) subscales. The subscales Emotionalizing (RMSEA=0.110) and Verbalizing (RMSEA=0.091) were judged to fit poor to a one-factor model and are, therefore, not included in the analysis. We identified the model by fixing E(θ)=0, \(\operatorname{Var}(\theta)=1\), τ i0=−2, and τ i1=−0.75.

7.2 Results

Table 4 contains the parameter estimates of υ i , τ i2, τ i3, δ 0i , δ 1i , and ζ in the full model for the six subscales. In Table 5, modeling results are given. We followed the same procedure as in Application 1. As shown in Table 5, for all subscales, the full model appears to fit best according to the AIC and BIC. See Figure 6 for the model implied category response functions for the Affective Analyzing subscale under the full model.

Figure 6.
figure 6

Model implied item category functions for the Affective Analyzing subscale of Application 2. Estimated heteroscedasticity parameters, δ i1, are given in the figure. Estimated baseline parameters, δ i0, are 0.35, 1.60, 0.24, 1.23, 0.28, 3.00, 0.34, and 0.67 for, respectively, items 1 to 8. See also Table 4. Category 0 is colored darkest. Items 2, 4, 6, and 8 are contra indicative items (i.e., λ i <0), as they are worded as such in the questionnaire.

Table 4. Parameter estimates of the full model in Application 2.
Table 5. Modeling results of Application 2.

Taken together, in contrast to Application 1, conclusions are relatively clear in the sense that we can conclude that for all 4 subscales analyzed, the distribution of Z i is characterized by both heteroscedastic errors and a skewed θ distribution. Additionally, we tested for equality of δ 1i to see whether we could model the heteroscedasticity with a single parameter for all items instead of an individual parameter for each item. We found that this equality restriction held for the Cognitive Analyzing subscale only (AIC reduced to 1267.99 and the BIC reduced to −14622.42). For the other subscales, δ 1i differed between items.

It can be seen in Table 4 that for the Affective Analyzing, Cognitive Analyzing, and Identifying subscales the heteroscedasticity is systematically in the same direction for almost all items.Footnote 7 That is, lower levels of θ are associated with higher error variance. Thus individuals with high levels of alexithymia are more consistent in their answers compared to individuals with lower levels of alexithymia. This finding is in line with the schematicity hypothesis, suggesting that high alexithymia individuals are less ambiguous about their psychological functioning in this regard because they are in little doubt concerning their lack of insight and knowledge of the emotions that they experience. Allport (1937) and Baumeister and Tice (1988) questioned whether personality traits do apply equally well to everybody. In perspective of this line of work, the present results suggest that for people that are well able to describe their emotions, the alexithymia trait applies less well (i.e., is a weaker source of individuals differences) compared to people that are less able to describe their emotion.

8 Discussion

In the present paper, we presented the Heteroscedastic Graded Response Model with a skewed latent trait—a unified model that extends the traditional Graded Response Model by adding heteroscedastic errors and a non-normal latent trait. In introducing heteroscedastic errors, we specified a parametric function between the trait and the error variances. As we showed, this choice is not straightforward, as the commonly used exponential function (Hessen & Dolan, 2009; Bauer & Hussong, 2009) gives undesirable results in the limits of the trait. We therefore adopted a logistic function, which was shown to be adequate. However, the choice for this function was pragmatic. The function could be replaced with any other function that behaves reasonably given the purposes of the researcher.

Although the logistic function performs well in the limits of the trait, it has some drawbacks. When effect sizes are very large (i.e., the logistic function tends to a step function), category response functions of some categories will show patterns that are not realistic from a psychological or measurement point of view. That is, category response curves will have more than one stationary point; for an example, see Figure 7. However, we stress that this occurs only given very large effect sizes, which we consider unrealistic. In addition, in applications of the model, this behavior of the category response function can easily be detected by inspecting relevant plots of the results. When a given item shows these patterns, a different function can be used for this item or all items. Or a non-parametric approach could be considered, as explained in the following.

Figure 7.
figure 7

Example of category response functions in case of an extreme effect size for the heteroscedasticity in the error variances. Parameters used to create the picture are: δ i0=1.5, δ i1=3, λ i =1, υ i =0, and τ ic ={−2,−0.75,0,0.75,2}.

An alternative to specifying a parametric function between the trait and the error variances is to consider a non-parametric approach. In this approach, heteroscedasticity is detected using a step function. An advantage is that the method is more flexible, i.e., highly specific forms of heteroscedasticity can be tested depending on the substantive theory and/or the aims of the researcher. However, the method will likely be associated with smaller power as compared to the parametric function approach, particularly as the number of steps in the step function increases. This remains to be investigated in depth.

With respect to the distribution of the trait, we considered the skew-normal distribution because, first, this distribution enables straightforward tests on normality; second, its statistical properties are well documented (e.g., Azzalini & Dalla Valle, 1996; Azzalini & Capatanio, 1999; Arnold et al., 1993; Arnold & Beaver, 2002; Chiogna, 2005; Monti, 2003); and, third, the distribution is applied in general (for examples, see Azzalini, 2005) and specifically with respect to psychometrics (e.g., Bazán et al. 2006; Molenaar et al. 2010b, 2011; Azevedo et al. 2011). However, our choice is based on pragmatic considerations; other distributions can be considered as well, e.g., mixture distributions (Muthén & Muthén, 2007; Vermunt 2004; Vermunt & Hagenaars, 2004; Schmitt et al. 2006), the shifted-log normal (Verhelst 2009), or distributions based on Johnson curves (van den Oord, 2005) or splines (Woods 2007).