Keywords

1 Introduction

Polytomous item response theory (IRT; Lord 1980) models are applicable for tests with items involving more than two response categories. Polytomous responses include nominal and ordinal responses. Ordinal polytomous responses, such as Likert scale items (Likert 1932), are broadly used in many fields, including education, psychology, and marketing. This study focuses on the graded response model (GRM; Samejima 1969), the most widely used IRT model for polytomous response data (e.g., Ferero and Maydeu-Olivares 2009; Rubio et al. 2007).

In many circumstances, multidimensional IRT (MIRT; Reckase 19972009) models are adopted when distinct multiple traits are involved in producing the manifest responses for an item. A special case of the MIRT model applies to the situation where the instrument consists of several subscales with each measuring one latent trait, such as the Minnesota Multiphasic Personality Inventory (MMPI; Buchanan 1994). In the IRT literature, such a model is called the multi-unidimensional (Sheng and Wikle 2007) or the simple structure MIRT (McDonald 1999) model and is the major focus of the study.

The multi-unidimensional GRM applies to situations where a K-item instrument consists of m subscales or dimensions, each containing k v polytomous response items that measure one latent dimension. With a probit link, the probability that the ith (i = 1, 2, … N) person contains a Likert scale response with c categories (c = 1, 2, , C j ) for the jth (j = 1, 2, … K) item is defined as

$$\displaystyle\begin{array}{rcl} P(Y _{vij} = c\vert \theta _{vi},\alpha _{vj},\boldsymbol{\delta }_{j})& =& \Phi (\alpha _{vj}\theta _{vi} -\delta _{j,c-1}) - \Phi (\alpha _{vj}\theta _{vi} -\delta _{j,c}) \\ & =& \int _{\delta _{j,c-1}}^{\delta _{j,c} }\phi (z;\alpha _{vj}\theta _{vi})dz, {}\end{array}$$
(1)

where \(\Phi (\cdot )\) and ϕ(⋅ ) are the standard normal CDF and PDF, respectively, z is a standard normal variate, α vj and θ vi denote the item discrimination and the person’s latent trait in the vth dimension (v = 1, 2, , m), and δ j, c denotes the item threshold parameter for the cth response category of item j (Samejima 1969), the latter of which satisfies

$$\displaystyle{ -\infty =\delta _{j,0} <\delta _{j,1} <\ldots <\delta _{j,C_{j}-1} <\delta _{j,C_{j}} = \infty. }$$
(2)

From a theoretical perspective, latent trait distributions in the IRT literature are often assumed to be normal. Therefore, some common estimation methods, such as marginal maximum likelihood and Bayesian techniques, are developed assuming normal latent traits. However, in some psychological instruments, such as depression and anxiety tests, the population latent traits may follow a non-normal distribution. Research has shown that violating the assumption of normality may bias the estimates of IRT item and latent trait parameters (e.g., Sass et al. 2008; Reise and Revicki 2014). In the literature, studies have been conducted to investigate item and person parameter recovery in estimating unidimensional dichotomous (e.g., Kirisci et al. 2001; Sass et al. 2008) and unidimensional multi-group dichotomous (e.g., Santo et al. 2013) models, where the latent trait follows a non-normal distribution. However, little has been conducted to investigate parameter recovery in estimating multidimensional polytomous models in this regard.

In view of the above, this study focuses on investigating parameter recovery of estimating multi-unidimensional GRMs when latent traits are either normal or non-normal. Specifically, different distributions of person parameters are adopted, and the performances of estimating item and person parameters using Hastings-within-Gibbs (HwG; Kuo and Sheng 2015) are compared. The remainder of the paper is outlined as follows. In Sect. 2, the HwG estimation is introduced. The simulation study is described and the results are discussed in Sect. 3. Finally, the conclusion for this study is summarized in Sect. 4.

2 Hastings-Within-Gibbs Estimation Procedure

For the past two decades, fully Bayesian has gained an increased popularity due to improved computational efficiency. There are two types of fundamental mechanisms among the Markov chain Monte Carlo (MCMC) algorithm: Gibbs sampling (Geman and Geman 1984) and Metropolis-Hastings (MH; Hastings 1970; Metropolis and Ulam 1949). Gibbs sampling is adopted in situations when the full conditional distribution of each parameter can be derived in closed form. If any of the full conditional distribution is not in an obtainable form, MH can be used via choosing a proposal or candidate distribution by the current value of the parameters. Then a proposal value is generated from the proposal distribution and accepted in the Markov chain with a certain amount of probability.

Hastings-within-Gibbs (HwG) is a form of the hybrid between Gibbs sampling and MH and has proved to be useful for complicated IRT models, such as GRMs. In the literature, Albert and Chib (1993) proposed a Gibbs sampler for the unidimensional GRM model. Cowels (1996) proposed a HwG procedure by using an MH step within the Gibbs sampler developed by Albert and Chib (1993) for sampling the threshold parameters to improve mixing and to accelerate convergence. Kuo and Sheng (2015) extended Cowles’ approach to the more general multi-unidimensional GRM.

3 Simulation Study

To investigate parameter recovery of the HwG procedure in situations when latent traits are not normal, a Monte Carlo simulation study was carried out where tests with two subscales were considered so that the first half measured one latent trait (θ 1) and the second half measured the other (θ 2).

3.1 Simulated Data

In the study, three factors were manipulated: sample size (N), test length (K), and intertrait correlation (ρ). The choice of N, K, and ρ was based on previous studies with similar models. For example, when investigating multidimensional GRMs, Fu et al. (2010) adopted N = 500, 1000, K = 10, 20, 30, ρ = 0. 1, 0. 3, 0. 5, 0. 7, 0. 9 for dichotomous items and N = 1000, K = 20, ρ = 0. 2, 0. 4, 0. 6, 0. 8 for polytomous items involving three categories. Working with dichotomous multi-unidimensional models, Sheng (2008) adopted N = 1000, K = 18, ρ = 0. 2, 0. 5, 0. 8 in the simulation studies, while Sheng and Headrick (2012) adopted N = 1000, K = 10, ρ = 0. 2, 0. 4, 0. 6. Wollack et al. (2002) conducted simulation studies with nominal response models, and they observed that parameter recovery was improved by increasing the test length from 10 to 30 items but that increasing the test length from 20 to 30 items did not produce a noticeable difference. Consequently, with our study, N polytomous responses (N = 500, 1000) to K items (K = 20, 40) were generated according to the multi-unidimensional GRM, where the population correlation between the two latent traits (ρ) was set to be 0. 2, 0. 5, or 0. 8. Each item was set to be measured on a Likert scale with three categories so that two threshold parameters were estimated for each item. The item discrimination parameters \(\boldsymbol{\alpha }_{v}\) were generated randomly from uniform distributions so that α vj  ∼ U(0, 2). The threshold parameters δ j1 and δ j2 were sorted values based on those randomly generated from a standard normal distribution, i.e., δ j1 = min(X 1, X 2) and δ j2 = max(X 1, X 2), where X 1, X 2 ∼ N(0, 1).

The person parameters of the first dimension (\(\boldsymbol{\theta }_{1}\)) and the second dimension (\(\boldsymbol{\theta }_{2}\)) were generated based on the Method of Percentile (MOP; Koran et al. 2015) Power Method transformation. The MOP transformation was developed to generate multivariate distributions with specified values of median, interdecile ranges, left-right tail-weight ratios (a skewness function) and tail-weight factors (a kurtosis function) for each distribution, and the pairwise correlations.

To generate \(\boldsymbol{\theta }_{1}\) and \(\boldsymbol{\theta }_{2}\) using the MOP transformation, \(\boldsymbol{\theta }_{1}\) were generated from a standard normal distribution, and \(\boldsymbol{\theta }_{2}\) were generated from one of the following four distributions: (1) skewness = 0, kurtosis = 0 (Dist. 1), (2) skewness = 0, kurtosis = 25 (Dist. 2), (3) skewness = 2, kurtosis = 7 (Dist. 3), and (4) skewness = 3, kurtosis = 21 (Dist. 4). The correlation between \(\boldsymbol{\theta }_{1}\) and \(\boldsymbol{\theta }_{2}\) (i.e., the true intertrait correlation, ρ) was set to be 0. 2, 0. 5, or 0. 8. Note that the skewness and kurtosis considered in each of the four distributions are conventional values and they can be transferred to left-right tail-weight ratios and tail-weight factors in order to implement the MOP transformation technique (see Koran et al. 2015).

Harwell et al. (1996) suggested that a minimum of 25 replications for Monte Carlo studies in IRT-based research is needed in order to obtain a better accuracy. Therefore, this study carried out 25 replications for each scenario, where root-mean-squared differences (RMSDs) and bias were used to evaluate the recovery of each item parameter. Let π denote the true value of a parameter (e.g., α vj or δ j, c) and \(\hat{\pi }_{r}\) is the estimate in the rth replication (r = 1, , R). The RMSD is defined as

$$\displaystyle{ RMSD_{\pi } = \sqrt{\frac{\sum _{r=1 }^{R }(\hat{\pi }_{r } -\pi )^{2 } } {R}}, }$$
(3)

and the bias is defined as

$$\displaystyle{ bias_{\pi } = \frac{\sum _{r=1}^{R}(\hat{\pi }_{r}-\pi )} {R}. }$$
(4)

The 10% trimmed means of these measures were calculated across items to provide summary statistics.

3.2 Results

Tables 1, 2, 3, and 4 display the results of the simulation study under the twelve test situations. The results indicated that the HwG procedure had an overall better estimation when \(\boldsymbol{\theta }_{2}\) followed a normal distribution. The non-normality of \(\boldsymbol{\theta }_{2}\) affected the accuracy of estimating \(\boldsymbol{\alpha }_{2}\). Specifically, distributions 2–4 had overall larger RMSDs of \(\boldsymbol{\alpha }_{2}\) than distribution 1 (normal). \(\boldsymbol{\alpha }_{1}\) had similar RMSDs across these four distributions when ρ = 0. 2 or 0.5. However, the non-normality of \(\boldsymbol{\theta }_{2}\) had more influence on estimating \(\boldsymbol{\alpha }_{1}\) when the two dimensions were highly correlated (i.e., ρ = 0. 8). On the other hand, the performance of estimating \(\boldsymbol{\delta }\) was affected more by skewness than kurtosis. Specifically, even though distribution 2 had the heaviest kurtosis, its RMSDs for estimating \(\boldsymbol{\delta }\) were smaller than those from skewed distributions (i.e., distributions 3 and 4). The estimation of ρ was sensitive to both skewness and kurtosis. Distributions 2–4 had larger RMSDs in estimating ρ than distribution 1. A further comparison of its RMSDs under the four distributions indicated that they were similar when ρ = 0. 2 but became more different when the actual correlation was higher (i.e., 0.5 or 0.8).

Table 1 Average RMSD and bias (italic values) for estimating \(\boldsymbol{\alpha }\), \(\boldsymbol{\delta }\), and ρ when N = 500, K = 20
Table 2 Average RMSD and bias (italic values) for estimating \(\boldsymbol{\alpha }\), \(\boldsymbol{\delta }\), and ρ when N = 500, K = 40
Table 3 Average RMSD and bias (italic values) for estimating \(\boldsymbol{\alpha }\), \(\boldsymbol{\delta }\), and ρ when N = 1000, K = 20
Table 4 Average RMSD and bias (italic values) for estimating \(\boldsymbol{\alpha }\), \(\boldsymbol{\delta }\), and ρ when N = 1000, K = 40

Posterior estimates for the person parameters (\(\boldsymbol{\theta }_{1}\) and \(\boldsymbol{\theta }_{2}\)) were also obtained and correlated with their corresponding true values. Tables 1, 2, 3, and 4 summarize all the correlation results, where \(r(\boldsymbol{\theta }_{1},\boldsymbol{\hat{\theta }}_{1})\) and \(r(\boldsymbol{\theta }_{2},\boldsymbol{\hat{\theta }}_{2})\) represent the correlations between the posterior estimates (\(\hat{\boldsymbol{\theta }}\)) and their corresponding true values (\(\boldsymbol{\theta }\)) for dimensions 1 and 2, respectively. The results indicate that \(\boldsymbol{\theta }_{1}\) was estimated fairly well due to the satisfaction of normality assumption. On the other hand, the estimation of \(\boldsymbol{\theta }_{2}\) was affected by kurtosis more than skewness, as distribution 2 had an overall lower \(r(\boldsymbol{\theta }_{2},\hat{\boldsymbol{\theta }_{2}})\) than distribution 3 (less kurtotic but more skewed). However, extreme skewed distributions (i.e., distribution 4) had an overall lower \(r(\boldsymbol{\theta }_{2},\hat{\boldsymbol{\theta }_{2}})\) than distributions 2 and 3. In addition, a comparison of K = 40 and K = 20 for the same sample size conditions (i.e., Table 2 vs. Table 1 and Table 4 vs. Table 3) indicates that the former had consistently larger \(r(\boldsymbol{\theta }_{2},\hat{\boldsymbol{\theta }_{2}})\) values than the latter. This suggests that the accuracy of estimating \(\boldsymbol{\theta }_{2}\) improved with the increase in test length regardless of its distribution.

Further, it is found that an increase of sample size can improve the accuracy of estimating model parameters. For example, with the test length of K = 20, the RMSDs of estimating \(\boldsymbol{\alpha }\), \(\boldsymbol{\delta }\), and ρ when N = 1000 were in general smaller than those when N = 500, especially when the true intertrait correlation was higher. One shall note that when ρ = 0. 2, larger sample sizes helped reduce the RMSDs of \(\boldsymbol{\alpha }_{2}\) when \(\boldsymbol{\theta }_{2}\) was non-normal. This is however not observed with ρ = 0. 5 or 0. 8. In terms of estimating \(\boldsymbol{\theta }\), larger sample size tended to increase the accuracy of estimating \(\boldsymbol{\theta }_{1}\). This pattern is only observed when estimating \(\boldsymbol{\theta }_{2}\) in distributions 2 and 3 when ρ < 0. 8.

4 Conclusion and Discussion

In general, with the use of Monte Carlo simulations, this study demonstrates that departure from normal distributions for the latent traits in the multi-unidimensional GRM does affect the accuracy of its parameter recovery. This is in line with findings from previous studies with unidimensional IRT models (e.g., Sass et al. 2008; Reise and Revicki 2014). Specifically, what we found in our study are that skewed distributions would affect more on the accuracy in estimating the item step parameters and that kurtotic distributions affect the estimation of person parameters. In situations where not all latent traits are normally distributed (such as what was considered in the simulation study), the non-normal shape associated with a few latent traits would affect the estimation of parameters in other dimensions when the intertrait correlation is moderate to high. As non-normal latent trait distributions are common in many polytomous response items, and examples of such instruments include mental tests, business satisfaction, cross-cultural differences, etc., one needs to be aware of the shapes of latent trait distributions before fitting the model to actual data. However, such information may not always be available in practice. It is hence important to find alternate solutions, such as using a more robust estimation method or a non-normal prior distribution. In addition, this study shows that increased sample size and/or test length can help improve the estimation of the multi-unidimensional GRM parameters. This finding not only confirms results from previous studies dealing with normal latent trait(s) (e.g., Linacre 2002; Sheng 2010; Kuo and Sheng 2015; Wollack et al. 2002) but also extends to situations where the latent traits are not normal. One may consider reducing the effect of non-normality by increasing sample size/test length under the non-normal conditions. The minimum number of persons/items necessary to reach a desired level of accuracy can be an interesting study that requires further investigation.

This study focuses on Likert scale items involving three scales, and therefore two threshold parameters need to be estimated for each item. Further study can evaluate the estimation of these procedures using items with more than three scales or with different numbers of scales. In addition, this study investigates the effects of non-normal latent traits using the HwG estimation method. Further study can include other estimation techniques, such as marginal maximum likelihood (Bock and Aitkin 1981) and Metropolis-Hastings Robbins-Monro (Cai 2010a,b). Lastly, the simulation study adopted 25 replications due to the computational expense of the MCMC procedures. Further studies can consider more replications to achieve a better accuracy.