1 Background

Reliability is one of the hallmark measures of an assessment’s quality. It is a necessary condition for validity. Several authors have noted that a test’s reliability is a function of the scores on a test, not the test itself or multiple forms of a test (Brennan, 2001; Thompson & Vacha-Haase, 2000). There are a host of measures that have been developed to estimate reliability (Feldt & Brennan, 1989; Kane, 1996). The basic definition of reliability is based on the classical test theory assumption that for individual i, a test score X i is the sum of two unobservable and uncorrelated components, T i, a true score and measurement error, E i:

$$ {X}_i={T}_i+{E}_i. $$
(10.1)

Reliability is then defined as the squared correlation between the observed test scores and the corresponding unobserved true scores which can be shown to be equal to the ratio of true score variance, \( {\sigma}_T^2 \), to total observed score variance, \( {\sigma}_X^2 \):

$$ {\rho}_{TX}^2=\frac{\sigma_T^2}{\sigma_X^2} $$
(10.2)

As noted by Sijtsma (2009a, b), over the years, the one standard reliability index that researchers and psychologists have adopted is coefficient alpha (Cronbach, 1951), further referred to as α. Although Cronbach’s name is tied to the statistic, this measure can be traced through the works of Kuder and Richardson (1937), who published a version of α for dichotomous items—the KR-20 coefficient. Hoyt (1941) proposed an equivalent statistic using of analysis of variance with dichotomous responses.

Finally, Guttman (1945) derived a series of reliability coefficients. One coefficient, denoted as λ 3, was equivalent to α.

Assuming a test composed of J-items, where a random variable, Y j, represents a score on item j, and the total score on the test for an examinee is defined as the sum,

$$ X=\sum_{j=1}^J{Y}_j, $$
(10.3)

α for a group of examinees can be expressed as

$$ \alpha =\frac{J}{\left(J-1\right)}\left[1-\frac{\sum_{j=1}^J{\sigma}_{Y_j}^2}{\sigma_X^2}\right] $$
(10.4)

where \( {\sigma}_{Y_j}^2 \) represents the item variances and \( {\sigma}_X^2 \) is the variance of the total scores.

If the item scores are standardized, the formula for α can be expressed in terms of the mean of the inter-item correlations, \( \overline{\rho} \); that is,

$$ \alpha =\frac{J\overline{\rho}}{1+\left(J-1\right)\overline{\rho}}, $$
(10.5)

or equivalently as the average of the inter-item covariances, \( \overline{\sigma_{YY}}, \)

$$ \alpha =\frac{J\left(\frac{\overline{\sigma_{YY}}}{\overline{\sigma_X}}\right)}{1+\left(J-1\right)\left(\frac{\overline{\sigma_{YY}}}{\overline{\sigma_X}}\right)}. $$
(10.6)

It should be noted that α also approximates the mean of all possible Spearman-Brown split-half coefficients (Spearman, 1910; Brown, 1910) where the split-half coefficients, r 12, are adjusted, pairwise Pearson product-moment correlations between the two half-test scores:

$$ {r}_{\mathrm{split}-\mathrm{half}\left(\mathrm{SB}\right)}=\frac{2{r}_{12}}{1+{r}_{12}}. $$
(10.7)

Coefficient α equals the mean of the split-half coefficients when the standard deviations of all possible halves are equal and smaller when the standard deviations are heterogeneous (Cortina, 1993). Feldt and Brennan (1989) and Lord and Novick (1968) further noted that α will be equal to the mean of all split-half correlations when the split-half correlations are calculated by the Flanagan-Rulon formula:

$$ {r}_{\mathrm{split}-\mathrm{half}\left(\mathrm{FR}\right)}=\frac{4{r}_{12}{s}_1{s}_2}{s_T^2}, $$
(10.8)

where s 1 and s 2 are the standard deviations of each half and \( {s}_T^2 \) is the variance of the total test (Flanagan, 1937; Rulon, 1939).

Many researchers have criticized the pervasive use of α (Green, et al., 1977; Green and Yang, 2009; Rodriguez & Maeda, 2006; Sijtsma, 2009a, b) or even wrote about the shortcomings of the statistic and its interpretations (Cronbach & Shavelson, 2004; Ten Berge & Socan, 2004). One drawback is the ubiquitous interpretation of α as a measure of internal consistency. Internal consistency is a characteristic of the test items, not the test, and does not reflect the length of the test (i.e., the pattern of inter-item covariances). Another caveat is that calculations of α can yield values that are outside the range of possible values of the score reliability that should be derivable from a single test administration (Cho & Kim, 2015; Sijtsma, 2009a).

It is often thought that α requires the test to be unidimensional and that it can be used as a measure signifying the degree of multidimensionality. Cronbach (1951) did address the test dimensionality issue when he wrote that for a test:

to be interpretable,…it is not essential that all the items be factorially similar. What is required is that a large proportion of the test variance be attributable to the first principal factor running through the test.

Several authors have noted that multidimensional tests can exhibit high values of α (Davenport, et al., 2015; Davison & Davenport, 2015). When a test has been empirically demonstrated to be multidimensional, it is important the test developer be able to articulate the meaning of the composite scale which α is characterizing (e.g., that the total test score is a weighted linear composite of two or more subscores by design). In any case, it has been well documented that a multidimensional test does not necessarily have a lower α than a unidimensional test.

Friedman and Weisberg (1981) demonstrated that if all the inter-item correlations are positive, the first principal component eigenvalue is approximately proportional to the average correlation of the J items

$$ {\lambda}_1\approx 1+\left(J-1\right)\overline{r}. $$
(10.9)

Using this relationship, α can be approximated as

$$ \alpha \approx \frac{J\overline{r}}{\lambda_1}. $$
(10.10)

Another approach that tries to capture the underlying possibly multidimensional nature is to assess reliability using a factor-analytic approach such as coefficient ω h (McDonald, 1985, 1999; Zinbarg et al., 2005), further referred to as ω h. The subscript h denotes that this measure of reliability is derived from the hierarchical factor analytic model. That is, it is assumed that all items measure a common factor that accounts for a major proportion of variance in the scaled scores. In addition, it is assumed that each item measures a unique skill uncorrelated with the common scale. For the purposes of this study, we used a bifactor model in which all items load on a general factor and on a unique factor. All unique factors are uncorrelated. The ω h statistic used is calculated as

$$ {\omega}_h=\frac{{\left(\sum_{j=1}^J{\lambda}_{gj}\right)}^2}{\sigma_X^2}, $$
(10.11)

where λ gj are the factor loadings on the general factor.

The goal of this research is to examine and compare the performance of α and ω h under several different test conditions including the correlations between dimensions, number of items, discrimination power of the items, and whether the difficulty of the items is optimal given the ability distribution of the examinees.

The response data were generated using the compensatory multidimensional two-parameter IRT model (M2PL) (Reckase, 2009). The M2PL can be expressed as

$$ {\boldsymbol{p}}_{\boldsymbol{j}}\left(\boldsymbol{\theta} \right)=\boldsymbol{P}\left({\boldsymbol{u}}_{\boldsymbol{j}}=\mathbf{1}|\boldsymbol{\theta} \right)=\frac{\mathbf{1}}{\mathbf{1}+{\mathbf{e}}^{-\left(\sum_{\boldsymbol{k}=\mathbf{1}}^{\boldsymbol{m}}{\boldsymbol{a}}_{\boldsymbol{j}\boldsymbol{k}}{\boldsymbol{\theta}}_{\boldsymbol{ik}}+{\mathbf{d}}_{\boldsymbol{j}}\right)}},\kern2em $$
(10.12)

where θ = (θ 1, θ 2, ⋯θ k, ⋯θ m) is a m-length vector of the latent scores with elements indexed as θ ik(the score of person i on dimension k), a jk is a discrimination for item j on dimension k, respectively, and d j is an intercept term denoting the composite difficulty of each item. The MDISC index is the multidimensional analog to unidimensional discrimination parameter, a. It is a composite discrimination index for each that can be expressed as

$$ {\mathrm{MDISC}}_i=\sqrt{\sum_{k=1}^m{a}_{jk}^2} $$
(10.13)

where \( {a}_{jk}^2 \) is defined above.

2 Research Design

This is a simulation study. The response data were generated under prescribed testing conditions with multiple replications. Three coefficients were computed for each data set and then the comparative results aggregated across replications: (i) \( {\rho}_{TX}^2 \), the true scale reliability when the true score and error variances are known (through simulation), (ii) α (Eq. 10.4), and (iii) ω h (Eq. 10.11). This design demonstrates how logically influential test design considerations such as test length, item discrimination, and the homogeneity of items relative to the population mean(s) impact those three reliability coefficients. The study included five completely crossed design factors:

  • Number of items (J = 24, J = 48)

  • Levels of MDISC (low MDISC, 0.4–0.8; moderate MDISC, 0.8–1.2; high MDISC, 1.2–1.6)

  • Number of dimensions (m = 1, 2, 3, 4)

  • Location of mean item difficulty (d = 0, 1) given the examinee distribution will always be centered at the origin

  • Correlation of abilities (ρ = .0, .5)

The sample size for each simulation was fixed at 1000 randomly generated examinees sampled from a standard normal univariate or multivariate normal distribution centered at the origin for each simulation. Each condition was further replicated 100 times to provide empirical sampling distributions of each reliability coefficient for comparative purposes.

3 Reliability Estimation and Evaluation

Three reliability coefficients were calculated for each of the simulated data sets: the true scale reliability, \( {\rho}_{TX}^2 \), coefficient α, and ω h (based on a fitted bi-factor model). As noted earlier, the true scale reliability was calculated using Eq. 10.1 where the true score variance is the variance of the expected scores of the N-examinees over J-items:

$$ {\sigma}_T^2={\sigma}^2\left(\sum_{i=1}^N\sum_{j=1}^nP\left({u}_i=1|{\theta}_{i1},{\theta}_{i2}\ {a}_{j1},{a}_{j2},{d}_j\right)\right) $$
(10.14)

using the generated N × m matrix, θ, and the J × (m + 1) matrix of generated item parameters. The raw score variance is calculated using the total score for each person and including all the items in the test. The α and ω h were calculated using the corresponding functions in the R package psych (Revelle, 2021). That package calculates the three reliabilities given in Eqs. 10.4 and 10.11. In aggregate, there were 96 conditions (2 × 3 × 4 × 2 × 2), and each condition was replicated 100 times to provide empirical sampling distributions of the three coefficients. In particular, the means and standard deviations of those sampling distributions were computed across the 100 replications per condition, and graphical visualizations were created using the R package ggplot2 (Wickham, 2016). All the simulations, data management, and analytical aspects of this study were carried out using R (R Core Team, 2021).

4 Results

The 5 design factors produced 96 simulation test design conditions. These factors were expected to have direct or indirect impact on the three reliability indices, \( {\rho}_{TX}^2 \), α, and ω h. The impact of the number of items (test length) on reliability is well-known given the extensive body of research on the Spearman-Brown formula (e.g., Angoff, 1953; Traub, 1997),

$$ {\rho}_{XX^{\prime}}^{\ast }=q{\rho}_{XX^{\prime }}/\left[1+\left(q-1\right){\rho}_{XX^{\prime }}\right] $$
(10.15)

where \( {\rho}_{XX^{\prime }} \) is the original reliability index and q is the ratio of new to original (old) test lengths. In contrast, the average MDISC (composite item discrimination) and item location were generated to either offset or match to the population centroids’ impact the contribution of each item to the score variance (e.g., Gulliksen, 1950). These two factors also directly and indirectly reflect on item quality—especially the item discrimination parameters and MDISC, which act as weights for the latent scores. Finally, the number of underlying dimensions and the correlation between those dimensions represent the dispersion of the measurement signal across the apparent latent structures representing the item covariances. Including these latter two conditions in the simulation directly speaks to the motivation for ω h, that is, to have a reliability index that responds to untended or idiosyncratic dimensionality, or to a test that includes multiple dimensions by design and perhaps reports the total score as weighted linear composite of subscores. Increasing the dimensionality and covariance(s) among the underlying factors should disperse the “measurement signal” relative to a reported total score.

For the most part, these factors produced results that met expectations. Figures 10.1, 10.2, 10.3, 10.4, 10.5 and 10.6 include “trellis” or facetted multi-plots that embed a bivariate plot conditioned on the number of items (columns) and the magnitude of correlation between the underlying dimensions or factors (none implies a zero correlation between the factors; moderate implies a correlation of .5 between all factors). The number of dimensions is shown along the horizontal axis for each plot, and the vertical axis represents the magnitude of the correlation. The three plotted outcomes in each cell of the multi-plot denote the three reliability indices: \( {\rho}_{TX}^2 \), α, and ω h. These results are summarized as the mean and standard error of the reliability coefficients across 100 replications per combination of simulation conditions.

Fig. 10.1
There are 4 graphs reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for high MDISC and item difficulty matched to the population proficiency score centroids: μ(d) − μ(θ k) = 0 (100 replications per condition)

Fig. 10.2
There are 4 graphs reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for high MDISC with item difficulty offset from the population proficiency score centroids: μ(θ k) − μ(d) = 1 (100 replications per condition)

Fig. 10.3
There are 4 graphs of reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for moderate MDISC and item difficulty matched to the population proficiency score centroids: μ(θ k) − μ(d) = 0 (100 replications per condition)

Fig. 10.4
There are 4 graphs of reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for moderate MDISC with item difficulty offset from the population proficiency score centroids: μ(θ k) − μ(d) = 1 (100 replications per condition)

Fig. 10.5
There are 4 graphs of reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for low MDISC and item difficulty matched to the population proficiency score centroids: μ(θ k) − μ(d) = 0 (100 replications per condition)

Fig. 10.6
There are 4 graphs of reliability against dimensionality. All the graphs plot alpha, omega, True Rel, which are represented by different colored dots.

Summary of reliability coefficients for low MDISC with item difficulty offset from the population proficiency score centroids: μ(θ k) − μ(d) = 1 (100 replications per condition)

As Fig. 10.1 shows (high MDISC, with the mean item difficulty matched to the population centroids, μ(d) − μ(θ k) = 0 for all k), there is a noticeable increase in the \( {\rho}_{TX}^2 \) and α coefficients as the test length increased from 24 to 48 items, and a decrease in the coefficients as the number of dimensions increased from 1 up to 4 due to the amount of total test score signal dispersion among the dimensions. The three coefficients are all highly similar in the unidimensional case (m=1) with α and ω h essentially being identical. The coefficients only start to decline as the total score signal is dispersed across two or more underlying factors. Note that the zero-correlation condition is rather unrealistic in a practical senseFootnote 1, but provides a reasonable baseline under “maximum dispersion” conditions. Interestingly, the mean values of ω h tend to somewhat track with the inter-factor correlations (.0 = none or .5 = moderate).

Figure 10.2 (high MDISC, with the mean item difficulty offset from the population centroids, μ(θ k) − μ(d) = 1 for all dimensions) shows a pattern that is very consistent with Fig. 10.1. Cronbach’s α values tend to be smaller than the “true reliabilities” with known true scores, \( {\rho}_{TX}^2 \). This likely reflects some sampling error when estimating the item error variances (see Eq. 10.3). The ω h coefficients, again, somewhat track with the magnitude of the inter-factor correlations, although the mean values are also confounded by the high MDISC present in the items.

Figure 10.3 (moderate average MDISC, with the mean item difficulty matched to the population centroids, μ(θ k) − μ(d) = 0 for all dimensions) begins to show an interesting pattern where the mean α and \( {\rho}_{TX}^2 \) values respond to the reduced composited item discrimination, but the ω h coefficients do not.

Figure 10.4 (moderate average MDISC, with the mean item difficulty offset from population centroids, μ(θ k) − μ(d) = 1 for all dimensions) confirms the coefficient patterns of Fig. 10.3; that is, the ω h coefficients respond more to the amount of total score signal dispersion than to the reduced composite item discrimination. The mean α and \( {\rho}_{TX}^2 \) values respond to the reduced composited item discrimination and, to a lesser degree, to the signal dispersion across dimensions.

Figures 10.5 and 10.6 show an overall decline in mean α and \( {\rho}_{TX}^2 \) values proportional to both the low average MDISC values and the dimensional dispersion of the total score signal. Interesting, and similar to Figs. 10.3 and 10.4, the latter dispersion has less impact across the increasing number of dimensions than under the high discrimination condition. Increasing the test length helps to somewhat offset the decline in the reliability coefficients, but the recommendation to write high-quality items and monitor that the level of composite item discrimination remains as high as possible seems to be good advice.

5 Conclusion

In this study, we varied testing conditions that we felt would influence the performance of the three reliability coefficients: (1) true reliability, (2) Cronbach’s α, and (3) ω h. As the number of items was doubled from 24 to 48, there was the expected proportional increase in reliability. Likewise, as the discrimination of the items, MDISC, increased, the magnitude of the reliability coefficients also unilaterally increased. The simulation response data were generated relative to an underlying multidimensional simple structure for three of the four simulation conditions. As the correlations between the multidimensional latent abilities increased from 0 to .5, thus “collapsing” the latent space—the reliability coefficients also proportionally increased. The effect of increasing the average difficulty of the items, that is, increasing the amount of offset between the location of maximum measurement information relative to the centroid of the examinee ability, joint latent distributions did not induce any prominent change in reliability.

The simulation condition that appeared to demonstrate the greatest impact on the reliability coefficients was multidimensionality. As the number of dimensions increased, coefficient ω dropped considerably in comparison to the true scale reliability and coefficient α. This was anticipated because ω h was computed using the sum of the loadings on the general factor in the hierarchical, orthogonal bifactor model, where all factors are uncorrelated. Because the data were generated using simple structure, the loadings on the unique factors were higher than the loadings on the general factor, creating significant dispersion in the measurement “signal”—specifically, inducing “noise” relative to the general factor. That is, the R-packages that were used estimated ω h using the bi-factor model versus a common factor or component model.

In the unidimensional case, α and ω were always equal. In some cases, these coefficients exceeded the true scale reliability. As dimensionality increased, α like the \( {\rho}_{TX}^2 \) decreased though not nearly as much ω. It appeared that α was not affected as much as ω h by the increase in dimensionality. There was one notable inconsistency. In the two-dimensional case, ω was consistently lower than in the three- and four-dimensional cases across all conditions. This may have been a function of the sampled item discrimination parameters.

It seems clear that testing practitioners must be advised always to conduct a thorough dimensionality analysis of their test results relative to the intended, reported score scale(s) and further evaluate the dimensionality analysis outcomes in terms of the test specification so that they can articulate the meaning of the observed score scale. Only evaluating a reliability coefficients or standard errors of measurement is not sufficient.

Future research will extend the current research to incorporate factorially complex item structures where the multidimensionality may relate to nuisance dimensions of idiosyncratic characteristics of the items (i.e., items loadings on both intended and unintended factors underlying the data). We also plan to examine reliability from a multidimensional IRT perspective and relate more directly to the concept of a unidimensional composite of intended multidimensional traits (i.e., Wang’s (1985) reference composite). Lastly, we plan to experiment with the formulation of ω h and determine if additional information about dimensionality and its effect on reliability can be delineated for testing practitioners.