Keywords

1 Introduction

For test and questionnaire data, respondents are often assigned a latent value for the attribute that the test is aimed to measure, based on a model for the dependencies that exist between the test items. Because the latent variable is unobserved, it is convenient to consider the observable sum scores across the test items as a proxy for the latent values instead. For the use of the sum score, a desirable feature of a measurement model would then be that a higher sum score also corresponds to a higher expected latent value, so that the ordering of respondents by their sum scores stochastically agrees with the ordering by their latent values. The use of the sum score for making such ordinal inferences has been studied for various item response theory models (Hemker et al., 1996, 1997; Ligtvoet, 2012, 2015), based on a monotone likelihood ratio (MLR) ordering of the latent variable by the sum scores. For these models, Hemker (2001) also looked at the effect of joining response categories. The results show that the MLR property is not implied by all models, and that for some models the model may be invalidated when (adjacent) outcomes are joined (cf. Andrich, 1995a, 1995b; Roskam, 1995). With the omnipresent use of the sum (or average) score across many test applications and the common practice of collapsing outcomes (e.g., median split), there thus seems to be a mismatch between the models proposed for response data and the way these data are handled in practice. The present chapter looks at the use of the sum score and the discretization of response data under the linear normal one-factor (LNF) model.

1.1 The Linear Normal One-Factor

Factor analysis provides a framework to account for dependencies that exist between the multiple item response variables of a test, whereby item response variables serve as indicators of the common factors that the test aims to measure. The LNF model proposes a single latent variable or factor Z to account for the covariances between the item variables X1, …, Xn by the linear relationships

$$\displaystyle \begin{aligned} X_i=a_iZ+U_i\mbox{, with }a_i>0, \end{aligned}$$

where ai denotes the ith factor loading and Ui is the ith residual or unique factor. It is assumed that U1, …, Un, Z are independent and normally distributed, with zero means (centered), and (non-negative) variances

$$\displaystyle \begin{aligned} \mbox{Var}(U_i)=\sigma_i^2\mbox{ and Var}(Z)=\sigma^2. \end{aligned}$$

Further, assume that Cov(Ui, Uj) = 0 (for ij) and Cov(Ui, Z) = 0 (Jöreskog, 1971; Lord & Novick, 1968). Hence, under the LNF model, the variables X1, …, Xn are conditionally independent (CI), given Z = z.

1.2 Monotone Transformations

In this chapter, two monotone (non-decreasing) transformations are considered that are often used on X1, …, Xn in practice. Let X = (X1, …, Xn) denote the random vector containing the variables Xi, with realizations \(\boldsymbol {x}\in \mathbb {R}^n\). Then, a function ϕ(x) is said to be monotone, whenever x < y (element-wise) implies that ϕ(x) ≤ ϕ(y).

The Sum Score

The first transformation that is considered is the sum score S = X1 + … + Xn often used in practice as a proxy for Z (McNeish & Wolf, 2020), with ϕ(x) representing a mapping of many-to-one or aggregation; i.e., \(\phi :\mathbb {R}^n\rightarrow \mathbb {R}\). In practice, a LNF model is often fitted to the data in order to assess the validity (factorial structure) of the test, whereas in subsequent analysis the sum or average score is used to ascertain the validity indices (e.g., predictive validity) of classical test theory. A minimal requirement of the LNF model for such practice is that the model implies a stochastic ordering of the factor scores by the sum scores, such that higher factor scores are expected for higher sum scores. Let f(s, z) denote the joint density of S and Z, then the stochastic ordering by the sum scores is satisfied, whenever the density f(s, z) is totally positive of order 2; that is

$$\displaystyle \begin{aligned} f(s_1,z_1)f(s_2,z_2)\ge f(s_1,z_2)f(s_2,z_1), \end{aligned}$$
(TP2 )

for all s1 < s2 and z1 < z2 (Karlin, 1968, cf. the MLR property).

Discretization

As a second transformation, consider the discretization of the variables X1, …, Xn. The LNF model is often used for the analysis of discrete ordinal response data (Jöreskog and Moustaki, 2001), where X1, …, Xn are taken as the ghosts underlying the observable discrete variables V1, …, Vn. Here, ϕ(x) takes on the form (ϕ1(x1), …, ϕn(xn)), with

$$\displaystyle \begin{aligned} V_i=\phi_i(X_i;b_{1},\ldots,b_{m_i})\mbox{ and }b_{1}<\ldots<b_{m_i}, \end{aligned}$$

where

$$\displaystyle \begin{aligned} V_i=v_i\mbox{, whenever }b_v\le x_i<b_{v+1} \end{aligned}$$

(\(b_{m_i+1}=-b_0=\infty \) by definition). In words, each discretization ϕi proposes mi ordered thresholds, where vi denote the largest threshold passed by the outcome of Xi. With vi ∈{0, 1, …, mi}, the outcomes of Vi are said to be equidistant (Andrich, 1995a). Sijtsma and Van der Ark (2017) discuss the use of an equidistant scoring rule in relationship with the use of the sum score in the context of Mokken’s monotone homogeneity (MH) model (Mokken, 1971; Molenaar, 1997). In addition to CI and the unidimensionality assumption, the MH model assumes that the tail distributions 1 − F(xi|z) are non-decreasing in z (Holland & Rosenbaum, 1986). The LNF model satisfies the MH model assumptions, also after applying a discretization to X1, …, Xn. In case mi = m = 1, the transformation ϕ(x) corresponds to a dichotomization of the response variables.

For later reference, the concept of Pólya frequency functions of order 2 (PF2) is introduced (e.g., Efron, 1965; Schoenberg, 1951).

Definition 1

The density f(x) is said to be PF2, if for all x1 < x2 and y1 < y2

$$\displaystyle \begin{aligned} f(x_1-y_1)f(x_2-y_2)\ge f(x_1-y_2)f(x_2-y_1). \end{aligned} $$
(1)

Ellis (2015) showed that the (monotone higher-order) one-factor model, with residuals having PF2 densities (e.g., normally distributed), implies that f(v) is multivariate TP2 (Karlin & Rinott, 1980). This in turn implies that each f(vi, vj) is TP2.

Strictly speaking, the normality requirements of the LNF model does not hold for the discrete variables V1, …, Vn. However, the LNF may still provide an adequate approximation of discrete response data (Rhemtulla et al., 2012).

Chapter Overview

The purpose of this chapter is to investigate the effect the discretization of the indicators X1, …, Xn has on the use of the sum score for the stochastic ordering on Z. In the next section, it is shown that the LNF model implies a stochastic ordering on Z by the sum score S. However, it is also shown that the stochastic ordering property does not generally hold when the sum score is used after discretizing the indicators X1, …, Xn, except in the special case of a dichotomization. These results and their implications are further discussed in Sect. 3.

2 The LNF Model and the Sum Score

In this section, it is shown that the LNF model implies a stochastic ordering on Z by the sum S = X1 + … + Xn. However, after discretization of the variables X1, …, Xn obtained under the LNF model, the sum R = V1 + … + Vn of the newly obtained variables V1, …, Vn no longer needs to provide a stochastic ordering on the factor Z. That is, f(s, z) is TP2 does not imply that f(r, z) is also TP2.

2.1 Preliminaries

In order to show that the LNF model implies a stochastic ordering of Z by S = X1 + … + Xn, it is convenient to express the model in terms of the more general properties TP2 and PF2. To this end, consider the joint and conditional densities f(xi, z) and f(xi|z), respectively. First, assuming CI. Then, with E(Xi) = E(Z) = 0, the covariance between Xi and Z equals

$$\displaystyle \begin{aligned} E(X_iZ)=a_iE(Z^2)+E(U_iZ)=a_i\sigma^2>0. \end{aligned}$$

This implies that f(xi, z) is TP2 (Karlin & Rinott, 1983). Second, because Ui is normally distributed, so is the conditional density f(xi|z). Consequently, f(xi|z) has a PF2 density (Efron, 1965). Note that, for strictly positive densities, if f(xi, z) is TP2, then f(xi|z) is TP2 as a function of (xi, z) (Holland & Rosenbaum, 1986). Because the TP2 property implies that the tail distribution 1 − F(xi|z) is non-decreasing in z, it thus follows that the LNF model satisfies the assumptions (i.e., is a special case) of the MH model (Holland & Rosenbaum, 1986).

The following observation is useful for the proof of Theorem 1 below. Assume that f(x, y|z) > 0 (as implied by the LNF model). Then, for z1 < z2, the inequality

$$\displaystyle \begin{aligned} f(x_2,y_2|z_2)f(x_1,y_1|z_1)\ge f(x_2,y_2|z_1)f(x_1,y_1|z_2) \end{aligned} $$
(2)

holds, whenever

$$\displaystyle \begin{aligned} \frac{f(x_2,y_2|z_2)}{f(x_2,y_2|z_1)}\ge\frac{f(x_2,y_1|z_2)}{f(x_2,y_1|z_1)}\ge \frac{f(x_1,y_1|z_2)}{f(x_1,y_1|z_1)}. \end{aligned}$$

Hence, (2) holds, if both (a) f(x, y|z) is TP2 as a function of (y, z), with y1 < y2 and X = x2 (fixed), and (b) f(x, y|z) is TP2 in (x, y), with x1 < x2 and Y = y1.

The next result is proven in Ligtvoet (2021), but here adapted to the LNF model.

Theorem 1

The LNF model implies that f(s, z) is TP2.

Proof

Suppose that Theorem 1 holds for n = 2, with X1 = X and X2 = Y . The proof of Theorem 1 then follows, by sequentially taking X = X1 + … + Xi−1 and Y = Xi, for i = 2, …, n. Hence, it is sufficient to show that f(s, z) is TP2, for S = X + Y .

Due to CI, the conditional density of S is given by the convolution \(f(s|z)=\int g(x|z)h(s-x|z)dx\). Then, f(s, z) is TP2, if for any s1 < s2 and z1 < z2 it holds that f(s2|z2)f(s1|z1) ≥ f(s2|z1)f(s1|z2), which yields

(3)

Note that (3) has the form of (2), with f(x, y|z) = g(x|z)h(s − x|z). So, it is sufficient to show that (3) holds for the case that X is constant between z1 and z2, and the case that Y  is constant between z1 and z2. Because both cases are symmetric in their arguments, we’ll only consider taking Y  to be constant at Z = z (for z1 ≤ z ≤ z2). For (3), this yields

(4)

Because g(x, z) is TP2, the function within the integral of (4) has a positive outcome for x1 < x2 and negative values for x1 > x2. What remains to be shown is that the density of the area of the integral spanning all \(x_1^\prime >x_2^\prime \) is smaller than the area spanning all \(x_1^{\prime \prime }<x_2^{\prime \prime }\) (see Fig. 1 for illustration). Let \(x_1^\prime =x_0\) and \(x_2^\prime =x_0-\varepsilon \), with ε > 0, and accordingly \(x_1^{\prime \prime }=x_0-\varepsilon \) and \(x_2^{\prime \prime }=x_0\). This yields a one-to-one (injective) mapping of each pair \((x_1^\prime ,x_2^\prime )\) that yields negative values in (4) to \((x_1^{\prime \prime },x_2^{\prime \prime })\), as shown in Fig. 1. Also, let s1 = s and s2 = s + δ, with δ > 0. Then, it is sufficient to show that for all values x0, ε, s, δ, and z1 < z2,

$$\displaystyle \begin{aligned} \begin{array}{rcl} & &\displaystyle (g(x_0|z_2)g(x_0-\varepsilon|z_1)- g(x_0|z_1)g(x_0-\varepsilon|z_2))\cdot{}\\ & &\displaystyle (h(s-x_0+\varepsilon|z)h(s-x_0+\delta|z)- h(s-x_0+\varepsilon+\delta|z)h(s-x_0|z))\ge0. \end{array} \end{aligned} $$
(5)

The first part of (5) in parentheses is non-negative, because g(x, z) is TP2. The second part in parentheses is also non-negative, because h(y|z) is PF2. This can be seen by taking x1 = s − x0, x2 = s − x0 + ε, y1 = −δ, and y2 = 0 in (1). □

Fig. 1
figure 1

Plot of x1 and x2, with the point \((x_1^{\prime \prime },x_2^{\prime \prime })\) in the gray areas above the identity yielding positive values of (4)

2.2 The Sum Score After Discretization of Variables

Theorem 1 shows that under the LNF model f(s, z) is TP2 for the sum score S = X1 + … + Xn. Next, a discretization is considered, whereby we denote the sum of the discretized variables as R = V1 + … + Vn. As mentioned earlier, the LNF model satisfies the MH model assumptions. For the special case when a dichotomization is applied to all variables (i.e., mi = m = 1), the LNF reduces to Mokken’s MH model for binary variables, which has been shown to imply a stochastic ordering of the latent variable by the sum score (Ghurye & Wallace, 1959; Grayson, 1988; Huynh, 1994; Ünlü, 2008). For mi > 2, however, the MH model does not imply a stochastic ordering of the latent variable by the sum score (Hemker et al., 1996, 1997). The next example also shows that f(r, z) need not be TP2, after discretizing the variables X1, …, Xn obtained under the LNF model.

Example 1

Consider the LNF for n = 2 variables, with a1 = a2 = 1, σ2 = 1, and \(\sigma _1^2=2/5\) and \(\sigma _2^2=5/2\). Also, let Vi = ϕi(Xi;−1∕2, 1∕2), for i = 1, 2 (i.e., mi = m = 2), and R = V1 + V2. Further, define the log-odds \(\ln \omega _r(z)=\ln f(r|z)-\ln f(r-1|z)\), for r = 1, …, 4, which are non-decreasing in z, whenever f(r, z) is TP2. Figure 2 shows that for r = 2, 3, the log-odds are decreasing (in violation of the stochastic property). For r = 2, the gray area in Fig. 2 shows the decrease in log-odds between z = −1.37 and z = −0.10, indicating that for two subjects with these factor scores, the subject with the (higher) factor score z = −0.10 is about 2.5 times more likely to obtain the lower sum score than the subject that has a factor score that is more than one standard deviation lower (i.e., a substantial violation).

Fig. 2
figure 2

Log-odds for the sum scores as a function of z, showing a violation (decrease; e.g., gray area) of the stochastic ordering of Z by R = V1 + V2

3 Discussion

The property f(s, z) is TP2 was proposed as a minimal requirement for the use of the sum S = X1 + … + Xn. This property is less restrictive than the tau-equivalence requirement from classical test theory (Lord & Novick, 1968; McNeish & Wolf, 2020), but is also limited to ordinal inferences. Theorem 1 implies that the confirmation of the LNF model justifies ordinal inferences about the latent factor based on the sum score. For applications that require more than mere ordinal inferences, the use of the estimated factor score may be more advantageous.

The discretization of variables obtained under the LNF model, not only jeopardizes the normality assumption, but also has implications for the practical use of the sum score. The extent to which the stochastic ordering property by the sum score is violated, in a practical sense, will depend on the number of items variables of the test, as well as the number of categories resulting from the discretization. Simulation studies may further address this issue.

Instead of assuming a normal distribution to underlie the observed discrete ordinal response data, alternative approaches for analyzing these data impose restrictions on the cumulative distributions similar to Samejima’s (1969) graded response model. Jöreskog and Moustaki (2001) and Takane and De Leeuw (1987) showed that the normal ogive model for graded responses is formally equivalent to the LNF model that assumes a normal distribution to underlie the ordinal responses. The difference between these models is that, for the graded response model, the conditional response distributions is discretized prior to taking the marginal across the latent factor, whereas for the LNF model the discretization is applied (afterwards) to the marginal distribution (Takane & De Leeuw, 1987, p. 397). In the latter case, the discretization may invalidate the property f(s, z) is TP2, when applying the LNF to discrete ordinal response data. That the graded response model does not imply this property was already shown by Hemker et al. (1996, 1997).

To conclude, the use of the sum score, albeit practical, is not what most models are designed for. The applied researcher should realize that an ordering on a latent variable by the sum score is not something that can be simply assumed to hold. If the applied researcher has a model that accurately describes the response data, it might generally be best to rely on the model estimates, rather than using the sum scores. And if a transformation of the data is deemed necessary, the validity of the model will need to be reassessed.