Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

PISA is an international large scale educational assessment study conducted by member countries of the OECD (20012002a200420072010) and investigates how well 15-year-old students approaching the end of compulsory schooling are prepared to meet the challenges of today’s knowledge societies (OECD 20052012). The study does not focus on the students’ achievement regarding a specific school curriculum but rather aims at measuring the students’ ability to use their knowledge and skills to meet real-life challenges (OECD 2009a).

PISA started in 2000 and takes place every 3 years. Proficiencies in the domains reading, mathematics, and science are assessed. In each assessment cycle, one of these domains is chosen as the major domain under fine-grained investigation; reading in 2000, followed by mathematics in 2003, science in 2006, etc. The definitions of the domains can be found in the respective assessment frameworks (e.g., OECD 2009a). In addition to these domains, further competencies may also be assessed by a participating OECD member; for example digital reading in 2009 (OECD 2012). Besides the actual test in PISA, student and school questionnaires are used to provide additional background information (e.g., about the socio-economic status of a student). In PISA 2009 for instance, in addition to these questionnaires, in 14 countries parents were asked to fill in an optional questionnaire. The background information are used as so-called conditioning variables for the scaling of the PISA cognitive (i.e., test) data.

The number of countries (and economies) participating in PISA continues to increase (e.g., 32 and 65 countries for PISA 2000 and 2009, respectively). In each participating country, a sample of at least 150 schools (or all schools) were drawn. In each participating school, 35 students were drawn (in schools with less than 35 eligible students, all students were selected).

The PISA study involves a number of technical challenges; for example, the development of test design and measurement instruments, of survey and questionnaire scales. Accurate sampling designs, including both school sampling and student sampling, must be developed. The multilingual and multicultural nature of the assessment must be taken into account, and various operational control and validation procedures have to be applied. Focused on in this paper, the scaling and analysis of the data require sophisticated psychometric methods, and PISA employs a scaling model based on item response theory (IRT; e.g., Adams et al. 1997; Fischer and Molenaar 1995; van der Linden and Hambleton 1997). The proficiency scales and levels, which are the basic tool in reporting PISA outcomes, are derived through IRT analyses.

The PISA Technical Report describes those methodologies (OECD 2002b20052009b2012). The description is provided at a level that allows for review and, potentially, replication of the implemented procedures. In this paper, we recapitulate the scaling procedure that is used in PISA (Sect. 2). We discuss the construction of proficiency scale and proficiency levels and explain how the results are reported and interpreted in PISA (Sect. 3). We comment on whether information provided in the Technical Report is sufficient to replicate the sampling and scaling procedures and central results for PISA, on classification procedures and alternatives thereof, and on other, for instance more automated, ways for reporting in the PISA Technical Report (Sect. 4). Overall, limitations of PISA and some reflections and suggestions for improvement are described and scattered throughout the paper.

2 Scaling Procedure

To scale the PISA cognitive data, the mixed coefficients multinomial logit model (MCMLM; Adams et al. 1997) is applied (OECD 2012, Chap. 9). This model is a generalized form of the Rasch model (Rasch 1980) in IRT. In the MCMLM, the items are characterized by a fixed set of unknown parameters, ξ, and the student outcome levels, the latent random variable θ, are assumed to be random effects.

2.1 Notation

Assume I items (indexed i = 1, , I) with K i + 1 possible response categories (\(0,1,\ldots,K_{i}\)) for an item i. The vector-valued random variable, for a sampled person, \(X_{i}^{\prime} = \mbox{ $\left (\begin{array}{c} X_{i1},X_{i2},\ldots,X_{iK_{i}} \end{array} \right )$}\) of order 1 × K i , with X ij  = 1 if the response of the person to item i is in category j, or X ij  = 0 otherwise, indicates the K i + 1 possible responses of the person to item i. The zero category of an item is denoted with a vector consisting of zeros, making the zero category a reference category, for model identification. Collecting the X i ′’s together into a vector \(X^{\prime} = \left (\begin{array}{c} X_{1}^{\prime},X_{2}^{\prime},\ldots,X_{I}^{\prime} \end{array} \right )\) of order 1 × t (\(t = K_{1} + \cdots + K_{I}\)) gives the response vector, or response pattern, of the person on the whole test.

In addition to the response vector X (person level), assume an 1 × p vector \(\xi ^{\prime} = \mbox{ $\left (\begin{array}{c} \xi _{1},\xi _{2},\ldots,\xi _{p} \end{array} \right )$}\) of p parameters (\(p \geq I\)) describing the I items. These are often interpreted as the items’ difficulties. In the response probability model, linear combinations of these parameters are used, to describe the empirical characteristics of the response categories of each item. To define these linear combinations, a set of design vectors a ij (\(i = 1,\ldots,I\); \(j = 1,\ldots,K_{i}\)), each of length p, can be collected to form an p × t design matrix \(A^{\prime} = \left (\begin{array}{c} a_{11},\ldots,a_{1K_{1}},a_{21},\ldots,a_{2K_{2}},\ldots,a_{I1},\ldots,a_{IK_{I}} \end{array} \right )\), and the linear combinations are calculated by (of order t × 1). In the multidimensional version of the model it is assumed that \(D \geq 2\) latent traits underlie the persons’ responses. The scores of the individuals on these latent traits are collected in the D × 1 vector \(\theta = \mbox{ $\left (\begin{array}{c} \theta _{1},\ldots,\theta _{D} \end{array} ^{\prime}\right )$}\), where the θ’s are real-valued and often interpreted as the persons’ abilities.

In the model also the notion of a response score b ijd is introduced, which gives the performance level of an observed response in category j of item i with respect to dimension d (d = 1, , D). For dimension d and item i, the response scores across the K i categories of item i can be collected in an K i × 1 vector \(b_{id} = \mbox{ $\left (\begin{array}{c} b_{i1d},\ldots,b_{iK_{i}d} \end{array} ^{\prime}\right )$}\) and across the D dimensions in the K i × D scoring sub-matrix \(B_{i} = \left (\begin{array}{c} b_{i1},\ldots,b_{iD} \end{array} \right )\). For all items, the response scores can be collected in an t × D scoring matrix \(B = \left (\begin{array}{c} B_{1}^{\prime},\ldots,B_{I}^{\prime} \end{array} ^{\prime}\right )\).

2.2 MCMLM

The probability \(\text{Pr}(\mbox{ $X_{\mathit{ij}} = 1$;}\,A,B,\xi \vert \theta )\) of a response in category j of item i, given an ability vector θ, is \(\text{exp}(b_{\mathit{ij}}\theta + a_{\mathit{ij}}^{\prime}\xi )/\mbox{ $(1 +\sum _{ q=1}^{K_{i}}$}\text{exp}(b_{\mathit{iq}}\theta + a_{\mathit{iq}}^{\prime}\xi )\mbox{,}\) where b iq is the qth row of the corresponding matrix B i , and a iq is the \((\sum \limits _{l=1}^{i-1}K_{l} + q)\) th row of the matrix A. The conditional item response model (conditional on a person’s ability θ) then can be expressed by \(\mbox{ $f_{x}$}(x\mbox{ $;$}\,\xi \vert \theta ) = \text{exp}\left [x^{\prime}\left (B\theta + A\xi \right )\right ]/\sum _{\mbox{ $z$}}\text{exp}\left [z^{\prime}\left (B\theta + A\xi \right )\right ]\mbox{,}\) where x is a realization of X and is over of all possible response vectors \(\boldsymbol{z}\).

In the conditional item response model, θ is given. The unconditional, or marginal, item response model requires the specification of a density, f θ (θ). In the PISA scaling procedure, students are assumed to have been sampled from a multivariate normal population with mean vector μ and variance-covariance matrix \(\boldsymbol{\varSigma }\), that is, \(\mbox{ $f_{\theta }$}(\theta ) = \mbox{ ${({(2\pi )}^{D}\left \vert \boldsymbol{\varSigma }\right \vert )}^{-1/2}$}\text{exp}\left [-\left (\theta -\mu \right )^{\prime}\mbox{ ${\boldsymbol{\varSigma }}^{-1}$}\left (\theta -\mu \right )/\mbox{ $2$}\right ]\). Moreover, this mean vector is parametrized, μ = Γw, so that \(\theta =\varGamma ^{\prime}w + e\mbox{,}\) where w is an u × 1 vector of u fixed and known background values for a student, Γ is an u × D matrix of regression coefficients, and the error term e is N(0, Σ). In PISA, \(\theta =\varGamma ^{\prime}w + e\) is referred to as latent regression, and w comprises the so-called conditioning variables (e.g., gender, grade, or school size). This is the population model.

The conditional item response model and the population model are combined to obtain the unconditional, or marginal, item response model, which incorporates not only performance on the items but also information about the students’ background: \(\mbox{ $f$}(x;\xi,\varGamma,w,\mbox{ $\varSigma $}) = \mbox{ $\int _{\theta }$}\mbox{ $f_{x}$}(x;\xi \vert \theta )\mbox{ $f_{\theta }$}(\theta;\varGamma,w,\mbox{ $\varSigma $})\mbox{ $d$}\theta \mbox{.}\) The parameters of this MCMLM are Γ, Σ, and ξ. They can be estimated using the software ConQuest®; (Wu et al. 1997; see also Adams et al. 1997).

Parametrizing a multivariate mean of a prior distribution for the person ability can also be applied to the broader family of multidimensional item response models (e.g., Reckase 2009). Alternative models capable of capturing the multidimensional aspects of the data, and at the same time, allowing for the incorporation of covariate information are explanatory item response models (e.g., De Boeck and Wilson 2004). The scaling procedure in PISA may be performed using those models. In further research, it would be interesting to compare the different approaches to scaling the PISA cognitive data.

2.3 Student Score Generation

For each student (response pattern) it is possible to specify a posterior distribution for the latent variable θ, which is given by h θ (θ; w, ξ, Γ, Σ | x) = fx(x; ξ | θ)f θ (θ; Γ, w, Σ)∕∫ θ fx(x; ξ | θ)f θ (θ; Γ, w, Σ). Estimates for θ are random draws from this posterior distribution, and they are called plausible values (e.g., see Mislevy 1991).

Plausible values are drawn in PISA as follows. M vector-valued random deviates (φ mn ) m = 1, , M  are sampled from the parametrized multivariate normal distribution, for each individual n. For PISA, the value M = 2, 000 has been specified (OECD 2012). These vectors are used to approximate the integral in the equation for the posterior distribution, using Monte-Carlo integration: \(\int _{\theta }\mbox{ $f_{x}$}(\boldsymbol{x;\xi \vert \theta })\mbox{ $f_{\theta }$}(\boldsymbol{\theta;\varGamma,w},\mbox{ $\varSigma $})\mbox{ $d$}\boldsymbol{\theta }\,\,\mbox{ $ \approx $}\,\, \frac{\mbox{ $1$}}{\mbox{ $M$}}\sum _{ \mbox{ $m = 1$}}^{\mbox{ $M$}}\mbox{ $f_{ x}$}(\boldsymbol{x_{n};\xi \vert \varphi _{mn}}) = \mbox{ $\mathfrak{I}$}\mbox{.}\) The values pmn = fx(x n |φ mn)f θ (φ mn;Γ,w, Σ) are calculated, and the set of pairs (φ mn , pmn∕ℑ) m = 1, , M  can be used as an approximation of the posterior density; and the probability that \(\varphi _{jn}\) could be drawn from this density is given by \(q_{jn} = p_{jn}/\sum _{m=1}^{M}p_{mn}\). L uniformly distributed random numbers \((\eta _{i})_{i=1}^{L}\) are generated; and for each random draw, the vector, \(\varphi _{i_{0}n}\), for which the condition \(\sum _{s=1}^{i_{0}-1}q_{sn} <\eta _{i} \leq \sum _{s=1}^{i_{0}}q_{sn}\) is satisfied, is selected as a plausible vector.

A computational question that remains unclear at this point concerns the mode of drawing plausible values. A perfect reproduction of the generated PISA plausible values is not possible. It also remains unclear which of the plausible values (for a student, generally five values are generated for each dimension), if the means of those values, or if even aggregations of individual results (computed one for each plausible value), were used for “classifying” individuals into the proficiency levels.

The MCMLM is fitted to each national data set, based on the international item parameters and national conditioning variables. However, the random sub-sample of students across the participating nations and economies used for estimating the parameters, is not identifiable (e.g., OECD 2009b, p. 197). Hence, the item parameters cannot be reproduced with certainty as well.

3 Proficiency Scale Construction and Proficiency Levels

In addition to plausible values, PISA also reports proficiency (scale) levels. The proficiency scales developed in PISA do not describe what students at a given level on the PISA “performance scale” actually did in a test situation, rather they describe what students at a given level on the PISA “proficiency scale” typically know and can do. Through the scaling procedure discussed in previous section, it is possible to locate student ability and item difficulty on “performance continua” θ and ξ, respectively. These continua are discretized in a specific way to yield the proficiency scales with their discrete levels.

The methodology to construct proficiency scales and to associate students with their levels was developed and used for PISA 2000, and it was essentially retained for PISA 2009. In the PISA 2000 cycle, defining the proficiency levels progressed in two broad phases. In the first phase, a substantive analysis of the PISA items in relation to the aspects of literacy that underpinned each test domain was carried out. This analysis produced a detailed summary of the cognitive demands of the PISA items, and together with information about the items’ difficulty, descriptions of increasing proficiency. In the second phase, decisions about where to set cut-off points to construct the levels and how to associate students with each level were made.

For implementing these principles, a method has been developed that links three variables (for details, see OECD 2012, Chap. 15): the expected success of a student at a particular proficiency level on items that are uniformly spread across that level (proposed is a minimum of 50 % for students at the bottom of the level and higher for other students at that level); the width of a level in the scale (determined largely by substantive considerations of the cognitive demands of items at that level and observations of student performance on the items); and the probability that a student in the middle of the level would correctly answer an item of average difficulty for this level (referred to as the “RP-value” for the scale, where “RP” indicates “response probability”).

As an example, for print reading in PISA 2009, seven levels of proficiency were defined; see Fig. 1.

Fig. 1
figure 1

Print reading proficiency scale and levels (taken from OECD 2012, p. 266). PISA scales were linear transformations of the natural logit metrics that result from the PISA scaling procedure. Transformations were chosen so that mean and standard deviation of the PISA scores were 500 and 100, respectively (OECD 2012, p. 143)

A description of the sixth proficiency level can be found in Fig. 2.

Fig. 2
figure 2

Summary description of the sixth proficiency level on the print reading proficiency scale (taken from OECD 2012, p. 267)

The PISA study provides a basis for international collaboration in defining and implementing educational policies. The described proficiency scales and the distributions of proficiency levels in the different countries play a central role in the reporting of the PISA results. For example, in all international reports the percentage of students performing at each of the proficiency levels is presented (see OECD 2001200420072010). Therefore, it is essential to determine the proficiency scales and levels reliably.

Are there alternatives? It is important to note that specification of the proficiency levels and classification based on the proficiency scale depend on qualitative expert judgments. Statistical statements about the reliability of the PISA classifications (e.g., using numerical misclassification rates) are not possible in general, in the sense of a principled psychometric theory. Such a theory can be based on (order) restricted latent class models (see Sect. 4).

4 Conclusion

The basic psychometric concepts underlying the PISA surveys are elaborate. Complex statistical methods are applied to simultaneously scale persons and items in categorical large scale assessment data based on latent variables.

A number of questions remain unanswered when it comes to trying to replicate the PISA scaling results. For example, for student score generation international item parameters are used. These parameters are estimated based on a sub-sample of the international student sample. Although all international data sets are freely available (www.oecd.org/pisa/pisaproducts), it is not evident which students were contained in that sub-sample. It would have been easy to add a filter variable, or at least, to describe the randomization process more precisely. Regarding the reproduction of the plausible values it seems appropriate that, at least, the random number seeds are tabulated. It should also be reported clearly whether the plausible values themselves are aggregated before, for instance, the PISA scores are calculated, or whether the PISA scores are computed separately for any plausible value and aggregated. Indeed, the sequence of averaging may matter (e.g., von Davier et al. 2009).

An interesting alternative to the “two-step discretization approach” in PISA for the construction of proficiency scales and levels are psychometric model-based classification methods such as the cognitive diagnosis models (e.g., DiBello et al. 2007; Rupp et al. 2010; von Davier 2010). The latter are discrete latent variable models (restricted latent class models), so no discretization (e.g., based on subjective expert judgments) is necessary, and classification based on these diagnostic models is purely statistical. We expect that such an approach may improve on the error of classification.

It may be useful to automatize the reporting in PISA. One way to implement that, is by utilizing Sweave (Leisch 2002). Sweave is a tool that allows to embed R code for complete data analyses in LaTeX documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. This tool can facilitate the reporting in PISA. Interestingly, different educational large scale assessment studies may then be compared, heuristically, data or text mining their Technical Reports.