The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts

Ünlü, Ali; Kasper, Daniel; Trendtel, Matthias; Schurig, Michael

doi:10.1007/978-3-319-01595-8_45

Ali Ünlü²¹,
Daniel Kasper²¹,
Matthias Trendtel²¹ &
…
Michael Schurig²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

5426 Accesses
2 Citations

Abstract

The Programme for International Student Assessment (PISA; e.g., OECD, Sample tasks from the PISA 2000 assessment, 2002a; OECD, Learning for tomorrow’s world: first results from PISA 2003, 2004; OECD, PISA 2006: Science competencies for tomorrow’s world, 2007; OECD, PISA 2009 Technical Report, 2012) is an international large scale assessment study that aims to assess the skills and knowledge of 15-year-old students, and based on the results, to compare education systems across the participating (about 70) countries (with a minimum number of approx. 4,500 tested students per country). Initiator of this Programme is the Organisation for Economic Co-operation and Development (OECD; www.pisa.oecd.org). We review the main methodological techniques of the PISA study. Primarily, we focus on the psychometric procedure applied for scaling items and persons. PISA proficiency scale construction and proficiency levels derived based on discretization of the continua are discussed. For a balanced reflection of the PISA methodology, questions and suggestions on the reproduction of international item parameters, as well as on scoring, classifying and reporting, are raised. We hope that along these lines the PISA analyses can be better understood and evaluated, and if necessary, possibly be improved.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Differential Item Functioning in PISA Due to Mode Effects

Subject-specific strength and weaknesses of fourth-grade students in Europe: a comparative latent profile analysis of multidimensional proficiency patterns based on PIRLS/TIMSS combined 2011

Article Open access 02 September 2016

Programme for International Student Assessment (PISA)

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

PISA is an international large scale educational assessment study conducted by member countries of the OECD (2001, 2002a, 2004, 2007, 2010) and investigates how well 15-year-old students approaching the end of compulsory schooling are prepared to meet the challenges of today’s knowledge societies (OECD 2005, 2012). The study does not focus on the students’ achievement regarding a specific school curriculum but rather aims at measuring the students’ ability to use their knowledge and skills to meet real-life challenges (OECD 2009a).

PISA started in 2000 and takes place every 3 years. Proficiencies in the domains reading, mathematics, and science are assessed. In each assessment cycle, one of these domains is chosen as the major domain under fine-grained investigation; reading in 2000, followed by mathematics in 2003, science in 2006, etc. The definitions of the domains can be found in the respective assessment frameworks (e.g., OECD 2009a). In addition to these domains, further competencies may also be assessed by a participating OECD member; for example digital reading in 2009 (OECD 2012). Besides the actual test in PISA, student and school questionnaires are used to provide additional background information (e.g., about the socio-economic status of a student). In PISA 2009 for instance, in addition to these questionnaires, in 14 countries parents were asked to fill in an optional questionnaire. The background information are used as so-called conditioning variables for the scaling of the PISA cognitive (i.e., test) data.

The number of countries (and economies) participating in PISA continues to increase (e.g., 32 and 65 countries for PISA 2000 and 2009, respectively). In each participating country, a sample of at least 150 schools (or all schools) were drawn. In each participating school, 35 students were drawn (in schools with less than 35 eligible students, all students were selected).

The PISA study involves a number of technical challenges; for example, the development of test design and measurement instruments, of survey and questionnaire scales. Accurate sampling designs, including both school sampling and student sampling, must be developed. The multilingual and multicultural nature of the assessment must be taken into account, and various operational control and validation procedures have to be applied. Focused on in this paper, the scaling and analysis of the data require sophisticated psychometric methods, and PISA employs a scaling model based on item response theory (IRT; e.g., Adams et al. 1997; Fischer and Molenaar 1995; van der Linden and Hambleton 1997). The proficiency scales and levels, which are the basic tool in reporting PISA outcomes, are derived through IRT analyses.

The PISA Technical Report describes those methodologies (OECD 2002b, 2005, 2009b, 2012). The description is provided at a level that allows for review and, potentially, replication of the implemented procedures. In this paper, we recapitulate the scaling procedure that is used in PISA (Sect. 2). We discuss the construction of proficiency scale and proficiency levels and explain how the results are reported and interpreted in PISA (Sect. 3). We comment on whether information provided in the Technical Report is sufficient to replicate the sampling and scaling procedures and central results for PISA, on classification procedures and alternatives thereof, and on other, for instance more automated, ways for reporting in the PISA Technical Report (Sect. 4). Overall, limitations of PISA and some reflections and suggestions for improvement are described and scattered throughout the paper.

2 Scaling Procedure

To scale the PISA cognitive data, the mixed coefficients multinomial logit model (MCMLM; Adams et al. 1997) is applied (OECD 2012, Chap. 9). This model is a generalized form of the Rasch model (Rasch 1980) in IRT. In the MCMLM, the items are characterized by a fixed set of unknown parameters, ξ, and the student outcome levels, the latent random variable θ, are assumed to be random effects.

2.1 Notation

Assume I items (indexed i = 1, …, I) with K _i + 1 possible response categories ($0,1,\ldots,K_{i}$) for an item i. The vector-valued random variable, for a sampled person, $X_{i}^{\prime} = \mbox{ $\left (\begin{array}{c} X_{i1},X_{i2},\ldots,X_{iK_{i}} \end{array} \right )$}$ of order 1 × K _i, with X _ij = 1 if the response of the person to item i is in category j, or X _ij = 0 otherwise, indicates the K _i + 1 possible responses of the person to item i. The zero category of an item is denoted with a vector consisting of zeros, making the zero category a reference category, for model identification. Collecting the X _i′’s together into a vector $X^{\prime} = \left (\begin{array}{c} X_{1}^{\prime},X_{2}^{\prime},\ldots,X_{I}^{\prime} \end{array} \right )$ of order 1 × t ($t = K_{1} + \cdots + K_{I}$) gives the response vector, or response pattern, of the person on the whole test.

In addition to the response vector X (person level), assume an 1 × p vector $\xi ^{\prime} = \mbox{ $\left (\begin{array}{c} \xi _{1},\xi _{2},\ldots,\xi _{p} \end{array} \right )$}$ of p parameters ($p \geq I$) describing the I items. These are often interpreted as the items’ difficulties. In the response probability model, linear combinations of these parameters are used, to describe the empirical characteristics of the response categories of each item. To define these linear combinations, a set of design vectors a _ij ($i = 1,\ldots,I$; $j = 1,\ldots,K_{i}$), each of length p, can be collected to form an p × t design matrix $A^{\prime} = \left (\begin{array}{c} a_{11},\ldots,a_{1K_{1}},a_{21},\ldots,a_{2K_{2}},\ldots,a_{I1},\ldots,a_{IK_{I}} \end{array} \right )$, and the linear combinations are calculated by Aξ (of order t × 1). In the multidimensional version of the model it is assumed that $D \geq 2$ latent traits underlie the persons’ responses. The scores of the individuals on these latent traits are collected in the D × 1 vector $\theta = \mbox{ $\left (\begin{array}{c} \theta _{1},\ldots,\theta _{D} \end{array} ^{\prime}\right )$}$, where the θ’s are real-valued and often interpreted as the persons’ abilities.

In the model also the notion of a response score b _ijd is introduced, which gives the performance level of an observed response in category j of item i with respect to dimension d (d = 1, …, D). For dimension d and item i, the response scores across the K _i categories of item i can be collected in an K _i × 1 vector $b_{id} = \mbox{ $\left (\begin{array}{c} b_{i1d},\ldots,b_{iK_{i}d} \end{array} ^{\prime}\right )$}$ and across the D dimensions in the K _i × D scoring sub-matrix $B_{i} = \left (\begin{array}{c} b_{i1},\ldots,b_{iD} \end{array} \right )$. For all items, the response scores can be collected in an t × D scoring matrix $B = \left (\begin{array}{c} B_{1}^{\prime},\ldots,B_{I}^{\prime} \end{array} ^{\prime}\right )$.

2.2 MCMLM

The probability $\text{Pr}(\mbox{ $X_{\mathit{ij}} = 1$;}\,A,B,\xi \vert \theta )$ of a response in category j of item i, given an ability vector θ, is $\text{exp}(b_{\mathit{ij}}\theta + a_{\mathit{ij}}^{\prime}\xi )/\mbox{ $(1 +\sum _{ q=1}^{K_{i}}$}\text{exp}(b_{\mathit{iq}}\theta + a_{\mathit{iq}}^{\prime}\xi )\mbox{,}$ where b _iq is the qth row of the corresponding matrix B _i, and a′_iq is the $(\sum \limits _{l=1}^{i-1}K_{l} + q)$ th row of the matrix A. The conditional item response model (conditional on a person’s ability θ) then can be expressed by $\mbox{ $f_{x}$}(x\mbox{ $;$}\,\xi \vert \theta ) = \text{exp}\left [x^{\prime}\left (B\theta + A\xi \right )\right ]/\sum _{\mbox{ $z$}}\text{exp}\left [z^{\prime}\left (B\theta + A\xi \right )\right ]\mbox{,}$ where x is a realization of X and ∑ is over of all possible response vectors $\boldsymbol{z}$.

In the conditional item response model, θ is given. The unconditional, or marginal, item response model requires the specification of a density, f_θ(θ). In the PISA scaling procedure, students are assumed to have been sampled from a multivariate normal population with mean vector μ and variance-covariance matrix $\boldsymbol{\varSigma }$, that is, $\mbox{ $f_{\theta }$}(\theta ) = \mbox{ ${({(2\pi )}^{D}\left \vert \boldsymbol{\varSigma }\right \vert )}^{-1/2}$}\text{exp}\left [-\left (\theta -\mu \right )^{\prime}\mbox{ ${\boldsymbol{\varSigma }}^{-1}$}\left (\theta -\mu \right )/\mbox{ $2$}\right ]$. Moreover, this mean vector is parametrized, μ = Γ′w, so that $\theta =\varGamma ^{\prime}w + e\mbox{,}$ where w is an u × 1 vector of u fixed and known background values for a student, Γ is an u × D matrix of regression coefficients, and the error term e is N(0, Σ). In PISA, $\theta =\varGamma ^{\prime}w + e$ is referred to as latent regression, and w comprises the so-called conditioning variables (e.g., gender, grade, or school size). This is the population model.

The conditional item response model and the population model are combined to obtain the unconditional, or marginal, item response model, which incorporates not only performance on the items but also information about the students’ background: $\mbox{ $f$}(x;\xi,\varGamma,w,\mbox{ $\varSigma $}) = \mbox{ $\int _{\theta }$}\mbox{ $f_{x}$}(x;\xi \vert \theta )\mbox{ $f_{\theta }$}(\theta;\varGamma,w,\mbox{ $\varSigma $})\mbox{ $d$}\theta \mbox{.}$ The parameters of this MCMLM are Γ, Σ, and ξ. They can be estimated using the software ConQuest®; (Wu et al. 1997; see also Adams et al. 1997).

Parametrizing a multivariate mean of a prior distribution for the person ability can also be applied to the broader family of multidimensional item response models (e.g., Reckase 2009). Alternative models capable of capturing the multidimensional aspects of the data, and at the same time, allowing for the incorporation of covariate information are explanatory item response models (e.g., De Boeck and Wilson 2004). The scaling procedure in PISA may be performed using those models. In further research, it would be interesting to compare the different approaches to scaling the PISA cognitive data.

2.3 Student Score Generation

For each student (response pattern) it is possible to specify a posterior distribution for the latent variable θ, which is given by h_θ(θ; w, ξ, Γ, Σ | x) = f_x(x; ξ | θ)f_θ(θ; Γ, w, Σ)∕∫_θf_x(x; ξ | θ)f_θ(θ; Γ, w, Σ). Estimates for θ are random draws from this posterior distribution, and they are called plausible values (e.g., see Mislevy 1991).

Plausible values are drawn in PISA as follows. M vector-valued random deviates (φ _mn)_{m = 1, …, M} are sampled from the parametrized multivariate normal distribution, for each individual n. For PISA, the value M = 2, 000 has been specified (OECD 2012). These vectors are used to approximate the integral in the equation for the posterior distribution, using Monte-Carlo integration: $\int _{\theta }\mbox{ $f_{x}$}(\boldsymbol{x;\xi \vert \theta })\mbox{ $f_{\theta }$}(\boldsymbol{\theta;\varGamma,w},\mbox{ $\varSigma $})\mbox{ $d$}\boldsymbol{\theta }\,\,\mbox{ $ \approx $}\,\, \frac{\mbox{ $1$}}{\mbox{ $M$}}\sum _{ \mbox{ $m = 1$}}^{\mbox{ $M$}}\mbox{ $f_{ x}$}(\boldsymbol{x_{n};\xi \vert \varphi _{mn}}) = \mbox{ $\mathfrak{I}$}\mbox{.}$ The values p_mn = f_x(x _n ;ξ|φ _mn)f_θ(φ _mn;Γ,w, Σ) are calculated, and the set of pairs (φ _mn, p_mn∕ℑ)_{m = 1, …, M} can be used as an approximation of the posterior density; and the probability that $\varphi _{jn}$ could be drawn from this density is given by $q_{jn} = p_{jn}/\sum _{m=1}^{M}p_{mn}$. L uniformly distributed random numbers $(\eta _{i})_{i=1}^{L}$ are generated; and for each random draw, the vector, $\varphi _{i_{0}n}$, for which the condition $\sum _{s=1}^{i_{0}-1}q_{sn} <\eta _{i} \leq \sum _{s=1}^{i_{0}}q_{sn}$ is satisfied, is selected as a plausible vector.

A computational question that remains unclear at this point concerns the mode of drawing plausible values. A perfect reproduction of the generated PISA plausible values is not possible. It also remains unclear which of the plausible values (for a student, generally five values are generated for each dimension), if the means of those values, or if even aggregations of individual results (computed one for each plausible value), were used for “classifying” individuals into the proficiency levels.

The MCMLM is fitted to each national data set, based on the international item parameters and national conditioning variables. However, the random sub-sample of students across the participating nations and economies used for estimating the parameters, is not identifiable (e.g., OECD 2009b, p. 197). Hence, the item parameters cannot be reproduced with certainty as well.

3 Proficiency Scale Construction and Proficiency Levels

In addition to plausible values, PISA also reports proficiency (scale) levels. The proficiency scales developed in PISA do not describe what students at a given level on the PISA “performance scale” actually did in a test situation, rather they describe what students at a given level on the PISA “proficiency scale” typically know and can do. Through the scaling procedure discussed in previous section, it is possible to locate student ability and item difficulty on “performance continua” θ and ξ, respectively. These continua are discretized in a specific way to yield the proficiency scales with their discrete levels.

The methodology to construct proficiency scales and to associate students with their levels was developed and used for PISA 2000, and it was essentially retained for PISA 2009. In the PISA 2000 cycle, defining the proficiency levels progressed in two broad phases. In the first phase, a substantive analysis of the PISA items in relation to the aspects of literacy that underpinned each test domain was carried out. This analysis produced a detailed summary of the cognitive demands of the PISA items, and together with information about the items’ difficulty, descriptions of increasing proficiency. In the second phase, decisions about where to set cut-off points to construct the levels and how to associate students with each level were made.

For implementing these principles, a method has been developed that links three variables (for details, see OECD 2012, Chap. 15): the expected success of a student at a particular proficiency level on items that are uniformly spread across that level (proposed is a minimum of 50 % for students at the bottom of the level and higher for other students at that level); the width of a level in the scale (determined largely by substantive considerations of the cognitive demands of items at that level and observations of student performance on the items); and the probability that a student in the middle of the level would correctly answer an item of average difficulty for this level (referred to as the “RP-value” for the scale, where “RP” indicates “response probability”).

As an example, for print reading in PISA 2009, seven levels of proficiency were defined; see Fig. 1.

A description of the sixth proficiency level can be found in Fig. 2.

The PISA study provides a basis for international collaboration in defining and implementing educational policies. The described proficiency scales and the distributions of proficiency levels in the different countries play a central role in the reporting of the PISA results. For example, in all international reports the percentage of students performing at each of the proficiency levels is presented (see OECD 2001, 2004, 2007, 2010). Therefore, it is essential to determine the proficiency scales and levels reliably.

Are there alternatives? It is important to note that specification of the proficiency levels and classification based on the proficiency scale depend on qualitative expert judgments. Statistical statements about the reliability of the PISA classifications (e.g., using numerical misclassification rates) are not possible in general, in the sense of a principled psychometric theory. Such a theory can be based on (order) restricted latent class models (see Sect. 4).

4 Conclusion

The basic psychometric concepts underlying the PISA surveys are elaborate. Complex statistical methods are applied to simultaneously scale persons and items in categorical large scale assessment data based on latent variables.

A number of questions remain unanswered when it comes to trying to replicate the PISA scaling results. For example, for student score generation international item parameters are used. These parameters are estimated based on a sub-sample of the international student sample. Although all international data sets are freely available (www.oecd.org/pisa/pisaproducts), it is not evident which students were contained in that sub-sample. It would have been easy to add a filter variable, or at least, to describe the randomization process more precisely. Regarding the reproduction of the plausible values it seems appropriate that, at least, the random number seeds are tabulated. It should also be reported clearly whether the plausible values themselves are aggregated before, for instance, the PISA scores are calculated, or whether the PISA scores are computed separately for any plausible value and aggregated. Indeed, the sequence of averaging may matter (e.g., von Davier et al. 2009).

An interesting alternative to the “two-step discretization approach” in PISA for the construction of proficiency scales and levels are psychometric model-based classification methods such as the cognitive diagnosis models (e.g., DiBello et al. 2007; Rupp et al. 2010; von Davier 2010). The latter are discrete latent variable models (restricted latent class models), so no discretization (e.g., based on subjective expert judgments) is necessary, and classification based on these diagnostic models is purely statistical. We expect that such an approach may improve on the error of classification.

It may be useful to automatize the reporting in PISA. One way to implement that, is by utilizing Sweave (Leisch 2002). Sweave is a tool that allows to embed R code for complete data analyses in LaTeX documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. This tool can facilitate the reporting in PISA. Interestingly, different educational large scale assessment studies may then be compared, heuristically, data or text mining their Technical Reports.

References

Adams, R. J., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.
Article Google Scholar
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.
Google Scholar
Dibello, L. V., Roussos, L. A., & Stout, W. F. (2007). Review of cognitively diagnostic assessment and a summary of psychometric models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics (pp. 979–1030). Amsterdam: Elsevier.
Google Scholar
Fischer, G. H., & Molenaar, I. W. (Eds.). (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer.
MATH Google Scholar
Leisch, F. (2002). Sweave: Dynamic generation of statistical reports using literate data analysis. In W. Härdle & B. Rönz (Eds.), Compstat 2002—Proceedings in Computational Statistics (pp. 575–580). Heidelberg: Physica Verlag.
Google Scholar
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177–196.
Article MathSciNet MATH Google Scholar
OECD (2001). Knowledge and skills for life. First results from the OECD Programme for International Student Assessment (PISA) 2000. Paris: OECD Publishing.
Google Scholar
OECD (2002a). Sample tasks from the PISA 2000 assessment. Paris: OECD Publishing.
Book Google Scholar
OECD (2002b). PISA 2000 Technical Report. Paris: OECD Publishing.
Google Scholar
OECD (2004). Learning for tomorrow’s world: First results from PISA 2003. Paris: OECD Publishing.
Google Scholar
OECD (2005). PISA 2003 Technical Report. Paris: OECD Publishing.
Book Google Scholar
OECD (2007). PISA 2006: Science competencies for tomorrow’s world. Paris: OECD Publishing.
Book Google Scholar
OECD (2009a). PISA 2009 assessment framework: Key competencies in reading, mathematics and science. Paris: OECD Publishing.
Google Scholar
OECD (2009b). PISA 2006 Technical Report. Paris: OECD Publishing.
Book Google Scholar
OECD (2010). PISA 2009 results: Overcoming social background—Equity learning opportunities and outcomes. Paris: OECD Publishing.
Google Scholar
OECD (2012). PISA 2009 Technical Report. Paris: OECD Publishing.
Book Google Scholar
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
Google Scholar
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Book Google Scholar
Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: The Guilford Press.
Google Scholar
van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer.
MATH Google Scholar
von Davier, M. (2010). Hierarchical mixtures of diagnostic models. Psychological Test and Assessment Modeling, 52, 8–28.
Google Scholar
von Davier, M., Gonzales, E., & Mislevy, R. J. (2009). What are plausible values and why are they useful? Issues and Methodologies in Large-Scale Assessments, 2, 9–37.
Google Scholar
Wu, M. L., Adams, R. J., & Wilson, M. R. (1997). ConQuest®;: Multi-aspect test software [Computer program manual]. Camberwell: Australian Council for Educational Research.
Google Scholar

Download references

Author information

Authors and Affiliations

Methods in Empirical Educational Research, TUM School of Education, and Centre for International Student Assessment (ZIB), Technische Universität München, Arcisstrasse 21, 80333, Munich, Germany
Ali Ünlü (Chair), Daniel Kasper (Chair), Matthias Trendtel (Chair) & Michael Schurig (Chair)

Authors

Ali Ünlü
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kasper
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Trendtel
View author publications
You can also search for this author in PubMed Google Scholar
Michael Schurig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Ünlü .

Editor information

Editors and Affiliations

Faculty of Computer Science, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Myra Spiliopoulou
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Lars Schmidt-Thieme
Institute of Computer Science, University of Hildesheim, Hildesheim, Germany
Ruth Janning

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ünlü, A., Kasper, D., Trendtel, M., Schurig, M. (2014). The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts. In: Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds) Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-01595-8_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-01595-8_45
Published: 10 October 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01594-1
Online ISBN: 978-3-319-01595-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics