1 Introduction

Large-scale assessments of learning have shaped public discourse and influenced educational policy in many countries (Breakspear 2012; Grek 2009). While attention has mostly focused on students’ achievement, large-scale assessment studies also provide rich information regarding the school context and educational processes such as teaching practices. Given its importance for student learning, teaching quality needs to be evaluated using measures that are reliable and valid (Klieme 2013; Marsh and Roche 1997; Müller et al. 2016; Wallace et al. 2016).

This study evaluates indicators of science teaching quality provided by the 2015 cycle of the Programme for International Student Assessment (PISA, OECD 2016). We focus on PISA for several reasons. First, while many studies have used PISA’s teaching scales to address substantive questions (Aditomo and Klieme 2020; Gee and Wong 2012; Hwang et al. 2018; Jiang and McComas 2015; McConney et al. 2014), few have focused on evaluating their psychometric properties. We found only one study which examined PISA’s teaching scales across all participating countries (Wenger et al. 2018). Wenger et al.’s study did not consider teaching quality in science and only focused on reliability.

Second, PISA randomly samples students (as opposed to classes) from schools. Thus, students from the same school likely come from different classes and may be taught by different teachers. When aggregated at the school level, student ratings in PISA do not refer to specific teachers (Wang and Degol 2016). For this reason, studies utilizing PISA data have mostly avoided the aggregation of its teaching measures by treating them as individual-level constructs (Aditomo and Klieme 2020). This is problematic because teaching is mostly a class-level process, and hence, its effect is best assessed at the classroom level. Some authors have conceptualised teaching quality as a dimension of school climate, suggesting the possibility of assessing it at the school level (Samuel n.d.; Wang and Degol 2016). Few studies to date have critically addressed the extent to which PISA’s teaching measures can be used to assess teaching as a school-level characteristic.

1.1 Conceptualising teaching quality

Teachers and the way they teach are major factors which determine students’ learning outcomes. Teachers vary in how effective they are in improving cognitive outcomes (Blazar 2015; Gershenson 2016; Jackson 2012; Slater et al. 2012) as well as affective and behavioural outcomes (Jennings and Di Prete 2010; Kraft and Grace 2016). Similarly, there is also a large variation in the effects of different teaching practices on student learning, as powerfully shown in the work of John Hattie (2008; 2017). Teacher and teaching effects are linked in that effective teachers are those who practise more effective teaching. One study indicated that the teacher effects seem to be less related with their background, experience, and qualifications, and more with what they do in their classroom, i.e. teaching practices (Slater et al. 2012). Another study showed that specific teaching practices explain why some teachers are more effective than others, with emotional support predicting differences in affective outcomes and classroom management predicting behavioural outcomes (Blazar and Kraft 2017).

Teaching quality can be described in terms of generic dimensions which characterise effective teaching practices. A number of frameworks have been proposed to describe these key dimensions. According to one framework, teaching quality has three basic dimensions: classroom management, student support, and cognitive activation (Klieme et al. 2009; Praetorius et al. 2018). Each of these dimensions facilitates different aspects of learning, namely time on task, motivation, and knowledge construction. As detailed below, the relations between each dimension with learning are supported by different theories and research traditions.

Classroom management refers to the organisation and structure of lessons which involves the establishment of clear rules, frequent monitoring of student behaviour, effective response to disruptions, and efficient use of time (Praetorius et al. 2018). The importance of classroom management is highlighted in early models of school learning which defined learning opportunity in terms of “time on task” (Carroll 1989). In that model, good classroom management produces an orderly climate which allows students to focus their attention to the relevant materials and activities. The opportunity to learn provided by an orderly climate may not necessarily lead students to develop a deeper understanding of the materials. However, a disruptive climate is assumed to lead to frustration and diminished motivation (Carroll 1989; Egeberg et al. 2016; Emmer and Stough 2001). In other words, good classroom management should correlate positively with student motivation. In many studies, including PISA, classroom management is measured by way of students’ report of how orderly or disruptive their typical lessons are.

Student support, the second teaching quality dimension, refers to teacher actions which cater to students’ psychological needs. The importance of student support is emphasised by theories of motivation such as self-determination theory (Ryan and Deci 2000). According to this theory, teachers and schools should strive to fulfil students’ basic psychological needs of autonomy (feeling empowered to exercise individual choice), belongingness (feelings of being valued members of a community), and competence (feelings of having the opportunity to learn and grow). In PISA, student support is most directly reflected in students’ reports of emotional support, personal feedback, and adaptive instruction. From a self-determination theory perspective, these measures should help fulfil students’ needs, which in turn should promote intrinsic motivation (Deci et al. 1991).

Two other teaching measures in PISA, teacher-directed and inquiry-based instructions, can also be seen as catering for students’ psychological needs albeit in a less direct manner. Teacher-directed instruction provides cognitive scaffolds which, when properly implemented, should help students feel competent. Inquiry-based instruction, again when properly implemented, provides room for personal choices (e.g. in designing experiments and analysing data) which should help cater for students’ need for autonomy. Hence, both forms of instruction should be correlated positively with intrinsic motivation.

The final basic teaching quality dimension, cognitive activation, refer to activities which prompt students to engage in deep processing of the learning materials. The importance of cognitive activation is highlighted by constructivist theories of learning, which assume that learning can only occur through active construction of knowledge by the learner (Bransford et al. 2000; Derry 1996; Wittrock 2010). Cognitive activation can be implemented through the provision of tasks which activate relevant prior knowledge, the scaffolding metacognitive processes, and the use of questions and collaborative discourse around important ideas (Praetorius et al. 2018). Theoretically, measures of cognitive activation should predict students’ scores in achievement tests. However, unlike in previous cycles of PISA, in 2015, cognitive activation was not directly assessed.

1.2 Teaching quality as a classroom-level phenomenon

Teaching is primarily a classroom process, and therefore, teaching quality is considered to be a classroom-level phenomenon (Cooley and Leinhardt 1980). Due to the comparative proximity of the classroom environment to students’ experiences, classroom-level processes such as teaching are considered to bear stronger influence on student achievement compared with school-level factors (Kyriakides et al. 2000; Scheerens and Creemers 1989). Indeed, teaching quality is seen as a key factor which mediates the effects of tangible components of the education system (e.g. teacher qualifications, school infrastructure, and programmes) on student learning outcomes (Creemers 1994). Several studies have found that student achievement often varies more between classrooms than between different schools (Goldstein 1997; Muthe’n 1991). This further suggests that teaching quality may vary substantially between classrooms within the same school.

The implication is that teaching quality should be evaluated at the classroom level. Analysis which ignores the classroom level has been shown to produce distorted estimates of variance, biased standard errors, and thus potentially misleading conclusions about the relationships between the variables of interest (Hutchinson and Healy 2001; Moerbeek 2004). More specifically, omitting the classroom level in such analysis tends to inflate the between-school variance, while simultaneously underestimating the overall effect of schooling on student achievement (Martínez 2012). Furthermore, without explicitly modelling classroom-level factors, the strength of relationships between process variables and student achievement can be mistakenly attributed to school-level factors—thereby masking inequalities in educational opportunities which may exist within each school (Martínez 2012).

Overall, these considerations suggest that in addressing questions regarding the effects of schooling on student learning, and the mechanisms which explain those effects, researchers need to explicitly model classroom-level variables. This poses a particular problem for researchers who are working with data which omits classroom identification such as PISA. Can such data yield meaningful insights regarding the quality of teaching? We postpone discussing this issue after describing the use of student ratings to evaluate teaching.

1.3 Using student ratings to evaluate teaching quality

Many studies evaluate teaching quality based on student rating data. In addition to being relatively efficient, such data are typically based on students’ extensive experience of the assessed behaviours. Moreover, student ratings reflect students’ interpretations of the learning environment, which is an important mediator between teaching and learning. Research suggests that student ratings are not simply a function of teacher popularity (Fauth et al. 2014; Kunter and Baumert 2006) and can yield insights that complement information from teacher self-reports (Aldrup et al. 2018; Kunter and Baumert 2006).

Regardless of whether teaching is analysed at the classroom or school, evaluating teaching quality based on student ratings requires the aggregation of individual-level (L1) data into the group-level (L2). The current methodological best practice to achieve this is through the application of doubly-latent multilevel models (Morin et al. 2013). In such models, multiple ratings from each student are treated as manifest (observable) indicators of a latent (unobservable) construct. In this case, teaching quality is regarded as a construct that is latent in relations to multiple items in a scale measuring a specific aspect of teaching. In other words, the construct represents each student’s perception of an objective aspect of teaching. Inconsistencies between item responses are regarded as measurement error.

In addition, these models are also latent in the sense that L2 constructs are inferred based on latent aggregations of responses from multiple students at L1 (Morin et al. 2013). In other words, teaching quality at L2 is inferred from latent L1 constructs, instead of being simply formed through averaging manifest responses. Here students from the same classroom may differ in their rating of the teaching quality. This variation between latent L1 constructs can be interpreted as either representing sampling error or meaningful variation in how an aspect of teaching is implemented within the same lesson/classroom. This latter interpretation makes sense when teachers differentiate their teaching in response to individual students’ needs within the same lesson.

Doubly-latent multilevel models guard researchers from the trap of ecological fallacy, i.e. falsely assuming that effects observed at one level can be generalised to another level (Morin et al. 2013). In addition, such models also allow researchers to simultaneously control for measurement error due to the sampling of items, as well as sampling error due to sampling of students from a classroom or school (Morin et al. 2013). Measurement error is controlled for by utilizing multiple manifest indicators to measure a latent construct, while sampling error is controlled through the aggregation of multiple students to represent classroom or school-level constructs. As an illustration, a doubly-latent multilevel model for teacher-directed instruction is depicted in Fig. 1.

Fig. 1
figure 1

Measurement models representing teacher-directed instruction

1.4 Reliability and validity of teaching quality measures

The use of doubly-latent multilevel models to assess teaching quality means that reliability and validity need to be ensured at both L1 and L2. Reliability is typically assessed by calculating the internal consistency, which can be estimated at both levels (Hox 2010; Miller and Murdock 2007; Raudensbush and Bryk 2002). To determine internal consistency at L1, the most commonly applied measures are Cronbach’s alpha and different forms of the rWG index proposed by James et al. (1984); e.g. Wenger et al. 2018). The L1 reliability of the PISA teaching scales has been examined and found to be adequate (OECD 2017).

Meanwhile, internal consistency at L2 is often referred to as ICC-1 (Shrout and Fleiss 1979) and is defined by:

$$ \mathrm{ICC}-1=\frac{\sigma_{\eta_{\mathrm{B}}}^2}{\sigma_{\eta_{\mathrm{B}}}^2+{\sigma}_{\eta_{\mathrm{W}}}^2} $$

where \( {\sigma}_{\eta_{\mathrm{W}}}^2 \) is the variance of the latent trait η at the within level (L1) and \( {\sigma}_{\eta_{\mathrm{B}}}^2 \) is the variance of the latent trait η at the between level (L2). It hence informs about the proportion of variance at L2. ICC-1 can take values between 0 and 1. High values indicate that a lot of variance in the latent variable is due to the clustering of individuals.

Another essential measure for evaluating reliability of latent constructs at L2 is ICC-2, which is calculated similarly as ICC-1 (Raudenbush and Bryk 2002; Shrout and Fleiss 1979):

$$ \mathrm{ICC}-2=\frac{\sigma_{\eta_{\mathrm{B}}}^2}{\sigma_{\eta_{\mathrm{B}}}^2+\frac{\sigma_{\eta_{\mathrm{W}}}^2}{\overline{n}}} $$

where \( \overline{n} \) represents the average cluster size. Dividing the variance of the latent trait η at the within level by the average cluster size has the effect that for large cluster sizes, the proportion of within-level variance in the denominator will be small, which will result in increased ICC-2 values. ICC-2 is hence considered the reliability of group mean scores in relation to sampling error (Lüdtke et al. 2011; Stapleton et al. 2016). Sampling error occurs when not all students in a class or school provide data. Like ICC-1, ICC-2 can take values between 0 and 1.

Multilevel models also need to be checked in terms of their validity. One important source of evidence to consider is the unidimensionality of the model at both L1 and L2. This is depicted visually in Fig. 1 for the case of teacher-directed instruction. This model assumes that the inter-item covariation can be explained by one latent factor. At L1, this means that a student’s response to one item should be consistent with responses to the other items because the student has a certain perception of teacher-directed instruction in his/her science lessons. Correspondingly at L2, the aggregated response to an item (from students in a school) should be consistent with the aggregated responses to the other items because the students have a shared perception about teacher-directed instruction in their school. The extent to which these assumptions are met is the issue of structural or factorial validity which can be evaluated using multilevel confirmatory factor analysis (Brown 2015).

Model fit within the confirmatory factor analysis (CFA) framework can be evaluated via several indices. The standardised root mean square residual (SRMR) conceptually reflects the discrepancy between observed and model-predicted correlations. The SRMR is particularly useful for evaluating multilevel models because it can be calculated separately for L1 and L2. In this paper, we focus on the SRMR at the between level, since we are particularly interested in how well the model describes the variance covariance matrix at the school level. Another index is the root mean square error approximation (RMSEA), which is based on the chi-square but takes into account sample size and model complexity (where complex models are penalised and more parsimonious ones are rewarded). Smaller RMSEA values reflect better models (Hu and Bentler 1999). Other fit indices, such as the Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI), are based on the comparison of the proposed model chi-square to that of a more restricted baseline model. Typically, this baseline is an “independence model” which assumes that the covariances among items are zero.

In addition to factor structure, another source of validity evidence is the relation between the construct of interest and other relevant variables. Since our study concerns teaching effects, we are interested in whether teaching quality is related to learning outcomes. As described in a preceding section, the teaching measures in PISA are theoretically related to intrinsic motivation, such that schools with better teaching quality should also have students with higher collective intrinsic motivation. The relationships between intrinsic motivation and classroom management, student support, and teacher-directed and inquiry-based instructions have been found in prior empirical studies (Aditomo 2020; Aditomo and Klieme 2020; Aldrup et al. 2018; Decristan et al. 2016; Fauth et al. 2014; Kunter and Baumert 2006; Rjosk et al. 2014; Schiepe-tiska 2019; Wallace et al. 2016). Thus, associations with intrinsic motivation at the school level can be used as one source of evidence to assess the validity of teaching quality measures in PISA.

1.5 Plausibility of evaluating school-level teaching quality

As discussed above, teaching is primarily a classroom process and hence its quality should be evaluated at the classroom level. However, PISA randomly selects students from within a school. Therefore, the L2 in PISA refers not to intact classes, but to schools which may be composed of more than one class and one teacher. This raises concerns regarding the meaning, reliability, and validity of teaching quality when measured at the school level.

Nonetheless, there are reasons to suggest the plausibility of interpreting PISA’s teaching quality scales at the school level. The simplest reason is that students in a school may be referring to the same science teacher, even when they are not from the same classroom. This possibility is stronger in schools that are relatively small, and in schools or education systems where science is taught as a general science subject (and hence could be taught by the same teacher). If this is the case, then school-level teaching quality actually reflects teacher effects. Unfortunately, PISA does not collect information to identify schools which employ only one science teacher.

Even if students in a school are referring to different teachers, there are other reasons to support interpreting teaching quality as a school-level property. Theoretically, teaching quality can be seen as a part of school climate. School climate has been defined as perceptions about values, beliefs, interactions, and relationships held by students and staff within a school (Rudasill et al. 2018). School climate is often described as covering three domains: academic, social, and organisational (Wang and Degol 2016).

From this perspective, teachers in a school may have shared values and beliefs regarding what constitute good teaching, i.e. the kinds of teaching practices and student-teacher interactions which are encouraged and supported (e.g. via training programmes) by the school leadership and community. For example, some school policies may encourage teachers to be more attentive to students’ individual learning needs (and hence allowing teachers to implement higher levels of adaptive instruction, emotional support, and personal feedback). Other schools may have whole-school positive discipline programmes, which are implemented by teachers in the form of more effective classroom management strategies.

A more technical reason has to do the wording of the items in the PISA teaching scales. Items in most of the scales refer to some aspect of the classroom situation or learning environment: teachers, teacher behaviour, teaching practice, or classroom climate. For example, classroom management items refer to students as a group (“Students don’t listen to what the teacher says”) or the classroom situation (“There is noise and disorder”). Adaptive instruction items refer to the teacher’s typical behaviour (e.g. “The teacher adapts the lesson to my class’s needs and knowledge”). Teacher-directed instruction items refer to classroom activities (e.g. “A whole class discussion takes place with the teacher”). The one exception is the personal feedback scale, which is composed of items referring to experience of the individual student. The item “The teacher tells me how I am performing in this course”, for instance, reflects the student’s personal experience of obtaining feedback. Because students may vary in their experience of feedback, the L1 latent factor bears substantive meaning. From this perspective, the personal feedback scale should exhibit relatively lower L2 reliability and validity compared with the other teaching scales.

1.6 Research questions

The preceding section presents reasons for the plausibility of interpreting PISA’s teaching quality scales at the school level. The extent to which they are supported by empirical data is the focus of this article. This is addressed by examining six scales provided by PISA 2015 intended to assess the following dimensions of teaching quality in science: classroom discipline, adaptive instruction, emotional support, constructive feedback, teacher-directed instruction, and inquiry-based instruction.

We are aware of only one prior study that investigated the reliability of PISA’s school-level teaching quality indicators (Wenger et al. 2018). That study focused on mathematics and reading. Our study contributes to the literature by considering not only reliability but also factorial and predictive validity of the teaching scales. We focus on the science domain in PISA 2015. Our specific research questions are as follows:

  1. 1.

    How reliable are PISA’s science teaching scales when used to assess school-level teaching quality, and how much does reliability vary across regions/countries?

  2. 2.

    To what extent do the teaching scales fit a two-level unidimensional factor structure, and how much does this vary across regions/countries?

  3. 3.

    Do the school-level factor scores exhibit positive relations with intrinsic motivation, and how much does this vary across regions/countries?

Regarding research question three, we postulate that higher ratings of teaching quality should be associated with higher/better intrinsic motivation to learn. This postulation is based on theoretical considerations outlined in the preceding section.

2 Method

2.1 Sample

The PISA student sample is drawn from a two-stage stratified procedure in which each participating country/region randomly selects schools, and then randomly selects 15-year-old students in those schools. We use the PISA 2015 sample which consisted of 503,146 students from 17,678 schools in 69 countries/regions (see Appendix for details). We include all regions or countries in PISA for practical reasons. PISA’s strength is in the international nature of its database, and hence, researchers often use PISA to conduct comparative analysis across many different regions/countries. We intend our current study to be useful to future researchers who wish to examine science teaching quality using PISA 2015, in whichever region/country they are interested.

2.2 Instruments

The six teaching scales in PISA 2015 are (OECD 2017):

  • Adaptive instruction—three items referring to the science teacher’s adaptations of his/her instruction in response to the students’ needs (e.g. “The teacher adapts the lesson to my class’s needs and knowledge” and “The teacher changes the structure of the lesson on a topic that most students find difficult to understand”).

  • Classroom management—five items describing the disciplinary climate in the science lessons (e.g. “Students don’t listen to what the teacher says” and “There is noise and disorder”).

  • Teacher-directed instruction—four items referring to teacher-led activities in the science lessons (e.g. “The teacher explains scientific ideas” and “A whole class discussion takes place with the teacher”).

  • Emotional support—five items describing the teacher’s commitment to create a supportive climate in the science lessons (e.g. “The teacher shows an interest in every student’s learning” and “The teacher continues teaching until the students understand”).

  • Personal feedback—five items describing personal feedback that the student receives from the science teacher (e.g. “The teacher tells me how I am performing in this course” and “The teacher tells me in which areas I can still improve”).

  • Inquiry-based instruction–eight items describing the use of inquiry activities in the science lessons (e.g. “Students spend time in the laboratory doing practical experiments” and “Students are required to argue about science questions”).

Meanwhile, intrinsic motivation is defined as enjoyment and interest in learning about science. It is measured using 5 items (e.g. “I generally have fun when I am learning <broad science> topics”). The four response options for each item on every scale were in all lessons (or every lesson), most lessons, some lessons, and never or hardly ever. For this study, all items were scored such that higher scores reflected higher levels of teaching quality.

2.3 Analysis

Data preparation was conducted using SPSS v.23, while all analyses were conducted using Mplus 8 (Muthén and Muthén 2017). To answer research question one, reliability at L2 is determined by calculating the ICC-1 and ICC-2 for each scale in each participating country/region. The within- and between-school variances of the latent factors estimated from the two-level CFA—with students at L1 and schools at L2—were used to compute ICC-1 and ICC-2 according to Eqs. 1 and 2, respectively.

For shared constructs, ICC-1 values often lie around 0.10 and seldom exceed 0.30 (see, e.g., Klein et al. 2000; Stapleton and Hancock 2016; Wagner et al. 2016). Meanwhile, LeBreton and Senter (2008) suggest that values of 0.01, 0.10, and 0.25 correspond to small, medium, and large ICC-1 values. For manifest measures, Klein et al. (2000) provide rules of thumb for ICC-2: 0.7 is considered acceptable; 0.5 is considered marginally reliable. LeBreton and Senter (2008) recommend values between 0.70 and 0.85 to justify aggregation of ratings. Note, however, that ICC-2 evidently depends on the number of clusters and the inter-correlation of the individuals. This means that ICC-2 values rise with increasing average cluster sizes.

Regarding our second research question, the fit of the two-level CFA models is examined. We provide the general model fit indices CFI and RMSEA, but more importantly the SRMR at the between level. For interpretation purposes, SRMR values close to or below 0.08 are considered to reflect good fit (Hu and Bentler 1999). Smaller RMSEA values reflect better models, with values close to 0.06 or below considered acceptable (Hu and Bentler 1999). For the CFI and TLI, values above 0.95 indicate good fit, and values between 0.90 and 0.95 indicate marginal fit (Brown 2015; Hu and Bentler 1999).

All scales are assumed to have latent factors at both the individual and school levels. For those CFAs, we did not assume invariance, meaning that factor loadings at the student and school levels are assumed to be equivalent (Schweig 2014; Zyphur et al. 2008), because we did not necessarily assume that the meaning of the construct at L1 was the same at L2. However, in five countries, namely Canada, Chile, China, Colombia, and Costa Rica, we tested cross-level invariance of the scales in order to check whether this drastically changes model fit indices.

To answer our third research question, the latent factor representing school-level intrinsic motivation (measured by 5 items) was regressed on the latent school-level teaching quality factors for each scale and each region/country. Motivation is chosen as the external criterion because there is strong theoretical rationale, as well as consistent prior empirical findings, to expect positive associations with the teaching scales included in PISA 2015. All the CFA and SEM used the robust maximum likelihood estimator.

3 Results

3.1 Reliability

Table 1 summarises the results of the ICC-1 and ICC-2 values across all countries/regions for each of the scales. As is evident from the table, reliability differs across the scales. In most countries/regions, school-level reliability is high for the classroom management scale and low for the inquiry and adaptive instruction scales. The other scales could be regarded as on average being marginally reliable at the school level. Note that these results also indicate large variability across countries/regions (depicted visually in Fig. 1). Hence, in certain countries/regions, the classroom management scale can be unreliable, and the inquiry and adaptive instruction scales can have adequate reliability. Researchers who aim to work with the data at the school level are referred to the Appendix for exact values for each country/region. Comparing the empirical values with the recommended values, we provided in Section 1, they can make a decision of whether the reliabilities in that particular country are sufficient.

Table 1 School-level reliability indices (summary across all regions/countries)

3.2 Internal structure

On average, the classroom management, emotional support, and feedback scales exhibited good SRMR between values in almost all countries/regions. In contrast, the inquiry and the teacher-directed instruction scales exhibited poor fit in most countries/regions. We paid particular attention to SRMR between values, since they report how well the model fits at the school level. The remaining fit indices, which largely reflect the model fit at L1, support the findings of model fit at L2. Only the teacher-directed instruction scale shows slightly better fit at L1, especially regarding the CFI, compared with fit at L2. Note that fit indices for the adaptive instruction scale cannot be meaningfully interpreted, because with only three items, the measurement model is saturated. Again, as depicted in Fig. 2, the fit indices of each scale vary across different countries/regions.

Fig. 2
figure 2

Distribution of ICC-1 and ICC-2 values across all countries for science teaching scales in PISA 2015

The comparison between the models with and without invariance constraints across the levels, which we conducted in five countries, showed inconclusive results regarding which model is preferable according to the AIC and BIC. The change in the CFI was very small for all scales in almost all countries, and the RMSEA values hardly differed (Figure 3, Table 2).

Fig. 3
figure 3

Fit of two-level unidimensional measurement models of the science teaching scales in PISA 2015

Table 2 Fit of measurement models (summary across all regions/countries)

3.3 Relationship with intrinsic motivation

On average, higher scores of adaptive instruction, classroom management, teacher-directed instruction, and emotional support predicted higher levels of intrinsic motivation to learn science. Meanwhile, the average regression coefficients for the feedback and inquiry scales were close to zero and mostly non-significant (Table 3). Again, there is considerable variation across countries/regions in the magnitude and statistical significance of the regression coefficients (see the Appendix for country-specific values).

Table 3 Average standardised regression coefficients across all regions/countries reflecting the latent school-level (L2) relationships between teaching quality and intrinsic motivation

4 Discussion

This study examined the extent to which student ratings from PISA provide reliable and valid information about teaching quality at the school-level. With a focus on science in PISA 2015, we calculated ICC-1 and ICC-2 to investigate school-level reliability of the two scales measuring generic quality, namely classroom management and emotional support, and the four instructional practices scales, namely inquiry-based instruction, individual feedback, adaptive instruction, and teacher-directed instruction. We also examined the factorial and predictive validity of those six scales.

A potential problem with regard to assessing teaching quality in PISA arises from its sampling procedure. Because students are sampled randomly from schools, the PISA sample lacks a teacher/classroom level (OECD 2017). Instead, the second level (L2) in PISA directly comprises the school. Thus, responses from students within the same school may pertain to different teachers. This raises the question of whether student ratings in PISA, which reflect student perceptions of different teachers, can be aggregated at the school level. Our analyses show that the answer depends on the specific scale and country/region.

For the classroom management scale, L2 reliability was fairly high with an average ICC-1 of 0.116 and ICC-2 of 0.755. Using the cutoff points of 0.05 for ICC-1 and 0.70 for ICC-2, this scale could be judged as a sufficiently reliable measure of L2 teaching quality in the vast majority of the countries/regions examined. This means that within a school, students tend to agree on the organisation and structure of science lessons. The measurement model for this scale also showed a good fit in most countries/regions. Overall, these findings indicate that the scale can be used to reliably measure a unidimensional latent factor at both the individual and school levels.

Unfortunately, the same conclusion cannot be made about the other science teaching scales in PISA 2015. On average, the ICC-1 values for emotional support, individual feedback, and teacher-directed instruction scales were around 0.07 and 0.08. This represents a non-trivial agreement between students in a school (Lebreton and Senter 2008), suggesting that to some extent, these dimensions of teaching are class-spanning and can still be considered a feature of the school. For the inquiry and adaptive instruction scales, ICC-1 values were lower (about 0.05) but still indicate some level of agreement between students in a school. However, given the cluster size in the sample (which was around 30 students per school), the ICC-2 for these scales did not reach 0.70 in most countries/countries.

With regard to their factorial validity, the teacher-directed instruction, emotional support, and feedback scales exhibited good fit with the data in most countries/regions. This supports the use of these scales to measure a single latent factor of teaching quality at the student and school levels (compare with, e.g., Wagner et al. 2013). Meanwhile, the measurement model of the inquiry scale exhibited poor fit with the data in almost all countries/regions. This finding is consistent with prior research, which applied a more exploratory approach and found that the inquiry scale is not unidimensional (Aditomo and Klieme 2020; Lau and Lam 2017). Rather, the scale seems to tap into guided and unguided forms of science inquiry.

In general, the school-level reliabilities observed in this study are low when compared with those reported by studies which utilize class or teacher-based—as opposed to school-based—student samples to assess teaching (Fauth et al. 2014; Lüdtke et al. 2009; Lüdtke et al. 2006). On the one hand, this is not surprising given that in the PISA sample, students in the same school report their perceptions about different teachers. Thus, lower reliability estimates are observed because the aggregated scores reflect not only subjective individual experiences of the same learning environment but also objectively different targets of perception (i.e. different classrooms and teachers). On the other hand, the low reliabilities raise the question of what the PISA teaching scales actually measure at the school level. Put differently, how meaningful is it to assume the existence of constructs which reflect school-level teaching quality?

In this respect, perhaps it is not coincidental that the most reliable scale in PISA 2015, that is, classroom management, does not directly assess teaching. Instead of measuring teacher activities or behaviours, items of this scale refer to disciplinary climate. For example, the scale asks students to consider how often lessons are disrupted due to noisy or unruly student behaviour. This distinction is important because disciplinary climate is not only a function of teachers’ classroom management skills but also of student-related factors such as the classroom SES and prior achievement composition, which are typically more reliable measures at L2. Meanwhile, items of the other teaching scales refer directly to teacher behaviour or teaching activities which may differ between teachers within the same school. This line of reasoning suggests that the classroom management scale is more reliable at L2 because it does not directly measure teaching.

Does this mean that the other PISA teaching scales cannot be used to assess teaching quality at the school level? This is not necessarily the case. Our findings show that although the quality of science teaching varies within each school, there is also some meaningful variation of teaching quality between schools. Even if the level of ICC-1 is relatively low, it still lies above 0.05 for the inquiry scale and is often above 0.07 for the other scales. The caveat is that with relatively low ICC-1, one would need larger numbers of students per school to achieve adequate reliability. This can be illustrated by comparing Hungary and Montenegro, two countries with the same ICC-1 for adaptive instruction (about 0.056) but very different cluster sizes. In Hungary, where only 18 to19 students were sampled in each school, the ICC-2 was 0.520. In contrast, almost 75 students per school were sampled in Montenegro, which resulted in a much higher ICC-2 of 0.811 for the same scale. Thus, the adaptive instruction scale provides a reliable estimate of school-level teaching in Montenegro, but not in Hungary.

Results from our investigations regarding predictive validity also suggest that student ratings from some of the PISA scales capture meaningful differences of teaching quality between schools. Intrinsic motivation is a key affective outcome in science education, and there is strong theoretical ground proposing that teaching quality is positively associated with higher motivation (Deci et al. 1991; Deci et al. 1996; Ryan and Deci 2000). Our findings show that the classroom management, emotional support, teacher-directed instruction, and adaptive instruction scales were indeed predictive of higher intrinsic motivation in many countries/regions. This finding was not replicated for school-level feedback, which predicted higher intrinsic motivation in only 19% of the regions/countries for which data was available, and even predicted lower intrinsic motivation in some countries/regions. A reason for this might be that the feedback scale measures practices aimed at individual students. Unlike the other scales, the feedback scale is intended to measure a student-level construct. Thus, its effect at the aggregated level is better seen as reflecting a composition rather than a climate effect (Lüdtke et al. 2009). This conjecture is supported by additional analyses which show that at the student level, feedback was positively and significantly associated with intrinsic motivation in almost all the regions/countries. As theory suggests, within a school, students who were provided with more feedback were more likely to enjoy learning science.

Like the feedback scale, the inquiry scale failed to predict higher intrinsic motivation in most countries/regions. In this case, the most likely reason has to do with the scale’s poor factorial validity. As mentioned, poor fit of the measurement model may be an indication that the scale is tapping onto a multidimensional construct. Indeed, previous analysis of the inquiry scale for a sub-set of the PISA regions has shown that the scale reflects a two-factor structure representing teacher-guided and unguided forms of inquiry (Aditomo and Klieme 2020). When used to measure only a single latent factor, the inquiry scale yields a score that conflates the two dimensions. Considering the importance of guidance and structure in facilitating learning from inquiry (Lazonder and Harmsen 2016), such conflation likely masks the substantive relations between inquiry and learning outcomes.

5 Conclusions and implications

Our findings show that in most countries/regions, the teaching scales in PISA 2015 have low reliabilities when used to assess school-level teaching quality. The exception was for the classroom management scale, which suggests that effective classroom management is a school quality that spans across classes. For the other scales, adequate school-level reliability could only be observed in certain countries/regions. This stands in contrast to the high single-level reliabilities of these scales in all countries/regions reported by the OECD (2017, p. 313). Nonetheless, findings also indicate that the classroom management, emotional support, adaptive instruction, and teacher-directed instruction scales capture meaningful differences in teaching quality between schools. This opens the possibility of using student ratings of teaching in PISA to investigate school-level effects in certain countries/regions, thereby allowing extensions of prior analysis that thus far only utilized the data at the individual student level.

To reiterate, the relatively low aggregate reliabilities of PISA’s teaching scales observed in many countries is likely because student ratings are aggregated at the school level as opposed to teachers/classrooms. A clear implication of these findings is that researchers and policy makers wishing to make inferences about school-level teaching quality using PISA data should proceed cautiously and check the level 2 reliability for the specific regions or school sectors they wish to evaluate. Relatedly, researchers need to formulate a clear theoretical reasoning about the substantive meaning of measures of teaching at the school level. In addition, the findings point to the importance of using a doubly-latent approach (multilevel structural equation modelling), which uses latent factors and latent aggregation (Morin et al. 2013). In this way, relationships between school-level teaching quality and other variables can be estimated while taking into account the unreliability in the measurement model (Kelloway 2015).

Our study also points to several avenues for further research. The first is related to the inquiry scale, whose measurement model exhibited poor model fit and even failed to converge in many countries/regions (see Appendix). Given the centrality of inquiry-based instruction in science education, its assessment based on student ratings requires a more in-depth analysis than can be provided by our study. Also, our study did not examine the comparability of the teaching scales across different regions/countries (He and van de Vijver 2013). Although international studies like PISA adopt rigorous procedures to minimise cross-cultural bias, measurement invariance of its scales cannot be assumed (for the case of teaching quality in mathematics, see Fischer et al. 2019). However, examining this issue for the whole set of participating regions/countries in PISA was beyond the scope of the current paper.