Introduction

How do preservice teachers evaluate the improvement in students’ achievement, and how the deterioration in achievement? Is there a gender bias in the evaluation of student performance, with girls preferred to boys? The study at hand addresses these questions in the context of school placement recommendations that are made in Germany (and some other European countries) at the end of primary school in order to suggest a student for one of the different school tracks in secondary school.

Secondary school in Germany is hierarchically structured, consisting of higher and lower tracks, where “high” and “low” correspond to both the necessary achievement level students should have to be eligible for being taught in a specific track, and the level of instruction that is delivered in the tracks. When students finish primary school, which is—dependent on the federal state—at the end of school year 4 (e.g., in Bavaria) or school year 6 (e.g., in Berlin), teachers decide on the future track students should attend in secondary school. Traditionally, three distinctive tracks constitute the German secondary school. The lowest track (“Hauptschule”) is dedicated to students with major learning difficulties and below-average achievement profiles. Students who attend this track can acquire qualifications for rather limited vocational areas. The intermediate track (“Realschule”) provides students with general education and vocational-training courses. The highest track (“Gymnasium”) offers students with above-average achievement profiles the qualification for university entrance when the students successfully accomplished this track.

According to the German transition regulations, teachers should value the student’s achievements as the major factor when making a school placement recommendation (KMK 2010). The students’ achievements that are considered for placement recommendations are mainly represented by their school grades, given in the last but one school report in primary school. Moreover, working habits and social behavior, related to achievement and mentioned in the school report, do also play a role for track recommendations. School grades in Germany vary between 1 and 6, with 1 meaning “very good” and 6 meaning “insufficient,” that is, lower scores on the grade scale represent higher achievements. School grades are based on predetermined education standards representing the knowledge and skills that should be mastered at each stage of the school system. Whereas school grades seem to be criterion-referenced at first glance, they are actually affected by the school or even the class where the teacher examines the students. For example, the meaning of “very good” performance might vary between classes of different achievement levels (Phillips 1991). Hence, the likelihood for students to get lower grades increases with the ability level of the class (Preckel et al. 2008; Zeidner and Schleyer 1999). School grades are therefore not directly comparable between schools.

In Berlin, which is both a federal state and the capital of Germany, teachers are given two school reports instead of one to reach a decision about a school-track recommendation. In Berlin, where secondary school starts in grade 7, the school reports serving as the decision base are from the last semester of school year 5 (5/2) and the first semester of school year 6 (6/1). The teachers are advised to take both school reports into consideration when opting for one of the different tracks. If the grand mean of the grades of both school reports equals 2.2 or less, the recommendation for the highest track is advised. If the grand mean is 2.8 or larger, students are recommended for the lower track. If, however, the grand mean is between 2.2 and 2.8, additional information regarding the student’s skills and achievements (e.g., her or his reflection of the learning process) should be gathered (Senatsverwaltung für Bildung, Jugend und Wissenschaft 2015). One possible advantage of using two reports rather than one might be that they provide a more stable and reliable estimate of the “true” achievements of a student. However, since it is likely that both school reports differ from one another in their grade point averages (GPA), teachers are faced with both a grand mean across the grades obtained in semester 5/2 and semester 6/1 and the difference between the GPAs of the two school reports. According to official regulations and school policies, only the grand mean should be used for placement decisions (Senatsverwaltung für Bildung, Jugend und Wissenschaft 2015). However, information about how to handle changes between both reports is not given.

Factors affecting school-track recommendations

Although it is clearly stated in official regulations that achievement be the only factor that should affect the teachers’ track recommendations (KMK 2010), factors that are not related to achievement have also been shown to affect track recommendations. Recent research has brought ample evidence that the main predictors of school placement decisions are the school grades (Bos et al. 2004; Stubbe and Bos 2008), scores from standardized achievement tests (Bos et al. 2004; Stahl 2007), achievement motivation (Bos et al. 2004; Neugebauer 2011), the socioeconomic status of the students (Stubbe 2009), the immigration status of the students (Kristen 2006), parental aspirations (Braun and Mehringer 2009), the composite achievement level of the school class (Trautwein and Baeriswyl 2007), and students’ gender (Arnold et al. 2007).

Gender

Several studies show that girls were more likely than boys to be recommended for the highest track in German secondary school (Arnold et al. 2007; Jürges and Schneider 2011; Lehmann and Peek 1997; Milek et al. 2009; Schmitt 2008; Schneider 2011; Schulze et al. 2009). This effect suggests that boys are, compared to girls, disadvantaged in the German school system. Although the predictive power of gender for school-track recommendations dropped when the students’ achievement was controlled for, the effect of gender was still significant in some studies (Arnold et al. 2007; Jürges and Schneider 2011; Lehmann and Peek 1997; Schulze et al. 2009). An explanation as to why teachers were prone to recommend girls more often to the highest track than boys despite equal achievements might be based on different expectations in regard to the students’ cognitive and personal development in secondary school. These expectations might result from teachers’ knowledge about students’ maturation and its effect on school performance. Boys lag behind girls by 2 years in physical maturation (Marshall and Tanner 1970). There is also evidence that girls are emotionally more mature than boys (Carrothers et al. 2000) and outperform boys in regard to self-efficacy (Kumar and Lal 2006). With respect to intelligence, boys and girls mature at different rates. Girls mature on average faster at the age of about 9 years and remain in advance of boys until the age of 14 or 15 years (Colom and Lynn 2004; Lynn 1999). Similarly, girls have been shown to mature earlier than boys in personality. For instance, girls exhibit higher levels of agreeableness (Branje et al. 2007; Klimstra et al. 2009) and conscientiousness (Klimstra et al. 2009) in early and middle adolescence. Both personality traits contribute to positive interpersonal relationships, which in turn teachers regard as valuable for learning and classroom behavior (e.g., Lane et al. 2004).

Teacher expectations might also refer to gender-related differences in student attitudes to school (e.g., Chesterfield and Enge 1998; Parks and Kennedy 2007), which may also result from girls’ earlier maturation (Darom and Rich 1988). For instance, girls show on average a more positive attitude to school than boys (OECD 2004), they usually enjoy going to school more than boys (Segeritz et al. 2010; Van Ophuysen 2008), and they tend to show more positive approaches to learning such as attentiveness and task persistence than boys (Ready et al. 2005). All these attitudes and behaviors might alleviate the subjective burden that comes along with the transition from primary to secondary school and, therefore, might be considered when teachers make their school-track recommendations. Likewise, results of a recent study (Timmermans et al. 2016) suggest that teachers’ expectations for the future academic performance of their students during the final grade of primary school were related to several teacher perceptions of student attributes. In general, teachers had higher expectations for a student if they perceived the student as having positive working habits. Positive working habits, however, are generally associated with female students rather than with male students (Reyna 2000; Siegle and Reis 1994).

Teacher expectations may also be driven by implicit gender stereotypes (Chalabaev et al. 2009). Much research has focused on stereotypes that teachers have concerning girls’ ability in mathematics and science (e.g., Li 2005). Teachers tend to stereotype mathematics as a male domain and attribute boys’ successes and failures to ability, whereas they tend to attribute girls’ successes and failures to effort (Fennema et al. 1990; Tiedemann 2002). Even if female students are high achievers, they are seen as less logical, less independent in mathematics, and liking mathematics less compared to equally achieving male students (Fennema et al. 1990). Gender stereotypes favoring male students are discussed as a factor influencing educational choices, resulting—for example—in an unequal distribution of males and females in STEM (science, technology, engineering, and mathematics) fields of studies (Glass and Minnotte 2010). If teachers tend to underestimate the mathematical ability of girls relative to boys (Frome and Eccles 1998), girls may be placed at a disadvantage.

However, teachers’ stereotypical perceptions of students are moderated by the students’ performance level. Teachers attribute more developmental resources in mathematics to male than to female primary school students if they are low or average achievers, but not if they are high achievers (Tiedemann 2002). Since high-achieving students (male or female) are likely to be recommended for the highest track in secondary school, teachers’ stereotyped perceptions may not necessarily result in favoring boys in placement recommendations. Moreover, whereas boys are usually stereotyped as having stronger mathematical abilities than girls, girls are stereotyped as having stronger verbal abilities than boys (Plante et al. 2013). Primary school teachers, however, weight mathematical abilities to a smaller degree than verbal abilities, when it comes to placement recommendations (e.g., Bos et al. 2004; Klapproth et al. 2013). Hence, high-track recommendations may be more likely for girls than for boys.

Teacher-student gender interaction

Student assessment might not only depend on students’ gender, but also on their teachers’ gender. According to the gender-stereotypic model (Martin and Marsh 2005), boys achieve higher scores in classes taught by males, and girls are better when instructed by female teachers. Therefore, some policy makers (e.g., in Great Britain) attempted to increase the number of male teachers in primary schools (Francis et al. 2008), although studies on the effect of teacher gender on the achievement, attitudes, or behaviors of their male and female students are quite rare (Driessen 2007). Teachers might have preferences over students of their own gender, and hence, female teachers might assess girls better than boys, whereas male teachers might prefer boys to girls (Holmlund and Sund 2005). Indeed, some studies corroborate this hypothesis. For instance, Lavy (2004) could show that girls got on average higher scores than boys in the main school subjects, but in some school subjects (e.g., biology and chemistry), the difference between boys and girls was larger with female teachers than with male teachers. Likewise, Dee (2007) revealed that in secondary school, boys and girls were evaluated more positively when they were taught by a same-gender teacher rather than by a teacher of the opposite gender. Contrarily, however, Hopf and Hatzichristou (1999) found an interaction between student and teacher gender, such that male primary school teachers judged boys to show more problematic behavior than did female primary school teachers. Other studies found that female teachers generally evaluated both boys and girls more positively than male teachers (Ehrenberg et al. 1995), or even did not find a teacher-gender bias in assessing male and female students (e.g., Driessen 2007). Particularly in Germany, studies concerning the effect of teacher gender on student assessment have focused on school placement decisions. For instance, Helbig (2010) as well as Neugebauer (2011) could show that there was no significant interaction between teacher gender and student gender. One reason for not finding teacher-student gender interactions might be that both male and female teachers simply follow the rules given by their administrations or schools when judging students. Since these rules often entail students’ learning motivation (Neugebauer 2011), girls are on average preferred over boys, as girls show on average higher degrees of learning motivation than boys (Ready et al. 2005).

Development of achievement

When teachers in Germany are urged to opt for one of the different tracks in secondary school, they mainly resort to grades as indicators of the students’ achievement (Arnold et al. 2007), which are given in the school reports. In the federal state (Berlin) where our study was conducted, teachers are presented with two school reports instead of one when it comes to placement recommendations. Although successive school reports from a single student may be very similar, they are hardly exactly the same.

How are changes of achievement perceived or evaluated by teachers, when it comes to school placement recommendations? When recommending a student for the highest track, the teacher in charge is likely to expect that the student will perform adequately in this track. According to Jussim et al. (1998), students’ past performance predicts teacher expectations about students’ future performance. A student who performed well in the past is expected to perform well in the future, whereas a student who performed poorly in the past is expected to perform poorly in the future. When teachers expect students to continue to perform according to previously established patterns, resulting expectation effects have been categorized as “sustaining expectation effects” (Cooper 1985; Good and Brophy 2003). Evidence for sustaining expectation effects comes, for example, from a study conducted by Cooper et al. (1976) where college students imagined themselves as primary school teachers who had to predict the performance of a child whose report card reflected either an increasing or decreasing performance pattern. Results indicated that expectations were higher in the increasing rather than decreasing condition. Rolison and Medway (1985) reported similar results with teachers as participants.

Although in Germany the development of academic achievement is not considered a factor for determining the suitable track, Caro et al. (2009) demonstrated that students growing more rapidly in their mathematics skills, measured by standardized achievement tests, were more likely to get a high-track recommendation than students with a smaller degree of growth. Caro et al. (2009) concluded that teachers value the growth rate of students for their placement recommendations. However, when teachers in Germany make placement recommendations, they usually do not have access to achievement test data, nor are they legally allowed to use these data. Therefore, in the Caro et al. study, teachers may have used indicators of achievement growth that were not explicitly given by test scores, but instead by the information that was actually present in the school reports. In particular, the teachers possibly gauged the development of achievement from the two school reports that were given for each student, even if the differences between them were rather small.

Research questions and hypotheses

The aim of the present study was to examine whether preservice teachers, who will become primary school teachers after successful completion of their teacher study program, and who therefore will eventually be in charge of making school placement decisions, are biased by two factors when making these decisions: the gender of the students and whether the students improved or deteriorated within one semester at the end of primary school. To the authors’ knowledge, no study so far has investigated this research question experimentally, although student achievement in general is the major determinant of school placement decisions.

The rationale of the present study was as follows. Since teachers (or even preservice teachers) may evaluate girls to be more mature than boys, they would presumably predict higher achievements in secondary school for girls than for boys, even if both currently show equal achievements. Moreover, since teachers are likely to predict students’ future achievements on the basis of their previous development of achievement, an increasing pattern of achievement would result in the prediction of higher achievements in secondary school than a decreasing pattern of achievement.

We therefore hypothesized that female students would be more likely to be recommended for the highest track than male students, despite their grades being equal. Furthermore, we assumed that showing preservice teachers two different school reports as the basis for deciding whether the students are eligible for being taught in the highest track or not, the preservice teachers will opt more frequently for the highest track when the GPA of the second report was lower (i.e., “better”) than that of the first report, than when the GPA of second report was higher (i.e., “worse”) than that of the first report. Hence, we expected that an improvement of the students in terms of GPA would make the recommendation for the highest track more likely, whereas with the deterioration of the GPA, the likelihood for the high-track recommendation would decrease.

In addition to the change of the GPA, we postulated that—as it is the standard official rule for teachers in Germany—the grand mean of the grades (i.e., the average over all grades of both school reports) would also be predictive for the recommendation of the participants. In particular, we expected that the lower the grand mean was, the more likely it would be that the participants recommend students for the highest track.

Finally, we examined whether the gender of the participants contributed differentially to the recommendations for male and female students, that is, whether there is an interaction between participants’ gender and students’ gender with respect to school-track recommendations. However, based on previous research, a straightforward hypothesis was hardly derivable, so that we treated this issue as an open research question.

Method

Participants

In total, 260 preservice teachers took part at the study. Of these participants, 172 (66.2%) completed the study fully, whereas 83 participants (31.9%) had missing values on more than 3%, but less than 10% of the responses. The remaining participants (1.9%) omitted 50% or more of the possible responses and were excluded from further analyses. In the sample used (N = 255), 176 (69.0%) participants were female and 79 (31.0%) were male. The distribution of gender in our sample matched pretty well the distribution of teachers’ gender in German primary schools (Neugebauer and Gerth 2013). The participants’ age varied between 18 and 40 years, with a mean age of 22.1 years (SD = 3.6).

For the 83 participants with less than 10% missing values, we conducted multiple imputation on the dependent variables by using the respective tool offered by SPSS, Version 23, since missing data pose a problem on data interpretation when not appropriately handled (Peugh and Enders 2004). Imputation was conducted five times, and the five imputations were finally aggregated to a single data set. Prior to multiple imputation, the pattern of missing values was analyzed. Little’s MCAR (missing completely at random) test (Little 1988) yielded a χ2 (364) = 384.31, p = 0.101, indicating that the values were missing randomly.

All participants were enrolled in a primary school teacher education program at the time of the study. In teacher education programs, the assessment of students and the training of diagnostic competences is a major part (e.g., Abs 2006). All participants had previous teaching experiences as student teachers or student observers in the classroom within the framework of their study program (the mean practice duration was 19.6 weeks (SD = 19.7)), and they had studied on average for 4.5 semesters (SD = 2.3). The study was open for 4 weeks.

Materials

Each participant received all 24 student vignettes, which were displayed online via the internet platform “soscisurvey.de” and which were accessible on every computer device connected to the internet. The vignettes mimicked two school reports that teachers in Berlin primary schools receive to make their track recommendations for secondary school. The vignettes were developed according to guidelines presented by Evans et al. (2015). Prior to the study, the vignettes were pretested in a small sample (N = 6) of students and teachers who were asked to rate the content of the vignettes with respect to plausibility and comprehensibility, and whether they appear similar to real-life student reports. One exemplar of the vignettes is shown in the Appendix.

Each vignette contained six grades varying between 1 (“very good”) and 4 (“sufficient”), with each grade being related to one school subject. However, the school subjects were not specified (e.g., subject A: “2,” subject B: “2,” subject C: “4,” subject D: “1,” subject E: “2,” subject F: “3”), so that the participants were to rely only on the amount of the grades and therefore could not apply subjective weighting of the school subjects. The rationale behind leaving the school subjects unspecified was as follows: To control for subjective weighting of school subjects, a design would have been necessary that would allow every single school subject to be crossed with all remaining factors, rendering the design virtually unfeasible. To reach a single GPA (e.g., 2.17), the combination of grades (e.g., 1; 2; 2; 2; 2; 4) was always the same.

The realized grand means of the grades displayed in the vignettes were M = {2.33; 2.50; 2.67; 2.83; 3.00; 3.17}, with higher means representing lower achievements. However, the grand means emerged from two school reports showing either improvement or deterioration in grades. In case of improvement, the GPAs of the first school report (representing grades obtained in semester 5/2) were 2.50, 2.67, 2.83, 3.00, 3.17, or 3.33, and the corresponding GPAs of the second school report (representing grades obtained in semester 6/1) were 2.17, 2.33, 2.50, 2.67, 2.83, or 3.00, respectively, so that the magnitude of improvement was always the same between two school reports. Accordingly in case of deterioration, the GPAs of the first school report were smaller than those of the second school report. Note that the change of GPAs was always due to the change of grades in one school subject by an amount of 2.0. For instance, when a student improved in her GPA (e.g., from 2.50 to 2.17, yielding a grand mean of 2.33), she realized this improvement by the change of grades in a single school subject from 4 (“sufficient”) to 2 (“good”). The grand means of the GPAs were unrelated to both the students’ gender and whether there was improvement or deterioration in grades.

The gender of the students was manipulated by the names that were assigned to the students. The names were common either for male or female German students.

Each vignette was supplemented with information regarding the students’ working habits and social behavior in order to make the vignettes more similar to real-world school reports. This information was delivered by two rather short sentences, which were derived from standardized sentences used for appraisal of the working habits and social behavior in school (Niedersächsisches Kultusministerium 2010). All sentences used in the vignettes displayed behavior that is regarded in school as “meeting the expectations.” Thus, all vignettes showed student behavior that was evaluated in quite the same way.

The three factors (the students’ grand mean of GPAs, their gender, and whether they improved or deteriorated) were varied orthogonally, resulting in a 2 (gender: male vs. female) × 2 (change of GPA: positive vs. negative) × 6 (grand mean 2.33, 2.50, 2.67, 2.83, 3.00, 3.17) within-subjects factorial design. The dependent variable was the decision of the participants, which was either in favor of or against placement in the highest school track. We additionally collected data about the participants’ sociodemographic background.

Procedure

The participants were instructed to imagine that they were a teacher of a class in the last grade of primary school and were to make a decision about every student of this class on her or his future track in secondary school. They were given the options “in favor of the highest track” or “not in favor of the highest track” at the end of each student description. The participants were instructed to make use of the information that was presented to them for each of the 24 students. After the general instruction, an example task followed which should make the participants get acquainted with the procedure. After that, the student vignettes followed in random order. In case a participant did not make a judgment (and instead clicked on the “next” button), a prompt popped up which reminded the participant to make a judgment, or otherwise to proceed the experiment without making a judgment. A new vignette was shown on the screen after the preceding vignette was closed by the decision of the participant. After the participants had made decisions for all 24 students, they were asked to give some information about their sociodemographic background. Finally, they were all thanked, debriefed about the purpose of the study, and were given the opportunity to take part at a lottery with the prospect of a 20 Euro prize to win.

Data analyses

We used multilevel logistic regression analysis to test our hypotheses. In this analysis, the judgments of the participants are nested within the participants. Hence, the level-1 unit of the analysis consists of the repeated measures for each participant, and the level-2 unit is the participant. The predictors in the regression model were the grand mean of grades as a metric covariate (with the values 2.33, 2.50, 2.67, 2.83, 3.00, and 3.17) and student gender (female = 0, male = 1) as well as the change of GPA (negative = 0, positive = 1) as binary factors. Since in our hypotheses we predicted three main effects to occur, we firstly estimated a regression model that contained only main effects (model 1). However, since we could not exclude the possibility of interactions between the predictors, we also estimated a model that specified interaction terms (model 2). We additionally examined whether the gender of the participants affected their placement recommendations (model 3). Finally, in order to examine whether the results obtained in this study can be generalized to potentially all “admissible” preservice teachers, we conducted a generalizability study.

Results

The results were reported according to guidelines suggested by Jaccard (2001). Table 1 shows the mean proportions as well as the respective standard deviations for each condition.

Table 1 Means and standard deviations of the proportions of high-track recommendations as a function of students’ grand mean in grades, their gender, and whether their grades improved or declined

Apparently, the proportions of high-track recommendations were dependent on the grand mean of grades, with a smaller grand mean resulting in higher proportions. Moreover, Table 1 also indicates that the change of grades contributed to the proportions of high-track recommendations, as a positive change (meaning that students improved) yielded higher proportions compared to a negative change. To examine whether these apparent effects were statistically significant, we conducted multilevel logistic regression analysis, of which the results are depicted in Table 2.

Table 2 Results of multilevel logistic regression analyses

The resulting logistic regression equation for model 1, which contained only main effects, reads as follows:

  1. (I)

    Predicted logit of high-track recommendation = 6.09 + 0.94 * Change of GPA + 0.05 * Student Gender − 2.69 * Grand Mean.

Except for student gender, all predictors were shown to be significant. For change of GPA, holding student gender and the grand mean constant, the logit increased by B = 0.94, when the change was positive. This effect translates to an odds ratio of 2.55, meaning that the odds of getting a recommendation for the highest track increased by factor 2.55 when the change in GPA was positive rather than negative. For the grand mean, lower values corresponded significantly with higher probabilities for high-track recommendations. An increase of one unit on the German grade scale (which ranges from 1 (“very good”) to 6 (“insufficient”)) corresponded with roughly a 15 times lower chance for a high-track recommendation.

In model 2, we added four interaction terms to the main effects, resulting in the following logistic regression model:

  1. (II)

    Predicted logit of high-track recommendation = 3.49 + 4.04 * Change of GPA + 0.55 * Student Gender − 1.72 * Grand Mean + 2.80 * Change of GPA × Student Gender − 1.15 * Change of GPA × Grand Mean − 0.17 * Gender × Grand Mean − 1.06 * Change of GPA × Student Gender × Grand Mean.

As compared to model 1, the goodness of fit, indicated by the quasi-likelihood under the independence model criterion (QIC), was smaller in model 2 and, hence, indicated better fit to the data. As in model 1, the same main effects were significant, which were the effect due to the change of GPA and the effect due to the grand mean. However, the effect of the change of GPA increased (from B = 0.94 in model 1 to B = 4.04 in model 2), whereas the effect of the grand mean decreased (from B = − 2.69 in model 1 to B = − 1.72 in model 2). Note that when interaction terms are included in a logistic regression equation, the coefficients for the main effects no longer represent main effects in the traditional sense, but instead odds ratios comparing the odds of one predictor of the interaction term when the other predictor of that interaction term is set to zero. Model 2 revealed three significant interaction terms: the Change of GPA × Student Gender interaction, the Change of GPA × Grand mean interaction, and the three-way interaction. However, the Student Gender × Grand mean interaction was not significant.

The Change × Grand Mean interaction means that when holding the value of student gender constant (e.g., when the students were all female or all male), the slopes of the functions were nonparallel, with steeper slopes for students who improved their grades than for students whose grades deteriorated. The Change of GPA × Student Gender interaction means that male students were more likely to get a high-track recommendation than female students, when they improved their grades rather than deteriorated.

The interpretation of the three-way interaction obtained from model 2 necessitates a look at the differences in logits between female-negative change and male-negative change students on the one hand, and female-positive change and male-positive change students on the other hand. At the smallest grand mean (2.33), the difference was Diff = 0.48 for positive-change students and Diff = 0.16 for negative-change students (with a higher probability of high-track recommendations for male than for female students). Hence, at this grand mean, gender of the students was of lower importance for judging negative-change students than for judging positive-change students. At the largest grand mean (3.17), the difference was Diff = − 0.56 for positive-change students and Diff = 0.01 for negative-change students, which indicates that at this achievement level gender played again a stronger role for the judgments of positive-change students than for the judgment of negative-change students. However, at this grand mean, female students were favored over male students, when they improved their achievements. Obviously, when students showed deterioration, the different grand means were considered to a lesser degree than when they improved, and the effect of the grand mean on the probability of a high-track recommendation was even stronger when the improving students were male rather than female.

In model 3, the participants’ gender was included both as a main effect and as part of interaction terms. The resulting model was as follows:

  1. (III)

    Predicted logit of high-track recommendations: 9.49 + 1.22 Change of GPA − 0.20 * Student Gender − 4.19 * Grand Mean − 10.97 * Participant Gender + 5.52 * Change of GPA × Student Gender + 0.16 * Change of GPA × Grand Mean + 0.06 * Student Gender × Grand Mean + 2.39 * Student Gender × Participant Gender + 2.77 * Change of GPA × Participant Gender + 4.56 * Grand Mean × Participant Gender − 2.09 * Change of GPA × Student Gender × Grand Mean − 5.96 * Change of GPA × Student Gender × Participant Gender − 1.55 * Change of GPA × Grand Mean × Participant Gender − 0.78 * Student Gender × Grand Mean × Participant Gender + 2.24 * Change of GPA × Student Gender × Grand Mean × Participant Gender.

Compared to model 1 and model 2, the QIC score was smaller and therefore indicated a better fit. Notably, the four-way interaction effect was significant, which we will illustrate in the following.

In Fig. 1, the predicted logits obtained from model 3, which corresponded to all combinations of factors realized in our study, were depicted. In the upper panels of Fig. 1, logits obtained from students with a positive change in GPA, and in the lower panels, logits obtained from students with a negative change in GPA are shown. The figure will help to interpret the four-way interaction of model 3.

Fig. 1
figure 1

Predicted logits of the probability of high-track recommendations, obtained from the different conditions in the experiment, depending on the grand mean of all grades. Upper panels: predicted logits for students who increased in grades. Lower panels: predicted logits for students who decreased in grades. Left panels: predicted logits for female participants. Right panels: predicted logits for male participants

To interpret the four-way interaction obtained from model 3, it is useful first to consider the effects for positively and negatively changing students separately. For students with positively changing grades, the slopes of the functions were steeper with female (B = − 5.04) than with male (B = − 1.31) participants, meaning that the grand mean of grades affected the participants’ decisions to a larger degree when the participants were female rather than male. Moreover, the relationship between the grand mean of grades and student gender was dependent on the participants’ gender. The difference in slopes between male and female students was larger with female (Diff = 2.03) than with male (Diff = 0.56) participants. In addition, whereas female participants judged male and female students differentially depending on their grand mean of grades, male participants preferred boys relative to girls at almost all values of the grand mean of grades.

The pattern of results was quite different for students with negatively changing grades. With female participants, the slopes of the functions were still steeper (B = − 4.16) than with male (B = 0.01) participants. However, compared to positively changing students, female participants did not make any visible difference when recommending boys or girls, since the slopes for boys and girls were virtually the same (Diff = 0.06). Male participants, however, appeared to devalue high-performing girls more than low-performing girls, since the slope for the girls was positive (B = 0.37), whereas the slope for the boys was negative (B = − 0.35).

Finally, we conducted a generalizability study in order to examine whether the results obtained in this study can be generalized to all admissible preservice teachers. Generalizability (G) theory is a statistical framework that allows for the identification and estimation of different sources of measurement error (Shavelson and Webb 2006) in order to determine the limits of generalizability of the results obtained from the measurement made (Cronbach et al. 1972). These different sources of error (e.g., item, occasion, participants, test form) are called facets of a measurement. In a simple G study, we attempted to isolate and estimate the measurement error attributed by the participants. In the G study we conducted, participants were considered a facet, whereas the independent variables of the study (grand mean, change, gender) were considered the objects of measurement. Table 3 shows the variance components for the main effects and interactions.

Table 3 Variance components for the main effects and interactions

The largest variance component was obtained for participants (12.8%) and for the four-way interaction (44.7%). The variance component for participants shows that—averaging over all levels of the independent variables—the participants in the sample differed systematically in their judgments. The large variance component for the four-way interaction reflects that the participants’ judgments were dependent on the independent variables—a result that was predicted in advance and confirmed by logistic regression analysis. In order to give an estimate about the degree to which the results can be generalized to (similar) persons not part of the sample, we estimated the dependability index (Brennan 2001; see also Shavelson and Webb 2006). The dependability index is analogous to the reliability coefficient in classical test theory, and its formula is as follows (cf. Shavelson and Webb 2006):

$$ \varPhi ={\sigma^2}_{\mathrm{p}}/\left({\sigma^2}_{\mathrm{p}}+{\sigma^2}_{\varDelta}\right) $$

with σ2p equals the variance component for participants, and σ2Δ equals the error variance. Due to the large number of participants in the study (N = 255), the error variance is quite small (σ2Δ = 0.00073), yielding a dependability index of Φ = 0.976.

Discussion

Comparison between the hypotheses and the results obtained

With the present study, we aimed at examining whether preservice teachers valued both the gender of primary school students as well as their development of GPAs, indicated by two successive school reports, when making recommendations for the students track in secondary school. The results obtained seem to be highly reliable, as the G study showed, so that the results might be generalizable to similar participants. We predicted that—according to legal regulations—the grand mean of the grades should affect the probability of a high-track recommendation, with lower grand means (i.e., higher achievements) resulting in higher probabilities. This hypothesis could be confirmed. When the grand mean was reduced by one unit on the German grade scale, the probability for a high-track recommendation increased by approximately factor 15. This result clearly shows that the preservice teachers acknowledged the overall achievement indicated by the grades of two school reports as a basis for their decision. Hence, what legal regulations envisage was actually adopted by our participants.

However, we additionally assumed that when students improved their GPA within one-half year of schooling, they would be more likely to get a high-track recommendation than students who deteriorated within the same time period, even if their grand mean was the same. This hypothesis was also confirmed. Actually, students who improved were two and a half times more likely to get a high-track recommendation than students who deteriorated. This result clearly contradicts the official regulations provided by the authorities, since teachers (and therefore preservice teachers) are allowed only to take into consideration the grand mean of the GPA of both school reports, but not their change. Also note that this effect occurred due to a rather moderate change in GPA. The difference between the successive school reports was an increase or a decrease of one single grade (out of six) by the amount of 2 units on the German grade scale. That is, the GPAs between both school reports differed by 1/3 unit, with one unit being roughly equivalent to a letter grade in the US grading system.

Finally, we predicted that preservice teachers would account for student gender by preferring girls over boys in their high-track recommendations. This hypothesis was not confirmed since the assumed main effect of gender was not significant, which means that on average the participants did not make a difference between male and female students when judging their suitability for the highest track.

However, there were some unexpected significant interactions that qualified the obtained main effects. First, the significant Change of GPA × Grand mean interaction showed that when students were rather low in achievement, the probabilities of receiving a high-track recommendation were quite similar between improving and deteriorating students. However, with high-performing students (indicated by a low grand mean), positively developing students were much better off than negatively developing students.

A possible explanation refers to the students’ achievements that were indicated by the second school report, as the GPA of the second report might have served as a predictor of future achievement. When the grand mean of all grades was rather high (e.g., 3.00), the student’s improvement yielded a second-report GPA of 2.83, and when the student deteriorated, the second-report GPA was 3.17. Although both second-report GPAs were produced by a change of 0.33, both the improvement and the deterioration would hardly justify a recommendation for the highest track. That is, at this grand mean, a change of the GPA in either direction would probably be of no consequence for the participants’ decision. Consider now the grand mean of 2.33, indicating rather high achievements. When students improved, their GPA of the second school report was 2.17, whereas when they deteriorated, the GPA of the second report was 2.50. If the participants valued the second report as a predictor of future achievement, a student who improved would certainly be recommended for the highest track, whereas a student who deteriorated would presumably judged as being not eligible for the highest track by quite a high number of participants. Thus, the change between both school reports appears to be valued differently depending on the second-report GPA. Since the GPA of the second report was directly related to the grand mean, the change effect was dependent on the grand mean, thus resulting in a Change of GPA × Grand mean interaction.

Second, the significant Change of GPA × Gender interaction means that the difference in the probabilities of getting a high-track recommendation between boys and girls was larger when the students grew in achievement than when they were downgraded. Contrary to our prediction, male students were preferred over female students, at least when they improved rather than declined. A possible explanation refers to maturational differences between boys and girls, which the participants presumably had assumed. If the participants recognized improvement with boys, they might have seen a latent potential in male students that eventually could result in high performance in secondary school.

However, the significant three-way interaction qualified all the significant two-way interactions. Only with rather high-performing students, positively developing boys were preferred over girls. Yet, when the students were rather low performers, the reverse was the case, meaning that girls were favored over boys. Since female primary school students are on average more mature than their male counterparts (Colom and Lynn 2004; Lynn 1999) and show more positive attitudes to school (OECD 2004), more attentiveness and task persistence (Ready et al. 2005), and more positive working habits (Reyna 2000) than boys, the participants might have evaluated highly performing and positively developing male students as exceptionally good. As such, the participants could have attributed more positive characteristics to these students and hence might have expected a more positive development in secondary school compared to their female counterparts, who in contrast were usually expected to be good performers in school. Indeed, studies on teacher biases in identifying talented students have revealed that teachers who were asked to nominate students for gifted programs based on hypothetical student profiles were more likely to select profiles where the students’ behavior did not match the expected gender stereotype (e.g., Bianco et al. 2011; Powell and Siegle 2000). For example, Powell and Siegle (2000) could show that teachers who generally believed that female students were better at reading than male students rated the profile of a male student who was a very good reader higher than a female student with the same skills. Similar results were obtained by Bianco et al. (2011) who demonstrated that teachers were less willing to refer a female student to a gifted and talented program than a male student, who actually were identically described regarding their characteristics.

How did the participants judge the change of students’ achievement?

This study brought evidence that preservice teachers regarded the development of grades as an indicator of future success in school, since students who improved their GPA were more than twice as likely to be recommended for the highest track than those who deteriorated. Despite this intriguing effect, the mechanism according to which this effect occurred is still unclear. We assumed that the participants would extrapolate the students’ development in achievement (indicated by their successive GPAs) in line with their sustaining expectations (Cooper 1985; Good and Brophy 2003) according to which improvement would be followed by further improvement, and impairment would be followed by further impairment. When applying this rule of development to the student vignettes of our study, an improving student would certainly be judged as being more successful in secondary school than a deteriorating student, since the former is likely to improve further, whereas the latter is likely to get worse. Even if we assume a less strict rule, for instance by proposing that an increase might be followed by further increase or by maintaining the level of achievement, high-track recommendations would be still more likely compared to students who fall behind their initial achievements. When the participants adopted a growth rule like this, they ignored that the change of GPA might have been the result of pure randomness. When students increase or decrease in their grades, this could happen by a variety of factors, which might be a “true” change in achievement or skills, but also—and perhaps equally likely—factors like a change of motivation, a change of teachers and a corresponding change of grading rules, or changes of the learning environment at home.

How did the participants’ gender affect their judgments?

Our final analysis has shown that participants’ gender significantly contributed to the likelihood of their high-track recommendations. On average, male participants recommended students more frequently for the highest track than female participants. This “leniency” of male participants in recommending students for the highest track is also mirrored in the results from studies investigating teachers’ characteristics as predictors of their ratings of students. For instance, Taylor et al. (2001) could show that female participants (inservice as well as preservice teachers) rated the degree of learning and behavioral problems of videotaped students on average higher than male participants. Moreover, the female participants in our study considered the grand mean of grades more strongly than did the male participants for their high-track recommendations for both improving and deteriorating students, and even more strongly for boys than for girls when the students were improving in grades. Similarly, research has shown that female teachers were more accurate than male teachers in identifying students’ behavioral problems (Ritter 1989) or learning difficulties (Hopf and Hatzichristou 1999). Hence, it seems that the female participants of our study evaluated the students’ grades with more caution and precision, particularly the boys, and were less optimistic than the male participants. This effect might also be due to a proneness of men to be more optimistic than women. This gender-dependent “optimistic bias” (Weinstein 1989) has been found in several areas, such as marriage (Lin and Raghubir 2005), self-evaluation (Beyer and Bowden 1997), and the accuracy of grade expectancies (Beyer 1999). With deteriorating students, however, female participants did not differentiate between boys and girls, whereas male participants preferred boys relative to girls and even devalued high-performing girls relative to low-performing girls. The male participants’ preference of boys indicates a same-gender bias, which has also been found in some previous investigations (e.g., Dee 2007; Lavy 2004).

Limitations

Some limitations should be mentioned that were inherent in the study. First, we did not realize baseline conditions wherein no change between successive school reports occurred. Second, this study was experimental in nature. While we therefore could expect the realization of a high level of internal validity, field investigations are nevertheless needed in order to show the effects obtained in a natural environment. Third, at the end of the experiment, we did not ask the participants about their preferences with respect to the placement decisions they made, for instance, whether or not they favored male over female students. If we had asked them for subjective reasons for their preferences, we might have got more insight into the participants’ judgment process. Fourth, grades were not associated with specific school subjects. This might have been confusing for some participants, as in practice grades are always related to school subjects. However, if we had assigned grades to specific school subjects, it is likely that students would have been judged based on the grades in distinct school subjects. To make the design of the study feasible, we abstained from specifying school subjects. For the same reason, we did not incorporate personality traits of students, which might have had an effect on the participants’ judgments, into the vignettes. Fifth and finally, since we examined decisions made by preservice teachers, we were not able to securely infer from the study’s results to inservice teachers’ decision-making. Recent research has identified some differences in decision-making between (rather inexperienced) preservice teachers or students and experienced inservice teachers. For example, Krolak-Schwerdt et al. (2009) could show that teachers (i.e., experts) were more flexible than university students of natural sciences (i.e., laymen) to switch between different modes of information processing, when the task was to evaluate characteristics of primary school students. In addition, teachers seem to be not only more flexible in choosing the appropriate information processing strategy, but they also use more information than rather inexperienced teacher students (e.g., Sabers et al. 1991). There is also evidence that when decisions are at high stakes (as it is certainly the case with real school-track recommendations), teachers decide more carefully than if they make decisions in a rather artificial experimental context (cf. Glock et al. 2012). However, like preservice teachers, inservice teachers are prone to be affected by their implicit attitudes toward students (e.g., Glock and Karbach 2015; Mertler 2004). Hence, preservice teachers should be made aware of their proneness to judge students differently according to factors not related to the predetermined educational standards.

Follow-up studies would shed more light into the complex decision-making processes when preservice or even inservice teachers evaluate students regarding their appropriateness for a secondary school track. For instance, establishing a baseline condition within a similar experimental design would allow for examining whether or not participants weight improvement and deterioration of grades equally or differentially. Furthermore, the provision of instructions that explicitly state how to handle information from both school reports could reduce the bias in placement recommendations. Finally, in order to validate the experimental studies, case studies could be applied in which participants would be presented with more elaborative student descriptions, and participants’ responses would be coded qualitatively.

Conclusions

In Germany (and some other European countries), students leave primary school with a certification of their teachers that recommends them for one of the school tracks in secondary education. The tracks in secondary school are hierarchically ordered, with only the highest track (“Gymnasium”) allowing for university entrance after being successfully accomplished. Hence, it is of great importance to know for students, preservice teachers, teachers, and policy makers, which factors have an effect on these recommendations. This study is the first one that showed that the change of grades between two successive school semesters has a large impact on preservice teachers’ judgments as to the eligibility of students for the highest track in secondary school. Moreover, this effect was dependent on the gender of the students, the gender of the participants, and the students’ overall achievement indicated by the grand mean of the grades of both school reports. Since the effect of the change of grades on the preservice teachers’ judgments is not envisaged by school authorities, and presumably not known by preservice teachers or educational personnel involved in teacher education, we deemed it of utmost importance to provide this knowledge for teacher education programs. Further studies should examine whether the effects obtained in this study generalize to real school settings. If data provide evidence that students are partially evaluated by their grades’ development, the tracking policy would have to be reconsidered. Moreover, even in school systems where students are not separated into different school tracks, the results of this study should be considered as a caveat since the development of school marks, measured at two occasions, might be the result of coincidence and does not necessarily reflect real change.