Many studies have shown that retrieval enhances learning. When students are given a test or quiz over information they have studied, the act of retrieving information leads to significant improvements in long-term memory, even when compared to additional opportunities to restudy the information. For example, Kang et al. (2007) found that undergraduate students’ learning of scientific articles was significantly enhanced by reading the articles once and then completing short-answer quizzes over the articles, relative to simply reading them twice. This benefit of retrieval over restudying is typically referred to as the testing effect or retrieval-enhanced learning (for recent reviews, see Rawson and Dunlosky 2011; Roediger and Butler 2011).

Retrieval-enhanced learning is widely studied, having been demonstrated in over 100 studies in the last several years alone. Retrieval has been known to produce significant learning gains for a wide variety of materials, including word lists (e.g., Carpenter 2009, 2011; Halamish and Bjork 2011; Karpicke and Zaromb 2010; Kornell et al. 2011; Kuo and Hirshman 1997; Peterson and Mulligan 2013; Zaromb and Roediger 2010), foreign language vocabulary (Carpenter et al. 2008; Coppens et al. 2011; Finn and Roediger 2011; Kang and Pashler 2014; Karpicke and Roediger 2008; Pyc and Rawson 2010; Toppino and Cohen 2009; Vaughn and Rawson 2011; Vaughn et al. 2013), text passages (Agarwal et al. 2008; Butler 2010; Clark and Svinicki 2014; Hinze and Wiley 2011; Kubik et al. 2014; Roediger and Karpicke 2006), and video-recorded lectures (Butler and Roediger 2007). Given the consistency of these findings in laboratory studies, researchers have advocated more frequent use of retrieval-based approaches as means of promoting students’ learning in authentic educational environments (e.g., Dunlosky et al. 2013; Pashler et al. 2007; Roediger and Pyc 2012).

Confidence in the power of retrieval to improve educational outcomes is tempered by the lack of classroom-based studies on this topic, however. Compared to hundreds of laboratory-based demonstrations of retrieval-enhanced learning under highly controlled conditions, we know much less about the promises—and potential pitfalls—of using retrieval to promote learning in classrooms. Particularly important—but especially lacking in the current literature—are studies comparing the effects of retrieval versus restudying on students’ learning of their course material. Determining whether benefits of retrieval hold up in these authentic contexts is critical to establishing the generality of the effect, and the degree to which the use of retrieval to promote student learning should be advocated, and under what conditions.

Retrieval-Enhanced Learning in the Classroom

To date, only a handful of studies has explored the effectiveness of retrieval vs restudying on students’ learning of course-relevant material. Some studies have observed significant benefits of retrieval over restudying in middle-school classrooms (e.g., Carpenter et al. 2009; Roediger et al. 2011) and in college-level online courses (e.g., McDaniel et al. 2007; McDaniel et al. 2012). However, others have observed no significant benefits of retrieval over alternative, nonretrieval-based activities (e.g., creating concept maps) on elementary school children’s learning of science concepts (Karpicke et al. 2014, experiment 1).

These studies provide some support for the use of retrieval as a learning tool in classrooms but suggest that there may be some limitations to its effectiveness. Factors that drive the effectiveness of retrieval could be linked with a variety of individual and situational factors that are present in classrooms. Unlike a laboratory setting that strives to control such factors, a classroom includes a wider range of individual differences in student achievement, prior knowledge of the material, interest, and motivation. It is presently unknown how such factors might interact with the effects of retrieval, as previous studies addressing the role of individual differences in learning from retrieval—particularly in classroom settings—are lacking. Given the diversity among students in classrooms and the need for more classroom-based research on retrieval-enhanced learning, the primary goal of the current study was to provide data on the role of individual differences in learning from retrieval.

In the current study, undergraduate students in an introductory biology course learned information from their course pertaining to the topic of reproduction. Students completed different in-class exercises, some of which required retrieval (e.g., recalling definitions for terms such as primary oocyte and constructing a diagram of the process of oogenesis), and some that provided exposure to the information without requiring retrieval (e.g., copying the definitions rather than recalling them, or labeling a diagram that was already provided). All students completed a quiz 5 days later assessing knowledge of the information learned from the exercises.

The Importance of Student Achievement

To examine retrieval-enhanced learning as a function of individual differences, we focused on a measure that varies considerably across students and is known to influence learning—student achievement. Previous research has shown that the level of knowledge students have achieved on a given topic can be a powerful predictor of further learning. Specifically, students who have learned more on a given topic acquire new knowledge on that topic more readily than students who have not learned as much. In the literature on text comprehension, for example, students who score higher on a pretest measuring knowledge of a particular topic (e.g., the formation of stars), compared to those who score lower, demonstrate better learning of a never-before-seen passage relating to that topic. This effect has been shown for high school students learning about concepts in physical science (Boscolo and Mason 2003) and biology (McNamara et al. 1996), undergraduate students learning about concepts in biology (McNamara 2001), and both undergraduate and graduate students learning about concepts in physics (e.g., Alexander et al. 1994).

This association between prior knowledge and new learning can be influenced—and sometimes reversed—by other factors. For example, high-knowledge learners show the usual advantage over low-knowledge learners when the coherency of a text passage is reduced (e.g., by eliminating connective links between paragraphs or replacing nouns with pronouns), but this advantage for high-knowledge learners can be reduced or eliminated when the coherency of the text is increased (e.g., McNamara 2001; McNamara et al. 1996; see also Kalyuga et al. 2013). One explanation for this effect is that low-coherence text requires effortful, active processing to fill in the gaps, and high-knowledge learners are better equipped to do this than are low-knowledge learners, who lack the appropriate background knowledge to make these inferences. High-coherence text is less dependent on inferences and may even produce inferior learning among high-knowledge students due to its tendency to discourage active processing (e.g., see McNamara 2001).

In a related body of research on the expertise reversal effect, certain instructional methods have been shown to affect learning in different ways depending upon the expertise of the learner. For example, Lee et al. (2006) found that middle-school students’ learning of chemistry concepts via a computerized simulation depended upon how the information was represented, as well as students’ prior knowledge of science. Components of the computer simulation were either represented by a verbal label accompanied by a visual icon (e.g., “temperature” with an image of a burner), or only a verbal label (e.g., “temperature”). Whereas low-knowledge students learned better when the components were represented by the easier-to-understand visual icon with a verbal label compared to the verbal label alone, high-knowledge learners demonstrated the opposite pattern. In a study by Leppink et al. (2012), undergraduate students learned statistics concepts by either providing arguments to support their answer to a true/false statement (e.g., the sample mean is 10), or by reading arguments that had already been provided for them. Students who scored high on a pretest of statistical reasoning learned better from coming up with their own arguments than from reading the examples, whereas students who scored low on the same pretest learned better from reading the examples than from trying to come up with their own arguments. In another study, Cooper et al. (2001, experiment 4) taught high school students how to use a computerized spreadsheet program to make different types of calculations. After completing a tutorial on the program, students were given new problems and were asked to either imagine performing the steps in the calculations or to refer to the on-screen instructions that walked them through each step. Students who performed higher in their mathematics classes learned the calculations better through imagining than through following examples, whereas students who performed lower in their mathematics classes learned better through following examples than through imagining.

These studies suggest that a higher degree of baseline knowledge benefits learning on tasks that require learners to supply information that is not currently present. This appears to be the case for making inferences while reading a text (e.g., McNamara et al. 1996), and for coming up with solutions to problems rather than reading worked examples (Cooper et al. 2001; Leppink et al. 2012). When the task requires processing of information that is already present (e.g., following a worked example), having a high degree of knowledge may not help—and may even hurt—learning because the task encourages processing that is redundant with current knowledge and may lead to disengagement or distraction (e.g., see Lee and Kalyuga 2014). Indeed, in the study by Leppink et al., after learning the material through worked examples, a later test over the material showed that high-knowledge students actually performed slightly worse than low-knowledge students.

Applied to the current study, this suggests that retrieval practice may be more effective for high-knowledge students than for low-knowledge students. Retrieval requires an active search of memory, and high-knowledge students may be better equipped to engage in this processing because they have acquired knowledge of the topic that will facilitate successful retrieval. If low-knowledge students have acquired very little knowledge of the topic, there may be less information in memory to retrieve, and therefore, retrieval may be ineffective for learning, or even counter-effective, compared to simply reading the material. Thus, in the current study, retrieval-enhanced learning may be expected to be more pronounced for students who have achieved a higher degree of knowledge over the course material than for students who have achieved a lower degree of knowledge.

Metacognition

Prior knowledge has been shown to predict not only students’ learning, but also their perceptions of their learning. Research on metacognition has shown that students can be poor predictors of their own learning, often giving estimates of their knowledge that exceed actual performance as measured by a later test (e.g., Carpenter et al. 2013; Castel et al. 2007; Dunlosky and Lipko 2009; Finn and Metcalfe 2007; Kornell and Bjork 2009). Furthermore, low performers show a greater tendency than high performers to be overconfident. For example, immediately before the exam in a college-level psychology course, Miller and Geraci (2011) asked students to predict their score. Across a series of experiments, low-performing students (those who scored within the lower quartile of the class) overpredicted their scores by as much as 18 %, whereas high-performing students (those who scored within the upper quartile of the class) consistently underpredicted their scores. Similar results were observed by Bol et al. (2005), who asked students in an undergraduate education course to predict their final exam scores, and found that lower-performing students overpredicted their scores by about twice as much as higher-performing students (12 vs 6 %, respectively).

Other studies show that students often fail to appreciate the benefits of retrieval, predicting that they will recall the information better after having read or restudied it compared to having retrieved it (Agarwal et al. 2008). Presently, however, it is unknown whether this effect occurs in a classroom setting for students learning course material, and whether it varies according to student achievement level. To explore this, after students completed different types of retrieval exercises in the current study, we asked them to estimate how well they would score on a later quiz over the information. Comparing students’ predicted scores with their actual scores allows an assessment of metacognitive awareness as a function of student achievement and the type of exercise (retrieval-based, or nonretrieval-based) that they used to learn the material.

Method

Participants

A total of 311 students from an introductory biology course at a large Midwestern University were invited to participate in the study in exchange for class participation credit. Thirty-six students either did not complete the in-class exercises or were absent on the day the follow-up quiz was administered, resulting in 275 students who completed all phases of the study.

Materials and Design

The study materials consisted of five term definition pairs and a diagram depicting the process of oogenesis (see Appendix A). These materials were part of the regular course curriculum. On the day the study was conducted, this information had not yet been covered by the instructor in class but had been included on a study guide that students received outside of class, and also in the assigned readings that students received the previous week.

Students completed in-class exercises that required them to engage with this material in one of four different ways, modeled after common instructional methods that have been used to learn this type of material. In the Copy Definitions + Label Diagram condition, students were provided with a sheet of paper that provided the terms and definitions, along with an unlabeled diagram. Students were asked to copy the definitions and label each of the terms within the diagram. In the Copy Definitions + Draw Diagram condition, students were also provided with the terms and definitions and were asked to copy the definitions, but this time, no diagram was provided, and students were asked to draw the diagram and label each of the corresponding terms. In the Recall Definitions + Label Diagram condition, students were given the terms (but not the definitions) and an unlabeled diagram. They were asked to recall the definition for each term and label each term within the diagram. Finally, in the Recall Definitions + Draw Diagram condition, students were provided with only the terms and asked to recall the definitions, draw the diagram, and label each of the terms within the diagram. Thus, the first two conditions did not require students to retrieve the definitions, whereas the last two conditions did.

Each student was assigned to one of these four conditions based on the seating arrangement of the class, which was randomly determined at the beginning of the semester. The classroom was organized into eight seating “zones,” each consisting of two rows of seats that shared a tier within the classroom. The conditions were distributed in alternating fashion by zone, such that students in zones 1 and 5 received the Recall Definitions + Draw Diagram condition (n = 82), students in zones 2 and 6 received the Copy Definitions + Draw Diagram condition (n = 66), students in zones 3 and 7 received the Recall Definitions + Label Diagram condition (n = 73), and students in zones 4 and 8 received the Copy Definitions + Label Diagram condition (n = 54). During the next class period, 5 days later, all students were given an unannounced quiz over the information from the in-class exercises.

Procedure

The study was conducted at the beginning of class. Following some introductory announcements by the instructor, students were informed that they would be working on some in-class exercises related to things they were learning in the course. They were asked to work on the exercises individually, without help from notes, books, or other resources, and to raise their hands when finished. Students were given an opportunity to ask questions, and then, the exercises were distributed according to the system described above.

Students completed the exercises at their own pace. Upon completion, each student raised his/her hand. At that point, the worksheet was collected, and the student was then given an answer sheet containing the terms and definitions, as well as the complete labeled diagram (see Appendix A). Students were asked to use this information to reflect on their accuracy on the exercise they had just completed. After reviewing the answer sheet at their own pace, each student was asked to make a judgment of learning (JOL) regarding how well they believed they would score on a multiple-choice quiz over this information. Students were asked to make two JOLs, one estimating their score (expressed as percent correct) on a multiple-choice quiz if it were given immediately, and the other estimating their score on the same quiz if it were given during the following week. Asking students to make both JOLs provided an opportunity to assess their predictions of their own performance both immediately and after a delay. Students wrote down these two JOL values on their answer keys, and then returned the sheets to the researchers.

After finishing the exercise on oogenesis, each student was given a similar exercise pertaining to the process of spermatogenesis, another topic that would soon be covered in the course. Anticipating that students would complete the oogenesis exercises at different times (e.g., the Recall Definitions + Draw Diagram condition takes longer than the Copy Definitions + Label Diagram), the purpose of this second exercise was to provide some additional course-relevant activities for students to work on (to keep them engaged) while other students completed the oogenesis exercises. Performance on the second exercise was not of primary interest, however, and will not be discussed. Some of the content relating to spermatogenesis is similar (but not identical) to that of oogenesis, raising the possibility that portions of the filler exercise could have provided some exposure to concepts that were encountered on the target exercise over oogenesis. Only performance on the oogenesis exercise was analyzed, with the understanding that the filler exercise could have provided additional feedback over some of the concepts. All students completed both exercises (lasting approximately 20 min altogether), and after all materials were collected from students, the instructor commenced with regular class activities.

During the next class period (5 days later), students were given an unannounced quiz over the information from the in-class exercises. All students received the same double-sided sheet of paper containing 20 quiz questions (10 questions first pertaining to oogenesis, and then 10 to spermatogenesis). In consultation with the revised Bloom’s taxonomy (Anderson et al. 2001), the first five questions from each topic were designed to assess knowledge and the last five to assess comprehension (see Appendix B). Students were asked to complete the quiz individually and were encouraged to do their best but were informed that their score would not count toward their grade in the course. As with the in-class exercises, participation credit was granted for completion of the quiz, regardless of students’ scores. At the bottom of the second page of the quiz, students were asked to indicate how much, over the last 5 days, they had studied any of the information contained on the quiz. Students indicated their response using a 0–5 scale (0 representing “not at all” and 5 representing “a lot”). After all students handed in their quizzes, they were given a debriefing concerning the nature of the study, were encouraged to ask questions, and were provided with the contact information of the primary researcher.

Results

Student Achievement

Student achievement was operationalized as overall performance in the course based on credit earned from four mandatory exams (all of which were completed prior to the in-class experiment) and a number of daily in-class activities (excluding the in-class experiment) that were administered throughout the semester via individualized response systems, or “clickers.” For the 275 students who completed the study, overall course performance ranged from 38 to 95 % (M = 71.5 %, SD = 11 %).

Students were partitioned into performance levels based on whether their overall course performance fell within the upper, middle, or lower one third of the students who completed the study. Table 1 displays mean course performance for these levels across the four experimental conditions. Across the four conditions, no significant differences in course performance were observed among high performers, F(3, 86) = 1.15, p = .33, among middle performers, F(3, 88) = 1.35, p = .26, or among low performers, F(3, 89) = 2.25, p = .09.

Table 1 Mean course performance for high, middle, and low performers across the four experimental conditions

Accuracy on the In-Class Exercises

Accuracy on the in-class exercises was scored by two independent raters who were knowledgeable about the content. Accuracy of the definitions was scored by awarding two points for a fully correct answer, one point for a partially correct answer and zero point for an incorrect answer. Accuracy of the diagrams was scored by awarding one point for each component of the diagram that was correctly drawn, plus one additional point if the correct label was included. For the two conditions requiring students to label the diagram but not draw it (the Copy Definitions + Label Diagram condition and the Recall Definitions + Label Diagram condition), one point was automatically included for the presence of the component in the diagram, and accuracy of the labeling was scored by awarding one point for each component that was correctly labeled. Across all conditions, inter-rater correlations for the accuracy of the definitions and diagrams (excluding the two conditions that involved merely copying the definitions) ranged from .72 to .91, ps < .001. Accuracy across all conditions was computed by averaging the two raters’ scores.

Table 2 displays accuracy on the experimental exercises as a function of student course performance and condition. For the conditions that required retrieval of the definitions (Recall Definitions + Label Diagram and Recall Definitions + Draw Diagram), high performers retrieved the definitions better than middle performers [ts > 2.38, ps < .03, ds > .65] or low-performers [ts > 3.86, ps < .001, ds > 1.16]. Though middle performers retrieved the definitions better than low performers in the Recall Definition + Label Diagram condition [t(48) = 3.07, p = .003, d = .88], this was not true for the Recall Definition + Draw Diagram condition [t(45) = .49, p = .62]. In the Recall Definitions + Draw Diagram condition, high performers drew and labeled the diagram better than middle performers [t(61) = 2.03, p = .047, d = .52] or low performers [t(52) = 3.89, p < .001, d = 1.22], and middle performers drew and labeled the diagram better than low performers [t(45) = 2.11, p = .04, d = .67]. In the Copy Definitions + Draw Diagram condition, middle performers drew and labeled the diagram better than high performers [t(36) = 2.49, p = .018, d = .81] or low performers [t(45) = 2.74, p = .009, d = .82], with no significant difference between the latter two, t(45) = .25.

Table 2 Accuracy on the in-class exercises for high, middle, and low performers across the four experimental conditions

For the conditions requiring students to label the diagram that was provided, high performers labeled the diagram better than middle or low performers. In the Recall Definitions + Label Diagram condition, high performers labeled the diagram better than middle performers [t(48) = 3.48, p = .001, d = 1.00] or low performers [t(44) = 2.5, p = .016, d = .74], with no significant difference between the latter two, t(48) = .29, p = .77. Though the same pattern occurred for the Copy Definition + Label Diagram condition, accuracy was fairly high for all students. A marginally significant advantage occurred for high performers over low performers [t(34) = 1.77, p = .08, d = .64], but not between high and middle performers [t(29) = 1.13, p = .27], or between middle and low performers [t(39) = .61, p = .55].

Quiz Scores

Quiz scores were examined as a function of experimental condition, type of quiz question (knowledge vs comprehension), and student course performance. Table 3 displays the mean quiz scores as a function of experimental condition and course performance for knowledge questions (upper half) and comprehension questions (lower half). For knowledge questions, high performers appeared to benefit more from exercises requiring retrieval of the definitions compared to copying of the definitions, whereas middle performers and low performers appeared to show the opposite pattern. Within each of the three course performance levels, no significant differences were observed between the two conditions requiring retrieval of the definitions [ts < .34, ps > .70], or between the two conditions requiring copy of the definitions [ts < .95, ps > .35]. Thus, to increase the sample sizes within each group for the comparisons of interest—retrieval vs copying—we combined the two conditions requiring retrieval of the definitions (Recall Definitions + Draw Diagram and Recall Definitions + Label Diagram), and the two conditions requiring copy of the definitions (Copy Definitions + Label Diagram and Copy Definitions + Draw Diagram). A 3 × 2 (performance × retrieval) between-subject analysis of variance (ANOVA) revealed a significant interaction, F(2, 269) = 5.04, p = .007, η 2 = .04, in that high-performing students benefitted more from exercises that required retrieval vs copying, whereas middle- and low-performing students did not.

Table 3 Mean quiz scores for knowledge and comprehension questions across the four experimental conditions as a function of student course performance

Figure 1 (left panel) displays this interaction. High performers’ scores were significantly higher following retrieval than copying of the definitions, t(88) = 2.45, p = .016, d = .50, middle performers’ scores were similar following retrieval vs copying, t(90) = .53, p = .60, and low performers’ scores were actually lower following retrieval than copying, t(91) = 2.31, p = .02, d = .48. The same 3 × 2 ANOVA revealed no overall main effect of retrieval vs copying, F(1, 269) = .12, p = .73, but a main effect of performance level, F(2, 269) = 17.31, p < .001, η 2 = .11, in that high performers achieved higher quiz scores overall compared to middle performers or low performers.

Fig. 1
figure 1

Mean quiz scores for knowledge questions (left panel) and comprehension questions (right panel) as a function of student course performance and whether definitions were learned through retrieval vs copying. Error bars represent standard errors

As a supplemental analysis, we computed correlations between student course performance and later quiz scores for the groups that learned the definitions through retrieval vs copying. Consistent with the interaction described above, this correlation was stronger for those students who learned the definitions through retrieval [r(155) = .44, p < .001] than through copying [r(120) = .21, p = .02], Fisher’s r-to-z transformation = 2.11, p = .03.

For comprehension questions (Fig. 1, right panel), the same 3 × 2 ANOVA revealed only a main effect of performance level, F(2, 269) = 12.13, p < .001, η 2 = .08. There was no main effect of retrieval vs copying, and no interaction, Fs < 1.29, ps > .27. Correlations between student course performance and later quiz scores were similar for those who learned the definitions through retrieval [r(155) = .39, p < .001] vs copying [r(120) = .34, p < .001], Fisher’s r-to-z-transformation = .47, p = .64, indicating a positive relationship between student course performance and later quiz scores that did not depend upon whether students learned definitions through retrieval vs copying.

Metacognition

Students’ metacognitive calibration was assessed by comparing predicted quiz scores with actual quiz scores. Although students made two predictions—one pertaining to an immediate quiz and one pertaining to a quiz “next week”—only the latter is relevant in computing calibration, as the quiz itself was administered the week after the in-class exercises. Calibration scores were based on knowledge questions (and not comprehension questions), as the knowledge questions were identical to those that students encountered on the in-class exercises and would thus represent the information that was available when students made their predictions.

Table 4 shows students’ predicted vs actual quiz scores on knowledge questions for high performers, middle performers, and low performers. Individual calibration scores (computed by subtracting the actual quiz score from the predicted quiz score) were computed for each student. Across performance levels, no significant differences in calibration scores were observed between the two conditions requiring retrieval of the definitions [ts < .50, ps > .60], or between the two conditions requiring copying of the definitions [ts < .40, ps > .70]. Thus, like before, we combined scores from the two conditions requiring retrieval of the definitions (Recall Definitions + Label Diagram and Recall Definitions + Draw Diagram) and the two conditions requiring copying of the definitions (Copy Definitions + Label Diagram and Copy Definitions + Draw Diagram). A 3 × 2 (performance × retrieval) between-subject ANOVA on the calibration scores revealed a significant interaction, F(2, 264) = 3.21, p = .042, η 2 = .02.

Table 4 Mean predicted and actual quiz scores across the four experimental conditions as a function of student course performance

This interaction is displayed in Fig. 2. High performers demonstrated better calibration following retrieval than copying, t(87) = 2.11, p = .037, d = .44, middle performers demonstrated a similar but nonsignificant pattern, t(89) = .89, p = .37, and low performers demonstrated slightly worse calibration following retrieval than copying, t(88) = 1.45, p = .15. The ANOVA revealed no overall main effect of retrieval vs copying, F(1, 264) = .83, p = .36, but did reveal a significant main effect of performance level, F(2, 264) = 10.74, p < .001, η 2 = .07, in that overall calibration was better for high performers than for middle performers or low performers. The overall calibration score of high performers (M = 4.45 %) did not differ significantly from zero, t(88) = 1.47, p = .15, indicating a close match between students’ predicted scores and their actual scores. However, the calibration score of middle performers (M = 11.59 %) was significantly greater than zero, t(90) = 3.67, p < .001, d = .39, as was the calibration score of low performers (M = 25.82 %), t(89) = 8.39, p < .001, d = .88, indicating overconfidence. Correlation analysis between student course performance and calibration scores confirmed the same pattern by revealing a negative relationship, r(270) = −.27, p < .001, indicating that overconfidence increased as student course performance decreased.

Fig. 2
figure 2

Mean predicted and actual quiz scores for knowledge questions as a function of student course performance and whether definitions were learned through retrieval vs copying. Means exclude data from five students who did not provide a judgment of learning. Error bars represent standard errors

It is informative as well to compare students’ predictions of their scores on an immediate quiz vs a delayed quiz. Are students aware that forgetting occurs over time, such that a delayed quiz would likely yield lower scores than an immediate quiz? To the contrary, students’ predictions for the delayed quiz were actually higher than for the immediate quiz. This was true for high performers (77 vs 80 %, respectively), middle performers (64 vs 79 %, respectively), and low performers (63 vs 79 %, respectively). A possible explanation for this pattern is that while thinking about a future quiz and how they would score, students may have assumed that they would have the opportunity to study for that quiz. As a result, they would expect their score on a future quiz to be higher than on a quiz given right at that moment. The fact that students were asked to report how much they studied prior to taking the quiz provides an opportunity to explore calibration under conditions in which this assumption was met—i.e., when students actually did study the information. Table 5 reports predicted quiz scores vs actual quiz scores for high, middle, and low performers as a function of how much students reported studying the information prior to the quiz. Table 5 indicates that even when students studied the information, they still overpredicted their own scores, and this overconfidence was greatest for lower performing students.

Table 5 Mean predicted and actual quiz scores as a function of student course performance and the amount of studying that students engaged in prior to the quiz

Explaining Different Learning Patterns as a function of Student Course Performance

Why are high performers more likely than middle or low performers to benefit from retrieval? There are at least two possibilities. First, quiz scores may reflect different degrees of postretrieval exposure to the material. Given that the quiz was administered 5 days after the in-class exercises, it is possible that students may have studied some of the material in-between completing the exercises and taking the quiz, and if so, performance could reflect the effects of studying (which may be more likely among high performers) and not the effects of the retrieval exercises per se. However, this possibility seems unlikely, as the amount that students reported studying was not correlated with later quiz scores, r(270) = −.014, p = .82.

The other possibility is that retrieval-enhanced learning is linked with the rate of success at recalling the definitions during the in-class exercises. Analysis of performance on the in-class exercises supports this idea (Table 2). In the conditions requiring retrieval of the definitions, high performers recalled 35 % of the definitions correctly (SD = 24 %), middle performers recalled 18 % correctly (SD = 15 %), and low performers recalled 11 % correctly (SD = 13 %). Independent sample t tests confirmed that high performers recalled significantly more than middle performers [t(111) = 4.52, p < .001, d = .86] and low performers [t(98) = 5.85, p < .001, d = 1.24], and middle performers recalled significantly more than low performers [t(95) = 2.22, p = .03, d = .46]. Furthermore, across all performance levels, the success rate of initial retrieval of the definitions was significantly correlated with later scores on the knowledge quiz questions, r(155) = .38, p < .001.

Thus, learning from retrieval appears to be linked with the amount of information that students initially retrieve. Students who recall more of the information—i.e., high performers—may benefit from retrieval because they gain additional exposure to this information through recall, and then gain further exposure through studying the answers during feedback. If students do not know enough to successfully retrieve the information—or if they are seeing it for the first time—retrieval may be ineffective because the information has not been effectively encoded, in which case learning would take place primarily by studying the answer sheet. Thus, an important contributor to the effectiveness of retrieval in classroom situations could be the degree to which a student has encoded and can successfully retrieve the information.

Discussion

In a large introductory biology course, classroom-based retrieval exercises were more effective for high performers than for middle or low performers. Whereas high performers learned biology definitions better when they had to retrieve the definitions, low performers learned the definitions better when they had to copy them. This finding is consistent with studies on the expertise reversal effect (e.g., Kalyuga 2007; Kalyuga et al. 2003; Lee and Kalyuga 2014), showing that high performers typically benefit from methods that require them to fill in or elaborate on information that is not currently present, whereas low performers benefit from additional processing of information that is provided for them (e.g., Cooper et al. 2001; McNamara et al. 1996).

In the current study, a likely possibility for this interaction is that high performers have a greater degree of baseline knowledge on the topic that permits successful retrieval on the initial test. Initial retrieval accuracy of the definitions was significantly correlated with later quiz scores. This is somewhat different from laboratory studies showing that initial retrieval success is not always necessary for benefits of retrieval to occur, as long as corrective feedback is provided. Some laboratory studies have shown, for example, that students learn better from failed retrieval attempts than from merely reading, as long as the correct answer is provided after the retrieval attempt (e.g., Kornell et al. 2009; Pashler et al. 2005). In the current study, students were always provided with the correct answers after trying to retrieve them, but they still benefited more from successful retrieval than from failed retrieval.

There may be important differences between laboratory- and classroom-based studies that influence the effectiveness of learning from feedback, however. Typical laboratory studies involve relatively simple stimuli that are learned under highly controlled conditions, with the goal of controlling factors such as the level of prior knowledge of the to-be-learned material (keeping it at minimal-to-none) and—unless it is directly manipulated—the learner’s motivation. In an environment that minimizes distractions and encourages task engagement, failing to retrieve an item in a laboratory study (assuming feedback is provided) may not be accompanied by any harmful consequences to learning. Classrooms, on the other hand, are less controlled environments that contain students with a broad range of background knowledge, interests, and motivation. Compared to a laboratory study, failing to retrieve curriculum-based information in a classroom study could more likely reflect lapses in prior knowledge, interest, or motivation that could perpetuate suboptimal learning. It may be the case, therefore, that successful initial retrieval is more important to retrieval-enhanced learning in the classroom than in the laboratory.

Indeed, at least one recent study has shown that positive effects of retrieval practice do not always occur in classroom environments, even when feedback is provided. Karpicke et al. (2014, experiment 1) gave elementary-school children a series of retrieval exercises over science concepts and found that children only retrieved about 10 % correct initially. On a later test, retrieval was not more effective than alternative, nonretrieval-based activities that required children to interact with the material, and this was true even though children received feedback after their initial retrieval attempts. In a follow-up experiment in which the retrieval conditions were more likely to facilitate success (i.e., by reducing the amount of information to retrieve and improving its organizational structure), retrieval demonstrated its usual advantage over simply reading the material (Karpicke et al., experiment 3). Thus, like the current results and unlike results of some laboratory-based experiments, failed retrieval attempts followed by feedback may not always benefit learning in authentic educational environments, particularly when the level of initial retrieval is low.

Consistent with this finding, previous studies in classroom environments have shown significant overall advantages of retrieval vs rereading under conditions that increase the success of initial retrieval—for example, by administering a quiz immediately after a lesson (which resulted in 89 % correct on the quiz and a later advantage of retrieval (91 %) over rereading (83 %) in the study by Roediger et al. 2011), or by allowing students multiple retrieval opportunities with feedback (which resulted in over 95 % correct on the quizzes and a later advantage of retrieval (87 %) over rereading (75 %) in the study by McDaniel et al. 2012). These results highlight the importance of providing guidance or scaffolding to facilitate retrieval in classroom environments, and the current results suggest that this may be especially important for lower performing students.

Retrieval success per se may not be the only factor contributing to the current results. The degree of success while retrieving information during a class activity could reflect individual student differences in motivation, interest, cognitive abilities, propensity to learn from feedback, or other factors. High performers may embody these characteristics more so than middle or low performers, such that retrieval success may be only a partial contributor—or even a by-product—of other individual characteristics. Though the influence of such additional factors cannot be ascertained from the current study, one previous study showed that the frequency of students’ reported use of retrieval practice as a study strategy was positively related to student achievement (Hartwig and Dunlosky 2012). An interesting possibility, therefore, is that higher performing students may be more familiar with the use of retrieval practice, increasing the likelihood that they use it effectively to learn course material.

Just as the role of a given construct (i.e., student achievement) may relate to retrieval-enhanced learning for a variety of reasons, retrieval-enhanced learning could reflect a number of additional constructs. A small number of studies has begun to explore these possibilities, so far reporting that individual differences in retrieval-enhanced learning do not appear to be linked with working memory (Brewer and Unsworth 2012), but one study reporting that these differences were accounted for by the interactive effects of working memory capacity and trait test anxiety (Tse and Pu 2012). Along similar lines, the positive effects of retrieval have been attenuated by the application of performance pressure at the time of retrieval (Hinze and Rapp 2014). These initial studies suggest that there may be important individual and situational factors underlying retrieval-based learning—particularly as they relate to performance, or one’s perception of their own performance—that have much potential for further exploration.

In the current study, effects of retrieval were only observed for quiz questions that tested memory of the same definitions that appeared on the retrieval exercises (i.e., knowledge questions) and did not occur for never-before-seen questions that tested a higher degree of understanding (i.e., comprehension questions). This is consistent with the results of some recent studies showing that retrieving the answer to a particular question does not always facilitate retrieval of the answer to a related but never-before-seen question (e.g., Hinze and Wiley 2011; Wooldridge et al. 2014). The degree of retrieval-enhanced facilitation to new questions may depend, in part, on how the retrieved information relates to the never-before-seen question. In the study by Wooldridge et al., final test questions were drawn from the same textbook chapter, but may not have tapped the same concepts, as the information that was originally retrieved. In Hinze and Wiley’s study (experiments 1–2), final test questions were drawn from the same paragraph of text as the retrieved information, but it is possible that students did not draw a connection between the retrieved information (e.g., the fact that daughter cells are created from parent cells in mitosis) and the nonretrieved information (e.g., the fact that daughter cells are genetically identical) that would facilitate later memory for the nonretrieved information. Studies demonstrating retrieval-induced transfer have often involved initial testing conditions in which the retrieved information bears a strong link with—and may even prompt explicit recall of—the nonretrieved information (e.g., Butler 2010; Carpenter and Kelly 2012; Chan et al. 2006). Indeed, in a third experiment, Hinze and Wiley found that free recall of entire paragraphs—which is more likely than short-answer questions to activate knowledge of the entire passage—resulted in better performance (compared to rereading the passage) on later, never-before-seen multiple-choice questions. In another recent study, Bjork et al. (2014) observed positive effects of retrieval on the concepts that served as incorrect lures on a multiple-choice test. To the extent that students processed each of the lures as potential answers to the question, they may have activated or retrieved information associated with each lure while answering the question, increasing the chances that a later question tapping knowledge of the lure would be answered correctly. Along similar lines, McDaniel et al. (2012) found that answering a quiz question in an undergraduate Brain and Behavior course (e.g., “Information coming INTO a structure (arriving) is called:” answer: “afferent”) facilitated later performance on a conceptually related but nonidentical question (e.g., “Information leaving a nervous system structure is called:” answer: “efferent”).

This provides some insight into why recalling answers to knowledge-based questions did not facilitate later performance on comprehension-based questions in the current study. Even though the knowledge and comprehension questions related to the same content, it seems unlikely that students would have needed to activate comprehension-based information in order to answer the knowledge-based questions. On the other hand, answering comprehension-based questions would seem to require knowledge-level representations; so, practice at retrieving comprehension-based information may be more likely to transfer to knowledge-based information than the other way around (e.g., see McDaniel et al. 2013). Indeed, one recent study reported that high-level quiz questions (those requiring application, evaluation, and analysis of information) were more effective than low-level questions (those requiring mere recall of information) at promoting later exam performance on both high-level and low-level questions (Jensen et al. 2014). Similarly, Hinze et al. (2013) found that students who read a science passage while expecting a future test containing higher-order inference-based questions scored higher than students who read the same passage while expecting a future test containing detail-based questions that were explicitly stated in the passage. Furthermore, students who expected the inference-based test performed better on inference-based questions and detail-based questions, relative to students who expected the detail-based test.

Thus, the degree of transfer resulting from retrieval may depend on how students approach, and engage with, the retrieval task. In the absence of conditions that promote connectivity between the retrieved information and later transfer questions (as in the current study), retrieval-enhanced learning may be relatively specific to the information that was practiced. Factors that promote this connectivity, however—through using higher-order questions during practice (Jensen et al. 2014), test expectancy instructions (Hinze et al. 2013, experiment 2), or the construction of explanations during retrieval of complex text materials (Hinze et al. 2013, experiment 3)—may increase the flexibility of retrieval-enhanced learning. Though much of the literature on retrieval practice has focused on measuring direct retention of relatively specific types of knowledge, a timely and worthwhile goal for future studies is to develop and apply retrieval-based methods for promoting the types of higher-order comprehension and application skills that align with educational goals (e.g., Carpenter 2012; Pellegrino 2012).

Finally, we found that metacognitive calibration was better for high performers than for middle or low performers. This is consistent with prior studies conducted in university classrooms showing that low-performing students tend to overpredict exam scores more than high-performing students (e.g., Bol et al. 2005; Miller and Geraci 2011). The relationship between achievement and metacognitive awareness exists even when students do not make specific performance predictions. For example, Schraw and Dennison (1994) administered a 52-item survey measuring everyday metacognitive monitoring behavior among students (e.g., “I ask myself periodically if I am meeting my goals” and “I find myself pausing regularly to check my comprehension”) and found that greater monitoring was associated with higher scores on a subsequent test of reading comprehension. Thus, students who show higher academic achievement also tend to show a higher degree of metacognitive monitoring.

Consistent with previous research, we also found that students displayed improved metacognitive calibration following retrieval (Agarwal et al. 2008; Little and McDaniel 2014; Tullis et al. 2013); however, this was only apparent for high performers and not for low performers. This could have been driven, in part, by the tendency for the conditions involving copying to increase the perceived ease of processing of the material, which has been known to inflate judgments of learning (e.g., Benjamin et al. 1998; Carpenter and Olson 2012; Diemand-Yauman et al. 2011; Rhodes and Castel 2008; Serra and Dunlosky 2010). The fact that this occurred for high performers suggests that high achievement may not inoculate students from using a (sometimes faulty) ease-of-processing heuristic while making judgments of learning.

Such findings are consistent as well with a recent study by Szpunar et al. (2014), who showed that students’ learning of statistics concepts from a video-taped lecture increased, and metacognitive calibration improved, by inserting quizzes periodically throughout the lecture. In addition, at least one classroom-based study has documented an improvement in students’ predictions of their own exam scores over the course of a semester (with overconfidence initially very high but then decreasing with each subsequent exam), and that high performers were more likely than low performers to show this improvement (Hacker et al. 2000). Thus, whereas high performers appear to get better at aligning their predictions with performance as a result of practice, middle and low performers may be in need of additional metacognitive training to improve calibration.

In conclusion, the current classroom-based study highlights the important role that individual student achievement can play in the effectiveness of retrieval-based learning. High performers were more likely than middle or low performers to benefit from retrieval and were more likely to accurately estimate their own performance on a later quiz. Given the wide range of student achievement that is present in many classrooms, these results encourage researchers to consider individual differences in student achievement when evaluating the effectiveness of educational interventions.