Over the past decade, there has been robust interest in the effects of retrieval practice on learning, with a special emphasis on how best to apply the benefits of retrieval (or testing) to the complex tasks, materials, and assessments found in educational settings. The consistent finding from recent research has been that retrieval practice promotes meaningful learning of complex materials (Carpenter 2012; Dunlosky et al. 2013; Karpicke 2012; Roediger and Pyc 2012). In this issue, Van Gog and Sweller claim that there is no testing effect for complex materials and that this represents a “boundary condition” on the effect. This is a dangerous claim because it may mislead educators to think that retrieval practice is not effective for learning complex educational materials when in fact a wealth of research has shown that it is.

The reasoning behind the claim is flawed. Whether educational materials are simple or complex is orthogonal to whether retrieval practice enhances learning. To be able to retrieve, use, and apply knowledge in the long term, it is highly effective to practice retrieving, using, and applying knowledge during learning. Consider a similar scenario: When a student wants to learn to play a piece of music on the piano, he or she practices playing the piano, rather than merely reading the sheet music or reading a book about how to play the piece. Van Gog and Sweller’s claim is akin to saying that practicing the piano works only for simple pieces, but to learn to play a complex piece of music, practicing does not work and students should not bother doing it. The reasoning simply makes no sense.

Van Gog and Sweller’s analysis is questionable as well, and the following sections describe specific problems with their analysis and emphasize research showing benefits of retrieval practice for learning complex materials.

“Complexity” and “Element Interactivity” Are Poorly Defined

Van Gog and Sweller’s analysis is ambiguous in part because it confuses the complexity of the materials, complexity of the initial learning activity, and complexity of the criterial assessment. Van Gog and Sweller define complex materials as those that are high in element interactivity. Material that is “high” in element interactivity contains elements or ideas that are related, so that the learning of some ideas depends on learning other ideas in the material. Material that is “low” in element interactivity contains items that can be learned in isolation, without reference to other items or ideas in the materials (paraphrasing Van Gog and Sweller 2015). Element interactivity is certainly an important aspect of educational materials that should be examined in a thorough analysis of the literature and rigorous experiments (indeed, that sentiment is not new: see McDaniel and Einstein 1989). Unfortunately, that was not done in the present issue.

The central problem is that Van Gog and Sweller never offer a quantitative metric for measuring element interactivity. The analysis of previous research is completely subjective, and without an objective measure, the idea of element interactivity can be applied on the fly to suit the authors’ needs. A wide variety of measures exist to assess several dimensions of educational materials including, for example, Latent Semantic AnalysisFootnote 1 (Foltz et al. 1998) which was used by de Jonge et al. (2015), and Coh-MetrixFootnote 2 (Graesser et al. 2004; Graesser et al. 2011) which provides more advanced measures of the cohesiveness of materials. No measure was used to define and assess element interactivity; instead, the critical idea in Van Gog and Sweller’s analysis of the retrieval practice literature was evaluated purely subjectively.

Several strange things happen without an objective measure of complexity or element interactivity. The only materials that Van Gog and Sweller deemed high in element interactivity were their own worked-example experiments (Leahy et al., 2015; Van Gog and Kester 2012; and Van Gog et al., 2015), a paper by Tran et al. (2015), and de Jonge et al. (2015).Footnote 3 de Jonge et al. had students study a 1000-word text on black holes. For a retrieval practice task, students filled in missing words in individual isolated sentences. Van Gog and Sweller rated this as high element interactivity; they did not specify whether this rating applied to the materials, the retrieval activity, or both. Like de Jonge et al., Tran et al. (2015) had students study a set of seven to nine sentences (e.g., “Students commute from off-campus housing to campus by any of 3 routes”) and practice retrieval by filling in words missing from the sentences (“Students commute from off-campus housing to campus by any of ___ routes”). The retrieval task involved recall of isolated words within individual facts. Nevertheless, these materials and activities received a high rating from Van Gog and Sweller.

The remaining experiments in which students read text materials (or watched videos) and practiced retrieval in various ways (e.g., by answering conceptual questions) were deemed “medium” element interactivity by Van Gog and Sweller. For example, Butler and Roediger (2007) used a 30-min videotaped classroom lecture on art history and had students answer short-answer questions. Van Gog and Sweller rated their materials as “Medium/High?” (question mark in original) and their short-answer activity as low element interactivity. All the other studies that used educational texts were rated as medium/high, often with a question mark (e.g., Agarwal et al. 2008; Blunt and Karpicke 2014; Johnson and Mayer 2009; Roediger and Karpicke 2006; Weinstein et al. 2010).

Van Gog and Sweller wrote that “instructional texts on scientific phenomena or mechanical systems are typically high in element interactivity” (2015), yet their analysis discounted or excluded essentially all the previous research with such materials. We wondered whether there were measurable differences among the materials deemed high complexity and those deemed medium complexity by Van Gog and Sweller. Table 1 shows several measures of the materials used in experiments highlighted by Van Gog and Sweller (and some experiments excluded from their analysis), including word length, Flesch Reading Ease, Flesch-Kincaid Grade Level, and a measure of referential cohesion from Coh-Metrix. Referential cohesion is the degree to which ideas within a text overlap and are connected across sentences (see Graesser et al. 2011); it provides a possible measure that may capture element interactivity within a text. Table 1 shows that the text used by de Jonge et al. was relatively high in referential cohesion. Two of the brief scenarios used by Tran et al. (2015) exhibited very high referential cohesion and the other two were reasonably high. Nevertheless, several experiments demonstrating retrieval practice effects have employed materials with relatively high referential cohesion; for example, materials used by Hinze and Wiley (2011), Johnson and Mayer (2009), and Roediger and Karpicke (2006) all scored above the 70th percentile for referential cohesion. Some experiments used materials with referential cohesion scores that were as high or higher than the de Jonge et al. and Tran et al. materials (namely, Karpicke and Blunt 2011, and McDaniel et al. 2009). All of the experiments displayed in Table 1, except de Jonge et al. and Tran et al., showed retrieval practice effects. Clearly, retrieval practice enhances learning for both low- and high-complexity materials.

Table 1 Measures of text characteristics from retrieval practice experiments

Van Gog and Sweller’s analysis of the complexity of retrieval practice tasks is perhaps even more bizarre than the analysis of materials. Van Gog and Sweller consider freely recalling (e.g., Blunt and Karpicke 2014; Roediger and Karpicke 2006) or summarizing (e.g., Johnson and Mayer 2009; Weinstein et al. 2010) to be low element interactivity retrieval tasks. To freely recall or create a summary, a learner must rely on a mental model of how material is organized and use this relational knowledge structure as a plan to guide retrieval. Free recall and summarization require high degrees of element interactivity by their very nature.

Van Gog and Sweller also rated experiments in which students answered conceptual short-answer questions as low element interactivity (Agarwal et al. 2008; Blunt and Karpicke 2014; Johnson and Mayer 2009; Kang et al. 2007; Weinstein et al. 2010). For example, Agarwal et al. (2008) had students read 1000-word texts (e.g., one was about the Voyager spacecraft) and answer short-answer questions requiring them to make inferences and explain concepts (e.g., “Why did the Voyager have instruments that would measure ultraviolet and infrared light?”). Agarwal et al.’s retrieval task was rated low by Van Gog and Sweller. Similarly, Johnson and Mayer (2009) had students study a notoriously difficult set of materials on how lightning storms develop, write a summary explanation of the materials, and answer inferential questions (e.g., “What could you do to decrease the intensity of lightning?”). Van Gog and Sweller gave them a “Medium/High?” (question mark in original) and did not distinguish whether this referred to the materials, the retrieval tasks, or both.

Blunt and Karpicke (2014) had subjects freely recall or create concept maps as retrieval practice tasks. Concept mapping involves creating diagrams in which students identify the individual elements within a set of material, place those elements in nodes, draw links to connect related nodes in a network, and write descriptions along the links to specify how the elements are related. If anything, concept mapping would seem to be a quintessential method for promoting and assessing the processing of element interactivity within a set of materials. Van Gog and Sweller rated the task as low/medium element interactivity.

To recap, tasks in which students filled in individual words in isolated sentences were rated as high in element interactivity, while tasks where students freely recalled, produced summaries, answered inferential short-answer questions, or created concept maps were deemed low or, at best, medium in element interactivity. The rating scheme appears to be completely backwards. Even without a quantitative measure of element interactivity during retrieval, it is clear that filling in individual words in sentences requires little or no integration across ideas, while all the other retrieval activities described here require significantly more organizational and relational processing. We will return to this point because the nature of the retrieval tasks is ultimately the key reason why some of the present studies failed to observe retrieval practice effects.

None of the Experiments Manipulated Complexity or Element Interactivity

There is an even more serious limitation in the research summarized by Van Gog and Sweller: Element interactivity was not manipulated in any of the worked-example experiments reported by Leahy et al. (2015), Van Gog and Kester (2012), or Van Gog et al. (2015). None of the experiments showed that, holding everything else constant, the testing effect exists for simple (low element interactivity) materials but disappears for complex (high element interactivity) materials.

The case can be made that de Jong et al. (2015) manipulated element interactivity across experiments. In Experiment 1, de Jonge et al. presented the materials intact (deemed high element interactivity by Van Gog and Sweller), while in Experiment 2, they presented the materials as a series of randomly ordered—but still clearly interrelated—facts. This is exactly what Chan (2009) did in a series of experiments (see too Chan 2010, and Chan et al. 2006; none of these papers was mentioned in the present special issue). Chan (2009) had students read lengthy texts intact or with the sentences randomly ordered, which he referred to as high and low integration conditions, respectively. The students then answered short-answer questions that required them to relate multiple concepts within the texts. Chan observed robust benefits of retrieval practice on delayed tests 1 day after the initial learning phase for both the low and high integration conditions.

Karpicke and Blunt (2011) also directly manipulated the type of materials that learners studied and practiced retrieving; again, a discussion of this fact is absent from Van Gog and Sweller’s analysis. Karpicke and Blunt had students read texts with enumeration structures, which listed a series of facts and concepts about a topic, and texts with sequential structures, which described a connected series of events and steps in a process (see too Cook and Mayer 1988; Meyer 1975). Sequential texts are likely higher in element interactivity than are enumeration texts—the measures of referential coherence in Table 1 support this claim. Karpicke and Blunt showed large benefits of retrieval practice on long-term retention for both types of text (see their Figure 2). It is also worth noting that Karpicke and Blunt used concept mapping as a final assessment of learning; thus, robust benefits of retrieval practice were evident on final assessments that explicitly required students to specify the interactions among elements.

Existing Research Has Shown Retrieval Practice Effects with Complex Materials

In addition to experiments that directly manipulated the complexity of the materials, several studies have shown retrieval practice effects with complex materials, often carried out in authentic educational settings. Many of these studies were excluded from Van Gog and Sweller’s analysis.

A surprising omission is McDaniel et al. (2009). They used complex materials (included in the analysis in Table 1) about the workings of mechanical systems (brakes and pumps) and showed benefits of retrieval practice on delayed assessments that measured recall and the ability to apply knowledge and solve new problems. Chan’s research (Chan 2009, 2010; Chan et al. 2006), which was also not discussed by Van Gog and Sweller, showed that retrieval practice enhanced long-term retention in low and high text-integration conditions. He also showed that practicing retrieval of a portion of complex material can spread to and enhance long-term retention of portions that were not explicitly tested, a phenomenon called retrieval-induced facilitation. Butler’s (2010) results, which showed that retrieval practice enhanced transfer of knowledge with questions that required learners to integrate multiple concepts, were discounted by Van Gog and Sweller for unclear reasons. Several studies have shown that retrieval practice enhances learning of spatial information, such as the locations and relationships among objects on maps or diagrams (e.g., Carpenter and Kelly 2012; Rohrer et al. 2010). The task of retrieving spatial relations seems high in element interactivity, yet again these findings were excluded from Van Gog and Sweller’s analysis.

A wealth of recent research has extended the benefits of retrieval practice to classroom learning. Several studies, carried out in authentic classroom settings, have shown that retrieval practice improves student learning of the materials studied in school, using educationally relevant retrieval activities and assessments (e.g., Agarwal et al. 2012; Butler et al. 2014; Dobson and Linderholm (2015); Jensen et al. 2014; Lyle and Crawford, 2011; McDaniel et al. 2007a, b, 2013; McDermott et al. 2014; Roediger et al. 2011). In one striking example, Larsen et al. (2013) had medical students practice retrieval of clinical knowledge (e.g., the symptoms that would be diagnostic of particular disorders). Six months after initial learning, practicing retrieval improved the medical students’ performance at forming diagnoses in a simulated patient scenario. To us, this unquestionably represents complex learning of complex materials.

Given the evidence, Van Gog and Sweller’s claim that retrieval practice does not enhance learning of complex materials is jarring. Indeed, the fact that retrieval practice enhances learning in “educationally relevant tasks that are closer to the ultimate goal of education” (2015) has already been affirmed.

The Worked-Example Data Are Ambiguous but Tend to Show a Positive Effect of Retrieval Practice

Based on the evidence reviewed so far, Van Gog and Sweller’s central claim that retrieval practice does not enhance learning of complex materials is incorrect. The overwhelming evidence shows that retrieval practice is effective for materials that are both simple and complex, and it benefits meaningful, long-term learning in authentic educational settings. Why then were few positive effects observed in the studies highlighted by Van Gog and Sweller?

There are two clear explanations. First, as we have emphasized, de Jonge et al. (2015) and Tran et al. (2015) had people practice retrieval by filling in individual words in isolated sentences. That retrieval activity does not require the kind of integrative, relational processing that occurs in free recall, summarization, or answering inferential short-answer questions. The idea of element interactivity is indeed important for retrieval practice, but it is element interactivity during the retrieval activity that matters, not the complexity or element interactivity within the set of materials. The effects of fill-in-the-blank retrieval activities, like those used by de Jonge et al. and Tran et al., relative to more integrative retrieval activities were examined directly by Hinze and Wiley (2011), another report not included in Van Gog and Sweller’s analysis (see Table 1). Hinze and Wiley showed that initial fill-in-the-blank tests did not produce retrieval practice effects relative to restudying, whereas freely recalling the material in paragraph format produced reliable retrieval practice effects.

Second, the worked-example experiments (Leahy et al., 2015; Van Gog and Kester 2012; Van Gog et al., 2015) essentially involved massed retrieval practice immediately after the occurrence of each problem. That is, students were given a worked example and then immediately given a problem to solve as a “retrieval practice” event. It is not at all clear that students needed to retrieve anything about the prior learning episode to solve such problems, and episodic retrieval is an essential ingredient for retrieval practice effects (Karpicke et al. 2014; Karpicke and Zaromb 2010; Lehman et al. 2014). Nevertheless, the task afforded immediate, massed retrieval practice at best, which requires little or no episodic context reinstatement (Delaney et al. 2010) and is certain to produce very poor long-term retention (e.g., Carpenter and DeLosh 2005; Karpicke and Bauernschmidt 2011; Karpicke and Roediger 2007; Pyc and Rawson 2007). In sum, the experiments reported by de Jonge et al. (2015) and Tran et al. (2015), and the worked-example experiments all failed to show retrieval practice effects because of the way retrieval practice was implemented, not because of the complexity of the materials.

Finally, even with these limitations—specifically, that the worked-example procedures involved massed retrieval practice—the data from worked-example experiments still show a small but positive benefit of retrieval practice. The data summarized by Van Gog et al. (2015) are very noisy, and the individual studies reported in their small-scale meta-analysis (their Figure 1) are all underpowered, as evidenced in part by the very large error ranges. Nevertheless, the data reported in the meta-analysis show an overall effect of d = 0.19 with an error range that barely includes zero. Notably, the effect size observed in Leahy et al.’s (2015) Experiment 3 on a delayed final test was also d = 0.19. Yet Van Gog et al. interpret the existing data not as evidence for a small, positive effect, but as evidence that there is no effect at all.

The results of the worked-example experiments, despite their problems, certainly show a small but positive effect of retrieval practice. A nonsignificant p value does not provide evidence against an effect, or rather, in favor of a null effect (see Rouder et al. 2009). To gain more insight on the data, we entered the t statistics and samples sizes from the contrasts in Van Gog et al.’s (2015) small-scale meta-analysis into a Bayesian meta-analysisFootnote 4 (Rouder and Morey 2011). This allowed us to evaluate the strength of evidence in favor of the hypothesis that there was no effect (d = 0) relative to a small, positive effect (0 < d < 0.20). The Bayesian meta-analysis showed positive evidence in favor of a small effect relative to a null effect (BF = 4.13). In other words, given these data, a small positive effect is 4 times more likely than a null effect.

Conclusions

The claim that there is no testing effect for complex materials is incorrect. A wealth of research, reviewed here only briefly, has shown that practicing retrieval enhances learning of complex materials in educational settings. Much of this literature, including experiments that directly manipulated the complexity of the materials, was not included in Van Gog and Sweller’s analysis. The experiments emphasized by Van Gog and Sweller involved either recall of isolated words in individual sentences or immediate, massed retrieval practice with worked-example materials. Retrieval practice effects were not observed in those experiments because of methodological issues, not because of the complexity of the materials. Despite the limitations of the worked-example experiments, they provided good evidence that there is a small but positive effect of (massed) retrieval practice with worked-example materials, contrary to Van Gog et al.’s interpretation. Finally, if element interactivity is to be a useful construct in educational research, it needs to be defined in a quantitative, measurable way. We offered referential cohesion as a possible measure, but it is likely that better measures can be developed. Ultimately, the influence of material complexity must be assessed in experiments that directly manipulate it. Given the results of the large base of relevant research, the testing effect is alive and well with complex materials.