Practice testing is arguably one of the most potent learning techniques documented to date. Several hundred experiments from more than 100 years of research have established that taking a test does more than just assess learning—it actually enhances learning (for recent reviews, see Dunlosky et al. 2013; Roediger et al. 2011; Rowland 2015). Although practice testing has been shown to enhance learning under a wide range of conditions, it is unreasonable to assume that testing (or any other learning strategy) will work equally well for all learners, materials, and tasks. Thus, identifying factors that moderate the testing effect is important for both theoretical and applied purposes.

Whereas prior research has identified some moderators (as described in the target articles and discussed further below), this special issue highlights material complexity as a plausible moderator that has not received sufficient attention in the literature. Furthermore, although testing effects have been demonstrated across a wide range of learning materials, the target articles correctly note that the extant literature is heavily populated by research involving relatively simple materials (e.g., word pairs or word lists). Because practice testing is increasingly being prescribed to students and teachers as an effective learning technique (e.g., Dunlosky et al. 2013; Pashler et al. 2007), one would want high confidence that testing effects also hold for the kinds of complex materials that are commonly the object of educational learning goals. Thus, this special issue addressing the extent to which testing effects depend on material complexity is timely and important. If the testing effect is consistently absent—or worse yet, reversed—for complex materials, this finding would certainly have important implications for prescriptions for teachers and students.

Is the Testing Effect Absent for Complex Materials?

Prior research has shown that testing effects may be absent and sometimes even reversed when criterion performance is assessed immediately or shortly after initial learning (as briefly discussed by van Gog and Sweller 2015). However, this known moderator is arguably not troublesome for educational purposes, given that the goal of education is long-term maintenance of knowledge. Thus, the question of greatest practical interest here is whether the testing effect is absent for complex materials when criterion performance is assessed after a delay.

First, here is the good news: In the target articles, none of the experiments involving a delayed criterion test found a reversed testing effect. But here is the potentially bad news: Only one experiment reported a statistically significant positive testing effect (de Jonge et al. 2015, experiment 2) and then only for an incoherent text (d = 0.58 vs. d = 0.13 for the coherent text in experiment 1).Footnote 1 Based on the outcomes reported in their target article, van Gog et al. (2015) concluded that “In none of these experiments, nor in an overall analysis, did we find evidence of a testing effect (current p. 19).” Similarly, Leahy et al. (2015) conclude that “The testing effect may not be obtainable using high element interactivity materials (current p. 11).”

However, these all-or-none conclusions are arguably too strong and too heavily weighted by failures to reach conventional levels of statistical significance in underpowered experiments. For example, the observed effect in Leahy et al. (2015) experiment 3 was d = 0.20, which was not significant given the small sample size (achieved level of power was only 0.08).Footnote 2 Similarly, the mean effect size estimate in the mini-meta-analysis reported by van Gog et al. (2015) was d = 0.19, but despite the large combined sample, achieved power for the reported two-tailed test was 0.43. Arguably, a one-tailed test would be warranted for this comparison (given that the test is of an a priori directional prediction that testing outperforms restudy), which would have yielded a significant p = 0.038. With that said, the effects demonstrated across the target articles are undeniably small, but they are consistently positive. Thus, the weight of the evidence across the target articles does not easily support the all-or-none conclusion that the testing effect is absent for complex materials. Rather, a more nuanced but arguably more appropriate conclusion about outcomes from the target articles would be that the testing effect generalizes in the expected direction but not in effect size.

Broader conclusions concerning whether testing effects obtain for complex materials can also be informed by available evidence from prior research. To this end, what is the evidence from prior research involving the kinds of complex materials used in the target articles (either problem-solving tasks or text materials)? Concerning prior research involving problem-solving tasks, few prior studies in the worked-example literature have administered criterion tests after a delay, and few of these studies included the practice conditions of interest here (example-problem pairs vs. examples only, as in the target articles by Leahy et al. and van Gog et al., or the closely related comparison of examples only vs. problems only).Footnote 3 van Gog and Kester (2012) compared example-problem pairs vs. examples only for novice undergraduates learning how to troubleshoot electrical circuits and found a reversed testing effect on a criterion test administered 1 week later (d = −0.66), which is potentially worrisome. However, Darabi et al. (2007) showed strong positive effects of practice tests in a troubleshooting task. Undergraduates in engineering courses worked with software simulating a water-alcohol distillation plant to diagnose and repair malfunctions. After initial basic instruction,Footnote 4 students either studied four descriptive worked examples (similar in kind to those used in the target articles) or completed four problem-solving trials. On a transfer test several days later, the problem-solving group significantly outperformed the worked-example group (d = 0.98), which is quite promising. Given the mixed outcomes reported by van Gog and Kester (2012) and Darabi et al. (2007) and the relative paucity of research on problem-solving tasks involving delayed tests, more research including these conditions is essential before any strong conclusions can be drawn.

Fortunately, the extant literature concerning testing effects with text material is on firmer footing. It is true that the majority of prior research on testing effects involved simpler materials, particularly older research (with notable exceptions such as Gates 1917, as described by van Gog and Sweller 2015). But the recent surge of research on test-enhanced learning has increasingly involved text materials. Rawson and Dunlosky (2011) summarized methods from 168 experiments reported from 2000 to 2010, and 36 of these involved text or lecture materials (32 of the 36 experiments included a delayed criterion test, although not all of them directly compared a testing condition to a restudy condition). Since 2010, many more papers have been published on test-enhanced learning for text materials with delayed criterion tests (some of these are summarized in Table 1 of van Gog and Sweller 2015). Thus, the available prior outcomes are too extensive to describe at length here. As luck would have it, Rowland (2015) recently conducted a meta-analysis of the testing effect literature, specifically focusing on comparisons of testing vs. restudy and examining type of material as a moderator. The mean weighted effect size was similar for prose (g = 0.58, based on k = 23 effect size estimates) vs. paired associates (g = 0.59, k = 71) and stronger than for word lists (g = 0.39, k = 58). Rowland also reported outcomes for the subset of studies that either included feedback and/or reported at least 75 % performance on the practice tests, given that these conditions represent a more level playing field with respect to re-exposure to target information (i.e., all information is re-exposed during restudy, whereas in the absence of feedback, only correctly retrieved information is re-exposed during practice testing). Under these conditions, estimated effects for prose were even more impressive (g = 0.73, k = 13, vs. g = 0.69, k = 47 for paired associates or g = 0.64, k = 27 for word lists). Thus, the testing effect would appear to be alive and well for text materials.

In sum, the weight of the available evidence (from the target articles and from prior research) does not support the conclusion that the testing effect is not obtainable for complex materials. Although the magnitude of the effect is disappointingly small in some cases, with very few exceptions, it is consistently positive. If testing is often but not always substantially better than restudy, the prescriptive conclusions for teachers and students remain unchanged: Testing is still the strategy that has the highest likelihood of paying off.

In closing, I again point to the importance and timeliness of this special issue, which highlights the need for more empirical and theoretical work on test-enhanced learning for complex materials. Although the weight of the evidence still favors practice testing as an effective learning technique for complex materials, more research is clearly needed to further examine when and why these effects may be limited, which in turn can inform efforts to optimize test-enhanced learning for educationally relevant materials and tasks.