The potential impact of perceptual disfluency on reasoning and memory is provocative. The dual-process model that links disfluency to performance is plausible: perceptually degraded materials (vs. those that are not degraded) are more difficult to process, and these difficulties may increase the likelihood that people will think analytically (for reasoning tasks) or will process the material more deeply (for memory tasks). Better thinking and deeper processing in turn improve performance. Moreover, the benefits of disfluency may be obtained by simply changing the font of texts or the contrast of non-verbal material. Thus, to take advantage of disfluency as an educational tool, students do not need to be instructed about tactics for reasoning better or effective techniques about how to process materials more deeply when they are studying. When they encounter disfluent processing while studying, the disfluency itself is expected to trigger those more effective processes without any support beyond placing a perceptual hurdle in front of them; trip students up, and watch them excel.

Given the provocative nature of perceptual disfluency as a reasoning and memory modifier, researchers’ interest in processing disfluency is not surprising, and the target articles in this special issue collectively provide a significant advance toward answering an important question. Namely, are disfluency effects robust? Answers to this question will have critical implications for whether disfluency should be recommended as a tool to improve people’s reasoning and memory. Nevertheless, the importance of being cautious about such recommendations cannot be overstated, especially for educational applications in which provocative interventions are adopted by well-meaning educators and administrators before the interventions are sufficiently supported by empirical evidence (the impact of learning styles is a high-profile example, see Pashler et al. 2008).

The conclusions from the abstracts of the target articles are not so encouraging and certainly raise doubts, not only about whether disfluency effects are robust, but whether they exist – after all, any significant effect may be a statistical fluke, which is one reason why replicating potentially important effects is essential in psychological science (see Pashler and Wagenmakers 2012). We suspect that if you are reading our commentary, then you did more than just flip through the abstracts of the target articles – if not, we encourage you to examine the details of each one, because the details indicate that even more evidence is needed before a confident verdict can be made concerning the potential role of disfluent processing for improving reasoning and memory. What is evident from the target articles is that perceptual disfluency may matter in some contexts and for some people, given that materials presented in a perceptually degraded format (a) improved retention and comprehension of expository text for those with high working-memory spans (Lehman et al. 2016; but see Strukelj et al. 2016); (b) improved rates of solving tricky problems presented on a computer screen (Sidi et al. 2016); and (c) influenced people’s judgments of their learning (Magreehan et al. 2016, Experiments 4 & 5) and study time (e.g., Rummer et al. 2016).

Given these positive results as well as others previously reported (for reviews, see Alter 2013; Unkelbach and Greifeneder 2013), we suspect that research on disfluency will continue. Thus, a major contribution of the target articles is how they offer recommendations – both explicitly and implicitly – about how to explore the disfluency hypothesis to afford the most compelling and conclusive advances. We expand on these recommendations in the remainder of our commentary.

Treat the contribution of disfluency as a hypothesis and evaluate it against competitors

A great start for testing the disfluency hypothesis is to treat the contribution of perceptual disfluency as a possibility, not a given. Researchers of the target articles are treating disfluency in this manner, at least with respect to the question, Do perceptually degraded materials impact performance? The target articles provide many examples of hypotheses that are explicitly stated and that can be disconfirmed. The most prevalent hypotheses were versions of the moderated disfluency hypothesis (from Eitel and Kühl 2016), which in general claims that the positive effects of disfluency will be moderated by a variety of factors. For instance, Lehmann et al. (2016) and Strukelj et al. (2016) evaluated the hypothesis that the impact of perceptually degraded materials is moderated by individual differences in working memory; Sidi et al. (2016) evaluated whether its impact is moderated by whether materials are presented on a computer screen or on paper; and Kühl and Eitel (2016) evaluated whether it was moderated by test expectancy. The current focus of the target articles on evaluating the moderated disfluency hypothesis illustrates how research on the positive impact of disfluency is in its infancy, because the main goal of the target articles was largely to address whether perceptually degrading materials impacts performance. The outcomes from the articles are compelling: The disfluency hypothesis failed to pass many tests that it could have passed (for a great overview of how the moderated disfluency hypothesis fared in the current articles, see Table 1, Kühl and Eitel 2016). The positive impact of perceptually degrading materials is limited, but versions of the hypothesis may succeed, such as those based on the moderated disfluency hypothesis.

The first part of this recommendation to treat the possible impact of disfluency as a hypothesis is trivial, but the implications are important. One implication is that trying to find materials “that work” is not appropriate – for instance, it would be inappropriate to investigate the potential impact of four different kinds of presumably disfluent fonts, obtain a positive (e.g., a t-test with p < .05) effect for one font, and then just report the outcome relevant to that font.Footnote 1 Doing so treats disfluency effects as though they exist somewhere in nature, so that researchers’ goal is to reveal them. Instead, per the recommendation to treat the impact of disfluency as a hypothesis, the effect sizes involving all the fonts investigated should be reported, not only to sidestep issues with the file-drawer problem, but also to establish the breadth of perceptually degrading materials on performance. Most researchers understand the perils of withholding evidence (and many journals are now requiring a statement about whether any data are being withheld), so we hope that this recommendation is unnecessary.

Even so, we included this recommendation so that we could raise a more subtle issue, which pertains to understanding why degraded formats impact performance when they do. That is, the disfluency hypothesis is about an empirical relationship, but if this relationship is obtained, differential processing fluency is not necessarily the cause. Consider the intriguing results from Magreehan et al. (2016, Experiment 4), who found that judgments of learning (JOLs) were lower for words that were italicized in light gray than those that were bolded in black. They also measured study time (which is one measure of processing fluency), and as expected, study times were longer for words that were italicized in light gray than in black. The effects were small, but significant, and they reflect an empirical disfluency effect. But, why did changes in font impact people’s JOLs? A seductive answer is that it must be fluency; that is, the relative difficulty in processing the light-gray font (as measured by study time) produces a subjective experience (or emotion) that in turn influences people’s judgments. This answer is seductive simply because the experimenters presumably manipulated processing difficulty and hence how could differential processing fluency not explain the outcomes? Fortunately, when measures of processing fluency are taken, this answer can be evaluated by conducting a mediation analysis. That is, if processing differences are responsible for the effects of font on JOLs, then this relationship should be reduced when controlling for study time. We conducted mediation analysis on data from their Experiment 4.Footnote 2 The mean intra-individual correlations between the indices were as follows: (a) font (1 = bold-black font; 2 = italicized-light font) and JOLs = − .08 (SEM = .02), indicating that JOLs were lower for words presented in the italicized-light font; (b) font and study time = −.04 (SEM = .01), demonstrating the small but significant relation between font and processing difficulty; and (c) study time and JOL = .09 (SEM = .04). When study time was factored out of the relationship between font and JOLs, the resulting mean correlation was −.07 (SEM = .02), which did not differ from the correlation between font and JOLs (−.08).

These data are inconsistent with the hypothesis that differential processing fluency is responsible for the impact of font on JOLs and hence suggest some other factor is responsible. An alternative hypothesis is that some participants believed that words in italicized-light font would be more difficult to remember and it was this belief that was responsible for the relationship between font and JOLs. Of course, the relationships in the present case (Magreehan et al. 2016, Experiment 4) were rather small (and some relationships were in the opposite direction as predicted by the disfluency hypothesis), so one could argue that these data do not provide a fair – or at least sensitive – test of the possible contribution of disfluency to JOLs. Regardless, our main point is that after one establishes that a manipulation (e.g., kind of font) influences a person’s judgments, learning, or reasoning, then further empirical work may be needed to reveal the source of the influence. In general, our recommendation is to develop plausible alternatives to the disfluency hypothesis and conduct experiments that competitively evaluate them (as per Platt 1964).

Replication is vital to establish effects worth pursuing

We appreciate the excitement about processing (dis)fluency, so much so that we have explored its contribution to people’s judgments ourselves (e.g., Matvey et al. 2001; Mueller et al. 2014). In looking back at some highly-cited papers on disfluency effects, we understand why they captivated our interest. But given the current replicability crisis across all sciences, tempering one’s excitement until a phenomenon has been replicated is not a bad idea. For this reason, the target articles provide a major contribution to the literature by reporting a variety of attempts to replicate the disfluency-relevant outcomes by Diemand-Yauman et al. (2011). And, as summarized in Table 1 of Kühl and Eitel (2016), not one of the target articles reported a significant effect (collapsed across moderating groups) of presenting perceptually degraded materials on performance.

One caveat here is that the attempted replications in the target articles were conceptual replications. And, as Pashler and Harris (2012) argue, such conceptual replications may not be as informative as direct replications:

If a conceptual replication attempt fails, what happens next? Rarely, it seems to us, would the investigators themselves believe they have learned much of anything. We conjecture that the typical response of an investigator in this (not uncommon) situation is to think something like “I should have tried an experiment closer to the original procedure—my mistake.” Whereas the investigator may conclude that the underlying effect is not as robust or generalizable as had been hoped, he or she is not likely to question the veracity of the original report. As with direct replication failures, the likelihood of being able to publish a conceptual replication failure in a journal is very low. But here, the failure will likely generate no gossip—there is nothing interesting enough to talk about here. (p. 532)

Their arguments are well taken and should be considered in the context of the target articles. First, from our perspective, we have learned a great deal from the conceptual replications from the target articles, because the lack of conceptual replications (at a minimum) indicates that the breadth of any disfluency effect is limited. Second, Pashler and Harris (2012) are likely correct that many failed attempts to conceptually replicate would not be publishable, and not publishing is one reason why little gossip would be generated. Accordingly, we applaud Metacognition and Learning (and the guest editors, Alexander Eitel & Tim Kühl) for pursuing disfluency effects and publishing the target articles, despite their failures to conceptually replicate prior research. After all, the target articles not only report possible boundary conditions for disfluency effects, but they also provide a great deal of positive evidence and advances within the areas in which the conceptual replications were conducted.

Compared to conceptual replications, direct replications are arguably more valuable for establishing a new phenomenon (for attempts to replicate high-profile outcomes, see the Reproducibility Project at https://osf.io/ezcuj, which also includes a failure to replicate a possible fluency effect). To this point, Rummer et al. (2016) employ an informative approach by attempting to replicate the focal effect as closely as possible and extend it to explore its underlying nature. Their evidence did not replicate outcomes from Dieman-Yauman et al. (2011), but notably, Rummer et al. (2016) do point out that despite using very similar methods, they were not identical, and perhaps minor differences in methods will be responsible for the failures to replicate. Even attempts to directly replicate a robust phenomenon would not always produce statistically significant outcomes (e.g., Schimmack 2012). What is needed is a method to combine outcomes from multiple direct replications to evaluate the overall robustness of an effect, and fortunately, Braver et al. (2014) recently introduced the continuously cumulating meta-analysis (CCMA) to do just that. We hope that the use of CCMA catches on, if only for the reason that conducting a CCMA requires multiple direct replications of a target effect.

Given that disfluency effects – if they are robust – will likely be moderated, replicating outcomes relevant to the moderated disfluency hypothesis are essential. For instance, Sidi et al. (2016, Experiment 2) reported a disfluency effect when the difficult to read material was presented on the computer screen but did not find the effect when materials were presented on paper. This interaction is encouraging. We suspect this one (and others like it) will promote further research on disfluency effects, but replications of this interaction (and others) will be needed before our own excitement is entirely rejuvenated. Our hesitation here is heightened by the fact that the moderated disfluency hypothesis based on working-memory capacity received mixed support in the target articles. Outcomes from Lehmann et al. (2016) confirmed the hypothesis, whereas outcomes from Strukelj et al. (2016) did not. Given the number of differences in methods between these studies, we will not speculate why the differences occurred. Our main point is simply that before much fanfare is raised about the possibility of moderation, replication is needed.

Measure processing fluency

In the larger literature on fluency (for overviews, see Unkelbach and Greifeneder 2013), researchers often argue that materials or conditions that are perceptually degraded impact performance, because the materials or conditions impact processing fluency. What strikes us is that researchers sometimes do not empirically evaluate whether the materials or conditions under scrutiny produce differences in processing fluency. For instance, words presented in large 48-point font are presumably easier to process than words in a smaller 18-point font, and such processing differences were believed to influence people’s judgments (e.g., Rhodes and Castel 2008). This possibility is testable – and when we tested it recently by measuring processing fluency (using study times as well as response times during a lexical decision task), we did not find that differences in font size had any impact on processing fluency (Mueller et al. 2014). We were surprised by these outcomes because the fluency-based explanation for the font-size effect on judgments is reasonable. Nevertheless, although empirically evaluating our assumption revealed the limits of our intuition, doing so also led to new insights into how people judge their learning (for an overview, see Dunlosky et al. 2014).

Thus, whenever processing fluency is presumed to drive differences in reasoning, judgments, or learning, we recommend that researchers establish the impact of the focal variable on processing fluency. Two approaches have been used. The first approach is to have participants rate how difficult processing was for certain items or under certain conditions. As noted by Lehmann et al. (2016), researchers have postulated “that learners are aware of their own cognitive load and that subjective ratings are therefore useful to measure mental effort in general”. We suspect that learners may have some awareness about their mental effort and processing difficulties, but plenty of research has shown that subjective ratings are often not good indicators of mental processing because the ratings are inferential in nature (for detailed discussion, see Koriat 1997; Nisbett and Wilson 1977; Schwartz et al. 1997). For instance, Sidi et al. (2016) did report that many (but not all) participants rated prose printed in Arial 9-point italicized light grey as requiring effort to read. In this case, we suspect that this font was difficult to read, but an alternative possibility is that participants reported it being difficult because they inferred that “it should be” difficult to read. That is, participants could have explicitly used differences in font as a cue to infer their effort. The bottom line is that subjective, metacognitive ratings are influenced by many factors and hence will not be entirely valid measures of processing (dis)fluency.

The second approach is to collect objective measures of processing fluency, which was prevalent in the target articles. Eitel and Kühl (2016), Magreehan et al. (2016), and Rummer et al. (2016) used self-paced study time as a measure of fluency. This measure is particularly relevant, because if perceptually degrading materials improves performance, it presumably does so in part by leading to more thoughtful or longer processing for materials presented in a perceptually degraded format. Overall, the outcomes were not encouraging: The focal manipulations did not consistently impact study times. Similar to subjective ratings, however, study time can be influenced by a variety of other factors (for a review, see Dunlosky and Ariel 2011) – study time is not a process-pure measure of processing fluency. Moreover, processing occurs at multiple levels, and different measures may be needed to reveal the impact of disfluency. Toward this end, the use of eyetracking by Strukelj et al. (2016) was innovative, because eyetracking can measure the rate and frequency of different reading behaviors that reflect different kinds of processing; thus, if the focal manipulation of degrading text does disrupt just a subset of processes, the fine-grained measures would be more likely to reveal the effect. In this case, however, perceptually degrading texts (vs. non-degraded texts) had no significant impact on reading as measured using eyetracking.

Avoid using theory-laden terms as labels for independent or dependent variables

A final recommendation is related to measuring processing fluency and seems more like a quibble than a solid recommendation for future research. It is major, however: Researchers should resist labeling manipulations as if disfluent processing is a foregone conclusion. For instance, presenting material in different formats should not be called a disfluency manipulation, and one level of the manipulation should not be labeled as “disfluent” and the other as “fluent.” This recommendation holds for any manipulation – that is, terms for independent variables and dependent measures should not presuppose a construct or theory. A reason for this recommendation is evident from several of the target articles that include degraded materials that do not appear to have an objective impact on processing fluency – so the materials were perceptually degraded, but were they really disrupting the fluency of processing? We understand that it may be more eloquent to use the phrase “fluent text” than “8-point bold font text” (and “disfluent text” for “9-point italicized font”) or to use the term disfluent materials (instead of “perceptually degraded materials”).

The advantage of using the more eloquent labels (“disfluent” and “fluent”) has the disadvantage of luring readers into believing that the focal manipulation impacts processing fluency. The latter needs to be empirically established, and avoiding labels that are theoretically-laden represents an effort to remind us that intuitions about disfluent processing should be confirmed by systematic empirical research. In the target articles, we were encouraged that everyone was basing conclusions on systematic empiricism (and some with a focus on whether their perceptual manipulation influenced processing fluency), so our recommendation here is more in line with issues pertaining to writing style.

Closing remarks

The allure of disfluency is unmistakable, especially for education, because by merely presenting materials in a degraded format, students’ reasoning, learning, and memory may improve. Unfortunately, the outcomes of the target articles in this Special Issue suggest that the positive impact of disfluency is limited and may even provoke one to question whether it has a genuine impact. We hope it does, and we hope our recommendations will be useful to researchers who attempt to reveal its influence – certainly, a wide-scale attempt to directly replicate some of the original effects would provide the most convincing evidence about whether perceptually degrading materials has a meaningful impact on reasoning and learning. Given the limited evidence for the effect of degrading materials on performance, we also want to emphasize that a related idea championed by Robert Bjork, referred to as desirable difficulties (e.g., Bjork 1994), is on much firmer ground – especially when used as a description of outcomes relevant to education. For instance, telling teachers that spacing study is a desirable difficulty emphasizes that a strategy that appears to impede performance during study (a difficulty) produces greater retention (and hence is desirable). Even so, we feel that this moniker (“desirable difficulty”) should be saved to describe well-established phenomena because the term represents a description of outcomes and not an explanation for them – after all, processing difficulty is not responsible for the spacing effect, and spacing probably does not make people think “more deeply” about the to-be-learned materials (for detailed descriptions of process-oriented models of the spacing effect, see Benjamin and Tullis 2010; Delaney et al. 2010). Even so, people’s perceptions that spacing is difficult may dissuade some teachers and students from using this superior study schedule, so educating them about these desirable difficulties can have a positive impact.

Many difficulties are not desirable, and some difficulties may only be desirable when people also experience an undesirable easiness (i.e., the benefits of the difficulty arise only in within-participant designs, as per Magreehan et al. 2016), which further limits the relevance of processing disfluency to education. As noted by Sidi et al. (2016), “the types of difficulties that are indeed desirable, and the appropriate conditions under which they enhance performance, are still unclear”. We agree but would also question the search for desirable difficulties as an endeavor in its own right, because one only knows which techniques are both difficult and desirable after the results of an experiment are known. The target articles go a long way with respect to establishing whether the possible impact of presenting degraded materials represents a desirable difficulty. And, at least until future research establishes when degraded materials will consistently boost performance, our final recommendation would be to reserve the use of the term desirable difficulty for other techniques that impose difficulties yet are undeniably desirable. If researchers continue exploring disfluency effects using experiments that test hypotheses involving high-powered replications while attempting to measure processing fluency, we suspect that the resulting evidence will more convincingly support any claims about the educational value of disrupting students’ perceptual processing while they are studying.