Introduction

Perceptual disfluency is an important focus of research in educational psychology (for a systematic review, see Reber and Greifeneder 2017). The term perceptual disfluency, considered to be a form of metacognition, refers to the learner’s subjective sense of difficulty while completing a cognitive task. It reflects the ease of perceptual-level processes associated primarily with material form (Reber et al. 2004). For example, if the font of a visual text is irregular, we are aware that it is difficult to process. If the quality of a spoken text is poor, we recognize the challenge in decoding it. A paradoxical finding in some studies is that students learn better when they perceive disfluency—that is, when educational materials are slightly harder, rather than easier, to read. This effect was initially found by Diemand-Yauman et al. (2011), who provided empirical evidence that learning with disfluent (hard-to-read) materials helped learners recall more details of the information than learning with fluent (easy-to-read) materials, not only in a controlled laboratory setting but also in actual classroom environments. These intriguing findings, going against common sense, have attracted a lot of attention, especially in the research field of educational psychology. Up to July 2017, the original article published by Diemand-Yauman et al. (2011) has triggered more than 200 citations (source: Google Scholar®), and there have been a number of follow-up studies aiming at replicating the compelling results in a text-based educational context (e.g., Eitel et al. 2014; French et al. 2013; Strukelj et al. 2016). Researchers’ enthusiasm was probably aroused due to the potential for these results to be applied to education. If a simple intervention based on the disfluency effect is proved to foster learning, it could be implemented with minimal investment.

Before widely introducing disfluency-based educational interventions to actual classrooms, however, it is of the essence to evaluate the robustness of the disfluency effect. Since the work of Diemand-Yauman et al. (2011), dozens of studies using a variety of participants, learning materials, and experimental designs have been conducted to verify the effects of disfluency manipulation on recall (French et al. 2013; Guenther 2012; Lee 2013), transfer (Eitel and Kühl 2016; Eitel et al. 2014; Kühl et al. 2014a; Lehmann et al. 2016), judgments of learning (JOLs) (Pieger et al. 2016, 2017; Weissgerber and Reinhard 2017), and learning time (Pieger et al. 2016, 2017; Rummer et al. 2016; Seufert et al. 2017). However, the mixed findings in this body of research call into question the generality of the disfluency effect, leading to a call for an assessment of the overall effects of disfluency (Bjork and Yue 2016; Dunlosky and Mueller 2016). Therefore, the most important aim of the present meta-analysis was to evaluate the effects of perceptual disfluency compared to fluency in educational learning texts on different sorts of performance (i.e., recall, transfer, JOL, and learning time). In addition, many researchers tentatively noted that disfluency might facilitate learning under some circumstances but not others (e.g., Eitel and Kühl 2016; Kühl et al. 2014a; Lehmann et al. 2016). That is, there might be some undiscovered moderators influencing the effects of disfluency (Kühl et al. 2014b; Oppenheimer and Alter 2014). Thus, an exploratory aim of the study was to identify potential factors (related to participants, learning material, and experimental design) that might moderate the effect of disfluency on learning, specifically on the most frequently studied outcome—recall.

Theoretical Frameworks for Explaining Perceptual Disfluency

According to the notion of “desirable difficulties” (Bjork 1994, 2013), facing challenges in the encoding phase can help learners process information more deeply and retrieve it better later. For example, research on the generation effect (e.g., deWinstanley and Bjork 2004; Hirshman and Bjork 1988) found that asking learners to actively generate letters when they were presented in an incomplete format (e.g., “aff_ct_v_”) resulted in better recall of the letters than asking learners to passively read the group of letters presented in their entirety (e.g., “affective”). This kind of difficulty (e.g., evoked by generation) is desirable, as more elaborative processing is activated.

Increasing the difficulty of a task is likely to trigger deeper processing, not because of an increase in the objective difficulty, but because of an increase in the subjective sense of difficulty (Alter et al. 2007; Eitel et al. 2014). This “perceptual disfluency” can be achieved by presenting words or texts in a format that makes them harder to read. An explanation of the metacognitive mechanism of perceptual disfluency has recently been proposed by the disfluency theory (Alter et al. 2007; Kühl and Eitel 2016). This theory is based on the considerations of William James (1890/1950), who proposed two distinct processing systems: one that leads to quick, effortless, associative, and intuitive processing (system 1) and another that works slowly, effortfully, analytically, and deliberately (system 2). The activation of systems 1 and 2 depends on the subjective sense of ease or difficulty of a cognitive task (Alter et al. 2007). If information processing is perceived as easy, system 1 is more likely to be activated. If, on the other hand, information processing is perceived as difficult, system 2 is more likely to be activated. Making learning materials harder to read (i.e., perceptual disfluency) can increase the perceived difficulty (Alter and Oppenheimer 2009), thus activating system 2 and stimulating deep rather than superficial processing. Deep processing in turn fosters learning performance (e.g., recall or transfer).

However, an opposite explanation is proposed by cognitive load theory (CLT, Sweller et al. 2011; Sweller et al. 1998). In the field of educational research, CLT is mainly used to explain the effects of various kinds of instructional design. When performing a cognitive task, learners will consume the cognitive resources in working memory, leading to cognitive load. Resources in working memory, however, are limited (cf. Baddeley 1992); only a small fraction of elements can be consciously handled per unit time. Therefore, one important objective of instructional design is to ensure that the cognitive load is within the learner’s working memory capacity. According to CLT, three kinds of cognitive load in working memory, namely intrinsic cognitive load (ICL), extraneous cognitive load (ECL), and germane cognitive load (GCL), can be created when learning with instructional material (Sweller et al. 2011; Sweller et al. 1998). ICL is a kind of load that is not directly affected by instructional design, but is related to element interactivity (i.e., inherent complexity) in learning materials and learners’ prior knowledge. ECL is affected by the quality of instructional design. When the material is poorly presented (e.g., in a way that requires learners to split attention), high ECL will be consumed to cope with the poor design rather than the learning task per se, possibly resulting in poor learning performance. Reducing ECL would be beneficial to learning because it would allow adequate cognitive resources to be allocated to learning processes. GCL is a desirable working memory load which directly correlates with the learning task per se and contributes to the learning performance. For instance, this load can be imposed by prompting learners to generate self-explanations when learning from worked examples (Paas and Van Gog 2006; Atkinson et al. 2003).

Although not specifically proposed by CLT, perceived difficulty due to text disfluency might be assumed to be detrimental to learning because of an increase in ECL (Eitel et al. 2014). When learners are presented a specific material that is identical in different groups (i.e., without a change in ICL), making instructional texts perceptually harder to read is thought to create additional cognitive demands to cope with the low legibility of the information. This load is neither related to the learning task per se, nor helpful for learners to actively generate information (i.e., the generation effect), leading to an increase in ECL with GCL unchanged (Eitel et al. 2014; Kühl et al. 2014a). From this perspective, learning performance (e.g., recall, transfer) will be hindered by perceptual disfluency.

In summary, these two theories make opposite predictions about the effect of perceptual disfluency on learning. Disfluency theory predicts that learners who receive disfluent materials will show better learning outcomes than learners who receive fluent materials, whereas CLT predicts that learners who receive fluent materials will show better learning outcomes than those who receive disfluent materials.

Effects of Perceptual Disfluency on Recall and Transfer

According to the empirical evidence, the effects of disfluency on recall and transfer are not consistent, and the generality of the disfluency effect with respect to text-based learning is an open question. These inconsistencies were the impetus for our meta-analytic investigation. The first important question we addressed was whether or not students who are learning with a disfluent text recall and transfer better than students learning with a fluent text. The answer to this question is key to testing the opposite predictions made by disfluency theory and CLT.

Dozens of studies have been conducted to check whether disfluency facilitates or impedes recall and transfer of learning in a text-based educational context. However, empirical evidence regarding this question is quite mixed (for overviews, see Kühl and Eitel 2016; Weissgerber and Reinhard 2017; Xie et al. 2016).

Both disfluency theory and CLT are supported only to a small extent. Only a handful of studies have found positive effects of disfluency on text-based learning outcomes, such as shallow level recall (Diemand-Yauman et al. 2011; French et al. 2013; Lee 2013; Weltman and Eakin 2014) and deep level transfer (Eitel et al. 2014, Experiment 1). For example, French et al. (2013) asked students to read a text describing facts about a fictional star at the beginning of the class. About 35 min later, students who had read the text in a disfluent font (i.e., Monotype Corsiva) were found to recall more facts than those who had read information in an easier to read font (i.e., Arial). Similarly, there are also limited findings showing that disfluency hinders educational performance in the form of recall and transfer (Carpenter et al. 2016, Experiment 3; Kühl et al. 2014a; Miele and Molden 2010, Experiment 3; Pieger et al. 2017). For example, Miele and Molden (2010) asked participants to read a brief expository text describing the ways in which television news affected its viewers and found that participants who had read disfluent text (italicized 12-point Juice ITC font) showed worse recall of explicitly stated information than those who had read the fluent text (12-point black Times New Roman font).

More importantly, it is more often the case that null (direct) effects of disfluency on educational performance are found (e.g., Ball et al. 2014; Carpenter et al. 2016, Experiments 1 and 2; Carpenter et al. 2013; Eitel and Kühl 2016; Eitel et al. 2014, Experiments 2, 3, and 4; Faber et al. 2017; Guenther 2012; Pieger et al. 2016; Sanchez and Khan 2016; Strukelj et al. 2016). Eitel et al. (2014) conducted a Bayesian analysis for conditions with disfluent text versus fluent text across their four experiments and found no overall effect of perceptual disfluency on recall and transfer. Even though some researchers used the same material as used in Experiment 1 of the original study (Diemand-Yauman et al. 2011), they did not detect differences in learning outcomes between disfluent and fluent groups (Haysom 2012; Rummer et al. 2016; Whitehouse 2011), resulting in no support for either disfluency theory or CLT.

Effects of Perceptual Disfluency on Judgments of Learning and Learning Time

Perceptual disfluency is largely assumed to reduce judgments of learning and to increase learning time. The second question we addressed was whether this pattern is apparent in text-based educational contexts. Experiments often manipulate perceptual disfluency in order to assess its effect on learning performance, but this type of manipulation is also used to investigate the effect of perceptual disfluency on metacognitive monitoring, usually inspected via JOLs. JOLs are learners’ predictions about the likelihood of recalling recently studied knowledge in a subsequent recall test. For instance, participants might be asked to explicitly judge what percentage of questions about the studied text passages they would answer correctly on a continuous scale from 0 to 100 (Pieger et al. 2016, 2017).

Learners use a variety of cues to monitor a cognitive task (Koriat 1997). In dual-system processing, perceptual disfluency is an effective metacognitive cue that affects processing and judgments (Alter et al. 2007; Pieger et al. 2016). If information is processed fluently, intuitive processes in system 1 will be activated to guide judgment. However, if information is processed disfluently, this experience will function as a cue that the cognitive task is difficult and that one’s intuitive judgment is likely to be wrong. In this case, more analytic and deliberate processes in system 2 will be activated and the processes will be judged with more caution.

A large number of word-learning studies indicate that when asked to make JOLs, students are likely to predict lower performance after learning disfluent items compared to fluent items (Besken and Mulligan 2013; Magreehan et al. 2016, Experiments 4 and 5; Mueller et al. 2014; Yue et al. 2013, Experiments 1a, 2a, 2b, and 3). Some studies with more complex instructional materials show the same results (Ball et al. 2014; Carpenter et al. 2013; Pieger et al. 2017; Weltman and Eakin 2014). Carpenter et al. (2013) asked learners to watch a video explaining a scientific concept and then to make a JOL about how they would perform 10 min later. They found that learners in the disfluent speaker group predicted that they would recall a smaller amount of information than those in the fluent speaker group, while the actual recall performance did not differ between the two groups. Analogously, immediately after reading text passages about social psychology, participants in Pieger et al.’s (2017) study predicted that they would perform worse for the disfluent text than for the fluent text. However, there is still a handful of evidence that based on the fluency-disfluency distinction, students do not make different predictions about the amount of information they would accurately recall (Carpenter et al. 2016).

One of the impacts of metacognitive monitoring is related to metacognitive control during learning, for example, the allocation of learning time. Students’ metacognitive monitoring appears to influence decisions about how to regulate their learning time (Metcalfe and Finn 2008). Under normal circumstances, students spend more time to learn hard-to-read items than to learn easy-to-read ones (for reviews, see Dunlosky and Ariel 2011; Son and Metcalfe 2000). In this sense, the time spent on learning disfluent instructional materials might be greater when compared with the time spent on learning fluent materials if learning is time-unlimited or controlled by learners themselves. This has been confirmed in several disfluency-related empirical studies (Ball et al. 2014; Eitel and Kühl 2016; Miele and Molden 2010, Experiment 3; Pieger et al. 2016, 2017), showing that learning with a disfluent text required significantly longer time than learning with a fluent text. For instance, Seufert et al. (2017, Experiment 2) set no time limits for participants and discovered that the time learners needed to read a scientific text increased with increasing disfluency levels. Of course, it should be noted that a couple of studies using educationally relevant materials did not find an influence of disfluency on learning time (Kühl et al. 2014a; for an eye-tracking study, see Strukelj et al. 2016).

Previous and Present Meta-analyses

There has been only one published meta-analysis that has addressed questions related to those in the current study. To test whether disfluent fonts contributed to math problem-solving, Meyer et al. (2015) conducted a meta-analysis of 17 studies that attempted to replicate Alter et al.’s (2007) findings on reasoning. Overall, there was no effect of disfluency on solution rates (d = − 0.01). Therefore, there was little evidence that disfluent fonts activated analytic reasoning in system 2.

The current meta-analysis differs from that of Meyer et al. (2015) in numerous respects. First, Meyer et al. (2015) included only studies that manipulated the disfluency of written questions on the Cognitive Reflection Test (CRT, Frederick 2005). The CRT consists of three misleading math problems (e.g., “A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? _____ cents”). In disfluency-related studies, the CRT items are usually manipulated in different fonts and directly presented to the participants (e.g., Alter et al. 2007; Thompson et al. 2013). That is, there is no learning phase before the test in these studies. However, in the field of educational psychology, learners are commonly required to learn a (visual or spoken) text first before being tested. It is the materials presented in the learning phase, rather than the test items presented in the test phase, that are manipulated to be perceptually disfluent or fluent (e.g., Eitel et al. 2014; Seufert et al. 2017). Second, what Meyer et al. (2015) were concerned with was the effect of disfluent fonts on reasoning, rather than on memory or learning outcomes (e.g., recall). Therefore, it is still an open question whether perceptual disfluency affects (text-based) memory or learning.

The current meta-analysis, unlike Meyer et al.’s (2015) work, was designed to test the influence of perceptual disfluency in a text-based educational context in which the task is analogous to a classroom learning task. It is necessary to note that we paid attention to learning with (visual or spoken) texts, rather than with words. Previous research often focused on word lists or word pairs when investigating disfluency and metacognition (for reviews, see Alter and Oppenheimer 2009; Dunlosky and Metcalfe 2009). However, text-based learning is considered to be more complex and closer to classroom learning than word-based learning (Pieger et al. 2016, 2017). When learning with words, learners only have to decipher the words at a surface level. When learning with texts, they not only have to process the words but also process information within and between sentences at a deep level to integrate this information into memory (De Bruin and Van Gog 2012) and ultimately to comprehend the text (Pieger et al. 2017). Before introducing disfluency-based manipulation to actual classrooms, it is worthwhile to understand the effect of disfluency on learning with texts.

Theory-Based Analyses

Our general plan for the meta-analysis was first to test the different predictions about the effect of perceptual disfluency on learning performance, respectively from the perspectives of disfluency theory and CLT. According to disfluency theory, better recall and transfer would be expected in conditions with disfluent text compared to conditions with fluent text (Hypothesis 1a). According to CLT, better recall and transfer would be assumed in conditions with fluent text compared to disfluent text (Hypothesis 1b). We then tested whether perceptual disfluency works as a cue for learners, both when they are predicting how well it would be for them to recall the recently studied knowledge and when they are allocating their learning time. According to the notion of dual-system processing, we expected that learners in the disfluent group would make a lower JOL (Hypothesis 2) and spend more time on learning with texts (Hypothesis 3) than learners in the fluent group.

Exploratory Analyses

The mixed findings on the effects of disfluency have triggered a controversial discussion concerned with undetected moderators of these effects (Kühl et al. 2014b; Oppenheimer and Alter 2014). However, in most cases, there is no theoretical rationale or empirical evidence to justify hypotheses about moderators. On an exploratory basis, we examined three sets of study characteristics (i.e., participant, learning material, and experimental design characteristics) that might moderate the impact of disfluency; because the effect of disfluency has been most studied in terms of its effects on recall, we focused on that specific outcome. We examined the following moderators (a) prior knowledge level (none, low, medium), (b) learning material domain (science, technology, engineering, mathematics, social science), (c) pacing of presentation (self-paced vs. system-paced), (d) type of presentation (static vs. dynamic), (e) modality of presentation (only a visual format vs. both visual and auditory formats), (f) medium of presentation (screen vs. paper), (g) inclusion of images (yes vs. no), (h) fluency manipulation type (font-related vs. audio-related), (i) design of study (between-subjects vs. within-subjects), (j) learning duration (equal to or shorter than 10 min vs. longer than 10 min), (k) time interval between learning and test (equal to or shorter than 10 min vs. longer than 10 min), (l) use of distraction task (yes vs. no), and (m) expectation of testing (yes vs. no). These potential moderators should prove of interest to disfluency researchers (Kühl et al. 2014b; Oppenheimer and Alter 2014).

Method

Literature Search

To identify relevant studies on the effects of perceptual disfluency on learning educational materials, a systematic literature search was conducted by searching the electronic databases PsycINFO, Educational Resources Information Center (ERIC), Science Direct, PubMed, and ProQuest. The search keywords were “perceptual disfluency,” “perceptual fluency,” and “font” with different combinations of “judgment of learning,” “learning time,” “recall,” “transfer,” “learning outcome,” and “educational performance.” Search engines such as Google Scholar and the reference lists of the identified articles were also used. The literature search was conducted for studies published up to August 2017. Finally, we sent requests by email for researchers to provide the details of their unpublished studies.

Study Selection

Studies were selected from journal articles, dissertations, conference presentations, and unpublished research. The studies were included for analysis if they met all of these criteria: (a) they were based on an experimental design; (b) text-based rather than word-based learning materials were used; (c) a group with disfluent material was compared with a group with fluent material; (d) disfluency was manipulated by changing font-related attributes (i.e., font type, font size, font grayscale, or font bolding) of visual texts or audio-related attributes of spoken texts (i.e., audio quality, stumble during the speech, or a foreign accent), because these intrinsic perceptual features are most often manipulated (Xie et al. 2016); (e) there was a specific learning task; (f) they measured the recall performance after the learning phase; and (g) sufficient quantitative data (e.g., means, standard deviations and n; t test or F test values) were reported to calculate the effect size.

Studies were excluded if they did not meet the inclusion criteria mentioned above. For example, a study was not included when it used word pairs as materials (e.g., Magreehan et al. 2016) or when learners were required to complete a reasoning task, rather than a memory or learning task (e.g., Sidi et al. 2016).

Coding of Studies

Three types of information were collected from each study. First, basic information was recorded, including authors, year of publication, sample size, and experiments or conditions from which the effect sizes were computed. Second, quantitative information was collected to calculate effect sizes. All studies had information needed to calculate effect sizes associated with disfluency effects and recall; if the study also reported information on transfer, JOL, or learning time, that information was also collected. Third, characteristics related to participants, learning materials, and experimental design were coded in order to test for moderation effects. The following section provides information about how these characteristics were coded for moderation analyses.

Participant characteristic:

  1. (1)

    Prior knowledge level. From an instructional design perspective, domain-specific prior knowledge level (Kalyuga 2007; Kalyuga et al. 2003) might affect the influence of the perceptual disfluency manipulation. Participants with no domain-specific prior knowledge were classified as “none”; participants with some prior knowledge were categorized as “low” or “medium” according to the classification made by the authors.

Learning material characteristics:

  1. (2)

    Learning material domain. Research on disfluency has used instructional materials from various domains, such as science (e.g., Kühl et al. 2014a), technology (e.g., Sanchez and Khan 2016), engineering (e.g., Strukelj et al. 2016), mathematics (e.g., Weltman and Eakin 2014), and social science (e.g., Pieger et al. 2016). Coding studies into the categories of science, mathematics, and social science was straightforward. Studies using technology and engineering materials were combined into a single category (i.e., technology/engineering) to maximize the number of studies in this category.

  2. (3)

    Pacing of presentation. If participants were allowed to interactively control the presentation by themselves (e.g., start, play, pause, replay, and stop), the pacing of the presentation was categorized as self-paced. If learners had no option to interact with the presentation, it was categorized as system-paced.

  3. (4)

    Type of presentation. Educationally relevant materials can be designed to be either static or dynamic (Tversky et al. 2002). Static learning materials, in which the elements remain still, are very common in educational environments (e.g., electronic text or printed text with static pictures). With the development of newer technologies, dynamic visual materials (e.g., animations, videos) are increasingly being used to provide richer and more vivid instructional messages that are changeable over time. The type of information presentation (static or dynamic) was extracted.

  4. (5)

    Modality of presentation. Humans possess dual modalities for respectively processing learning materials, namely the visual modality (e.g., for pictures, printed text) and the auditory modality (e.g., for narration) (Mayer 2009). A single modality was coded when the learning material was presented only in a visual form. If the material was presented in both formats (i.e., visual and auditory), the modality of presentation was classified as dual.

  5. (6)

    Medium of presentation. Different kinds of media, such as screen and paper, can be used to present learning materials. The presentation medium of the learning material (screen or paper) was extracted.

  6. (7)

    Inclusion of images. Many studies (e.g., Faber et al. 2017; French et al. 2013; Lehmann et al. 2016) used pure text to present instructional materials as in the original work (Diemand-Yauman et al. 2011, Study 1). Some studies, meanwhile, added accompanying images (e.g., static picture, animation, or video) in the materials. Studies were coded according to whether there were learning-related images or not.

    Experimental design characteristics:

  7. (8)

    Fluency manipulation type. Because the perceptual fluency of materials was usually manipulated with features of font (e.g., font type, font size, grayscale) or quality of spoken text, the fluency manipulation type was classified as font-related or audio-related.

  8. (9)

    Design of study. Most studies examining disfluency effects on learning with texts have used a between-subjects design, such that a portion of participants are asked to learn a disfluent text, whereas the rest of the participants are asked to learn a fluent one. However, a within-subjects design has also been used in some studies (e.g., Ball et al. 2014; Katzir et al. 2013; Lee 2013). Thus, design type of disfluency was coded as a categorical variable with between-subjects or within-subjects levels.

  9. (10)

    Learning duration. Learning duration was defined as the amount of time spent learning the material. Learning duration can be set as either long or short, and it should be carefully considered together with the effects of instructional design (Kalyuga 2012). We considered learning duration as a categorical variable, rather than a continuous variable, because some studies did not explicitly report the actual value. Thus, to guarantee enough number of effect sizes for moderator analyses, studies were coded and analyzed through dichotomizing the duration. According to earlier research (Reinwein 2012), learning duration was categorized as either “equal to or shorter than 10 minutes (10−)” or “longer than 10 minutes (10+).”

  10. (11)

    Time interval between learning and test. This variable was also considered as a categorical one and the rationale was the same as learning duration. Again, 10-min duration was defined as the point of the classification (10− or 10+).

  11. (12)

    Use of distraction task. After the learning phase, learners in a number of studies (e.g., Diemand-Yauman et al. 2011, Study 1; Eitel and Kühl 2016; Rummer et al. 2016) were asked to complete distraction or filler tasks that were completely unrelated to the contents of the learning materials. Other studies used no distraction tasks (e.g., Eitel et al. 2014; Kühl et al. 2014a; Sanchez and Khan 2016). Studies were coded according to whether there were distraction tasks or not (yes or no).

  12. (13)

    Expectation of testing. Studies were coded according to whether or not learners knew they would be tested (yes or no).

Calculation of Effect Sizes and Analysis

Data were analyzed using the Comprehensive Meta-Analysis (CMA) 2.0 software (https://www.meta-analysis.com/). Effect sizes were weighted using the reciprocal of their variances so that effect sizes based on studies with larger sample sizes were more heavily weighted in the analysis. The random-effects model was preliminarily used for all analyses because studies included in the meta-analysis differed on a number of variables (e.g., characteristics of participants, research design, and procedures), conforming to the assumption of the random-effects model that the true effect sizes are not exactly the same in all studies (Borenstein et al. 2009).

Because the data included in the present study were continuous data with no consistent unit, we chose Cohen’s d as the standardized estimate of effect size (Cohen 1988). Specifically, Cohen’s d was calculated as the mean score difference in recall, transfer, JOL, or learning time between a disfluent group and a fluent group. It should be pointed out that most studies tested the effect of disfluency on multiple dependent variables. A basic methodological assumption of meta-analysis is the independence of effects, and the inclusion of multiple dependent variables in each article would not conform to this assumption (Lipsey and Wilson 2001). Thus, to abide by this assumption and avoid potential deviation due to dependencies among effect sizes introduced by multiple variates per study, we separately conducted analyses with regard to different dependent variables (i.e., recall test, transfer test, JOL, and learning time). When a study reported multiple experiments or multiple conditions which were not related to the moderators, the data were merged to compute one pooled study-level effect size in order to minimize the deviation of results caused by a large number of effect sizes and disproportionate weight if not pooled (Borenstein et al. 2009). The generated study-level effect sizes were then averaged to obtain an overall average effect size point estimate for quantifying the central tendency among the effect sizes. A forest plot with 95% confidence interval (95% CI) was created to detect patterns in the magnitude of the individual effect sizes. For Cohen’s d, the direction of the effect size was negative if recall, transfer, JOL, or learning time of the disfluent group was lower or shorter than that of the fluent group. The magnitude of an effect size was interpreted using Cohen’s (1992) standards of small (d = ± 0.20), moderate (d = ± 0.50), and large (d = ± 0.80).

The homogeneity statistic Q, along with its p value, was used for two purposes. First, Q was used to test whether there was significant variance within the set of effect sizes for each of the dependent measures. A related statistic, tau2, tests whether the between-study variance component is significantly different from 0. Another related statistic, I2, is the estimated percentage of total variance that is caused by true between-study heterogeneity rather than random error. According to Higgins et al. (2003), I2 values of around 25, 50, and 75% are generally interpreted to indicate low, medium, and high heterogeneity, respectively. A high degree of heterogeneity indicates that the random-effects model used in the meta-analysis is reasonable and that there is a call for tests of moderation. Second, Q was used in tests of moderation to determine whether there was a significant difference between two sets of effect sizes (e.g., effect sizes from screen presentation vs. paper presentation). In this case, a statistically significant Q test indicates non-negligible between-study heterogeneity in effect size distributions beyond that accounted for by sampling error alone.

Studies have empirically confirmed the existence of publication bias (Franco et al. 2014; Rosenthal 1979). Publication bias is considered to emerge in meta-analyses if there are systematic errors between articles that ought to be included and those actually included (Borenstein et al. 2009). This potential problem may lead to an overestimation of the mean effect size. In the present work, three strategies were adopted to examine the potential existence of publication bias. First, the funnel plot, a simple scatter plot with the standard error (the measure of study size) plotted on the vertical axis and the effect estimates plotted on the horizontal axis, was visually examined. A funnel plot shows a symmetrical distribution around the weighted mean effect and an asymmetric funnel plot suggests the existence of publication bias. Second, we performed Egger’s linear regression test (Egger et al. 1997) to further assess funnel plot asymmetry. Through this test, a regression equation can be created with the standard normal deviate of each study as the dependent variable and the estimate’s precision in each study as the independent variable. The intercept of the regression equation provides a measure of publication bias. The smaller its deviation from zero, the less pronounced the bias. Additionally, we conducted a rank correlation test (Begg and Mazumdar 1994) to evaluate the degree of asymmetry in the funnel plot. The rank correlation test quantifies the association between the effect sizes and their sampling variance. The less significant the correlation is, the more independent the effect sizes are from the sample sizes of the studies, and the less possible publication bias is.

Results

Descriptive Analysis

A total of 25 empirical articles that met the inclusion criteria were finally included and analyzed. An overview of the 25 articles with basic information and coded moderators is presented in Table 1. There were 21 studies obtained from journals, 2 from dissertations, 1 from an academic conference, and 1 from unpublished work. Across the 25 studies, 39 independent effect sizes with respect to recall were computed, involving 3135 participants. We were also able to compute 8 study-level effect sizes regarding transfer test containing 939 participants, 8 JOL-related effect sizes containing 901 participants, and 9 study-level effect sizes regarding learning time containing 689 participants.

Table 1 A list of studies included in the meta-analysis

The distribution of the derived effect sizes related to recall performance is presented in Fig. 1. The graph shows a roughly skewed distribution, with most of the effect sizes (74.4%) clustering between − 0.60 and 0.20. There were 22 out of 39 negative effect sizes (56.4%). Figure 2 presents forest plots with the point estimate of each effect size with a 95% confidence interval. No study-level effect size went beyond three SD of the mean of all effect sizes; thus, no outliers representing extreme values needed to be deleted.

Fig. 1
figure 1

Distribution of 39 effect sizes related to recall performance

Fig. 2
figure 2

Forest plots of effect sizes for a recall, b transfer, c JOL, d learning time. Note. the order of the effect sizes for recall is the same as in Table 1

Overall Analysis

Table 2 presents the results regarding the effects of disfluency on recall, transfer, JOL, and learning time. Concerning tests of recall, the meta-analysis revealed that the overall pooled effect size was not significant from 0 and was small in magnitude (d = − 0.01 p > 0.05). Similarly, the overall effect size for transfer performance was small and not statistically significant (d = 0.03, p > 0.05). With regard to JOL scores, the overall effect size was significant and was considered to have small-to-medium magnitude with a negative direction (d = − 0.43, p < 0.001). This result suggested that learners in the disfluent group predicted lower performance than those in the fluent group. The overall analysis regarding learning time was limited to studies that used self-pacing (i.e., studies in which participants were allowed to work for as long as they wished, in contrast to studies that put limits on learning time). In this analysis, the overall effect size was statistically significant and medium in magnitude with a positive direction (d = 0.52, p < 0.001), indicating that (in a self-paced environment) learners spent longer time learning disfluent materials than learning fluent materials.

Table 2 Main effects of disfluency on recall, transfer, JOL, and learning time, as well as results of the homogeneity test

As shown in Table 2, the homogeneity test showed that effect sizes varied significantly across studies on each of the four dependent variables (ps < 0.05), with a medium-to-high heterogeneity due to variance across studies (all I2 > 50). These results warranted tests of moderation to identify sources of this heterogeneity. Only a small number of effect sizes were able to be computed for the dependent measures of transfer, JOL, and learning time (each being less than 10), possibly affecting the reliability of subgroup (moderation) analyses on these dependent variables. Therefore, the moderator analyses were conducted just on recall test scores (with 39 effect sizes to examine).

Moderator Analyses

The results are presented in Table 3. With respect to participant characteristic, we did not find a significant moderating effect regarding learners’ prior knowledge; the difference between the disfluent group and the fluent group was not significant at any level of prior knowledge, and the three pooled effect sizes were not significantly different from each other (ps > 0.05).

Table 3 Moderator analysis on recall performance for all studies

Regarding learning material characteristics as well as experimental design characteristics, no variable was found to be a moderator of the disfluency effect on recall performance (all QB < 3.00, ps > 0.05). Furthermore, no subgroup showed a study-level effect size significantly different from 0 (see Table 3).

Further Moderator Analysis for Studies with Adults

The first moderation analyses showed that among participant characteristic, learning materials, and experimental designs, no significant moderator of the disfluency effect on recall test scores was found. However, most of the effect sizes were from adults (87.2%). Thus, next, a moderator analysis was conducted just for studies with adults.

The results are presented in Table 4. Again, no individual variable was a significant moderator of the disfluency effect. However, certain trends are of interest. These trends are apparent when the effect size of disfluency is significant at one level of the moderator but not the other level(s), or when the effect size is significant at both levels of the moderator. These trends are presented in italic in Table 4. First, the trends suggest that certain characteristic of the learning materials may affect the effect of disfluency. Specifically, learners in the disfluent group recalled less than learners in the fluent group when the presentation was system-paced. Second, the trends suggest that certain characteristics of the experimental design might influence the effect of disfluency. Learners in the disfluent group showed worse recall than learners in the fluent group when a between-subjects design was used; when the time interval between learning and test was ≤ 10 min; and when no distraction tasks were presented.

Table 4 Moderator analysis on recall performance for studies with adults

In summary, again no factor was found to be a moderator of the disfluency effect on recall performance for studies with adults. It should be pointed out that although there were not significant effects due to some potential moderators (i.e., pacing of presentation, design of study, time interval between learning and test, and distraction tasks), the pattern of results suggest that these factors should not be ignored in future research on the disfluency effect on adults’ recall.

Publication Bias Analysis

We first examined the funnel plot (see Fig. 3) to determine if there appeared to be publication bias. The scatterplot was nearly symmetrical, indicating that no potential bias existed. Given the subjectivity in interpreting a funnel plot, we used Egger’s linear regression test (Egger et al. 1997) and a rank correlation test (Begg and Mazumdar 1994) to further assess funnel plot asymmetry. Egger’s linear regression test showed that publication bias was an unlikely influence on the findings of the present meta-analysis (intercept = − 0.83, p > 0.05). This result was confirmed by the rank correlation test (tau = − 0.06, p > 0.05).

Fig. 3
figure 3

Funnel plot of standard error by Cohen’s d effect sizes

Discussion

The present meta-analysis of 25 empirical studies, involving 3135 participants, theoretically examined the effects of experimentally manipulated perceptual disfluency on learners’ recall, transfer, JOL, and learning time in a text-based educational context. In addition, this meta-analysis tentatively tested three sets of study characteristics (i.e., participant, learning material, and experimental design characteristics) to check whether they might moderate the impact of disfluency on the most frequently studied outcome, namely recall.

Null Effects of Perceptual Disfluency on Text-Based Learning Performance

The most important theory-based test in the present study addressed whether perceptual disfluency could facilitate or impede recall and transfer performance when learning with texts. According to disfluency theory (Alter et al. 2007), a disfluent text should introduce desirable difficulty and lead to deeper processing of the instructional material in system 2, leading to better recall and transfer. By contrast, CLT (Sweller et al. 2011; Sweller et al. 1998) predicts that disfluent text would increase ECL, leading to worse recall and transfer.

However, contrary to Hypotheses 1a and 1b, the results showed that students who learned with a disfluent text recalled (d = − 0.01) and transferred (d = 0.03) to the same degree as students learning with a fluent text. Thus, neither disfluency theory nor CLT was supported. These results are consistent with other empirical studies using texts as learning materials that have failed to find significant differences between disfluent and fluent groups on recall and transfer tests (for overviews, see Kühl and Eitel 2016; Weissgerber and Reinhard 2017; Xie et al. 2016).

Given the null effects of perceptual disfluency on recall and transfer in the present meta-analysis, one might argue that it is text comprehension that counts. Compared with word learning, learning with texts requires more cognitive resources to select and build connections between relevant information for text comprehension (Mayer 1984). Thus, even if disfluency is able to activate analytic processing as predicted by disfluency theory, its influence might not be strong enough to invoke the cognitive resources needed for deep text learning to occur. From the perspective of CLT, one might cautiously argue that perceptual disfluency increases ECL during text-based learning, but the ECL does not necessarily result in overload of total working memory capacity. Learners, for example, have the chance to compensate for the negative effect of ECL through monitoring or control of the learning process (e.g., through increasing learning time). Thus, even if disfluency is able to increase ECL as predicted by CLT, the total cognitive load may be still well within the limits of working memory (Sweller et al. 1998).

One important consideration is the boundary line of disfluency (Diemand-Yauman et al. 2011; Seufert et al. 2017). On a descriptive level, Seufert et al. (2017) found a reversed u-shape pattern when investigating the relationship between increasing disfluency levels and learning outcomes. Their results identified an optimal level of perceptual disfluency, that is, the level of perceptual disfluency associated with optimal learning performance on the study task. Therefore, it is vital to determine the exact point at which a text begins to be disfluent or fluent, as well as the exact point at which disfluency begins to (significantly) improve or hinder learning outcome. Because very few studies in our meta-analysis took this into account, we are not sure whether the disfluency manipulation was within an optimal scope. A poor manipulation of disfluency might narrow the gap between the learning performance in conditions of disfluent text and fluent text, leading to a null effect.

After failing to find a disfluent font benefit for math problem reasoning, Meyer et al. (2015, e20) speculated that “disfluent fonts may aid memory but not reasoning—presumably because reading words more slowly benefits memory, but not reasoning.” However, the current meta-analysis did not confirm a disfluent text benefit for recall, and the exploratory moderator analyses did not discover any factor that might moderate the impact of disfluency on recall. In short, our meta-analysis is one more failed attempt to replicate the disfluency effect in the text-based learning domain.

With respect to the moderator analyses, although no significant moderators were found in adult samples, the pattern of results suggested that the negative effect of disfluency might be especially apparent under some circumstances (i.e., system-paced presentation, between-subjects design, shorter time interval between learning and test, without distraction task included). It should be pointed out that because there were very few studies represented in some subgroups (e.g., design of study), the results must be treated with some caution.

Perceptual Disfluency Influences JOL and Learning Time

Another theory-based test in the present study addressed whether students would use perceptual disfluency as a metacognitive cue for monitoring and controlling their learning with texts. According to models of dual-system processing, making a text disfluent would reduce learners’ judgments of learning and increase learning time.

Our findings revealed that perceptual disfluency reduced the judgment magnitudes of learning. Specifically, students predicted that they would perform worse after learning disfluent instructional materials than after learning regularly fluent materials (d = − 0.43), which is in line with Hypothesis 2 and previous empirical studies (Ball et al. 2014; Carpenter et al. 2013; Pieger et al. 2017; Weltman and Eakin 2014). Meanwhile, we found that making instructional materials harder to read influenced the allocation of learning time when the learning process could be controlled by students themselves. Specifically, students spent longer time to learn disfluent information than to learn fluent information (d = 0.52), which is in line with Hypothesis 3 and prior qualitative reviews (Dunlosky and Ariel 2011; Son and Metcalfe 2000) as well as empirical studies (e.g., Ball et al. 2014; Eitel and Kühl 2016; Miele and Molden 2010, Experiment 3).

Compared with learning with simple words, learning with instructional materials is more complex (Pieger et al. 2017). Regulating the complex learning process requires monitoring, and control and typically, a variety of cues (Koriat 1997, 2012; Koriat et al. 2006) including metacognitive cues (Koriat 1997; Pieger et al. 2016). When the studied information is perceived as difficult, poor learning performance is likely to be predicted or assessed. This monitoring process (i.e., low JOL) will affect students’ control during learning, for example, by reading more slowly and using more time to study.

These results should also be interpreted in the context of the results on all four dependent variables (i.e., recall, transfer, JOL, and learning time). Based on the entire set of findings, one might conclude that a perceptually disfluent text may affect learning processes (monitoring or control), rather than learning outcomes (recall or transfer). It is possible that perceptual disfluency can function as a cue to justify the learning processes, rather than overturn the final learning performance. That would explain both the lower JOLs, longer learning time, and the null effects on recall and transfer. However, we cannot overstate the importance of these results because several studies included in the meta-analysis had low sample sizes (N < 30, Diemand-Yauman et al. 2011, Study 1; French n.d.; Lee 2013), and as noted earlier, only a small number of effect sizes were computed for the dependent measures of transfer, JOL, and learning time.

Limitations and Future Directions

The current meta-analysis provides a reference for research on text disfluency and suggests that it is too early to widely introduce disfluency-based educational interventions to instructional design. The following limitations should be acknowledged. First, there are two issues related to the dependent measures. One of the four dependent measures we analyzed was the magnitudes of judgments of learning, but the impact of disfluency on judgment accuracy is not very clear. On the one hand, lower magnitudes of judgments of learning in the disfluency group may make learners become more cautious and rational (Carpenter et al. 2013); on the other hand, disfluency may lead to significant under-confidence (Pieger et al. 2016). The other dependent variable that needs further research attention is learning time. If a disfluent text produces longer learning time when compared with a fluent text under time-unlimited circumstances, then the ingredients of “learning time” are ambiguous. It is possible that this “learning time” contains the time for real learning and the time for deciphering the disfluent words.

Second, only a small number of studies used transfer, JOL, or learning time as a dependent variable. Therefore, the tests of moderation were limited to studies using the more commonly used dependent measure, namely recall. It was also not possible to test age as a moderator because almost all studies on text disfluency are conducted with adults. However, other research suggests that there might be a developmental change in the disfluency effect (e.g., Katzir et al. 2013). That is, the positive or negative influence of disfluency might emerge in certain stages of learners’ lives, but disappear or even go in the opposite direction during another period. Of course, this possibility needs more evidence. In addition, to ensure enough number of effect sizes for moderator analyses in the present study, both learning duration and time interval between learning and test were treated as categorical variables rather than continuous ones, which might influence the reliability of these corresponding results.

Finally, there are several questions that can be addressed in future studies. For example, learning materials, rather than test materials, were usually manipulated to be disfluent or fluent in previous text-based disfluency studies. Perhaps disfluency manipulation of test items would show additional unexpected but vital results. There could also be boundary conditions that deserve to be further examined, such as learning motivation and working memory capacity. In addition, closer examination of the specific nature of the disfluency manipulation (e.g., 8- vs. 20-point font in one study; 12- vs. 24-point font in another) may generate information about when exactly text becomes disfluent. As another example, because our focus was perceptual disfluency, we included studies where disfluency was manipulated on a perceptual level but not on a conceptual level (e.g., by altering text coherence; McNamara et al. 1996), and another interesting direction for future research would be to investigate the different effects of perceptual disfluency and conceptual disfluency on learning with texts.

Conclusions

This meta-analysis on the effect of disfluent vs. fluent texts on participants’ recall, transfer, JOL, and learning time indicates that (a) perceptual disfluency can reduce judgments of learning and increase learning time, but appears to have no effect on recall or transfer, providing insufficient evidence that it either stimulates analytic processing or increases extraneous cognitive load; (b) characteristics of participants, learning materials, and experimental design do not appear to moderate the effect of disfluency on recall, but the pattern of results in samples of adults suggests that several factors should be examined in future research. This study has implications for future research as well as a warning function for interventions in educational settings.