Self-explanation is a constructive cognitive activity learners can enact, at will or in response to a prompt, to comprehend new information or learn skills (Fonseca and Chi 2011). When self-explaining, it is theorized that learners generate inferences about causal connections and conceptual relationships that enhance understanding. The content of self-explanations ranges widely; for example, explanations can describe how a system functions, the effects of serial steps in a procedure, the motives of characters in a story, or concepts presented in expository text (Chi 2000; Siegler 2002).

It has often been observed that students learn steps in a procedure without understanding how each step relates to others or contributes to the goal of the procedure (Siegler 2002). Consequently, learners are less able to transfer the procedure to tasks with differing conditions. Similarly, students studying an expository text may read each sentence but neither connect new information to prior knowledge nor consider implications arising from new information. Both situations can be characterized by absent or ineffective metacognition whereby a learner fails to recognize (metacognitive monitoring) and repair gaps (metacognitive control) in understanding.

Self-explanation is conceptualized as a tactic some students spontaneously use to fill in missing information, monitor understanding, and modify fusions of new information with prior knowledge when discrepancies or deficiencies are detected (Chi 2000). Chi considers self-explanation to differ from summarizing, explaining to another, or talking aloud. Self-explanation is directed toward one’s self for the purpose of making new information personally meaningful. Because it is self-focused, the self-explanation process may be entirely covert or, if a self-explanation is expressed overtly, it may be intelligible only to the learner.

In early self-explanation research, stable individual differences were observed in the frequency and quality with which students spontaneously self-explained, and positive relationships were reported between self-explanation and learning (Chi et al. 1989; Renkl 1997). Renkl found more successful learners tended to self-explain either by predicting the next step in a problem solution (anticipative reasoning) or identifying the overall goal structure of the problem and the purpose of its subgoals (principle-based explaining).

Because self-explanation could be an effect rather than a cause of learning, observational data showing better learning outcomes for learners who self-explain is not evidence learners benefit from self-explanation training or prompts. Prompts and other exogenous inducements could lead learners to generate explanations less finely adapted to gaps in their knowledge than spontaneous self-explanations. Thus, prompted “self” explanations may be less effective.

Some research has found teaching or prompting self-explanation produces better achievement than comparison conditions (e.g., Lin and Atkinson 2013). These studies typically pose a primary learning activity for all participants and introduce a treatment in which some participants are prompted to self-explain during or after the activity. Lin and Atkinson had college students learn about the cardiovascular system from animated and static visualizations. After studying each visualization, participants in the self-explanation condition were given open-ended questions (e.g., “Could you explain how the blood vessels work?”, p. 90) and an empty text box to provide their response to each question. Participants in the comparison condition were given an empty text box where they were told they could write notes. Participants prompted to self-explain performed better on a 20-item multiple-choice posttest than learners prompted to generate notes.

Although several reviews of relevant self-explanation studies support its use as a study skill (Rittle-Johnson and Loehr 2017; Chi and Wylie 2014; Dunlosky et al. 2013), we could find no published report that comprehensively synthesized research investigating the relationship between self-explanation and learning outcomes. Wittwer and Renkl (2010) conducted a meta-analysis exploring the influence of instructional explanations on example-based learning. One moderator variable coded whether participants in the control condition were prompted to self-explain. No effect for this condition was found among the six studies reviewed. A second meta-analysis was conducted on self-explanation in learning mathematics (Rittle-Johnson et al. 2017). The results showed prompting students to self-explain had a small to moderate effect on learning outcomes.

To assess the cognitive effects and instructional effectiveness of self-explanation interventions, we meta-analyzed experimental and quasi-experimental research which compared learning outcomes of learners induced to self-explain to those of learners who were not. In the meta-analysis, we examined moderating variables to investigate how learning outcomes varied under a range of conditions and to explore their theoretical implications on applying self-explanations.

Self-Explanation Versus Instructional Explanation

One can speculate self-explanation is effective because it has the potential to add information beyond what is given to the learner. That is, learning may be enhanced by the product, not the process, of self-explanation. Hausmann and VanLehn (2010) labeled this the coverage hypothesis, describing that self-explanation works by generating “additional content that is not present in the instructional materials” (p. 303). The coverage hypothesis predicts cognitive outcomes of self-explanation are the same as those of a provided instructional explanation. An instructional explanation would presumably be superior in cases where the learner was unable to generate a correct or complete self-explanation.

Several studies examined various effects of self-explanation and instructional explanations. Cho and Jonassen (2012) observed instructional explanations were as effective as self-explanations but outcomes were even better when learners were prompted to compare their self-explanations to instructional explanations. de Koning et al. (2010) demonstrated equivalence between self- and instructional explanations on measures of retention and transfer but learners achieved higher inference scores when instructional explanations were provided. Gerjets et al. (2006) posited both self- and instructional explanations elevated germane load when students studied worked-out examples presented in a modular format. The coverage hypothesis was not supported but these researchers noted design issues with content learners studied. Kwon et al. (2011) contrasted self-explanations to “partially” instructional explanations for which participants filled in small bits of missing information. Self-explaining led to better outcomes. Owing to these diverse findings, our meta-analysis assessed the relative efficacy of self- vs. instructional explanations.

Does It Matter How a Self-Explanation Is Elicited?

Although self-explanation can be induced in various ways, theory is equivocal about their relative efficacy. For example, some researchers induced self-explanation by directions given before a learning activity (e.g., Ainsworth and Burcham 2007). Others provided self-explanation prompts during the learning activity (Hausmann and VanLehn 2010) or after it (Tenenbaum et al. 2008). Supplying prompts periodically during a learning activity may help learners engage in sufficient self-explanation by signaling when there is suitable content to explain. Alternatively, allowing learners to apply their own criteria to decide when and what to self-explain may support constructing explanations better adapted to idiosyncratic prior knowledge.

Several studies used unvarying prompts to avoid reference to specific content. For example, learners studying moves by a chess program were repeatedly asked to predict the next move and “explain why you think the computer will make that move” (de Bruin et al. 2007). In contrast, most research provided content-specific self-explanation prompts, such as “Write your explanation of the diagram in regard to how excessive water intake and lack of water intake influence the amount of urine in the human body” (Cho and Jonassen 2012). Providing more specific prompts might draw learners’ attention to difficult conceptual issues that would otherwise be overlooked, but it may also interfere with or override learners’ use of personal standards for metacognitively monitoring content meriting an explanation.

The format used for self-explanation prompts may affect features of learners’ elaborative processing of content. Some researchers asked learners to generate open-ended responses while other researchers provided fill-in-the-blank or multiple-choice questions. As with other prompt characteristics, prompt format may modulate specificity provided by the prompt which, in turn, may affect qualities of elaboration generated in the self-explanation. By containing fewer cues than multiple-choice and fill-in-the-blank formats, an open-ended prompt format may invite elaborative processing better adapted to each learner’s unique gaps in knowledge. However, in some contexts, stronger cues present in a multiple-choice question may signal more clearly a possible misconception the learner should address.

Prompts often convey the purpose or nature of the expected self-explanation. Researchers have prompted learners to (a) justify or give reasons for a decision or belief, (b) explain a concept or information in the content, (c) explain a prediction, or (d) make a metacognitive judgment about qualities of their understanding, reasoning, or explanation. When combined with the task context, self-explanation prompts may imply the learner should generate a description of causal relationships, conceptual relationships, or evidence supporting a claim. Of interest in our meta-analysis is the relative effectiveness of the types of self-explanation the inducement is intended to elicit.

Does It Matter that Self-Explanation Takes More Time?

Performing a learning task that invites self-explaining usually requires more time than performing the same learning task without self-explanation. From the perspective of a student interested in optimizing instructional efficacy, it is important to know whether additional time spent on self-explanation might have been better spent otherwise, e.g., re-studying a text or doing additional problems. From a theoretical perspective, the comparative efficiency of self-explaining may depend on the duration and timing of instances of self-explanation in an experiment. The theoretical significance of self-explanation duration and timing is apparent in research by Chi et al. (1994) in which 8th grade students studied an expository text about the human circulatory system. Participants prompted to self-explain after reading each line of text showed greater learning gains than those who read each line a second time. However, the self-explanation group spent almost twice as much time (approximately 1 h longer) studying the materials. Because students who took longer to study the materials were allowed to return for an additional session, those students in the self-explanation group may have benefited from the spaced-learning effect (Carpenter et al. 2012).

In research examining effects of self-explanation prompts on solving mathematics problem by elementary school students, McEldoon et al. (2013) included a comparison group who practiced additional problems to equate for additional time required by the self-explanation group. Over a variety of problem types and learning outcomes in the posttest, there was little difference between the time-matched self-explanation and additional-problem conditions. Unfortunately, most self-explanation research does not report the learning task completion time of each treatment condition and many studies do not even report the mean completion time of all treatment conditions. Nevertheless, where possible, we compared effect sizes of studies in which treatment duration was greater for self-explanation groups to those in which treatment duration was equivalent.

Another time-related question is whether the duration of learning activities is associated with the efficacy of self-explanation inducement. It is plausible that self-explanation helps learners maintain engagement throughout a lengthy task. Alternatively, prompts to self-explain that are initially effective and are internalized by learners may lose their potency over time if they interfere with learners’ nascent ability to spontaneously self-explain.

Other Moderators of Interest

Several other variables potentially moderate the benefits of self-explanation inducement. Self-explanation is most commonly used to support either problem solving (including study of worked examples) or text comprehension tasks, and it seems likely that the cognitive effects of self-explanation inducement vary depending on the nature of the primary learning task. The type of knowledge being acquired through the learning activity (i.e., conceptual vs. procedural) is a related factor which we also anticipated may moderate the treatment effects. Because these two variables, task and knowledge type, are more directly related to cognitive operations, we hypothesized they are more likely to moderate treatment effects than subject matter (e.g., science, mathematics, social sciences), a variable we also coded.

Participants learned via several types of instructional media (i.e., digital, print, video), with digital media allowing three types of interactivity (computer-based instruction, intelligent tutoring system, simulation). Computer-based instruction (CBI) was defined as offering response feedback with no student modeling, and intelligent tutoring systems (ITS) were defined as modeling student knowledge to adapt feedback, task selection, or prompts (Ma et al. 2014). We investigated whether the interactivity afforded by digital media, especially the more adaptive forms of interaction provided by intelligent tutoring systems and simulations, would enhance self-explanation effects. Much of the research included diagrams in instructional materials and asked participants to explain or interpret them. We hypothesized explaining diagrams while problem solving may offer opportunities to translate between verbal and visual knowledge representations and thereby strengthen learners’ ability to retrieve memories and generate inferences from them (Booth and Koedinger 2012; Chu et al. 2017). We also investigated whether studies that used visual pedagogical agents (i.e., images depicting fictional tutors) to prompt self-explanation might show enhanced treatment effects due to a “persona effect” (Dunsworth and Atkinson 2007; Schroeder and Gotch 2015).

During posttests, particular conditions and requirements of the assessment may allow participants to demonstrate self-explanation effects more fully. The beneficial effects of self-explanation may be evident more on transfer tests than recall tests, and more on long-form test items such as essays and problems than multiple-choice questions. Because interpreting diagrams often requires a more fully elaborated mental model to support translating between visual and verbal representations, tests that include diagrams may be more sensitive to self-explanation effects. To investigate the moderating effects of test characteristics, we coded for assessed learning outcomes (e.g., recall, transfer), test format (e.g., fill-in-the-blank, essay), and whether test items included diagrams.

Students’ metacognitive and self-regulatory abilities increase throughout childhood and adolescence (Gestsdottir and Lerner 2008; Kopp 1982; Raffaelli et al. 2005; Schneider 2008; Weil et al. 2013). To investigate whether these developmental changes moderate the effect of self-explanation inducement, we coded for level of schooling (e.g., elementary, secondary) which was the most reliable proxy for developmental level reported in the primary research.

There is an increasing awareness of how regional or cultural differences affect student learning (Frambach et al. 2012; Marambe et al. 2012). Differences have been reported between geographic regions in students’ learning patterns (Marambe et al. 2012) and metacognition (Hartman 2001; Hartman et al. 1996), and these have the potential to moderate self-explanation effects.

Finally, we examined treatment fidelity (Smith et al. 2007), a methodological feature of the primary research that is a key marker of internal validity. Observing a relationship between effect size and methodological quality can help in interpreting meta-analytic results. For example, observing that studies of higher methodological quality tend to find lower effect sizes should decrease our confidence in the effect sizes found by lower quality studies.

Research Goals

We reviewed research that compared learning outcomes when participants were instructed or prompted to self-explain to a condition in which participants were not induced to self-explain. We coded 20 moderator variables to investigate the hypotheses and research questions discussed in the preceding section.

Method

Following the meta-analytic principles described by Borenstein et al. (2009), Lipsey and Wilson (2001), and Hedges and Olkin (1985), we (a) identified all relevant studies, (b) coded study characteristics, and (c) calculated weighted mean effect sizes and assessed their statistical significance.

Identifying Relevant Studies

Using the term self-expla*, searches of ERIC, PsycINFO, Web of Science, Education Source, Academic Search Elite, and Dissertation Abstracts were carried out to identify and retrieve primary empirical studies of potential relevance. These searches retrieved 1306 studies. Reference sections of review articles (e.g., Atkinson et al. 2000; Chi 2000; VanLehn et al. 1992) were examined to identify candidate studies for inclusion in the meta-analysis. Inclusion criteria were first applied to abstracts. If details provided in abstracts were insufficient to accurately classify a study as appropriate for inclusion, the full text of the study was examined.

To be included, a study was required to:

  • Include an experimental treatment in which learners were directed or prompted to self-explain during a learning task. The requirement for self-explanation was not considered satisfied by observing another person self-explain, explaining to another person (i.e., not a self-explanation), rehearsing information, making a prediction without an explanation, or choosing a rule or principle without providing an explanation (Atkinson et al. 2003).

  • Provide a comparison treatment in which learners were not directed or prompted to self-explain.

  • Avoid confounding the effect of self-explanation by combining it with another study tactic such as summarizing (Chung et al. 1999; Conati and Vanlehn 2000) or providing feedback (Siegler 1995).

  • Measure a cognitive outcome such as problem solving or comprehension.

  • Use a between-subjects research design.

  • Provide sufficient statistical information to compute an effect size, Hedge’s g.

  • Be available in English.

A handful of studies claimed to use think-aloud protocols to elicit self-explanations, especially when participants were young children. However, as Chi (2000) observed, the expected outcomes of think-aloud and self-explanation inducements are quite different:

Think-aloud protocols, often collected in the context of problem-solving research, is a method of collecting verbal data that explicitly forbids reflection. Think-aloud protocols presumably ask the subject to merely state the objects and operators that s/he is thinking of at that moment of the solution process. It is supposed to be a dump of the content of working memory…. Self-explaining is much more analogous to reflecting and elaborating than to “thinking aloud.” Talking aloud is simply making overt whatever is going through one’s memory (see Ericsson & Simon, 1993), without necessarily exerting the effort of trying to understand. (p. 170)

Furthermore, there is evidence that think-aloud directions can improve performance on some types of problems (e.g., Fox and Charness 2010).

We decided to include studies identified by authors as using a think-aloud method only if the method matched our previously described operationalization of self-explanation inducement and did not provide any prompts to talk aloud. Specifically, we excluded talk aloud studies in which:

  • The experimenter reminded participants to keep talking (McNamara 2004; Wong et al. 2002).

  • Participants were provided with neutral prompts if they remained quiet (Gadgil et al. 2012).

Applying these criteria, we identified 64 studies reporting 69 independent effect sizes which are examined in the present meta-analysis.

Coding of Study Characteristics

The coding form included 43 fixed-choice items and 24 brief-comment items cataloging information about each study in 8 major categories: reference information (e.g., date of publication), sample information (e.g., participants’ ages), materials (e.g., subject domain), treatment group (e.g., prompt type), comparison group (e.g., study tactic), research design (e.g., group assignment), dependent variable (e.g., outcome measure), and effect size computation. In an initial coding phase, three of the authors independently coded ten studies and obtained 86% agreement on the fixed-choice items and 78% agreement on free comment items. All disagreements in this initial phase were discussed and resolved. In the main coding phase, each of the three coders was assigned to code approximately a third of the remaining studies. Bi-weekly meetings were held throughout the main coding phase in which the coders met with the rest of the research team. During these meetings, borderline cases, discrepancies, and unintended exceptions that arose while coding were resolved. As well, an online forum was used to facilitate discussion in weeks where a weekly meeting did not occur. Meetings and the online communications served four purposes: (1) resolving emerging issues, (2) sharpening category definitions, (3) disseminating definitional refinements to all coders, and (4) creating an archive showing how the group arrived at decisions.

After items in the coding form were used to code the studies, we decided several items should be not be analyzed as moderator variables due to a lack of interpretable variance. For example, as a measure of methodological quality, we coded the method of assignment to treatment groups. However, we found 90% of studies randomly assigned individuals to groups by a method supporting high internal reliability (simple random assignment, block randomization, etc.). Almost all the remaining studies failed to report the method of assignment, and only one study reported a quasi-experimental method. Consequently, we decided not to present further analysis of this variable.

Effect Size Extraction

To avoid inflating the weight attributed to any particular study, it is crucial to ensure coded effect sizes are statistically independent (Lipsey and Wilson 2001). To this end, when a study reported more than a single comparison treatment (i.e., treatments in participants were not trained, directed, or prompted to self-explain), the weighted average of all comparison groups was used in calculating the effect size. When repeated measures of an outcome variable were reported, only the last measure was used to calculate the effect size. Lastly, when multiple outcome variables were used to measure learning, we used a combined score calculated using an approach outlined in Borenstein et al. (2009, p.27).

Data were analyzed using random effects models as operationalized in the software program Comprehensive Meta-Analysis (CMA) Version 3.3.070 (Borenstein et al. 2005). Hedge’s g, an unbiased mean effect size, was generated for each contrast. Statistics reported are the number of participants (N) in each category, the number of studies (k), the weighted mean effect size (g) and its standard error (SE), the 95% confidence interval around the mean, and a test of heterogeneity (Q) with its associated level of type I error (p).

When significant heterogeneity among levels of a moderator was detected, post-hoc tests using the Z test method described by Borenstein et al. (2009, p.167-168) were further conducted to compare the effect size of each level with all others. The critical value was adjusted for the number of comparisons following the Holm-Bonferroni procedure (Holm 1979) to maintain α = .05 for each moderator.

Results

Overall Effect Size for Self-Explanation

As shown in Table 1, a random effects analysis of 69 effect sizes (5917 participants) obtained an overall point estimate of g = .55 with a 95% confidence interval ranging from .45 to .65 favoring participants who were prompted or directed to self-explain (p < .001). There was statistically detectable heterogeneity, Q = 196.63 (p < .001, df = 68), a result which warrants analyzing moderator variables. Factors other than random sampling accounted for 65% of variance in effect sizes (I2 = 65.42).

Table 1 Overall effect and weighted mean effect sizes for comparison treatments

Moderator Analysis

The coverage hypothesis can be evaluated by comparing the effects of self-explanation prompts to instructional explanation. Comparison treatments were coded as (a) no additional explanation if they provided no intervention beyond the primary instructional task, (b) instructional explanation if they provided scripted explanations corresponding to the self-explanation prompts presented to the self-explanation group, (c) other strategy/technique if they provided some other intervention beyond the primary instructional task, or (d) mixed if they provided two or more of the preceding three types of comparison treatments. Results, shown in Table 1, indicate that each type of comparison treatment was associated with a statistically detectable effect size (p < .05) and there was a statistically detectable difference among them. Post-hoc tests indicated studies that provided no additional explanation to the comparison group had effect sizes significantly greater than those that provided instructional explanation (z = 2.696, p = .007) and those that provided another strategy or technique (z = 3.119, p = .002).

Researchers used methods and formats to induce self-explanation that vary in specificity. These ranged from directive prompts, such as multiple-choice questions, to less directive inducements such as open-ended questions or a general instruction to explain. Researchers also attempted to engage learners in different types of cognitive or metacognitive processes. We used four moderators to characterize the self-explanation inducements: inducement timing, content-specificity, inducement format, and type of self-explanation elicited.

The optimal timing of self-explanation inducements is arguable and may depend on each learner’s knowledge and self-explanation ability. Although prompts given during a learning activity model when to self-explain and may induce more frequent self-explanation, compared with instruction to self-explain given before the learning activity, prompts may inhibit learner’s choices of content for additional processing. The timing of the self-explanation inducement was coded as (a) concurrent if prompts or reminders to self-explain were interspersed with the learning material, (b) retrospective if a prompt was provided after participants studied learning materials, and (c) beginning if the inducement was training or an instruction to self-explain was given before participants began the learning activity. If a beginning prompt specified the participant was to self-explain after each sentence, paragraph, or problem step and there were no explicit prompts during the learning activity, the inducement was coded as beginning.

The content specificity of inducements was coded as specific to named content (e.g., “please self-explain how the valve works”), general (e.g., “please self-explain the last paragraph”), or both if the treatment included both content-specific and content-general inducements.

Inducement format was coded as (a) interrogative if a prompt asked the participant a question, (b) imperative if a prompt directed the participant to self-explain, (c) predirected if the participant was told at the beginning of the activity to self-explain, (d) fill-in-the-blank if the participant was asked to complete a sentence with a self-explanation, (e) multiple choice if the participant was to choose a self-explanation from a list of explanations, or (f) mixed if participants were provided more than one inducement type. Although the overall comparison of inducement types returned p = .048, post-hoc tests found no significant differences among them.

Lastly, the type of self-explanation elicited by the inducement was coded as (a) justify if the participant was asked to provide reasoning for a choice or decision, (b) conceptualize if the inducement asked the participant to define or elaborate the meaning of a named concept presented in the material (e.g., iron oxide), (c) explain if the inducement asked the participant to explain a structural part of the material (e.g., the last paragraph), (d) mixed when the inducement elicited more than one type of response, (e) justify for another if the inducement asked the participant to provide reasoning for someone else’s choice or decision, (f) metacognitive if the inducement asked the participant to self-explain their planning or performance, (g) anticipatory if the prompt asked participants to predict the next step in a problem solution and self-explain why they believed this was the correct next step, and (h) other if the inducement asked participants to provide another type of self-explanation. Table 2 shows all categories of inducement were associated with a statistically detectable effect except multiple-choice inducements and metacognitive self-explanations. The overall comparison of levels returned p < .001, and post-hoc tests indicated that the prompts eliciting conceptual self-explanations (z = 3.284, p = .001) were more effective than those inducing metacognitive self-explanations. Post-hoc tests also suggested that prompts inducing anticipatory self-explanations were more effective than those inducing metacognitive self-explanations, but this outcome can be discounted because there was only one study inducing anticipatory self-explanations and its large effect size could be due to any of its particular characteristics.

Table 2 Weighted mean effect sizes for self-explanation inducements

The length of time spent utilizing a study strategy and engaging with learning materials may impact learning outcomes. Therefore, in examining the problem of differences in time on task between intervention and comparison groups discussed in the introduction, we found most studies did not report separate mean durations for each group’s study activities nor statistically compared durations across groups. For studies that did statistically compare duration, those in which the self-explanation treatment group took detectably longer were coded as greater for SE group and those in which there was no detectable difference in time were coded equal for both groups. Second, we coded studies according to the overall duration of the learning activity. Possibly, a brief learning phase might produce only transitory effects due to novelty of the intervention, or it might not allow enough time for participants to learn how to respond to inducements (Table 3).

Table 3 Weighted mean effect sizes for learning task duration

To capture intended learning goals, three moderator variables were defined to characterize the learning task and the type of knowledge students were expected to acquire. Researchers have tended to investigate self-explanation in the context of problem solving (including worked examples) or text comprehension. Although these two learning activities are usually treated separately in the wider arena of educational research, and one might anticipate they would interact differently with self-explanation inducements, self-explanation theory deals with and thereby emphasizes their common elements. For example, both learning activities call for cognitive strategies that require metacognitive monitoring, a process which may detect gaps in understanding and productively trigger self-explanation. To determine if outcomes were affected by type of learning task, the task common to all treatment groups was considered the primary learning activity and was coded as solve problem, study text, study worked example, study case, study simulation, or other. Learning activities that combined two or more of these types were coded as mixed. As shown in Table 4, each type of primary learning activity was associated with a statistically detectable mean effect size except studying simulation, for which there were only two studies. There was no statistically detectable difference among the different types of primary learning activity.

Table 4 Weighted mean effect sizes for learning task

The knowledge type was coded as conceptual, procedural, or both. Often students study texts to develop conceptual knowledge, and they solve problems to develop a combination of conceptual and procedural knowledge. However, the type of learning task assigned does not reliably indicate the type of knowledge students are intended to acquire. One can productively study a text about how to execute a skill, and problem solving can be undertaken solely for conceptual insights it affords. Learning materials were coded according to their subject area (e.g., mathematics, social sciences). Associations between inducement to self-explain and learning outcome were statistically detected for each knowledge type and subject. No differences were statistically detected among the levels of each variable.

Features of the learning environment have potential to moderate effects of self-explanation inducement. Prompts may offer greater benefit in learning environments lacking interactive and adaptive features and, aside from the prompts, present only text to be studied or problems to be solved. More adaptive environments such as intelligent tutoring systems (Ma et al. 2014) are designed to detect gaps in individual learner’s knowledge and provide remediating instruction or feedback, possibly obtaining outcomes similar to self-explanation via a different route. On the other hand, self-explanation prompts delivered by a visual pedagogical agent may have amplified effects due to greater salience or inducing greater compliance than prompts embedded in a studied text or problem. As shown in Table 5, learning environments were coded and analyzed for four moderator variables: media type, interactivity, diagrams in materials, and visual pedagogical agent. The medium of the learning materials was coded as (a) digital when materials were presented on a computer, (b) print if materials were printed on paper, (c) video when the participants were asked to learn from a video, (d) other when the learning materials could not be placed in the previously stated categories. The type of interactivity was coded as computer-based instruction (CBI), intelligent tutoring system (ITS), simulation, or none. Learning materials were coded for whether or not diagrams were included. Providing diagrams in learning materials suggests visuospatial processing was a component of the intended learning goal. Associations between inducement to self-explain and learning outcome were statistically detected regardless of media type, interactivity, and whether diagrams were present or not. No differences were statistically detected between the levels of each variable.

Table 5 Weighted mean effect sizes for learning environment characteristics

A visual pedagogical agent was defined as a simulated person that communicated to participants and was represented by a static or animated face. Although the overall comparison among levels of the pedagogical agent moderator returned p = .004, post-hoc tests found only that a single study with non-reported pedagogical agent status outperformed studies that did not use a pedagogical agent to present self-explanation prompts—a difference that affords little meaningful interpretation.

Guided by the theory of self-explanation as explained in the introduction, we hypothesized that self-explanation would have a small beneficial effect on measures assessing comprehension and recall, and a larger beneficial effect on measures assessing transfer and inference. As shown in Table 6, the learning outcome was coded as comprehension, inference, recall, problem solving, transfer, mixed (if several outcomes were tested), or other depending on how it was described by the researchers. For example, the learning outcome in Ainsworth and Burcham (2007) was coded as inference because they described posttest questions as requiring “generation of new knowledge by inferring information from the text” (p. 292). The studies coded as assessing problem solving dealt with problems ranging from diagnosing medical cases to geometry problems. While not coded as transfer, these could be regarded as assessing problem solving transfer because they posed posttest problems that differed to some degree from problems solved or studied in the learning phase. The effect sizes for all types of learning outcomes, except comprehension, were statistically significant and no differences between them were detected.

Table 6 Weighted mean effect sizes for learning outcome assessment

Also shown in Table 6, learning assessments were coded according to test format and whether the test asked participants to complete or label a diagram. Test format was coded as essay questions, fill-in-the-blank questions, multiple choice questions, problem questions, short-answer questions, mixed, and other. The effect sizes for tests with or without diagrams and all test formats except essay questions were statistically significant. No differences among the categories of each moderator were statistically detected.

Because participants’ ages (in years) were not reliably reported, we assessed whether the effect of self-explanation inducements varied with level of schooling. As shown in Table 7, statistically detectable effects were found at all levels of schooling. No difference was detected among levels.

Table 7 Weighted mean effect sizes for grade level and region

Social science research has been criticized for using data gathered only in Western societies to make claims about human psychology and behavior (Henrich et al. 2010). We were curious about the extent to which the research we reviewed was subject to similar sampling biases. We coded the region in which the research took place as North America, Europe, East Asia, and Australia/New Zealand. As shown in Table 7, most of the studies were conducted in North America, and most of the remainder was conducted in Europe. A comparison among regions returned p = .011. Post-hoc tests indicated that studies conducted in East Asia had effect sizes significantly greater than those conducted in North America (z = 3.105, p = .002), Europe (z = 3.275, p = .001), and mixed (z = 3.263, p = .001).

Finally, we coded a treatment fidelity variable that indicated whether researchers ensured the group induced to self-explain actually engaged in self-explanation. Studies were coded as (a) Yes, analyzed if the researchers recorded or examined the self-explanations and verified that participants self-explained, (b) No if participants’ responses were not verified as legitimate self-explanations, (c) Yes, followed-up if the participant was prompted again when the initial self-explanation was insufficient, and (d) not reported if it could not be determined whether the provided self-explanations were verified. As shown in Table 8, all four categories of effect sizes were statistically detectable and no differences among them were detected.

Table 8 Weighted mean effect sizes for methodology

Publication Bias

Two statistical tests were computed with comprehensive meta-analysis to further examine the potential for publication bias. First, a classic fail-safe N test was computed to determine the number of null effect studies needed to raise the p value associated with the average effect above an arbitrary level of type I error (set at α = .05). This test revealed that 6028 additional studies would be required to fail to confirm the overall effect found in this meta-analysis. This large fail-safe N implies our results are robust to publication bias when judged relative to a threshold for interpreting fail-safe N as ≥ 5 k + 10 (Rosenthal 1995), where k is the number of studies included in the meta-analysis. Orwin’s fail-safe N, a more stringent publication bias test, revealed 274 missing null studies would be required to bring the mean effect size found in this meta-analysis to under 0.1 (Orwin 1983). Although there is potential for publication bias in this meta-analysis, results of both tests suggest it does not pose a plausible threat to our findings.

Discussion

In a review of over 800 meta-analyses relating to educational achievement, Hattie (2009) identified some of the most efficacious instructional interventions as reciprocal teaching (d = .74), feedback (d = .73), and spaced learning (d = .71). Our results indicate inducement to self-explain, with a mean weighted effect size of g = .55, offers benefits similar in magnitude to interventions such as mastery learning (d = .50) and peer tutoring (d = .55). In almost all the categories we examined, inducements to self-explain were associated with statistically detectable benefits. The only exceptions were categories represented by a small number of studies. Only three moderator variables showed statistically detectable differences that have implications for theory or practice: (a) comparison treatments, (b) type of self-explanation elicited, and (c) region.

Self-Explanation Outperforms Instructional Explanation

Contrary to the coverage hypothesis (Hausmann and VanLehn 2010), which describes the effects of self-explanation as adding information that instead might be supplied by an instructor or instructional system, our results showed a statistically detectable advantage (g = .35) of self-explanation over instructional explanations. We attribute this result to cognitive processes learners use when generating an explanation and/or the idiosyncratic match to prior knowledge of the new knowledge generated by self-explaining. By retrieving relevant previously learned information from memory and elaborating it with relevant features in the new information, we hypothesize that meaningful associations are formed; constructing the explanation engages fundamental cognitive processes involved in understanding the explanation, recalling it later, and using it to form further inferences. At the same time, since self-explanation compared to no additional explanation yielded a detectably larger effect size (g = .67), a substantial portion of the benefit reported for self-explanation appears to be alternately available from instructor-provided explanations. Considering the difficulties learners often have in generating complete and correct self-explanations, Renkl (2002) proposed that optimal outcomes might be attained by first prompting learners to self-explain and then providing an instructional explanation upon request. However, Schworm and Renkl (2006) reported that participants who received only self-explanation prompts outperformed participants who received self-explanation prompts plus supplementary instructional explanations on demand. Schworm and Renkl argued making a relevant instructional explanation available undermines learners’ incentive to self-explain effortfully, thus depriving them of cognitive benefits they would otherwise receive.

How Self-Explanation Is Elicited

Our introduction discussed how the cue strength associated with different prompt formats might affect learning outcomes. In the meta-analysis, multiple-choice prompts were associated with an effect size (g = .24) not statistically different from zero (Table 2). Although the small number of studies using multiple-choice prompts (k = 2) warrants caution, it may be that the greater cue strength of multiple-choice prompts undermines self-explanation effects. To address the question of optimal cue strength more thoroughly, we recommend that further research directly compare the effects of multiple-choice prompts, fill-in-the-blank prompts, and open-ended prompts. Optimal cue strength likely depends on each student’s ability to self-explain the concepts or procedures they are learning. When first introduced to a topic, students may benefit more from strongly cued self-explanation prompts, and as their knowledge increases they may benefit more from less strongly cued prompts.

We found all types of elicited self-explanations yielded a detectable effect size except metacognitive prompts (Table 2). A metacognitive prompt engages the learner in a self-explanation of planning or performance, and therefore we speculate that learners might not address content in these explanations. That is, content measured by the learning outcome may not be involved in responding to a metacognitive prompt. If so, it is not surprising learning outcomes are variable. Future research might explore different kinds of effects resulting from metacognitive prompts, e.g., increased planning when presented additional problems to solve.

Anticipatory self-explanations (g = 1.37) had substantially better learning outcomes than those in which participants justified their own decisions (g = .42) or the decisions of another (g = .43). Since only one study induced anticipatory self-explanations, the large effect size could be due to any of its particular characteristics. We suggest future research should investigate the efficacy of anticipatory self-explanations.

Findings of our meta-analysis regarding timing, content specificity, and format of inducement (Table 2) as well as the nature of the self-explanation elicited (Table 2) show variation in these factors rarely matters when learners self-explain. The same was the case for variations of the learning task (Table 4) and the learning environment (Table 5). As well, variation in learning outcome and test type rarely mattered. In almost all these cases, self-explaining benefited learning outcomes. We note that the absence of statistically detectable aggregate effect sizes in three cases: multiple-choice prompts, metacognitive self-explanations, and measures of learning outcomes formatted as essays. These invite considerations for future research.

Self-Explanation Usually Takes More Time. Does that Matter?

Although the effect size for studies in which self-explanation took more time than the comparison treatment (g = .72) was greater than for studies in which self-explanation took a similar amount of time (g = .41), the difference was not statistically significant. We interpret these results as showing that inducing self-explanation is a time-efficient instructional method but, in some research, its efficiency may be exaggerated because time-on-task was not controlled or reported. Most research on self-explanation has, regrettably, not fully reported time-on-task. We advocate reporting the mean and standard deviation of learning task duration for all treatment groups in self-explanation research.

Why Did Studies from East Asia Return High Effect Sizes?

Looking more closely at the six studies conducted in East Asia, we found that almost all had participants study texts as the learning task (Table 4) and gave comparison treatments providing no additional explanation or alternate learning strategies (Table 1). These two conditions were associated with relatively large effect sizes, and we speculate their confluence in the East Asian studies led to a significantly larger effect size for that region. An alternative interpretation rests on evidence Asian students’ learning strategies, shaped by cultural and educational contexts, tend toward reproduction-oriented studying aiming for verbatim recall on tests (Biemans and Van Mil 2008; Marambe et al. 2012; Vermunt 1996; Vermunt and Donche 2017). If, as a result of this orientation, learners in East Asia are less likely to engage spontaneously in self-explanation, they may receive greater benefit from prompts to self-explain than learners who are more likely to self-explain without being prompted.

Automatic Generation of Self-Explanation Prompts

In all the research we reviewed, self-explanation prompts or pre-study instructions were scripted by researchers or instructors. In most cases, the prompts were derived from the content of the learning task and were not generic statements that could be re-used with other content. Pre-scripted, instructor-generated prompts may be suitable for instructional settings in which the same readings or problems are assigned to many students over repeated courses or modules, but they cannot be delivered at scale when the content is highly differentiated or personalized. In resource-inquiry models of instruction (e.g., problem-based learning conducted in an environment with access to the internet and resource databases), students collaborate to identify information needs that are satisfied by online searches (Nesbit and Winne 2003). In such settings, computer-generation of content-specific prompts may be the most feasible method for supporting self-explanation.

Two of the authors have collaborated with computer scientists in developing and evaluating algorithms that automatically generate questions for expository texts (Lindberg et al. 2013; Odilinye et al. 2015). The premise underlying this work is that a computer-generated question can serve as a self-explanation prompt even when the computer has no way of assessing the correctness of a student’s explanation or when the answer to the question is not explicitly represented in the source text. Systems like this could be used to scaffold students’ skills in self-explaining, even if students are engaged in academic tasks where the materials they study are not controlled, such as researching material for writing an essay. If a learner model is part of such a system, the prompts could be adapted to match each learner’s subject knowledge, reading vocabulary, and self-regulatory ability.

Conclusions

Our findings have significant practical implications. The foremost is that having learners generate an explanation is often more effective than presenting them with an explanation. Another major implication for teaching and learning is that beneficial effects of inducing self-explanation seem to be available for most subject areas studied in school, and for both conceptual (declarative) and procedural knowledge. The most powerful application of self-explanation may arise after learners have made an initial explanation and then are prompted to revise it when new information highlights gaps or errors.

Research on self-explanation has arrived at a stage where the efficacy of the learning strategy is established across a range of situations. There is now a need for clearer mapping of the unique cognitive benefits self-explanation may promote, the specific effects of different types of self-explanations and prompts, and how self-explanation might be optimally combined and sequenced with other instructional features such as extended practice, explanation modeling, and inquiry-based learning. Future research that investigates these questions should be designed to directly compare different self-explanation conditions. Effective as self-explanation prompts may be, the primary goal of future research should be to identify strategies whereby prompts are faded so that self-explanation becomes fully self-regulated.