Introduction

Problem solving is an important part of daily educational practice. Students learn to solve application problems in math classes, conduct science experiments in chemistry and physics, and propose solutions for socio-scientific issues in cross-curricular projects. Problem solving is also required for students’ future professional life. From electricians who need to fix faulty wiring to event managers who need to find a solution for a sick catering supplier. Learning how to engage in understanding a new problem and finding a solution is thus one of the key elements to prepare students for their future jobs and society in general (Greiff et al., 2014; OECD, 2014).

When trying to solve a problem, students apply their knowledge to work toward a certain goal—the problem solution. For example, students in elementary school learn to solve multi-digit addition problems. To do so, they need to understand what addition entails, apply a solution strategy such as compensation, and change it if necessary to reach a solution (cf. De Jong & Ferguson-Hessler, 1996; Schoenfeld, 1979). After having solved the problem, students should accurately judge to what extent they mastered the problem-solving skill. This so-called monitoring accuracy is critical to learning as it causally relates to students’ decision to continue or abort practicing. This decision, in turn, affects the successfulness of their problem solving in future tasks (Başokçu & Güzel, 2022; Jacobse & Harskamp, 2012; Rinne & Mazzocco, 2014).

Up until recently, problem solving received hardly any attention in monitoring accuracy research. Instead, most studies addressed the accuracy of judgments about learning lists of items (e.g., see reviews by Rhodes, 2016; Rhodes & Tauber, 2011) or texts (summarized in meta-analyses by Prinz et al., 2020a, b). This text-based focus was expanded in the 2010s when researchers realized that interventions designed to improve monitoring accuracy in text learning cannot be directly applied to problem-solving tasks (Baars et al., 2013; Kostons et al., 2010). Problem-solving tasks differ from learning items or studying texts in that students not only need to retrieve or understand information but also need to apply this information in such a way that they correctly perform problem-solving steps to reach a solution. Furthermore, judging the complex process of applying knowledge during problem solving is difficult and studies have consistently reported that students’ monitoring accuracy in problem solving is generally poor (Başokçu & Güzel, 2022; Dentakos et al., 2019; García et al., 2016; Jacobse & Harskamp, 2012).

Bearing this theoretical and empirical evidence in mind, several researchers developed support to improve students’ knowledge about their problem-solving skills, for example, by asking them to generate explanations (Pilegard & Fiorella, 2016) or answer metacognitive questions during the task (Gidalevich & Kramarski, 2019). Other studies sought to strengthen students’ knowledge about the judgment itself by offering feedback on the accuracy of their performance and judgments (Kim, 2018; Oudman et al., 2022). The results of these studies were mixed: some interventions improved students’ monitoring accuracy (Kim, 2018; Mihalca et al., 2015; Oudman et al., 2022), whereas others did not (Gidalevich & Kramarski, 2019; Kim, 2018; Pilegard & Fiorella, 2016).

This seemingly inconsistent evidence raises questions regarding the true effect of monitoring accuracy interventions in problem solving, and the factors that contribute to their effectiveness. Both questions were addressed in the present meta-analysis, which classified different types of interventions in terms of how they target students’ judgments and investigated their relative effectiveness to improve monitoring accuracy in problem-solving tasks.

Theoretical Foundations of Monitoring Accuracy

Interventions aiming to improve monitoring accuracy are rooted in theories of metacognition. The seminal article by developmental psychologist Flavell (1979) defined metacognition as “knowledge and cognition about cognitive phenomena” (p. 906). This definition was further elaborated in the metamemory model by Nelson and Narens (1990), which specified the relations between cognitive processes, monitoring, and subsequent control in the learning context. This model consisted of two levels: the object level—which is about cognitive processes, and the meta level—which concerns metacognitive thinking and judgment processes. The meta level is informed by the object level through monitoring. Judgments on the meta level determine which control decisions are made on the object level, such as initiating changes in the learning process. So, according to this model, there is a causal connection between monitoring and control such that the accuracy of monitoring influences the quality of regulative actions undertaken by a student.

Monitoring and control processes play a central role in self-regulated learning (SRL) theories. These theories holistically describe how successful students self-regulate their learning: they set learning goals which they attempt to reach by monitoring and regulating their cognition, motivation, and behavior within the learning environment (Pintrich, 2000). To illustrate, the Zimmerman (2002) model posits that these processes take place in the phases of forethought (when self-assessment, goal setting, and planning processes occur), performance and volitional control (when students monitor and regulate learning strategies), and self-reflection (when students judge and attribute learning outcomes and set new plans for future learning). Other SRL models include similar phases with minor modifications depending on the researcher’s perspective. One SRL-model that explicitly considers cognitive monitoring is Efklides’ (2011) metacognitive and affective model of self-regulated learning (the MASRL model), which emphasizes the motivational components of metacognition and describes how motivational and affective learner characteristics interact with monitoring and control processes. Winne and Hadwin’s (1998) conditions, operations, products, evaluations, and standards (COPES) model includes factors that influence cognitive monitoring and control processes during learning, such as task conditions and standards learners use to monitor their progress. Finally, the two-process model by Gutierrez et al. (2016) focuses on task-level monitoring by asking students to give judgments after a task. They include monitoring accuracy in their model and add monitoring error as a different but inversely related process.

The general assumption in the above-mentioned models is that the causal relation between monitoring and control extends to learning outcomes. Empirical research confirmed this assumption by showing that the accuracy of monitoring predicts control decisions and that both determine the extent to which the student masters the learning content (e.g., Thiede et al., 2003, 2012). The extent to which monitoring accuracy can be influenced is thus important to the success of the learning process.

To find out how students’ monitoring accuracy can be improved, researchers have investigated what underlies students’ judgment processes. This metamemory research typically instructed students to learn lists of items, such as words or word pairs, and asked them to judge how confident they would be in their performance on a later test (Rhodes, 2019). Using experimental designs, these studies found that students do not have direct access to their memory traces and therefore use information sources within the environment as cues to monitor their learning (Koriat, 1997), such as concreteness (Tauber & Rhodes, 2012, exp 4) or fluency of the to-be-learned item (Benjamin et al., 1998). The cue-utilization theory further holds that the use of cues does not necessarily result in correct judgments; only cues that are consistent with factors that affect performance would lead to improved monitoring accuracy (De Bruin & Van Merriënboer, 2017; Koriat, 1997). Thus, monitoring accuracy can be improved by interventions that target students’ use of such appropriate cues.

Intervention Type as a Potential Moderator of Monitoring Accuracy

Insights from research on metamemory and text learning were taken as starting point for classifying interventions aiming to improve monitoring accuracy in problem-solving tasks. A broad distinction was made between interventions that focus on the learning task and interventions that directly address metacognitive judgment (Dunlosky & Lipko, 2007). Interventions addressing the learning task aim to focus students’ attention on the cues within the learning content that are most informative for judging the extent to which this content is mastered. These interventions typically target students’ use of cues during problem solving. If students use these cues effectively, their monitoring will be more accurate (Prinz et al., 2020b; Rawson et al., 2000). These types of interventions thus have an indirect impact on monitoring accuracy. Interventions addressing metacognitive judgment, by contrast, directly target monitoring accuracy by asking students to consider the process by which they determine how well they have mastered the learning content, that is, their metacognitive judgment after completing problem-solving tasks. If students carefully consider the correctness of their metacognitive judgment process, their monitoring accuracy is assumed to improve (Dunlosky & Lipko, 2007).

Interventions Addressing the Learning Task

As learning task interventions focus on cues in the learning content that are most informative for monitoring accuracy, these interventions depend on what students need to learn. In metamemory research students have to recall certain items and therefore need to use retrieval-based cues for their monitoring. Examples of effective interventions are practice and self-testing, which stimulate students to directly retrieve the to-be-recalled items (Rhodes, 2016). Effective interventions in the realm of text learning try to help students reach a deep understanding of the meaning of the text (Prinz et al., 2020b). For example, students are asked to create concept maps during reading to encode the information (Redford et al., 2012), or generate keywords or diagrams some time after reading the text so as to also retrieve the information (De Bruin et al., 2011; Thiede et al., 2005; Van Loon et al., 2014). Such whole task interventions thus focus on helping students understand what they (do not) know about the learning content.

In problem-solving tasks, students not only have to understand information but also have to apply it to a problem situation, which requires both procedural knowledge and metacognitive knowledge (Braithwaite & Sprague, 2021; Schoenfeld, 1979). Procedural knowledge concerns the content-related strategies to solve problems. Using the math example from the introduction, the compensation strategy requires students to perform three steps: (1) round each number to the nearest 10, (2) add them, and (3) subtract or add the difference to the original number. Metacognitive knowledge concerns thinking about the selection and application of problem-solving strategies. In the math example, students could opt for alternative strategies or could make mistakes executing the steps of the compensation strategy, for example, by omitting a step or by confusing addition and subtraction in step 3. Monitoring the applied strategies and changing them when mistakes are made is vital for efficient and effective problem solving (Schoenfeld, 1979, 2015). Attention for procedural and metacognitive knowledge seems therefore fruitful for interventions aiming to improve monitoring accuracy in problem solving.

Thus, in problem-solving tasks, interventions could take a whole task approach, similar to metamemory and text learning research. Additionally, interventions might address procedural knowledge to help students understand what they (do not) know about the content-related steps for problem solving, or might stimulate students to apply metacognitive strategies to help them consider how well they solved the problem.

Interventions Addressing Metacognitive Judgment

Interest in interventions aimed at metacognitive judgment has been instigated by Dunlosky and Lipko (2007), who found that even though interventions addressing the learning task improved monitoring accuracy in text learning, the accuracy scores remained rather low. They reasoned that interventions that directly target metacognitive judgment might better support students in reaching a correct judgment. With this reasoning in mind, three types of interventions can be distinguished: interventions aimed at the timing of judgments, availability of an external standard, and monitoring training.

Interventions that focus on the timing of judgments ask students to offer their judgment after a delay or ask them to more often judge their confidence during a learning task. Delayed judgments have mainly been applied in metamemory research. The idea behind this intervention is that the delay indirectly stimulates students to retrieve the information from memory, which consequently results in higher monitoring accuracy (Rhodes & Tauber, 2011). Asking students to more often judge their confidence has been investigated in early attempts to improve monitoring accuracy in text learning research. These judgments were elicited during a practice phase, while monitoring accuracy during the test phase was calculated to test their effectiveness (Bol et al., 2005; Nietfeld et al., 2006; Rawson & Dunlosky, 2007). Results were mixed and closer inspection of the studies revealed that monitoring accuracy improved only when students received feedback on their judgments during practice.

The reason that feedback improved monitoring accuracy when increasing judgments might be that these studies offered students an external standard—a point of reference to determine the extent to which their judgments were accurate. Students generally use internal standards for performance (i.e., what they think is a correct answer) when evaluating their work (Butler & Winne, 1995). However, especially for novice students such standards are generally of low quality which consequently results in inaccurate monitoring (Dunning et al., 2003). Offering external standards might alleviate the adverse effects of low-quality internal standards. Indeed, several studies have shown that students who received performance feedback―i.e., the correct answer (Lipko et al., 2009; Nederhand et al., 2019), or calibration feedback―i.e., the accuracy of their judgment (Callender et al., 2016, experiment 2; Geurten & Meulemans, 2017, experiment 1; Miller & Geraci, 2011, experiment 2) after practice problems were more accurate in their judgments on a later test than students who did not receive such feedback.

Finally, interventions that include monitoring training have not been applied in research on monitoring accuracy in metamemory and text learning, but might be especially effective for problem-solving tasks. As students need to engage in several problem-solving steps to reach a solution, they need to take each of these steps into account when judging the correctness of their problem solution. Research into students’ performance during problem solving has shown that modeling examples effectively support them in systematically reflecting on problem-solving steps and improving their problem-solving procedures (e.g., Atkinson et al., 2000; van Gog & Rummel, 2010). Such interventions might therefore similarly support students in improving monitoring accuracy.

Previous Meta-Analyses on Interventions for Monitoring Accuracy

Previous meta-analyses on interventions aimed at improving monitoring accuracy examined the relative effectiveness of a certain type of intervention or focused on a specific type of task. Rhodes and Tauber (2011) did both by comparing delayed and immediate judgments during recall tasks. The effect of delayed judgments was robust (g = 0.93), indicating that delaying judgments improves students’ monitoring accuracy.

In a meta-analysis on text learning tasks, the positive effect of delayed judgments could not be replicated (Prinz et al., 2020a). This might be because a delay mainly targets students’ recall, but not their understanding of the information. In a follow-up meta-analysis, Prinz et al. (2020b) discovered that interventions that took a whole task approach by targeting text understanding improved monitoring accuracy by approximately half a standard deviation (g = 0.46).

Two recent meta-analyses did not specify the type of learning tasks. Gutierrez de Blume (2022) included studies incorporating a certain type of intervention, namely, learning strategy interventions. He identified a moderate effect (g = 0.57),Footnote 1 and found that deep strategy interventions resulted in higher monitoring accuracy than superficial strategies. León et al. (2023) investigated the effects of several types of interventions. However, they did not include monitoring accuracy, but focused on the closely related field of self-assessment accuracy—that is, the accuracy of students’ assessment of “the quality of their own learning process and products” (Panadero et al., 2016, p. 804). They found a small effect (g = 0.21) with the strongest effects for external standards interventions.

These meta-analyses indicate that instructional interventions can improve monitoring accuracy during recall tasks and text learning tasks. However, it is still unclear whether and to what extent these benefits apply to problem-solving tasks. Equally unclear is whether and how this effectiveness depends on the design characteristics of the interventions used. Our meta-analysis extends these previous works by focusing on problem-solving tasks and including intervention types that target either the learning task or metacognitive judgment. Three intervention types addressed the learning task: (1) whole-task interventions that help students understand what they know about the problem, (2) interventions aimed at students’ procedural knowledge, and (3) interventions aimed at students’ metacognitive knowledge. Additionally, three intervention types addressed metacognitive judgment: (1) timing of the judgment, (2) providing external standards, and (3) training students to monitor their performance. Definitions and examples of these six intervention types are presented in Table 2.

Research Questions and Hypotheses

This meta-analysis set out to determine the overall effectiveness of monitoring accuracy interventions during problem solving (question 1) and how this effectiveness is moderated by intervention type (question 2). Based on the above literature review, we expected that monitoring accuracy interventions would have an overall positive effect (hypothesis 1), and that the magnitude of this effect would differ depending on type of intervention used (hypothesis 2). Additionally, we explored to what extent the six types of interventions described above differed in effectiveness, and how the overall mean effect size was moderated by the substantive characteristics school level, domain, and judgment type, and the methodological study features research design and setting.

Method

The research questions, procedures, and analytic plan for this meta-analysis were preregistered and data were shared on the Open Science Framework (OSF; https://osf.io/u52w9/).

Literature Search

The literature search was inspired by Alexander’s (2020) guidelines to systematically find potentially relevant research articles. The timeframe was restricted to the years 2002–2022 to include recent research on monitoring accuracy in problem-solving tasks. The electronic databases of Web of Science and ERIC were used for our main search. Web of Science was selected because it offers a wide access to research articles in the educational and psychology field. ERIC was selected because of its focus on educational research and because it includes research reports and dissertations by universities or non-profit organizations.

In Web of Science, we restricted our search to the abstract field and the categories Education Educational Research, Education Scientific Disciplines, Psychology, Psychology applied, and Psychology Educational. The database was searched using the following query: (“monitoring accuracy” OR “calibration” OR “metacognitive monitoring”) AND (“effect*” OR “enhanc*” OR “improv*” OR “increas*”). As ERIC neither supports Boolean searches with more than six keywords nor truncations (Gusenbauer & Haddaway, 2020), we conducted separate searches for each possible query and replaced terms with truncations by the most likely term for an experimental research context―namely, effect, enhance, improve, and increase.

To extend the search results, Alexander (2020) recommends three actions: researcher checking, referential backtracking, and journal scouring. Researcher checking is the examination of the publication records of authors frequently appearing in the search results. As the first authors in our set differed widely, we performed this step by investigating all first authors’ publication records in Web of Science. We additionally perused the publication records of two co-authors with more than five papers (Van Gog, k = 11, and Paas, k = 9). Referential backtracking involves examining references within important documents, such as related reviews. We included the articles that were meta-analyzed by Gutierrez de Blume (2022) in our pool of records for screening. Finally, journal scouring refers to hand searching journals that appear regularly in the search results. In our final set of records, not one journal stood out, so we skipped this action. The literature search was performed in January 2023 and the initial set of search outcomes contained 1945 records.

Study Selection

Figure 1 shows the steps taken to include the relevant studies for this meta-analysis. As a first step, we removed duplicates, which resulted in a total of 1736 unique records. Next, titles and abstracts were screened to determine whether the records administered a quantitative (quasi-)experimental between-subject design aimed at improving the participants’ own monitoring accuracy. The 164 records that met this criterium were retrieved for more detailed information. Examples of excluded records were studies about calibration in other research fields (Goodney & Silverstein, 2013; Gorrini et al., 2018), theoretical or correlational studies (Hadwin & Webster, 2013; Hattie, 2013), studies that administered a within-subject design (Oudman et al., 2022), and studies of teachers’ monitoring accuracy (Van De Pol et al., 2021). The full texts of retrieved records were screened for detailed evaluation of the tasks participants had to perform and the measures used. Records included satisfied the following criteria:

  • Corresponded with our definition of problem-solving tasks. That is, tasks in which participants had to apply their knowledge to reach a certain goal—the problem solution. Included studies explicitly mentioned problem-solving tasks (e.g., Baars et al. 2014a; Mihalca et al., 2015); used generally known problem-solving tasks, such as the Raven’s test task (Mitchum & Kelley, 2010) or the Latin square task (Double & Birney, 2018); or administered tasks that did not explicitly mention problem solving but required problem-solving steps to reach a solution, such as a location in navigational map reading (Kok et al., 2022) or an answer to a research question in inquiry tasks (Kant et al., 2017; Morrison et al., 2015). Excluded studies were mostly about text learning (e.g., Huff & Nietfeld, 2009; Kimball et al., 2012; Thiede et al., 2011) or lacked a clear description of the learning tasks (e.g., Bingham et al., 2021; Bol et al., 2005; Morphew, 2021; Osterhage, 2021).

  • Contained at least one intervention condition that aimed to improve students’ monitoring accuracy and a control condition that was either untreated or included an intervention that did not aim to improve monitoring accuracy.

  • Reported absolute monitoring accuracy scores (see Schraw, 2009). In case bias scores were reported, the corresponding author was asked to e-mail us a conversion into absolute accuracy scores.

  • Addressed typical students in formal education and adults.

  • Contained sufficient statistical information to calculate effect sizes. If this was not the case, authors were e-mailed to request the missing information.

Fig. 1
figure 1

PRISMA flow chart of the search and selection of studies

The final set of 28 records consisted of 35 independent studies; their characteristics are presented in Table 1.

Table 1 Studies included in this meta-analysis

Outcome Measure

Monitoring accuracy is defined as the deviation of students’ judgments from their actual performance. The most common monitoring accuracy measures in problem-solving research are bias and absolute accuracy. Bias is calculated by subtracting actual performance scores from students’ judgment scores. Although this metric offers valuable information about the direction of students’ judgment, it does not reflect the magnitude of judgment error. The absolute accuracy measure does because it represents the absolute deviance between students’ judgments and their performance (Schraw, 2009). This meta-analysis therefore used the absolute accuracy scores as outcome measure, which was calculated in the primary studies either by squaring or recoding the deviation between students’ judgments and performance.

Moderator Variables

Six moderators were included in this meta-analysis. The first moderator, intervention type, defined the instructional interventions taken to improve students’ monitoring accuracy. Table 2 offers an overview of the types of interventions included in this meta-analysis. Three intervention types focus on the learning task: whole task, procedural knowledge, and metacognitive knowledge. Intervention types targeting metacognitive judgment manipulate the timing of the judgment, offer external standard support, or give monitoring training. To determine interrater agreement on the classification of the interventions used in the included studies, two independent raters coded the interventions of all 35 studies and discussed differences until agreement was reached. Disagreements were randomly distributed across codes and the Cohen’s κ of 0.79 showed that agreement between raters was substantial (cf. Landis & Koch, 1977).

Table 2 Classification of the interventions on monitoring accuracy in problem-solving tasks

Substantive characteristics encompassed school level, domain, and judgment type. School level was used to examine whether and how students’ age moderated the findings. We distinguished between three age groups, based on the most common school systems: elementary school children (ages 5–12), secondary school adolescents (ages 13–17), and adults (age ≥ 18). Based on the tasks used in the primary studies, the problem-solving domain was classified as either mathematics, science, or reasoning. Mathematics included problem-solving tasks in mathematics education. Science comprised tasks in the exact disciplines of physics, chemistry, and biology. Reasoning involved logical, inductive, and analogical reasoning tasks. Lastly, we used an “other” category that comprised domains not included above. The moderator judgment type indicated whether the monitoring accuracy score was based on retrospective confidence judgments (RCJs) or judgments of learning (JOLs). RCJs ask students to indicate their confidence in the correctness of their performance on a completed problem-solving task, whereas JOLs ask students to make a predictive judgment of how confident they are regarding their performance on a future problem-solving task. Note that this is a conceptual distinction: in both cases, monitoring accuracy is calculated as the difference between the students’ judgment (either RCJ or JOL) and actual performance.

Methodological study features included design and setting. Design was classified as experimental or quasi-experimental designs. Setting involved the amount of experimental control and was classified as either classroom or laboratory experiments. Classroom experiments had low experimental control, for example, authentic classroom settings, but also cases where adults were performing the problem-solving tasks in an online setting. Laboratory experiments had high experimental control. Examples are experiments with individuals or small groups and quiet settings where students were individually tested in school. When it was unclear in which setting the study took place, it was coded as “not reported.”

Effect Size Calculation

Hedges’ g was used as effect size measure because it includes a correction for small sample sizes (Borenstein et al., 2009). Effect sizes were calculated based on the means and standard deviations of monitoring accuracy scores in the intervention condition and control condition. As more accurate monitoring implies smaller deviations, lower scores reflect higher accuracy. Such positive results are thus reflected in negative effect sizes. To avoid confusion, the effect sizes were mirrored so that a positive effect size reflects higher accuracy scores in the experimental conditions.

In studies that provided both pretest and posttest monitoring accuracy scores, the mean gains and pre- and posttest standard deviations were used to calculate effect sizes (Lipsey & Wilson, 2001). In studies that included multiple outcomes (e.g., multiple difficulty levels or timepoints), these outcome measures were collapsed into a composite effect size via the formulas by Borenstein et al. (2009). When studies only presented outcomes for subgroups within each condition, the subgroups’ means and standard deviations were merged (Borenstein et al., 2009) and the effect size was calculated (Lipsey & Wilson, 2001).

In studies investigating multiple interventions, a composite effect size value was calculated if the interventions were of the same type (e.g., completion problems and practice problems are both whole-task interventions), following the same procedure as studies that only presented outcomes for subgroups. If different types of interventions were investigated, the intervention that matched the study’s main purpose was selected. If this was not possible, one intervention was selected at random.

Data Analysis

Main analyses were conducted with Meta Essentials (Suurmond et al., 2017). We first analyzed the effect sizes for signs of publication bias through visual inspection of the funnel plot, Begg and Mazumdar’s (1994) rank correlation test,Footnote 2 and Egger et al.’s (1997) regression test. Next, the overall mean effect size was calculated and a random effects model was applied to examine whether the mean effect differed significantly from zero (hypothesis 1). A Q-test was conducted to analyze whether intervention type moderated the findings (hypothesis 2); significant p-values indicate that the true effects vary between intervention types. Pairwise comparisons (Hedges & Pigott, 2004) were conducted to investigate differences in effect size between the six intervention types. Hochberg’s (1988) step-up procedure was used to control the familywise type I error at α = 0.05. Adjusted p-values were reported for α = 0.05.

In a series of exploratory analyses, which were not included in our pre-registration, Q-tests were used to investigate whether school level, domain, judgment type, design, and setting moderated the findings. Pairwise comparisons and alpha-level corrections were conducted similarly to the moderator analysis of intervention type.

Results

Publication Bias

The funnel plot in Fig. 2 indicates low chance of publication bias as the studies were evenly distributed around the combined effect size. The follow-up tests statistically confirmed this symmetric distribution. The rank-correlation test showed a low and nonsignificant correlation coefficient, Kendall’s Τ = 0.08, z = 0.64, p = 0.523, and in the regression test the intercept did not differ significantly from zero, β = 0.90, 95% CI [− 1.29, 3.08], p = 0.411. This means that both tests did not show evidence of publication bias.

Fig. 2
figure 2

Funnel plot of effect sizes

Overall Effect

The overall mean effect size (g) of the 35 included studies was 0.25, SE = 0.06, 95% CI [0.12, 0.37]. This result indicates that all interventions together had a small effect (Cohen, 1988) on monitoring accuracy. The magnitude of the overall effect size differed significantly from zero, z = 4.09, p < 0.001. The I2 statistic showed that 63.62% of the variance in effect sizes reflects true score variation; the estimated variance of the true effect size (T2) was 0.08.

Moderator Analysis: Intervention Type

The heterogeneity of effect sizes was scrutinized through moderator analysis. As shown in Table 3, intervention type significantly moderated the findings. The 95% confidence intervals indicate that the intervention types whole task, metacognitive knowledge, and external standards all significantly improved students’ monitoring accuracy. On the other hand, the timing interventions significantly decreased students’ monitoring accuracy. The other interventions did not significantly impact monitoring accuracy. Pairwise comparisons of all interventions types—except the mixed interventions—showed that timing had a significantly lower effect size than all the other intervention types, with values ranging from z =  − 4.56, p < 0.001 for whole task to z =  − 2.76, p = 0.017 for training. All other intervention types did not significantly differ in their effectiveness, with p-values ranging from 0.237 to 0.968.

Table 3 Summary of effect sizes per moderator variable category

Exploratory Moderator Analyses: Substantive and Methodological Study Features

As shown in Table 3, the interventions significantly and positively impacted monitoring accuracy in all study features, except the learning domains “reasoning” and “other,” and studies that did not report their setting. Results of the Q-tests show that setting was the only study feature that significantly moderated the findings. However, a nonsignificant p-value does not necessarily prove that the effect sizes are consistent; a lack of significance may be due to low power because of the limited number of studies (Borenstein et al., 2009). We therefore explored differences between moderator categories for each moderator variable by means of pairwise comparisons, with the exception of the moderators domain and design due to similarities in confidence intervals and the 0.00% score on I2.

Regarding students’ school level, interventions were generally less effective for secondary school students than for elementary school students, z =  − 4.75, p < 0.001, and adults, z =  − 2.14, p = 0.049, while differences between elementary school students and adults were not significant, z = 1.28, p = 0.201. Interventions were more effective for retrospective confidence judgments than judgments of learning, z = 2.03, p = 0.042, and studies conducted in laboratory settings had significantly higher effect sizes than classroom studies, z = 4.98, p < 0.001.

Discussion

This research built on prior meta-analyses by examining the effectiveness of interventions aimed at improving monitoring accuracy in the context of problem solving. In line with the first hypothesis, the interventions significantly impacted monitoring accuracy and there was little evidence of publication bias. However, with g = 0.25 this effect was small, especially compared to effects found in other meta-analyses on monitoring accuracy interventions (e.g., Gutierrez de Blume, 2022: g = 0.57; Prinz et al., 2020b: g = 0.46). Furthermore, not every intervention type positively and significantly affected monitoring accuracy. Interventions on the whole task, metacognitive knowledge, and external standards had a positive impact, procedural knowledge, mixed and training were not significant, and timing interventions negatively impacted monitoring accuracy. In line with hypothesis 2, moderator analysis showed that effects varied between intervention types. Timing interventions had a negative effect on monitoring accuracy and significantly differed from all other intervention types. Exploratory moderator analyses revealed study feature effects. School level, judgment type, and setting need consideration in future research.

The relatively small overall effect of the interventions on monitoring accuracy might be due to the large variation in true effect sizes. This is most apparent in the difference between timing and the other intervention types, although variations within and between other intervention types were considerable, too. Another reason for the small effect could be related to the complexity of the tasks in which the interventions were implemented. Previous meta-analyses mainly included recall and text learning tasks that do not require students to apply information to a certain situation. Problem solving does and therefore additionally takes up procedural and metacognitive knowledge (Braithwaite & Sprague, 2021; Schoenfeld, 1979). Combining and applying these knowledge components requires a considerable amount of students’ working memory (Cornoldi et al., 2015; Sweller, 2023; Sweller et al., 1998) and might have reduced the impact of the interventions. Future research should investigate whether this is indeed the case and consider other possible factors that might have influenced the effect size of the interventions.

The reason that timing stood out compared to the other interventions might also be related to the complexity of problem-solving tasks. Previous meta-analyses already showed that students benefit from timing interventions in recall tasks (Rhodes & Tauber, 2011), but not in text learning (Prinz et al., 2020a). This meta-analysis adds that its effect even reverses in problem-solving tasks. The two studies that included this intervention type differed in how they changed the timing of judgments. Baars, Van Gog et al. (2018b) asked elementary school students to judge their math performance after a time delay, while Double and Birney (2018) asked adults to more often judge their confidence during the Latin square task. Authors of both studies reasoned that their intervention retained participants from using experience-based cues. Specifically, a time delay might have made it difficult to use the cues that were most salient during problem solving (Baars et al. 2018b), while increasing judgment frequency focused participants’ attention on their existing confidence-related beliefs. The latter was confirmed by a follow-up analysis that showed that participants in the intervention condition relied more on judgments they made prospectively than the control participants (Double & Birney, 2018). It thus seems that, although studies are few, timing interventions do not to draw students’ attention to their problem-solving experiences and therefore negatively impact monitoring accuracy.

Several intervention types had no significant impact on monitoring accuracy. The monitoring training interventions might have been ineffective because they focused on monitoring the problem-solving steps without offering an external standard against which students could evaluate their knowledge (Dunlosky & Lipko, 2007). Procedural knowledge interventions were included in three studies. Nietfeld and Schraw (2002, exp 2) offered university students a strategy training for probability problems and found this intervention to be effective, whereas Baars et al. (2018a, 2018b) found no effects of self-explaining the steps in solving biology problems in two studies with secondary school students. Perhaps the sole focus on explaining problem-solving steps might not sufficiently improve monitoring accuracy because students also needed to consider their conceptual understanding and metacognitive thinking (Braithwaite & Sprague, 2021; Schoenfeld, 1979). For mixed interventions it is difficult to draw conclusions, as these interventions differed in many ways.

Another reason why the above-mentioned interventions are seemingly ineffective for improving monitoring accuracy might be related to the students’ school level. In all these intervention types, at least two-third of the studies were conducted in secondary education, and in line with the results of this moderator variable, these studies lowered the overall effect size of the intervention types. Given the limited number of studies on these intervention types, with the majority conducted in secondary schools, future research should further investigate the effectiveness of these intervention types in different contexts, starting with primary school students and adults.

The finding that interventions were generally less effective for secondary school students compared to primary school students and adults offers new insight in the effectiveness of monitoring accuracy interventions. Until now, only the meta-analysis by Gutierrez de Blume (2022) took age into account. He found that interventions were more effective for adults than for children. A reason for the lower effect size scores in secondary school in this meta-analysis might be related to students’ development in motivational beliefs. Classical research on students’ motivation shows a decline in school-related motivational beliefs during adolescence (e.g., Eccles et al., 1993; Wigfield et al., 1991) and recent research confirms that secondary school students with low motivation learn less from interventions aimed at improving monitoring accuracy than students with moderate or high motivation (Wijnia & Baars, 2021). Thus, the general decline in motivation during the secondary school years might have resulted in lower effects of monitoring accuracy interventions. Still, considering the limited number of studies and nonsignificant Q-statistics, future research should first replicate the present findings with a larger set of studies. As these are not available in the problem-solving field, the impact of school level should be tested in related fields that include interventions on monitoring accuracy such as text learning.

Although setting made a difference in that laboratory studies showed higher effects than classroom studies, it should be noted that interventions in classroom studies also had a significantly positive impact on monitoring accuracy. This is in line with the meta-analysis by Gutierrez de Blume (2022) and suggests that despite these more “messy” settings, such interventions can be recommended for use in authentic classrooms. Regarding judgment type, pairwise comparisons revealed that interventions had a greater effect on retrospective confidence judgments than judgments of learning. This might be because retrospective confidence judgments ask students to recall their performance on a task they just completed, while judgments of learning are prospective and therefore require students to predict an uncertain future (Dougherty et al., 2005). Future research should investigate how to optimize monitoring accuracy interventions in authentic classrooms for both retrospective and prospective judgments.

Limitations

With 35 included studies in the meta-analysis, the number of studies that aim to improve monitoring accuracy in problem-solving is relatively low. This number could have been higher when all retrieved studies had reported sufficient information to calculate effect sizes. Ten studies lacked this statistical information, and authors of only four of these studies sent us the requested information. The others did not respond or did not have the required data. Not only for future authors but also reviewers and editors, it is thus important to be cognizant of whether sufficient statistical information is available when studies are published so that they can be included in the growing body of meta-analyses in educational psychology.

The low number of studies within some intervention types has consequences for the strength of the conclusions that can be drawn. A good example is that timing included only two studies. Although it was clearly found that timing interventions did not improve monitoring accuracy, with only two studies there is a higher chance that other study-related factors could have influenced the results. A related consequence of the low number of studies was that the Q-statistics were not significant for the moderators school level and judgment type while they were for the pairwise comparisons. As the pairwise comparisons and Q-tests are different tests, it is reasonable to anticipate different results, especially due to the Q-test’s sensitivity to low power (Borenstein et al., 2009). To ensure no false positives were reported, Hochberg’s alpha correction procedure was used. However, as each study has unique contextual features, such as the sample, materials, and procedure, caution is needed when generalizing these results to educational practice. Nevertheless, these moderator analyses offer sufficient starting points for future research on interventions for monitoring accuracy.

Implications

This meta-analysis was conducted in the context of theories that pose that SRL is beneficial to students’ long-term learning and development (Dignath et al., 2023). Early SRL interventions aimed to improve students’ regulation during all phases of SRL models (e.g., De Boer et al., 2014; Dignath & Büttner, 2008; Dignath et al., 2008), while more recently, meta-analyses addressed interventions on monitoring and their effectiveness for students’ learning outcomes (Dignath et al., 2023; Guo, 2022a, b). The current meta-analysis is even more specific, as it contributes to SRL by investigating the accuracy of students’ monitoring and demonstrating that this accuracy can be improved.

Specifically, the results of this meta-analysis show that interventions addressing the whole task, metacognitive knowledge, and external standards positively impact monitoring accuracy. These results are largely consistent with previous meta-analyses (Gutierrez de Blume, 2022; León et al., 2023; Prinz et al., 2020b) and suggest that the effectiveness of these three interventions generalizes across tasks and domains. To better understand the mechanisms underlying these successful interventions, future research should investigate whether different design configurations lead to differential effectiveness. For example, a good starting point might be to investigate whether an intervention type should offer support on all steps of the problem-solving process or whether it should only support the final step—i.e., the solution.

Theoretical implications relate to the incorporation and support of monitoring accuracy in models of (meta)memory and SRL. Our results indicate that monitoring accuracy can be improved and that successful interventions give students insight into their cognitive and metacognitive processes during problem solving and support the monitoring of their performance. Monitoring accuracy can thus be improved by providing support on each component of Nelson and Narens’ (1990) metamemory model, which suggests that this descriptive model can be used as a normative framework for research and development of successful instructional interventions. Our results substantiate the pivotal role of standards in the COPES model (Winne & Hadwin, 1998) by showing that monitoring accuracy improves if students evaluate their performance against an offered, and hence appropriate benchmark. Finally, implications for the monitoring model of Gutierez et al. (2016) are less straightforward. In line with their conclusion of a domain general monitoring model, our moderator analysis found that the effectiveness of interventions was independent of the task domain. However, we could not find direct evidence for the domain-general notion that older individuals would be more proficient in monitoring, as adults benefitted as much from the interventions as children did. These results suggest that monitoring accuracy with additional support differs from unsupported monitoring and that successful interventions might compensate for age-related differences in monitoring accuracy. Future research should investigate both suggestions.

For educators, this meta-analysis provides evidence to suggest the use of interventions to improve students’ monitoring accuracy in problem-solving tasks. Specifically, interventions addressing the whole task, metacognitive knowledge, and external standards are recommended for problem solving given their positive impact on monitoring accuracy. On the other hand, as timing adversely affected students’ monitoring accuracy, educators should refrain from implementing this intervention until further research is conducted to ascertain its effectiveness.

To conclude, the current meta-analysis contributes to the growing body of evidence addressing monitoring accuracy in the context of SRL. Several factors not included in this meta-analysis might be of interest to future meta-analyses on monitoring accuracy, for example, the effort students (are willing to) put into their learning (De Bruin et al., 2020; Van Gog et al., 2020) or task characteristics (cf. Prinz et al., 2020a), such as the amount of problem-solving steps or problem-solving performance during practice tasks. The studies that include such factors are few but promising (Baars et al., 2014b; Kostons et al., 2012; Mihalca et al., 2017). When more studies are available, these could give a more comprehensive understanding of how to best support students in improving their monitoring of problem solving, their regulatory decisions, and consequently the overall efficiency of learning problem-solving skills.