Metacomprehension describes thoughts and ideas about reading comprehension and includes monitoring and control of the comprehension process (Dunlosky & Lipko, 2007; Dunlosky et al., 2005; Maki & Berry, 1984). A common metacognitive framework demonstrates that a person can monitor or evaluate their basic cognitions then, based on this evaluation, control the basic cognitive task (Nelson & Narens, 1990). Using metacomprehension as an example, a person may read a paragraph and realize that they lost focus and do not comprehend a text (monitoring), then reread that section of the text (controlling). Metacomprehension benefits all readers, but is particularly important for students tested on expository readings (Dunlosky & Lipko, 2007) and those who read for work because a misperception of comprehension could result in poor performance on exams or projects.

Efficient readers tend to adjust reading behaviors to reach learning goals, which requires them to have accurate judgments of their comprehension (Dunlosky et al., 2005; Efklides, 2014; Schunk & Zimmerman, 1998). During this process, called self-regulated learning (SRL), an individual will plan and set study goals and then monitor their learning to assess whether they have met their goals. As monitoring learning is an integral part of SRL, students should be accurate in their judgments for successful studying (Lee et al., 2010). However, metacognitive monitoring accuracy tends to be poor, leading many students to struggle with SRL (Lee et al., 2010). Specifically, relative accuracy, or the ability to distinguish between what is learned and not learned, helps learners evaluate what to spend time studying, a crucial part of SRL (Dunlosky et al., 2005; Wiley et al., 2016). For example, if a student is studying for an exam, they might determine that chapter two is learned well and does not need further studying, but that chapter four is not well-learned and they must study it further in order to be successful on an exam. If that judgment is correct, then the student demonstrates high relative accuracy. To measure relative accuracy in the context of metacomprehension, a person is typically given multiple texts to read, and then asked to predict their future performance on each text on some scale predetermined by the experimenter. This prediction magnitude is compared to actual test performance using a gamma correlation. When relative accuracy (gamma correlation) is high, students tend to study efficiently by, for example, allocating more time for passages that are challenging but within their reach (Kornell & Metcalfe, 2006; Metcalfe, 2009).

Relative accuracy is typically low but implementing metacognitive strategies may lead to significant improvements. Without any strategy, students’ mean relative accuracy is a gamma correlation of 0.27 (Dunlosky & Lipko, 2007). Although this correlation is significantly greater than zero, there is still much room for improvement. Interventions can greatly improve relative accuracy, especially when there is a generative component; for example, generating keywords after a short delay (de Bruin et al., 2011; Thiede et al., 2003, 2005), mental self-explanation of the text during reading (Griffin et al., 2008), and delayed written summaries (Anderson & Thiede, 2008; Fukaya, 2013; Thiede & Anderson, 2003) all seem to increase relative accuracy. For each of these interventions, the participants’ relative accuracy leaped to a gamma correlation of approximately 0.6! It is generally believed that these interventions increase accuracy by bringing attention to the situation model, which is a gist-based mental representation of the text (Kintsch, 1998) that helps participants better monitor their comprehension (Anderson & Thiede, 2008). The purpose of the present study was to further explore the benefit of written summarization to relative accuracy by comparing it to oral summarization. We reasoned that summary modality may affect the cue basis for comprehension judgments. For example, oral summaries typically have more inference-based ideas (Anderson & Thiede, 2008; Kellogg, 2007), which might lead the situation model to be especially salient in the oral condition, benefitting relative accuracy. The use of judgment cues and how they may differ between summary modalities is discussed next.

Cue utilization

Individuals use their experiences with the material to predict their comprehension; thus, inequitable summarizing experiences between speakers and writers may alter the cues that individuals might use to make judgments. The cue-utilization hypothesis states that people cannot directly judge the accuracy of their cognitions, and therefore must use cues, or heuristics, to estimate the information they know (Koriat, 1997). For example, a person might estimate they will score well on a comprehension exam because they enjoy the topic, or because they read the information multiple times. Certain cues allow for better predictions, particularly those that represent a deeper understanding of the material (Griffin et al., 2008) and are referred to as having higher cue validity. Using the earlier example, if a person thinks they will pass a comprehension test because they enjoy the topic but they do not pass, then topic enjoyment is not a valid judgment cue. The cue utilization hypothesis provides a framework to describe why oral and written summaries may contribute to potential differences in prediction accuracy. The accuracy of oral summaries may depend on which cues the participants are using in their judgment; by comparing two similar but distinct metacognitive strategies, we can start to uncover when students use certain cues, and whether the use of cues is influenced by summary modality. For example, if oral summaries are longer and associated with less accurate metacomprehension judgments compared to written summaries, we might conclude that oral summarizers were basing their judgments on less valid cues, such as summary length. There are two cues already well described in the literature that may differ between oral and written summaries: the accessibility of information at retrieval and the situation model. Theoretical and empirical accounts for each will be described in turn.

The accessibility hypothesis explains how individuals use the amount of information they recall from a task as a cue to make prediction judgments. While making a judgment, recalling a large quantity of information tends to increase an individual’s confidence in their knowledge, and therefore increases prediction magnitude (Koriat, 1993, 1997). Importantly, higher prediction magnitude does not equate to higher metacognitive accuracy (Baker & Dunlosky, 2006; Koriat, 1993; Maki et al., 2009; Morris, 1990). Accessibility of information can be a valid cue if the retrieved information is both correct and relevant, but incorrect, repetitive, or irrelevant information may erroneously inflate prediction magnitude. The accessibility hypothesis predicts that a high word count or number of total ideas during a summary will lead to a high prediction magnitude (Maki et al., 2009) but not necessarily high accuracy.

The situation model, or a gist-based mental representation of the text (Kintsch, 1998), is another cue used to judge comprehension (Anderson & Thiede, 2008; Fukaya, 2013; Thiede & Anderson, 2003; Thiede et al., 2005). The situation model connects information within the text to previous knowledge about the topic. This deep level of processing is resistant to forgetting and therefore considered a valid cue on which to base judgments of comprehension (Kintsch et al., 1990; Thiede et al., 2005). The situation model may be related to summary length, but summary length includes gist-based ideas and details as well as distorted or irrelevant information (Anderson & Thiede, 2008). Therefore, although the accessibility of information and the situation model are related, they are distinct concepts, and not equally valid judgment cues. Although evidence suggests that delayed summaries increase salience of the situation model, which improves relative accuracy (Anderson & Thiede, 2008; Thiede & Anderson, 2003), the situation model is not always the most salient cue. For instance, students are more likely to use other cues like the accessibility of information, or surface level cues, like the readability of the text. This is especially true when students make judgments without using a metacognitive intervention such as delayed summaries, as these interventions may work in part by bringing attention to the situation model (Anderson & Thiede, 2008; Thiede et al., 2010).

Summarization modality and comprehension

The ability to summarize a text depends on one’s ability to comprehend it (Alterman, 1991), which is likely why summarizing increases metacomprehension accuracy. There are generally two types of summaries that students construct: one more surface level, and one that is deeper, connecting it to past information, representative of the situation model. College students tend to draw from their prior knowledge when summarizing, therefore tapping the situation model (Leon et al., 2006). When creating a summary, individuals typically form a group of key ideas that represent the main ideas of the text (León & Escudero, 2015). Summarizing is more complicated than text comprehension, as this process includes identifying and generalizing the most important ideas, and conveying them in a coherent and concise manner (León & Escudero, 2015). Being able to construct a good summary is an active process rather than a passive one. Although this core process is similar in oral and written modalities, there are some processes that differ between the two.

Written and oral summarizing are similar in that both involve condensing texts to their key components in a cohesive manner, and result in similar output quality (Hidi & Hildyard, 1983; Scardamalia et al., 1982). However, studies have shown that several summary characteristics can differ between modalities. For example, oral production (recall, summaries, narratives) tends to have higher idea units and word count, yet take less time (Hidi & Hildyard, 1983; Kellogg, 2007; Viero & García-Madruga, 1997). Also, they have been shown to include more gist-based ideas, whereas written summaries tend to be more verbatim (Viero & García-Madruga, 1997). Written output tends to be slightly more cohesive, which is believed to be a result of being able to pause during the writing process and assess what is already written (Hidi & Hildyard, 1983). Additionally, there are potential differences in the demand placed on working memory during oral and written summarizing. First, while speaking has a small motor component, writing has a much larger motor component, and therefore must be considered in the working memory process (Kellogg, 2007). Furthermore, although both modalities include the macrostructures of language (e.g. generating speech), spoken summaries should tax working memory less, as some microstructures, such as spelling, are less prominent during speaking (Vanderber & Swanson, 2007). On the other hand, oral summarizers must rely on their memory of what they have already said. Because written summarizers can reread what they wrote, they may have less strain on their working memories.

In the metacomprehension literature, oral and written summaries seem to differ in ways that affect the situation model and the accessibility of information. The oral modality tends to produce a greater number of inference or gist-based ideas (Kellogg, 2007; Vieiro & García-Madruga, 1997), an indicator of a strong situation model (Anderson & Thiede, 2008), which may increase metacomprehension accuracy. Unfortunately, oral production may also contain higher levels of distortions and more idea units in general (Hidi & Hildyard; Kellogg, 2007), and therefore have a higher word count. When word count is driven in part by distortions, the accessibility hypothesis would predict that individuals inflate prediction judgments, but have potentially lower accuracy because summary length is not always a valid cue (Dunlosky et al., 2005b; Koriat, 1993). The accessibility of information is only a valid cue when the amount of information recalled is both correct and relevant. Another issue with oral summarization revolves around retrieval fluency as a judgment cue. Retrieval fluency, or the ease with which a person retrieves information, is closely related to response time (Bjork et al., 2013), and is associated with higher prediction magnitude but not necessarily comprehension (Benjamin et al., 1998). Because oral production takes much less time than written production (Kellogg, 2007), it could lead to lower metacomprehension accuracy if retrieval fluency is used as a cue.

Although students may prefer the oral modality because it is faster and feels easier (Kellogg, 2007; McPhee et al., 2014), it may not afford the same metacomprehension accuracy as written summarization. If oral summaries lead to worse metacognitive outcomes, we can still gain vital information about why the outcome was worse. During oral summarization, participants may use heuristics that sometimes work but are not valid cues in this scenario (such as length as a cue), due to oral summaries generally having more distortions. Alternatively, there might be explanations outside of cue use that contribute to group differences. For example, unlike written summaries, oral summaries cannot be reread. If participants are unable to review their summaries, they lose an additional opportunity to evaluate their knowledge and improve their metacomprehension accuracy.

Perceived cognitive load differences between summarizing modality

Summary modality may also affect perceived cognitive load, which may be another cue affecting judgments and therefore judgment accuracy. Kellogg (2007) proposed that speaking and writing differ in terms of working memory demand, which can influence cognitive load, defined as the amount of cognitive resources required to complete a task (Chandler & Sweller, 1991). Bourdin and Fayol (1994) proposed that writing may tax working memory more than speaking. This may primarily be because writing requires spelling, whereas speaking, obviously, does not (Vanderber & Swanson, 2007). Furthermore, the writing process is slower than speaking, so text representations may remain in working memory for longer, using more resources (Kellogg, 2007). Currently there is evidence that writing taxes working memory more than speaking for both children (Bourdin & Fayol, 1994) and adults (Grabowski, 2010). There is also evidence that participants prefer to speak because writing is more effortful (McPhee et al., 2014). For these reasons, it is possible that perceived cognitive load may be higher when summarizing in writing, and act as another cue that could influence comprehension judgments and group differences.

The way in which perceived cognitive load may affect metacomprehension accuracy is unclear. For example, when a task is experienced as more difficult, students are less confident in their judgments (Maki et al., 2005, although see Moore et al., 2005 for counter evidence), which could decrease overconfidence, making judgments more accurate. However, one study found that written summaries increased cognitive load in comparison to a control group without increasing relative accuracy (Reid et al., 2017), but there was no oral summarization condition, so no modality comparisons were made. Even if perceived cognitive load does not act as a judgment cue, or is not a major explanatory one, the difference in perceived cognitive load between conditions is still worth measuring. If metacomprehension monitoring accuracy is equivalent between summarizing orally and in writing, but one modality leads to lower perceived cognitive load, then summarizing in that modality would have an obvious study benefit: students should summarize in the “easier” modality if the harder one has no metacomprehension benefits. Another reason it is difficult to predict the effect of perceived cognitive load on metacomprehension judgment accuracy is that cognitive load is not a unitary construct. Paas and colleagues (2003) conceptualized cognitive load as having three separate components: intrinsic, extraneous, and germane. These components describe, respectively, how the actual difficulty of the task, the presentation or environment of the information, and motivation all influence the perceived effort required. For example, a tricky puzzle would increase intrinsic cognitive load, but trying to complete it in a noisy café would lead to high extraneous cognitive load, and a love of puzzles would increase motivation and decrease germane load (Paas et al., 2003). Summarizing modality could affect one or more of these three components, so all three were measured.

The present study

The delayed summary technique has been found to increase metacomprehension accuracy (Anderson & Thiede, 2008), but is summarizing in one modality superior to the other? Both written (Anderson & Thiede, 2008; Maki et al., 2009; Reid et al., 2017; Thiede & Anderson, 2003), and oral summaries (Baker & Dunlosky, 2006; Fukaya, 2013; Fulton, 2021) have been used in past metacomprehension research but, to our knowledge, they have never been compared in the context of metacomprehension monitoring accuracy. The present study was conducted to address this gap in the research literature. With this study, we can learn (a) whether oral summaries seem to improve relative accuracy compared to written summaries and (b) whether summary characteristics drive potential differences in relative accuracy. Both the situation model and the availability of information were expected to differ between spoken and written summaries, which could influence which cues were available or salient and thus the prediction magnitude and the accuracy of the predictions. Comparing the accuracy for three conditions (written, oral, and no summary), as well as their summary characteristics, informs metacognitive theory and could elucidate which summarization type might be most useful for students while studying. We included a control condition to assure that the delayed summarization manipulation reliably improves metacognitive accuracy. If neither summary condition increases accuracy more than the control condition, then delayed summarization loses its practical and possibly theoretical implications.

It is important to note that the cues measured in the study do not encompass all cues that a person can use, and that a person can use multiple cues to make their judgments (Morris, 1990; Undorf et al., 2018). We focused on cues that past literature suggested might change with summary modality, but we acknowledge that other cues, such as familiarity or interest in the topic, may play an important role in prediction magnitude (Koriat, 1997; Thiede et al., 2010). We also note that because multiple cues can be used simultaneously, we are not pitting the accessibility hypothesis directly against the situation model hypothesis, per se. We assessed relative reliance on each cue first by correlating each participants’ cues with their predictions; the cue(s) with the strongest relationship to prediction magnitude were considered the most salient to the participants. Next, we assessed which cue has the highest validity by correlating cues with multiple-choice accuracy. We believed that summary modality would likely alter the cues in the summaries (e.g. amount of information recalled), and that the salience and validity of each cue might differ between modalities.

  • Hypothesis 1: We predicted that the three groups would a.) differ in their prediction magnitude, b) not differ in their comprehension, but c) differ in their relative accuracy.

Written summaries generally increase metacomprehension accuracy, likely due to the increased attention to the situation model (Anderson & Thiede, 2008; Thiede & Anderson, 2003). Because oral summaries tend to have more valid cues (gist-based ideas) and more invalid cues (more total words, faster summary time; Kellogg, 2007; Vieiro & García-Madruga, 1997), it was unclear how oral summarizing would impact metacomprehension accuracy. Thus, hypotheses did not specify whether oral summaries would be associated with more or less accurate judgments compared to written summaries, just that there would be a difference.

  • Hypothesis 2: Each summary characteristic was expected to differ between conditions.

We were interested in length (word count), the situation model (latent semantic analysis), and retrieval fluency (latency to begin summarizing and total time). Past research suggests higher word count and latent semantic analysis scores in the oral condition (Kellogg, 2007; Vieiro & García-Madruga, 1997), and higher summary time and latency to begin summarizing in the written condition (Kellogg, 2007). Although these predictions are directional, we planned two-tailed tests because we felt the presence of any differences was more important than the direction of that difference.

  • Hypothesis 3: (a) We expected summary characteristics (word count, latent semantic analysis, latency to begin summarizing, and total summary time) to relate to prediction magnitude and (b) comprehension. (c) Further, it was hypothesized that these relationships would differ between cues, and by condition.

In order to assess cue use and cue validity, intra-individual gamma correlations were calculated for each participant, which correlated summary characteristics and prediction magnitude, as well as summary characteristics and multiple-choice scores (Anderson & Thiede, 2008; Maki et al., 2009). Higher correlations represent higher cue use (when correlated with prediction) and cue validity (when correlated with comprehension). Differences between modalities can suggest which cues are used to a greater extent, or are more valid, in one condition compared to another.

 

  • Hypothesis 4: We expected perceived cognitive load to differ between groups.

Because working memory will likely increase the most in the written condition (Kellogg, 2007), it was predicted that the control group would exhibit the lowest levels of perceived cognitive load, and the written summary group would exhibit the highest levels of perceived cognitive load. However, we planned two-tailed tests because we felt the presence of any differences was more important than the direction of that difference.

This hypothesis was exploratory, as our main focus was on summary modality and differences in cue use. Although perceived mental effort can certainly act as a cue, the timing of the cognitive load measure makes it unclear whether this measure reflects perceptions of load during the summarization process, per se. This was an intentional design choice to prevent artificially increasing the salience of cognitive load before making prediction judgments. We believe our cognitive load measure has merit in the current study, but results should be viewed as preliminary evidence.

Methods

Participants

To estimate the target sample size, we used an effect size of d = 0.45, derived from a study which assessed modality differences in recall ability (Putnam & Roediger, 2013); the current study required 95 participants at 0.80 power (Bausell & Li, 2002). Overall, 116 individuals over the age of 18 participated in this study, recruited from the SONA system at Idaho State University. Students were compensated with course credits. Ten people who did not speak English as their first language were excluded. Three additional participants were excluded for not following directions, and one was excluded for making uniform predictions, so a gamma correlation could not be calculated. It should be noted that there was a problem in the original audio recordings. Unfortunately, the oral summaries could not be transcribed accurately, so the oral condition was completely replaced. Both the original and new samples were similar in their demographics and performance. There were 102 participants included in the final analysis. There were 35 participants in the written condition, 34 participants in the oral condition, and 33 in the control condition.

The sample was primarily white (91.2 %) and female (70.6 %). Of our population, 11.8 % identified as Hispanic. The mean age of the sample was 21.76 (SD = 5.20) years, and most participants were early in their college career, with 68.6 % in their first year and second year.

Design

This experiment used a one-factor, between-subjects design. Participants were randomly assigned to one of three groups, an oral summary group, a written summary group, or a no summary control group.

Materials

Eprime 3.0 (Psychology Software Tools; www.pstnet.com) was used to display each of the texts in this study. We used six texts that have been used in similar metacomprehension experiments (Fulton, 2021; Rawson & Dunlosky, 2002). These texts come from the Scholastic Aptitude Test (Board, 1997) and are at a Flesch-Kinsaid grade-level of 9.8–12.0 (M = 11.6). Each text was between 337 and 398 words. The titles of the texts are: Television Newscast, Precision of Science, Women in the Workplace, Zoo Habitats, American Indians, and Real vs. Fake Art (see Appendix A for sample). There were eight multiple choice questions for each text. Four could be answered from information that was explicitly stated in the text and four required making inferences from the text. Inference-based questions best assess comprehension of the situational model.

Additionally, Eprime recorded prediction and postdiction judgments, the multiple-choice answers, latency to begin summarizing and summary time for the oral and written conditions, and the typed summaries for the written condition. The oral condition summaries were recorded with Audacity (http://audacity.sourceforge.net/) and transcribed for analysis. Qualtrics was used to administer the demographic questionnaire, as well as an exploratory cognitive load survey. The cognitive load survey has been validated and shows strong reliability (Cronbach’s α = 0.81, Klepsch et al., 2017). To score the cognitive load survey, participant answers were averaged across a 7-point Likert-type scale for each subscale (intrinsic, extraneous, germane).

Procedure

Participants first read each of the six texts in a random order. The texts were displayed so that only one sentence appeared at a time to control for rereading, as rereading can lead to an increase in accuracy (Rawson et al., 2000). The reading task had no time limit, but the time that it took to read each text was recorded with Eprime. A two-minute word search was presented to the oral and written conditions after the texts to assure that they were summarizing at a delay and not rehearsing the information that they read. After the two-minute delay, participants in the oral and written conditions were asked to summarize one text at a time. The title of each text appeared as a prompt for them to begin summarizing. After each summary was completed, a prediction question was presented, which asked, “How many questions out of eight do you think you will answer correctly about this passage?” A key press presented the next title for them to summarize. For both conditions, the summary order did not necessarily match reading order; both reading order and summarizing order were randomized. Participants were told to read carefully for a future test, but they were not informed of the nature of the test.

The control condition did not summarize the texts. The control participants read the texts as in the experimental conditions but were a given a 15-minute word search as an easy distraction task in place of generating summaries. The distraction task prevents the individuals from rehearsing information from the passages, which could influence their comprehension and metacomprehension. After spending 15 min on the word search, the control group made their multiple-choice comprehension predictions. Participants were shown the title of the text and asked to predict their multiple-choice performance, as in the two experimental conditions.

After completing predictions, all participants completed the multiple-choice comprehension test. The test was composed of eight questions for each text. Once they finished each set of questions, participants took the cognitive load questionnaire. Finally, they completed a demographic survey and were debriefed about the study. See Fig. 1 for a procedural summary.

Fig. 1
figure 1

Summary of methods

Data analysis

The summaries were measured on four dimensions: length, situation model, latency, and total time. Length was measured by a word count. The situation model was measured using a technique called latent semantic analysis (LSA; http://lsa.colorado.edu/; Landauer, 1998). Latency to begin summarizing and total time were measured using Eprime. To measure latency, participants were instructed to remain on a screen until they were ready to summarize. The time that they spent on this slide was considered latency to begin summarizing. Total time was measured by the amount of time it takes from the presentation of the summarizing prompt to the time it takes to finish summarizing and moving to the next screen. Total time did not include latency to begin summarizing.

LSA, the current measure for the situation model, measures how closely a summary relates to an ideal target summary, using a cosine that measures the semantic relatedness of the two texts. The cosine is comparable to a correlation coefficient. LSA does not assess synonyms; rather, it compares how words are used in similar contexts. Because of this feature, LSA is able to measure the gist of the text and can therefore potentially measure the situation model (Landauer et al., 2007). LSA has been shown to measure both comprehension and cohesion (Landauer & Dumais, 1997). The target summaries were adapted from the grading rubric in Fulton (2021), which described the main ideas and important details of each text, as agreed upon by two judges. LSA has been used in metacognitive research in the past (Maki et al., 2009; Thiede & Anderson, 2003) and found to be comparable to a trained scorer (Landauer, 1998).

Relative accuracy was calculated for each participant using a gamma-correlation between participant prediction magnitude and multiple-choice performance. A one-way ANOVA was used to compare the three groups on prediction magnitude, multiple-choice accuracy, and relative accuracy. A Tukey test was run after each significant ANOVA to assess which groups differed from each other.

The summary characteristics in the oral and written conditions were compared using t-tests, and a Bonferroni correction was used to account for increased error rate. After differences in summary characteristics were established, summary characteristics were correlated to prediction magnitude and multiple-choice scores to establish cue use and cue validity. To assess whether individuals were using our measured summary characteristics to make predictions of their performance, intra-individual gamma correlations were calculated between each summary characteristic and prediction magnitude. To measure whether these cues were valid, intra-individual gamma correlations between each summary characteristic and multiple-choice scores were calculated. There were eight gamma correlations calculated for each individual: four correlations between summary cues and predictions, and four between summary cues and multiple-choice scores. Finally, mean gamma correlations were compared across groups using a 2 (oral vs. written) x 4 (summary characteristic) repeated measures ANOVA. Type of summary characteristic was considered a repeated measure variable, as each person had four measures of summary characteristics, and these were compared between groups. A separate 2 (oral vs. written) x 4 (summary characteristic) repeated measures ANOVA was conducted for gammas relating summary characteristics to prediction magnitude and to multiple-choice scores.

Results

Prediction magnitude and multiple-choice performance

No group differences were found for prediction magnitude [F(2, 99) = 0.04, p = .96, ɳ2 = 0.00] nor for multiple-choice score [F(2, 99) = 0.68, p = .51, ɳ 2=0.01; see Table 1 for means]. These results show mixed support for our hypotheses, as group differences were expected for prediction magnitude, but multiple-choice performance was expected to be consistent across groups. This analysis suggests that summary modality does not influence mean prediction magnitude or comprehension.

Table 1 Mean prediction magnitude and multiple-choice performance by condition

Relative accuracy

The average gamma correlation across all conditions was small, at 0.18 (SE = 0.05), but significantly different from zero [t(101) = 3.55 p < .01]. This suggests participants were, on average, above chance at distinguishing on which texts they would score well. An ANOVA showed differences in average gamma correlations between conditions [F(2, 99) = 4.46, p = .01, ɳ2 =0.08], supporting the hypothesis that relative accuracy would differ between groups. A Tukey test revealed that the written condition had the highest average gamma correlation (Fig. 2), which differed significantly from the control condition, which had the lowest relative accuracy of the three groups (95 % CI [0.07, 0.63]; p = .01). The oral condition had an intermediate relative accuracy; it did not differ significantly from either the written condition (95 % CI [-0.08, 0.47]; p = .22), or the control condition (95 % CI [-0.47, 0.08]; p = .40). However, the written condition was the only condition to have a gamma correlation significantly different from zero (t(34) = 4.53, p < .01). The average gamma correlation for the oral condition was marginally different than zero [t(33) = 2.01, p = .052], but the control condition failed to differ [t(32) = 0.05, p = .95]. Thus, we are confident that delayed written summaries effectively increased relative accuracy, as this is the only condition to really differ from both zero and from the control condition.

Fig. 2
figure 2

Mean relative accuracy by condition. Note: Error bars represent standard error. **p < .05

Summary characteristics between modalities

Some, but not all summary characteristics differed between the written and oral conditions (Table 2). The written summaries took longer to complete on average [t(67) = 9.28, p < .01, d = 2.25], which we anticipated. The written condition was also quicker to begin summarizing [t(67) = -4.38, p < .01, d = 1.05]. The oral and written summaries did not differ in LSA score [(t(67) = -0.68, p = .50, d = 0.17] or word count [t(67) = -1.64, p = .10, d = 0.39], contrary to the hypothesis. All tests were Bonferroni corrected, with p = .0125. Overall, the groups did not differ in their situation models or word count, but oral summarizers took less time to summarize, and more time to begin summarizing.

Table 2 Differences in averages of summary characteristics by modality

Relation between summary characteristics, prediction magnitude, and MC accuracy

As hypothesized, summary characteristics significantly related to prediction magnitude, with each gamma correlation between summary characteristics and prediction magnitude differing significantly from zero (Table 3), indicating that each summary characteristic measured was related to prediction magnitude. Next, in order to measure whether some characteristics were more related to prediction magnitude, a 2 × 4 repeated measures ANOVA was conducted. There was found to be a significant main effect of summary cues [F(3, 65) = 15.17, p < .001], suggesting that cues did not equally influence prediction magnitude. Specifically, latency to begin summarizing was significantly lower than each other cue, with each other cue found to be different from latency at the p < .001 level. Additionally, word count was marginally larger than LSA (p = .051) and total time (p = .094). Although approaching traditional significance levels, group differences in cue use were not significant [F(1, 67) = 2.91 p = .09], and there were no significant interaction terms [F(3, 65) = 0.33 p = .80].

Table 3 Gamma correlations between summary characteristics and predictions, as well as between summary characteristics and multiple-choice accuracy

In regards to comprehension, there were significant mean differences in the relationship between summary characteristics and comprehension scores [F(3, 65) = 7.00, p < .001; Table 3]. Again, latency to begin summarizing was significantly lower than each other cue, with each other cue found to be different at the p < .001 level. No other cues were different from one another. The interaction term (cues by condition) was found to be significant [F(3, 65) = 4.16, p < .009]. For the interaction term, total time differs in cue validity between written and oral conditions (p = .045), with time only as a valid cue in the oral condition. Finally, there were no main effects for condition [F(1, 67) = 0.26, p = .98].

Cognitive load

There were three subscales of cognitive load, and each were compared individually using a one-way ANOVA. First, intrinsic load was not found to be different between groups [F(2, 101) = 1.55, p = .21 (see Table 4 for group means)]. Extraneous load was marginally different [F(2, 101) = 2.77, p = .067], with a Tukey test revealing a marginal difference between the oral condition and written condition (p = .066) and no other differences. Finally, germane load did not differ between conditions [F(2, 101) = 1.40, p = .25]. Therefore, extraneous load, or the cognitive load dependent on environment, may be higher in the oral modality compared to the written modality.

Table 4 Group means for perceived cognitive load

Discussion

Our findings provide the first evidence of an effect of summary modality on metacomprehension relative accuracy and evidence of multiple cue use in a metacomprehension context. The results suggest that written summaries, but not oral summaries, benefit relative accuracy in metacomprehension, as the written summary condition was the only group whose relative accuracy was greater than chance and differed from the control group. Explanations for this effect remain unclear, but we believe the findings have implications for metacognitive theory and SRL, and we discuss possible explanations and implications below. We also discuss the novel evidence for multiple cue use in metacomprehension predictions and the extent to which they are valid cues.

Summarizing in writing appears to benefit relative accuracy in metacomprehension predictions, as the written summary group was the only group whose judgment accuracy was significantly different than the control group and significantly different from zero. We are aware that no strong statement can be made about oral summarizing as that group was not significantly different from either the written or the control condition. However, we note that Griffin and colleagues (2019) argued that gamma correlations, which are assumed to be an ordinal variable, experience a reduction in variation and thus statistical power, which can inflate type II error rates. For this reason, it is possible that a replication with greater statistical power could show that oral summarizing is significantly better than the control and/or worse than written summarizing. Regardless, we believe our results provide evidence that summarizing, particularly in writing, may improve SRL through its impact on metacomprehension. In particular, students who write summaries of texts they read may be better judges of which of those texts are more or less well understood. This can allow them to make informed choices about continued study, such as which texts need to be restudied or need the most time allocated to them during either study or test (Metcalfe & Finn, 2008). We had planned to use analyses of cue use to help interpret why summarizing and/or a particular summary modality might afford greater advantages for metacomprehension judgment accuracy, but the complex and nuanced nature of the findings prevents strong conclusions about mechanism. Nonetheless, we provide some possible explanations below.

We originally hypothesized that group differences in cue use could help explain why one summary modality might lead to better metacomprehension relative accuracy. As such, the written condition may have outperformed the oral condition because they were better able to use the cues (i.e., summary characteristics we measured) at their disposal. However, the group difference in the relationship between prediction judgments and cues was only marginally significant (p = .09), with no significant interaction between condition and summary characteristic. It is interesting, though, that all gamma correlations between cues and predictions (barring latency) were higher in magnitude for the written condition than the oral condition, particularly for word count (g = 0.48 versus g = 0.31). The gamma between word count and predictions was also marginally larger than between predictions and two other summary characteristics, LSA and total time; importantly, this was only the case in the written condition. Again, although we must very cautiously interpret null effects, it could mean that word count is a stronger judgment cue than the others, especially when considering the conservative nature of gamma correlations (Griffin et al., 2019). If so, one way this might have occurred is that the visual nature of written summaries could have increased the salience of word count as a cue for participants in the written condition, making it easier to monitor word count when one can see how much space it takes up on the page, a visual that is absent while speaking. Higher salience of word count might lead those in the written condition to incorporate word count into their judgments to a greater degree, leading the written condition to be more accurate. Nonetheless, we fully acknowledge that this reasoning is speculative given the null/marginally significant differences in cue use, but we do believe that this possibility is worth exploring.

In addition to showing that summary modality can affect metacomprehension relative accuracy, our study provides the first experimental evidence, to our knowledge, that people use multiple cues when making metacomprehension judgments (see Undorf et al., 2018 for evidence in metamemory). Each of the cues measured (word count, LSA, summary time, latency to begin summaries) were related to predictions of future comprehension performance. The well-established cues, accessibility of information and the situation model, seemed to both be utilized by participants to approximately the same extent, expanding our knowledge of how these cues are used. Thus, similar to a recent metamemory study (Undorf et al., 2018), we argue that participants use multiple cues in order to make predictions about their comprehension performance, but the cues can vary in validity, and some may be weighted more than others. Most of the cues (LSA was the exception) were valid, as they were significantly related to comprehension performance, but the gamma correlations were fairly low, indicating that there may be other more valid cues that should be used for prediction judgments. One possible explanation invokes the transfer-appropriate monitoring theory (Dunlosky et al., 2005), which posits that encoding and retrieval are more successful when the processes required at test match those employed at study. Because summarization involves some different cognitive processes than those used to successfully complete a multiple-choice comprehension test, high performance in one does not necessarily transfer to high performance in the other (Head et al., 1989). However, this explanation cannot fully account for current and previous findings. In our current study, the summarizing strategy, particularly in the written condition, still afforded greater metacognitive accuracy than not summarizing, so it seems that some part of the summary is representative of comprehension performance. Furthermore, Anderson and Thiede (2008) found a rather high correlation between gist-based ideas from summaries and multiple-choice performance. Even total ideas in that study were more highly correlated with multiple choice performance than in the present study, with their participants achieving much higher relative accuracy (Anderson & Thiede, 2008). Perhaps, then, the mismatch between processes involved in summarization and those involved in multiple-choice test performance is not fully to blame. Rather, poor summarizing and/or comprehension test performance in our sample may have diminished the relationship between the two, in part due to restriction of range.

The variation in cue validity deserves further interpretation. First, the average gamma correlation between comprehension and latency to begin summarizing was significantly greater than zero: the longer people paused before summarizing the worse they did on the comprehension test. Perhaps, this pause is an indicator of difficulty retrieving information about the text. This sense of disfluency was a fairly accurate cue, replicating some other disfluency findings (Pieger et al., 2016), although disfluency is not always the most beneficial for learning (Kühl & Eitel, 2016; Yue et al., 2013). Second, some cue validity depended on the summary modality. Summary time was related to comprehension in the oral summary condition but not in the written summary condition. Although we cannot be sure why, we conjecture that those in the written condition took time to reread or organize their summaries, such that time in the written condition was less correlated with actual content output than it was in the oral condition. In the oral condition, on the other hand, summary time was a decent indicator of how much one actually understood because people would stop summarizing when they could not recall or articulate more. Thus, summary time was more correlated with accessibility of information in the oral condition than it was in the written condition, making it a valid cue for oral summarizers.

Some describe fluency (at encoding and retrieval) as the most prevalent cue individuals use in making judgments (Koriat et al., 2004), but most fluency research is limited to the metamemory field. Typically, fluency is measured using reading speed, or recalling words and sentences (Benjamin et al., 1998; Pieger et al., 2016), and we do not believe others have assessed summary time as a cue for fluency. Most often, it is found that faster recall or processing leads to higher prediction judgment (Benjamin et al., 1998; Pieger et al., 2016; Rawson & Dunlosky, 2002); however, the current study finds that slower summary times are associated with higher judgments. Total summary time likely relates to greater accessibility of information; if the participants took a long time to summarize, they likely knew more about the subject. Another aspect to consider in the future is whether the participants took breaks while summarizing. One study conceptualized the fluency of processing as the regularity of task timing, rather than the speed of the task, and demonstrated that this consistency affects metacognitive judgments more than speed (Stevenson & Carlson, 2020). Measuring consistency of summary production may enhance our understanding of fluency’s role in metacomprehension judgments in future research. Latency is likely a more traditional proxy for processing speed: when information comes quickly to mind, participants tend to believe they know the material better. However, according to pairwise comparisons, the gamma correlation between predictions and latency to begin summarizing was the only one that differed significantly from the gamma correlations between predictions and the other three cues. This suggests latency was not as strong of a cue as the other summary characteristics.

Based on the differences in fluency-related cues across conditions, it is surprising that average prediction magnitude did not differ between the summary modality conditions. However, past research has shown that average prediction magnitude tends to remain consistent and a within-subjects measure shows that judgments of performance are highly correlated after a week (Kelemen et al., 2000). It is possible that participants were using an unmeasured anchor while making judgments, and then based on cues during their summarization experience, they adjusted their predictions from this anchor (Zhao & Linderholm, 2008). So, even though written summaries were faster to begin summarizing, and took longer to complete (both associated with higher judgment magnitude), the average prediction magnitude was not higher because participants could have been adjusting from an anchor.

It was surprising that the situation model, as measured by LSA, was not correlated with multiple-choice performance. First, it was numerically only slightly smaller than the other measured cues. While the other cues were significantly different than zero, they were not much larger than the correlation between LSA and performance. Second, it is possible that participants are simply quite poor at judging the situation model. In Maki et al. (2009), the correlations between metacognitive judgments and LSA scores were quite low, although their method differed substantially from the one in the current study. Another possibility is that LSA is a poor measure of the situation model. As suggested by a reviewer, we assessed the relationship between LSA and inference questions, as the situation model is associated with a greater ability to make inferences. Surprisingly, despite a higher mean correlation, LSA was not more highly correlated with inference based questions (g = 0.10) on the comprehension test than questions that could be answered by what was explicitly stated in the texts (g = − 0.09, t(67) = -1.59, p = .11), suggesting that those with higher LSA scores did not have better situational models. Thus, LSA may tap a slightly different construct, potentially summary quality, as our analysis suggests that LSA is a cue. Finally, it could be that participants’ expectations about the comprehension test did not match the actual difficulty of the test (we did not offer a practice test). So, perhaps LSA did not correlate with performance because the comprehension test was unexpectedly challenging for the participants.

It should be noted that our participants had noticeably lower relative accuracy compared to past research; the average gamma correlation in the control condition was almost zero, while in the past, it has been 0.27 without intervention (Dunlosky & Lipko, 2007). Participants also scored more poorly on the multiple-choice comprehension test compared to other students on the same task (Fulton, 2021). This lower comprehension ability may have led to lower relative accuracy (Maki et al., 2005). Optimistically, relative accuracy in the written condition was similar to the average found in the literature (Dunlosky & Lipko, 2007). The written summary intervention appears to be effective at increasing relative accuracy, even when comprehension performance is low, and may help those with lower comprehension ability to maximize their learning potential.

Cognitive load

Our cognitive load results revealed a very different pattern than we predicted, with those in the written summarization condition reporting the lowest cognitive load, and oral summarizers reporting the highest. While initially surprising to some extent, there are several reasons why this pattern makes sense in hindsight. Those in the written condition had the ability to reread summaries, which could have allowed them to summarize the passages and monitor their comprehension separately in time, something those in the oral condition could not have done. Offloading in this way can free up cognitive resources (Risko & Dunn, 2015), which may then be allocated to metacomprehension monitoring, potentially allowing for more accurate judgments (Griffin et al., 2008). This possibility is partially supported by the cognitive load results. Whereas the intrinsic and germane loads reported were approximately the same between groups, extraneous load, the type of cognitive load that relates to presentation/environment, appears to be lowest in the written condition. Although differences between all groups in reported cognitive load only approached statistical significance (p = .07), a post hoc analysis comparing the two summary conditions shows that the written condition had a marginally significantly lower cognitive load than the oral condition, supporting the idea that cognitive offloading might explain differences between groups. Opportunities for offloading is a mechanism that may be worth exploring in the future to help explain group differences in metacomprehension monitoring accuracy. One study (Reid et al., 2017) found that cognitive load was higher in the summarizing condition than a control condition, contrary to the data in the current study, but they measured cognitive load before the comprehension test. It is unclear whether the type of measure is driving differences, as we used a survey that differentiated between the types of load, or due to the timing of the cognitive load questionnaire, as our measure occurred at the end of the multiple-choice test. Because our measure occurred after the comprehension test, we do not know if cognitive load acted as cue for participants’ judgments, or whether our cognitive load survey measured perceived difficulty of the summaries, the multiple-choice test, both, or perhaps the entire experiment experience. However, differences seem to be due to summarization modality, as that was the only manipulated variable.

Limitations

First, we recognize that there may be other cues (e.g. prior knowledge) that influence prediction judgments (Koriat, 1997). Second, LSA was used to measure the situation model, because it has been used to measure summary quality in the past (Kintch, 1998; Maki et al., 2009; Thiede & Anderson, 2003). We used LSA to avoid a high correlation between word count and number of gist-based ideas and reduce potential bias of human scorers, but we acknowledge that there are other approaches to measuring the situation model (e.g., number of gist-based ideas; Anderson & Thiede, 2008), and each approach has its strengths and limitations. Third, some of our results were only marginally significant, and despite the conservative nature of gamma, replication is certainly necessary. A fourth limitation may be the relative lack of practice and experience with oral summarization among our participants. Although most people speak more than they write, summarization is more often done in writing. With the exception of presentations, oral assignments and examinations are relatively rare (Huxham et al., 2012). Although we do not know the exact experience that our participants have with oral summarization, it is likely a novel task for many. If students primarily summarize in the written modality, they may be more attuned to the cues available in written summaries and thus relative metacomprehension accuracy under those conditions could be largely an artifact of practice. Importantly, past research suggests that oral test anxiety is related in part to social anxiety (Laurin-Barantke et al., 2016) and, anecdotally, participants in the oral condition expressed nervousness when summarizing aloud in front of a researcher. As such, it may be the case that social anxiety was unintentionally evoked in some participants, and may be in part driving the marginal differences in cognitive load. The effects of social anxiety on metacomprehension monitoring accuracy is unknown, but is the topic of a study currently underway in our lab.

Future directions

Future studies could confirm whether writing summaries truly benefits relative accuracy in metacomprehension monitoring, whether summarization leads to utilization of multiple cues, and whether cognitive offloading (perhaps from summarizing in writing) is driving and benefiting accuracy. Some of our findings, that the accessibility of information and fluency were valid cues, is inconsistent with past findings (Dunlosky et al., 2005b; Koriat, 1993), so it is important to first replicate these effects and then understand which conditions allow the accessibility of information to be a valid cue. Finally, very few studies have addressed whether multiple cues increase accuracy, which is vital to our understanding and use of metacognitive strategies, so future research should confirm the benefit of multiple cue utilization to relative accuracy in metacomprehension monitoring.

Conclusions

Summarizing in writing may offer better relative metacomprehension accuracy than summarizing orally or not at all. Perhaps this is due to greater salience of cues, or cognitive off-loading. Because higher relative accuracy improves study practices, such as a better allocation of study time, instructing students to summarize in writing, rather than orally, may improve their study efficiency, and therefore academic performance. With more research, we may find that instructing students to pay attention to multiple cues may additionally benefit their metacognitive accuracy. Although researchers have previously used summaries to assess the situation model and the accessibility of information (Anderson & Thiede, 2008; Maki et al., 2009), we believe that studying cues from a multi-cue perspective will advance our understanding of monitoring strategies.