Introduction

Metacognition refers to learners’ awareness and regulation of their cognition (Flavell 1979). Monitoring is a core component of metacognition that drives learners to detect errors and optimize their performance. Schraw and Moshman (1995) define monitoring as “one’s on-line awareness of comprehension and task performance (p. 355).” Specifically, monitoring represents learners’ ability to concurrently examine their learning processes and outcomes. Accurate monitoring enables learners to perceive task demands, determine the appropriate selection of strategies, and evaluate and reflect on task performance in ways that improve their future task performance (Winne 1995, 2001; Winne and Hadwin, 1998). In contrast, inaccurate monitoring, can be detrimental to students’ academic achievement.

Students’ monitoring accuracy has been studied by examining the extent to which students’ perceived performance is discrepant from their actual performance, which is also called calibration (Keren 1991; Nietfeld et al. 2006b; Pieschl 2009). Confidence bias and absolute accuracy are two indices that capture this discrepancy (Nelson 1996; Pieschl 2009). Specifically, confidence bias demonstrates the signed difference between students’ perceived performance and actual performance. A positive difference indicates students’ overconfidence while a negative difference indicates their underconfidence. In comparison, absolute accuracy represents the absolute value of the discrepancy. Monitoring is considered accurate when both indices (i.e., confidence bias and absolute accuracy) are close to zero.

Existing research about these two indices suggests that students often have difficulty rendering accurate judgments relative to their objective academic performance, typically because they are often overconfident (e.g., García et al. 2016; Glenberg and Epstein 1987; Hacker et al. 2000; Miller and Geraci 2011a). As metacognition is related to domain-specific knowledge (Schraw 1998), the phenomenon, in which students’ inaccurate appraisals of their own performance hinders them from deploying self-regulated learning strategies and optimizing learning goals, is also considered as an issue in the domain of mathematics.

In the domain of mathematics, students often need to engage in multiple cognitive activities, such as coordinating different strategies during problem-solving tasks (Silver 1987, 1994). To do so, students need to be equipped with both cognitive and metacognitive knowledge of mathematics (Garofalo and Lester 1985). However, this knowledge is not readily acquired by students, perhaps especially as they are transitioning from primary school mathematics to middle school mathematics and confronting more intensive and comprehensive curriculum. Cleary and Chen (2009) reported that middle school students rarely used strategies effectively and efficiently during mathematics problem-solving tasks. One primary reason for students’ lack of effective strategy use is a lack of metacognition as students fail to recognize gaps in their mathematics knowledge and their need for further learning to remedy such gaps. Further, even if students have a strong and accurate knowledge base, they may be unable to accurately monitor their learning processes and, as a result, report inaccurate or biased judgments about their own performance (e.g., Dinsmore et al. 2015; Glenberg and Epstein 1987). These deficits in monitoring may create obstacles for middle school students to pursue academic goals in mathematics. Thus, further investigations into interventions to support improvement of middle schoolers’ monitoring accuracy in mathematics are warranted.

Researchers (e.g., Nietfeld and Schraw 2002; Zimmerman et al. 2011) have reported positive outcomes from the implementation of metacognitive interventions in mathematics with college students. These interventions typically teach students to judge their performance accurately by providing monitoring guidelines and/or feedback to improve students’ monitoring accuracy as well as improve their mathematics achievement. Interventions in monitoring, however, have not been adequately implemented and tested with middle schoolers.

Further, few studies have examined Asian students’ monitoring in any academic domain. In one of the few studies, Yates et al. (1989) compared American and Chinese college students’ monitoring accuracy when answering general knowledge questions (e.g., What is the farther north? London or New York). Specifically, the American and Chinese students were asked to answer equally difficult general knowledge questions and indicate the probability that they arrived at the correct answers. Findings suggested that Chinese students were significantly more inaccurate and overconfident when compared to the American students.

However, findings of such cultural differences are not always consistent. For instance, Zabrucky et al. (2009) examined Taiwanese college students’ ability to accurately monitor their performance on reading comprehension items. Specifically, students were asked to judge their performance on reading comprehension before (prediction) and after (postdiction) the task. Findings indicated that overall, Taiwanese students’ reading comprehension performance was positively associated with accurate performance judgements, corresponding to the findings with students from Western countries (e.g., Dunlosky et al. 2005b; List and Alexander, 2015; Pressley and Ghatala 1988). In contrast, however, to a previously conducted similar study within the United States, Lin et al. (2001) reported discrepant results. As such, American college students were asked to make predictions and postdictions about their performance on reading expository texts. Findings indicated that the American students’ postdictions tended to be more accurate than predictions, which was consistent with the Taiwanese students in the later study (Zabrucky et al. 2009). However, American students’ monitoring accuracy was not associated with their performance, which was discrepant from Taiwanese students. Taiwanese students also demonstrated more accurate monitoring than American students when comparing these two studies (i.e., Lin et al. 2001; Zabrucky et al. 2009). These findings indicate potential differences in monitoring between Asian and Western students. The current study addressed known gaps in the monitoring literature and targeted Chinese middle school students’ monitoring in mathematics.

Collectively, the present study contributes to the literature on monitoring accuracy in several ways. First, with a considerable amount of monitoring literature focused on reading comprehension (e.g., Dabarera et al. 2014; Gillström and Rönnberg 1995; Glenberg et al. 1987; Kinnunen and Varuas 1995; Lauterman and Ackerman 2014; Lin et al. 2002; Ozuru et al. 2012; Schraw et al. 1993; Shiu and Chen 2013; Singer and Alexander 2017; Walczyk and Hall 1989), less is known about monitoring accuracy in mathematics. Further, the present study focused on first-year (7th grade) middle school students’ monitoring accuracy, based upon our consideration that these students are likely to experience unique challenges in mathematics learning during the transition between elementary and middle school. Moreover, we asked middle school students to not only judge their performance but to further justify their judgements. Studying middle school students’ justifications of their monitoring judgments allows us to identify factors that younger students may consider when making performance judgements and, more importantly, may provide insight into how to improve future monitoring interventions for middle school students in mathematics. Finally, in comparison to most of the reading comprehension monitoring studies that were conducted in Western countries, the present study was conducted in classrooms within China and represents one of very few monitoring intervention studies conducted in China.

Monitoring in metacognition and self-regulated learning

Schraw and Moshman’s model of metacognition (1995) explained the importance of monitoring for students’ learning. Specifically, they divided metacognition into two metacognitive processes: knowledge of cognition and regulation of cognition. Knowledge of cognition refers to one’s awareness or knowledge about one’s cognition, nature of tasks, and selection of strategies. Regulation of cognition refers to a dynamic and proactive self-regulatory process of learning. Regulation of cognition emphasizes the ongoing processes learners deploy in order to reach their learning goals and meet task criteria. Importantly, knowledge of cognition and regulation of cognition dynamically interact with one another. When learners develop accurate awareness of their knowledge, they correspondingly learn to control and regulate their learning. Monitoring falls under regulation of cognition, and not only enables learners to track their learning trajectories but also supports the development of metacognitive awareness.

Winne and Hadwin’s (1998) model of self-regulated learning also places metacognitive monitoring as the central component that operates the processes of self-regulated learning. Specifically, they identified four phases of self-regulated learning including the task definition phase, the goal setting and planning phase, the tactics and strategies enacting phase, and the adaptation phase. Monitoring assists learners to move through each of the four phases. In Phase 1 (task definition), learners receive information from the task context as well as their prior task experience to perceive the definition of the task. Monitoring helps learners to compare such information against the current task to establish their perceptions of the task. In Phase 2 (goal setting and planning), monitoring assists learners to set goals and plans through matching with their initial task perceptions from Phase 1. Moving toward Phase 3 (enacting tactics and strategies), learners use monitoring to select and employ tactics and strategies while completing the task. Finally, in Phase 4 (adaptation), monitoring guides learners to evaluate and reflect on their task product and make necessary revisions for future tasks.

Both Schraw and Moshman’s and Winne and Hadwin’s models demonstrate the crucial function of monitoring in students’ learning processes. In other words, students’ accurate monitoring spurs improved awareness of their knowledge, effective strategy deployment, and desired learning outcomes (Butler and Winne 1995; Dunlosky et al. 2005a; Flavell 1979; Huff and Nietfeld 2009; Pressley and Ghatala 1990; Winne and Jamieson-Noel 2002).

Monitoring interventions

Although monitoring is critical for learners’ academic achievement, prior literature in students’ monitoring found that students are typically inaccurate monitors (e.g., Dunlosky and Rawson 2012; Schraw et al. 1995). Some scholars and practitioners have therefore targeted interventions to support students’ monitoring abilities. As noted, while the majority of monitoring studies examined reading comprehension (e.g., Dunlosky et al. 2005b; Pressley and Ghatala 1988, 1990), researchers also studied monitoring in a variety of domains and developed interventions to improve students’ monitoring. Consistent with results that targeted reading comprehension, across academic domains, high monitoring accuracy was associated with improved academic achievement and students were generally inaccurate monitors (e.g., research methods, Dinsmore and Parkinson 2013; undergraduate educational psychology, Hacker et al. 2000; multiple texts task, List and Alexander 2015; math, Ramdass and Zimmerman 2008).

For instance, Nietfeld and Schraw (2002) implemented a strategy training session to improve college students’ monitoring accuracy during mathematics tasks. Students were asked to complete a set of mathematics problems and to provide confidence ratings for each item on a 100-point scale. Students in the training condition received a 2-h training that included strategy instructions for solving mathematical probability problems. In contrast, students in the control condition were not provided with strategy instructions. As a result of this brief intervention, those students who received the strategy training demonstrated improved mathematics problem solving performance as well as better monitoring accuracy. From these positive results, Nietfeld et al. (2006a) extended this research and implemented another monitoring strategy training study. They distributed weekly monitoring exercise sheets that directed undergraduate students to monitor their learning in an introductory educational psychology course. Specifically, students in the treatment condition received weekly monitoring instruction and intensive monitoring exercises over 16 weeks. Students were asked to practice item-by-item monitoring on multiple-choice questions corresponding to the course contents and received feedback from the instructor the following week. Students in the comparison condition did not practice monitoring on a weekly basis. Findings showed that students in the treatment condition demonstrated improved monitoring accuracy as well as academic performance when compared to the comparison condition and indicated that training with monitoring exercises has potential benefit for both students’ monitoring accuracy and their academic achievement.

In testing an additional intervention, Bol et al. (2012) provided high school students with monitoring practice guidelines and instruction in a biology class. Specifically, monitoring practice guidelines and instruction directed students to check their answers and understanding during task completion. In addition, students in the experimental condition were further instructed to reflect on their understanding of targeted biology content and to make confidence judgments during task completion. Results indicated that students’ monitoring accuracy was improved, as was their biology achievement.

While these positive results suggest support for monitoring interventions, inconsistent findings demonstrate that metacognitive interventions are not always effective. For instance, Bol, Hacker et al. (2005) examined whether having students practice monitoring would improve their monitoring accuracy and academic achievement. Specifically, they asked college students to make predictions and postditctions about their performance on several quizzes across an academic semester in an education major-related course. Students in the control condition were asked to complete the same quizzes but without pre- and postdictions. Unexpectedly, results indicated that students’ overt monitoring practice did not improve their monitoring accuracy nor academic performance. This may be due to the lack of explicit monitoring instruction provided to students to inform their monitoring practice. Similar ineffective intervention results were also reported in other studies (e.g., Bol and Hacker, 2001; Nagel and Lindsey 2018). These mixed findings demonstrated the inconsistent intervention effects on students’ monitoring.

Factors that influence monitoring

In addition to studying the effectiveness of interventions, other scholars focused on investigating the formation and the multifaceted nature of students’ performance judgments. Lin and Zabrucky’s (1998) review suggested that external task demands play a role in forming students’ performance judgments. Particularly, students’ performance judgments were driven by individual-related, task-related, and text-related factors. In a reading comprehension task, individual-related factors refer to readers’ individual differences that influence their reading comprehension and monitoring accuracy. Such factors may include learners’ prior knowledge about the text content, reading ability, and related motivational constructs (e.g., interest in the topic, and self-efficacy for reading). Task-related factors refer to possible task characteristics such as types of tests, test item difficulty, task demands, and involvement of feedback or practice. Text-related factors include the genre of texts and text difficulty level. Lin and Zabrucky suggested that accurate performance judgments require learners to take both internal (individual-related) and external (task-related and text-related) factors into consideration simultaneously while rendering performance judgments. When students neglect to consider multiple factors, poorer academic performance and inaccurate monitoring may result.

Drawing from Lin and Zabrucky (1998), Pieschl (2009) argued that traditional studies of students’ performance judgments solely focused on students’ internal processes, that is, how students capture their individual learning processes when making judgments, and prior investigation neglected external task demands and task complexity. As previous findings indicated that students tend to be overconfident for difficult tasks, it may be that they fail to accurately perceive task demands and terminate improvement on the task (Pressley and Ghatala 1988; Schraw and Roedel 1994). Pieschl (2009) discussed that learners consider more than internal factors in order to render accurate performance judgments and also suggested future research to capture external factors that influence students’ performance judgements.

Similar conclusions were voiced by Dinsmore and Parkinson (2013). In their study, college students’ monitoring accuracy and academic performance as they read two introductory statistics passages was examined. After reading the passages, students answered a set of multiple-choice questions related to the passage content and then rated their performance on a 100-point scale. Notably, Dinsmore and Parkinson further asked students to justify their performance judgments in open-ended responses explaining reasons for their judgments and identified five categories of factors that students considered when making judgments: prior knowledge, text characteristics, item characteristics, guessing, and ‘other’. Besides identifying internal (prior knowledge) and external (text characteristics and item characteristics) similar to Lin and Zabrucky (1998) and Pieschl (2009), Dinsmore and Parkinson also recognized two additional factors: guessing and ‘other’. Guessing captured when students stated that their judgments were made based on a guess or a feeling, while the category of ‘other’ represented those not within previously identified categories [See Dinsmore and Parkinson (2013) for sample responses]. They reported that college students considered both personal and environmental factors when making performance judgments and further verified the multifaceted nature of students’ performance judgments.

A recent study (Wang and List 2019) also examined college students’ self-evaluations, but in a complex writing composition task. Participants were asked to compose a written product (i.e., a research report or an argument) based on reading of multiple texts. After they composed a product, they evaluated their written responses by assigning themselves a letter grade. They then explained reasons for their grade assignment. Results demonstrated that students considered 12 categories of factors, such as strategies deployment, specific writing mechanics, and personal attributions. Consistent with previous findings, students’ justifications were multifaceted but broadly mapped onto personal skills, task context, and strategies that they enacted (Dinsmore and Parkinson 2013; Lin and Zabrucky 1998; Pieschl 2009).

The multifaceted nature of students’ monitoring justification has only been established with college students and generally in reading and writing tasks. In the present study, we extended the work and examined middle school students’ justifications about their performance in mathematics problem-solving tasks.

Monitoring in a mathematics context

Although metacognitive monitoring is often considered as domain-general in nature (Gutierrez et al. 2016; Schraw 1998), the motivational (e.g., goal setting) and cognitive components (e.g., strategy use) involved in self-regulated learning may vary across different subject areas. Wolters and Pintrich (1998) asked Grade 7 and 8 students to complete self-report questionnaires regarding motivation and cognition across three subject areas including mathematics, English, and social studies. They found differences in students’ reported motivation- and cognition-related constructs across the three subject areas. Specifically, in terms of the motivational components, students demonstrated higher task value for mathematics than English and social studies and students had higher self-efficacy in English than mathematics and social studies. Students also demonstrated varied degrees of cognitive strategy use and students reported that they used strategies more often in social studies when compared to English and mathematics. Given that monitoring plays a critical role throughout the phases of self-regulated learning, it is expected that students may monitor and further regulate their learning differently across domains (Hadwin et al. 2001).

Mathematics is differentiated from other subject areas in a number of ways. First, teachers have different views toward mathematics than other subjects. Particularly, teachers consider mathematics as more structured, sequential, and heavily dependent on previously taught topics, while they consider social studies more open and less sequential (Grossman & Stodolsky 1995; Stodolsky and Grossman,1995). Students also hold different views toward mathematics when compared to other subjects. For instance, compared to social studies, which students may relate to real life during task completion, mathematical problem-solving tasks often require students to confront abstract mathematical concepts and sequential operations (Schoenfeld 1992). Such cognitive activities during mathematics tasks require students to enact specific self-regulatory strategies and accurate monitoring. Yet, students usually fail to do so (e.g., Cleary and Chen 2009; Kramarski and Gutman 2006). For example, García et al. (2016) examined elementary school students’ monitoring on mathematics problem-solving tasks using confidence rating scales. Students solved two math word problems and indicated their confidence while also demonstrating their work. These scholars reported that students were inaccurate monitors and were particularly overconfident relative to their actual performance. Their analysis of mathematical problem-solving processes further demonstrated that students who were accurate monitors used strategies more frequently than those who were not accurate. These results reflected students’ deficits in monitoring and strategy deployment, which further led to their potential failure in mathematics achievement. Subsequent research by Callan and Cleary (2019) also found that middle school students’ mathematics performance was positively associated with metacognitive monitoring and strategy use.

The present study

The present study examined Chinese middle school students’ monitoring accuracy in mathematics by implementing a monitoring intervention and exploring students’ monitoring judgments. We addressed three primary research questions.

First, what are the associations among absolute accuracy, confidence bias, mathematics performance, and other psychological constructs (i.e., self-regulated learning strategies, metacognitive awareness, and self-efficacy) in Chinese 7th grade students?

According to the theoretical frameworks that guided this study (Schraw and Moshman 1995; Winne and Hadwin 1998), and consistent with previous literature (e.g., Bol et al. 2012) we anticipated that students’ improved mathematics performance would be correlated with absolute accuracy and less confidence bias. Meanwhile, self-regulated learning strategies, metacognitive awareness, and self-efficacy were expected to be positively associated with students’ mathematics performance and monitoring accuracy. That is, students who use strategies effectively, have high metacognitive awareness, and high self-efficacy in mathematics were also expected to have higher math performance and more accurate monitoring.

Second, within the school setting, to what extent does the monitoring intervention improve 7th grade students’ mathematics performance, monitoring accuracy, reported self-regulated learning strategies, metacognitive awareness, and self-efficacy?

Consistent with Nietfeld and Schraw (2002) where students improved their academic performance and monitoring accuracy after receiving both verbal and written monitoring training sessions, we expected that students in the experimental condition, who would receive both explicit monitoring instructions and calibration practice, would improve their mathematics performance and monitoring accuracy more than students in other conditions. Similarly, students’ self-regulated learning strategies, metacognitive awareness, and self-efficacy were also expected to improve. We anticipated, however, the magnitude of improvement in monitoring in this study would be constrained for two reasons. First, students were provided only written monitoring instructions without verbal intervention instructions from teachers. Further, the length of three weeks with one session each week was relatively short and limited in comparison to other effective long-term monitoring intervention studies (e.g., 14 sessions: Huff and Nietfeld 2009; 16 sessions: Nietfeld et al. 2006a).

Third, how do 7th grade students justify their performance judgments and do students’ performance justifications predict their math performance and monitoring accuracy?

Dinsmore and Parkinson (2013), reported that college students considered multiple factors when making performance judgments. In the present study, we also expected 7th grade students to report a variety of justifications for their performance judgments. However, given the developmental nature of metacognition (Brown 1987), we anticipated that middle school students may not be able to consider as many factors as college students. In other words, middle school students may be more likely to consider a limited number of factors when making performance judgments. We further expected that the specific factors that students commonly considered would significantly predict their math performance as well as monitoring accuracy.

Method

Study design

A three-group pretest/posttest quasi-experimental design with random assignment of classroom to condition was implemented in this study. A pretest was administered to all students to control for potential differences among students and across classrooms. This study examined the effects of the metacognitive intervention on students’ monitoring accuracy and mathematics achievement for a duration of three practice sessions, once a week for three weeks. During each practice session, across conditions, students received the same practice material developed by the teachers, but conditions varied by the inclusion of monitoring directions and confidence rating scales. The three conditions included a control condition, a confidence rating only condition (CR), and a confidence rating with monitoring instructions condition (CR + MI).

Students in the control condition were asked to complete the mathematics practice questions only. In addition to the mathematics practice questions, students in the CR condition were asked to rate their confidence for each item on a 10-point Likert-type scale. Students in the CR + MI condition, were asked to rate their confidence and also were provided with written explicit monitoring directions that instructed them to monitor during the practice session. An example of an explicit monitoring direction was “When you monitor, you ask yourself questions…After you pick an answer you stop and ask yourself if it is the right answer.” A detailed description of the study design is presented in Fig. 1.

Fig. 1
figure 1

Study design

Participants

Participants were 133 Grade 7 students in a public middle school located in Southwestern China. Of the 133 student participants, 54.14% (n = 72) were female, and 45.86% (n = 61) were male. Students’ average age was 13. The average class size was 44. Data screening was completed to address invalid or missing data and assumption testing was conducted prior to analyses.

Procedures

Students across all conditions received the same practice sets each session during the intervention. Prior to administration of the practice sets, written explicit monitoring directions and/or confidence rating scales designated the three conditions.

The procedures of the study can be briefly described in four steps. First, school permission and students’ informed consent were obtained. Second, students completed the pretest. Third, after the pretest, the intervention was implemented across 3 weeks with one session per week. Last, after the intervention phase, all students completed the posttest measures. These measures were identical to those administered in the pretest phase except that a parallel form of the mathematics items was used for the posttest. The total duration of this study was 6 weeks overall including the consent process, pretest, three practice sessions, and posttest. Each practice session took approximately 45 minutes over a regular class period. A timeline is presented in Fig. 1.

Teachers’ primary role in this study was to administer the study materials. The first author provided a 2-h training session to the three participating teachers to ensure intervention fidelity in advance of the study including a group session and individual sessions. Contents for the group training session for the three teachers included overall description of the study, specific study procedures, and timeframe. Specifically, the first author met with the three teachers in a conference room and provided general information regarding the study, such as the general instructions that students needed to know when completing the pre- and posttest. The first author also coordinated with the three teachers’ schedules, class locations, and other logistical details for conducting the study during the meeting.

After the group session, the first author also held individual meetings with each of the teachers regarding the specific materials they would receive and instructions for administering materials with fidelity prior to the study. Specific study procedures and instructions were provided to each teacher individually corresponding to the designated condition. Typically, students in the target school would complete regular math practice tests independently and teachers would provide feedback regarding students’ performance after task completion. As such, during the intervention phase, teachers were asked to provide the practice sets and performance feedback as usual and students were asked to complete the practice independently. Notably, as the written monitoring instructions and monitoring practice were novel for students in the designated conditions, teachers in the CR + MI and the CR conditions were asked to prompt students to read the monitoring instructions and/or take the monitoring practice provided in the materials. The first author also conducted several observations during the study to ensure intervention fidelity.

Materials

Testing and practice materials

The materials used for this study included two parts: the pre- and posttest materials and the practice materials for the intervention. Specifically, the pretest and the posttest included a mathematics self-efficacy measure, a metacognitive awareness inventory, a self-regulated learning measure, selected and adapted TIMSS (2011) released math items, confidence rating scales, and open-ended justification questions about students’ confidence ratings. In particular, the adapted TIMSS items were shown to the three participating teachers and an expert in middle school mathematics in advance of the study to ensure the TIMSS items for the pre- and posttest corresponded to the math content students learned in class. Items were only slightly adapted.

The materials for the intervention were teacher-generated math practice sets. These practice sets varied by condition to include CR and CR + MI in the two intervention conditions. Teachers selected the mathematics items from their practice item pool corresponding to the math curriculum taught in their classes with discussions with the first author and the expert in middle school mathematics. A variety of topics were included in these items (e.g., geometry, probability, and algebra). Across conditions, all students completed the same items.

Materials for monitoring instructions

The monitoring instructions for the CR + MI condition were developed and adapted from a previous study that demonstrated positive effects on students’ monitoring and learning (Sperling et al. 2012). Specifically, the monitoring instructions included three parts: (1) the introduction of monitoring, (2) specific examples for practicing monitoring with instructions, and (3) feedback for the practice examples and instructions that directed students to reflect upon their monitoring process. In Part 1, students learned about the definition of monitoring and its importance in supporting their learning. An example of the introduction was “When we stop and think about what we are doing, it is called monitoring.” In Part 2, specific examples of math items were used to demonstrate to students how to monitor. Specifically, monitoring instructions were provided to guide students’ monitoring processes during the completion of a specific math item, such as suggesting students to check their work. In Part 3, students rated their certainty about their answer to the math item and were provided feedback regarding the correct answer. Students then were provided with additional instructions that directed them to reflect upon their monitoring process and emphasized the usefulness of monitoring. In total, students were provided with three example math items that included the three instructional scaffolds for each practice session.

Measures

The measures used in this study were all originally published in English. As the students’ first language was Chinese and they were not fluent in English, the measures were double translated. The translation process included two steps. First, two translators whose first language is Mandarin Chinese and who are also fluent in English including the first author translated all the instruments administered in this study. Specifically, the two translators translated the English instruments into Mandarin Chinese individually first and then reconciled with each other. Second, a third person who was a Chinese researcher in the field of middle school mathematics and educational theories verified the translation. The translated measures were then finalized to be administered. These procedures were consistent with the Programme for International Student Assessment translation guidelines for double translation and reconciliation (PISA 2018).

Mathematics self-efficacy

Student self-efficacy was measured by the Middle School Mathematics Self-Efficacy Scale developed and validated by Usher and Pajares (2009). The instrument includes 24 items with a six-point Likert-type scale (1 = definitely false; 6 = definitely true). Original reliability estimates (Cronbach’s α) for the measure ranged from α = .84 to α = .88 across four subscales and were α = .86 for pretest and α = .82 for posttest in the present study.

Junior metacognitive awareness (Jr.MAI)

Students’ metacognitive awareness was assessed by the Junior Metacognitive Awareness Inventory Version B developed by Sperling et al. (2002). The Jr.MAI for Grades 6 to 9 includes 18 items on a 5-point Likert scale (1 = Never; 5 = Always). Internal consistency for Jr.MAI was α = .85 reported by Sperling et al. (2002). The Cronbach’s alphas were α = .84 for pretest and α = .87 for posttest in the present study.

Self-regulated learning (SRSI-SR)

Student self-regulated learning was measured by the Self-Regulation Strategy Inventory-Self-Report developed by Cleary (2006). The inventory includes 28 items on a seven-point Likert-type scale (1 = Never; 7 = Always). Cronbach’s alpha reported by Cleary (2006) was .92. The original items were designed based on a science-learning context. We adapted the items to a mathematics context. In our study, the Cronbach’s alphas were .94 for pretest and .93 for posttest.

Mathematics achievement

Released items from an international standardized mathematics achievement test were adapted and administered to measure students’ mathematics achievement. Specifically, 10 selected and adapted mathematics items across different difficulty levels and topics from TIMSS (2011) for 8th grade, were administered to all participating students. Parallel forms were used for pretest and posttest. The assessment included three multiple-choice questions, four fill-in-blank questions, and three “show-all-work” questions. Students were asked to show their work and provide written justifications about their confidence for the three show-all-work questions. One point was given for each of the correct dichotomously scored items. The internal consistency for pretest was α = .63, and it was α = .62 for posttest. Given the assessment was to measure a wide variety of mathematics concepts with a few items, the reliabilities were considered sufficient based on the revised Dutch rating system for test quality (Evers 2001).

Monitoring accuracy

Monitoring accuracy was assessed by a confidence rating provided below each mathematics item. The confidence scale ranged from 1 (not confident at all) to 10 (totally confident). Confidence bias and absolute accuracy were calculated following the formulas suggested by previous literature (e.g., Schraw and Nietfeld 1998), to indicate differences between students’ perceived performance and their actual performance. Specifically, students’ average confidence scores were divided by nine to arrive to a range of 0 to 1. Students’ math scores were then averaged by dividing the number of items. The two indices were obtained by calculating the difference between the averaged rescaled confidence score and the average math score. Thus, confidence bias ranged from −1 to 1, and absolute accuracy ranged from 0 to 1. See Table 1 for the calculation of confidence bias and absolute accuracy.

Table 1 Calculation for monitoring indices

Confidence justifications

After students rated their confidence, for the show-all-work questions, they were asked to justify how they arrived at their judgments. Unlike multiple choice and fill-in-blank items, these three items required students to provide all the work and problem-solving steps. The direction for providing confidence justifications was “Please explain why you would rate your confidence as above.”

Coding

To capture students’ justifications about their confidence judgments, we employed a bottom-up approach to develop a coding scheme. Justifications were coded following four steps. In the first step, the first author read through all of students’ justification responses through pre- and posttest and then came up with an initial coding scheme reflecting the variability of students’ justifications. In the second step, another researcher who was fluent in Chinese Mandarin joined, as a second coder, read through all of the justifications independently, and then consulted the initial coding scheme with the first author to make sure they shared understanding of the coding categories. In the third step, the first author and the second coder started coding independently. During this phase, the two researchers reconciled any disagreements found among 50% of the students’ justifications. Based on the discussion of disagreement, some coding categories were either modified or collapsed into another coding category. Thus, a final coding scheme including ten categories was formed, reflecting various factors that students considered when judging their performance. Overall, these categories represented different dimensions of students’ judgments of performance including person-related characteristics, item-related characteristics, and context-related characteristics, corresponding to findings in prior research (Dinsmore and Parkinson 2013; Lin and Zabrucky 1998; Pieschl 2009).

Specifically, three coding categories mapped onto students’ person-related characteristics including (1) prior knowledge, (2) confidence, (3) effort. In particular, prior knowledge was identified as a category reflecting how familiar students were about the question in their responses (e.g., I have done a similar question before). Another person-related category, confidence, indicated students’ consideration of how confident they were about their performance in general without indicating more details (e.g. I am confident.). The category of effort was identified when students justified their performance based on perceived effort (e.g., Because I put effort into this question).

Furthermore, four categories [i.e., (4) required knowledge, (5) problem-solving process, (6) item difficulty, and (7) calculation] reflected item-related characteristics tied to a particular item. Specifically, the required knowledge category was identified when students mentioned the knowledge they had for solving a specific item. For example, they made statements such as: I learned the unit of triangles well. In addition, the coding category reflected students’ specific problem-solving process showing their explanation of how they solved the problem and specific steps of their problem-solving procedures. For example, Item 10 was about calculating the interior angle sum of a pentagon. One student responded “I drew two lines so that the pentagon becomes three triangles. The interior angle sum of one triangle is 180 degrees. Therefore, adding them up is 540 degrees.” Thus, this student demonstrated the process of how one particular problem was solved. When students took the item difficulty into consideration, we coded it as item difficulty. For example, “I made my decision based how difficult the question is.” Moreover, the calculation category was identified when students rated their confidence based on their calculation skills or accuracy. For example, “I am just not sure about my calculation on this item.” These four categories represented the characteristics of the specific items.

Another category considered context-related characteristics, such as perceived task demands. Specifically, when students rated their performance based on the format that they wrote on the paper, we coded it as (8) format reflecting students who were especially concerned about the written format required for the task (e.g., “I am not sure whether or not the format is right for this task”). When students stated that checking processes lead to their confidence ratings, we coded it as (9) checking (e.g., “I checked my answer after I completed the item”). In addition, when students expressed uncertainty or guessing in general (e.g., “I don’t know”), we coded such statements as (10) unknown.

Finally, in the last phase, both coders independently read through all of the justifications again to finalize ratings. Exact agreement between two coders was 92.36% through pretest and posttest. See Table 2 for the coding categories and response examples. Additional examples of students’ justifications by study condition are provided in Appendix A.

Table 2 Coding scheme for students’ justifications about their confidence ratings

Results

Data screening

Data were collected via printed paper copies. We first conducted data screening to identify any missing or invalid data. There were very few missing entries across the pre- and post-measures. Missing data analysis explored potential missing patterns and identified only 1.16% missing data across all the variables. The dataset met the assumption of data missing completely at random (MCAR), χ2 (11084) = 175.485, p = 1.00. An EM (expectation-maximization) estimation method was used to impute missing values according to Peng et al. (2006).

We further tested normality of all the variables across pre- and posttest. Although absolute accuracy did not meet the assumption of normality, as we found similar statistical findings and interpretations after performing both parametric and non-parametric analyses. Therefore, we reported the parametric results here for consistency. Please see the notes in Tables 6 and 7 for the non-parametric results.

Research question 1

What are the associations among absolute accuracy, confidence bias, mathematics performance, and other psychological constructs (i.e., self-regulated learning strategies, metacognitive awareness, and self-efficacy) for Chinese 7th grade students?

The first research question examined the extent to which the key variables were associated with one another. Descriptive statistics showed that students performed well overall on both the math pretest and posttest. Furthermore, overall, students had very low bias and absolute accuracy scores indicating effective metacognitive monitoring for both pre- and posttest. See Table 3 for descriptive statistics.

Table 3 Descriptive statistics across conditions overtime

Students’ math scores were negatively associated with their confidence bias scores for both the pre- and posttest. Their math scores were also negatively associated with absolute accuracy for the posttest. As expected, these two negative associations indicated that students who received lower mathematics scores tended to be overconfident and less accurate. As anticipated, students’ math scores were also found to be positively associated with metacognitive awareness and self-efficacy in mathematics.

Moreover, students’ reported self-regulatory strategy use was also positively associated with their metacognitive awareness and self-efficacy, indicating that students with higher metacognitive awareness and self-efficacy tended to report more self-regulatory strategies. Noticeably, while students’ absolute accuracy was associated with the three psychological self-regulation measures (i.e., SRSI-SR, Jr.MAI, and Mathematics Self-Efficacy) on the pretest, there were no significant associations on the posttest. Correlation results are presented in Tables 4 and 5.

Table 4 Pearson correlations among absolute accuracy, confidence bias, mathematics performance, and other measures for pretest
Table 5 Pearson correlations among absolute accuracy, confidence bias, mathematics performance, and other measures for posttest

Research question 2

To what extent does the monitoring intervention improve 7th grade students’ mathematics performance, monitoring accuracy, reported self-regulated learning strategies, metacognitive awareness, and self-efficacy?

Our second research question examined the effects of intervention on the dependent variables including students’ math scores, confidence bias, absolute accuracy, metacognitive awareness, self-regulated strategy use, and mathematics self-efficacy. Prior to analyses, we examined whether significant differences existed before the intervention was implemented. Importantly, results showed no significant pretest differences on mathematics performance [F (2, 131) = 1.17, p = .31), confidence bias [F (2, 131) = 0.26, p = .77], absolute accuracy [F (2, 131) = 0.50, p = .61], metacognitive awareness [F (2, 131) = 0.63, p = .53], self-regulated strategy use [F (2, 131) = 0.71, p = .49], or mathematics self-efficacy [F (2, 131) = 0.25, p = .77].

Mathematics performance

To examine the effect of the intervention on students’ mathematics performance, we performed a 3 × 2 (Condition |Control, CR, CR + MI| × Time |pretest, posttest|) repeated measure analysis. Results showed no significant differences among the three conditions (p = .10). Nevertheless, a significant interaction between condition and time was found [F (2,129) = 3.39, p < .05, η2 = .05]. This indicated that students’ changes in mathematics achievement were significantly different over time across conditions. Further, post hoc analyses demonstrated that students in the control condition surprisingly decreased in their mathematics performance overtime [F (1,43) = 6.69, p < .05, η2 = .14]. However, there were no significant changes for the CR (p = .18) or CR + MI conditions (p = .51) between pre- and posttest measures (See Table 6).

Table 6 Results for ANOVA repeated measure for math performance

Monitoring accuracy

Monitoring accuracy was assessed by indices of confidence bias and absolute accuracy. Two 3 × 2 (Condition |Control, CR, CR + MI| × Time |pretest, posttest|) repeated measure analyses were conducted separately for confidence bias and absolute accuracy. Results demonstrated significant changes in students’ confidence bias over time [F (1, 129) = 7.14, p < .01, η2 = .05]. Specifically, students in the control condition marginally [F (1, 43) = 3.93, p = .05, η2 = .08] increased in their underconfidence. Students in the CR + MI condition [F (1,44) = 6.28, p < .05, η2 = .13], however, significantly changed from underconfident to overconfident between the pretest and the posttest indicating more confidence on the posttest. There were no significant changes over time for the CR condition (p = .88). The results are presented in Table 7.

Table 7 Results for ANOVA repeated measure for confidence bias

There were no significant results for students’ absolute accuracy (between effect p = .92, within effect p = .73, interaction effect p = .38) over time.

Metacognitive awareness, self-regulated strategy use, and self-efficacy

Another three 3 × 2 (Condition |MC, CR, Control| × Time |pretest, posttest|) repeated measures analysis was performed to examine the effects on students’ metacognitive awareness, self-regulated strategy use, and mathematics self-efficacy, respectively.

With the assumption of multivariate sphericity being met (Box’s M test = 6.95, p = .34), repeated measures ANOVA demonstrated significant increases in students’ metacognitive awareness following a significant increasing linear trend, F (1,129) = 5.48, p < .05, η2 = .04 across the three conditions. Nevertheless, there was no significant difference among the three conditions (p = .23) indicating all students improved in their metacognitive awareness. In contrast, there were no significant differences among conditions overtime in students’ self-report self-regulated strategy use or their mathematics self-efficacy.

Research question 3

How do 7th grade students justify their performance judgments and do students’ performance justifications predict their math performance and monitoring accuracy?

For our third research question, we were interested in investigating the factors that students considered when making confidence ratings. The coding scheme captured various dimensions that students reported including categories of person-related characteristics, item-related characteristics, context-related characteristics, and an unknown category. We also explored how many factors students reported when arriving their confidence ratings. Overall, most students only considered a single factor with only a few students considering multiple factors. Specifically, across the six items, students considered from zero to four factors, but only one student considered four factors when justifying their rating.

Among the ten identified factors, item-related categories were more represented than other justifications. For instance, more than half of the students (pretest: 66.67%, n = 88; posttest: 63.64%, n = 84) made their performance judgments based on the extent to which they knew about item-relevant mathematical conceptual knowledge. Other than item-relevant knowledge, a number of students considered problem-solving procedures as the attribution for their performance judgments (pretest: 28.03%, n = 37; posttest: 19.70%, n = 26).

Within the person-related characteristics, most students considered prior knowledge as an important factor when justifying their confidence judgments (pretest: 18.94%, n = 25; posttest: 21.97%, n = 29). In contrast, another person-related factor, effort, was considered by very few students (pretest: 0.76%, n = 1; posttest: 0.76%, n = 1) when judging their performance. For context-related categories, students considered the format of their responses for pretest justifications (16.67%, n = 22).

While many students provided specific justifications for their confidence judgments, some students guessed their answers and did not know whether they were correct (pretest: 27.27%, n = 36; posttest: 18.94%, n = 25). Table 2 presents the frequencies of the justification categories across pre- and posttest.

We further selected the most frequently cited category from posttest for each of the three dimensions (i.e., person-related, item-related, and context-related) as well as the unknown category. We then performed multiple regression tests to examine the extent to which the four selected attribution categories (i.e., prior knowledge, item-relevant knowledge, formatting, and unknown) predicted students’ math performance and monitoring accuracy (i.e., two indices: confidence bias and absolute accuracy) for the three justification items on the posttest.

Confidence bias and absolute accuracy indices were calculated for each of the three items. Students’ pretest math scores were entered at Step 1, prior knowledge, item-relevant knowledge, formatting, and the unknown category for each item were entered at Step 2.

Mathematics performance

Results showed that the multiple regression models for math posttest scores were significant for all three justification items. Specifically, students’ consideration of prior knowledge, item-relevant knowledge, formatting, and their uncertainty (i.e., the unknown category) significantly predicted their performance on the justification items (Item 1: F (5, 126) = 5.08, p < .001, R2adj = .14; Item 2: F (5, 126) = 5.03, p < .001, R2adj = .13; Item 3: F (5, 126) = 2.51, p < .05, R2adj = .06).

There were no significant individual predictors for Item 1 (ps > .36) and Item 3 (ps > .06). In comparison, prior knowledge and the unknown category were significant predictors (ps < .05) for students’ math performance on Justification Item 2. Specifically, students’ consideration of prior knowledge positively predicted their performance on Item 2; and their uncertainty negatively predicted their performance. This indicated that students who took prior knowledge into consideration when judging their performance were likely to perform well on the item, and those who were not sure about their judgments were likely to perform poorly. See Tables 8, 9 and 10 for the reports of the regression models.

Table 8 Multiple regression results for justification Item 1 for math performance
Table 9 Multiple regression results for justification Item 2 for math performance
Table 10 Multiple regression results for justification Item 3 for math performance

Confidence Bias

We next performed regression tests to examine the extent to which the selected justification categories predicted students’ confidence bias. Specifically, for students’ confidence bias, the regression models were overall significant for Item 1 [F (5, 126) = 2.67, p < .05, R2adj = .06] and marginally significant for Item 3 [F (5, 126) = 2.25, p = .05, R2adj = .05]. The model was not significant for Item 2 (p = .64). The unknown category was a significant predictor (p < .05) for both Item 1 and 3 (p < .05). In particular, the unknown category negatively predicted students’ bias scores, which indicated that students who were uncertain when judging their performance tended to be overconfident. These regression results are presented in Tables 11 and 12.

Table 11 Multiple regression results for justification Item 1 for confidence bias
Table 12 Multiple regression results for justification Item 3 for confidence bias

Absolute accuracy

We performed another three regression models for absolute accuracy with the same four justification categories. None of the models were significant (ps > .08). This indicated that the selected justification categories did not predict students’ absolute accuracy.

Discussion

In this study, we examined the extent to which Chinese middle school students’ monitoring accuracy and mathematics achievement changed through intervention. Absolute accuracy and confidence bias were adopted as the indices of students’ monitoring accuracy. We examined the extent to which students’ metacognitive awareness, self-regulatory strategy use, and self-efficacy were associated with students’ monitoring accuracy and mathematics performance. Finally, we investigated students’ justifications for their metacognitive judgments.

Overall, this study contributes to monitoring literature in at least three ways. First, to our knowledge, this is the first monitoring intervention study that targets Chinese middle school students in the domain of mathematics in a school setting. Though monitoring has been widely studied with college students in other domains (e.g., reading comprehension) in Western countries, it remains valuable to extend the research to younger populations in an Asian country.

In general, Asian students have relatively low self-efficacy when compared to Western students (Eaton and Dembo 1997; Klassen 2004), which may be explained by cultural differences in the sources of self-efficacy (Bandura 1994). For example, the target Chinese sample in the present study demonstrated different patterns in regard with the sources of self-efficacy in mathematics when compared to a similar American sample (Usher and Pajares 2009). Students in both samples were given the same Middle School Mathematics Self-Efficacy Scale to assess students’ self-efficacy. Data collected from the target Chinese students reported lower scores of mastery experience and social persuasion than did American students as reported in Usher and Pajares (2009). Social persuasion refers to the encouragement students receive regarding their performance from others, such as teachers and parents (Bandura 1994). Thus, the lower scores on social persuasion in the current sample may be attributed to Chinese teachers or parents’ high standards and expectations when evaluating students’ academic progress and achievement (Kifer 2002; Kifer and Robitaille 1989; Salili 1996), which may lead to less frequent persuasion. Such differences in self-efficacy may affect students’ performance judgments as well as the effects of interventions. The included psychological measures (i.e., Mathematics Self-Efficacy, Jr.MAI, and SRSI-SR) were translated in Chinese Mandarin and demonstrated high reliabilities.Footnote 1 The administration of these measures in Chinese Mandarin advances their implemental value and generalizability.

Second, findings indicated that 7th grade students were slightly inaccurate in judging their performance, as reflected by the indices of absolute accuracy and confidence bias. While this finding does not directly correspond to previous research in monitoring, which found students were overconfident and inaccurate in monitoring across age groups (e.g., college students: Dunlosky and Rawson 2012; high school students: Bol et al. 2012; primary school students: van Loon et al. 2013), the negative associations found between monitoring indices and mathematics performance is consistent with previous research. That is, students’ overconfidence and low accuracy were associated with poor performance. This common finding was consistent across the pretest and posttest in the present study, which confirms the critical role of monitoring in students’ academic achievement.

Last, we further examined students’ justifications reflecting upon what factors they considered when making performance judgments on mathematics tasks. Such considerations were previously explored in text comprehension-related tasks (Dinsmore and Parkinson 2013; Lin and Zabrucky 1998; Wang and List 2019). The present study extended this exploration to mathematics problem-solving tasks in order to investigate potential factors that 7th grade students consider when making confidence ratings. For instance, Dinsmore and Parkinson (2013) identified that college students considered multiple factors when forming performance judgments. However, the present study found that middle schoolers tended to only take a single factor into consideration when rendering performance judgments in mathematics.

Research question 1

Associations among Absolute accuracy, Confidence Bias, Mathematics Performance, and Other Psychological Measures.

Our first research question examined associations among key constructs including students’ absolute accuracy, confidence bias, mathematics performance, metacognitive awareness, mathematics self-efficacy, and self-regulated learning strategy use. To do so, we asked students to complete a set of psychological measures and to rate their confidence on a 10-point scale after they completed each math item, indicating how confident they were about their given answers.

Interestingly, the mean scores of students’ confidence bias and absolute accuracy showed that students tended to be just slightly both underconfident and inaccurate in general, across pretest and posttest. This is distinct from most prior work that reported students are inaccurate monitors and they generally overestimate their performance relative to their actual performance (e.g., Dunning et al. 2003). This inconsistent finding may be due to the ease of the math items included in the current study, as students tend to be more accurate on easier items when compared to more difficult items (Nietfeld et al. 2005; Schraw and Roedel 1994). Moreover, from a cultural perspective, Klassen (2004) suggested that Asian individuals tended to be realistic and generated accurate judgments of their abilities when compared to Western students. Our finding, perhaps, demonstrated a cultural distinction in making performance judgments.

Furthermore, the negative association between mathematics performance and confidence bias showed that students who were overconfident about their actual performance performed poorly on the mathematics tests. This finding is consistent with previous research, which demonstrated low-achieving students are often overconfident relative to their actual performance (e.g., Bol and Hacker 2001; Labuhn et al. 2010; Stone and Opel 2000). Another negative association between students’ mathematics scores and absolute accuracy indicated that low-achieving students had low absolute accuracy in monitoring. This finding is also consistent with previous research in other contexts, which has found that higher absolute accuracy is associated with improved objective academic performance (e.g., Hadwin & Webster 2013). These common findings about students’ overconfidence and low absolute accuracy leading to less ideal academic performance may be explained by the theoretical framework of metacognition. Specifically, with deficits in accurate monitoring, students are likely unable to detect errors and employ strategies to remedy the corresponding obstacles that they may encounter during task completion (Butler and Winne 1995; Flavell 1979).

Finally, for the pretest, students’ absolute accuracy was negatively and moderately associated with mathematics self-efficacy, metacognitive awareness, and self-regulatory strategy use. This finding showed that students who inaccurately monitored their performance were likely to have low self-efficacy about mathematics, low metacognitive awareness, and deficits in using self-regulatory strategies. These findings correspond to theories of self-regulated learning (Winne and Hadwin 1998) and metacognition (Schraw and Moshman 1995), which support that inaccurate monitoring hinders students’ awareness of cognition and regulation, resulting in failure to enact appropriate strategies.

Research question 2

The Effects on Students’ Mathematics Performance, Monitoring Accuracy, Metacognitive Awareness, Self-regulated Strategy Use, and Self-Efficacy.

We further examined the intervention effects on students’ mathematics performance, confidence bias, absolute accuracy, and other psychological outcomes (i.e., metacognitive awareness, self-efficacy, and self-regulatory strategy use). Results suggested that students in the control condition significantly decreased in their mathematics performance overtime. This was not indicated in the other two conditions. Further, although students in the control condition and the CR + MI condition slightly increased their confidence bias over time, there was no significant effect for the CR condition. Students across the three conditions also showed increases in metacognitive awareness, while there were no significant changes in students’ self-efficacy and self-regulatory strategy use.

While there were no increases for the CR and the CR + MI conditions in mathematics performance, the significant decrease for the control condition may indicate that the posttest was more difficult than the pretest and the students did benefit from the intervention conditions (i.e., CR and the CR + MI conditions).This finding deviates from a previous monitoring study by Nietfeld et al. (2006) where students who received feedback and verbal instructions about monitoring produced better learning results. The insignificant increases in the CR and the CR + MI conditions may be due to the low intensity of the intervention in the present study. Specifically, the intervention was delivered in a written format without teachers’ verbal directions for performing accurate monitoring. Although we found positive effects for written directions with middle students in United States in a previous study (Sperling et al. 2012), this may not be the case in the current sample. An inclusion of teachers’ explicit monitoring instructions may be a focused modification for future interventions.

In addition, in terms of the intervention intensity, one may argue that the insignificant effects may be due to the low frequency and short duration of the intervention. Specifically, students in the present study received three intervention sessions with one session per week. This dosage may not be powerful enough to improve students’ monitoring and academic performance. Interestingly, previous monitoring interventions have demonstrated mixed results regarding intervention frequency and duration. For instance, Bol et al.’s (2012) intervention included one treatment session only resulting in students’ improved monitoring and academic performance, whereas Huff and Nietfeld’s (2009) monitoring intervention included multiple treatment sessions over weeks resulting in insignificant effects on students’ performance. A recent systematic review that examined the effective characteristics of mathematics SRL interventions reported no consistent patterns in the nature of effective versus ineffective interventions when comparing effect sizes in learning and monitoring outcomes (Wang and Sperling 2020). Thus, other factors may also likely contribute to the insignificant effects, such as students’ ineffective use of the intervention materials.

Furthermore, as students were accurate monitors before the intervention, the intervention effect on students monitoring accuracy were therefore weak. This finding corresponds to Bol et al. (2005). Specifically, Bol and colleagues’ work, in which they explicitly asked college students to make predictions and postdictions about their performance as the overt monitoring condition. They reported no increases in students’ monitoring accuracy for the overt condition when compared to the control condition. The finding in the present study may indicate the ineffectiveness of monitoring instructions, especially when students are good monitors already.

Moreover, all participating students’ metacognitive awareness was improved between pre- and posttest measured by the Jr.MAI. This finding was a little surprising as we only expected main increases in metacognitive awareness for the CR and CR + MI conditions. This may be that the exposure to the stems of the Jr.MAI items served as an intervention and led students to think metacognitively.

Research question 3

Factors Students Considered When Justifying Their Metacognitive Judgments.

Our final research question aimed to understand students’ rationales for making performance judgements, which may inform avenues to improve future monitoring interventions. We expected that students would consider a variety of factors when judging their performance. Nevertheless, most students only considered one factor even though multiple categories were identified among all students. This is not consistent with the prior research that investigated the formation of metacognitive judgments when reading texts (Dinsmore and Parkinson 2013; Lin and Zabrucky 1998; Wang and List 2019). In the previous work, learners were found to consider multiple dimensions or factors when rendering their performance judgments. From a developmental perspective, this inconsistency is likely a result of 7th grade students’ limitations in metacognitive awareness that may prevent them from thinking holistically (Flavell et al. 1995). Metacognition develops as students gain exposure to increasing metacognitive and educational experiences (Flavell 1976,1979). In particular, Baker and Brown (1984) suggested developmental differences between child and adult readers, in which children tend to be less aware of their reading processes when compared to college students. Such developmental limitations for children’ metacognition in reading also apply to mathematics. For instance, Shilo and Kramarski (2019) examined fifth grade students’ metacognitive processes in mathematics through qualitative analyses of recorded math classroom videos. They reported that fifth graders tended to have difficulties verbalizing and justifying their learning processes, further indicating middle school students’ limited metacognition. As such, this may explain the Chinese middle school students’ limited justification factors for their performance judgments in the present study.

Furthermore, students commonly considered only item-related characteristics. This focus may be due to the perceived nature of mathematics. Unlike open-ended items in other domains (e.g., social science), which students can compose their responses in different ways, these items required specific knowledge and objective answers. In contrast, previous studies have demonstrated students’ considerations of multiple factors when judging their performance in a writing composition task based on multiple texts (List and Alexander 2015; Wang and List 2019).

As Mosenthal (1998) suggested, processing reading text tasks requires students to be able to integrate and connect inferential information from texts and match the information found from the texts with the task questions. It may be difficult for students to produce high-quality responses when readings are lengthy and are not cohesive. In contrast, mathematics items tend to avoid syntactic complexity and commonly ask for one objective answer (Martiniello 2009). Such characteristics of math items may directly activate the targeted mathematical knowledge necessary for solving the particular item. This may explain the differences in justifications of performance on essay questions and math problem-solving items. In addition to the prevalence of item-related factors that students considered, a person-related factor (i.e., prior knowledge) was also found to be significant for predicting students’ math performance. This finding also suggests the objectivity and specificity of mathematics problem-solving items, that is, they require specific item knowledge as well as specific prior task experience. Thus, students’ various perceptions of the task context may be crucial for them to activate relevant knowledge and use strategies, which perhaps help students to justify their performance judgements (List et al. 2019; Wang and List 2019; Wright 1981).

In addition, the Chinese culture may also affect how students justify their performance judgments. For instance, Lundeberg et al. (2000) examined college students’ monitoring internationally across five regions including the United States, the Netherlands, Israel, Palestine, and Taiwan. Specifically, students were asked to judge their performance on multiple course exams including subjects such as mathematics, biology, and psychology. Findings suggested that college students from Taiwan demonstrated more accurate monitoring and underconfidence when compared to students from the United States, the Netherlands, Israel, and Palestine. Such findings may suggest cultural components in students’ metacognitive judgments and the factors they consider during monitoring. For example, the justification data in the present study demonstrated that Chinese seventh grade students were mostly accurate about their prior knowledge for a certain item when justifying their performance judgements. In particular, students who justified their performance judgments based on accurate perceptions of their prior knowledge tended to also judge their performance accurately. However, the present study was only conducted with Chinese students and lacked a comparison sample. Future research should further explore potential cultural differences in middle school students’ justification factors when making performance judgments in mathematics.

Furthermore, the unknown category negatively predicted students’ performance and confidence bias for Item 2 and 3. Specifically, students who were uncertain about how they arrived at their performance judgments were likely to perform poorly on the corresponding item and feel underconfident about their performance. These findings suggest the importance of supporting middle school students’ metacognition. Empirical evidence indicates that providing students with metacognitive guidelines and feedback is a viable avenue to improve students’ metacognitive awareness as well as monitoring accuracy (e.g., Miller and Geraci 2011b; Nietfeld and Schraw 2002; Shilo and Kramarski 2019). Additional research should explore effective metacognitive guidelines for middle school students in mathematics.

Moreover, as the items in the present study were generally easy, future interventions that include high difficulty math items may unmask more information about possible factors that Chinese middle school students may consider when making performance judgments. Asking students to explain their performance judgments may also prove to be a viable instructional strategy to support students’ metacognition and accurate monitoring. In a seminal review, Schraw (1998) encouraged the employment of a strategy evaluation matrix (SEM) and a regulatory checklist (RC) as strategies to promote students’ knowledge of cognition and regulation of cognition. Teachers who implemented these strategies found positive increases in students’ metacognition. Moreover, according to recent studies that investigated the attributions of performance judgments, teaching students about the multidimensional structure of monitoring may be another approach to improve students’ metacognitive awareness and monitoring accuracy. For instance, teachers or educators can guide and encourage students to consider the variated person-, item-, and context-related factors that may influence their performance judgements.

Conclusions and implications

Findings from this intervention study indicated overall Chinese middle school students are accurate in monitoring in mathematics. Consistent with the extant monitoring literature, overconfidence and poorer accuracy were associated with lower mathematics performance. The study explored potential benefits of two interventions designed to scaffold students’ metacognition. Findings indicated some support for the interventions but also indicated the potential instrumentation concerns. Given that the control group mathematics scores significantly decreased between pre- and post-measures, we explored potential differences in performance on the pre and post mathematics measures. Findings indicated the posttest was significantly more difficult than the pretest (z = 12.77, p < .001). As neither of the intervention scores significantly decreased but the control condition scores did, the viability of the interventions to improve monitoring needs to be further explored.

Further, findings demonstrated consistent relationships among students’ math performance, confidence bias, metacognitive awareness, and self-efficacy corresponding to previous studies in mathematics (e.g., Labuhn et al. 2010; Nietfeld and Schraw 2002; Usher and Pajares 2009). Specifically, students’ improved mathematics performance was found to be associated with low confidence bias, high metacognitive awareness, and high self-efficacy. While the consistent relationships among these constructs indicates the importance of the given psychological constructs to students’ mathematics achievement, one challenge is this stable relationship may result in resistance to intervention. Further investigation into the complexity of the potential effects of interventions that target one or more of these constructs is needed for future research.

Also, for further consideration for future research is the dosage and delivery of metacognitive interventions. In the current study, paper and pencil written interventions did not successfully improve 7th grade students’ monitoring. While there are relatively few metacognitive intervention studies for middle school mathematics, Kramarski (2004) reported benefit for a verbal instruction intervention for junior high school students. Consist with recommendations for strategy instruction (e.g., prompt students to use appropriate strategies for problem solving questions) more and longer exposure is likely necessary to realize long term benefits for children’s metacognition. Future intervention research should carefully consider dosage implications. Like this study, conducted in classrooms and interventions were administered by teachers, additional future intervention research should strive to further include teachers to integrate even more seamlessly into classroom practice.

In this study we also examined the justifications students made for their performance judgments. Findings indicate that middle school students generally consider only a single factor as they form their performance judgments. These factors were grouped into person-, item- and context factors. Future research should continue to examine the rationales students use to inform their performance judgment to both inform future research but also to inform instructional practice. Although students’ performance scores and the monitoring indices demonstrated high monitoring accuracy, these justification categories reveal potential deficits in metacognitive awareness. Continued investigation into effective interventions for improving middle school students’ monitoring accuracy in mathematics are needed.