Introduction

Solving math problems is a very common activity in elementary school. However, it is also a complex cognitive activity involving multiple processes. In order to successfully perform these tasks, students must integrate cognitive, metacognitive, and self-regulatory mechanisms (Cleary and Chen 2009; Montague et al. 2011) to accurately regulate and monitor their learning processes. An important aspect of monitoring is calibration, or the degree to which students’ judgments about the correctness or adequacy of their performance corresponds to their actual performance, the latter determined on the basis of an objective measure (Hacker et al. 2008a). These judgments can be expressed in terms of predictions or postdictions, depending on whether they are made prior to or after completing a task. Postdictions (or post-performance judgments) are indicative of monitoring mechanisms during task performance and have been shown to be more accurate than predictions, presumably due to the additional feedback provided by performing the task (Ackerman and Wolman 2007).

The relevance of calibration mechanisms in mathematics has been substantially demonstrated, suggesting a direct and positive relationship between these metacognitive control processes and mathematics achievement. Within the context of math problem-solving, high-calibrated students tend to perform more successfully than low-calibrated students, presumably because they have more control over their problem-solving processes (Desoete and Roeyers 2006; Hadwin and Webster 2013; Jacobse and Harskamp 2012; Lipko et al. 2009; Özsoy 2012; Rinne and Mazzocco 2014). However, previous literature has shown that students are often poorly calibrated, showing a tendency towards over-confidence. Furthermore, these patterns of inaccuracy tend to persist over time (Bol et al. 2012; Dinsmore and Parkinson 2013; Hacker et al. 2008b; Stolp and Zabrucky 2009). Given the possible implications of different patterns of calibration accuracy on students’ learning, it is necessary to determine which factors influence their judgments. In this sense, a review of the current literature shows that: (a) most research has focused on the study of these mechanisms in undergraduate or secondary school students, paying less attention to elementary school ages; (b) most studies have been conducted in laboratory settings, which may be an important constraint for generalization of results; and (c) although both predictions and postdictions have been substantially studied in math problems, the amount of research conducted from the perspective of the analysis of the processes involved in these tasks is almost non-existent to date. This analysis is especially interesting in the case of postdictions, given their relationship with monitoring processes during task performance.

According to the above, the main objective of the present study was to analyze the relationship between students’ post-performance calibration accuracy and the metacognitive processes they displayed while solving two math problems. A large sample of 524 fifth- and sixth-grade students took part in this study. To address this goal, students were assessed in a regular mathematics class, using the Triple Task Procedure in Mathematics (TTPM; García and González-Pienda 2012; García et al. 2015) as a measure of the process.

Importance of calibration for problem-solving

Calibration has been characterized as an important metacognitive process involved in the development of self-regulatory competence (Dinsmore and Parkinson 2013; Hadwin and Webster 2013; Labuhn et al. 2010; Zimmerman and Schunk 2011). Because of its implications in students’ motivation, metacognitive control, and self-regulation, it is an essential process affecting academic success and task completion (Alexander 2013; Cleary 2009; Efklides and Misailidi 2010; Schunk and Pajares 2009). In this sense, as Alexander pointed out, students’ ability to accurately gauge their progress and performance plays an important role in their subsequent effort and strategy use in problem-solving situations. In this context, mathematical problem-solving is an important issue in current scientific literature. It is characterized as a complex cognitive activity, but students tend to show poor metacognitive skills while performing math problems. Many students (even those without learning disabilities) forge ahead without considering alternative approaches, jumping immediately into calculations, giving impulsive responses, and using trial and error as a strategy to solve these tasks. They commonly get stuck in irrelevant details, or fail to verify solution paths and evaluate the suitability of their responses (Cleary and Chen 2009; Kramarski and Gutman 2006; Montague et al. 2011; Pennequin et al. 2010; Pereis et al. 2009). Literature also indicates that students tend to be inaccurate when asked to predict or judge the results of their performance, either in mathematics and other academic areas, showing a tendency towards over-confidence (Bol and Hacker 2001; Bol et al. 2005, 2010; Hacker et al. 2008a; Lipko et al. 2009; Özsoy 2012). This is an important aspect, as the calibration of students’ metacognitions to their performance has important consequences for learning outcomes (Dunlosky and Rawson 2012; Dunlosky and Thiede 2013; Finn and Metcalfe 2014; Hacker et al. 2008a).

Calibration measurement

Whereas the relevance of calibration in learning has been substantially demonstrated, an important issue that must still be addressed is the distinction between absolute and relative calibration. The term calibration used in the present study refers to absolute accuracy, or the degree of correspondence between the judged level of performance and the actual performance (Hacker et al. 2008a; Maki et al. 2005). Thus, absolute accuracy (or calibration) differs from the concept of relative accuracy, also referred to as resolution or discrimination. This component has traditionally been defined as the degree to which a person’s judgments can predict the likelihood of correct performance of one item relative to another (Hacker et al. 2008a). Both types of calibration are different aspects of metacognitive monitoring, showing low correlations with each other (Maki et al. 2005).

In calibration studies, learners are asked to make confidence judgments, which consist of judging the degree to which they think they will be able to acquire new knowledge or perform a task (prediction), or to judge their performance once the task has been completed (postdiction). This judgment is then compared with an objective measure of that learning (their actual performance), such as the score on an exam or test, after completing the task (Bol et al. 2010; Winne 2004). The more closely a student’s judgment of performance matches his/her actual performance, the better calibrated the student is (Hacker et al. 2008a). In the context of the present study, the distinction between prediction and postdiction acquires a special meaning. In this sense, postdictions have been shown to be more accurate than predictions, possibly due to the additional feedback provided by performing the task (Ackerman and Wolman 2007). This has led many authors to characterize these judgments as informative of monitoring processes during task performance, incorporating this measure into their studies (Bol and Hacker 2001; Bol et al. 2005, 2010; Hacker et al. 2008b; Nietfeld et al. 2006; Sheldrake et al. 2014).

Among the methods to calculate calibration, Parkinson et al. (2010) note dichotomous ratings, categorical ratings (such as Likert-type scales), data count, and 100-mm scales as the most commonly used measures. In studies using dichotomous ratings, students are asked to complete a multiple-choice recall measure of whether they feel “confident” or “not confident” about each answered item (Schraw et al. 2012). In the case of Likert-type scales, they are requested to rate the degree to which they feel confident, ranging from “not confident” to “very confident” (Hattie 2013). Within the numerous indexes used to express judgment accuracy, different measures are feasible, from Pearson correlations to more complex measures, such as rho or the Gamma coefficient (Dinsmore and Parkinson 2013; Schraw 2009; Winne and Muis 2011). In the present study, a dichotomous measure of calibration was used. Specifically, Phi coefficient was calculated (Schraw et al. 2012). As these authors point out, although continuous measures may be more sensitive, dichotomous judgments are the most common in the literature. This may be due to the fact that a dichotomous scoring scheme is easier for participants to understand than a more complex scheme. This is especially relevant when studies are conducted with young samples. In this sense, although there are several dichotomous measures of calibration (e.g., the G index, the Odds ratio, Gamma, Kappa, Phi, or the Sokal distance measure), Schraw et al. (2012) showed in their study that all of them present high and similar sensitivity and specificity.

As mentioned above, an important issue in calibration studies is the fact that students tend to be inaccurate in their performance judgments, with a tendency to over-confidence (Bouffard et al. 2011; Butler 2011; Dinsmore and Parkinson 2013; Hadwin and Webster 2013). This tendency has been found in studies at different educational stages, across different subject areas, and in mathematics in particular (Bol and Hacker 2001; Bol et al. 2005, 2010; Rinne and Mazzocco 2014; Sheldrake et al. 2014). Furthermore, there is a good deal of evidence suggesting that calibration judgments tend to be stable over time (Bouffard et al. 2011; Hacker et al. 2008a). These studies also indicate that performance judgments (and related biases) are resistant to improvement. This has led to an important line of research on the determining factors underlying students’ judgments and how to increase their accuracy. In this context, numerous studies have focused on exploring many different components. Aspects such as students’ general cognitive ability or prior knowledge, academic achievement or their level of expertise in making such judgments, task or item characteristics, affective-motivational components such as goal-setting orientation or self-efficacy beliefs, or more stable and persistent traits such as attribution styles or personality, have been proposed as possible explanatory factors (Alexander 2013; Bol et al. 2012; Dinsmore and Parkinson 2013; Hacker et al. 2008b; Hadwin and Webster 2013; Stolp and Zabrucky 2009).

Metacognition, self-regulated learning, and calibration in mathematics

Within this perspective, another important determining factor of students’ judgment accuracy is the link between calibration and metacognitive and self-regulatory mechanisms. The two terms (metacognition and self-regulation) have occasionally been used interchangeably (Dinsmore et al. 2008). However, it is important to note that there is a large difference between them. On the one hand, self-regulation is defined in the context of learning as the control students exert over their cognition, behaviour, emotion, and motivation to achieve established goals. On the other hand, metacognition (or the ability to think about thinking) refers to the cognitive component of self-regulation, in charge of processes such as self-awareness of problem-solving, monitoring, and controlling one’s mental processing (Dinsmore et al. 2008; Panadero and Alonso-Tapia 2014). Nevertheless, both metacognitive and self-regulatory components are articulated under the paradigm of Self-regulated Learning (SRL: Zimmerman 2000, 2008).

SRL explores the acquisition, evaluation, and regulation of knowledge. It is rooted in the conception of learners as being capable of monitoring their own learning, setting goals, evaluating, and regulating their own learning progress (Stolp and Zabrucky 2009). This perspective is intimately linked to the concept of calibration, or the ability to accurately judge one’s performance on a task. Calibration is relevant in this context, as it consists of a metacognitive control process that provides information about the status of one’s knowledge and strategies at a cognitive level. It also sets the stage for effective self-regulation by generating the internal feedback that students use to control their learning and performance (Bol et al. 2012; Dunlosky and Rawson 2012). In this sense, given that Zimmerman’s (2000) model establishes that self-regulated learners are more aware of what they do and do not know, they are also expected to exhibit more accurate perceptions of their performance. This statement is supported by studies showing that self-regulated learners are commonly high-achieving students who are more accurately calibrated than their peers with lower levels of self-regulation. These studies also reveal that high-achieving students tend to be somewhat under-confident, whereas low-achieving students are commonly inaccurate and over-confident when asked to predict or postdict their performance (Bol and Hacker 2001; Bol et al. 2005; Hacker et al. 2008b; Özsoy 2012). These results support the role of calibration as a metacognitive and self-regulatory process, as well as its relationship with students’ achievement.

Regarding the impact of calibration mechanisms on mathematics (and problem-solving in particular), the relationship between calibration mechanisms and mathematics has been substantially demonstrated (Desoete and Roeyers 2006; Jacobse and Harskamp 2012; Özsoy 2012; Rinne and Mazzocco 2014). First, calibration accuracy has been shown to explain between 16 and 36 % of the variance of mathematics achievement in different studies (see Jacobse and Harskamp 2012). Specifically, a recent study by Rinne and Mazzocco (2014) supported this association, suggesting that calibration accuracy of mental arithmetic judgments represents an important developmental predictor of future mathematics performance in elementary school. Second, highly calibrated students have been observed to solve math problems more successfully. Specifically, previous studies show that these students can solve more complex problems using more flexible and effective strategies, such as separating the tasks into simpler parts, or designing different ways to represent information, concepts, and relationships. They also formulate self-queries to clarify their thoughts and they tend to evaluate the result of their performance more frequently than their less accurate peers (Özsoy 2012). Hence, highly calibrated or accurate students tend to show more self-regulatory and metacognitive skills during problem-solving (Boekaerts and Rozendaal 2010; Dinsmore and Parkinson 2013; Hadwin and Webster 2013; Stolp and Zabrucky 2009; Winne and Muis 2011). However, previous research has also revealed that many students fail to use self-regulatory and metacognitive mechanisms spontaneously when solving math problems, leading to poor performance and inaccurate self-evaluations (Kramarski and Gutman 2006; Pennequin et al. 2010; Pereis et al. 2009). In this context, one of the greatest challenges for researchers is to obtain evidence of the cognitive, metacognitive, and strategic processes used by students during these tasks, and of how these processes relate to their ability to make accuracy judgments. This information can be gathered by means of on-line measures, such as Think-Aloud (Jacobse and Harskamp 2012; Montague et al. 2011) or Triple Task protocols (Olive and Piolat 2002; Piolat et al. 2001, 2005).

Process-based measures: the triple task procedure in mathematics (TTPM) and its predecessor

Self-regulatory and metacognitive mechanisms have been traditionally assessed by means of questionnaires or structured interviews. These kinds of measures are based on students’ self-reports about how they tackle a problem, or the extent to which they use different strategies. Students’ responses are then scored depending on their quality, and a total score is calculated. However, this kind of assessment—although widely applied—may yield inconsistent or incorrect information due to inaccurate recall or response biases, such as social desirability. In this sense, these tools have been considered useful to assess students’ declarative and situational knowledge, but they do not provide information about how learners transfer such knowledge to regulate their problem-solving process (Cleary and Chen 2009; Veenman 2011). In order to address this issue, the use of on-line methods (i.e., process-based measures) is an alternative. These measures are taken concurrently with task performance and have been shown to be especially useful to provide evidence of students’ cognitive processes during different activities (Azevedo and Aleven 2013; Throndsen 2011; Tillema et al. 2011).

An example of these measures is the Triple Task technique (Piolat et al. 2001). This procedure has a long tradition in the study of the processes involved in composition writing (Olive and Piolat 2002; Piolat et al. 2005). In traditional Triple Task studies, participants are required to perform three tasks simultaneously: a primary task (e.g., the composition of a text); a secondary probe task (based on response time [RT]); and a third task in which they are asked to verbalize or label the actions or thoughts that are interrupted by the probe. This method uses directed introspection. In this sense, probes are presented at a certain time interval, and after each reaction to a probe, participants are asked to categorize their actions or thoughts according to a given category system that includes different writing sub-processes. This technique is aimed at providing information about the activation of different sub-processes and cognitive demands as the written task progresses. For this purpose, this protocol is based on two measures: (a) introspective data, which provide information about how the writing sub-processes are activated over time; and (b) RT to the probes, which allows studying the cognitive effort involved in these sub-processes. Specifically, RTs are taken before (i.e., baseline condition) as well as during Triple Task administration (i.e., secondary probe task). RTs taken during Triple Task are called “secondary RTs” and are linked to a reported action or sub-process. Differences between RTs in the two conditions are interpreted as an estimation of the cognitive effort associated with each writing sub-process or category. Before starting the tasks, participants are trained to recognize each of the sub-processes of the given category system. As a measure of the process, the Triple Task procedure differs from Think-aloud protocols (Montague et al. 2011) in two main aspects: (a) the Triple Task procedure uses directed introspection, as opposed to the undirected introspection used in Think-aloud. In this sense, providing a system of categories facilitates students’ categorization of their thoughts or actions. It also allows the experimenter to limit the number of target sub-processes, exploring those of special interest; and (b) response transcription and coding are not required. This aspect makes the Triple Task technique less costly than Think-aloud, and therefore, also more suitable to conduct studies with larger sample sizes.

Within this context, an adaptation of the Triple Task method has been recently proposed for application in mathematical problem-solving, the Triple Task Procedure in Mathematics (TTPM; García and González-Pienda 2012; García et al. 2015). This is a modification of the traditional Triple Task technique, which grants less prominence to cognitive effort (RTs) than the original version. Instead, it has been designed to examine the process underlying solving math problems from the viewpoint of the main SRL phases of Planning, Execution, and Evaluation (Rosário et al. 2008; Zimmerman 2000). The authors combined the SRL perspective with Bransford and Stein’s (1993) IDEAL model of problem-solving to design a new category system, which serves as the basis for directed introspection. García et al. (2015) administered this protocol to 510 fifth- and sixth-grade students from Northern Spain, who carried out mathematical tasks of varying difficulty. Students in this study showed ineffective planning strategies and a lack of evaluation mechanisms. However, students who successfully solved the problems showed significantly better metacognitive skills while performing the problems, mainly related to the SRL phase of planning. These differences gained prominence as task difficulty increased, suggesting a relationship among problem-solving performance, metacognitive processes, and task difficulty. A brief review of the SRL and problem-solving models used in TTPM is presented below, before the proposed assessment protocol more extensively described in this section.

Self-regulated learning model

There are different theories and models that attempt to explain SRL mechanisms. Most of them share the common ground that self-regulation involves different processes (e.g., planning, monitoring, etc.) and is cyclical (i.e., previous performance provides feedback for future tasks). Authors such as Panadero and Alonso-Tapia (2014) suggest that the models proposed by Boekaerts (1999), Winne (2001), and Zimmerman (2000, 2008) are the most representative of the last few years. Within this context, Zimmerman’s model has been the most frequently cited in the literature, and the most extensively used to guide interventions (Moos and Ringdal 2012; Stoeger and Ziegler 2008). Due to its practical implications, it was selected as a framework for the TTPM design. Briefly, this model comprises three cyclical phases (forethought, performance, and self-reflection), which correspond to the processes that occur before, during, and after SRL takes place. In the forethought phase, students analyze the task and assess their capacity to perform it, establishing goals and plans to complete it. The second phase is performance, in which two types of processes are carried out: self-control and self-observation. This phase is aimed at keeping track of progress during task performance, and involves using the strategies identified during the forethought phase. These strategies can be classified as metacognitive (e.g., self-instruction, imagery, time management, etc.), and motivational strategies (e.g., self-given messages to remind one of the goal, self-rewards, etc.). The last phase is self-reflection, in which students judge their performance and formulate the reasons for their results (Panadero and Alonso-Tapia 2014).

Problem-solving model

Several problem-solving models have been proposed from different conceptual perspectives. One of these perspectives is rooted in cognitive psychology. Focused on the mental processes underlying problem-solving, some models conceive the application of general problem-solving strategies, called “heuristics”, as the basis for problem-solving expertise and performance. Heuristics involve different processes (e.g., identifying the problem, drawing a graph or diagram, thinking about structurally similar problems, simplifying the problem, reviewing, or generalizing), which are commonly expressed as a series of stages through which problem-solving progresses. The models of Polya (1954), Gick (1986), and the IDEAL Model of Bransford and Stein (1993) have been the most studied models within this perspective. The IDEAL Model describes five stages: Identifying potential problems, Defining and representing the problem, Exploring possible strategies, Acting on those strategies, and Looking back and evaluating the effects of those activities. It is currently one of the most extended models (Newton et al. 2009; 2012). In contrast to the other mentioned models, it establishes the problem-solving process as cyclical, in accordance with Zimmerman’s SRL Model (2000, 2008). As it also provides a comprehensive system to examine the metacognitive and self-regulatory processes involved in problem-solving during performance, this model makes up the other basis for the TTPM.

From the combination of Zimmerman’s (2000, 2008) SRL model and the IDEAL model (Bransford and Stein 1993), a system emerged with eight categories or sub-processes, organized in three higher level categories, corresponding to the main SRL phases (Rosário et al. 2008; Zimmerman 2000): Planning, Execution, and Evaluation. Table 1 shows the category system used. In accordance with previous studies, an additional category (“other”) has been included to reflect all the thoughts or activities unrelated to problem-solving performance. Some examples of processes within this category are day-dreaming (e.g., “I’m thinking about what I’m going to do this afternoon”) or possible distracters (e.g., “I’m looking out the window”). This additional category was also included in the categorization training phase, which is previous to the TTPM (described in Procedure).

Table 1 Category system. Based on the SRL phases (Rosário et al. 2008; Zimmerman 2000, 2008) and the IDEAL model (Bransford and Stein 1993)

An important issue in this category system is the allocation of the drawing/summarizing sub-process in the SRL planning phase. In this sense, imagery, or the construction of mental images to organize the information, is similar in essence to drawing/summarizing, and falls under the performance phase in Zimmerman’s (2000, 2008) SRL model. Specifically, it consists of a metacognitive strategy under self-control, as mentioned above. The location of this strategy in the planning phase in TTPM is due to the problem-solving model adopted (i.e., IDEAL model). Specifically, in this model, the Definition and representation stage is prior to execution (or Action based on the strategy). Previous studies using TTPM (García and González-Pienda 2012; García et al. 2015) confirmed the allocation of this sub-process under the SRL planning phase. This latter study analyzed the temporal sequence of the process in a sample of 350 elementary students during TTPM. This sequence was based on the percentage of students who reported each of the eight TTPM categories or sub-processes across 20 different moments while solving several math problems. Results indicated that: (a) although the drawing/summarizing sub-process tends to be present both in planning and execution phases, it is mainly activated during the first phase; and (b) the frequency of drawing/summarizing decreases dramatically as the frequency of the calculation sub-process (i.e., execution phase) increases.

Some of the potentialities of the TTPM are its design and implementation features (see Procedure section in the present study), which facilitate its applicability to broad samples, and more important, in classroom settings (García et al. 2015). In this sense, one of the most discussed aspects in calibration studies is the fact that they are commonly conducted in laboratory settings (Dinsmore and Parkinson 2013; Hadwin and Webster 2013; Winne and Muis 2011). As Hacker et al. (2008a) pointed out, although these contexts often provide important information on calibration accuracy mechanisms, generalizing these findings to different contexts, especially to classroom contexts, can be difficult. Laboratory study components, material and procedures—such as highly structured contexts, tasks with little meaning for students, or tasks presented over short periods of time—differ considerably from those used in classrooms settings, where students must deal with multiple requirements from the environment and the task. Additionally, although a large amount of research has focused on the study of the relationship between calibration, metacognitive and self-regulatory mechanisms in mathematics (Bol et al. 2010; Hacker et al. 2008a, 2008b; Rinne & Mazzocco 2014; Sheldrake et al. 2014), and in mathematical problem-solving in particular (Jacobse and Harskamp 2012; Zimmerman et al. 2011), none of these studies has been conducted from the perspective of the analysis of the relationship between students’ accuracy judgments and the metacognitive process involved in solving math problems. The study of this relationship is especially interesting in the case of postdictions, which have been suggested to be informative of monitoring mechanisms during task performance.

The present study

This study is aimed at examining the metacognitive process shown by a sample of 524 fifth- and sixth-grade students while solving two math problems, and how this process explains differences in students’ post-performance accuracy judgments. For this purpose, the TTPM technique (García and González-Pienda 2012; García et al. 2015) will be used as a measure of the process. Additionally, the possible influence of variables such as mathematics achievement and grade level both on metacognitive processes and calibration accuracy will be analyzed. In this sense, whereas the relationship between students’ achievement in calibration has been substantially demonstrated (Bol et al. 2005, 2010; Hacker et al. 2008b; Özsoy 2012), it is not clear whether grade level may lead to the same pattern of results. However, as different grade levels may indicate different cognitive or developmental stages (Boston and Smith 2009), this variable is expected to exert some degree of influence on the analysed variables.

Overall, this study aimed to answer the following questions: (a) Do students with different accuracy in their post-performance judgments show differences in their problem-solving processes? (b) Which SRL phases or sub-processes explain these differences? (c) Will different patterns of calibration accuracy emerge according to students’ achievement in mathematics and grade level, and will these variables account for differences in the metacognitive processes? Prior to addressing these questions, the correspondence between students’ post-performance judgments and actual performance was estimated to establish the degree to which the students were calibrated and the tendency of their judgments (i.e., over- or under-confidence) in both problems. In addition, post-performance judgment and actual performance stability across problems were analyzed. Previous laboratory and classroom studies have shown that calibration tends to be relatively stable over time and across tasks. Students also tend to be more stable in their performance judgments than in their actual performance (Bol et al. 2005; Bouffard et al. 2011; Hacker and Bol 2004; Hacker et al. 2008b).

Given these assumptions and according to previous research results, it is expected that:

  1. 1)

    Students will be imprecise in their judgments, showing a tendency towards over-confidence.

  2. 2)

    Students will be stable in their judgments and actual performance across problems. This stability will be higher in judgments than in actual performance.

  3. 3)

    Differences in the metacognitive process (TTPM phases and sub-processes) are expected to be found between students with different accuracy in their post-performance judgments (Accurate group = when students’ judgments and actual performance match vs. Inaccurate group = when they do not). Accurate students will show more effective metacognitive strategies during TTPM, such as using different forms of organizing and representing the information, more frequently than their inaccurate peers. They will also evaluate the progress and results of their performance more than students in the inaccurate group.

  4. 4)

    Different patterns of calibration accuracy will emerge as a function of students’ mathematics achievement level. High-achieving students will be more accurate and under-confident than their peers with lower levels of achievement. This variable is also expected to generate differences in the metacognitive process during TTPM. High-achieving students will use more effective solving-problem strategies (e.g., more frequent use of information organization and representation strategies, better evaluation mechanisms).

  5. 5)

    If grade level has an impact on calibration and the metacognitive process during TTPM, sixth-grade students will show higher levels of calibration accuracy, and better planning and evaluation strategies.

Materials and methods

Participants

Five-hundred and twenty-four students from fifth and sixth grade of elementary school took part in this study. They were recruited from 11 schools in Northern Spain. Ages ranged between 10 and 13 years (M = 10.991, SD = 0.716). Of these students, 260 (49.6 %) were female, and 264 (50.4 %) were male. The sample comprised 220 students from the fifth grade (42 %; male = 108, female = 112) and 304 students from the sixth grade of elementary school (58 %; male = 156, female = 148).

Students were selected through convenience sampling. They volunteered for the study and presented signed parental permission to participate. Children with a diagnosis of severe learning disabilities or those who showed reading difficulties were excluded from the analyses.

Given that data analyses were conducted separately for each math problem, students were distributed according to the accuracy of their post-performance judgments in each problem (Accurate vs. Inaccurate groups). Post-performance judgments and actual performance were both expressed in dichotomous terms (Success = 1, Failure = 0). Students were assigned to the accurate group when their post-performance judgments matched their actual performance, and to the inaccurate group when there was a mismatch. There were no statistically significant group differences in age (p = .954), grade level (p = .489), and gender distribution (p = .253) in Problem 1. Regarding Problem 2, gender (p = .177) and grade level (p = .405) were equally distributed in the two groups, and there were no group differences in age (p = .921).

Variables and measurement instruments

In accordance with the objectives proposed in this study, three sets of measures were used: those related to the problem-solving process (TTPM); those applied to calculate post-performance calibration; and a measure of students’ achievement in mathematics.

Process measures

Evidence of students’ problem-solving processes was obtained by means of the Triple Task Procedure in Mathematics (TTPM; García and González-Pienda 2012; García et al. 2015), previously described in the present study. The main characteristics of the TTPM are related to its category system, based on Bransford and Stein’s (1993) IDEAL Model and the SRL perspective (Rosário et al. 2008; Zimmerman 2000). It comprises eight sub-processes (reading, drawing or summarizing, recalling similar problems, thinking about a solution, mental calculation, writing a response, reviewing, correcting mistakes) grouped into three main phases (planning, execution, evaluation). Dependent variables were students’ relative frequency of election of each category and phase, expressed in percentages.

Post-performance calibration measures

Three measures were established (actual performance, post-performance judgments, and a calibration index):

Actual performance was obtained by asking students to complete two math word problems taken from the book “Problem-solving and comprehension” (Whimbey and Lochhead 1999), published in Spanish. Actual performance in the problems was expressed in terms of success (1) or failure (0). Although the second problem involves two different questions (number of dogs and their breeds), no partial scores were awarded in order to unify the measurement system. Thus, both parts of the problem had to be answered correctly to receive credit. Prior to commencing the study, these problems were reviewed by teachers to ensure that they were developmentally appropriate for students. The two math problems are presented below:

  1. Problem 1

    Beatriz lends700 to Susana. But Susana borrows1500 from Esther and300 from Juana. In addition, Juana owes Esther300 and Beatriz700. One day they meet at Beatrizs home to settle their debts. Who went back home with1800 more than she brought?”

  2. Problem 2

    Paula, Mari, and Juana have a total of 16 dogs, 3 of which are poodles, 6 are hounds, and the rest of them are German shepherds and Pekinese dogs. Juana does not like poodles and Pekinese dogs, but she has 4 hounds and 2 German shepherds, leading to a total of 6 dogs. Paula has a poodle and 2 more dogs, which are German shepherds. Mari has 3 Pekinese dogs and several dogs of other breeds. Which breeds, and how many dogs of each breed, does Mari have?”

Post-performance judgments were obtained by asking students after each problem whether they considered that they had solved the problem successfully. They had to respond in dichotomous terms (yes/no), and the values of 1/0 were assigned, respectively. The dichotomous nature of the post-performance judgment scores is based on two criteria: (a) the characteristics of the problems (i.e., actual performance was established as success or failure); (b) an attempt was made to provide participants with an easier scoring scheme, given the youth of the sample (i.e., fifth- and sixth-grade students).

Finally, a calibration index was calculated as the correspondence between post-performance judgments and actual performance. As both variables were dichotomous, Phi correlation coefficient ρ ϕ was used for this purpose.

Mathematics achievement level

Students’ final academic grades in mathematics were used as an indicator of mathematics achievement. This information was provided by the teachers in charge of mathematics, after the parents’ permission was obtained. Academic grades ranged from 0 to 10 points, although three levels of achievement were distinguished: low achievement (grades of 5 or below), medium achievement (grades between 5 and 7.5), and high achievement (grades over 7.5). Due to the dichotomous nature of the calibration measure, and the fact that the process variables did not meet normality conditions (see Table 4), regression analyses with achievement as a continuous variable were unsuitable. Thus, this distinction was made to determine whether different levels of mathematics achievement lead to different patterns of calibration accuracy, as well as to differences in the metacognitive process during TTPM.

Procedure

The study was conducted in accordance with the Helsinki Declaration of the World Medical Association (Williams 2008), which reflects the ethical principles for research involving humans. The evaluation was collectively administered during a regular class, in the class devoted to mathematics. Students performed the TTPM using their personal computers, and the assessment protocol was administered in students’ regular classroom. This session was part of a broader research project, and this was the third time that examiners and students worked together. Although two trained examiners carried out the evaluations, different groups of students were assigned to each one, so that the children had only one examiner as a reference. Teachers were not present in this evaluation session. Data collection was implemented through Moodle platform. For this purpose, a multidisciplinary team including psychologists, teachers, and a computer engineer collaborated during the study.

Participants in this study were not given any incentive to take part. However, every attempt was made to design the materials and the evaluation procedure to be appealing for students (e.g., assessment protocol administered through a hypermedia environment -Moodle platform-, category system displayed by means of pop-ups, the use of graphics). Fig. 1 shows an example of the interface used to present the category system. In addition, the hypothetical case of a boy of their age (Alex), who tried to solve a math problem, was used to introduce the TTPM procedure and guide the training phase. Before the evaluation, students were told that it was not an exam. However, they were encouraged to do their best because the main goal of this study was to know how they solved math problems in order to help them to improve their problem-solving skills and get better grades.

Fig. 1
figure 1

Example of category system presentation during TTPM. This figure shows seven of the nine categories (reading, drawing/summarizing, recalling, thinking, calculating, reviewing, and correcting)

The first step consisted of a training phase to help students to familiarize themselves with the system of categories and the assessment procedure. This phase started by presenting Alex and the different actions he carried out while solving a problem. After training, students performed a category-recognition test consisting of 18 multiple-choice items (2 per category) with four response alternatives. Students had to indicate the category that best expressed each proposed activity, including the “other” category. For instance, for the statement, “Alex realized he made a mistake, so he is erasing”, the alternatives were: “Alex is thinking about a solution”, “Alex is writing”, “Alex is reviewing”, or “Alex is correcting mistakes”. The length of the training phase is variable and depends on students’ engagement (comments, questions, etc.). However, it must never last more than 20–25 min (García et al. 2015). Students with scores below 16 points on the category-recognition test (around 90 % of correct responses) were excluded from the analyses. Originally, the sample was comprised of 567 students, of whom 43 (7.58 %) scored below the cut point. The remaining 524 students (included in this study) obtained a mean score of 16.984 (SD = 0.493) on this test.

Because of the characteristics of the TTPM, only two math problems were used in the present study. This technique establishes that training and administration of the evaluation protocol must be conducted in the same session. Additionally, this method is designed to match the length of a regular class session (40–45 min). Thus, time has to be carefully assigned to the different TTPM components, taking into consideration that students must perform the problems without time pressure.

Before starting the TTPM administration, students performed a reaction time (RT) task, responding to an image accompanied by a tone, and presented at different time intervals (10–15 s). They had to mouse-click on the stimuli appearing on the screen as quickly as possible. A total of 30 beeps were presented. This task was aimed at familiarizing  the students with the response method. Thus, an index of cognitive effort was not calculated in the present study. Students were told that the same stimuli would appear while solving the problems, after which a box with the category system would appear on the computer screen. They had to indicate the sub-process they were engaged in at each moment. During TTPM performance, the visual-auditory stimuli were presented in intervals of 40–45 s. Stimulus presentation during TTPM lasted until students indicated they had finished each problem by clicking on a “finish” button set up for this purpose. This task was designed so that, regardless of each student’s response speed and the time they started or finished each problem, the time intervals between probes were the same for all the participants.

An important constraint in the Triple Task technique is the choice of the time interval with which the probe interrupts the primary task. Piolat et al. (1996) examined whether the choice of a particular time interval affects text production, cognitive effort, and temporal organization of the writing process. In their study, three groups completed the Triple Task protocol under different time intervals: a high-rate group (auditory probe every mean interval of 15 s, varying between 10 and 20 s), average-rate group (auditory probe every 30 s, varying between 15 and 45 s), and low-rate group (auditory probe every 45 s, varying between 30 and 60 s). The results showed that there were no differences either in the temporal organization of the writing processes or in text production as a function of time interval. However, secondary RTs, used as a measure of cognitive effort in their study, were affected by this variation. Specifically, longer RTs were found in both the low- and high-rate groups. Subsequently, García and Rodríguez (2007) found in their study that the longer the time interval, the better the performance in composition writing. Results from both studies suggest that high-rate patterns are more intrusive than average- and low-rate patterns in terms of cognitive effort and even of performance. These studies focused on writing composition. However, the number of phases and sub-processes analyzed and the time devoted to them may vary considerably depending on the cognitive task under investigation. In addition, excessively long time intervals may result in a loss of information, as certain processes may occur between probes. In order to reach a balance between informative capacity and the degree of interference caused by the assessment procedure, a time interval of 40–45 s was established in the present study.

Data were collectively gathered, and head-phones were given to the students to prevent them from disturbing each other. Data were collected and stored in a Moodle platform for later analyses. Students accessed this platform through an individual username and password in order to guarantee anonymity. Math problems were presented on the computer screen and on paper. Students could use the paper to write, draw, take notes, etc. They were asked to write their answer on the paper after finishing each problem. After students reported finishing the problem by pressing the “finish” button, a new box appeared on the computer screen displaying the following question: “Do you think that you have solved the problem successfully?” Responses were coded as yes (1) or no (0). Students’ achievement in the math problems was established in terms of success (1) or failure (0), according to their written answers.

Once this information was coded, students were assigned to groups as a function of the correspondence between post-performance judgment and actual performance in each problem. Students who judged their performance accurately were assigned to Group 1 (Accurate), whereas those who failed in their judgments were assigned to Group 2 (Inaccurate). The number of students who reported having failed to solve the problem when they actually did solve the problem successfully (under-confident students) was about 1–2 % in both problems. This made it implausible to consider the difference between under- and over-confident groups for further analyses. Group composition varied in both problems, and separate analyses were conducted for each problem.

Process variables were based on relative frequency counts. In order to minimize the effect of students’ differences in the total number of reported categories, the frequency of each category was established by dividing the election frequency of that category by the total number of elections across categories. In addition, and given the high variability shown by students in their process profiles, with large differences in frequencies across students and categories (see means and standard deviations in Table 4), frequency counts were then transformed into percentages. Relative frequencies were multiplied by 100 in order to simplify data analysis and interpretation. This high variability also made it necessary to conduct non-parametric analyses, as described below.

Data analysis

To address the proposed goals, the data were analyzed in five steps: (a) calibration, defined as the correspondence between post-performance judgments and actual performance, was calculated with Phi coefficient (ρ ϕ This statistic was used due to the dichotomous nature of these variables (Adeyemi 2011; Schraw et al. 2012). Although more dichotomous measures of calibration are available, showing similar specificity and sensibility, Phi coefficient was used, as it is provided by SPSS statistical package and easily interpreted in terms of association or correspondence; (b) considering the dichotomous nature of the variables, McNemar’s χ 2 statistic was calculated to analyze the stability of post-performance judgments and actual performance across problems. This statistic is based on a 2x2 cross classification of paired responses to a dichotomous item. Data from the concordant cells of the cross-classification table provide an indicator of stability, whereas χ 2 focuses on the change (i.e., non-concordant cells), and establishes whether this change is statistically significant (Adedokun and Burgess 2012). A p value < .05 indicates the existence of statistically significant differences between the responses given at two different moments (e.g., performance judgments in Problem 1 and 2). As this measure informs about two different aspects, it implies that high stability can be found even when the change is significant; (c) prior to analyzing differences in the process between the groups with different accuracy, dependent variables distribution were examined, and students’ metacognitive process during mathematical problem-solving was described. Dependent variables included the three SRL phases (planning, execution and evaluation), and the eight TTPM categories or sub-processes (reading, drawing or summarizing, recalling similar problems, thinking about solutions, calculating, writing a response, reviewing, and correcting mistakes); (d) as dependent variable distribution did not meet normality conditions (see Table 4), non-parametric analyses were conducted to examine group differences. Specifically, the Mann-Whitney U-test was applied, using Cliff’s delta (δ) as a measure of effect size (Macbeth et al. 2011). The non-parametric nature of this statistic reduces the influence of characteristics such as distribution shape, dispersion differences, and extreme values. It provides a measure of dominance, or the degree of overlapping between two distributions of scores. The value of this statistic ranges from −1 (if scores in Group 2 are larger than scores in Group 1) to +1 (if scores in Group 2 are smaller than scores in Group 1), and takes on the value of zero if the two distributions are similar (i.e., absence of significant group differences in the measured variables). Cohen (1988) established a bridge between Cohen’s d and Cliff’s δ statistic. In this sense, a δ value of .147 has an effect size of d = .20 (small effect); a δ value of .330 corresponds to an effect size of d = .50 (medium effect); and a δ of .474 has an effect size of d = .80 (large effect); and (e) the emergence of different calibration accuracy patterns and possible differences in the metacognitive process as a function of mathematics achievement and grade level were examined. Phi coefficient (ρ ϕ ) was calculated for each group of students (low, medium, and high achievers; fifth- and sixth-grade students) as an indicator of calibration accuracy. Differences in the metacognitive process among groups with different levels of achievement were examined through the Kruskal-Wallis statistic. As the non-parametric nature of these statistics did not allow conducting post-hoc analyses, differences between pairs of groups were analyzed using the Mann-Whitney U-test. This statistic was also used to analyze differences in the process between fifth- and sixth-grade students. Cliff’s delta (δ) was calculated as a measure of effect size.

SPSS v.19 (Arbuckle 2010) was used to carry out the statistical analyses. However, as this program does not provide any non-parametric alternative to obtain an estimation of effect size, Cliff’s Delta Calculator (CDC: Macbeth et al. 2011) was used to calculate Cliff’s delta. A p-value ≤ .05 was established as criterion of statistical significance.

Results

Correspondence between post-performance judgments and actual performance

Table 2 shows the frequency distribution of post-performance judgments and actual performance, as well as the Phi correlation coefficients and their statistical significance in the two math problems. In Problems 1 and 2, a total of 221 (42.2 %) and 278 (53.1 %) students, respectively, were accurate in their post-performance judgments, whereas 303 (57.8 %) and 246 (46.9 %) were not. The correlation between the two variables (post-performance judgments and actual performance) was .207 (p < .001) in Problem 1 and .272 (p < .001) in Problem 2. Thus, a low and directly proportional relationship between post-performance judgments and actual performance was found. Taking this relationship into consideration, there is a strong mismatch between students’ post-performance judgments and their actual performance.

Table 2 Table of cross-classified data frequency according to correspondence between post-performance judgments and actual performance, and phi correlation (ρ ϕ ). Problems 1 and 2

As Table 2 also shows, inaccurate students displayed a tendency towards over-confidence. Specifically, 298 students (56.871 %) in Problem 1 and 232 (44.275 %) in Problem 2 reported having successfully solved the problem when, in fact, they gave an incorrect response. On the other hand, the number of students who erroneously reported having failed to solve the problem (i.e., under-confident students) accounted for less than 1 % (n = 5) and 2.671 % (n = 14) of the sample in Problems 1 and 2, respectively. At this point, it is also important to note the students’ low rates of success in solving both math problems, as only 156 students (29.771 %) gave a correct answer to Problem 1, and 185 (35.305 %) to Problem 2. These data contrast with the high perception of success shown by students, with 449 (85.687 %) and 403 (76.908 %) students reporting having successfully solved Problems 1 and 2, respectively.

Stability of post-performance judgments and actual performance

Data about students’ stability in post-performance judgments and actual performance are provided in Table 3. Regarding post-performance judgments, 354 students reported solving Problems 1 and 2 successfully, whereas 26 students reported being unsuccessful in both problems. This means that a total of 380 students (72.5 % of the sample) were stable in their post-performance judgments across problems. Thus, only 27.5 % of the students showed a change in their judgments. McNemar’s χ 2 statistic indicated that this change was statistically significant (p < .001). Actual performance followed a similar pattern: 252 students were unsuccessful solving Problems 1 and 2, whereas 69 solved both problems successfully. This indicates that 321 students (61.2 % of the sample) were stable in their actual performance, whereas 38.8 % were not. This change in actual performance across problems was statistically significant (p = .039).

Table 3 Table of cross-classified data frequency according to post-performance judgments and actual performance stability, and McNemar’s χ 2 statistic. Problems 1 and 2

Differences in the metacognitive process shown by students with different calibration accuracy

Descriptive statistics for each TTPM sub-process and phase are shown in Table 4. High standard deviations indicated high within-subject variability in their problem-solving process. Kurtosis and skewness values confirmed the unsuitability of using parametric analyses. The presence of high within-subject variability was also observed in the total time that students spent on the problems. Specifically, students spent a mean of 376.794 s (SD = 151.491) on Problem 1, and 339.084 s (SD = 125.626) on Problem 2.

Table 4 Descriptive statistics for dependent variables (TTPM variables). Problems 1 and 2

Table 4 also provides information about students’ metacognitive process before differences in calibration accuracy were considered. As can be observed, profiles are quite similar in both problems. Students reported spending a greater amount of time on calculations in comparison to the rest of sub-processes. Within the planning phase, students reported spending a large amount of time thinking about solutions, in contrast to other sub-processes such as recalling similar problems (i.e., previous knowledge and experiences) or drawing/summarizing (i.e., organization of the information). Finally, the sub-processes of the evaluation phase (i.e., reviewing and correcting mistakes) were the least frequently reported by students in both problems.

With regard to differences between groups with different accuracy, the Mann-Whitney U-test indicated the existence of statistically significant differences in some sub-processes and phases. Regarding Problem 1, differences in the sub-processes of drawing or summarizing (U = 29143.500, p = .006, δ = .129), writing (U = 29760.000, p = .026, δ = −.111), and reviewing (U = 29301.000, p = .006, δ = −.125) were found, as well as in the phases of planning (U = 29807.500, p = .032, δ = .110) and evaluation (U = 28340.500, p = .001, δ = −.153). Means in Table 5 show that students who were accurate in their post-performance judgments reported drawing or summarizing (and also planning) more frequently than their inaccurate peers. Accurate students also reported having spent less time on writing, reviewing, and on the evaluation phase than their inaccurate peers. In Problem 2, a similar pattern of results was observed. Statistically significant differences were found in the sub-processes of drawing or summarizing (U = 29101.000, p = .001, δ = .149) and writing (U = 28067.000, p = .001, Cliffs δ = −.179), and in the phases of planning (U = 29453.000, p = .006, δ = .139) and evaluation (U = 31088.500, p = .042, δ = −.098). Although differences in correcting mistakes were not statistically significant, a lower proportion of accurate students reported doing this than students in the inaccurate group in both problems.

Table 5 Means (M) and Standard Deviations (SD) of the groups with different calibration accuracy in TTPM variables. Problems 1 and 2

Differences in calibration and the metacognitive processes as a function of math achievement level

Table 6 shows the existence of different patterns of correspondence between post-performance judgments and actual performance as a function of students’ achievement level (low-medium-high). As can be observed, calibration accuracy improved progressively as students’ achievement level increased. This change was more evident in Problem 2, where ρ ϕ reached values near to .50 in the high-achieving group. This improvement in calibration accuracy was reflected in over-confidence rates. Specifically, 124 (61.083 %) low achievers made over-confident judgments in Problem 1, whereas only 64 (45.714 %) high achievers did so. In Problem 2, the number of over-confident students ranged from 114 (56.157 %) to 37 (26.428 %) in the groups of low and high achievers, respectively.

Table 6 Patterns of calibration accuracy as a function of achievement level in math. Problems 1 and 2

Regarding group differences in the metacognitive process, the Kruskal-Wallis statistic revealed statistically significant differences in the sub-processes of reading, K-W (χ 2) = 9.332, p = .009 in Problem 1; and K-W (χ 2) = 16.243, p < .001 in Problem 2, and drawing/summarizing, K-W (χ 2) = 6.904, p = .032 in Problem 1; and K-W (χ 2) = 15.636, p < .001 in Problem 2. High-achieving students in mathematics reported having read less, but using representation strategies more frequently than their low-achieving peers (see Table 7). Mann-Whitney’s U-test showed that: (a) the groups with low and medium achievement only differed in the sub-process of drawing/summarizing in Problem 1 (U = 16332.500, p = .037, δ = −.111); (b) differences in this sub-process were also found between the groups with medium and high achievement in both problems (Problem 1: U = 11007.000, p = .032, δ = −.131; Problem 2: U = 10363.500, p = .003, δ = −.182); and (c) the groups with low and high achievement showed the strongest differences in both problems. They differed in drawing/summarizing (Problem 1: U = 10938.500, p < .001, δ = −.230; Problem 2: U = 11128.000, p < .001, δ = −.216), as well as in reading (Problem 1: U = 11537.500, p = .003, δ = .188; Problem 2: U = 11961.500, p = .012, δ = .158). Effect sizes, although higher than in previous analyses, were low.

Table 7 Means (M) and Standard Deviations (SD) of the groups with different math achievement levels in TTPM variables. Problems 1 and 2

Differences in calibration and metacognitive processes as a function of grade level

Table 8 shows patterns of calibration accuracy as a function of grade level (fifth and sixth grade) in Problems 1 and 2. As can be observed, calibration accuracy was low, especially in the case of sixth-grade students in Problem 1 (ρ ϕ  = .114). In contrast to mathematics achievement, calibration accuracy did not present a distinguishable pattern of change as a function of grade level. As in previous analyses, calibration patterns were characterized by a marked tendency towards over-confidence. Specifically, 123 (55.909 %) and 104 (47.272 %) fifth-grade students were over-confident in Problems 1 and 2, respectively. Sixth-grade students were over-confident in 175 (57.565 %) and 128 (42.105 %) of the cases, respectively.

Table 8 Patterns of calibration accuracy as a function of grade level. Problems 1 and 2

Regarding group differences in TTPM sub-processes and phases, means in Table 9 indicate that both groups showed a similar metacognitive process. This was confirmed by the Mann-Whitney U-test. Specifically, statistically significant group differences were only found in Problem 1, in the sub-processes of calculating (U = 29533.500, p = .022, δ = −.117), and reviewing (U = 29414.500, p = .008, δ = −.120). Sixth-grade students reported calculating and reviewing more frequently than fifth graders in this problem. However, the same pattern of results was not found in Problem 2. Effect sizes for the differences were low.

Table 9 Means (M) and Standard Deviations (SD) of fifth- and sixth-grade students in TTPM variables. Problems 1 and 2

Discussion and conclusions

The present study was aimed at examining patterns of post-performance calibration and their relationship with the metacognitive process shown by a broad sample of elementary school students while solving two math problems. In order to evaluate the process, the Triple Task Procedure in Mathematics (TTPM: García and González-Pienda 2012; García et al. 2015) was administered to 524 fifth- and sixth-grade students in a regular mathematics class. Once the TTPM was completed, the students were asked to judge the results of their performance in terms of success or failure. These judgments were then compared with their actual performance, also expressed in dichotomous terms. The possible influence of students’ mathematics achievement and grade level on calibration accuracy patterns and the metacognitive process was analyzed. Separate analyses were conducted for each problem. Results are discussed according to the five initial predictions:

Correspondence between post-performance judgments and actual performance

As hypothesized, students were poorly calibrated in the present study, with a strong tendency towards over-confidence. Nearly 60 and 45 % of the sample in Problems 1 and 2, respectively, reported having successfully solved the math problems when they actually gave an incorrect answer. This finding is coherent with previous studies in mathematics and problem-solving, conducted in elementary school and higher educational stages (Bol et al. 2005, 2010; Hacker et al. 2008b; Özsoy 2012).

Stability of post-performance judgments and actual performance

Previous studies suggest that performance judgments (and biases) are stable over time, and more stable than actual performance (Bol et al. 2012; Dinsmore and Parkinson 2013; Hacker et al. 2008b; Stolp and Zabrucky 2009). In this sense, results from the present study tentatively support this statement, as well as the second hypothesis of the study. Specifically, post-performance judgments and actual performance were both quite stable across problems, with about 70 % of students showing stability in their judgments, and 60 % in their performance. However, as will be discussed later in the present study, the use of only two math problems to estimate stability makes it difficult to generalize these findings.

Differences in the metacognitive process shown by students with different calibration accuracy

Descriptive analyses of the metacognitive process during TTPM revealed that students in the overall sample spent a great amount of time performing calculations, but little time organizing or representing information (drawing/summarizing), and even less time reviewing the progress and results of their performance. In addition, the time they spent recalling similar problems (i.e., seeking previous knowledge) was almost negligible in general. These findings indicate an absence of effective metacognitive skills in this group of students, which would made it difficult for them to monitor their own solution process or use different strategies flexibly (e.g., representing problem situations in many forms, such as a graph, a sketch or a table), getting stuck in familiar procedures instead (i.e., calculations). This is coherent with some previous studies pointing out that students at these educational stages tend to show poor metacognitive strategies while solving math problems, which commonly leads to poor performance and low calibration accuracy (Cleary and Chen 2009; Desoete and Roeyers 2006; Kramarski and Gutman 2006). The low rates of success solving the math problems, and the weak correspondence between post-performance judgments and actual performance shown by students in the present study confirmed this pattern of results.

Additionally, differences between accurate and inaccurate students in the metacognitive process during TTPM were found. Specifically, accurate students reported spending significantly more time planning (especially drawing and summarizing) than their inaccurate peers. Accurate students also spent significantly less time writing and reviewing than the inaccurate group. These results initially confirmed the third hypothesis of the present study, but only partially. Thus, accurate students planned more and used different information representation strategies, but they did not evaluate their progress and results more than low calibrated students did. However, the finding that inaccurate students wrote and reviewed significantly more, along with the fact that they also reported correcting (and presumably making) more mistakes than their accurate peers—although this last variable did not generate statistically significant differences—indicates the use of “trial-and-error” mechanisms by this group of students. These mechanisms involve a tendency towards tackling a problem with various methods until a solution is found, and have been reported to be frequent during problem-solving tasks (Callander 2011; Clements et al. 2008; Dijk et al. 2003). A similar pattern of results was found in both problems. Although effect sizes were low in general, these findings initially suggest a relationship between planning strategies (mainly representation and organization of the information) and making accurate post-performance judgments. In fact, there is an inverse relationship between these strategies and the use of trial-and-error as a means to solve the problems.

Evidence from students’ metacognitive processes while solving the math problems could be obtained by analyzing performance artifacts (e.g., drawings, notes, or editing from students’ drafts on paper). An example of these artifacts is provided below (Fig. 2), corresponding to the performance of two different children in Problem 1 (two 11-year-old boys, the first boy belonging to the accurate group and the second one to the inaccurate group). This figure represents an example of organization of the information and trial-and-error mechanisms, respectively. As can be observed, the student who judged his performance accurately (left side of the image) used different strategies to organize the information, establishing clear relationships between data and facts. He also seems to have tried only one approximation to solve the problem, as a lack of mistakes or corrections indicate. Finally, he successfully solved the problem, giving a unique and correct answer (i.e., “Esther”). Regarding the student who was inaccurate in his performance judgment (right side of the image), his problem-solving process was less smooth than in the former case, with some signs of correcting and editing. A lack of information-organization strategies can be observed, and data and relationships are rather chaotic. Although the student gave a unique answer (i.e., “Susana”), it was incorrect.

Fig. 2
figure 2

Example of students’ artifacts during TTPM (Problem 1). Two 11-year-old students: accurate (left) and inaccurate (right)

This pattern of results indicates the presence of some degree of self-regulatory and metacognitive mechanisms in calibrated students. As Jacobse and Harskamp (2012) pointed out, an important cognitive sub-process in mathematical problem-solving is building a representation of the problem situation. However, few elementary school students use this strategy spontaneously. The use of schematic visualizations allows students to establish those much-needed relationships between variables that help to solve the problem. For these authors, making these kinds of representations can be interpreted as an expression of sophisticated metacognitive regulation mechanisms, which provide insight into the episodes of analyzing and exploring a problem (i.e., metacognitive regulation in the first episodes of the problem-solving process). Moreover, schematic representations have been shown to predict students’ successful problem-solving (Fagnant and Vlassis 2013).

Differences in calibration and the metacognitive process as a function of math achievement level

Different patterns in calibration accuracy, as well as differences in the metacognitive process during TTPM, emerged as a function of mathematics achievement level. These results confirmed the fourth hypothesis of the present study. Regarding calibration accuracy, high-achieving students made more accurate judgments than their peers with lower levels of achievement. Specifically, the higher the achievement level, the more precise the post-performance judgments, and the lower the tendency towards over-confidence shown by students. The same pattern of results was obtained in both problems. These findings are consistent with previous studies showing that high-achieving students are commonly more accurate in their judgments of performance (and somewhat under-confident) than their low-achieving peers (Bol et al. 2005, 2010; Hacker et al. 2008b; Özsoy 2012). However, no conclusions about under-confidence were reached in the present study, as the proportion of under-confident students was negligible (about 1–2 %). This could be related to the fact that most students in the current sample were low achievers, as problem-solving scores evidenced. Results also indicated the existence of some differences in the metacognitive process during TTPM as a function of achievement level. In this sense, previous differences between accurate and inaccurate students in drawing/summarizing are at least partially explained by this variable. This is not surprising, as high achievers have been referred to as being more highly calibrated and self-regulated than low achievers (Bol et al. 2005; Dunlosky and Rawson 2012; Nietfeld et al. 2005, 2006). The low effect sizes found in the previous analyses may also be a consequence of the influence of this variable.

Differences in calibration and the metacognitive process as a function of grade level

With regard to grade level, results did not allow to confirm the fifth hypothesis of the present study. First, no distinguishable patterns of calibration accuracy were obtained as a function of this variable. In general, the overall sample showed low calibration accuracy. Additionally, sixth-grade students’ calibration accuracy was considerably lower in the first problem. Second, regarding the metacognitive process during TTPM, statistically significant differences between fifth- and sixth-grade students were mainly found in reviewing, in Problem 1. In this sense, sixth-grade students reported reviewing more than their fifth-grade peers. These differences could be due to the fact that sixth-grade students were considerably less accurate in this first problem. Thus, these results reflect the previously described differences between accurate and inaccurate students in this sub-process. This pattern of results shows that grade level did not influence calibration accuracy or metacognitive processes in this sample of students. This finding could be related to the fact that both school grades are very close in time and constitute a unique educational cycle in the current Spanish Educational System. This makes it difficult to observe developmental or cognitive differences between fifth- and sixth-grade students. However, as there is also a good deal of evidence suggesting that low calibration accuracy and over-confident judgments are present across different educational stages, an absence of differences can also be expected.

Implications

The main implication of the present study is related to the impact of inaccurate performance judgments on learning processes and outcomes, and how these self-evaluation biases can be corrected. In this sense, as students show a strong tendency towards over-confidence (also corroborated in the present study), most research has focused on whether this positive bias is adaptive or maladaptive (Dupeyrat et al. 2011). Results are mixed. On the one hand, based on Bandura’s social cognitive theory (Bandura 1986), positive self-evaluation bias is adaptive because it engages motivation, persistence in the face of failures, and protects against negative emotions such as anxiety (Stolp and Zabrucky 2009). On the other hand, authors such as Bouffard and Narcis (2011), Narciss et al. (2011), and Schunk (2008) emphasize the negative effects of unrealistic judgments. There is also some evidence that positive illusions are either adaptive or maladaptive depending on variables such as the domain or the temporariness of the related outcomes (Bouffard et al. 2011; Gonida and Leondari 2011; Gramzow et al. 2003). This debate has prompted important efforts to improve calibration accuracy, many of them from the perspective of metacognition and SRL.

Some of these studies have shown that gains in calibration accuracy are difficult to achieve or are not durable (Bol and Hacker 2001; Bol et al. 2005; Nietfeld et al. 2005), whereas others have reported some improvement (DiGiacomo 2014; Hacker et al. 2008b; Nietfeld et al. 2006; Zimmerman et al. 2011). These studies agree that more comprehensive and adapted interventions must be designed and implemented. For this purpose, and as self-evaluation biases are present at all academic stages, the first step consists of analyzing the factors involved in making performance judgments at early educational stages. Results from the present study suggest the usefulness of process-based measures, specifically the Triple Task Procedure in Mathematics (García and González-Pienda 2012; García et al. 2015), to study the metacognitive mechanisms involved in post-performance judgments of math problem-solving. This procedure also reflected the influence of the mathematics achievement level on calibration accuracy and the metacognitive process itself. This aspect is relevant because the relationship between these components has been shown to have important implications for intervention (Hacker et al. 2000). These authors analyzed the effectiveness of an intervention based on feedback, practice tests, and course instruction to improve calibration accuracy and academic performance in a sample of undergraduate students. The results revealed that calibration accuracy improved, but only for higher achieving students. Similar results were obtained by Nietfeld et al. (2006) and Hacker et al. (2008b). These studies suggest that intervention strategies must be adapted to students’ characteristics, and achievement level is an important factor moderating the effects of the intervention.

Limitations

Finally, some limitations in the present study must be acknowledged. First, the low effect sizes found call for some degree of caution concerning the scope of the findings. This result could be related to the high inter-subject variability observed in the students’ metacognitive process. These low effect sizes may also indicate that other variables account for differences in the metacognitive process. This should be addressed in further studies, establishing more homogeneous groups based on additional variables. Second, the use of only two math problems must also be considered. As explained before, the features of the assessment method used (TTPM) justified this decision. However, including more problems would lead to a clearer measure of post-performance judgments and actual performance stability. Similarly, and given the mentioned time constraints, the possibility of administering the problems across days must be considered. Taking into account that the dichotomous nature of post-performance judgments and actual performance may be a limitation in the present study, more continuous measures of calibration could be obtained by increasing the number of math problems. In this same line, retaining the classical measure of cognitive effort in future studies would help to establish the appropriate number of problems by examining the extent to which they are challenging for students. Third, another limitation in the present study is that the TTPM may be intrusive for students. Specifically, it may lead to a “reactivity” effect (Bowles and Leow 2005), that is, that the type of measure employed triggers changes in learners’ cognitive and metacognitive processes while performing the task. This effect has been substantially studied in other on-line measures, mainly Think-aloud protocols (Bannert and Mengelkamp 2008; Ericsson and Simon 1996; Fox et al. 2011; Veenman et al. 1993). These authors concluded that, although the use of this sort of measure may be related to an increase in the time to complete the task, simply instructing participants to verbalize their thoughts during a task does not alter the sequence of the cognitive processes or task performance. In the same line, Kellog (1987), using Triple Task procedures, reached similar results. This author found no differences in written fluency or quality between writers engaged in directed retrospection and 30 control writers who made a written composition as a single task, suggesting that directed introspection in Triple Task was not reactive. However, the use of different tasks (i.e., math problems in the present study) must be considered. In this sense, whereas there is extensive tradition in the study of mathematics problem-solving processes though Think-aloud protocols (see Montague et al. 2011), the flexibility of the Triple Task technique and its suitability for larger samples motivated its application in the present study. Given its design, however, it may be more intrusive for students than Think-aloud. In order to reduce its possible effect on students’ performance, the TTPM places special emphasis on the training phase so that the categorization task (directed introspection) becomes more automatic, generating less interference and involving less cognitive effort. Additional research is being conducted to examine the possible reactivity of this technique. Fourth, students were not provided with any incentive for taking part in the study. This is relevant because students’ performance was low. Providing incentives in the form of prizes or extra credits may increase motivation, improving performance. In fact, motivation has been shown to be important for students to engage in and successfully complete a task, as well as to evaluate their performance (Sheldrake et al. 2014). This is relevant, as Zimmerman’s SRL model (2000, 2008) includes cognitive, behavioural, emotional, and motivational components. Thus, exploring motivation and its relationship to calibration accuracy and performance is an aspect to be considered in future studies. Fifth, the categorization of the variable mathematics achievement in different levels must be considered. The dichotomous measure of calibration, and mainly the non-parametric nature of the process variables, with high levels of variability among values, made it unfeasible to conduct regression analyses considering this variable as continuous. Thus, in order to obtain information about the influence of mathematics achievement in the mentioned variables, different levels of achievement were established. This decision may have led to a loss of statistical power, given the lower effect sizes found in the present study. In this sense, additional studies focused on specific sub-processes or phases would help to reduce variability, making it possible to conduct regression analyses properly. Finally, although this study was initially aimed at increasing ecological validity, some contextual aspects such as the fact that teachers were not present during the evaluation, or the above-mentioned absence of incentives, may have compromised this attempt. Future studies will be conducted in order to address these issues.