Metacognition predicts performance in many learning domains for children and adults (Ohtani & Hisasaka, 2018). The definition of metacognition as “cognition about cognition” (Flavell, 1979) encompasses declarative and procedural components. Declarative metacognition involves knowledge of and beliefs about the factors that affect learning and cognition, whereas procedural metacognition involves skills to monitor and control one’s learning while working on tasks. Procedural metacognition is strongly associated with the outcomes of self-regulated learning (Ohtani & Hisasaka, 2018). The Nelson and Narens (1990, 1994) model of metacognition explains procedural metacognition. Two levels of cognition, the meta-level and the object-level, are assumed to be interrelated via monitoring and control. To assess monitoring, a person can, for instance, be asked to make a judgment of learning (JOL) when studying an item or make a confidence judgment (CJ) about the correctness of their test answers. Measures of control are, for example, study time allocation, restudy selections, and the extent to which a person strategically decides to submit correct responses and opts out by withdrawing incorrect responses (Dunlosky et al., 2016; Koriat & Goldsmith, 1996; Krebs & Roebers, 2010). Research on procedural metacognition typically used single tasks and measures instead of combining multiple components. The present study combines multiple measures of procedural metacognition to obtain insights into the heterogeneity between elementary school children’s procedural metacognition.

Consider, for example, Anna, a fourth-grade student, who has to read a text about bear hibernation and solve math questions. She needs to apply procedural metacognition to understand the text and the math questions. When monitoring, she can ask herself, “Did I correctly answer the question?“. When controlling, she can, for instance, reread text sections to improve understanding or try to solve a math question for which she is unsure once more before submitting her response. If Anna is skilled at monitoring and controlling her text comprehension but struggles to identify errors in math assignments, her comprehension skills will likely improve, but her math performance may suffer (De Bruin & Van Gog, 2012).

Development of procedural metacognition

Procedural metacognition develops as early as during the preschool years (see Godfrey et al., 2023; Roebers, 2017; and Schneider et al., 2022 for reviews). During elementary school, monitoring skills undergo further development (Godfrey et al., 2023; Roebers, 2017). For example, Van Loon et al. (2017) showed that between the ages of eight and 11, children’s ability to recognize their incorrect test responses improved, such that children became less overconfident for errors.

Moreover, children’s control skills improve from mid-elementary school onwards. Approximately from the age of eight, children can strategically choose which items they need to restudy to optimize learning (De Bruin et al., 2011; Dufresne & Kobasigawa, 1989; Metcalfe & Finn, 2013; Van Loon et al., 2017). Further, skills to strategically withdraw or submit test responses develop during mid-elementary school. Approximately at the age of eight, children decide to withdraw more incorrect than correct answers, and the accuracy of maintenance/withdrawal decisions improves between the ages of eight and 11 years (Bayard et al., 2021; Roebers et al., 2009).

Most studies on children’s metacognition are cross-sectional and only use one task, for instance, picture learning (Liu et al., 2018), paired associates (Metcalfe & Finn, 2013), concept learning (Van Loon et al., 2017), or comprehension of videos or expository texts (De Bruin et al., 2011; Roebers et al., 2009). These studies mainly aimed to describe developmental improvements for different monitoring and control aspects. Although well-suited for estimating general developmental tendencies over time, such approaches might conceal differential developmental patterns of monitoring and control skills.

Heterogeneity of metacognition

In the current study, by combining metacognition measures, we aim to understand how children differ when working on learning tasks (i.e., a paired-associate language learning task and a text-comprehension task) requiring continual monitoring and control. Monitoring and control processes, despite being closely intertwined, appear conceptually distinct. Monitoring involves judging the alignment between knowledge and task goals, while control entails making choices for optimizing performance. Studies show both relations and dissociations between them (e.g., Grainger et al., 2016; Peng & Tullis, 2021; Qui et al., 2018), and from a developmental perspective, monitoring skills appear to precede control skills in childhood (Destan et al., 2014).

Within samples, children of homogenous age groups exhibit substantial heterogeneity in monitoring accuracy and restudy decisions. Follow-up data analyses by Van Loon & Oeri (2023) revealed a moderate to strong overall relation (-0.49 gamma correlation) between fourth graders’ CJs and restudy decisions. However, children appeared heterogeneous in how monitoring was translated into control: 50.6% restudied < 20% of items judged as not learned, while 21.5% restudied > 80%. Individual differences in motivation, task value (Wigfield, 1994), and learning goals appear to influence restudy choices (Ariel et al., 2009). Further, children appear to vary in their decisions to maintain or withdraw test responses. Follow-up data analyses by Van Loon et al. (2024) showed that overall, fourth graders maintained 74.5% correct and 32.7% of their incorrect answers when learning texts. However, decisions about incorrect answers varied greatly: 30.9% submitted < 20% of the incorrect responses, while 25.7% submitted ≥ 50%, despite awareness of point deductions when submitting errors.

Studies using multiple tasks to obtain insights into the generalizability of metacognitive skills show that, for adults, monitoring for memory tasks has generalizable elements (Lehmann et al., 2022; Mazancieux et al., 2020). Also, in middle childhood, metacognition appears to some extent to be generalizable (Kleitman & Moscrop, 2010; Van de Stel & Veenman, 2010). For instance, when eight-to-nine-year-old children completed arithmetic and spelling tasks, Bellon et al. (2020) found that monitoring accuracy measures were moderately correlated across tasks. However, particularly for young elementary school children, there may be heterogeneity in the extent to which metacognitive skills are generalizable. Between the ages of eight and ten, metacognitive skills appear to transition from being rather task- or domain-specific to more generalizable across tasks and domains (Bellon et al., 2020; Geurten et al., 2018). In sum, differences between children are likely to exist in (a) their accuracy and effectiveness of monitoring and control, (b) the interplay between monitoring and control, and (c) the generalizability of metacognitive skills across tasks.

In this study, latent profile analysis (LPA) is used to simultaneously consider multiple measures of procedural metacognition to understand how metacognitive monitoring and control processes co-occur within children. Focusing on metacognition profiles rather than mean values for single variables allows us to investigate heterogeneity within individual children’s procedural metacognition. For example, some children may struggle with monitoring and control of reading comprehension but not with a paired-associate memory task, or some may be skilled at monitoring but not at controlling. A better understanding of such heterogeneity within children’s metacognitive skills may improve the identification of children who are at risk for low metacognition and benefit the design of personalized metacognition support.

A further aspect to consider when aiming to understand heterogeneity in procedural metacognition is the extent to which individual development patterns differ between children over time. The few longitudinal studies on children’s development of metacognition mainly focus on metacognitive skills improvement for one specific task as an effect of age (e.g., Fandakova et al., 2017; Roebers & Spiess, 2017). This research, however, does not address to what extent children identified as having proficient metacognition maintain high levels over time and to what extent children who initially lag behind persist in having lower metacognitive skills.

The novelty of the present approach lies in investigating children’s membership in metacognitive profiles over time. Building on longitudinal research into the stability and change of self-evaluations such as self-esteem and self-concept (e.g., Putnick et al., 2020; Trzesniewski et al., 2003), we employed an annual measurement interval. This allowed us to explore the extent to which children’s metacognitive skills remain stable or change over a year. By investigating the changes in children’s profile classifications over time, we can obtain insights into whether metacognitive deficits are temporary and resolved within a year or represent a persistent developmental lag where children with initial metacognitive deficits remain classified as having low procedural metacognitive skills.

Present research

The present study includes multiple metacognition indicators (i.e., confidence judgments, restudy selections, and response maintenance/withdrawal decisions). Two age groups (eight-year-old second graders and ten-year-old fourth graders) participated. With our measures of metacognition, we focus specifically on (a) children’s ability to recognize errors and assign lower confidence judgments (CJs) to incorrect responses compared to correct ones; (b) children’s skills to correct errors by strategically selecting these task items for restudy, and (c) their ability to strategically maintain correct test responses and withdraw their errors (thereby preventing losing points). In the age range under investigation, metacognitive monitoring and control skills develop extensively. There are pronounced changes in children’s capacity to recognize and correct errors, to utilize monitoring mechanisms to guide their control decisions and actions, and metacognition transitions from task-specific to a more generalizable skill (Bellon et al., 2020; Geurten et al., 2018; Selmeczy & Ghetti, 2019; Van Loon et al., 2013).

Two verbal learning tasks, text comprehension and Kanji learning, were completed at two measurement points, one year apart. Despite similarities in task phases—learning, testing, monitoring, and controlling—task formats differed. For the text comprehension task, children read short texts at their own pace, with open-ended questions in the test phase. The Kanji task involved paired-associated visual stimuli with fixed learning times and a multiple-choice recognition test. Procedural metacognition indicators—monitoring accuracy, effective restudy selections, and response maintenance accuracy—were calculated per task and measurement point.

Using person-oriented LPAs, we aimed to identify profiles of procedural metacognition. We expected quantitative differences between children in metacognitive skills, including at least one high and one low metacognitive skills profile in both age groups (H1). To foreshadow, because differences appeared between the two grade levels in profile classifications, LPAs are reported separately for each grade. To better understand potential reasons for differences between children’s profile memberships, we investigate whether task performance measures were related to metacognition profile classifications. The relationship between metacognition and performance is subject to ongoing debate, with some research arguing for a close connection (showing that higher task performance often correlates with enhanced metacognition, e.g., Roebers & Spiess, 2017), whereas other research suggests that these are dissociable (Howard et al., 2010; Scott et al., 2014) and should be separable in measurement to interpret “purer” measures of metacognition (Fleming & Lau, 2014). For explorative purposes, we relate task performance to the metacognition profile classifications to obtain insights into whether children’s retrospective monitoring and control skills (i.e., monitoring judgments and control decisions were made after answering test questions) may be driven by test performance.

By combining two tasks, we can obtain insights into the generalizability of metacognition. This appears to be developing in the age range under investigation; children may differ in the extent to which their levels of monitoring and control are generalizable. Bellon et al. (2020) found evidence for generalizable metacognitive skills between ages eight and 10. We, therefore, expect that we find, particularly for the older age group, that at least some of the identified profiles would show generalizable metacognitive skills (i.e., similar levels of metacognitive ability for the text task and the Kanji task, respectively, H2).

Longitudinal measurements (one year apart) provide insights into the developmental trajectories of metacognition profiles and the probabilities of changing profile membership. We aim to understand if identified profiles remain similar or change over time (potentially due to biological maturation, repeated task and testing experiences, and schooling experiences). Further, we examine to what extent children’s profile memberships develop and change over time. This way, we can obtain insights into how often children with a high level of metacognition remain high, and children initially classified with deficits in monitoring and/or control catch up over time or continue to exhibit metacognitive deficits. Based on findings showing moderate relations between children’s level of confidence across two time points (Roebers & Spiess, 2017), we hypothesize that children classified with accurate metacognition tend to remain high in metacognition, and children poorer in metacognition also tend to stay in lower-metacognition profiles over time (H3).

Methods

Participants

At the first measurement point (M1), 327 children participated: 151 second-graders (M age 7.61 years, SD 0.50, 50% female) and 176 fourth-graders (M 9.62 years, SD 0.49, 46.38% female). They attended primary schools in the German-speaking part of Switzerland. Of these participants, 325 completed the Kanji task, and 304 completed the text task. At M2, which took place one year later, 308 children participated: 138 second-graders (then in third grade) and 170 fourth-graders (then in fifth grade). Of these participants at M2, 307 children completed the text task and 303 the Kanji task; 279 children completed both the text and the Kanji tasks at both measurement points.

All children were used to follow school instructions in the German language; 76.47% had German as a native language, of the children with other native languages, Albanian (7.19%), Italian (1.96%), and Hungarian (1.96%) were most common. At M2, the children had changed to the next school year and were thus in third and fifth grade. However, for clarity, we will refer to these profiles as second and fourth-graders throughout the paper.

Note that children participated in a large project including seven measurements in one year: In Fall Year 1 with a Kanji and text task, in Spring Year 1 with a Kanji, text, and secret code task, and in Fall Year 2 with a Kanji and text task. Due to the aims of this study, the measurements in the Fall of Year 1 (M1) and one year later in the Fall of Year 2 (M2) were used. Findings about the overarching project are reported in detail for the Kanji tasks by Roebers et al. (2019), and the text task is reported by Steiner et al. (2020).  The current approach is the first to use a latent profile approach to combine findings for the text and Kanji tasks cross-sectionally and longitudinally.

Materials and procedure

Children were tested in their school classes. At each measurement point (one year apart), they completed the text and Kanji tasks during two sessions with approximately one week in between. The order of completion was counterbalanced per class. Both tasks consisted of a learning, test, monitoring (CJs), and control phase (restudy selections and responses maintenance/withdrawal). Figure 1 shows these phases. All materials were previously piloted with a different sample of children.

Fig. 1
figure 1

Procedure

Text comprehension task

Six expository texts were read in randomized order on a tablet; text reading was self-paced. The texts had the same topics but were longer and more difficult for the fourth than the second graders (average word count 80.9, SD = 6.2 for second and 127.4, SD = 7.0 for fourth grade). More details about the texts are reported by Steiner et al. (2020). After reading, the children received a booklet with test questions. They answered twelve open-ended questions (two per text). For each text, a comprehension question required a sentence as an answer, and a detail question required a single-word response. After giving each answer, they monitored how certain they were that their answer was correct by making a CJ on the 7-point thermometer (adapted from Koriat & Shitzer-Reichert, 2002) printed right of each question. When introducing the thermometer at the first testing session, we asked, “What color is grass?” and prompted them to indicate their confidence level. Children unanimously selected the highest point on the scale (red, very sure) for this known answer. When then asked, “How many hairs do I have on my head?” (for which they made a guess), they uniformly chose the blue side (unsure). These instructions ensured that children all received the same instructions about how to use the scale, making our measures of monitoring accuracy comparable across participants. After making confidence judgments, children could select texts for restudy; the text topics were presented on a page in a 3 × 2 grid format (as recommended by Thiede & Dunlosky, 1999). Children could put a checkmark next to the text they would like to restudy (note that they did not restudy these texts). In the end, children could decide which responses they wanted to maintain and withdraw. They were informed that a + 1, -3 bonus-to-penalty ratio (based on Roebers et al., 2009) would be used to award or deduct points, such that they would earn one point for maintaining correct answers, and three points would be deducted for maintaining incorrect answers. They then looked back at their test answers and decided for each of their responses whether they liked to maintain (by adding a check mark) or withdraw them (by crossing the answer out).

Kanji learning task

Children learned 12 (second graders) or 16 (fourth graders) Japanese characters (Kanjis, as used by Destan et al., 2014) and their meanings presented as pictures. The Kanji materials are reported in more detail by Roebers et al. (2019). The task was completed on a tablet computer. When learning, each Kanji and its meaning were randomly shown for 5 s. For the recognition test, one Kanji at a time was presented together with four alternatives (the correct alternative and three randomly selected alternatives that had appeared in the learning phase). Children had to select the image depicting the meaning of the Kanji. There was no time limit to complete the test. After choosing an answer, a red frame surrounded it, and the children made a monitoring judgment (CJ). They answered, “How sure are you that you have chosen the correct picture?” by clicking on a 7-point scale ranging from very unsure to very sure, illustrated with the colored thermometer scale (adapted from Koriat & Shitzer-Reichert, 2002; the thermometer was introduced similarly to the text task.) Then, participants could select which Kanjis they would like to restudy. The Kanjis were presented in a grid format (a 4 × 3 grid for the second graders and a 4 × 4 grid for the fourth graders); they could click on the Kanji they liked to restudy and could also select a “none” button.

In the end, participants were instructed that they could decide to maintain or withdraw their answers. They were informed that the + 1, -3 bonus-to-penalty ratio would be used to award or deduct points. Answers appeared one at a time. They could maintain or withdraw their answers by selecting a green or red traffic light.

Analyses

For each task to assess performance, the percentage of correct test responses was calculated. To assess monitoring accuracy, discrimination was calculated as the difference in the mean of confidence for accurate and inaccurate answers (Roebers, 2002). Discrimination could range between − 6 (very inaccurate discrimination, i.e., consistently being highly certain for incorrect responses and very uncertain for correct responses) to + 6 (perfectly accurate discrimination, consistently being highly certain for correct responses and very uncertain for incorrect responses.Footnote 1 To assess the effectiveness of restudy selections, for the Kanji task, the percentage of restudy selections for items with incorrect responses was calculated per child. For the text task, the percentage of restudy selections for texts for which at least one of the two questions was answered incorrectly was calculated. That is, restudy selections were effective when many of the not-well-learned task materials were selected for further study. Moreover, to investigate decision accuracy, the percentage of correct response maintenance/withdrawal decisions (accurate answers maintained and inaccurate responses withdrawn) was calculated for each child. Monitoring accuracy, effective restudy, and maintenance/withdrawal decision accuracy measures were converted into z-scores for all analyses. For full sample analyses, z-scores were calculated for the whole sample; for the reported follow-up analyses per grade level, z-scores were calculated separately per age group.

Latent profile analyses were conducted with LatentGold software (Vermunt & Magidson, 2005, 2021). Only children with complete data sets per measurement point (i.e., who completed both the text and the Kanji task) were included in these analyses. To test whether we would find different profiles of metacognition (addressing H1), we performed time-specific LPAs for both measurement points separately, using a stepwise procedure from a one to six-profile solution. The LPAs included the z-scores for the constructs: (a) monitoring accuracy texts, (b) effective restudy selections for not well-understood texts, (c) decision accuracy of response maintenance/withdrawal text task, (d) monitoring accuracy Kanji, (e) effective restudy Kanji, and (f) decision accuracy of response maintenance/withdrawal Kanji. For the LPAs, the statistics and p-values for the VLMR (Vuong-Lo-Mendell-Rubin Likelihood Ratio Test) are reported to inform whether adding more profiles significantly improves the model fit. To select the best fitting profile solution, the solutions of the LPAs were compared, and the most parsimonious fitting model was selected, which displayed the smallest BIC values and an entropy value higher than 0.70 (as recommended by Nylund et al., 2007). Further, when conducting LPAs with relatively small sample sizes (lower than 300), a larger number of profiles can result from over-extraction, and small profiles may be challenging to interpret (Nylund-Gibson & Young Choi, 2018). Therefore, we only selected profile solutions for which the smallest profile had a prevalence of at least 10% of the sample. Note that for our initial LPAs, data for second- and fourth-graders was combined, and to account for potential differences between grades, grade level (2 or 4) was included as a nominal, active covariate. This way, we could benefit from the larger sample size (combining both grade levels) and still investigate whether profile classifications would differ for both age groups. To foreshadow, the covariate grade level appeared significant, indicating that profile classifications differed for both age groups. Therefore, follow-up LPAs were conducted for both grade levels separately.

For the LPA profile solutions for both grade levels, we addressed to what extent the identified metacognition profiles differed in task performance for the text and the Kanji task using a three-step approach (with a BCH adjustment). This approach preserves the LPAs’ classification uncertainty when relating profile membership to the distal task performance variables (Bakk & Kuha, 2021; Nylund-Gibson et al., 2019).

The data for the children with complete data sets (i.e., including the two measurement points both for the Text task and Kanji tasks) were used for the analyses investigating longitudinal profile changes (for the second graders, 95 children had complete longitudinal data sets; for the fourth graders, 123 children had complete longitudinal data sets). As profile classifications differed for both grades, we separately calculated the within-person profile change probabilities across time points for both age groups. This enabled us to investigate to what extent children initially classified with higher levels of metacognition would remain higher in metacognition (relative to their peers) over time and vice versa (testing H3). With Chi2 tests, we investigated whether the pattern of changes between profiles from M1 to M2 could be considered random or if this pattern significantly deviated from what would be expected by chance. That is, with these tests, we address whether being classified into a certain profile at M1 would be accompanied by a higher/lower likelihood than chance of being classified into a specific profile one year later at M2). Significant Chi2 results would indicate that certain movement patterns were more or less probable than expected by chance alone. For significant omnibus Chi2 tests, the standardized residuals were inspected for each profile change type to follow up on which specific movements between profiles were more or less likely to happen than would be expected by chance alone.

The full dataset and supplementary materials (including tables showing correlations between variables and results of the LPAs for the total sample) are available online: https://osf.io/nx2wp/?view_only=9623bc6ce8a24b91ad53c10333f1305e.

Results

Before testing our hypotheses, descriptive statistics for performance, monitoring, restudy, and response maintenance/withdrawal are presented. Preliminary analyses then investigate differences in means between age groups and differences in means between measurement points. The online supplementary materials show tables with correlations between the LPA-indicator variables and task performance for M1 and M2.

Descriptive statistics and preliminary analyses

For both age groups, Table 1 shows the mean values for performance, CJs for correct and incorrect responses, monitoring accuracy (i.e., CJ discrimination), the percentage of restudy selections, the percentage of effective restudy selections for not-well-learned task materials, the percentage of maintained responses, and decision accuracy (indicated by the percentage of accurate maintenance/withdrawal decisions), for both tasks (text and Kanji) and both measurement points. This table indicates significant differences between age groups for both measurement points.

Table 1 Mean performance, CJs, restudy selections, and response maintenance/withdrawal decisions

As shown in Table 1, for the text task at M1, t-tests showed significant differences between age groups in CJs for incorrect responses (second > fourth graders), the number of restudy selections (second < fourth graders), effective restudy selections (second < fourth graders), and the number of maintained responses (second > fourth graders). For the Kanji task at M1, t-tests showed significant differences between age groups in mean performance (second < fourth graders), CJs for correct responses (second < fourth graders), CJ discrimination (second < fourth graders), effective restudy selections (second < fourth graders) and decision accuracy (second < fourth graders).

Further, Table 1 shows that at M2, there were significant differences between age groups for the text task in mean performance (second < fourth graders), the number of restudy selections (second < fourth graders), and the number of maintained responses (second < fourth graders). For the Kanji task, there were significant differences between age groups in mean performance (second < fourth graders), the percentage of restudy selections (second > fourth graders), and decision accuracy (second < fourth graders).

Moreover, with paired t-tests was investigated whether, for both age groups, the means differed across measurement points. For the text task, significant differences were found for second graders for performance (M1 < M2, p = .01), CJs for incorrect responses (M1 > M2, p < .001), CJ discrimination (M1 < M2, p < .001), effective restudy selections (M1 < M2, p = .04), response maintenance (M1 > M2, p < .001) and decision accuracy (M1 < M2, p < .001). For fourth graders, for the text task, there were differences between measurement points in performance (M1 > M2, p < .001), CJs for incorrect responses (M1 > M2, p < .001), CJ discrimination (M1 < M2), response maintenance (M1 > M2, p < .001), and decision accuracy (M1 < M2, p = .002).

For the Kanji task, comparisons between measurement points for second graders showed significant differences in mean performance (M1 < M2, p < .001), CJs for correct responses (M1 < M2, p = .001), CJ discrimination (M1 < M2, p < .001), restudy for incorrect responses (M1 < M2, p = .003), and decision accuracy (M1 < M2, p < .001). For the fourth graders, for the Kanji task, there were significant differences between measurement points in task performance (M1 < M2, p < .001), CJs for correct responses (M1 < M2, p < .001), CJ discrimination (M1 < M2, p = .003), the overall number of restudy selections (M1 > M2, p < .001), the number of restudy selections for incorrect responses (M1 > M2, p < .006), the number of maintained responses (M1 < M2, p < .001) and decision accuracy (M1 < M2, p < .001).

Profiles of metacognition at measurement 1

The supplementary materials show the model fit indices for the full sample for the latent profile solutions with 1–6 latent profile profiles. Although the BIC value was lowest for the 6-profile solution, the smallest profile had a prevalence of 4.9% of the sample. The 5-profile solution appeared to have a high entropy, and all identified profiles had prevalences of at least 16.4%. Therefore, this 5-profile solution was selected. Profile prevalences and mean values for the metacognition measures per profile are shown in the supplementary materials. The Wald test showed a significant effect of the covariate grade level, χ²(1) = 17.37, p = .002, indicating that classification outcomes differed for both age groups (grade differences per profile are described in the supplementary materials). Therefore, LPAs were subsequently calculated for both grade levels separately.

Profiles for second graders

Table 2 shows the fit indices for latent profile solutions with 1–6 profiles for the second graders. The BIC value was the smallest for the 3-profile solution, entropy was high, and the 3-profile solution fitted better than the 2-profile solution. Therefore, this solution was selected. The mean values for monitoring accuracy, restudy effectiveness, and decision accuracy are presented in Table 3. Further, Fig. 2 shows the z-scores for metacognition for both tasks for the three profiles. The largest profile (54.3% of the included 2nd graders, n = 63) had overall low to average metacognition; particularly for texts, metacognitive skills were low. This profile is called the variable metacognition profile. Profile 2 (35.3%, n = 41) had high metacognition for the text task; metacognition was average for the Kanji task. This profile is referred to as the high text metacognition profile. Profile 3 was a small profile (10.3%, n = 12); children in this profile had low metacognition for the text task, particularly for monitoring and decision accuracy, and average to high metacognition for the Kanji task. This profile is referred to as the low text metacognition profile. Profiles differed in performance on the text task, Wald test χ²(2) = 9.46, p = .008. Post hoc comparisons showed that the low text metacognition profile had lower performance on the text task than the variable metacognition profile (p = .004) and the high text metacognition profile (p = .005); there was no difference between high text metacognition and variable metacognition (p = .79). There were no differences between profiles in performance on the Kanji task (p = .14).

Metacognitive skills did not appear generalizable across tasks for second graders. Profiles differed most in their ability to monitor and control text comprehension, with one profile showing particularly high skills and a small profile showing particularly low metacognition for text.

Fig. 2
figure 2

Latent profile solutions for the first measurement point for second graders

Table 2 Model fit indices from latent profile analyses for both grade levels for the solutions with a maximum of 6 latent profile profiles for measurement point 1 and measurement point 2
Table 3 Mean scores (unstandardized) for monitoring accuracy, effective restudy, and decision accuracy for the identified profiles for both grades at both measurement points

Profiles for fourth graders

Table 2 shows the fit indices for the solutions with 1–6 profiles for the fourth graders. The 3-profile solution had the smallest BIC value and high entropy and fitted the data better than the 2-profile solution. Therefore, this solution was selected. Table 3 shows the mean scores for the metacognition measures for the identified profiles; Fig. 3 shows the z-standardized means. The largest profile (39% of the sample, n = 57) showed average metacognition for the text task; however, monitoring was lowest, and also restudy was low for the Kanji task. This profile is referred to as the low Kanji monitoring profile. The second profile (35.6%, n = 52) showed consistently low metacognition for both the text and the Kanji task and is referred to as the at-risk metacognition profile. The third profile (25.3%, n = 37) has high monitoring and control for both tasks and is called the proficient metacognition profile.

Fig. 3
figure 3

Latent profile solutions for the first measurement point for fourth graders

There were no differences between profiles in performance on the text task (p = .09), but for the Kanji task, the profiles differed in performance, χ²(2) = 7.94, p = .019. The low Kanji monitoring and the at-risk metacognition profiles had lower performance on the Kanji task than the proficient metacognition profile (ps 0.009 and 0.015, respectively). There was no difference in Kanji task performance between the low Kanji monitoring and the at-risk metacognition profiles (p = .76).

For fourth graders, metacognitive skills appeared generalizable across tasks for two of the three identified profiles, i.e., for the at-risk and proficient metacognition profiles. For the low Kanji monitoring profile, for Kanji, monitoring was less accurate and restudy less effective than for the text task.

Profiles of metacognition at measurement 2

For the LPA for the total sample, the fit indices for the solutions with 1–6 latent profiles are shown in the supplementary materials. Although the 5-profile solution had the smallest BIC value, the smallest profile only contained 7.7% of the sample. The 4-profile solution also had a high entropy value, and all profile prevalences were higher than 10%. Therefore, this solution was selected. The supplementary materials show the profile prevalences and the scores for monitoring accuracy, effective restudy, and decision accuracy for the classified profiles for the 4-profile solution. Because the grade level covariate affected the classification outcomes, χ²(1) = 17.20, p < .001 (grade differences are described in the supplementary materials), follow-up LPAs were conducted for both grade levels separately.

Profiles for second graders

Table 2 shows the profile classifications for M2 for the second-graders. BIC values were lowest for the 3-profile solution, entropy was high, and this solution fitted the data better than the 2-profile solution. Further, all profiles had acceptable prevalences. Therefore, the 3-profile solution was selected. Table 3 shows the mean values, and Fig. 4 shows the z-scores for the metacognition measures. The largest profile (40.3%, n = 50) showed average monitoring and decision accuracy and highly effective restudy for both tasks. This profile is referred to as the effective restudy profile. Profile 2 (35.5%, n = 44) showed average metacognition for the text task but the lowest monitoring and decision-making and low restudy for the Kanji task and is referred to as the low Kanji monitoring profile. The third profile (24.2%, n = 30) had average to low monitoring and decision accuracy and made ineffective restudy selections for both tasks; this profile is referred to as the ineffective restudy profile.

Fig. 4
figure 4

Latent profile solutions for the second measurement point for second graders

There were no differences between profiles in performance on the text task (p = .34) and the Kanji task (p = .27), indicating that task performance was not driving profile membership. Moreover, as indicated by Fig. 3, for the effective restudy and the ineffective restudy profiles, indicators of metacognition appeared generalizable across tasks. The level of metacognitive skills did not appear generalizable for the low Kanji monitoring profile (i.e., metacognition was consistently lower for Kanji than for texts).

Profiles for fourth graders

For M2, for the fourth graders, as shown in Table 2, the LPAs had the lowest BIC value for the 4-profile solution. However, inspection of the profiles showed that the smallest profile only had a prevalence of 5.6%. The 3-profile solution also had a high entropy, profile prevalences were acceptable (all prevalences > 10%), and the solution fitted better than a 2-profile solution. Therefore, the 3-profile solution was selected.

Table 3 shows the unstandardized scores for monitoring accuracy, effective restudy, and decision accuracy for the three profiles, and Fig. 5 shows the z-standardized scores. The first and largest profile (58.3%, n = 84) had average monitoring and control skills for the text and the Kanji task. This profile is referred to as the average metacognition profile. The second profile (31.3%, n = 45) had average-to-low monitoring and decision accuracy for both tasks and restudy was ineffective. This profile is called the ineffective restudy profile. The third and smallest profile (10.4%, n = 15) had average-to-low monitoring and decision accuracy; however, restudy selections were highly effective. This profile is referred to as the effective restudy profile.

Fig. 5
figure 5

Latent profile solutions for the second measurement point for fourth graders

Task performance did not affect profile membership; there were no differences between profiles in performance on the text task (p = .17) and the Kanji task (p = .70). Moreover, for all three profiles, metacognitive skills appeared generalizable across tasks, such that the level of metacognitive skills was average-to-high for both tasks for the average metacognition profile, discrimination was average-to low and restudy particularly low for both tasks for the ineffective restudy profile, whereas restudy was highest for both tasks for the effective restudy profile.

In sum, for both age groups and at both time points, identified profiles showed differences in the quality of metacognitive monitoring and control skills (confirming H1). Additionally, there was heterogeneity in the extent to which metacognitive skills appeared generalizable. There was stronger evidence for H2 (expecting profiles with generalizable metacognitive skills across tasks) at M2 than M1, particularly among older age groups. At M2, all identified profiles showed generalizable metacognitive skills across tasks for fourth graders.

Longitudinal movement patterns between profiles

The extent to which profile membership at M1 was related to profile membership at M2 was inspected for both grade levels separately. The types of changes between profiles over time is shown in Table 4 for second graders and Table 5 for fourth graders. For both grades, the changes between profiles from M1 to M2 differed from what would be expected by random chance (second grade, 𝜒2 = 10.90, 𝑝 = 0.028, fourth grade, 𝜒2 = 12.42, 𝑝 = 0.015). Follow-up inspections for the second graders did not indicate that specific changes over the course of one year were significantly more or less likely to happen than would be expected by chance. However, for the fourth grade, being classified into the at-risk metacognition profile at M1 was predictive of being classified into the ineffective restudy metacognition profile at M2 (standardized residual = 2.1, p < .05). Thus, for fourth-grade children, findings confirm H3. Almost half (44.2%) of the children with low metacognitive skills in fourth grade appeared likely to keep lagging behind and stay at risk for showing metacognitive deficiencies one year later.

Table 4 Probabilities of longitudinal changes between profiles from measurement points 1 to 2 for second graders (in percentage)
Table 5 Probabilities of longitudinal changes between profiles from measurement points 1 to 2 for fourth graders (in percentage)

Discussion

This study aimed to improve understanding of the development of elementary school children’s metacognitive monitoring and control. The novelty of this research is that a latent profile approach was used to combine multiple indicators of metacognition per participant. This way, metacognition could be investigated as a holistic set of skills rather than focusing on mean values for single variables. Monitoring accuracy (discrimination between CJs for correct and incorrect responses) and control processes (effective restudy selections and the accuracy of response maintenance/withdrawal decisions) were measured in two age groups (second graders and fourth graders) across two tasks (text comprehension and Kanji memory) at two time points one year apart. We investigated whether we could uncover different patterns of metacognition at the individual level (i.e., metacognitive profiles) to understand heterogeneity in children’s metacognition, the generalizability of metacognition across two tasks, and heterogeneity in developmental trajectories over time.

As expected (H1), different metacognitive profiles emerged, reflecting heterogeneity in the quality of children’s metacognition. For the second graders at M1, three profiles emerged: variable, high text, and low text metacognition profiles. Most second graders appeared to have variable metacognitive skills, and heterogeneity in metacognitive skills was mainly visible for metacognition for the text task. Notably, children with the lowest metacognitive skills for the text task also exhibited the poorest performance on it. Between the ages of 7 and 11, after obtaining text decoding skills, children rapidly develop reading comprehension abilities (Oakhill et al., 2003). The observed differences in metacognition for the text task among second graders possibly mirror individual differences in their developing reading skills.

Also, differences in the level of metacognitive skills were observed for fourth graders, and primary task demands seemed to influence metacognitive skills at M1. Particularly for the Kanji task, fourth-grade children who experienced more difficulties with learning the Kanji stimuli also demonstrated lower metacognitive abilities for this task. Further, in addition to a low Kanji monitoring profile, for fourth graders, an at-risk metacognition profile (with overall low skills) and a proficient metacognition profile (with overall high skills) were identified.

Also, at the second measurement point one year later (M2), three distinct metacognition profiles were identified for second graders (then in third grade). The largest profile was an effective restudy profile. Further, a low Kanji monitoring and an ineffective restudy profile were identified. At M2, for second graders, metacognition seemed more generalizable than at M1, such that the level of metacognition appeared similar across two different tasks for the effective restudy and the ineffective restudy profiles. Moreover, profile classifications were unrelated to task performance. Also, for fourth graders at M2 (who were then in fifth grade), three profiles were found: average metacognition, ineffective restudy, and effective restudy. For all three profiles, metacognitive skills seemed generalizable across the two tasks. Further, task performance did not affect profile classifications for the fourth graders at M2.

Findings thus indicate heterogeneity between children in the quality of metacognition. Moreover, cross-sectional and longitudinal findings indicated development in generalizability of metacognition. For second graders, fewer profiles showed generalizability across tasks than for fourth graders, and for both age groups, at M1, metacognition appeared less generalizable than at M2. Bellon et al. (2020) proposed that the domain generality of metacognition develops around the age of eight. Our findings indicate that when children reached fifth grade and were approximately 11 years old, metacognitive abilities appeared generalizable rather than task-specific for all identified profiles. This indicates that when children cannot monitor and control their text comprehension at this age, they are most likely also unable to monitor and control their Kanji memory.

While profiles show variations in metacognitive skills among children, the identified profile types were not exactly similar for the different grade levels and measurement points. This difference between grade levels and measurement points may indicate that biological maturation and richer educational experiences impact metacognitive development differently as children age. At M1, the metacognition tasks were new, and at M2, the children had experience completing these; the repeated exposure to the metacognition task at M2 could have contributed to observed differences. Further, at M1, profile membership seemed to some extent driven by performance (particularly for the profiles with less generalizable metacognitive skills, low performance on a task was related to low metacognition). However, at M2, the classified profiles appeared to reflect differences in metacognition rather than differences in performance.

The most pronounced difference between the profiles seems to be the extent to which children made effective restudy selections. For children in profiles showing metacognitive deficiencies, most children decided to restudy very few not-learned items; often, they did not restudy anything at all. The high metacognition profiles were characterized by highly effective restudy selections. In the present study, children made restudy selections after answering test questions, and they did not actually restudy to improve their learning. However, in self-regulated learning settings, these highly strategic learning decisions would most likely lead to improved learning scores and the largest long-term learning improvements (Ohtani & Hisasaka, 2018; Van Loon & Oeri, 2023).

This study specifically focused on heterogeneity in children’s discrimination between correct responses and errors and the extent to which they decided to correct their errors through restudying and withdrawing these. During the elementary school years, the influence of monitoring judgments on control, and particularly restudy, strengthens (Metcalfe & Finn, 2013). Although our study did not specifically focus on monitoring-based control, it is noteworthy that (as visible in the online supplementary material) children who more accurately monitored their performance also made more effective restudy selections. However, the correlations between monitoring accuracy and effective restudy skills were low to moderate, indicating that while monitoring was related to restudy choices, it was not a perfect relationship. This suggests that other variables, for instance, motivational factors, liking of the study tasks, perceptions about task difficulty, and beliefs about the extent to which they would be able to improve, also affected children’s decisions to restudy. Future research could further investigate reasons for heterogeneity in restudy decisions by including motivational variables.

For second graders, overall, the effectiveness of restudy improved for both tasks over one year. However, for fourth graders, mean restudy selections and effective restudy decreased for the Kanji task, and further, performance on the text task appeared to decline over time. These time-related decreases in fourth graders’ control and in their learning performance may be explained by declines in intrinsic learning motivation from middle childhood (around age 9) onwards (Gnambs & Hanfstingl, 2015; Gottfried et al., 2001). Interestingly, the LPAs show extensive individual differences among fourth graders in their decisions to restudy, suggesting that only some experienced this motivational decline. Future research should explore how motivational factors impact differences in metacognitive skills among children.

The analyses investigating longitudinal profile changes show how children moved between profiles over one year. These analyses can bring insight into whether children with initially low metacognition at M1 catch up over time or persist with low skills. In second grade, children with lower metacognitive skills tended to catch up over one year, suggesting a pattern of ongoing developmental progression. This indicates that around the age of eight, poor metacognition may indicate delayed development rather than longer-lasting metacognitive deficits. Notably, developmental improvements, especially in restudy effectiveness, were evident over one year for this age group.

However, signs of longer-term challenges in metacognitive development appeared more identifiable in fourth grade. This may indicate that spontaneous metacognition improvement is less likely at this stage. For this age group, deficits in metacognitive skills tended to amplify over time, potentially evolving into developmental risks. These findings seem to imply that, particularly after age 10, children’s metacognitive skills should be monitored so that when children appear to have low metacognition, they can be given help earlier rather than later to promote their skills. The children who appeared to be at risk for persistent metacognitive deficits exhibited the most challenges in making effective restudy selections. It thus seems highly unlikely that these children can strategically improve their learning through further study. This may imply that, when aiming to identify whether children have metacognitive challenges, it may be advisable to pay particular attention to children’s ability to take appropriate action to learn unknown information and rectify errors.

For fourth graders, 15.4% of the children showed metacognition deficiencies both at M1 (being classified in the at-risk metacognition profile) and M2 (then being classified in the ineffective restudy profile), indicating that they may be at risk for a developmental delay in metacognition. This prevalence seems similar to the prevalence of children in classrooms who have specific learning disorders or disabilities (in the USA 15%, National Center for Education and Statistics, 2024). For these children, most often specific learning disorders (SLD, characterized by difficulties in acquiring reading, writing, and math skills, Morsanyi et al., 2018) and attention deficit hyperactivity disorder (ADHD) are diagnosed. SLD and ADHD seem to be characterized by difficulties in metacognition and effortful control (Butzbach et al., 2021; Tzohar-Rozen et al., 2021). Children diagnosed with a form of SLD (dyscalculia or dyslexia) appear to make less accurate self-monitoring judgments than peers without these disabilities (Baten & Desoete, 2019; Geurten & Lemaire, 2023). Currently, very little research investigates reasons for low procedural metacognition and the potential comorbidity with other learning problems. Future research should prioritize investigating the co-morbidity of low procedural metacognition with ADHD, SLD, and other risk factors for learning difficulties.

One limitation of the present research is the sample size. Although this sample size is larger than in most developmental research, an even larger sample size would have been preferable for LPAs, as it could provide more robust findings and support the interpretation of the small profiles (Spurk et al., 2020). Selecting profile solutions and then naming profiles is a nuanced and somewhat interpretative process. Our chosen profile solutions revealed distinct levels of metacognitive skills, particularly when examining the effectiveness of restudy selections. We selected the 3-factor solutions for reasons of interpretation; although these profile solutions fit the data well, a higher number of profiles may even fit the data better. In this study, we could not further pursue this due to sample size reasons (i.e., very small profiles would then emerge). We took a conservative approach to ensure we would not over-interpret solutions with very small profiles; however, our findings should be evaluated carefully. The 3-factor solutions may be a simplification of reality, and possibly, there is more heterogeneity in metacognition than what appears from our data. The main finding of the present research is that there are heterogeneous groups that clearly differ in the quality of procedural metacognition. The differences between profiles appear most pronounced for children’s effectiveness of restudy selections. However, we acknowledge that the reality may be more nuanced, and future research with much larger sample sizes should provide more insights into how many different metacognition profiles would emerge, how this may depend on children’s age, how profiles differ from each other, and how profiles may be most adequately named.

When aiming to holistically understand metacognitive skills, a limitation is that we only had verbal metamemory tasks, asking children to memorize materials and then retrieve this information when taking the test. Interpretations about the extent to which the level of metacognitive skills is similar for multiple tasks may be different when verbal tasks are combined with non-verbal tasks (as done, e.g., by Bellon et al., 2020). Moreover, we only had one indicator of monitoring per task (CJs) and two indicators of control (restudy selections and decision accuracy). All these measures of metacognition were retrospective, such that these were collected after children completed test questions. Future research should aim to extend results with more varied task types and other measures of metacognition (e.g., combining prospective and retrospective judgments).

In summary, heterogeneity between children in metacognition is often overlooked. Findings may advance measurement methods to assess procedural metacognition and to distinguish typical from at-risk development. The present study found different profiles of metacognition, varying in the quality of their monitoring and control skills and the generalizability of metacognition across tasks. Children mainly differed in the effectiveness of their restudy selections. Findings contribute to a better understanding of the typical development of metacognitive skills and offer insights for identifying at-risk development. While eight-year-old children with low metacognition tended to catch up over time, ten-year-old children who were low in metacognition were likely to continue falling behind and had limited chances of catching up later. Specifically, for older children, the inability to improve their learning and correct errors through restudy may signal a more prolonged risk. Approximately 15% of the children appeared to be at risk of severely lagging behind in metacognitive development over time. This suggests that educators should be aware of low metacognition in children from age ten onwards. Findings emphasize the importance of early identification of elementary school children with low metacognition and providing individualized support for improvement, given the limited likelihood of spontaneous catch-up.