Introduction

Theory of mind (ToM) is a well-investigated component of social cognition and consists in attributing mental states such as desires, beliefs, and thoughts to self and others in order to understand, predict, and interpret one’s own and others’ behavior (Mitchell, 1997). The measure widely used to assess ToM is the first-order false belief (FB) task, originally designed by Wimmer and Perner (1983). Typically developing (TD) 3-year-old children systematically fail the task, while from 4 years onwards, the probability of solving it correctly goes from below chance to above chance. Several kinds of FB task have been administered to samples of children over the years, and a meta-analysis has suggested that the developmental pattern of TD children’s responses does not depend on variations in task type (Wellman, 2018; Wellman et al., 2001).

ToM development has been studied in atypical populations, such as children with specific language impairment (SLI), deafness, or autism spectrum disorder (ASD). The main issue at stake is whether these children display a delay in ToM, i.e., development that is slower but substantially similar to that of typically developing children, or a deficit, i.e., differential patterns of development of this competence.

ASD is characterized by a specific deficit in interaction and communication skills. There is consensus in the literature that a delay is also observed in autistic children’s performance on ToM tasks and that the gap between their performance and that of TD children widens with age (Baron-Cohen, 2000; Blijd-Hoogewys et al., 2010; Boucher, 2012; Yirmiya et al., 1998). A longitudinal study showed that the ToM abilities of children with ASD improved during adolescence, but without typical functioning being attained (Ozonoff & McEvoy, 1994). In a recent longitudinal study (Peterson & Wellman, 2019), 3- to 13-year-old children with autism, deafness, or typical development completed the six ToM tasks from the ToM Scale (Peterson et al., 2012) at Time 1 and at Time 2, 18 months later in average. Deaf children and children with ASD obtained lower scores than their TD peers at both Time 1 and Time 2; all three groups displayed steady progress in ToM competence and individual differences among children remained stable over time. Other studies suggest that when the various aspects of the ability to reason about mental states in self and others are compared, children with ASD appear to have greater difficulty attributing beliefs to themselves than to others (for a review, see Williams, 2010). Adults with ASD close this gap and perform similarly to TD adults, although a deficit in false belief processing continues to be observed (Bradford et al., 2018). Furthermore, high-functioning children and adolescents with ASD, aged between 6 and 20 years, can perform as well as TD peers on ToM tasks, but still display limited ToM abilities in everyday social interactions, prompting us to lean towards the conclusion that they ultimately show a deficit (Scheeren et al., 2013).

SLI is characterized by poor performance on tests of language ability, while subjects’ non-verbal cognitive functioning is normal and there is no evidence of neurological damage (Leonard, 2014). Given that ToM and language are interdependent competences (Milligan et al., 2007; for a review, see Bulgarelli et al., 2022), several studies have examined how they are related in children with SLI, obtaining mixed results (see Table 1 for a summary of the participants, tasks, and outcomes). More specifically, some studies found a ToM delay in childhood (Andrés-Roqueta et al., 2013; Durrleman et al., 2017; Farrant et al., 2006, 2012; Hanley et al., 2014) and adolescence (Botting & Conti-Ramsden, 2008). Other studies in contrast did not identify any differences between children with and without SLI (Bulgarelli & Molina, 2013; Ziatas et al., 1998). Interestingly, some studies suggest that the development of children with SLI may not actually be delayed; rather, their performance may be hampered by the linguistic nature of ToM tasks: when children with SLI are tested using tasks of low linguistic complexity, they obtain scores similar to those of TD children (Loukusa et al., 2014; Miller, 2001, 2004; Schaeffer et al., 2018; van Buijsen et al., 2011). Furthermore, inter-individual differences may depend on the severity of the disorder: children with Phonological Language Impairment (who display difficulty in properly forming the sounds of words) seem to be more likely to perform similarly to TD children, while children with Expressive Language Impairment (difficulty understanding and producing words and sentences) or Pragmatic Language Impairment (difficulty making appropriate use of language in social situations) seem more likely to perform poorly. Nevertheless, given that it is difficult to recruit large groups of children with SLI to participate in research, only a couple of studies to date have been sufficiently well powered to compare children with different degrees of severity of SLI (Bulgarelli & Molina, 2013; Shields et al., 1996). A meta-analysis examining 17 studies published between 1998 and 2014 found that children with SLI display a delay in ToM compared with aged-matched TD children, and that age and gender do not moderate the difference in performance between the two groups (Nilsson & de Lopez, 2016). This meta-analysis does not yet provide sufficient evidence to fully reject the hypothesis that the linguistic complexity of ToM tasks may affect children’s performance, given that it did not include studies that matched participants on language competence.

Table 1 Summary of the literature about theory of mind in children with specific language impairment

Notably, the studies summarized in Table 1 cannot shed light on the question of whether children with SLI show a ToM deficit, because they do not test the correlation of ToM performance with age across the groups of children (with and without language impairment), with the sole exception of van Buijsen et al. (2011). The last-mentioned study found that the pattern of correlations between chronological age and ToM tasks was quite similar in children with SLI and TD children, offering initial support for the hypothesis that children with language impairment do not have a ToM deficit.

The aim of this study was to contribute to the debate about whether children with SLI and ASD display a ToM delay or a ToM deficit. Our first research hypothesis was that children with SLI would not display a deficit or a delay in ToM competence, given that, on controlling for verbal competence, we expected both their performance (i.e., mean ToM scores) and their developmental patterns (i.e., correlation between ToM scores and chronological age) to be similar to those of TD children. Our second hypothesis was that children with ASD would display a delay in ToM, but we did not formulate a hypothesis about whether or not they would display a deficit, given that the literature reports mixed findings in this regard, while recent longitudinal studies have reported substantially similar developmental processes in ASD and TD children. The present study represents an innovative contribution to the literature because it drew on a comprehensive ToM test, the ToM Storybooks, which enables a composite and more precise measure of Theory of Mind competence than FB tasks alone. Furthermore, this instrument enables comparison of children across a wide age range, given that it meaningfully differentiates between the performances of TD children between 3 and 8 years.

Method

Participants

Three different groups of Italian children participated in the study. Children with SLI (N = 43, 13 girls, age range: 4–9 years) attending the public child neuropsychology service: they had received this diagnosis during routine health care checks, based on their parents’ accounts and tests of their linguistic production and understanding; 29 of these children also participated in a study by Bulgarelli and Molina (2013). Children with ASD (N = 47, 8 girls, age range: 5–12 years) attending two centers run by the national health system and specialized in the treatment of ASD; this group had been diagnosed directly at the centers. TD children (N = 227, 106 girls, age range: 3–10 years) were recruited at mainstream kindergartens and primary schools across several districts of a large Northern Italian city. Teachers and/or parents did not report any psychological or developmental problems that could hamper these participants’ performance on the study measures. The age and gender distribution of the groups is reported in Table 2.

Table 2 Total sample by age and gender

Measures and Procedures

Three measures were collected: ToM was evaluated through the Italian version of the ToM Storybooks (Blijd-Hoogewys et al., 2008; Bulgarelli et al., 2015; Molina & Bulgarelli, 2012); receptive vocabulary was assessed through the Italian version of the PPVT-R (Stella et al., 2000), and non-verbal cognitive ability was measured with the Leiter-R Brief IQ (Roid & Miller, 2011).

The ToM Storybooks is a comprehensive test that evaluates five ToM abilities drawn from Wellman’s theory of mental state understanding (1990): recognizing emotions, making a distinction between physical and mental entities, appreciating that perception leads to knowledge, understanding how desires affect behavior, and understanding how beliefs affect behavior. The test is based on six books with full color illustrations, which present tasks in the context of a story about a character called Sam. Some tasks are repeated several times in different contexts throughout the stories, with a view to obtaining a more reliable measure. The test comprises 95 items. The Quantitative score ranges from 0 to 77 and is based on 77 closed-ended questions, some of which require non-verbal responses (e.g., pointing at images). The Qualitative score varies from 0 to 36 assigned based on the child’s response to 18 open-ended questions, which investigate whether the child spontaneously attributes mental states to the story characters (e.g., “Sam looks for the skates in the box because he thinks they are there”: 2 points), only invokes situational aspects (e.g., “Sam looks for the skates in the box because he put them there”: 1 point), or provides wholly inconsistent or incorrect explanations (zero points). Notably, the scores for the qualitative items partly depend on the quantitative items preceding them: if a closed-ended question is answered incorrectly, then a score of zero is automatically assigned to the following open-ended item. The Total score is obtained by summing the Quantitative and Qualitative scores.

The ToM Storybooks have been standardized for use with Dutch-speaking (Blijd-Hoogewys et al., 2008) and Italian-speaking (Bulgarelli et al., 2015; Molina & Bulgarelli, 2012) populations. The instrument offers good internal consistency (Cronbach’s alpha = .90), test–retest reliability (r = .86, p < .001), and inter-rater reliability (range of Cohen’s Kappa values among coders = .81–.97). It also displays discriminant validity (differentiating children with ASD from TD children), and divergent and convergent validity (Blijd-Hoogewys et al., 2008, 2010; Molina et al., 2020). Concerning content validity, a Principal Component Analysis of the data in the Dutch validation study showed that a five-component model (belief action, emotion recognition, mental physical, belief emotion, and desire emotion) offered the best theoretical interpretation. This solution accounted for 53.8% of variance. A confirmatory factor analysis applied to the data from the Italian study also supported a five-component solution (emotion, desire, mental–physical, belief, and perception knowledge). For a discussion of these two factorial structures, see Bulgarelli et al. (2015). The ToM Storybooks have been translated into multiple languages (English, Finnish, French, Italian, and Spanish), and an adapted Italian version of the test has been developed for blind children (Bartoli et al., 2019).

The children were individually assessed in a quiet room at the healthcare center they attended, or at their kindergarten or primary school. The children agreed to take the tests and both parents provided their written informed consent for the administration of the tests.

Data Analysis

To assess whether children with SLI displayed delayed or atypical development in terms of their ToM performance and whether their pattern of ToM development differed from that of children with ASD, we adopted a developmental trajectory approach (Thomas et al., 2009). When subjects of a wide range of ages are available to form both clinical and typical development samples, the developmental trajectory approach outperforms the matching approach because it permits discrimination between different forms of developmental delay (delayed onset, slowed rate, and delayed onset + slowed rate) as well as different forms of atypical development (non-linear trend, premature asymptote) as compared to the linear pattern usually observed in the TD group.

First, we examined the relationship between the three scores for ToM ability (Quantitative, Qualitative, and Total ToM Storybooks scores) and chronological age (CA) by calculating Pearson’s correlation coefficients and creating a scatter plot for each of the three groups of participants. To control for the effects of differences in verbal and general cognitive ability, these descriptive analyses were conducted partialling out the influence of receptive vocabulary scores (VQ) and non-verbal cognitive ability scores (IQ) from the ToM Storybooks scores.

Second, we tested three regression models in which the different ToM Storybooks scores (Quantitative, Qualitative, and Total scores) were the respective dependent variables, and gender, CA, IQ, VQ, and two between-group factors ASD and SLI (TD was the reference group) were the independent variables. To test whether the developmental trajectory of the children with ASD and SLI differed from that of the TD children, we included the two interactions between group factors and CA in the regression models (ASDxCA and SLIxCA) and conducted simple slope analyses. We entered CA, IQ, and VQ in the models as mean-centered variables. The regression coefficients of the between-group factors (ASD and SLI) express the mean difference between these clinical groups and the TD group at the mean value of CA, allowing us assess for delays in the development of ToM competence as evaluated by the ToM Storybooks. More specifically, a ToM average score that was lower than that obtained by the group of TD children was considered to show a delay when it occurred in combination with a similar regression slope for CA (i.e., a non-statistically significant coefficient for the interaction); a lower ToM average score combined with a lower regression slope compared to that of the TD group was considered to show a deficit.

We evaluated the overall goodness of fit of the regression models by calculating R2 and used the variance inflation factor (VIF) to assess collinearity among the independent variables (values greater than 2.5 were taken to indicate collinearity). We measured the effect size of the interactions by calculating the difference in R2 between a model that omitted the interaction term and one including the interaction term. We used IBM SPSS Statistics 26 to perform these analyses and PROCESS v3 macro (Hayes, 2012–2020) for the simple slope analyses.

Results

Descriptive statistics for the three groups are reported in Table 3.

Table 3 Sample characteristics by groups

In terms of the associations between age and the other variables, we found a significant correlation between age and IQ and VQ in the TD and SLI children, whereas the ASD children displayed a different pattern: their correlation coefficients for age were lower in general, and the correlation between age and Qualitative score, representing the ability to explain behavior in term of mental states, was particularly weak (see Table 4 and supplementary materials).

Table 4 Partial correlation with age, controlling for IQ and VQ in TD, ASD, and SLI children

Moving on to the results of the regression analyses (Table 5), overall goodness of fit was satisfactory for all three models estimated; explained variance was 58% for Quantitative scores, 61% for Qualitative scores, and 63% for Total scores.

Table 5 Regression results

All VIF values were lower than 2.5, indicating that collinearity was not an issue. With regard to the control variables, the coefficient for gender was not statistically significant in any of the regression models, while IQ and VQ scores both exerted a positive and statistically significant influence on the Qualitative, Quantitative, and Total scores. With regard to the variables CA and group membership and their interactions, the coefficient for CA—which measured the influence of chronological age among TD children (the reference group)—was positive and statistically significant for all three models. Its standardized value ranged from .67 to .71, reproducing the pattern of associations identified in the correlational analysis (Table 4, first column). The coefficient for the SLI group factor was not statistically significant in any of the models, meaning that the mean ToM Storybooks scores of the SLI group did not significantly differ from those of the TD group, once the influence of the control variables was held constant and the mean difference was evaluated at the mean chronological age. A different outcome was identified for the ASD group condition factor: in this case, the coefficient was statistically significant in all the models and the relationship was negative and strong. Specifically, the standardized regression coefficient (see the column headed ‘Beta’ in Table 5), ranged from − .22 to − .24, meaning that the mean ToM Storybooks scores of the ASD group were around 1/5 or 1/4 of a standard deviation lower than the mean ToM Storybooks scores obtained by the TD group. The mean differences between the TD group and the ASD and SLI groups, respectively, are represented in Fig. 1, along with their 95% confidence interval values.

Fig. 1
figure 1

Mean differences and 95% confidence intervals for SLI and ASD scores compared to TD scores (regression estimates)

When we added the interaction between CA and group membership, this led to a statistically significant increase in R2 in all the models. The greatest improvement was observed in relation to the Qualitative score outcome variable (ΔR2 = 0.04, p < .0001), followed by the Total score outcome variable (ΔR2 = 0.02, < .0005), and lastly the Quantitative score outcome variable (ΔR2 = 0.01, p < .05). In all three models, only the coefficient for the ASD x CA interaction was statistically significant. As illustrated in Figs. 2, 3 and 4, simple slope analyses revealed that, compared to the slope in the TD group, the slope of the ASD group was not significantly different to zero for the Qualitative scores model and weaker than in the TD group for the other two. As may be observed in the figures, the simple slopes estimated for the SLI group overlapped with those of the TD group, reflecting the fact that the interaction term was not statistically significant.

Fig. 2
figure 2

Qualitative ToM score: simple slope tests

Fig. 3
figure 3

Quantitative ToM score: simple slope tests

Fig. 4
figure 4

Total ToM score: simple slope tests

Discussion

The literature offers mixed evidence concerning ToM development in children with SLI, insofar as some studies have found that these children are delayed in their development (Andrés-Roqueta et al., 2013; Durrleman et al., 2017; Farrant et al., 2006, 2012; Hanley et al., 2014; Nilsson & de Lopez, 2016), while others have not (Bulgarelli & Molina, 2013; Ziatas et al., 1998), and there is only preliminary evidence suggesting that they may not show a ToM deficit (van Buijsen et al., 2011). The findings of this study contribute to the literature by confirming our first research hypothesis: the children with SLI in our sample displayed neither a delay nor a deficit in ToM competence, once their receptive vocabulary had been controlled for; indeed, their mean ToM scores did not significantly differ from those of the TD children and the patterns of correlation between age and ToM scores were substantially similar in the two groups. Furthermore, in the regression analysis, neither membership of the SLI group nor the interaction between SLI status and age wielded a statistically significant effect.

With regard to the second hypothesis, the children with ASD displayed a deficit in ToM, not only a delay. Indeed, compared to the TD children, the children with ASD obtained lower mean ToM scores, and lower correlations between age and ToM scores. Furthermore, in the regression analysis, age, language, non-verbal cognitive functioning, and ASD were the factors that explained variance in the ToM scores. Notably, the Quantitative ToM scores of children with ASD improved with age, albeit at a slightly lower rate than those of the TD children and children with SLI, while their Qualitative score was similar to that of the other two groups in the younger cohort (at the age of 52 months) and only differed in the older ones. This outcome reflects the specificity of the Qualitative score, which measures the ability to explicitly explain responses to the ToM tasks in term of mental states; in other words, in the qualitative items, children are not simply asked to recognize the effects of mental states on behaviors and/or emotions, but rather to refer to these mental states of their own initiative. Hence, a growing gap between children with and without ASD is to be expected from the age of 3 years, when TD children usually have not yet acquired the ability to spontaneously invoke mental states as explanations, to 12 years of age, when spontaneous references to mental states are extremely common in TD children but still rare in children with ASD.

These findings partly differ from those of the two longitudinal studies reported in the literature that appeared to exclude a specific deficit in ToM. The first of these studies, by Blijd-Hoogewys et al. (2010), found that children with ASD displayed nonlinearities in their ToM development just as TD children do; the overall developmental pattern was similar in both groups, with children on the autism spectrum only displaying a delay. Blijd-Hoogewys and colleagues only examined children’s development over a limited time period and from a micro-genetic perspective, while our own investigation was conducted with subjects of a broad range of ages in a cross-sectional sample. These two different levels of developmental analysis may not reveal the same pattern, and thus the question of whether children with ASD display a ToM deficit or not remains open. The limitation of Blijd-Hoogewys et al.’s work (2010) is the restricted period of time investigated, whereas the shortcoming of our own work was that we could not track developmental patterns over time, because we lacked a sufficient number of cases and longitudinal data. The second longitudinal study, by Peterson and Wellman (2019), based on the 6-step ToM Scale (including measures of diverse desires, diverse beliefs, knowledge access, false belief, hidden emotion, sarcasm), found that children with ASD displayed steady progress in ToM competence over time as did TD children. This outcome appears to suggest a ToM delay in the former group. However, Peterson and Wellman (2019) reported that 65% of the children with ASD who passed the Hiding Emotion Task failed the False Belief task, while all TD children who passed the Hiding Emotion Tasks also passed the False Belief task. This inversion in the scaling of the tasks in the two populations is a sign of qualitatively different development in children with ASD as compared to TD children, thus suggesting a ToM deficit. This is in line with our own observations of a differential pattern of ToM scores across groups and a different pattern of correlation between ToM scores and age. In Peterson and Wellman’s article, further possible signs of qualitatively different development in children with and without ASD might be detected, in support of the deficit hypothesis: specifically, the participants were divided into two groups as a function of their performance at Time 1: children with low ToM scores and children with high ToM scores. In the ASD sample, the children in the low scoring group displayed a bigger increase in their scores from Time 1 to Time 2 than did the children in the high scoring group. In the TD sample, the opposite pattern was observed, with the children in the high scoring group displaying stronger gains at Time 2 than their peers in the low scoring group.

At present, our results offer no clear contribution to the debate about the impact of the linguistic complexity of the ToM tasks used to assess children with SLI. Some studies have shown that these children are more likely to obtain similar scores to TD children when tested using tasks of low linguistic complexity (Loukusa et al., 2014; Miller, 2001, 2004; Schaeffer et al., 2018; van Buijsen et al., 2011). Overall, the ToM Storybooks administered in our study is not an instrument of low linguistic complexity, although the Quantitative items only involve receptive language (requiring the child to answer yes/no or point to images), while the Qualitative questions also involve expressive language and the use of mental state lexicon. Neither Quantitative or Qualitative ToM scores differed significantly across the SLI and TD groups, suggesting that children with SLI could succeed on the tasks irrespective of the type of language involved. One possible explanation for this outcome, which suggests that children with SLI develop typically in terms of their ToM abilities, is that we administered all six books in the ToM Storybooks instrument, thus maximizing the number of similar ToM tasks that were repeated, and arguably thus obtaining a more reliable measure of the children’s competence than if we had only used classical FB tasks. Further research is required to test the linguistic complexity hypothesis. It should also be noted that the impact of task linguistic complexity may be mediated by severity of linguistic impairment, as already discussed in Bulgarelli and Molina (2013). To test this hypothesis, future studies should compare children with phonological, expressive, and pragmatic language impairment.

With respect to the limitations of the current study, first, the number of participants in each age group was generally low and some age groups were not represented at all. This is usually addressed as the “common support problems”, that may increase the root mean square error of the regression estimator (Lechner & Strittmatter, 2019). A second limitation is that the children with SLI and with ASD were not divided on the basis of the severity of their condition: this information was available for the children with SLI, but the small number of participants did not allow statistical comparisons to be conducted; the equivalent information was not available for the children with ASD. A third limitation concerns the use of the PPVT-R, which only evaluates general language competence in terms of receptive language and vocabulary size. However, receptive language is the skill that children require to take the ToM test, so it seems reasonable to view it as a sufficient measure of their linguistic competence for the purposes of this study. Another limitation is the low mean score for receptive vocabulary obtained by the TD participants. Nevertheless, the children were recruited at mainstream schools and neither the participants’ parents nor their teachers reported developmental issues of any kind. It is possible that this limitation was related to the wider range of socio-economic backgrounds represented in our sample compared to the PPVT-R normative sample. Nevertheless, we cannot test this hypothesis, given that the standardization study for the Italian version of the PPVT-R did not report information about the socio-economic background of the children who took part in it. Moreover, to control for this issue, we conducted independent samples t-tests, which showed that the mean PPVT-R scores of the TD children differed significantly from the average PPVT-R scores of the children with SLI and with ASD, suggesting that the instrument was able to discriminate adequately between the groups. Finally, we should also acknowledge the wider call in the literature to revisit research perspectives on autism (e.g., Woods et al., 2018): notably, the Double Empathy Problem suggests that the limited ToM abilities observed in persons with ASD are actually the outcome of barriers created when neurotypical interlocutors misunderstand and misperceive their communications. Given that little research on such reciprocal dynamics has been conducted to date (Mitchell et al., 2021), our study cannot contribute to this debate.

Future research about ToM development in children with SLI and with ASD should be focused on obtaining longitudinal data to validate the developmental patterns observed in our cross-sectional data and on observing whether delays or deficits in these groups persist beyond childhood. In addition, further research on different types of SLI and different degrees of severity in ASD is needed, given the crucial clinical value of being able to discriminate amongst children with ASD and children with SLI (and specifically those with Pragmatic Language Impairment). Comprehensive tools such as the ToM Storybooks could then be used in support of more reliable diagnostic processes, across a broad range of ages during childhood.

The novel contribution of the present study to the literature on ToM development in atypical populations lies in our use of a comprehensive test, the ToM Storybooks, which yields a composite and more reliable measure of theory of mind ability than FB tasks used in isolation. Furthermore, we did not find any ceiling effect across the age range under study, an issue that arises when relying on classical FB tasks (see also Thomas et al., 2009). The impact of this type of research on clinical practice is potentially crucial, given that ToM tasks and tests can be used to better describe children’s mental state understanding and related abilities and to plan more tailored interventions accordingly. This seems particularly important for children with SLI who mainly need support in relation to their development of language and communication skills, given that their ToM ability is mainly preserved.