Introduction

At the beginning of our infancy, we as human beings are physiologically premature and deficient (Gehlen 2014; Portmann 1941), and in contrast to many nonhuman animals, we are highly dependent on intense care and social bonding to our loved ones. Embedded in a sociocultural environment, many developmental processes rely on social interaction (Portmann 1941) to the extent that the development of complex human cognition builds upon basic social cognitive processes formed in early childhood (e.g., Rochat et al. 1999; Soto-Icaza et al. 2015). Social cognition (SC) can be regarded as a dynamic constructive and conceiving process of perception, categorization, recall, and evaluation of social stimuli. Experience with the social world emerges from the perception and categorization of behavioral events that unfold in a mutual interplay between self and others over time (Macrae and Miles 2012; Vogeley 2009), and develops by increasing implicit and explicit knowledge of new concepts (Low and Perner 2012). SC acts as an umbrella term and, in this context, we can differentiate several subfunctions like empathy, imitation and social meta-cognition. In this paper we will especially focus on several aspects of mindreading, since mindreading serves as a relevant concept to describe the process of perspective taking and tracking the intentional states of others.

Agency, intentionality, and purposive explanations as efficacious factors of human SC enable us to ascribe observable actions of others to their unobservable mental states. The ability to hypothesize about feelings, desires, intentions and beliefs of oneself and of other people is subsumed in the technical term Theory of Mind (e.g., Perner and Lang 1999), and depends on basal processes like (self-)representation, (self-)perception, and categorization. These processes do not always involve a conscious subjective experience (Brosch et al. 2010) and several studies have shown that judgments within the scope of SC take place without perceptual awareness (Stewart et al. 2012; Todorov et al. 2009). One possibility for defining this fragmentation as a testable construct is to subdivide Theory of Mind (ToM) into implicit and explicit components to describe different aspects of processing social stimuli (Frith and Frith 2012). The implicit component is delineated as unconscious, procedural, non-conceptual, automatic, and spontaneous processing, while explicit processes are described as conscious, declarative, conceptual, and controlled (Kahneman 2011; Low and Perner 2012; Payne 2012). Explicit processes interfere with currently ongoing activity and interrupt automatic and procedural behavior or cognition (Kahneman and Frederick 2002). Implicit processes are characterized by the near impossibility to report how a decision or an action has been accomplished. However, a wide range of cognitive abilities thought to be explicit are based on implicit processes (Gasper and Clore 2000; Higgins 1998; Oikawa 2010).

Contrary to a common belief that the causes for behavior in social situations can be found in conscious decisions, many noteworthy studies (Asch 1951; Festinger and Carlsmith 1959; Milgram 1963; Schachter and Singer 1962) have shown that social actions are more influenced by automatic, unconscious, and uncontrolled processes than expected (Dijksterhuis et al. 2006; Frith and Frith 2012; Payne 2012). Facial expressions are one of the most important carriers of conveying information about the emotional state of the performer (Martinez and Du 2012) and the attribution of emotions to others is mainly based on their facial expressions. Therefore “emotion reading is a fundamental kind of mindreading” (Goldman 2006, p. 113). Generally, neurotypical individuals are highly motivated to turn to and learn to discriminate this continuous sequence of varying muscular contractions and relaxations as categories of basic emotions ab initio (Farroni et al. 2004; Grossmann et al. 2008; Johnson et al. 1991; Rigato et al. 2011).

The amount of incoming information by a facial emotional expression requires the ability to reduce the complexity of the continuous sequence and interaction of action units with varying intensity of muscular contractions and relaxations (Brosch et al. 2010). In order to do so the categorization of facial expressions depends on earliest implicit learning as “the acquisition of knowledge that takes place largely independently of conscious attempts to learn and largely in the absence of explicit knowledge about what was acquired” (Reber 1993, p. 5). Referring to the de- and reconstruction of the concept of ToM according to Schaafsma et al. (2015), we understand emotion processing in terms of inferring emotional cues from human actions, animate motions and from the facial expressions clearly as a relevant aspect of ToM.

Nevertheless the attribution of FEC to either a lower order process or a higher order feature concerning social cognition is still contested. There are neuroimaging studies suggesting that basic visual perception in NTD (Yamasaki et al. 2002) and ASD (Dawson et al. 2004) is based on lower order processes. On the other hand, there is evidence from current studies and theoretical approaches that perception and categorization of complex emotional stimuli is influenced by two different processes. For the categorization of simplest and basic emotions it is sufficient to process the stimuli by use of purely perceptual automatic processes. With increasing complexity of facial expressions and their embedding in complex social situations the categorization is executed by a cognitive process that uses already acquired background knowledge to conceptualize and categorize the emotions in accordance with the situation (Marchi and Newen 2015). The conceptualizations considered as higher-level cognitive states then influence the perceptual experience (Lindquist et al. 2014).

Although typically developed children are able to name facial basic emotions, the procedure of perception and categorization remains unconscious, automatic, procedural and accordingly implicit (Goldman 2006; Kahneman 2011; Shi et al. 2013). Thus, perceived facial emotions have an important influence on decision making, especially in complex, highly contextual social situations (Van’t Wout and Sanfey 2008; Willis and Todorov 2006; Winston et al. 2002), that take place under time pressure, with limited information, and limited cognitive resources. Some authors describe this kind of social information processing as simple and frugal heuristics in a social world (Hertwig and Hoffrage 2013), characterized as “a strategy that ignores part of the information, with the goal of making decisions more quickly, frugally and/or accurately than more complex methods” (Gigerenzer and Gaissmaier 2011, p. 454). Especially non-explicit, automatically processed facial identity and facial expression are influencing our socially relevant cognitions and decisions (Mehu et al. 2007; Scharlemann et al. 2001; Van’t Wout and Sanfey 2008).

Implicit ToM (iToM) in this context includes the ability to perceive and recognize expressed basic facial emotions (Esteves and Ohman 1993; Todorov 2012; Uleman et al. 2008; Whalen et al. 1998). It can be assessed by tasks concerning the perception and categorization of facially expressed emotions, while explicit ToM (eTOM) is mainly measured by tasks concerning classical false belief tasks (FBT) of first and second order.

Transferring dual process theories of social cognition (for an overview of non-ToM-related accounts in SC, see Sherman et al. 2014) to the domain of ToM, it is conceivable that the perception and categorization of facial expression continuously influences our presumptions, beliefs, and decisions in social situations without our awareness. Apperly and Butterfill (2009) describe a similar approach: They assume that human infants and non-human animals have access to an implicit ToM system with high cognitive efficiency but low flexibility. During the course of development human children come up gradually with more complex psychological concepts (Apperly and Robinson 2003), that improve with explicit training and build a second ToM system that is flexible but slow and inefficient. NTD adults are equipped with both systems working in parallel.

While it is common in many publications to label tasks using verbal responses as ‘explicit’ and tasks measuring non-verbal behavior as ‘implicit’ (e.g., Kliemann et al. 2013) we start from the premise that predominantly automatic processes are also implicit even if the responses are given verbally. With respect to this issue we are in line with current studies establishing evidence that explicit mentalizing does not influence automatic processing (Schneider et al. 2012, 2014).

ASD, conceived as one of many neurodevelopmental disorders (DSM-5, American Psychiatric Association 2013), is associated with severe and pervasive impairments in social interaction and communication (Frith and Frith 2012; Hill et al. 2004). Some studies suggest, that individuals with ASD and typical IQ are able to solve explicit ToM-tasks (Happé 1995; Senju 2012), likely using compensatory strategies (Frith and Frith 2012; Senju 2012). However, they still score significantly lower in tests concerning complex social situations (Dziobek et al. 2006) and in facial emotion categorization (FEC) tasks (Schuwerk et al. 2014).

As Uljarevic and Hamilton (2012) and Lozier et al. (2014) outlined in their meta-analyses, a considerable emotion recognition deficit forms part of ASD, interpretable with due care because of a presumed influential publication bias on the one hand and most likely underestimated by the use of predominantly static stimuli and lacking time limitations on stimulus presentation on the other. This may have led to similar accuracy rates between neurotypicals (NTD) and ASD in many studies. Thus, in terms of ecological validity, many researchers recommend the use of dynamic and time limited facial emotion stimuli in future research (Gepner and Féron 2009; Parish-Morris et al. 2013; Rahko et al. 2012; Uljarevic and Hamilton 2012). We will address this issue later in the “Method” section.

From the first study investigating explicit ToM in ASD (Baron-Cohen et al. 1985) to the latest meta-analyses concerning neural correlates of explicit ToM in ASD (Aboulafia-Brakha et al. 2011; Devine and Hughes 2014; Sugranyes et al. 2011), the deficient ability of inferring the mental states of others is a consistent observation in individuals with ASD. Nevertheless, the extent to which differences between ASD and NTD can be measured depends on the level of difficulty of the explicit ToMtasks. While typically developed children between the age of 3–4 years are able to solve verbally presented first-order FBT (Wellman et al. 2001; Wimmer and Perner 1983), most children with ASD do not pass this task until the age of 11 years (Happé 1995). In a test using a method of implicit learning, typically developed children aged 2 years are able to spontaneously anticipate others’ actions, while adults with ASD still fail to pass the test (Senju et al. 2010). Second-order FBT can be solved by children at the age of 6 and 7 years (Perner and Wimmer 1985). Advanced ToM-abilities, needed for complex social situations in which wrong behavior has to be represented by a cognitive and an empathic affective component, first appear in typically developed children between the age of 9 and 11 (Baron-Cohen et al. 1997).

The process of perception and categorization of social stimuli is an adaptive function, and in addition to facial emotional expression, it is based on the integration of several sources containing voice cues like prosody, body cues like gestures and posture, odor, time, space, and environment (Murphy 2012). The basic mechanism of this cognitive integration process of context effects was investigated by Rumelhart and McClelland (1981, 1982). Their interactive activation and competition model, tested by means of target letter perception in words, gives evidence to the assumption that context has a bearing on perception and that duration and timing of the context in which a target appears improves its perceivability. Different constructivist approaches consider contextual information an important framework that can change the categorization of the same facial expression, depending on the contextual background (Kim et al. 2003, 2004). The categorization of facial expression can be influenced by odor (Leppanen and Hietanen 2003) or emotional context (Righart and de Gelder 2008) and, furthermore, it has been shown that contextual factors have equal or greater influence on categorization of emotional information than the facial expression itself (Carroll and Russell 1996).

To assess the full range of demands on SC we compiled a test battery covering a broad range of well established and newly developed tasks that will put to the test the concept of implicit and explicit ToM with (1) facial emotion categorization (implicit), (2) first and second-order FBT (explicit), and (3) complex SC (tasks with complex social interaction in a contextual framework of a coherent story including implicit and explicit aspects of SC).

Compiling this detailed test battery, we investigated which differences arise between adolescents with high-functioning ASD and their typically developed peers. The aim was to determine if there is an overall deficit in SC, especially in ToM abilities of the participants with ASD as compared to NTD. Furthermore, we examined to what extent differences between implicit (concerning facial emotion categorization) and explicit (concerning FB conditions) ToM measures correlate with autistic symptomatology and/or IQ.

Method

Participants

A total of 47 adolescents aged between 14 years 0 months and 17 years 11 months took part in the present study. Two participants had to be excluded from the original ASD group (n = 25): One participant performed at chance level in all facial emotional recognition tasks conceivably based on prosopagnosia, and a second participant had to be excluded due to technical problems during the test session. Finally, 22 neurotypically developed participants (NTD) and 23 participants with ASD were included in the statistical analyses. ASD and control participants were recruited from project databases of the Department of Child and Adolescent Psychiatry, Psychotherapy, and Psychosomatics of the Medical Center—University of Freiburg. The groups were matched by chronological age and nonverbal intelligence (see Table 1) as assessed by the CFT 20-R (Weiß 2008) that belongs to the family of culture fair intelligence test measuring general fluid intelligence (Cattell 1963). ASD diagnoses were established by means of instruments considered to be the gold standard, the Autism Diagnostic Observation Schedule (ADOS; German version: Rühl et al. 2004) and the Autism Diagnostic Interview-Revised (ADI-R; German version: Bölte et al. 2006a). To assess severity of symptoms of ASD concerning social responsiveness, we used the Social Responsiveness Scale (SRS; German version: Bölte and Poustka 2008). NTD had to fill out the CBCL/4-18 (Döpfner et al. 1994) a questionnaire for parents to assess behavioral and emotional problems of children, and its equivalent self-rating questionnaire YSR/11-18 (Döpfner et al. 1998) for exclusion of psychiatric symptoms. To classify empathizing and systemizing abilities based on the E-S-Theory of Baron-Cohen (2009), participants had to fill out the EQ and the SQ (German versions). Measuring alexithymia, assessing the inability to identifiy and describe one’s own emotions and those of others, we used the Toronto Alexithymia Scale (TAS-26; German version; Kupfer et al. 2001). The study was conducted according to the principles expressed in the Declaration of Helsinki. Informed written consent was obtained before testing. The local ethics committee approved the study.

Table 1 Sample’s characteristics

Tests

(1) Complex Tasks

MASC (Dziobek et al. 2006). The Movie for the Assessment of SC (MASC) is a sensitive video-based test to assess mindreading difficulties in complex social situations. Watching a 15 min movie about four individuals getting together for a dinner party, the video stops 51 times and the participants are required to answer several questions about the mental states of the different characters in a multiple choice paradigm. Main outcome is the accuracy rate for correctly given answers.

AToM (Schaller and Rauh 2011). The Animated-Theory-of-Mind (AToM) test is aimed at increasing ecological validity by using dynamic video-based stimuli close to everyday life (by means of the TV-animation series “Shaun the Sheep”). Within a non-verbal story that features social interactions, participants are asked about different aspects of SC including facial emotion expression, intentions, deception and first- and second–order false beliefs of protagonists (see Fig. 1). After every relevant scene the participants were asked questions about the facial expressions, thoughts, beliefs, and intentions of the protagonist(s) that are observable or inferable in a contextual and coherent story. Items concern FEC, FB, and comprehension of complex SC. The mean of these items serves as the AToM-Total score that integrates all SC-related answers. In addition, participants are asked questions about the presence and number of peripheral objects (PO) outside the (social) gist of the scene. PO items are included to serve as a measure for basic aspects of story understanding of the participants beyond items concerning SC. On the one hand PO serves as an subscale of working memory, representing a measure to compare this aspect of executive functions (Miyake et al. 2000) between ASD and NTD. On the other hand we want to examine whether individuals with ASD show a basic understanding of the task respectively pay attention to the task at all.

Fig. 1
figure 1

(printed with permission from Aardman Animations Limited. Shaun the Sheep (word mark) and the characters “Shaun the Sheep” (C) and TM Aardman Animations Limited. Licensed by WDR mediagroup GmbH)

Scene from the Animated-Theory-of-Mind Test (AToM). The sheep deceive the farmer into abandon the swimming-pool, by letting him believe that a storm is brewing

In addition to the AtoM-Total score, the task allows for the assessment of six subscales: Facial Emotion Categorization (FEC = 14 items; i.e. “What facial expression does the dog show when he recognizes the shark in the pool?”), first-order false belief (FB1O = 16 items; i.e. “What does the farmer think is swimming in the pool?”), second-order false belief (FB2O = 6 items; “What does Shaun think, what the farmer thinks, is swimming in the pool?”), comprehension questions concerning the causes for the emotional expression of the protagonists [Comprehension Facial Emotion Categorization (AToM-Comp-FEC) = 13 items; i.e. “Why does he show this facial expression?”] and reality questions (AToM-Comp-RQ = 4 items; i.e. “Who is actually swimming in the pool?”) concerning distinctions between appearance (beliefs of protagonists) and reality (perspective of the omniscient observer), and peripheral objects (AToM-PO = 19 items; i.e. “How many objects have you seen standing on the side table next to the dog?”). The AToM-FEC subscale includes 14 FEC items. The AToM-FB subscale includes the items of FB1O and FB2O and thus comprises of 23 items. In this version of AToM, all answers are given verbally without a fixed response format and are audio-recorded. Analysis was carried out by two independent blinded raters, assigning the verbally given answers using a prescribed list of possible answers for each question. Inter-rater reliability was assessed for (i) consistency as well as for (ii) absolute agreement using two-way random, single-measures ICC (McGraw and Wong 1996) for all AToM scales. According to the guidelines of Cicchetti (1994), the resulting ICC values were fair to excellent for consistency [AToM-Total: ICC(C,1) = 0.892; AToM-FEC: ICC(C,1) = 0.808; AToM-FB: ICC(C,1) = 0.664; AToM-Comp-FEC: ICC(C,1) = 0.894; AToM-Comp-RQ: ICC(C,1) = 0.538; AToM-PO: ICC(C,1) = 0.832] as well as for absolute agreement [AToM-Total: ICC(A,1) = 0.788; AToM-FEC: ICC(A,1) = 0.747; AToM-FB: ICC(A,1) = 0.606; AToM-Comp-FEC: ICC(A,1) = 0.828; AToM-Comp-RQ: ICC(A,1) = 0.542; AToM-PO: ICC(A,1) = 0.802].

In cases of ambiguity a third independent and blinded rater decided if the answer should be counted as right or wrong. Main outcome is the accuracy rate for correctly given answers.

(2) Implicit Tasks

Dynamic Emotion Categorization Test—DECT (Rauh and Schaller 2009). In order to prove the usefulness of dynamic facial stimuli of emotion, we combined videos of natural actors with animated artificial actors displaying facial expressions of basic emotions at a stated level of intensity. Four characters (two natural/two virtual) express six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) with three levels of intensity (weak, medium, strong). In a pilot study, a certified coder of the facial action coding system (Ekman and Friesen 1976) had screened the 72 items for consistency with the FACS. One female and one male virtual character were designed and specified with the Agent Framework (Helzle et al. 2004). Each video clip or animation lasts for 6 s, wherein for the first 3 s the corresponding face is neutral and then starts to progress into the emotional expression. The Dynamic Emotion Categorization Test (DECT) starts with seven practice trials followed by the 72 video clips or animations. Facial emotion type and intensity level were distributed in a pseudo-randomized manner with the restriction that in one block of trials all six basic emotions were presented exactly once, and each intensity level (weak, medium, strong) exactly twice. Responses were given verbally and the experimenter enters the verbally given answer per button press. For every item six basic emotions were given (anger, disgust, fear, happiness, sadness, surprise). Main outcome is the accuracy rate for correctly categorized emotions.

Faces-Test (Baron-Cohen et al. 1997). In the Faces-Test b/w photographs of an actress expressing seven “basic” emotions (afraid, angry, disgusted, distressed, happy, sad, and surprised), and nine “complex” mental states (admiring, arrogant, bored, flirting, guilty, interested, quizzical, scheming, and thoughtful) are used. Each photo is accompanied by a target adjective and an incorrect adjective (e.g., “admiring” versus “arrogant”), describing the depicted mental state of the actress. The original paper-pencil version was implemented in Presentation® (Neurobehavioral Systems Inc., Berkeley, USA) and modified with regard to time limitation. The pictures are presented for four duration conditions (150, 300, 450 ms, self-paced) in pseudo-randomized order, and the participant has to make his decision in a forced choice procedure between the two alternatives (target state and foil alternative state) by pressing one of two keys on the numeric keypad. Main outcome is the accuracy rate for correctly categorized emotions.

RMET (Baron-Cohen et al. 2001). Similar to the Faces-Test, the computerized child version of the Reading-the-Mind-in-the-Eyes Test (RMET) uses 36 b/w photographs. Instead of whole faces, the RMET only shows the eye-region of a face expressing an emotion depicting a distinguishable mental state. Every picture is accompanied by four words, describing mental states (e.g., “bored”, “interested”, “relaxed”, “thoughtful”). The original paper-pencil version was also implemented in Presentation® (Neurobehavioral Systems Inc., Berkeley, USA). The main outcome is the proportion of target selections (versus non-target selections). Responses were recorded by pressing one of four keys on the numeric keypad.

FEEL (Facially Expressed Emotion Labeling; Kessler et al. 2002). The FEEL test (Version 3.2) is a computer-based test measuring perception of facially expressed emotions. All six basic emotions developed by Matsumoto and Ekman (1988; JACFEE, anger, disgust, fear, happiness, sadness, and surprise) are presented on a computer screen for 300 ms. For the first 1500 ms the stimulus character depicts a neutral face, and then switches into one of the six basic emotions for 300 ms. After image presentation, the participants choose the corresponding emotion via mouse click within a time interval of 10 s. Main outcome is the accuracy rate for correctly categorized emotions.

FEFA (Frankfurt Test and Training of Facial Affect Recognition; Bölte et al. 2006b). A computer-based emotion recognition test consisting of 50 b/w photographs of faces depicting all six basic emotions and a neutral expression. Each picture is shown on the screen and the participant is asked to select one out of six emotions or the description “neutral” by mouse click. Main outcome is the accuracy rate for correctly categorized emotions.

(3) Explicit Tasks

Classical FBT (sensu Wimmer and Perner 1983). In order to test the comprehension of FB of others, three different stories of first (n = 1) and second order (n = 2) FB were presented. The first-order FB story is a slightly adapted version of the classical “Sally & Anne-Task” by Baron-Cohen et al. (1985). The two second-order FB stories are a narrowly modified adaption of the “ice cream van-story” by Perner and Wimmer (1985) and the so called “Waldgeschichte” (“wood-story-task”) by Wichmann (1996). In all three stories the experimenter asks a false belief question (“Where will person A look for object X/person B?”), an explanation question (“Why?”), a control question (“Where is object X/person B really?”), and a knowledge question (“Does person A know that person B has hidden object X”). In case of second-order FB stories there are additional questions: a question about the state of not-knowing (“Does person A know that person B knows?”) and a second-order belief question (“What does person A believe what person B believes?”). This procedure assesses whether subjects have an explicit representation of another person’s FBs (Wimmer and Perner 1983). Main outcome is the accuracy rate for correctly given answers.

Procedure

Individuals who fulfilled inclusion criteria were invited to participate in the study. Subsequently, they had to complete the above mentioned compilation of surveys (all: SRS, EQ, SQ, TAS-26; NTD only: CBCL/4-18, YSR/11-18). Adolescents were assessed within two sessions, separated by a maximum interval of 14 days. All participants completed the test battery individually in a quiet room. In the first session with a total duration of approximately 2 h, the paper and pencil short form of CFT 20-R was conducted within a time frame of about 30 min, followed by the computerized DECT, RMET, Faces-Test, FEEL, and FEFA. In the second session, the participants had to complete the second part of the SC test battery including MASC, AToM and classical FBT within approximately one and a half hours.

Measures

Computerized stimuli were presented on a 17 inch monitor connected to a PC running Presentation® software (Version 15.1, Neurobehavioral Systems, Inc., Berkeley, CA, www.neurobs.com) on a Microsoft Windows XP operating system. Group differences were analyzed by means of ANOVA. In case of violations of distributional assumptions of parametric tests, non-parametric alternatives are applied. Checking assumptions of normality are conducted with Kolmogorov–Smirnov tests. In the case of accuracy rates, the well-established arcsine transformation (aka “arcsine square root transformation” or “angular transformation”) of proportional values was applied in order to see whether that remedies the normality violations (e.g., Hair et al. 2014). If so, ANOVAs were applied to the arcsine-transformed data; if not, non-parametric Wilcoxon–Mann–Whitney tests were computed. For the purpose of distinctions between ASD and NTD we computed analyses of all implicit, explicit and complex measures of the SC test battery. In order to investigate convergent and concurrent relations of implicit, explicit and complex tasks we computed Pearson correlations of all accuracy outcomes. Furthermore, we examined if there are any developmental effects of age on the performance in SC considering the above mentioned measures (implicit, explicit, complex tasks). For this we ran Pearson correlations for the total sample and each group separately. Effect sizes for group differences are reported in terms of standardized mean differences (SMD): Unbiased Hedges’s g, rather than Cohen’s d, is used as point estimator of effect sizes (Borenstein et al. 2009), because the former enables the computation of the 95% CI that go also into the forest plot of the systematic review of results. All statistical analyses are performed with SAS software, Version 9.4 (SAS Institute Inc., Cary, NC, USA). For all statistical analyses, a significance level of α = .05 is adopted.

Results

Sample’s Characteristics

In Table 1, the sample’s characteristics are presented. There were no significant group differences for age and IQ [both F(1, 43) < 1]. Concerning autistic symptomatology, the ASD group showed significantly higher T-scores for the SRS and its subscales—with large effect sizes (all gs ≥ 2.33). In addition, alexithymia scores as assessed by the TAS-26 were also significantly higher for the ASD group, except for the TAS-26-EOT (= Externally-Oriented Thinking) subscale. Concerning empathizing and systemizing, results were mixed: ASD adolescents reported significantly lower empathy (lower EQ scores), whereas no significant group difference was found concerning degree of systemizing as assessed by SQ.

Group Differences in Measures of SC

In the following, the results of the behavioral tests concerning social cognition (complex tests, implicit tests, and explicit tests) will be reported. In Table 2, descriptive (means and standard deviations) and the inferential statistics are summarized.

Table 2 Mean accuracy rates for implicit, explicit, and compound measures of social cognition

Complex Test—AToM

Considering the compound AToM-Total Score (assessing complex social cognition), the ASD group (M = 0.852, SD = 0.117) scored significantly lower than the NTD group (M = 0.936, SD = 0.057) [F(1,43) = 9.27, p = .004, g = −0.89]. Since the assumption of normal distribution was violated before and after arcsine transformation of the AToM-FEC subscale, non-parametric tests were applied. The Wilcoxon–Mann–Whitney test yielded a significant difference between both groups (NTD: M = 0.912, SD = 0.085; ASD: M = 0.826, SD = 0.150; W s  = 593.50, z = 2.024, p = .043). For the AToM-FB subscale the assumption of normal distribution was also violated before and after arcsine transformation. The Wilcoxon–Mann–Whitney test revealed a significant difference between both groups (NTD: M = 0.981, SD = 0.030; ASD: M = 0.934, SD = 0.068; W s  = 608.00, z = 2.449, p = .014).

Additional to the items affecting SC, we examined potential differences between groups on PO items in the AToM without obtaining any significant effects [NTD: M = 0.512, SD = 0.123; ASD: M = 0.487, SD = 0.121; F(1,43) < 1].

Complex Test—MASC

For the MASC (assessing complex social cognition), the total score in the ASD group deviated significantly from the normal distribution. Therefore, an arcsine transformation was applied before parametric testing. ASD adolescents were more inaccurate when responding to mental state questions [F(1,43) = 6.36, p = .015, g = −0.78].

Implicit Test—DECT

One-way ANOVA with the main outcome (overall accuracy) revealed significant deficits in (dynamic) facial emotion categorization for the ASD group [F(1,43) = 5.66, p = .022, g = −0.70].

Implicit Test—Faces-Test

For the Faces-Test (assessing FEC), the total accuracy score in each group deviated significantly from the normal distribution. Therefore, an arcsine transformation was applied before parametric testing. Upon computing a one-way ANOVA for the Faces-Test, neither significant differences between groups for the arcsine-transformed overall accuracy rate [F(1,43) = 2.44, p = .126], nor for the time limited sub-indices (150 ms, 300 ms, 450 ms, self-paced) were found [all Fs < 1, except for the 450 ms condition: F(1,43) = 3.18, p = .082].

Implicit Test—RMET

Considerable differences in facial emotion categorization were found between ASD and NTD in the RMET and yielded significant results in the one-way ANOVA [F(1,43) = 15.84, p = .0003, g = −1.17]. Performance of the ASD group was much lower than in the NTD group (see also Table 2).

Implicit Test—FEEL

No significant differences were found for the FEEL test (that also assesses FEC) comparing both groups for accuracy in a one-way ANOVA [F(1,43) = 1.28, p = .264]. Separate analyses for accuracy rates of the six basic emotions also yielded no significant group differences.

Implicit Test—FEFA

For the FEFA, the one-way ANOVA revealed distinct deficits in facial emotion categorization for the ASD group [F(1,43) = 8.27, p = .006, g = −0.84). Moreover, regarding the accuracy rates for the 6 basic emotions, differences were significant for anger [F(1,43) = 6.48, p = .015], and for fear [F(1,43) = 6.11, p = .018], but not for disgust [F(1,43) = 1.22, p = .276], happiness [F(1,43) < 1], sadness [F(1,43) = 2.09, p = .156], and surprise [F(1,43) < 1].

Explicit Test—Classical FBT

Since the assumption of normal distribution was violated before and after arcsine transformation, non-parametric tests were applied. Wilcoxon–Mann–Whitney tests yielded no significant differences between both groups [neither in FB1O (W s  = 517.00, z = 0.551, p = .581) nor FB2O (W s  = 575.50, z = 1.748, p = .081)]. The FB-Total score (NTD: M = 0.944, SD = 0.089; ASD: M = 0.894, SD = 0.144) was also not significant (W s  = 569.50, z = 1.579, p = .114), indicating that both groups at least do not differ widely in their ability to detect false beliefs in stories presented in verbal format.

Systematic Review of Results

In Fig. 2, group differences for all measures are displayed in a forest plot to give a concise systematic review of obtained results. All measures were scaled so that results falling to the right of the line-of-no-effect (zero) indicate worse performance, more symptomatology, more systemizing or less empathizing than the NTD group. As can be seen, nearly all measures, except for SQ and TAS-26-EOT, are descriptively heading for the hypothesized direction. Significant differences are obtained for (a) the implicit tests DECT, RMET, and FEFA, (b) for the complex tests MASC and AToM, and (c) for the AToM subscales AToM-FEC, AToM-FB, and AToM-Comp-FEC. Within the group of questionnaires, the SRS together with its five subscales yielded the largest effects (all gs > 2.33). In addition, EQ as well as the TAS-26 together with the two subscales TAS-26-DIF and TAS-26-DDF are significant. In contrast, the 95% CIs of TAS-26-EOT and SQ cover the line-of-no-effect.

Fig. 2
figure 2

Forest plot of standardized mean differences for all instruments (totel scores and subscales) applied in the study

Correlations Between Measures of SC

In Table 3, the correlations of the implicit facial emotion categorization tests together with the explicit test and the tests of complex social cognition (MASC and AToM scales) are displayed (skipping the social cognition unrelated AToM subscales AToM-Comp-RQ and AToM-PO): Concerning FEC, there are roughly three groupings of noteworthy intercorrelations: (1) All pairwise correlations of DECT, Faces-Test and RMET are significant, (2) FEEL and FEFA also correlate significantly. (3) In contrast, the AToM-FEC subscale does not correlate with any of the other implicit tests. The only exception to this grouping is the significant correlation of RMET with FEFA (r = .380, p = .010).

Table 3 Summary of intercorrelations, means, and standard deviations of accuracy rates for implicit, explicit, and compound measures of social cognition

Concerning the explicit test, the classical FBT correlates significantly with DECT (r = .372, p = .012) and RMET (r = .365, p = .014), but also with the complex measures of MASC, AToM-Total und AToM-FB (r = .443, p = .002; r = .341, p = .022; r = .325, p = .029, respectively). For the total scores of the two complex measures of SC there is a significant correlation of MASC with AToM-Total (r = .509, p < .001), that is considered as a large effect according to common statistical conventions (Cohen 1988). In addition, all scales of the compound tests correlate significantly with each other as can be seen in Table 3. However, the intercorrelations should be interpreted with caution, because the correlations are based on a rather small sample. Therefore, we also refrained from running principal component analyses (PCA), because results hinges from honest correlations, and sample size is definitely below recommended rules of thumb: Hair et al. (2014) for example state: “The minimum absolute sample size should be 50 observations” (p. 100).

Effects of Age on SC

In order to consider tentative influences of age on performance in SC, we computed Pearson correlations for the total sample and each group separately, for all measures of SC with age.

With regard to the five context-free FEC tests, there were two tests (RMET, Faces-Test) for which the total score correlated significantly with age [RMET: r = .304, p = .043; Faces-Test (arcsine-transformed): r = .326, p = .029]. In both cases, the correlation within the NTD group was not significant [RMET: r = .074, p = .744; Faces-Test (arcsine-transformed): r = .231, p = .301], whereas for the ASD group both correlations showed (tendency for) a significant correlation [RMET: r = .450, p = .031; Faces-Test (arcsine-transformed): r = .375, p = .078]. In addition, all correlation coefficients in the other concext-free FEC tests were descriptively greater in the ASD group that in the NTD group.

For the explicit task [Classical FB-Total Score (arcsine-transformed)], no significant correlation with age was found (r = −.175, p = .250). Within groups, there was an unexpected significant negative correlation in the NTD group (r = −.444, p = .038), mostly due to the oldest NTD participant who performed rather poorly in the classical FBT. After removing this most influential observation the significant correlation disappears (r = −.300, p = .186). For the ASD group, no significant correlation was found (r = −0.044, p = .842).

For the complex scores of SC (MASC, AToM), the pattern of results was mixed: For the MASC (arcsine-transformed Total score) there was a significant correlation with age for the whole sample (r = .362, p = .015). Within groups, no significant correlation was obtained for the NTD group (r = .193, p = .391); for the ASD group, however, there was a significant correlation with age (r = .438, p = .037). In Fig. 3, the differential developmental trajectories are depicted in a scatterplot.

Fig. 3
figure 3

Scatterplot of MASC-Total scores (arcsine-transformed) and age of participants, together with LOESS curve and 95% CI. For the ASD group, a linear increase of MASC performance with age can be noted, whereas for the NTD group a plateau seemed to be reached already at the age of 14

For the AToM (Total score), neither a significant correlation with age was obtained for the total sample (r = −.006, p = .967); nor for the two subsamples of ASD (r = −.072, p = .745); and NTD (r = .059, p = .795).

Discussion

The aim of the present study was to investigate SC in order to assess performance differences between adolescents with ASD and neurotypical controls on the basis of a multifaceted test battery. Comprised of eight behavioral tasks, the ToM test battery addresses implicit (FEC), explicit (false belief), and complex aspects of SC.

We found significant differences between ASD and NTD in three of five tasks concerning iToM respectively the perception and categorization of facial emotion without any contextual factors. In contrast, there were no significant eToM differences in the performance measures of classical FBT (first and second-order). In the case of complex social cognition tasks we found significant differences in the newly developed AToM test and in the well established MASC.

Addtionally, in the area of FEC we found significant correlations between DECT, Faces-Test and RMET, between RMET and FEFA and between FEFA and FEEL (see Table 3). The influence of age on SC, especially on implicit ToM and complex SC abilities was descriptively higher for ASD in all tasks and yielded significance in correlations with RMET and Faces-Test (FEC) and MASC (complex SC).

SC as the dynamic constructive and conceptual processing of the social environment arises as a result of multimodal perception. Obviously, alert human individuals are predominantly exposed to a more or less permanent flow of social information in everyday life. The incoming perceptions must be handled efficiently within a limited timeframe and without having full access to all necessary information (Hertwig and Hoffrage 2013). In order to efficiently and successfully develop personal expectations that correspond to the expectations of the social environment, the SC system depends on the implicit processing of the inrushing stimuli. Combining paradigms measuring implicit as well as explicit and complex responses in the test battery, our results suggest that—as adolescents with ASD transition into adulthood—they still display an overall deficit in social cognition.

Concerning general task performance, the comparison of accuracy between ToM-items and distractive PO-items in AToM suggest that participants with ASD do have deficits in SC regardless of their general effort in the task, because their perfomance is as good as the NTD group in the PO-items.

These findings are reflected exemplarily in the newly developed AToM test, which shows significant differences of performance between ASD and NTD in FEC-items as well as in FB-items in complex social conditions. The compound structure of the test provides conclusively more processable information than an emotional expression of a face in a noncontextual task can yield. Thus, we suggest that embedding implicit and explicit aspects of social cognition in a continuous/sequential story poses a higher demand for ASD than explicit false belief stories or static and exaggerated displays of facial emotion.

Implicit

In accordance with previous research we can mainly find significant differences in the performance between autistic and neurotypical individuals concerning facial emotion categorization (Harms et al. 2010; Lozier et al. 2014; Uljarevic and Hamilton 2012). On closer examination of implicit tasks it is remarkable that there is a wide range of results consisting of small differences on the descriptive level (FEEL, Faces-Test) as well as significant differences with large effect sizes (RMET, FEFA, DECT). Especially the RMET yielded considerable performance disparities between NTD and ASD. It seems plausible that these differences can be ascribed to the piecewise reduction of relevant mimic information and to the lack of any contextual information. Since the ability to relate the fragments of facial expressions of mental states to mental concepts is described as unconscious, rapid, and automatic (Todorov 2012), the distinct deficits of the ASD group can be regarded as strong evidence for impairments concerning implicit facial emotion processing.

Looking at it the other way round, the addition of contextual information to a reduced social stimulus then would facilitate the categorization of the mental state in the therefore provided lexicon. This could be an explanation for the minor differences between NTD and ASD in the Faces-Test and in the FEEL where facial stimuli are presented in an entire, exaggerated and static form. And as Baron-Cohen et al. (2001) stated, the results of ASD in basic emotion recognition tests may be camouflaged by learning compensatory strategies and lead to the fallacious assumption that individuals with high-functioning autism spectrum conditions show no deficits in facial emotion categorization.

Besides, the presentation of fragmentary faces in the RMET, some of the implicit tests in the SC test battery are characterized by the use of dynamic stimuli and virtual characters (DECT) as well as time-limited presentation of stimuli (Faces-Test, FEEL). Although NTD performed significantly better than ASD, the accuracy rates in the DECT are distinctly lower in both samples. Dynamic virtual characters especially in a context-free setting seem to be more challenging to categorize compared to static photographs of exaggerated facial expressions. Looking at previous research it can be assumed that dynamic stimuli reveal subtle emotion categorization deficits (Evers et al. 2015; Kessels et al. 2010; Sato et al. 2013; Uono et al. 2009). Moreover, imaging studies using dynamic stimuli could show, that indiviuals with ASD compared to NTD generate different activity patterns in the human face processing system including the amygdala, fusiform gyrus and posterior superior temporal sulcus region (Pelphrey et al. 2007). Nonetheless, it is notable that previous research comparing static versus dynamic stimuli has yielded mixed results in NTD: (i) they show performance loss (Brosnan et al. 2015) and (ii) there is no significant difference to ASD participants (Bekele et al. 2014). Furthermore, it is conceivable that especially dynamic basic emotions with weak or medium intensities are very difficult to categorize for both NTD individuals and participants with ASD without having contextual factors like voice, gesture, posture, and situation. There is no doubt that contextual information can be helpful in situations of ambiguity, but in case of absent contextual information about emotional states, one has to revert to solely facial information. That means the smaller the amount of decisive contextual information provided, the more important it is to be able to discriminate between minimal differences.

While the FEFA, using static, exaggerated stimuli presented without time limit, shows a significant difference between ASD and NTD, especially those tests using time restrictions (FEEL, Faces-Test) did not result in significant group differences. Nonetheless, our findings are similar to those of another study presenting facial emotion stimuli with limiting exposure time that failed to elicit differences between ASD and NTD (Tracy et al. 2010). This may be, at least in part, attributable to the fact that the durations of the stimulus presentations were too long to establish differences between ASD and NTD, both in the study of Tracy et al. (2010) with stimulus exposure times of 1500 ms and in our study with varying exposure times of 150, 300, and 450 ms. In contrast, the use of markedly shorter stimulus exposure time (15 and 30 ms) has led to significant differences in emotion categorization between ASD and NTD (Clark et al. 2008).

A further explanation for the missing significances in both implicit tests can be seen in the visual stimulus presentation using static depictions and exaggerated facially expressed emotions of insufficient ecological validity. Everyday social interaction depicts a wide range of rapidly displayed, subtle and nuanced emotional expressions, while most studies investigating social cognition in ASD are restricted to prototypical stimulus material. This methodological point of criticism can be found in several publications with increasing frequency (Clark et al. 2008; Rump et al. 2009; Sasson et al. 2015; Tracy et al. 2010; Uljarevic and Hamilton 2012). In contrast to the results of FEEL and Faces-Test, the FEFA as a test procedure using static, exaggerated stimuli presented without time limit, shows a significant difference between ASD and NTD, indicating that even tests with a lower ecological validity can elicit deficits in emotion categorization in ASD. Furthermore it can be stated, that tasks concerning FEC show only reduced intercorrelations (see Table 3), thus the consistency of the construct of facial emotion categorization strongly depends on the structure and features of the tasks. This applies especially to the case of complex tasks, were we can see that the FEC-items of the AToM show no significant correlations with any of the implicit tasks, underlining that FEC embedded in a complex social event differs distinctly from FEC without any contextual information.

Explicit

Based on recent studies, adolescents and adults with high-functioning ASD apparently show typical performance in FBT as long as they are explicitly instructed (Baron-Cohen et al. 1999; Frith and Happé 1994; Senju 2013). These findings can be reconciled with the results of the classical false belief paradigm we used to test explicit ToM abilities, indicating no significant differences between ASD and NTD. Showing no impairments in the case of eToM, compensation seems to be possible somewhere in the age range from 14 to 18 years, while there are still substantial impairments in iToM at least until the age of 18.

If we take a closer look at the process of comparative perspective taking as is customary in FBT, knowledge about the wrong or missing information of the protagonist results primarily from the distinction between the information that the protagonist can have and the information the participant as observer of the whole scene has. In none of the items presented in the explicit FB stories the participants have to process bodily expressions like facial emotion expression, gesture, posture, or other decisive contextual social information. The only relevant aspect for a correct answer is the correct conclusion that the presence or absence of a protagonist is decisive, and that the protagonist has or has not perceived the same information as an observer of the entire scene. Referring to causal learning models, we suggest that participants solving a FBT can attend to causal properties. Causal representations provide relations and processes with a mechanistic basis for the correlations between protagonists, time and space. Following Johnson-Laird (1981), a participant using causal properties given in a simple explicit FB story can create a model of correlations between subjects and their agency, objects, time and space that interprets the intentions and beliefs of the protagonist. Causal learning models are basic elements for the development of social schemata. While NTD are able to build up a rich variety of social schemata continuously growing into modules of folk psychology, our data emphasizes that adolescents with ASD show at least a different development of social schemata leading to different interpretations of complex social scenes. However, the used FB stories are of such primitive structure without any description of emotional expressions or social complexity, that they are solvable without drawing on fully shaped social schemata, instead using simple inferences based on explicitly given premises. Taking a look at the study of Jolliffe and Baron-Cohen (1999) using the “strange stories-task” (Happé 1994) it is ascertainable that one of the differences between classical FB stories and the “strange stories” is the use of more naturalistic and complex stories (Happé 1994) containing social scripts, social schemata, metaphorical expressions, sarcasm, and jokes. In the study of Jolliffe and Baron-Cohen (1999) the comparison between a causal reasoning task and the “strange stories-test” suggests, that even when individuals with ASD show normal performance in causal reasoning, they show deficits in the mentalizing task. In relation to the basic structure of the classical FBT it is assumable that the deficits arise primarily from the difficulty to understand the social scripts and schemata. Thus, individuals with ASD fail in everyday social interactions, because they are exposed to a continual stream of different ToM challenges. Confronted with this vast quantity of nonverbal and contextual cues, and incapable of processing them automatically, the congestion leads to a bottleneck of information processing of social stimuli. In contrast, NTD individuals are able to process social information mostly implicitly and procedurally having recourse to fully fledged social schemata.

Complex

With regard to the AToM test including both implicit and explicit stimuli in a contextual framework of sounds, music, vocalizations, course of action, spatial and environmental cues, it is remarkable that participants with ASD performed significantly worse in both ToM conditions than their controls. We attribute these differences in performance on the high demands arising from a complex task. It contains child-oriented, social-interactional but non-verbal stories with several interacting protagonists and their network of relationships, being combined with a catalogue of questions putting high demands on executive functions, working memory processes, attention in areas of facial emotion categorization, understanding of false beliefs as well as remembering objects and events outside the social gist. Neurotypically developed participants may have the ability to refer to facial emotions, gesture and posture in the context of a story without achieving high cognitive load. This cognitive effortlessness can be associated with the familiarity of a situation. A familiar course of an event is easier to capture and interpret than an entirely unknown situation. NTD are able to identify key properties and relations between objects and/or subjects. This ability results from the use of schemata and scripts (Schank and Abelson 1977). Besides the above mentioned causal-learning-model, we especially use these schemata to organize commonly experienced social situations (Christensen and Michael 2016). The use of schemata and scripts is implicit and makes it possible to facilitate a complex social situation, by establishing a framework that integrates informations and emphasizes key relationships (Hoernig et al. 1993; Kintsch and van Dijk 1978). Especially social cues can be used by NTD for associative relations of stored scenarios, becoming aware in a fast and automatic way, whereas individuals with ASD have no or insufficient recourse to such prototypical mental schemata, especially if they are induced by emotional or social stimuli. These deliberations are consistent with a previous study comparing scripts for social routines between high-functioning ASD with NTD (Trillingsgaard 1999).

Similar to the AToM, the setup and structure of the MASC shows a complex framework of social interactions, embedded in a self-contained story including a variety of interaction patterns that have to be categorized and answered in a multiple choice task. However, in the MASC another component is important for the correct categorization of social interactions between protagonists. The plot of this task highlights social categories of friendship, personal relationships and their initiation in peer groups, thus the comprehension respectively the implicit and explicit knowledge of these social schemata and categories, in other words, the stereotypical roles of social groups (and their interactions), are decisive for constructing meaningful representations of others (Bodenhausen et al. 2012). From an evolutionary perspective, we have to regiment the consort with humans as a risky endeavor (Heatherton et al. 2004). For a successful course of this adventurous challenge, it is necessary that we are capable of categorizing the behavior of the counterpart correctly. First and foremost the basis for this is that long before verbal communication, we capture information from the bodily expression in the form of gesture and facial expression. In an interplay with the verbal information emerges a pattern of congruent and incongruent (e.g. irony) perceptions (Murphy 2012).

Categorizing these social perceptions and then again attributing them to larger clusters and social schemata, is not thought to be innate but a process of learning (Heatherton et al. 2004) starting from the very first day and being encouraged by parents, family, peers, milieu, and media. It is conceivable that there is a direct relation to iToM abilities and the understanding of such social schemata as friendship and personal relationship. Following the model of social development suggested by Soto-Icaza et al. (2015), the temporal perspective of biological and behavioral specialization according to social cognition is referable to a progressive development in social abilities dependent on the interaction between the social brain and the surrounding environment. While NTD have direct access to implicit, automatic knowledge and scripts, it is assumable that individuals with ASD are lacking social concepts and schemata. The simultaneity of relevant and irrelevant events in the stories of AToM and MASC constrains individuals with ASD to manage the processing of the inrushing information in an explicit, and conscious manner under high cognitive load. This may lead to an overextension in real social situations and deficits in accuracy in complex tasks.

Age Correlations

As indicated, age can be an important factor for accuracy in SC tasks. A closer look at correlations reveals significance for MASC, Faces-Test and RMET. Specifically the correlation between MASC as a complex task (see Fig. 3) and age as well as facial emotion categorization (RMET) and age can be found in the ASD group. This finding may imply that crucial and decisive steps in the development of facial emotion categorization and the development of social schemata are still incomplete in the ASD group, while neurotypically developed adolescents show a flatter learning curve.

Referring to the meta-analysis of Uljarevic and Hamilton (2012), impairments of FEC in ASD across age can be assumed. With respect to the age range between childhood and adolescence, many studies reveal distinct FEC deficits, while others cannot find significant differences in that age group (for a review, Harms et al. 2010). Neuroimaging and electrophysiological data suggest that face processing in adults can be traced back to earliest activation of brain areas specific for face detection mechanisms within the first months after birth (Grossmann et al. 2007, 2008).

The significant correlation between age and accuracy in the MASC is also solely attributable to the ASD group. Thus the developmental deficit can be associated with impairments in facial emotion categorization, leading to deficient answers in those items concerning facial emotion expression. On the other hand it might be possible that the developmental deficits of ASD show their influence when concurrent processing of various different stimuli of social and non-social kind needs recourse to elaborated social schemata.

Conclusions

In this study we applied several tasks of SC for assessing FEC, FB and complex SC in a sample of adolescent ASD and their NTD peers. We found that the use of tasks with fragmented emotional faces and the use of dynamic facial emotion stimuli lead to distinct differences in emotion categorization abilities between NTD and ASD.

In contrast, adolescents with ASD can solve classical FBT nearly as accurate as their NTD peers. We suggest that this is largely because of the simple structure of classical FBT, which enables adolescent individuals with ASD to find the correct answer by drawing inferences on explicitly given premises conveyed in verbal format instead of referring heavily on implicit ToM abilities that would be needed for example in non-verbal versions of FBTs.

In our study, especially SC tasks with a complex structure (like MASC and AToM) revealed impairments of SC in the ASD group. Such tasks are characterized by a plethora of relevant stimuli and all of them should be considered and handled to come to a correct evaluation of what has been perceived. In order to process this wealth of information quickly and implicitly, NTD draw on social schemata and scripts. We assume that impaired social cogntion in ASD is attributable to deficiently developed social schemata. Thus, we proceed on the assumption that the deficient structure of social schemata is in part a result of the early emerging impairments in facial emotion categorization.

Limitations and Implications for Future Research

In this study we operationalized a theoretical approach via a wide range of various behavioral tests to investigate SC and the impact of iToM, eToM and socio-contextual abilities. As only large studies possess sufficient power to yield reliable results (e.g., Charman et al. 2011), our sample size is not big enough (i) to detect small and medium effects, (ii) to apply principal components analyses, or (iii) to perform more advanced statistical modelling approaches like SEM (Hair et al. 2014) that would be better suited for delineating between implicit, explicit, and complex social cognition. Thus, we suggest that future research should build upon larger sample sizes.

The age range (14–18 years) of our sample yielded significant group differences in implicit as well as contextual SC tasks. However, to make a clear statement about the trajectory of SC in terms of implicit, explicit and schema-based developments, it would be desireable to include younger cohorts as well as adults of all age groups.

Moreover, regarding the male-only sample, it is impossible to make any statement concerning gender differences in the development and abilities of social cognition.

Age and gender related aspects should also be reflected in the presentation of social stimuli. Stimuli of the DECT are limited to only one old male and one young female virtual character. Nevertheless, a wide range of cohorts and gendertypes could help to diversify the stimulus material and allow for valid conclusions about age and gender aspects of facial emotion categorization.

Our approach to investigate SC distinguishing between iToM and eTom with and without contextual factors is operationalized by a behavioral test battery. To investigate underlying mechanisms and to understand the relevant processes applied in SC, further empirical and theoretical conceptualizations are needed.

In order to segregate the effect of contextual factors, future research should rely on consistent facial stimuli with and without contextual elements, to obtain distinct data allowing to deduct non-contextual from contextual factors.

The broad selection of our study tasks concerning different aspects of SC has revealed distinct impairments in ASD especially within the scope of FEC and complex social context. The findings point to a need to intensify some investigations that have not been elaborated sufficiently within this study.

According to the recommendations of Happé and Frith (2014) and the results of our study, we assume that implicit and explicit aspects of SC and the concept of social schemata should be investigated more thoroughly in terms of development. Opening the age range in both directions could be a remarkable chance to track whether and how attention to faces, emotion processing, facial emotion categorization, the ability to draw on contextual cues and the use of compensational heuristics, social schemata and scripts are related and in which chronological order they develop and appear.

As perceptual categorization of emotional stimuli seems to be considerably influenced by contextual factors and by contextual knowledge (Brosch et al. 2010), a profound analysis of the relationship between facial emotion categorization and additional sources of information could help to understand how the social brain network organizes the completion of insufficient information, especially in comparisons between ASD and NTD. In our study, we used diverse tests with different stimuli to investigate this phenomenon. Hence, the results and their implications should be replicated and tested by a task design dealing with consistent stimuli in a contextual and non-contextual condition.

The use of purely behavioral tasks for testing iToM-abilities in individuals with ASD compared to NTD participants often provides uncertain or ambiguous results in accuracy. Assisted by additional physiological measurements (eye-tracking, EEG, fMRI), most studies reveal deviant processing of the applied stimuli. For a better understanding of the perception, categorization and integration processes of SC especially in terms of contextual and non-contextual iToM and eToM tasks, it could be insightful to combine behavioral tests with eye-tracking, EEG or fMRI.

Further relevant methodological aspects to focus on are time constraints for the ecologically valid presentation of social stimuli. In our study only few tasks allowed for time limitations and if so, the time constraints were possibly not short enough to elicit significant differences between ASD and NTD. As we suggest that especially iToM is processed mainly spontaneously, future studies should consider pertinent time-limited stimuli and reaction-time measurement in their design. Beside time limitations the use of dynamic stimuli provides higher ecological validity (Uljarevic and Hamilton 2012). Thus, future research should also focus on dynamic stimuli.

The data of the study presented here suggests that social cognition is a heterogeneous field comprising several different aspects and domains. Future research should operate, beside obligatory facial emotion categorization and false belief tasks, with ecologically valid material taking into account the complexity of the environment neurotypically developed and individuals with ASD are living in. Hence, a core theme of future research should be the investigation of differences in the acquisition and revision of social schemata between ASD and NTD.