Introduction

Autism spectrum disorder (ASD) comprises a phenotypically and genetically heterogeneous group of disorders. Phenotypic aspects indicating heterogeneity are, among others, age of onset, severity and combination of symptoms, as well as language and cognitive development. The heterogeneity has often been offered as explanation for the difficulties in replicating specific genetic risk factors involved in autism (Hus et al. 2007; Hu and Steinberg 2009; Hu and Lai 2013). Consequently, numerous attempts to stratify children into distinct subgroups have been done. Several researchers already aimed at eliciting distinct underlying genetic etiologies by subgrouping the ASD phenotype (Veatch et al. 2014, Hu and Steinberg 2009; Hu and Lai 2013). Yet it has not become clear whether a categorical or a dimensional view of ASD is more appropriate to delineate etiology, diagnostic, and treatment options for children and adolescents with ASD (Wiggins et al. 2011). It is commonly expected, that the delineation of phenotypically distinct ASD subgroups will allow a more comprehensive understanding of the disorder and the individual’s needs with regard to diagnostic and treatment options (Brennan et al. 2014).

Empirical, unbiased approaches for exploring underlying substructures of disorders include factor analyses and cluster analyses. Results derived from factor analyses can be interpreted as an association between the observed variables and underlying latent variables (Snow et al. 2008). Recent factor analytic results supported the novel DSM-5 two domain ASD model with the social communication domain on the one hand, and stereotyped repetitive behavior on the other one in a sample of 1861 individuals aged 4–18 years with a mean estimated IQ of 99.4 (SD 19.3) provided by the Autism Genetic Research Exchange program (Snow et al. 2008). In contrast, cluster analytic approaches aim at grouping individuals according to their observed phenotypic similarities and differences. One advantage of both methods is the un-biased analysis allowing researchers to reduce data in an objective manner based on the descriptive, atheoretical, multivariate techniques (Hair and Black 2000). Past research using the cluster analytic approach on different ASD symptom measures identified a range of two to four ASD subgroups (Eaves et al. 1994, Sevin et al. 1995, Fein et al. 1999). Some studies described distinct diagnostic categories (Siegel et al. 1986), others a low- and a high functioning group (Fein et al. 1999, Stevens et al. 2000). Limitations of some studies were the relatively small sample sizes. Also, differing ASD populations were studied. Some studies concentrated on high functioning children (Verté et al. 2006; Ring et al. 2008), others included toddlers (Wiggins et al. 2011, Georgiades et al. 2013, Jansen et al. 2013) or pre-pubertal boys only (Obafemi-Ajayi et al. 2014). A broad range of measurement variables were also used to identify distinct cluster groups. While most of the studies included data of one primary instrument (e.g. the Autism Diagnostic Interview-Revised (ADI-R, Rutter et al. 2003), the childhood autism rating scale (CARS, Schopler et al. 1988), or the Social Communication Questionnaire (SCQ, Rutter et al. 2001), some analyzed data from a combination of phenotypic measures (Brennan et al. 2014, Veatch et al. 2014). These methodological differences may have resulted in the heterogeneous findings.

The ADI-R is a comprehensive assessment instrument, which contains items on age of onset, language development, social interaction and communication abilities as well as stereotyped and repetitive behaviors. Thus, it is an ideal measure for studies on the autism phenotype (Snow et al. 2008), and cluster analyses on the ADI-R will likely result in clinically meaningful subgroups (Wiggins et al. 2012).

In a cluster analysis of ADI-R algorithm subscale scores, verbal-nonverbal status, and IQ scores, a three cluster solution was observed in autism affected sib-pairs from n = 171 multiplex families excluding children with severe mental retardation (Spiker et al. 2002). Clusters were characterized by symptom severity (mild, moderate, severe) in all three DSM-IV-TR domains. Limitations of the study are the strict inclusion criteria, presumably not including individuals on the whole spectrum (Asperger Syndrome, pervasive developmental disorders not otherwise specified (PDD-nos). A three-cluster solution was also identified by Verté et al. (2011) in a different sample, including children within the high functioning range (IQ > 80) (Asperger’s Syndrome, high functioning autism, PDD-nos) aged 6–13 years old (N = 135). This cluster analysis was based on an agglomerative hierarchical approach on z-standardized ADI-R subscale scores. The cluster groups were defined by one cluster including children with high functioning Autism (HFA), the second cluster combined HFA and Asperger’s Syndrome, and the third cluster group mostly comprised individuals with PDD-nos. The authors concluded that these findings support the severity concept; still cluster groups were described by distinct DSM-IV-TR diagnostic categories. It remains unclear if these results also indicate differing subtypes apart from a severity gradient, or if the diagnostic categories might be based on an underlying severity gradient labeled as categories.

Four phenotypic clusters based on similarity of symptom severity across the ADI-R algorithm items of nearly 2000 autistic individuals from the Autism Genetic Research Exchange (AGRE) were identified by Hu and Steinberg (2009). The sample included a broad range of ages from 1.85 to 47.68 year olds with an estimated nonverbal IQ of 73–140. One cluster was characterized by severe language deficits, another exhibited intermediate severity, the third one included milder symptoms across the domains, and a last one showed a high frequency of savant skills. This study offers support for both, a symptomatic profile of distinct subtypes, and a severity gradient.

With regard to the severity concept proposed in DSM-5 and current efforts to develop ICD-11 ASD criteria, the present study builds on previous attempts by exploring homogenous subgroups within the ASD spectrum. Cluster groups were derived from a large sample including all three ICD-10/DSM-IV-TR diagnoses with a broad age range and an estimated IQ in line with previous studies (e.g. Snow et al. 2008; Hu and Steinberg 2009) in terms of the ADI-R algorithm scores. The study’s specific objective was to identify homogeneous and clinically meaningful subgroups. We first performed cluster analysis to describe empirically derived subgroups from the ADI-R algorithm scores. Second, subgroups were carefully characterized by additional phenotypic measures, covering ASD and global psychopathological symptoms.

Methods

Participants

Participants were consecutive referrals receiving a comprehensive diagnostic workup between March 1992 and January 2014 at a Department of Child and Adolescent Psychiatry, Psychosomatics, and Psychotherapy in Germany. Individuals with a clinical diagnosis of autistic disorder (F84.0), Asperger’s syndrome (F84.5), or atypical autism (F84.1) according to ICD-10 research criteria (WHO 1992) were included into the study. All of the participants were aged ≥3–≤21 years-old. Diagnoses were established by a team of independent and experienced clinicians (clinical psychologists, psychiatrists) all trained to research standards by certified ADOS/ADI-R trainers as clinical best estimate based on information of the Autism Diagnostic Interview-revised (ADI-R; Rühl et al. 2004), the Autism Diagnostic Observation Schedule (ADOS, Bölte et al. 2006), the Social Responsiveness Scale (Bölte and Poustka 2008), and/or the Social Communication Questionnaire (Bölte and Poustka 2006). Children who scored on the ADI-R as nonverbal, individuals with IQ <35, and those ones who could not be tested were excluded from the study. The final sample with complete ADI-R data consisted of n = 463 individuals, including n = 194 with autistic disorder (41.9 %), n = 113 (24.4 %) with atypical autism (24.4 %), and n = 156 (33.7 %) with Asperger’s syndrome.

Measures

The Autism Diagnostic Interview-revised is a semi-structured parent/primary caregiver interview on DSM-IV-TR/ICD-10 autism criteria. The ADI-R provides empirically derived diagnostic algorithms for three subdomains of qualitative impairments. The social interaction domain includes questions about emotional sharing, offering and seeking comfort, social smiling, and social responding comprising four subscores. The second domain (communication) assesses stereotyped expressions, pronoun reversal, and the social usage of language (B1–B4). In the third domain repetitive and stereotyped behavior as hand- and finger mannerisms, unusual sensory interest or activities are investigated by four subscales. Furthermore, a score for abnormality of development evident at/before 36 months (D) can be calculated from five items. For a description of the specific subscales please see Table 2. The interviewer determines a rating score for each item based on her/his evaluation of the caregiver’s response to the semi-structured questions. Scores are defined from 0 (“Behavior of the type specified in the coding is not present”), 1 (“Behavior of the type specified is present in an abnormal form, but not sufficiently severe or frequent to meet the criteria for a 2”), 2 (“Definite abnormal behavior”), 3 (“Extreme severity of the specified behavior”), 7 (“Definite abnormality in the general area of the coding, but not of the type specified”), 8 (“Not applicable”), 9 (“Not known”). For calculating subdomain scores, 3 is recoded to 2, the scores of 7, 8, 9 are recoded to 0. A total score is then calculated for each of the three subdomains. In this study, data of the ever/4–5 diagnostic algorithm were used for children aged ≥4, the current/ever algorithm was used for the sample of children aged 3,0 to 3,11.

The Autism Diagnostic Observation Schedule is a semi-structured, cross sectional observation tool, designed for the assessment of ASD children, adolescents, and adults with varying developmental age and language abilities by four different modules. It measures reciprocal interaction and communication impairment, as well as repetitive, stereotyped behaviors. Here, calibrated severity scores for social affect (SA), restrictive, repetitive behavior (RRB) (Hus et al. 2014; Hus and Lord 2014), and the total severity score (Gotham et al. 2009) were calculated for modules 1–4. The separately calibrated severity scores allow a comparison of ASD severity across the different modules.

Both, ADI-R and ADOS are well-established, and excellently validated diagnostic tools for children and adolescents with ASD (Risi et al. 2006). In this study, both instruments were administered by independent clinical experts (psychologists, psychiatrists) who were trained to research standards.

The Social Responsiveness Scale is a 65-item rating scale that measures symptoms indicative of ASD over the previous 6 months in 4- to 18-year-olds. It is a parent/teacher questionnaire. Each item is scaled from 0 (never true) to 3 (almost always true), generating a total score ranging from 0 to 195. Scores can also be generated for five subdomains: social awareness, social cognition, social communication, social motivation, and autistic mannerisms. For comparability, the usage of SRS raw scores is recommended for research (Bölte and Poustka 2008), and raw scores of the German version were applied in this study. Reliability and validity findings for this version were similar to the data of the US original sample (Bölte et al. 2008).

The German version of the Social Communication Questionnaire (Bölte and Poustka 2006) as a further parent-report screening questionnaire for autism was obtained in this study. It is widely available and has good psychometric properties (Bölte et al. 2000). The 40-item SCQ (composed of “yes” or “no”- questions) is based on ADI-R items. Here, the total score was analysed.

General psychopathology was assessed by the German version of the Child Behavior Checklist (CBCL 4-18) (Achenbach 1998; Döpfner et al. 1994). The CBCL comprises 113 items. A total score, second order scores for internalizing and externalizing problems, and first order syndrome scales for behavioral and emotional problems can be derived: withdrawn, somatic complaints, anxious/depressed, social problems, thought problems, attention problems, delinquent behavior and aggressive behavior. Responses are coded on a Likert Scale from 0 (not true), 1 (sometimes true) to 2 (often true). The CBCL is an internationally validated and widely used parent report form (Achenbach 1998).

Intelligence quotient (IQ) was measured in n = 277 individuals by the age appropriate German version of the Wechsler Intelligence Scales for children (Hamburg-Wechsler-Intelligence Test for children, HAWIK-IV, HAWIVA) (Petermann and Petermann 2010), and adults (Wechsler Intelligence Test for adults, WIE) (Aster et al. 2006), or by the current version of the revised Culture Fair Intelligence Test (CFT 20-R) (Weiss 2006). Additional measurements of cognitive estimation comprised the Kaufman Assessment Battery for Children (K-ABC) (Melchers and Preuß 2009), Peabody Picture Vocabulary Test (PPVT) (Dunn and Dunn, 2007), Standard Progressive Matrices (SPM) (Horn 2009), or Coloured Progressive Matrices (CPM) (Bulheller and Häcker 2002) were assessed in n = 105 participants.

Statistical Analysis

IBM SPSS statistics version 22 was used for all statistical analyses. For descriptive statistics mean scores, and standard deviations (SD) were calculated. The first order ADI-R domains were tested for internal consistencies by Cronbach’s α. To achieve comparability among subscales, the ADI-R subscales A1–A4, B1–B4, C1–C4, and D raw scores were transformed into standardized z-scores (mean = 0, SD = 1). By converting raw into z-scores they can be accurately compared to each other. A positive z-score means that the score is above the group mean, a negative below, and a value of 0 is equal to the group mean. Multiple cluster analyses were employed to classify symptom patterns on the basis of similarity derived from the scores of the described ADI-R algorithm subscales. In the first step, a single-linkage analysis was carried out to identify outliers with atypical profiles for elimination. In the second step, an agglomerative hierarchical clustering method was applied, using the squared Euclidean distance as proximity measure. Fusions were made by Ward’s method. The number of clusters was statistically determined by dendrogram, implementing the elbow criterion, and the development of error sum of squares. Subsequently, the K-means procedure was implemented for finally optimizing the cluster solution. K-means is a partition based clustering method to minimize the sum of squared error over all clusters. Subsequently, the same cluster analytic procedure was done for the summarized social communication, and stereotyped behavior domain, as postulated by DSM-5. For controlling the quality of the clustering solution, a MANOVA with cluster affiliation as independent variable and ADI-R algorithm subscale scores as dependent variables was administered. Discriminant analysis was used to validate the final cluster results. The clustering solution is considered as satisfying, when 90–95 % of the cases are correctly classified (Steinhausen and Langer 1977). After obtaining the clusters, comparisons of the cluster means were done by (M)ANOVA (or appropriate χ2-test) for age, IQ, gender, ICD-10 diagnoses, as well as ADOS scores for social affect (SA), and restricted repetitive behaviors (RRB), and the severity score, SCQ, SRS, and CBCL scores with post hoc Bonferroni correction. Effect sizes were calculated by Cohen’s φ, or partial η2, as appropriate.

Results

Descriptive Data of the Sample

The entire sample consists of n = 405 (87.5 %) male, and n = 58 (12.5 %) female individuals, in total n = 463 individuals. The age ranged from ≥3 to ≤21 years with a mean score of 10.41 years (SD = 4.15). IQ test results were available for n = 382 individuals ranging from an IQ of 41–147 (mean score = 94.60, SD = 20.64).

ADI-R subscale scores were available for all n = 463 participants with scores for social interaction (mean = 15.64, SD = 5.78), communication (mean = 12.31, SD = 4.63), stereotyped behavior (mean = 4.48, SD = 2.41), and abnormal development (mean = 2.27, SD = 1.48). ADOS (modules 1, 2, 3, 4) was available for n = 415 individuals. Severity scores were calculated above all of the four modules (mean = 6.70, SD = 2.06). Raw scores for ADOS module 1 were available for n = 28 children (SA: mean raw score = 13.43, SD = 3.93; RRB: mean raw score = 3.00, SD = 1.66), n = 97 for module 2 (SA: mean raw score = 10.38, SD = 3.75; RRB: mean raw score = 2.54, SD = 1.88), n = 220 for module 3 (SA: mean raw score = 9.97, SD = 3.80; RRB: mean raw score = 1.64, SD = 1.32), n = 78 for module 4 (SA: mean raw score = 11.72, SD = 5.04; RRB: mean raw score = 2.49, SD = 1.75). The calibrated severity scores for both domains and each module are presented for each cluster in Table 3. Scores on the Social Responsiveness Scale (SRS) were available for n = 334 (mean = 97.08, SD = 28.64). N = 358 raw scores were obtained for the Social Communication Questionnaire (SCQ) (mean = 17.92, SD = 6.90). Scores on the Child Behavior Checklist (CBCL) were available for n = 347 children and adolescents with a mean total T-score of 70.20 (SD = 7.96).

Cluster Analysis: Cluster Solution

Analysis of the three ADI-R algorithm domain scores resulted in fair reliability for all domains (Cronbach’s α = .80 for social interaction, α = .77 for communication, α = .78 for stereotyped behaviors), except for the subscale abnormal development (Cronbach’s α = .36).

The first single linkage procedure indicated one outlier by using the dendrogram elbow criteria. Examination of the dendrogram identified one spike presenting an atypical profile of the indicated participant compared with the rest of the sample. The outlier was excluded from further analysis. By using the hierarchical clustering analysis the dendrogram and inspection of reduction of the error sums of squares indicated a three-cluster solution. The subsequent adjusted cluster solution showed the following characteristics: cluster 1 comprised n = 168 (36.68 %) individuals, cluster 2 n = 137 (29.91 %), and cluster 3 n = 153 (33.41 %). A MANOVA with the ADI-R subscale scores as dependent and the three clusters as independent variables showed a significant main effect [F(26) = 54.47, p < .000, η2 = .615]. Similar results were found for all subscale scores (FA1(2) = 115.23; FA2(2) = 70.26; FA3(2) = 137.97; FA4(2) = 156.17; FB1(2) = 77.12; FB2(2) = 85.13; FB3(2) = 119.21; FB4(2) = 97.48; FC1(2) = 24.63; FC2(2) = 20.71; FC3(2) = 47.07; FC4(2) = 42.31; FD(2) = 34.47; pall < .000). For details, see Table 1 presenting the z-standardized subscale scores with means and standard deviation. Please also see Table 3 for more details on the comparison of the ADI-R raw scores with same results for a significant main effect [F(26) = 54.47, p < .000, η2 = .615] and significant differences for all of the subscale scores.

Table 1 The z-standardized ADI-R algorithm subscales due to a three cluster solution

The most severely affected individuals were allocated to cluster 2. All scores besides A1 (“Failure to use nonverbal behaviors to regulate social interaction”) showed the highest impairments in any of the subscores related to reciprocal social interaction, communication, stereotyped behaviors, and abnormal development.

Cluster 3 comprised individuals with a moderate symptom severity in the domains of reciprocal social interaction, communication, and abnormal development. For stereotyped, repetitive or idiosyncratic speech and behaviors (B3, C1-C4) the individuals of cluster 3 showed minor symptom severity in comparison with the distribution of social affect (SA) symptoms within the cluster 3.

Overall, the least affected individuals in the social interaction, communication, and abnormal development domains were included in cluster 1. In contrast to the social affect domains, the restricted and stereotyped behavior domain was moderately pronounced in cluster 1, indicating a trend for relatively higher restricted and repetitive behaviors (RBB) than social affect symptoms in cluster 1.

Compared to the other cluster groups there were only main effects for the social affect domains (cluster 2 > cluster 3 > cluster 1), while the RBB domains showed comparable severity between cluster 3 and cluster 1 (cluster 2 > cluster 1 = cluster 3) with a descriptive trend for more severely affected individuals in cluster 1. To provide a better overview, the average behavioral profile of each cluster is displayed in Fig. 1.

Fig. 1
figure 1

The average behavioral profile for each of the three cluster groups. Note: A1–A4 = impairments in social interaction, B1–B4 = impairments in communication and language, C1–C4 = restricted, repetitive and stereotyped behaviors/interests, D = abnormality of development evident at/before 36 months; Cluster 2: high impairments in all subscales; Cluster 3: more impairments in the social affect domain (except for B3: stereotyped speech) than for stereotyped behaviors; Cluster 1: less impairments in all subscales with more severe pronounced stereotyped behaviors than social affect deficits

For greater clarity, comparisons were also done for the z-standardized overall scores. Instead of summarizing the subscales into the original ADI-R algorithm domains, subscales were reorganized according to current DSM-5 criteria. Impairments in reciprocal social interaction and communication were combined to a social affect score. The subscale B3 on stereotyped speech was reorganized under the overall score of stereotyped, repetitive behavior as recommended by DSM-5. Significant main differences between the cluster groups were found for both, social affect (cluster 2 > cluster 3 > cluster 1), and RRB (cluster 2 > cluster 1 > cluster 3) (see Table 2).

Table 2 Comparisons of the z-standardized overall scores including confidence intervals

Comparisons within the single cluster groups showed, that the confidence interval of the mean scores for social affect and repetitive behavior in cluster 1 did not overlap, indicating a significant difference between both domains with less deficits in the SA domain than in the RRB domain (SA < RRB). In contrast, confidence intervals of both mean scores overlapped in cluster 2, indicating a similar severity of difficulties in SA, and RRB (SA = RRB). Results for cluster 3 were similar to cluster 1: confidence intervals of mean SA and RRB scores did not overlap, indicating more problems in the SA domain than in the RRB domain (SA > RRB). Results are visualized in Fig. 2.

Fig. 2
figure 2

Overall scores for Social Affect (domain AB) and Restricted, Repetitive Behavior (domain C) according to their cluster affiliation

As last step, a discriminant analysis was computed to confirm the results of the cluster solution. In total, 96.7 % of the cases were correctly classified by the presented three cluster solution. In detail, 98.2 % of the participants were correctly classified in cluster 1, 94.9 % were correctly allocated to cluster 2, and 96.7 % to cluster 3 (see Fig. 3).

Fig. 3
figure 3

Discriminant analysis of correct cluster classification

Cluster Comparisons

Demographic and Diagnostic Variables

No differences between clusters were observed for demographic data (Table 3), namely age [F(2) = 1.49, p = .228, η2 = .008] and gender [χ2(2) = 2.49, p = .293, φ = .073]. IQ [F(2) = .40, p = .670, η2 = .006] did not differ between the cluster groups. Cluster groups were also compared with regard to ICD-10 diagnostic classification, for which again no differences were observed between the three cluster groups [χ2(4) = 2.74, p = .603, φ = .077].

Table 3 Cluster comparison due to sample characteristics, autism specific, and comorbid variables

Autistic Symptoms

The Social Responsiveness Scale total score did not differed between clusters [F(2) = 1.69, p = .187, η2 = .010]. No significant group comparisons were found with regard to the SRS subscales, but trends were observed with the most severe social impairments in cluster 2 and elevated autistic mannerisms in cluster 1 similar to the cluster groups.

Statistically significant differences were found for the Social Communication Questionnaire [F(2) = 3.20, p = .042, η2 = .018] with highest scores in cluster 2, followed by cluster 1, and cluster 3.

To assess directly observed autistic symptoms ADOS modules 1, 2, 3 and 4 were applied. The ADOS severity scores [F(2) = .76 (2), p = .469, η2 = .004] did not differ according to the cluster affiliation. None of the calibrated severity scores for SA or RRB for module 1, 2, 3, and 4 differed between the three cluster groups [module 1/SA: F(2) = 1.13, p = .341, η2 = .086; RRB: F(2) = .1.91, p = .169, η2 = .138; module 2/SA: F(2) = .75, p = .476, η2 = .016; RRB: F(2) = .58, p = .560, η2 = .012; module 3/SA F(2) = .16, p = .851, η2 = .002; RRB: F(2) = .19, p = .825, η2 = .002; module 4/SA F(2) = .70, p = .502, η2 = .019; RRB: F(2) = .16, p = .854, η2 = .004]. For details please see Table 3.

Comorbid Psychopathology

Comparing the Child Behavior Checklist (CBCL) total score between clusters no differences could be observed [F(2) = .67, p = .574, η2 = .309]. Similar results were found for all CBCL subscale scores (see Table 3). Descriptively, cluster 2 showed the highest comorbid psychopathology.

Discussion

Research that attempts to stratify children and adolescents with ASD into distinct subgroups assumes on an improving comprehension of the underlying phenotype structure to better delineate etiology, diagnostic, and treatment options. The major aim of the present study thus was to identify homogenous and clinically meaningful ASD subgroups based on the ADI-R algorithm subscale scores. This is a highly topical question with regard to the previous changes from DSM-IV-TR to DSM-5 and current efforts to develop ICD-11.

Our approach to reduce the heterogeneity of ASD samples differs from many other studies, which have attempted to analyze the factor structure of the ADI-R (Snow et al. 2009; Duku et al. 2013), or utilized the method of clustering in very circumscribed samples (e.g. Brannan et al. 2014). By using the hierarchical cluster analytic approach in a large sample with a broad range of IQ and age over the full spectrum of ASD symptoms, our study identified three empirically derived homogenous subgroups. Discriminant analysis revealed an excellent rate of correctly classified participants, confirming the presented cluster solution. The empirically derived number of clusters is in line with previously reported results on differing ASD populations from toddlers to high functioning adolescents (Verté et al. 2006; Wiggins et al. 2011; Brannan et al. 2014).

The three derived clusters indicate a different pattern of social interaction and communication problems versus stereotype behaviors. The characteristics of social interaction and communication problems fit with the assumption of a severity gradient across all three cluster groups with cluster 2 comprising the most affected individuals, followed by cluster 3, and least affected ASD individuals in cluster 1. These results support the novel DSM-5 concept of a severity gradient with regard to social communication, former comprising social interaction and communication problems. This result is in line with replicated findings that a differentiation between both DSM-IV-TR/ICD-10 domains seems to be arbitrary (Lord and Bishop 2015). Based on our results, a severity gradient can also be described for the stereotyped, repetitive behavior domain; however, the pattern of severity considerably differed from the first one. Even though the strongest impairments again were found for cluster 2, the less affected individuals in cluster 1 and 3 did not show a distinct symptom pattern for the stereotyped behavior domain as found for the social communication domains. Interestingly, the subscale for stereotyped speech (B3) (former included in the communication domain in DSM-IV-TR) showed the same cluster pattern as demonstrated for the repetitive and stereotyped behavior domain. These findings again support the shift of idiosyncratic and repetitive speech under the umbrella of the stereotyped behavior domain in DSM-5. Due to the differing symptom distribution, our findings also strongly support the reorganization from the triadic to the dyadic domain architecture, as suggested by the DSM-5 concept.

As a consequence, clusters were subsequently compared with regard to the DSM-5 based dyadic structure. For both domains, social communication (SA) impairments and stereotyped, repetitive behavior (RRB), differences between clusters were observed. The significant difference, especially for the RRB domain, might be developed by combining the subscale scores and shifting the stereotyped speech under the RRB domain. Non-overlapping confidence intervals between the social communication and stereotyped behavior domains for cluster 1 and 3 revealed different symptom patterns than demonstrated for cluster 2. Cluster 1 included ASD individuals exhibiting significant more stereotyped behaviors than social communication impairments. ASD individuals with significantly less stereotyped behaviors in contrast to moderate impairments in the social communication domain were allocated to cluster 3. The most severely affected individuals with regard to both (not differing) domains, who also showed the earliest age of onset, were allocated to cluster 2. Taken together, a severity gradient as well as differing symptom pattern can be found within ASD. While the most affected ASD individuals were allocated to one cluster group (cluster 2), the less affected individuals were grouped into two clusters with a differing symptom profile. In consequence our study offers support for both, a gradient model, and a distinct phenotypic profile model. Nevertheless the minor number of items in the RRB domain and the size of the standard deviation need to be considered by interpreting the non-overlapping confidence intervals.

Thus, no cluster group characterized by social-communication problems only was observed, which supports the concept of a phenotypic distinction of social communication disorder (SCD), and ASD. Still, ASD children in cluster 3 showed the least stereotyped behavior, and probably included some children who may also meet SCD criteria according to DSM-5. Groups mainly exhibiting social communication problems without many restrictive and repetitive behaviors were also described by cluster analyses of ASD samples before (Greaves-Lord et al. 2012; Brennan et al. 2014). Similar to the study of Greaves-Lord et al. (2012), almost 30 % of our sample showed the described symptom pattern. A study of Mandy et al. (2011) indicated that the majority of the former diagnoses PDD-NOS presented a combination of social interaction and communication impairments with less or without repetitive and stereotyped behaviors. To date, it is widely discussed, but still unclear if indeed a substantial proportion of children and adolescents with ASD will be shifted into the SCD category. Due to the fact that we did not include any diagnostic assessment of SCD, we cannot provide any empirical data on this important question. Further studies including a broader range of communication and language impaired children with and without ASD need to clarify this issue.

After deriving the cluster groups, we aimed at further characterizing them by additional phenotypic measures. No influence of demographic variables or cognitive abilities was observed, which replicates previous results (Verté et al. 2006; Wiggins et al. 2012; Kim et al. 2013) providing some preliminary evidence for the cluster groups being independent of IQ, age and gender. Further studies found that the ADI-R algorithm totals are influenced by the developmental level, particularly the expressive language of the children, in linear regression models (Hus and Lord 2013). This influence was stronger for the ADI-R current than the ADI-R diagnostic algorithm. Future studies, also including non-verbal participants using the current algorithm, are needed to analyse if there might also be any similar effects of language-level for the derived cluster groups.

Evidence that DSM-IV-TR defined subtypes cannot be distinguished by the previous characteristics has led to the proposal of merging them into a single dimensional ASD group (Lord et al. 2012). In our study, there were also no differences for the distinct ICD-10 diagnoses autistic disorder, Asperger’s syndrome, and PDD-NOS between the three empirically derived cluster groups. This result is also in line with previous studies (e.g. Brennan et al. 2014). These findings again support the revision of previously existing sub-groupings of ASD, as implemented by the DSM-5 concept, and underscore the need for a more empirically based conceptualization in the prospective eleventh revision of the ICD criteria.

Cluster comparisons with additional phenotypic measures, covering ASD symptoms, showed significant differences for the Social Communication Questionnaire including the most affected individuals in cluster 2 similar to the presented ADI-R cluster distribution, but less affected individuals in cluster 3 instead of cluster 1. Even though the results for the Social Responsiveness Scale did not differ significantly, descriptive statistics also indicated a trend for highest mean scores in cluster 2. Trends for less affected individuals due to autistic mannerisms measured by the SRS were found for cluster 3 in line with the cluster analytic results. We did not find any significant cluster differences on the ADOS severity score. Similar results were found by Georgiades et al. (2013). Several measurement issues may have influenced this result. One explanation may be that the ADOS severity metric used in our analyses combines scores of social-communication symptoms, and stereotyped, repetitive behaviors, and may be confounded by summarizing all symptoms into one combined score. For overcoming this effect, we also compared the ADOS social affect and restricted, repetitive behavior scores between the three cluster groups. But again, no significant differences between cluster groups were found. The lack of statistically significant cluster differences on the ADOS scores might also reflect the differences between a retrospective parent interview, and a current direct observation of ASD symptoms. One further explanation could be that the cluster groups derived by retrospective report of symptoms at ages 4–5 or ever were compared to current ASD symptoms. Past studies comparing agreement of the ADI-R and ADOS algorithms similarly observed that both instruments allow independent, additive contributions on the basis of rather low correlations (Kim and Lord 2012). Another reason might also be the comparative small sample sizes for modules 1, 2 and 4 for each cluster group. In sum, the lack of significant differences between cluster groups for most of the ASD questionnaires might be attributed to the differing measurement issues, probably also age confounded.

Since individuals with ASD vary not only with regard to level and type of core autistic symptoms, but also in terms of co-occurring psychopathological symptoms, clusters were also compared with regard to the Child Behavior Checklist syndrome scales. No differences between clusters were observed, contrasting findings of previous studies of higher rates of psychiatric comorbidity in more severely affected ASD individuals (Greaves-Lord et al. 2013). But previous findings in this area are very heterogeneous. While some data showed a tendency of more co-occurring behavioral problems in individuals with more severe ASD traits (e.g. Pearson et al. 2006), others also report high rates of comorbidities among less affected children with PDD-NOS (de Bruin et al. 2006), and further studies revealed few differences between the ASD subtypes (e.g. Snow and Lecavalier 2011) in line with this study. As a limitation, the cluster comparisons were based on a subset of children assessed with the CBCL. Thus, the number of missing data needs to be taken into account.

Our study has methodological strengths that enhance the validity of the results compared to previous studies. First, we included a large sample with a broad range of ages, above the broad spectrum of ASD symptoms from the less severe end (e.g. atypical autism, Asperger’s syndrome) to the severe affected end (autistic disorder), enhancing the generalization of the study results. Diagnoses were carefully implemented by using the gold standard diagnostic system, underlining the internal validity of our study. The cognitive level was available for most of the participants, and mean IQ scores are remarkably similar across clusters. The homogenous distribution enables the core symptom domains to emerge within the cluster analyses. The empirically derived cluster groups were additionally compared by various child psychiatric variables, as cognitive level, comorbidities, and autism specific questionnaires, which were not part of the former cluster analyses. Second, by using the ADI-R as one of the most widely used and comprehensive assessment for the broad range of behavioral and functional ASD symptoms (Snow et al. 2008), an ideal measure to derive clinically relevant and homogenous subgroups, was chosen for the cluster analytic approach (Hu and Steinberg 2009). Besides the limitation already noted, we have to emphasize that we just used data assessed by an interview with the primary caregiver for the cluster analyses. Including data of one primary measurement instead of analyzing data from multiple instruments allowed us post hoc comparisons of the clusters by variables, which were not part of the cluster analytic process before. When interpreting the results of the current study, it has to be noted, that the ADI-R measurement was also included, besides other instruments, into the diagnostic process. So ICD-10 diagnoses were not totally independent of the ADI-R algorithm scores used in the cluster analyses. Non-verbal children were excluded for reducing missing data, and enhancing the homogeneity of the sample, what also limits the generalization of our results, although Snow et al. (2008) did not find any differences between verbal and non-verbal children due to a factor analysis of the ADI-R algorithm items. At last, two aspects due to the measurement and the statistic procedure have to be addressed. First, the ADI-R is developed as categorical diagnostic instrument. That might limit the identification of a severity model and affect the cluster pattern. This aspect has to be taken into account even if previous studies already showed evidence for an underlying severity gradient using the ADI-R (e.g. Spiker et al. 2002). A study of Constantino et al. (2004) also found similar results for a continuously distributed underlying factor for both, the ADI-R, and the SRS as a dimensional measurement. Second, next to the strengths of creating z-scores by eliminating inherent subscale differences, the reduction of variance can also be noted as a shortcoming while generating relative instead of absolute differences. This is of special interest as the social communication differences across cluster groups might be more meaningful with regard to the number of items included. Finally, even if the sample size and the sample characteristics strengthen the representativeness of the findings, the results have to be replicated in further samples.

Despite these limitations, results from this study could have important clinical and research implications. First, the empirically derived cluster groups mainly differed from the diagnostic single subtypes as classified by ICD-10. Asperger’s syndrome, PDD-NOS, and autistic disorder were nearly equally distributed across the cluster groups, indicating that the underlying ASD phenotype visibly differs from the previously proposed diagnostic criteria of the subtypes. In line with numerous other studies, this leads to a need for a more empirically based conceptualization in the prospective eleventh revision of the ICD concept. Second, the ASD phenotype according to the distribution of social communication deficits confirms the novel dyadic concept as well as the proposed severity gradient by DSM-5. This finding should also be considered by the revision of the ICD-10 diagnostic concept. A closer look on the data even showed a somewhat differing pattern for the stereotyped behavior domain, also indicating a symptom profile for the less affected groups. This is of special interest for diagnostic concepts, but needs to be replicated in further studies especially by considering the minor item number in the RRB domain.

All in all, our results essentially offer support for both, the severity gradient model suiting for the severe affected individuals, and specific symptom profiles due to the less affected participants. Due to the differing symptom profiles at the less affected end we want to alert to the risk given for treatment approaches when distinct subtypes will no longer be distinguished. Different children with different symptoms might need differing intervention approaches. At last it has to be taken into account that the presented cluster solution is dependent upon the variables and the specific sample included into these analyses. Our results have to be replicated in independent samples first, especially with cluster variables not used in the diagnostic process before, to verify the external validity of these findings.