Introduction

Much recent attention has been directed towards elucidating the structure of autistic symptoms. The heart of the matter is whether and how symptoms coalesce. This is important because such knowledge informs the use and development of instruments used to measure symptoms of Autism Spectrum Disorders (ASDs). Relatedly, these efforts also inform our conceptualization of diagnostic classification systems, an especially relevant point given the impending release of the DSM-V. A better understanding of the structure of ASD symptoms may also lead to a better understanding of the relationship between genes and behavior. In this respect, quantitative phenotypes have proven more useful than categorical designations (Szatmari et al. 2007a, b).

Factor analysis (FA) is one way researchers have examined symptom structure. A summary of such studies examining the structure of ASD symptoms can be found in Table 1. Thus far, results have been inconclusive. Studies largely support models composed of either two or three symptom domains. But there is variability in the composition as well as the number of factors identified. For instance, Frazier et al. (2008) submitted ADI-R data from 1,170 verbal individuals aged 2–46 years to FA and found support for a two-factor model, with a social/communication factor and a stereotyped behavior factor (which also included stereotyped language). A three-factor model also performed well, which separated difficulties in peer relationships and imaginative play from the social/communication factor (only containing two items). Likewise, using the ADI-R in a sample of 1,861 individuals with ASD aged 4–18 years, Snow et al. (2009) found that both two- and three-factor models were defensible.

Table 1 Summary of studies of ASD structure

Studies supporting a three-factor model have not always aligned with the three domains of the DSM-IV (e.g., Georgiades et al. 2007; van Lang et al. 2006). The DSM-IV triad of symptom domains was largely based on clinical judgment, and has received limited empirical support. Additional models (e.g., a one-factor model) also have been proposed in the literature.

Results from studies may differ for several reasons, including the instrument used and type of statistical procedures employed. To date, most FA studies have used data from the ADI-R (8 out of 12 studies in Table 1). Perhaps the ADI-R has been used most often in this line of research because of its strong psychometric properties and the fact that it contains many items from each of the three DSM domains. However, use of other well-validated instruments can be informative. The ADOS is an ideal candidate for several reasons. It has been extensively validated (e.g., de Bildt et al. 2004; Mazefsky and Oswald 2006) and is based on clinician observation, thereby providing a different source of information. Recently, the ADOS scoring algorithms have been revised (Gotham et al. 2008; Gotham et al. 2007) to improve diagnostic validity. The revised algorithms also have been validated in an independent sample (Oosterling et al. 2010).

Methodological and procedural variations in the statistics can also impact FA results. For instance, while some authors use FA, others use PCA. Extraction methods and type of rotations also vary across studies. Some studies have analyzed data at the item level, while others have used subscales. Of course, results are dependent on the nature/content of items that are analyzed (i.e., the phenotype is limited to the behaviors that are analyzed).

The nature of the sample can also impact findings. As illustrated in Table 1, many previous studies have included a wide range of ages, verbal abilities, and/or levels of functioning. This complicates comparisons across studies, as research has shown that ASD symptoms change over time (Charman et al. 2005; Happé et al. 2006; Piven et al. 1996). Since ASD symptom patterns change with age, results may be ‘muddied’ in heterogeneous samples (e.g., if preschool-aged children and older children/adults are included in the same analyses). Other studies have shown that symptom severity and expression vary with level of functioning (e.g., Dawson et al. 2007; Munson et al. 2008; Szatmari et al. 2006). Examination of ASD subtypes also has indicated that variability among those with ASDs is perhaps best explained by differences in IQ (Witwer and Lecavalier 2008). In addition to providing further reason to revisit DSM criteria, these studies point to the importance of examining homogenous subgroups of those with ASDs.

The primary goal of the current study was to examine the structure of ASD symptoms as measured by the ADOS in a large sample of children and adolescents with ASD. In doing so, this study complemented previous research by using an observation-based measure of current behaviors. This study is unique in that it included all ADOS items related to DSM domains and it examined several models of ASD symptoms within the same sample.

A secondary goal was to examine how different subgroups impacted model fit. Because previous studies have demonstrated that ASD symptoms may manifest differently as a function of IQ and age, the sample was split according to these variables. While the ADOS controls for level of functioning and age to some extent through appropriate module selection, there can still be significant variability in terms of age or level of functioning within a given module.

Methods

Participants

Participant data were collected from a large database available for research purposes, the Autism Genetic Resource Exchange (AGRE; Geschwind et al. 2001). The majority of children in the AGRE database are from families in which at least two members have an ASD diagnosis (i.e., multiplex families). Participants with an ASD diagnosis are referred to AGRE; diagnoses are then confirmed via team consensus, based in part on both ADI-R and ADOS scores. Individuals are assigned to one of four diagnostic categories: Autism, Not Quite Autism (NQA, those within one point of domain autism cut-offs on the ADI-R), Broad Spectrum (more than one point away from domain cut-offs on the ADI-R, but still determined to fall on the spectrum), or Not Spectrum. Individuals were excluded if they were referred with an ASD but categorized as ‘Not Spectrum’ by AGRE researchers; all other individuals were included if they met spectrum cut-offs on the ADOS and had complete data.

Children and adolescents aged 3–18 years were included in analyses. Modules 1 and 3 were examined because they are the most frequently used modules in children and adolescents. There were a total of 1,409 participants, with 720 individuals for Module 1 and 689 individuals for Module 3. Sample characteristics are reported in Tables 2 and 3.

Table 2 Summary of module 1 participant characteristics
Table 3 Summary of module 3 participant characteristics

Instruments

The Autism Diagnostic Observation Schedule (ADOS; Lord et al. 2000) is a semi-structured, standardized observation for the assessment of ASD. It consists of social “presses” which require a trained examiner to engage an individual in various social interactions. It consists of four modules; the appropriate module is selected based on the individual’s expressive language level and chronological age. Administration of each module requires approximately 30–45 min. Most ADOS items are coded from zero (0, no evidence of abnormality) to three (3, markedly abnormal behavior). A code of seven (7) is used when the behavior is abnormal but in a different way than the protocol specifies and eight (8) is used when the rating is not applicable. For analyses, ratings of seven and eight were transformed to zeros. Module 1 consists of 10 activities and 29 ratings; 12 items are part of the scoring algorithm. Module 3 consists of 14 activities and 28 ratings; 11 items are part of the scoring algorithm.

The Raven’s Progressive Matrices (Raven et al. 2003) is a nonverbal measure of cognitive functioning. It consists of five sets of 12 items, with subsequent items increasing in difficulty. Estimated non-verbal age was used to select participants for the high-functioning subgroup in Module 3.

The Vineland Adaptive Behavior Scale (VABS; Sparrow et al. 1984) is a semi-structured interview of adaptive behavior. Items are organized into four domains: Communication, Daily Living Skills, Socialization, and Motor Skills. An overall composite score also is obtained. Both the overall composite and the Daily Living Skills standard score were used to select participants for the low-functioning subgroups. This was because the composite score incorporates functioning in areas essential to the diagnosis of autism (i.e., socialization and communication) so a more independent domain was also used as criterion to ensure low-functioning status was not the result of severe autistic symptomology.

Procedure

Participant families for AGRE are found through various sources. At initial contact, a packet is given to families containing information about the program and a preliminary enrollment form. On the form, families identify the diagnosis they have received from their physician or specialist; Autism, PDD-NOS, and Asperger’s Disorder are accepted. Then diagnoses are confirmed via team consensus, based in part on both ADI-R and ADOS scores. Raters are assessed periodically to ensure that they are research-reliable (Autism Genetic Resource Exchange 2009).

Statistical Analyses

Confirmatory Factor Analysis

Modules 1 and 3 were examined by CFA individually. Analyses were conducted on the entire module sample, then also on subgroups based age and level of functioning. CFA was performed using LISREL (Jöreskog and Sörbom 2004). Polychoric correlation matrices were used due to the ordinal nature of the data. Diagonally weighted least squares (DWLS) was used for estimation because it is preferred over other methods when data are categorical, and it requires smaller sample sizes than other methods (Wirth and Edwards 2007).

Measures of Fit

There are many different indices of fit, but little consensus regarding which measure (or combination of measures) is best to use for evaluating models (Bollen 1989). Often indices can disagree with each other. Indices in this study were chosen in order to follow recommendations which suggest selecting indices from different categories to reflect various criteria (Brown 2006; Garson 2009) and in order to maximize the ability to compare this study to previous work.

RMSEA was selected as the primary measure of model fit because it is one of the measures least impacted by sample size and does not involve a null model (M. W. Browne, personal communication, June 1, 2010). It also has a confidence interval and graduated guidelines exist for its interpretation (rather than a single cut-off). Finally, it is commonly reported in the literature, allowing for direct comparison between previous studies and results found here. Additional measures considered were the Non-Normed Fit Index (NNFI), Closeness of Fit Index (CFI), Standardized Root Mean Square Residual (SRMR), and Goodness of Fit Index (GFI). Although recently GFI may have fallen out of favor as a preferred measure of fit (Garson 2009) it is frequently reported in previous studies, so was included here. Guidelines have been proposed to help researchers interpret goodness of fit indices, though there is some variability in recommendations. Generally accepted cut-offs were used (i.e., for the RMSEA, values less than 0.08 were considered acceptable; for the NNFI, CFI, and GFI, values greater than 0.95 were considered acceptable; and for the SRMR, values less than 0.10 were considered acceptable).

Model Specification

Three different models were tested within each module (total sample and subgroup). First, a one-factor model was specified, in which all items loaded onto a single factor. The second model was the three-domain DSM-IV model (as reported in the DSM-IV-TR; American Psychiatric Association 2000) that includes social impairments, communication, and repetitive behaviors and interest (see Fig. 1). Finally, a third model based on recently proposed changes to the DSM was specified (American Psychiatric Association 2010) called hereafter the DSM-V model (see Fig. 2). It consisted of one factor of social and communication items and one factor of restricted and repetitive behavior and language (RRB/L) items.

Fig. 1
figure 1

DSM-IV model

Fig. 2
figure 2

Proposed DSM-V model

Subgroups

In order to examine the impact of subject characteristics, modules were divided into more homogenous subgroups by age and level of functioning (as determined by IQ or adaptive behavior scores). Decisions related to dividing modules 1 and 3 were made with two goals in mind: to create homogenous subgroups based on variables of interest, and to retain enough participants for stable factor solutions. As such, analyses were conducted on modules as a whole first and then by subgroup. A “youngest subgroup” in Module 1 consisted of those 6 years and younger and an “oldest subgroup” in Module 3 consisted of those 10 years and older. The subgroup analyses based on level of functioning were as follows: a “lowest-functioning subgroup” in Module 1 was based on VABS composite and DLS < 55; a “lowest- functioning subgroup” in Module 3 was based on VABS composite and DLS ≤ 70. There were two higher functioning groups in Module 3: a “highest-functioning subgroup” based on VABS composite and DLS > 70 and a “highest-functioning subgroup” based on Raven estimated NVIQ ≥ 100. This resulted in subgroups ranging from 217 participants (Module 3, VABS Composite and DLS ≥ 70) to 399 participants (Module 3, Raven estimated NVIQ ≥ 100). It was necessary to divide groups in this manner due to missing data for many participants (e.g., the Raven was not administered to participants if they were too low-functioning, too young, or engaged in difficult behaviors which precluded test administration).

Results

Results are presented by module, with Module 1 first. Unless specified otherwise, results are presented for total sample analyses.

Module 1

Model Fit

Selected indices of fit for Module 1, all models, for the total sample and both subsamples are presented in Table 4.

Table 4 Module 1 indices of fit across models

Within the total Module 1 sample (N = 720), RMSEAs ranged from 0.056 (DSM-IV model) to 0.062 (the one-factor model). RMSEA confidence intervals overlapped, indicating that based on the RMSEA, no single model fit the data significantly better than other models. However, the DSM-IV model performed somewhat better than the other models. When other measures of model fit were considered, all three models had acceptable to good fits. In general, subsample fits were an improvement over the total sample fits. Fits tended to be best for each model within the lowest-functioning children.

Factor Loadings

Overall, the mean factor loadings for the total sample analyses were on the order of 0.50–0.60 for social and communication factors and 0.35–0.4 for the RRB and RRB/L factors. For the one-factor model, the mean item loading was 0.50. For the DSM-IV model, the average loadings were: social = 0.59, communication = 0.51, and RRB = 0.37. For the DSM-V model, mean loadings were: social-communication = 0.54, and RRB/L = 0.41.

Association between Factors

The inter-factor correlations for each model were quite high. For the DSM-IV model the intercorrelations were: social and communication factors = 0.86, social and RRBs = 0.75, and communication and RRBs = 0.92. For the DSM-V model, correlation between factors was 0.89.

Module 3

Model Fit

Selected indices of fit for Module 3, all models, for the total sample and all subsamples are presented in Table 5.

Table 5 Module 3 indices of fit across models

Within the total Module 3 sample (N = 689) RMSEAs ranged from 0.076 (DSM-V model) to 0.083 (the one-factor model). As within Module 1, confidence intervals for RMSEAs overlapped between models, indicating similarities in model fit across models. Overall, the DSM-V model was slightly preferable as two indices fell within acceptable ranges.

The only subgroup analysis that yielded slightly better indices of fit than the total sample was with children over 10 years (n = 277). Analyses with the two highest-functioning subgroups yielded poor fit, and the poorest fits for each model were found with the lowest-functioning subgroup (n = 299). As with the total sample, there was little difference between the three models in the subgroup analyses.

Factor Loadings

As in Module 1, the average factor loadings were approximately 0.50–0.60 for social and communication factors and 0.40 for the RRB and RRB/L factors. For the one-factor model, the mean loading was 0.47. For the DSM-IV model, the mean loadings were: social = 0.57, communication = 0.51, and RRB = 0.43. For the DSM-V model, mean loadings were: social-communication = 0.57 and RRB/L = 0.41.

Association between Factors

Inter-factor correlations for each model ranged from modest to strong. Inter-factor correlations for the DSM-IV model were: between social and communication factors = 0.93, between social and RRBs = 0.36, and between communication and RRBs = 0.33. The DSM-V model inter-factor correlation was: 0.48.

Discussion

This study examined the structure of ASD symptoms with items from the ADOS in a large group of children and adolescents with ASDs. Taken as a whole, the analyses did not support any single model over others. They also suggested that development may impact how symptoms coalesce. Indeed, data from Modules 1 and 3 yielded significantly different results, despite having most items in common. Overall, analyses in Module 1 yielded better fits across all models than analyses in Module 3. Indices of fit for Module 1 were acceptable, but suggested room for improvement. However, within Module 3, most indices were poor or less than acceptable. This pattern was more evident in subgroup analyses: Module 1 fits generally improved within subgroup analyses, whereas they generally worsened in Module 3. These results may help explain why so many plausible solutions are proposed in the literature: ultimately, there may not be much difference between the models, and fits may vary significantly with sample characteristics.

In Module 1, there was little differentiation between models. There is little reason to prefer one model over the others given that they all have some theoretical support. That being said, the DSM-IV model had slightly better indices of fit than other models, which echoes findings with the ADI-R and ECI/CSI (Lecavalier et al. 2006, 2009). Fit indices within subgroups generally did improve a little, but not to the extent that any model was clearly preferable.

There was no clear-cut winner between models in Module 3, but the DSM-V model did appear to perform marginally better than the others. This is encouraging as the DSM-V criteria have only recently been proposed and have yet to be studied. Although indices of fit in Module 3 were worse than in Module 1, they were comparable to several previous studies. For instance, Frazier et al. (2008) reported the following indices for the best models in their analyses: RMSEA = 0.07, CFI = 0.94, and NNFI = 0.92. Georgiades et al. (2007) found an RMSEA of 0.067, SRMR of 0.08, CFI of 0.92, and NNFI of 0.90 for their proposed model. Fit indices within subgroups were more variable than in Module 1, with model fit improving in the oldest subgroup but generally worsening in the other subgroups.

Sample composition is an important variable to consider in factor analytic studies. A possible explanation for the notable difference in fits across modules could be that Module 1 participants had more severe ASD symptoms than Module 3 participants, resulting in more clear-cut symptom relationships to latent factors. ADOS domain scores and autism diagnosis rates support this idea. Module 1 participants were generally lower functioning than those in Module 3. Comparatively, Module 3 may have been administered to individuals with a wider range of functioning, which may have impacted results.

Though analyses of the more homogenous subgroups did not yield expected results, they did provide evidence that sample characteristics impact fit. Fit indices changed, sometimes drastically, when homogenous subgroups were analyzed. The poorest indices of fit of all analyses were found in the lowest-functioning subgroup of Module 3. This could be a reflection of Module 3 being less appropriate for lower-functioning individuals. The best fits were found in the lowest functioning subgroup of Module 1. This may indicate that autism symptoms, at least as they are currently defined by diagnostic instruments, are best measured in relatively young, lower-functioning individuals. This converges with previous work which has shown that DSM-IV criteria are most accurate in school-aged children with mild-to-moderate delays in functioning (see Lord and Corsello 2005).

Association between Factors

In Module 1, inter-factor correlations indicated a strong relationship between factors in all models (range from 0.75 to 0.89). Even the RRB and RRB/L correlated highly with other factors. This is in contrast to previous research indicating that the domains of ASD symptoms may be relatively distinct (e.g., Ronald et al. 2005; Ronald et al. 2006). However, the two studies by Ronald and colleagues were composed of community-based samples, not ASD-specific samples, which likely resulted in greater variability in scores. Indeed, other studies have reported a strong relationship between domains, including the RRB domain (e.g., Wing and Gould 1979; Lecavalier et al. 2009).

In Module 3, inter-factor correlations were still quite strong between social and communication factors, but much lower between RRB and RRB/L factors and other domains as compared to Module 1. Together, these findings may suggest that core ASD symptom domains become more distinct with development. These findings also converge with other studies (e.g., Charman et al. 2005) showing that the developmental trajectories of domains do indeed differ. The very high inter-factor correlations between social and communication factors across both modules provide some empirical support for collapsing these domains into one (as with the DSM-V model). However, the variability in inter-factor correlations for RRB and RRB/L factors across modules is less decisive. Research into the relationship between domains has been inconclusive, and further examination of this topic is necessary.

Considering both modules, one important point is that overall, the newly proposed DSM-V criteria performed as well as the other models. This is one of the first studies to explicitly evaluate this new model and to provide some evidence supporting it. The original conceptualization of autism discussed two main symptom domains: social/communication impairments and RRBs (Kanner 1943). This study provides evidence that the social and communication domains may be so closely related that they represent one factor (as in Kanner’s work and the recent DSM-V conceptualization). However, as discussed above, inter-factor correlations across modules for the DSM-IV RRB and DSM-V RRB/L factors were more variable. In summary, this indicates the DSM-V model is promising, but the RRB/L domain may need further examination.

The primary method by which models were compared in this study was by examining different indices of fit, most notably the RMSEA. It is important to note that the guidelines for interpretation of these measures are arbitrary. Researchers need to consider several measures of fit as well as clinical and theoretical meaningfulness of solutions. It is important to note that even unacceptable fits can represent progress when compared to previous results (Bollen 1989). It is also important to remember that not all indices of fit yield similar conclusions. For instance, there is less variability in NNFI values across modules and subsamples compared to RMSEA values. Furthermore, it is not clear how indices of fit translate into clinical significance. A model with an RMSEA of 0.03 will not necessarily be more clinically meaningful than a model with an RMSEA of 0.06. Research into this topic is lacking.

Limitations

Almost all participants in this study came from multiplex families. Because of this, results may not generalize to individuals from simplex families. Additionally, all individuals had a significant level of ASD symptoms. This can lead to decreased variability in the data. Some researchers have argued that studying autistic symptoms among individuals with ASD may artificially inflate associations (e.g., Happé et al. 2006). Further examination of models within community samples or family members with ASD symptoms without a full-blown ASD might increase variability in scores and provide valuable information in understanding the structure of ASD symptoms. Additionally, because the ADOS controls somewhat for language level, age, and level of functioning, even total sample analyses were (theoretically) being performed on somewhat homogenous samples. This may have lead to less variability in scores, which could have resulted in less differentiation between models.

This was a sample of convenience. The nature of the AGRE database dictates that not every child is rated on each instrument, due to age, level of functioning, or behavior difficulties. For this reason, some information (e.g., NVIQ scores) was not available for some participants. Additionally, creating more homogenous subgroups to examine the impact of age and level of functioning was difficult due to missing information. It was impossible to use both IQ and adaptive behavior to create subgroups for level of functioning, and if subgroups had been independent of each other, they would not have been large enough for CFA to be performed. For these reasons results of subgroup analyses, especially those based on level of functioning, should be interpreted with caution.

Future Directions

Elucidation of the structure of ASD symptoms will necessitate different research strategies and data sources. Continued work with large samples and homogenous subgroups offers a number of advantages. If the trend for large research databases continues, such studies may become possible sooner rather than later. Used in isolation, factor analysis is unlikely to yield a completely satisfying answer to the question: “what is the structure of ASD symptoms?” With the variety of etiologies related to the variable phenotypes, this question is obviously a complicated one. Genetic and family studies suggest that a single underlying etiology is unlikely to explain the observed behavioral heterogeneity. For example, some studies have demonstrated that over half the genes which contribute to variation in one symptom domain are independent from those responsible for variation in the other domains (Ronald et al. 2005; Ronald et al. 2006). Ultimately, a growing body of literature is indicating that many different phenotypes make up ASDs, each associated with different symptom profiles, and each possibly with distinct etiologies (Abrahams and Geschwind 2008).