Introduction

The Autism Diagnostic Observation Schedule (ADOS) is a semi-structured, standardized assessment of communication, social interaction, play, and imagination designed for use in diagnostic evaluations of individuals referred for a possible Autism Spectrum Disorder (ASD). For both research and clinical diagnostic purposes, the ADOS is intended to complement information obtained from developmental tests and a caregiver history, such as the Autism Diagnostic Interview-Revised (ADI-R; Rutter, LeCouteur, & Lord, 2003). The ADOS encompasses four modules, each with its own schedule of activities that allow examiners to observe behavior in participants of particular developmental and language levels, ranging from those with no expressive language to verbally fluent children and adults. Items are scored on a 4-point scale, with the highest scores of 2 and 3 collapsed in the algorithm in order to reduce impact of individual items.

To receive an ADOS classification of Autism or ASD, an individual’s scores must meet separate cut-offs in a Communication domain, a Social domain, and a summation of the two. To date, the ADOS has been effective in categorizing children who definitely have autism or not, but has had lower specificity and sometimes sensitivity for distinctions involving children with milder ASDs (Lord et al., 2000; Bishop & Norbury, 2002; de Bildt et al., 2004). In the original norming sample for Modules 1–3, the ADOS generally achieved 94% correct classification. The exceptions were the ASD versus Non-spectrum Module 2 specificity of 87% and Module 3 sensitivity of 90%, and the Pervasive Developmental Disorder-Not Otherwise Specified (PDD-NOS) versus Non-spectrum Module 2 specificity of 88% and sensitivity of 89% and Module 3 sensitivity of 80% (Lord, Rutter, DiLavore, & Risi, 1999). That sample included 188 children and adolescents (recipients of Modules 1–3), with at most 21 participants in a diagnostic group per module; the 2000 data were published with a request for replication in larger samples.

Despite the initial evidence for strong validity in classifying ASDs, several concerns can be raised about using the ADOS including floor and ceiling effects in the current algorithm totals and the effect of level of impairment. The ADOS scores in the norming sample had minimal association with verbal mental age (Lord et al., 1999), and divisions into modules ensured that two individuals functioning at the same mental age would receive the same schedule of activities and thus receive scores on the same items, regardless of their chronological age. However, an 8- and a 4-year-old who primarily used simple phrases would be scored similarly on the same ADOS algorithm, despite a clear difference in their levels of developmental impairment. In 2002, Joseph et al. (Joseph, Tager-Flusberg, & Lord, 2002) reported that ADOS social domain totals were correlated with level of cognitive impairment for preschool children and with the discrepancy between verbal and non-verbal IQ as measured on the Differential Ability Scales (DAS; Elliott, 1990). Data from de Bildt et al. (2004) suggested that, in a sample of children with mental retardation (MR), ADOS classifications appeared to be least valid for children with mild MR. It is unclear the exact role chronological age plays in the issue of impairment level; de Bildt et al. reported ADOS sensitivity increased with age in their mentally retarded sample (2004), while Lord et al. found the opposite effect in the original norming sample that had a smaller percentage of participants with MR (2000). Results from Bishop and Norbury (2002) suggest the ADOS may also be overinclusive with children with specific language impairments, though Noterdaeme et al. found excellent agreement between ADOS and clinicians’ classification in their language-impaired sample (Noterdaeme, Mildenberger, Sitter, & Amorosa, 2002).

One approach to improve the sensitivity and specificity while possibly reducing the age and IQ effects of the ADOS is to divide the sample into smaller, more homogeneous cells by developmental level, language level, or age, and then create algorithms composed of the items that best differentiated between clinical diagnoses within each cell. Because our goal was to generate improvements that could be used with existing data, the current ADOS divisions by module were retained. An eventual goal in ASD identification is to integrate information from the ADOS and the ADI-R for individual cases; thus, at a minimum, we also wanted the new groupings proposed to be comparable to ADI-R distinctions of language level.

Previous factor analyses of the ADOS identified one factor underlying the social and communication domain items (Robertson, Tanguay, L’Ecuyer, Sims, & Waltrip, 1999; Lord et al., 1999, 2000), though separating the two domains yielded slightly higher specificity. Although considered separately in DSM-IV and ICD-10, several recent studies have suggested that non-verbal communication and social items often load onto the same factor (Robertson et al., 1999; Constantino et al., 2004). At issue, too, is the inclusion of restricted, repetitive behavior (RRB) items in an ADOS diagnostic total. Currently, these items appear on the algorithm but do not contribute to the total score that results in a spectrum/non-spectrum classification. This decision was based on the narrow window of time available to observe such behaviors in the context of the administration (Lord et al., 2000). However, recent findings suggest that RRB, even in the limited context of the ADOS, may make an independent contribution to diagnostic stability (Lord et al., 2006). Another reason for reviewing the current ADOS algorithms was to test whether including RRB items in the total score, so that they contribute to the total but are not required for a classification of autism, increased the validity of the measure.

The goal of the present research is to address these topics in order to improve the sensitivity and specificity of the ADOS algorithms for Modules 1 through 3, and to test the feasibility of employing items of similar conceptual content, though developmentally graded, in algorithms across all three modules used with children, allowing for easier comparison of cases. This endeavor is the first step of a larger project that aims to use ADOS algorithm scores from existing data to generate a calibrated metric of severity of autism, as independent as possible of current language levels. For ease in scoring, we wanted to create an algorithm with fewer items, chosen from items with the best possible diagnostic distinctions for increased predictive value of the measure and organized to remain as consistent as possible across developmental cells while maintaining or improving classification performance. Because we suspected floor and/or ceiling effects may occur in the current ADOS totals, we began with all available items as options for inclusion in a new algorithm, instead of attempting to adjust the item pool of the current algorithm. Our goal is to improve the usefulness of the ADOS in quantifying social-communicative deficits and in making more difficult diagnostic distinctions between ASD and other disorders.

Methods

Participants

Analyses were conducted on data from 1,139 different participants. Some participants had repeated assessments yielding a total of 1,630 cases (each case was defined by complete data from a contemporaneous ADOS, verbal IQ, and best estimate clinical diagnosis); thus one participant could provide data for two or three cases based on evaluations conducted at different points in time. From the sample of 1,630, 321 were given an ADOS precursor, the Pre-Linguistic ADOS (PL-ADOS), from which scores on identical items were recoded as Module 1 ADOS scores. The majority of participants completed a diagnostic evaluation at the University of Chicago Developmental Disorders Clinic or the University of Michigan Autism and Communication Disorders Center (UMACC). The rest participated in a longitudinal study conducted through TEACCH Centers at the University of North Carolina, Chapel Hill, and a clinic at the University of Chicago, or in recent, ongoing studies at UMACC, in which participants with non-ASD developmental delays, ASD-affected sibling pairs, or children between 12 and 36 months of age who fail a social-communication screener are recruited for a comprehensive evaluation. The sample was limited to participants aged 12 years or younger for Modules 1 and 2, and 16 years or younger for Module 3. The resulting age range of the sample is 14–192 months. Because older adolescents and adults with ASD were seen as a behaviorally distinct group that merited individual study, ADOS Module 4 recipients were excluded from the outset.

The final dataset included 912 cases with clinical diagnoses of autism (56% of entire sample), 439 with non-autism ASD (27%), and 279 with non-ASD developmental delays (17%). Within the non-spectrum sample of 279 cases, 115 had non-specific MR (41% of non-spectrum total), 58 were cases with language disorders (21%), 35 with oppositional defiant disorder, ADD and/or ADHD (12%), 38 with Down syndrome (14%), 16 with mood and/or anxiety disorders (6%), and 17 with an unspecified early delay (6%). Refer to Table 1 for a more detailed description of this sample.

Table 1 Sample description

Gender varied across module and diagnostic group from 57 to 86% male. Ethnicity across module and diagnostic group ranged from 71 to 91% Caucasian, 4–27% African American, 1–5% Asian American, 0–0.8% Native American, 0–2.2% biracial, and 0–0.6% other or unknown race.

Measures and Procedure

The ADOS was administered by a clinical psychologist or a trainee who had completed research training and met standard requirements for research reliability (Lord et al., 1999). A developmental hierarchy of psychometric measures, most frequently the Mullen Scales of Early Learning (MSEL; Mullen, 1995) and the DAS Elliott, (1990) were used to determine IQ scores. Cognitive testing generally took place immediately before the ADOS administration. The ADI-R was available for 1,357 cases in our sample and the Vineland Adaptive Behavior Scales (VABS; Sparrow, Balla, & Cicchetti, 1984) for 1,409 cases. These two measures were administered together during a parent appointment that generally preceded the child assessment. Clinicians (usually a clinical psychologist and child psychiatrist) involved in each case together determined a best estimate diagnosis after review of all information. Clinic-referred participants received oral feedback and a written report without financial compensation. Participants who were recruited only for the purpose of research received compensation and a written summary of evaluation results. All procedures related to this research were approved by the Institutional Review Boards at the University of Chicago or the University of Michigan.

Inter-rater reliability on the ADOS was monitored through joint administration and scoring by two different examiners for at least 1 in 10 cases and, in some cases, through scoring of videotapes. Agreement remained at greater than 85%. Disagreements were resolved through discussion. Within this sample, 26 different examiners collected the data from the ADOS over 10 years.

Design and Analysis

The ADOS domain means were compared by module and diagnosis for the current sample and the original ADOS norming sample. Domain total distributions for this sample were generated within each module. When distributions appeared to exhibit floor or ceiling effects, items within that domain were evaluated to identify individual variables contributing to the effect. Correlations between ADOS totals and chronological age, verbal IQ, and verbal mental age were examined, and where possible, the sample was divided by age and language level within each module to yield cells with lower correlations between the ADOS totals and these variables. At that point, item distributions were considered within each cell in order to select those that best differentiated between diagnoses.

Each item in each cell was labeled as preferred or not preferred for inclusion in a new algorithm, with inclusion criteria generated from and applied to social-communication items, but not RRB items, which had an expected diversity (Bishop, Richler, & Lord, 2006). The criteria specified no more than 20% of autism cases scoring a zero on an item, and no more than 20% of non-spectrum cases scoring a 2 or 3. The former percentage was allowed to rise to 27–45% for two theoretically important items which performed well in some but not all of the cells (“Gestures” for Module 3 and “Shared Enjoyment” for Modules 2 and 3). From this pool of preferred items, roughly equivalent items across modules were selected, so as to promote a conceptually uniform model across modules that would enhance inter-module comparisons.

Exploratory multi-factor item response analysis provided insight into the factor structure within each cell and was used to organize the items into new domains. All factor analyses reported here employed Mplus software (Muthen & Muthen, 1998) to address the ordinal nature of ADOS data. In an effort to balance getting the best fit by cell with having one model consistent across cells, factor loadings from promax oblique rotations were used to select better-performing items across developmental cells in a theoretically meaningful way. For example, where the item “Pointing” had failed to differentiate effectively between diagnoses in one cell, “Response to Joint Attention” replaced it; the item was theoretically similar (relating to shared interest) and loaded on the same factor. Goodness-of-fit was verified through confirmatory factor analysis, and logistic regression used to examine the weighting of the two domains in view of the relative predictive value of scores from the different factors.

Domain total distributions of the new algorithm model were assessed for floor and ceiling effects, and correlations were generated between items and the remainder of the domain, as well as between items and participant characteristics like age and IQ. The ROC curves were calculated, and the sensitivity and specificity of the existing and newly revised ADOS algorithms compared within each cell. Since adjustment for the minority of subjects with multiple observations left the factor analysis results reported here largely unchanged, no adjustment has been made. Reported logistic regression coefficients for predicting diagnosis were adjusted using cluster robust standard errors, confidence intervals, and test statistics (Binder, 1983).

Results

Comparison of Domain Means

The ADOS domain total means and standard deviations were calculated for this sample in order to compare them to those of the original ADOS norming sample (Lord et al., 1999). For this comparison only, data from the original norming sample were removed from the current sample. As expected with a sample of the current size, mean differences in chronological age, verbal mental age, and non-verbal mental age between the module and diagnostic groups of the norming and current samples were statistically significant; however, they were clinically marginal. For example, the mean chronological age of Module 3 Autism groups was 8.45 years in the new sample (N = 123; SD = 2.51) and 9.14 years in the original norming sample (N = 21; SD = 2.36).

The ADOS domain means of the autism and ASD samples were similar, with a trend toward slightly lower means in Communication and Social domains for the Autism groups and slightly higher means in the Restricted-Repetitive domain for the non-autism ASD groups in the current larger sample. As examples, the mean combined social-communication totals in the Module 2 Autism groups were 18.38 in the norming sample (N = 21) and 18.11 in the current sample (N = 171), while the means for the Module 2 non-autism ASD groups were 11.83 (N = 18) in the norming sample and 11.94 (N = 83) in the current sample. On the whole, the similarities in domain distributions with greater numbers and a more diverse population suggest it would be appropriate to apply a new algorithm calibrated on this new sample to existing research databases.

Domain Total Distributions

In the original algorithm communication domain total, Module 1 scores of 8 were frequent (22.2% of Module 1 ASD sample), while scores of 9 or 10 were very rarely achieved (a total of 2.1% of Module 1 ASD cases received either score), implying an item set that provided little discrimination at the severe end of the spectrum. This effective range restriction was associated with the fact that 62% of the Module 1 ASD participants were non-verbal (i.e., participants used fewer than five words during the ADOS administration, as reflected in scores of 3 or 8 on Item 1: overall level of language [or a score of ‘4’ on the equivalent PL-ADOS item]). For these children, only four items were scorable in the algorithm communication total. Scores of 9 and 10 were largely ineligible because the algorithm items “Stereotyped /Idiosyncratic Use of Language” and “Frequency of Vocalization” were unscorable for non-verbal participants, and thus did not contribute to the domain totals. Across all modules, distributions were broadened considerably when “3” codes were not recoded to “2” prior to algorithm calculation, but because standard use of the ADOS does not require reliable distinctions between these codes, our primary data analyses focused on continued use of this recoding. Because of the range restriction, we proposed the creation of distinct algorithms for verbal and non-verbal recipients of Module 1.

Correlations with Participant Characteristics

In Module 2, significant correlations between ASD participants’ social-communication totals and their chronological age (r = 0.24, p < 0.001) and verbal IQ (r = −0.34, p < 0.001) occurred. Perusal of scatterplots provided evidence of curvilinear relationships between chronological age and ADOS social-communication totals, such that these variables were negatively related in children under age 5 (as age increased, ADOS scores decreased) and positively related in children over 5 (as age increased, ADOS scores increased). The children over 5 with phrase-speech-only seemed to represent a different group than children under 5 in Module 2, who may well acquire fluent speech as they get older. When Module 2 was split into “Younger than 5” and “Greater or Equal to 5,” the former group had no significant correlation between the social-communication total and age and verbal IQ (age: r = −0.06, p = 0.52; VIQ: r = −0.16, p = 0.08); in the latter group, scores were still significantly correlated with these variables (age: r = 0.16, p = 0.04; VIQ: r = −0.34, p < 0.001), but less so in the case of age than before the division.

Division of the Modules

Dividing the sample by age in Module 3 did not produce more homogeneous samples. Dividing Module 3 recipients by language level was deemed not appropriate because past data had shown that “Item 1: Overall Level of Language” in Module 3 was particularly difficult to score reliably. Division based on specific item scores, such as “Reporting of Events,” met with little success. Thus, at this point, the cells for which revised algorithms have been formulated are Module 1, No Words; Module 1, Some Words; Module 2 Younger than 5; Module 2, 5 or Older; and Module 3 (Fig. 1).

Fig. 1
figure 1

Revised algorithm developmental cells

Factor Analysis

Exploratory factor analysis was performed by cell (Fig. 1) with the ‘preferred’ items included from all domains. Item scores of 2 and 3 were collapsed and scores of 8 were labeled as missing data and excluded. Because ADOS data are ordinal and do not represent equal intervals, the analyses were run as ordinal probit item response models with Mplus Version 3.0 software. A Root Mean Square Error Approximation (RMSEA) of 0.08 or less is commonly taken as a satisfactory fit (Brown & Cudeck, 1993). The results shown in Table 2 indicate that 2-factor solutions generally fitted well, with items loading onto clear Social Affect (SA) and RRB factors that were positively correlated. Confirmatory factor analysis, that assigned each item to one of two factors, showed the 2-factor model to fit substantially better than the 1-factor model, with goodness-of-fit ratings ranging between Comparative Fit Index (CFI) of 0.94 (CFI between 0.9 and 1 indicating good fit; Skrondal and Rabe-Hesketh, 2004) and RMSEA of 0.08 in the Module 3 cell, to CFI of 0.97 and RMSEA of 0.09 in the Module 1, Some words cell. Thus, the final mapping of the new algorithm model includes a Social Affect domain and a Restricted-Repetitive domain (Table 2).

Table 2 Revised algorithm mapping

Although “Stereotyped/Idiosyncratic Use of Words or Phrases” was a communication domain item in the previous algorithms, it loaded onto the RRB factor and was thus included in that domain on the new algorithms.

The eigenvalues of a third factor, called “Joint Attention,” ranged across cells from 0.93 to 1.12 in exploratory factor analysis; this factor was comprised of pointing, gesturing, showing, initiating joint attention, and unusual eye contact in the Module 1, Some Words and both Module 2 groups, and response to joint attention, gesturing, showing, initiating joint attention, and unusual eye contact in the Module 1, No Words group. Statistics from confirmatory factor analysis were satisfactory (CFI ranged from 0.92 to 0.96; RMSEA from 0.06 to 0.09) in the four relevant developmental cells. The two-factor model, however, was more consistent across the five cells and more parsimonious. Although it was not included in the algorithm and overlaps with the SA factor, the Joint Attention factor was consistent across Modules 1 and 2 and therefore may be of interest to some researchers and clinicians.

Logistic Regression Check on Weighting Domains

Logistic regressions for autism versus not-autism (non-autism ASD and non-spectrum cases together), and ASD versus non-spectrum indicated that both the SA and RRB factors made significant independent contributions to the prediction of diagnosis. Since factor scores were not uniformly better at prediction of diagnosis than simple totals, we describe results for the simple item totals that would be ordinarily used in clinical practice. We report raw and standardized log-odds coefficients, the latter being easier to compare when the predictor variables have widely differing variability.

Item totals within SA and RRB factors were both predictive of diagnosis. For children with autism versus all other groups, the raw partial log-odds coefficients were 0.29 (C.I. 0.26, 0.32; z = 16.53; standardized coefficient = 1.74) for SA and 0.36 (C.I. 0.29, 0.44; z = 9.30; standardized coefficient = 0.84). For Autism and PDD versus no PDD, the raw partial log-odds coefficients were 0.25 (C.I. 0.20, 0.29; z = 10.52; standardized coefficient = 1.47) for SA and 0.51 (C.I. 0.38, 0.64; z = 7.92; standardized coefficient = 1.20).

While both factors were predictive for both comparisons, the standardized coefficients and z-scores are lower for RRB than SA. In addition, it is interesting to note that, for the SA factor, the log-odds coefficients are similar for the different factors, but for the RRB factor, the coefficient for the autism and PDD versus no-PDD comparison appears to be larger than that for the autism versus all other groups comparison.

Item Correlations with Domain Totals, Chronological Age, Mental Age, and IQ

Item-’rest’ correlations (domain scores minus the particular item) were significant for each algorithm item in each developmental cell; they ranged from 0.45 to 0.78 in the SA domain and 0.27–0.53 in the RRB domain. The two domains were significantly correlated with each other (0.34–0.57 by cell). Internal consistency was assessed using Cronbach’s alpha (Cronbach, 1951). Cronbach’s alphas were consistently highest for the SA domain (0.87–0.92 by developmental cell) and ranged from 0.51 to 0.66 in the Restricted, Repetitive domain.

Item correlations with age and verbal mental age were also reviewed. “Intonation” in Module 1, No Words was the only item correlated above 0.30 with chronological age (r = 0.45, p < 0.001). Seven items across cells showed correlations with verbal mental age greater than 0.30; five of these items applied only to the Module 1, Some Words cell (ranging from “Unusual Eye Contact,” r = −0.32, p < 0.001 to “Showing,” r = −0.39, p < 0.001). Clearly, the delineation of children with “Some Words” in Module 1 still yields a heterogeneous group, in which social skills are related to children’s language abilities, ranging from a few single words to the use of occasional phrases.

Sensitivity and Specificity

Receiver Operating Characteristic (ROC) curves (Siegel, Vukicevic, Elliott, & Kraemer, 1989) were run to obtain the sensitivity and specificity of both the old and the new algorithms by cell. For the new algorithms, ROC curves were run twice, for items from the SA factor alone and then for the sum of items from the SA and RRB factors. As in the past, scores of 3 were recoded to 2 for this procedure. Cases with acceptable missing data (for example, an ‘8’ on the ‘stereotyped speech’ item in Module 1, Some Words) were included (contributing a zero score), but 56 cases were excluded because of other missing data from items comprising either the old or new algorithm, for a resulting N of 1,574. The new algorithm includes the item “Integration of Gaze and Other Behaviors during Social Overtures” in Module 1; because the children who received the PL-ADOS did not have this item available and yet represented a clinically important group, scores on the item “Unusual Eye Contact” were substituted for the missing item data in PL-ADOS recipients. The inclusion of the PL-ADOS cases greatly reduced specificity in the Module 1, No Words cell, the most obvious reason being the inclusion of children with very low non-verbal mental age in the Early Diagnosis sample (Lord et al., 2006), all of whom initially were assessed using the PL-ADOS. Because evaluating children with low non-verbal mental age is a reality in clinical practice, sensitivity and specificity were generated for all of Module 1, No Words cases, but were reported separately for those with non-verbal mental ages of 15 months or lower and those with non-verbal mental ages above 15 months for comparison (Table 3). Another point to note in Table 3 is that “ASD” is often reported in literature as including the Autism and non-autism ASD cases, whereas we have provided separate comparisons in this table of Autism versus Non-spectrum cases, and non-autism ASD cases (PDD-NOS and Asperger Disorder) versus Non-spectrum cases. This was done to give a true indication of how well the measure performs within the most conservative diagnostic groupings.

Table 3 Sensitivities and specificities of current and revised algorithms

For Autism versus Non-spectrum, and for ASD versus NS (Table 3), the new and old algorithms perform approximately equally well in terms of sensitivity, with the new algorithm showing slightly reduced sensitivity in some cells and notable gains in others (Module 1, Some words; AUT versus NS and ASD versus NS). For non-autism ASD versus Non-spectrum, sensitivity of the new algorithm is somewhat lower in Module 1, No Words (as was necessary to raise specificity), but shows improvement from the old algorithm in the higher-functioning Modules 1 (AUT versus NS) and 2 (ASD versus NS) cells.

The new algorithm shows substantial gains in specificity in each of the diagnostic categories. Module 1, No Words (both non-verbal mental age groups) improve in each diagnostic comparison; the specificity of both Module 2 groups improves for non-autism ASD versus NS.

Overall, the first factor by itself tends to perform somewhat less well, so a summation of both domain totals are recommended to complete a total algorithm score. Analyses also were rerun including scores of 3 to see if using a broader distribution resulted in greater predictive value; there was little impact on sensitivity and specificity in comparison with the new totals with recoded 3’s. Further information about cut-offs using “3’s” is available from the authors.

Discussion

With a much larger, more diverse sample (in terms of participants and examiners), both domain means and sensitivity and specificity remained similar to the original norming data, indicating that the ADOS continues to be a valid and reliable measure. Used with the original norming sample, any new algorithm was unlikely to improve on the old, since the latter had been chosen to best classify that particular sample. The new sample, being so much larger, offers less scope for overfitting, and thus achieving artifactual high levels of classification success. Nonetheless, as intended, the algorithm changes described here increase specificity in classifying non-autism ASD in lower functioning populations, evidenced by the 12–31% increase in specificity for children without any words (depending on non-verbal mental age) and the modest gain in specificity for older children who have not progressed beyond phrase speech.

A more homogeneous algorithm has been achieved, with similar items used across developmental cells to allow for easier comparison of ADOS scores within and between individuals. This is a step closer to the use of the ADOS as a measure of severity. The inclusion of repetitive behavior items in an algorithm model that is relatively uniform across developmental cells will be useful in the derivation and application of the future severity metric.

With the use of the proposed model, all items appearing on the algorithm will contribute to a single score with two classification thresholds, one for Autism and one for ASD. The existing social and communication domains were merged, as proposed in previous research (Lord et al., 1999, 2000; Robertson et al., 1999), and only very strong, salient factors were retained. Because longitudinal data suggest that trajectories between social communication and RRB profiles are different (Lord et al., 2006), the two-factor solution chosen from these analyses adds to the clinical utility of this diagnostic instrument. Inclusion of the RRB domain did not improve predictive value of the ADOS in differentiating individuals with autism from those with PDD-NOS, though surprisingly it aided in distinguishing PDD-NOS from non-spectrum cases.

Clearly some of the many goals for this algorithm revision were more difficult to achieve: the specificity of classification in children with non-verbal ages 15 months and younger remained weak. For these children, ADOS cut-offs do not reliably differentiate Autism or ASD from other disorders. It may be that expectations of interaction in the ADOS are too high for passive, low-functioning children (Hepburn, Lord, John, & Rogers, Submitted). A version of the ADOS employing novel tasks for infants and toddlers is now being piloted at the UMACC to address the diagnostic needs of very young and/or more severely delayed toddlers.

Although predictive value of the ADOS for children with autism was strong across all groups, sensitivity for non-autism ASD across modules was lower than desired even in children with non-verbal ages above 15 months. Correlations between the revised totals and age, IQ, mental age, ADI-R, and former ADOS totals are reported here (see Tables 4, 5, 6). Division of the sample into cells accomplished the goal of creating algorithms independent of age effects (except for Module 1), and minimizing the effect of verbal IQ across modules. Greater association between ADOS and cognitive scores remained in the Module 1 cells relative to other developmental cells; this was expected due to the role of social communication in the measurement of cognitive skill in very young or low-functioning children. In fact, some of the earliest MSEL items in the language domain, as well as in other developmental tests (Bayley, 1993), overlap with social-communication items on the ADOS (e.g., MSEL item ‘Responds to voice and face by vocalizing’). Severity of expressive language impairment (though not so much current language functioning) continues to influence our interpretation of autistic symptomatology, and even with Module 1 divided further on language level, the relationship between ADOS domain scores and verbal IQ remained.

Table 4 Correlations between revised algorithm totals and participant characteristics
Table 5 Correlations between revised algorithm totals and ADI-R domain totals
Table 6 Correlations between revised algorithm totals and previous ADOS algorithm totals

The new algorithm did not greatly improve the distinction between autism and other ASDs in the ADOS. It continues to be the case that social-communication deficits within ASD (Constantino et al., 2004; Gilmour, Hill, Place, & Skuse, 2004) and shared by ASD and other developmental or psychiatric disorders (Bishop & Norbury, 2002) appear to represent a continuous dimension. Our next step is to generate such scores based on calibrations across modules. Because we found virtually the same total distribution and predictive value when including scores of 3 in the new algorithm, the calibration effort likely will recode these to 2’s as is common practice. For those who want to increase variability in their ADOS data and who have become reliable on coding 3’s as well as 2’s (Rogers, Hepburn & Wehner, 2004), 3’s remain an option to increase variation, particularly in treatment studies that look for changes in individuals.

Limitations

Because different thresholds were necessary to attain the best sensitivity and specificity within each developmental cell, calibration is necessary to achieve the final goal of providing a simple way to compare cases across modules. Although the goal of creating algorithms more similar across modules was met, it is ultimately limited by the fact that children receiving a Module 2, for example, still complete different tasks than do those in Module 3. Shifts in modules clearly add complexity in interpreting data, but there is little alternative to the fact that adequate social behavior differs with chronological age, language level, and examiner expectations, and thus must be measured by different modules of tasks. To maintain the validity of this measure, it is important that the appropriate module is selected (see also Klein-Tasman, Risi, & Lord, in press).

The sensitivity and specificity of the instrument may vary in different clinics and research centers due to the skill of the examiner, sequence of administration, and other factors; therefore, we might not see the same predictive value of the instrument in other clinical or research databases (see also Risi et al., 2006). A further limitation of this study is the relatively small numbers of non-spectrum participants by module.

The research protocol described here was unable to divide recipients of Module 3 into more homogeneous groups based on chronological age or language level. Comparison of item distributions between younger and older recipients of Module 3 revealed few age differences in children age 5 and over; children under 5 who received Module 3 exhibited some mean differences on an item level, but this group represents such a small minority of Module 3 recipients that we felt it unnecessary to divide the sample on that basis alone. The sensitivity and specificity of both old and new ADOS classifications are generally lower in Module 3 than in other developmental cells. Ideally, study of higher-functioning children and verbal adolescents will lead to better understanding of autism and ASDs in these populations and perhaps inform decisions on new tasks and/or scored items for future revision of this diagnostic schedule. Modifications of the ADOS for older children and adults with single words or phrase speech are also needed in order to present more age-appropriate tasks and materials to lower-functioning individuals while preserving standardization. The extent to which one can measure more subtle social-cognitive differences in one-to-one interaction with an adult in an office visit is unknown. Information from measures such as the VABS, ADI-R, Social Responsiveness Scale (SRS; Constantino et al., 2003), and Pervasive Developmental Disorder Behavior Inventory (PDDBI; Cohen, Schmidt-Lackner, Romanczyk, & Sudhalter, 2003), that allow consideration of the information from a broader range of contexts, may be critical.

A comparison of ADOS classification to clinical diagnosis, which was done to generate the sensitivity and specificity numbers reported here, is confounded by the fact that the two classifications are not independent, as the ADOS was one of the tools used to make the clinical best estimate diagnosis. When constructing an entirely new algorithm for a new instrument, an entirely independent validation criterion is desirable as proof of validity. However, when revising an algorithm, the concern is to identify improved performance over an existing algorithm, each being measured against the best available criterion. Lord et al. (2006) have shown in a longitudinal study that clinical judgment, the ADI-R, and the ADOS all made independent contributions in predicting long-term best-estimate diagnoses. No single source can be considered as either a gold standard or the best possible criterion. This would suggest that to calibrate the algorithm against a criterion diagnosis that excluded information from the ADOS would be to calibrate it against a potentially inferior criterion. The fact that the best-estimate diagnosis was not independent of the ADOS might potentially upwardly bias the absolute performance of an ADOS algorithm; however, its influence on the relative performance would be slight. By contrast the much larger sample size of this study makes it less prone to the upward bias in absolute performance of previous studies that can arise from over-fitting.

Conclusions

The satisfactory performance of the revised algorithm found here must be replicated in other research samples before it replaces the existing ADOS algorithm. It can be calculated currently by adding scores from the items listed under the relevant developmental cell in Table 2, and applying the parenthetical thresholds for Autism or ASD from Table 3. Pending replication and the future calibration project, we expect a new published version of the algorithm to be provided by Western Psychological Services.

The ADOS has begun to be used in relation to neurobiological measures (Critchley et al., 2000; Schultz et al., 2000; Klin, Jones, Schultz, Volkmar & Cohen, 2002) and continues to contribute to improved diagnosis in conjunction with the ADI-R (Risi et al., 2006). Nevertheless, researchers and clinicians must bear in mind that this measure is not a replacement for a historical account by a caregiver or for the diagnosis of a well trained, experienced clinician. Replication across sites and across other well defined populations with and without ASD and further explorations into how we can best organize time-limited, clinician-structured observations of social-communication behavior to better understand and treat ASD are all much needed.