Introduction

Currently, levels of impairment in children with autism spectrum disorders (ASD) are measured largely in terms of language delay, cognitive functioning, or behavioral issues such as aggression. While these are important factors in overall adaptive functioning, they are not core features of the autism spectrum. Measuring the relative severity of autism-specific features could contribute to our ability to accurately describe ASD phenotypes across samples and across time in clinical and treatment research. An ASD severity metric could be useful for categorizing samples into more homogeneous groups in genetic and other neurobiological studies; it would also address a need to document severity as part of clinical assessment.

At this point, measures that provide autism severity ratings, such as the Childhood Autism Rating Scale (CARS; Schopler et al. 1986), the Gilliam Autism Rating Scale (GARS; Gilliam 1995), or the Autism Behavior Checklist (ABC; Krug et al. 1980), tend to yield scores that are either strongly correlated with IQ or that do not correspond to standard measures of diagnosis (Gilliam 1995; Volkmar et al. 1988; Spiker et al. 2002; South et al. 2002; Szatmari et al. 2003). The Social Responsiveness Scale (SRS; Constantino et al. 2003) provides a method for quantifying social impairment that has shown relative independence from participant characteristics such as IQ. SRS scores are based on parent or teacher report, however, and thus a complementary measure of ASD severity that offers the opportunity to take into account the observations of an experienced clinician would be desirable.

For genetic, neuroscience, and intervention research, severity of core autism features often has been estimated using primary phenotyping measures, the Autism Diagnostic Observation Schedule (ADOS; Lord et al. 2000) and the Autism Diagnostic Interview-Revised (ADI-R; Rutter et al. 2003). While it is true that higher ADI-R and ADOS scores indicate that an individual has a greater number of items representing core deficits and/or greater severity of impairment, scores were not normalized for this purpose and vary in the degree to which they are correlated with both IQ and chronological age. Attempts to indicate severity using ADI-R item scores selected to operationalize ICD-10 criteria for the disorder proved successful in predicting the number of affected relatives of verbal probands, but not for nonverbal probands (Pickles et al. 2000). One limitation of ADI-R scores as a severity metric is that nonverbal children are not scored on roughly 25% of the total ADI-R items, and so communication domain summary scores are restricted by non-random missing data.

The ADOS, a semi-structured autism diagnostic observation, has shown strong predictive validity against best estimate diagnoses (Gotham et al. 2007), making it a common choice among phenotyping measures. In each of four developmental- and language-level dependent modules, a protocol of social presses is administered by a trained examiner, and then behavioral items relevant to ASD are scored on a 4-point scale, with 0 indicating ‘no abnormality of type specified’ and 3 indicating ‘moderate to severe abnormality.’ Specific items comprise an algorithm for each module; these items are summed and compared to thresholds, which results in a classification of “autism,” “autism spectrum disorder,” or “nonspectrum.”

Because the ADOS has been used to catalogue ASD features in large samples, ADOS raw totals are a common stand-in for a measure of autism severity. This instrument was created for diagnostic purposes, and thus was not specifically designed to facilitate longitudinal and cross-sectional comparison of data. As an individual gains language skills, he or she potentially moves through ADOS modules, making raw scores not directly comparable across time. Additionally, effects of age and language level on domain total and algorithm scores have been observed (Joseph et al. 2002; de Bildt et al. 2004; Gotham et al. 2007).

In 2007, the original ADOS algorithms were revised in part for the purpose of increasing the comparability across modules 1–3. Algorithms with the same number of items and of similar content across modules were created (Gotham et al. 2007). These revisions resulted in improved specificity of the measure among more impaired populations, while generally maintaining or improving predictive validity among individuals of other developmental levels (e.g., fluent speakers). The algorithm domain structure now includes a Social Affect (SA) and a Restricted, Repetitive Behavior (RRB) domain for each of the five developmentally-based algorithms corresponding to modules 1–3. Comparability of item content and total item number across these algorithms was intended to improve the interpretability of longitudinal comparisons using the measure. Still, items are necessarily developmentally graded across modules, making calibration necessary to compare algorithm totals.

Some effects of participant characteristics still exist within and across ADOS modules as well. Revised algorithm totals met the goal of independence from chronological age and decreased association with verbal IQ, with the exception of Module 1 scores (Gotham et al. 2007). A replication of the algorithm revisions in an independent dataset again found low correlations between raw scores and age, verbal IQ, and nonverbal IQ, though significant associations remained between verbal IQ and SA domain total scores for Module 1 recipients with few or no single words and Module 2 recipients aged 5 or older (Gotham et al. 2008).

True normalization of severity of autism would require a representative population, but to date, population studies have been too small, e.g., Brick Township (Bertrand et al. 2001), have not used the ADOS (Chakrabarti and Fombonne 2005; CDC 2007), or have collected samples older than most clinically assessed children (Baird et al. 2006). Acknowledging these limitations, in the present study we elected to standardize ADOS scores using a large “convenience” sample of individuals with ASD. Our goals were to reduce remaining participant demographic effects to the greatest possible degree, and generate standard scores that would approximate a severity metric for the construct of ‘autism spectrum’ as it is measured on the ADOS. This metric ideally will be useful in (1) allowing comparison of assessments across modules and time; (2) providing a means of assessing the relationship between severity in ASD and verbal and nonverbal IQ; and (3) identifying different trajectories of autism severity independent of verbal IQ both for clinical purposes and for phenotypic subgrouping in genetic and neurobiological research. We hope that calibrated severity scores can then be replicated in smaller population-based studies and tested for validity in predicting treatment responsiveness and other clinical outcomes in children with ASD.

Our first approach to developing a severity metric was to calibrate ADOS algorithm totals using eight age/language cells chosen on the basis of theoretically-driven expectations for specific age ranges with similar developmental impairments. This would have allowed a ‘prefix’ on the severity score that indicated age and language level out of the eight possible groups (ranging from young Module 1’s with no words to fluent speakers, aged 5–10). Within each cell, raw totals were converted to Z-scores, which were then converted to a 100-point scale. This method yielded calibrated scores that fanned out, with increasing variability of individuals’ ADOS totals over time and age. Thus, an alternative approach was chosen in which a greater number of age/language cells were used, and severity scores within each cell were based on the raw total percentiles that corresponded to each of three possible ADOS diagnostic classifications. This method is described in more detail below.

Methods

Participants

Analyses were conducted on data from 1,415 individuals, of which 355 individuals with ASD diagnoses had repeated measure data. The final dataset included 2,195 assessments, where ‘assessment’ is defined as contemporaneous ADOS data and a best estimate clinical diagnosis. Autism diagnoses were assigned to 1,187 assessments (54% of entire sample); 599 assessments were given diagnoses of non-autism ASD (27% of the sample, including n = 12 with Asperger Disorder, n = 3 with Childhood Disintegrative Disorder, and n = 584 with Pervasive Developmental Disorder, Not Otherwise Specified, or PDD-NOS), and 409 had non-ASD developmental delays (19%). Contemporaneous verbal IQ data was available for 2,007 assessments (91.4% of the entire sample) and nonverbal IQ data for 1,989 assessments (91.0%). Please refer to Table 1 for a detailed description of the dataset by revised algorithm group.

Table 1 Sample description

Chronological ages in the sample ranged from 2 to 16 years (see Table 1 for age range by algorithm group). Recipients of ADOS Module 4 (older adolescents and adults with fluent speech) were not included in these analyses because of smaller sample size and the different relevance of age equivalents in adults. Females comprised 22% of the dataset (N = 478 assessments). Ethnicities represented by these data include 14% African American (N = 306 assessments); 3% Asian American (N = 58); 77% Caucasian (N = 1699); 0.5% Native American (N = 10); 2% biracial (N = 40); and other (N = 20) or race not specified (N = 62) totaling 4% of assessments. Twenty-three percent of the sample reported maternal education at the graduate or professional level; 56% of mothers had a bachelor’s degree or some college education, and 21% of mothers had a high school degree or less.

Within the nonspectrum sample of 409 assessments, 111 had a primary diagnosis of a language disorder (27% of nonspectrum total), 80 were assessments with nonspecific intellectual disability (20%), 56 with Down syndrome (14%), 55 with oppositional defiant disorder, ADD and/or ADHD (13%), 31 with mood and/or anxiety disorders (8%), 29 with Fetal Alcohol Spectrum Disorders (7%), 24 with non-ASD genetic and/or physical disabilities such as Fragile X, Williams syndrome, or mild cerebral palsy (6%), and 23 had an early delay that clinicians were not comfortable categorizing (5%).

The majority of participants were self-, school-, or physician-referred clinic patients at the University of Michigan Autism and Communication Disorders Center (UMACC) or the University of Chicago Developmental Disorders Clinic. The rest participated in a longitudinal study conducted through the Treatment and Education of Autistic and Communication Handicapped Children (TEACCH) Centers at the University of North Carolina, Chapel Hill, and the University of Chicago clinic, or received diagnostic evaluations through recent, ongoing studies at UMACC, including those focused on participants with non-ASD developmental delays, ASD-affected sibling pairs, or children between 12 and 36 months of age who failed a social-communication screener. Out of 399 participants with repeated assessments through clinic reevaluations or longitudinal research, 301 individuals had 2 or 3 ADOS assessments (57% with autism, 31% with PDD-NOS, and 12% NS), and 98 individuals had between 4 and 8 assessments (58% with autism, 33% with PDD-NOS, and 9% NS). Individuals with longitudinal data did not differ significantly in gender, race, or maternal education from those with only one assessment point, however they had significantly lower mean verbal IQs (M = 49.6, SD = 27.8) and nonverbal IQs (M = 73.0, SD = 23.8) at first assessment than did single assessments (verbal IQ M = 68.2, SD = 32.8; nonverbal IQ M = 77.9, SD = 27.5); verbal IQ t(1,351) = 9.7, p < 0.001 and nonverbal IQ t(1,334) = 3.0, p < .01.

Measures and Procedure

The most typical research protocol across sites and projects was the initial administration of the ADI-R and the Vineland Adaptive Behavior Scales, 1st (VABS; Sparrow et al. 1984) or 2nd edition (Vineland II; Sparrow et al. 2005), to a parent or caregiver, followed by a child evaluation in which psychometric testing preceded the ADOS. The second most common protocol was a re-evaluation consisting of psychometric testing and an ADOS. In both cases, a clinical diagnosis was made by a psychologist and/or psychiatrist after review of all data. The ADI-R was available for 1,700 assessments (77% of sample) and the Vineland for 1,710 assessments (78%). The ADOS was administered and scored by a clinical psychologist or trainee who met standard requirements for research reliability. The Pre-Linguistic Autism Diagnostic Observation Schedule (PL-ADOS; DiLavore et al. 1995) was given in 418 assessments (19%) and the piloted ADOS-T (Luyster et al. 2009), a toddler version of the ADOS, was given in 82 assessments (4%); for both measures, identical items were recorded to Module 1 algorithm scores. A developmental hierarchy of cognitive measures, most frequently the Mullen Scales of Early Learning (MSEL; Mullen 1995) and the Differential Ability Scales (DAS; Elliot 1990), determined IQ scores.

Clinic-referred participants received oral feedback and a written report without financial compensation. Participants recruited only for the purpose of research received financial compensation and a written summary of evaluation results. Institutional Review Boards at the University of Chicago or the University of Michigan approved all procedures related to this project.

Mapping a Standardized Severity Metric onto Raw ADOS Scores

Severity scores were created by dividing the pool of assessments from individuals with ASD into narrowly defined age and language cells, and standardizing raw total scores from the revised algorithms (Gotham et al. 2007) within these cells. In order to maximize the number of cases available for standardization, assessments missing data from any one item from either the SA or RRB domains of the revised ADOS algorithms were retained by adding to the domain total an average item score from that participant’s existing domain data. The ASD sample alone was used for raw total standardization: this included all assessments corresponding to a best estimate diagnosis of autism or ASD, as well as data from 13 individuals who had ADOS data with a contemporaneous nonspectrum diagnosis but who were later diagnosed with ASD. This subsample (N = 1,807 assessments from 1,118 individuals) was separated into groups based on the five revised algorithms used with children: Module 1 No Words, Module 1 Some Words, Module 2 Younger than 5; Module 2 Age 5 and Older; and Module 3. Within each of these five developmental cells, distributions of summed SA and RRB totals were generated separately for every 1-year age group between 2 and 16 years; these age cells were collapsed when possible in order to create the fewest number of age- and language-level-determined ‘calibration cells’ with similar raw total score distributions. Younger age cells were purposely kept distinct to anticipate developmental changes and more frequent assessments in young children as they transition from toddlerhood to preschool to school programs. Age cells with similar distributions were collapsed only within the same algorithm. Eighteen calibration cells resulted (see Fig. 1).

Fig. 1
figure 1

Age by language level calibration cells. Note. N’s denote the number of ASD assessments within each cell

Within each of these 18 cells, raw ADOS totals were mapped onto a 10-point severity metric. After considering a variety of approaches, severity scores 1–3 were set so as to represent the distribution of raw scores receiving a nonspectrum ADOS classification within that calibration cell, severity scores 4–5 represented ASD-classification ADOS totals, and 6–10 represented raw totals receiving an autism classification within that cell. ADOS classification thresholds were determined by the revised algorithm relevant to each calibration cell. The range of raw totals corresponding to each point on the severity metric was determined by the percentiles of available data associated with each severity point within a classification range. Lower severity scores are associated with less autism impairment. Table 2 shows the raw score range corresponding to each severity point within each calibration cell.

Table 2 Mapping of ADOS raw totals onto calibrated severity scores

Design and Analysis

Distributions of raw totals and severity scores were compared to assess whether severity score distributions across age/language cells were more uniform than raw score distributions. Linear regression models were analyzed to compare the relative independence of severity scores and raw totals from participant characteristics, such as chronological age, verbal and nonverbal IQ, and verbal and nonverbal “current” mental ages. Several assessments with longitudinal data were then chosen to exemplify various patterns of severity change over time across diagnostic groups.

Results

Comparing Distributions of Severity Scores and Raw Ados Totals by Calibration Cell

In line with the goal of increasing comparability across modules and developmental levels, severity scores for ASD participants were expected to have a more uniform distribution across age- and language-level calibration cells than would raw totals. Distributions of raw ADOS totals were generated for each of the 18 calibration cells (Fig. 2) and compared to the distribution of severity scores for each cell (Fig. 3).

Fig. 2
figure 2

Distributions of ADOS raw total scores by age/language cells (ASD assessments only). Note. See Table 3 for a key of age/language cells 1–18

Fig. 3
figure 3

Distributions of calibrated severity scores by age/language cells (ASD assessments only). Note. See Table 3 for a key of age/language cells 1–18

Distributions of severity scores showed increased comparability across the age/language cells, though they were not uniform. The means and standard deviations of both severity scores and raw totals are listed by age/language cell in Table 3.

Table 3 Raw score and calibrated severity score means and standard deviations by age/language cell (ASD assessments only)

Severity score distributions exhibited a ceiling effect that is inherent to the metric. By ensuring that scores 6–10 correspond to approximate fifths of the ASD participants who received scores in the autism classification range, roughly 20% of participants received the maximum score of ‘10’ (in this dataset, 19.3% of participants with an autism classification on the ADOS have a severity score of ‘10,’ which is 16.5% of all participants). Though some overlap exists, severity scores showed expected heterogeneity of distribution across the three diagnostic groups: autism, PDD-NOS, and nonspectrum (see Fig. 4).

Fig. 4
figure 4

Distributions of calibrated severity scores by diagnostic group

Relative Independence of Severity Score from Participant Characteristics

Multiple linear regression analyses were performed separately for the dependent variables severity score and raw total to examine whether participant characteristics such as age and IQ would be less associated with severity scores than they were with raw scores. For ASD assessments with complete contemporaneous demographic data (N = 1,369), potential predictors were entered into a structured hierarchical model, in which Block 1 included verbal and nonverbal IQ and mental age variables (which are known to affect the expression of ASD symptoms; Lord and Spence 2006), and Block 2 included age, gender, maternal education, and race (variables that could affect ASD symptoms but that often have had non-significant effects when Block 1 variables are controlled). Whereas 44% of the variance in raw totals was explained by this model, only 12% of variance was explained for severity scores using these covariates. Verbal IQ and one maternal education variable (mothers with graduate/professional degrees vs. all others) emerged as significant predictors for both severity score and raw score. Nonverbal IQ, verbal mental age, nonverbal mental age, chronological age, and gender were not significant predictors of either severity scores or raw totals for ASD participants. When covarying for these variables, as well as verbal IQ and maternal education, there was a trend for African American participants to have lower severity scores than other racial groups combined (B = −0.35; β = −0.06, p = 0.04), but this is not easily interpreted due to the confounding effects of possible referral bias. For all ASD assessments with racial affiliation data (N = 1,749), mean severity score for African-American participants was 7.4 (SD = 1.8) compared to 7.3 (SD = 2.2) for the combined other participant groups, t(1,747) = −0.71; p = 0.48.

Verbal IQ and the graduate/professional maternal education variable were then entered into Forward Stepwise models (see Table 4), at which point maternal education was excluded from the model as a predictor of severity score, though retained as a predictor of raw score. Standardization reduced the effect of verbal IQ, the most influential participant characteristic on ADOS scores. Verbal IQ explained 43% of the variance in raw totals in the model, but accounted for only 10% of the variance in severity scores in this model. This represents a change from a large effect size (R = 0.67) for verbal IQ on ADOS scores to an effect size just outside the accepted range for ‘small’ (R = 0.32; see McCarthy et al. 1991; Cohen 1988). The effect of maternal education on raw total scores was likely an artifact of recruitment biases (Graduate/Professional raw total M = 14.9, SD = 7.2; other maternal education levels raw total M = 15.4, SD = 7.2; t(1,887) = 1.13, p = 0.26).

Table 4 Multiple linear regression models for calibrated severity scores and ADOS raw totals in ASD assessments

When the initial hierarchical block models were applied to the full sample (ASD and nonspectrum assessments combined), significant predictors of severity scores included verbal IQ, gender (with males the more severe group), and maternal education; significant predictors of raw totals included verbal IQ, nonverbal mental age, gender, chronological age, and maternal education (these statistics are available from the authors). This again indicates that, when severity scores are applied to a clinical referral population, they are less influenced by participant characteristics than are raw ADOS totals.

Case Summaries

Four children with ASD diagnoses and longitudinal data were chosen to exemplify patterns in severity score change over time. Their scores by chronological age are plotted in Fig. 5, with ADOS module and raw total score displayed for each time point.

Fig. 5
figure 5

Case summaries of longitudinal severity scores. Note. Parentheses by individual data points indicate (Module, Raw Score) for each assessment. The calibrated severity metric allows change across time and module to be evaluated in a standardized fashion in children of varying age and verbal ability. Adam and Daniel follow relatively consistent trajectories despite module changes, while a marked change in severity is apparent in Cara’s scores despite seemingly small increases in raw total within the same module. Bianca’s decreasing raw totals alone indicate a drop in ASD severity, but the clinical import of this is obscured by her module change and increasing chronological age. Severity scores are not necessarily more stable than raw totals, but were created to allow the change or consistency in these cases to be interpreted more readily than perceived patterns in raw total scores

Case 1. “Adam,” a Caucasian male, was seen at 45 months of age as part of a clinical research project. He received a diagnosis of autism at that time. He was evaluated with ADOS Module 2 until age 13, when he received Module 3. His mental ages were 34 months nonverbal and 21 months verbal at first assessment, and 165 months nonverbal and 111 months verbal at final assessment at age 13 (NVIQ: 71 at first, 107 at last; VIQ: 44 first, 80 last). Despite his increase in IQ, Adam showed a persistently severe trajectory, with scores varying between 8 and 10 over seven assessments.

Case 2. “Bianca,” a Caucasian female, was first seen at age 48 months as a clinical referral, at which point she received a diagnosis of autism. She was evaluated with ADOS Module 2 until age 5, when she received Module 3. Her mental ages were 46 months nonverbal and 56 months verbal at first assessment, and 107 months nonverbal and 120 months verbal at her 8.5-year-old assessment (NVIQ: 80 at first, 107 last; VIQ: 108 first, 126 last). Bianca showed decreasing autism severity over time, with scores dropping from 9 to 4 across six assessments.

Case 3. “Cara,” an African American female, was first seen as part of a research project at age 3. She received a diagnosis of autism. She was evaluated consistently using ADOS Module 1. Her mental ages were 16 months nonverbal and 8 months verbal at first assessment, and 51 months nonverbal and 11 months verbal at her last assessment at age 10 (NVIQ: 47 at first, 40 last; VIQ: 23 first, 20 last). Despite the stability of her IQ scores over time, Cara showed worsening autism severity, with scores increasing from 5 to 10 over four assessments.

Case 4. “Daniel,” a Caucasian male, was first seen at 34 months of age as a clinical referral and was given a nonspectrum diagnosis; at 46 months of age he received a PDD-NOS diagnosis which then remained stable over time. He was evaluated with ADOS Module 1 in his assessments through age 5; at age 10 he received Module 3. His mental ages were 38 months nonverbal and 36 months verbal at first assessment, and 162 months nonverbal and 142 months verbal at final assessment at age 10 (NVIQ: 112 at first, 129 at last; VIQ: 105 first, 113 last). Daniel showed consistently mild severity scores varying between 1 and 3 over four assessments.

Discussion

The calibrated severity metric based on ADOS raw totals offers a method of quantifying ASD severity with relative independence from individual characteristics such as age and verbal IQ. It should have utility in various genetic, neurobiological, and clinical research endeavors, including treatment trials, that otherwise would use unstandardized ADOS raw totals. Calibrated scores have more uniform distributions across age- and language-groups compared to raw totals, making it possible to compare children’s scores longitudinally across distinct algorithms. In part because of the modular system of the ADOS, chronological age, nonverbal IQ, and verbal and nonverbal mental age did not predict either raw totals or severity scores in this sample. The severity metric builds on this modular system to reduce the influence of participants’ verbal IQ, which accounted for 10% of the variance in severity scores versus 43% of the variance in raw totals, a reduction from a large to medium effect size. The remaining influence of verbal IQ on the severity metric can be seen in the drift of mean scores toward greater severity in older age groups with lower language levels (Modules 1 and 2). This apparent age effect seems likely to be explained by lower verbal IQ in the older children without fluent speech. Though this effect has not been eliminated entirely, the calibrated metric is better able to measure autism severity beyond verbal impairment than are raw ADOS totals.

Calibrating scores within narrowly-defined age/language cells achieved the reduction in verbal IQ effects within the new metric and corrected for artificial variability in individuals’ scores across time. Unfortunately, a greater number of calibration cells precludes a user-friendly age/language ‘prefix’ to the severity score, as mentioned in the “Introduction”. The method described here necessarily defines autism severity in relation to individuals of similar age and language ability. When using these scores clinically and for research, one must keep in mind the age/language level of the child/sample, as there clearly will be developmental and adaptive functioning differences among children with the same severity score on this 10-point scale. This is true of all standardized scores. Calibrated severity scores do not measure functional impairment, but are intended to provide a marker of severity of autism symptoms relative to age and language level. The module a child can be given depends on his/her expressive language level, and thus will continue to be an important indicator of adaptive functioning for most children.

The dataset described here included children from various areas in the US, both urban and rural. Participants represented both consecutive clinic referrals and research participants. While this is likely a representative sample for a North American clinical research center, it is worth examining how referral bias might have influenced these calibrated scores. Though the dataset was large (N = 1,807 assessments from children with ASD), its division into age/language cells for calibration resulted in a few small cell sizes. For example, children under age 5 who are not language delayed are unlikely to be referred for an evaluation unless they exhibit notable ASD symptomatology, so we would expect these cells to have a more limited distribution in the higher end of the range of ADOS scores. Another referral bias involved the tendency for children of higher severity to have more clinic reevaluations than those with less pronounced features of ASD. Indeed, the mean severity scores across the 18 calibration groups ranged from 6.64 (in young children with fluent speech) to 8.10 (in older children with phrase speech only), indicating that severity scores are still somewhat influenced by developmental level and referral bias.

After attempting a number of methods for standardizing ADOS scores, we believe that the present method of using ADOS diagnostic classifications to ‘anchor’ severity scores best controls for recruitment effects that would be present in any large clinical research sample, and therefore results in a metric more likely to be generalizable across datasets. If a cell in this calibration sample had predominantly high- or low-scoring children, this restricted range would only be assigned to severity scores associated with one classification (autism, ASD, or nonspectrum), allowing for more variability in other datasets across the other possible classifications. Ideally this method circumvents to some degree the inevitable effects of recruitment. Anchoring severity scores to ADOS classification instead of clinical diagnosis also avoids conflicting dimensional and diagnostic assignment. Within the present method, severity scores reflect ADOS raw totals regardless of the participant’s diagnosis, so a child with a non-ASD best estimate diagnosis potentially could receive a score of 6 on the metric while a child with autism receives a 3, if the former child showed more autistic symptomatology relative to his/her age and language within that 45 min assessment than did the child with autism.

More work is needed to test the validity and utility of this calibrated severity metric. Module change, especially into Module 3 (fluent speech), may inflate an individual’s severity score. Some longitudinal variation in these scores is expected, but the purpose of the metric is to measure change beyond typical variation in ASD. For this reason, the fact that approximately 20% of ASD assessments with ‘autism’ ADOS classifications receive the highest severity score of 10, creating a ceiling effect, was preferred over drawing out the distribution of the metric with the result of less meaningful differences between scores. We hope to further examine patterns of severity score change over time in a longitudinal sample, identifying trajectory classes and the risk variables that predict class membership.

Another future direction is to calibrate the SA and RRB domains of the revised ADOS algorithms separately in order to measure severity within these symptom domains. This process will need to employ a different method of mapping raw scores onto a severity metric, due to the fact that each domain has a smaller range of possible raw totals than the overall score (with a maximum of only 8 points for the RRB domain).

Limitations

Although based on a large sample, this is not a metric of symptom severity in a “true” ASD population because ADOS data on such samples do not exist at present. As larger population studies become available, the metric should be recalibrated within those samples for a more accurate reflection of the distribution of ADOS scores in the ASD population.

These results also may be influenced by the historical period in which some of the data were collected. This sample grew over a 16-year period in which patterns in ASD identification evolved. As greater numbers of children are identified at earlier ages (thus including milder cases at younger ages), it is possible that severity scores might have been assigned differently to raw totals if only recently collected data were used.

Conclusion

The ADOS calibrated severity metric represents a step towards achieving greater comparability of scores across time, age, and module, and is less influenced by verbal IQ than raw scores. Therefore, it should provide a better measure of ASD severity than other methods currently available, including ADOS raw total scores. This metric must be replicated in a large independent sample. To test the validity of the metric, calibrated scores should be used to track observed changes in ASD severity against sources of convergent validity.

Calibrated scores could be used to predict outcome, changes in adaptive skills over time, and associations between severity of core features and clinical characteristics such as behavior problems, peer relationships, and school achievement. This metric may also prove useful in interpreting results from studies of the effectiveness of interventions, and in characterizing samples for genetic and neurobiological research. An important reminder, however, is that the calibrated severity metric is based on a relatively brief, office-based observation with a clinician, and thus is only one part of a necessarily broader picture of the strengths and difficulties of a child with ASD.