Introduction

The diagnostic label of autism spectrum disorder (ASD) in the recently published Diagnostic and Statistical Manual of Mental Disorders-5 (DSM-5; APA 2013) captures the fact that children with ASD vary widely with respect to severity. Children with ASD who are severely impaired are readily identified at a young age by their obvious failure to develop social reciprocity and communicative language as well as by their highly restricted behavioral repertoire. However, young children with ASD who are less severely impaired can be difficult to differentiate from children who do not have ASD but who have other problems, such as language delay, social difficulties, anxiety, attention problems, oppositional behaviour, or poor impulse control. For example, this difficulty in differential diagnosis is apparent in studies of the diagnostic overlap between ASD and ADHD, such as those by de Bruin et al. (2007) and by Leyfer et al. (2006), which reported, respectively, that 45 and 31 % of children diagnosed with ASD/PDD also met criteria for ADHD.

The diagnostic assessment of ASD in young children can be a lengthy, involved, and costly process. It often involves administration of the Autism Diagnostic Observation Scale, Second Edition (ADOS-2; Lord and Rutter 2012) and the Autism Diagnostic Interview-Revised (ADI-R; Rutter et al. 2003), both of which require extensive training to master and time to administer. Therefore, in-depth assessment for ASD is most practical and cost-effective when previous screening with a less expensive assessment tool suggests a high likelihood of ASD. Such screening tools typically take the form of parent-report checklists, which can be completed and scored quickly at minimal expense.

Screening tools have significant advantages of cost and practicality, but they also can result in two major types of erroneous decisions. When a screener misses children who truly have a condition (false negatives), the screener has low sensitivity. To avoid this type of erroneous decision, cutpoints on screeners are often set low enough to identify any child who might have the condition, resulting in the opposite kind of erroneous decision, namely false positives (i.e., flagging children who do not actually have the condition), which is reflected in low specificity. As Barton et al. (2012) state (p. 1167), “In order to detect the maximum number of children with the disorder, the threshold for identification may be set low, which will result in the identification of more children with the disorder (high sensitivity), as well as a significant number of false positives (low specificity).” Thus, most screeners have an inevitable trade-off such that the cost of high sensitivity is reduced specificity. Conversely, setting a cutpoint high enough to avoid false positives (increased specificity) inevitably results in an increase in the number of misses (decreased sensitivity).

As noted by Barton et al. (2012) and Norris and Lecavalier (2010), a distinction is commonly made between Level 1 screening, designed to identify potential cases of ASD in unselected populations like preschools and primary care settings, and Level 2 screening, designed to differentiate children with ASD from those with other disorders/developmental disabilities. The costs and benefits associated with the two kinds of erroneous decisions are somewhat different for Level 1 and Level 2 screening. For example, as noted by Barton et al. (2012), the priority in Level 1 screening is to not miss children who may have ASD, even if this means that further evaluation will be recommended for a child who may not turn out to have ASD. Barton et al. (2012) suggest that this outcome is not problematic because children flagged by a Level 1 screening often have some other developmental disability or behavioral/emotional problem, even if they don’t have ASD, and because the benefits of early diagnosis of ASD outweigh the negative effects of some false positives. In contrast, avoiding false positives may be more desirable than avoiding misses when conducting Level 2 screening, given that the next step for cases flagged on the screener may be costly evaluation with the ADOS.

Norris and Lecavalier (2010) reviewed results from 20 studies examining screening accuracy of five instruments they designated as Level 2 screeners, namely the Social Communication Questionnaire (SCQ; Berument et al. 1999), the Gilliam Autism Rating Scale, Second edition (GARS-2, Gilliam 2002), the Social Responsiveness Scale (SRS; Constantino and Gruber 2005); the Autism Spectrum Screening Questionnaire (ASSQ; Ehlers et al. 1999); and the Asperger Syndrome Diagnostic Scale (ASDS; Myles et al. 2001). The authors concluded that the SCQ had received the most rigorous examination (e.g., 11 studies, with a range of different comparison groups), with 18 sensitivity (SENS) and 18 specificity (SPEC) values reported. Only two of the 18 pairs of SENS and SPEC values were 80 %, whereas many showed the typical trade-off of higher SENS and lower SPEC (e.g., 92 vs. 62 %) or lower SENS and higher SPEC (e.g., 54 vs. 92 %). Nevertheless, Norris and Lecavalier concluded that the SCQ performed well. They also concluded that the SRS and the ASSQ “showed promise,” and that the GARS/GARS-2 and ASDS demonstrated poor sensitivity. Poor sensitivity of the GARS was also reported by South et al. (2002), who stated that it identified only 52 % of a sample of 119 children with “unequivocable” diagnoses of autism.

Barton et al. (2012) reviewed results for eight instruments they designated as Level 1 screeners. As Barton et al. (2012) noted, the first ASD screener was the Checklist for Autism in Toddlers (CHAT; Baron-Cohen et al. 1992). The CHAT’s SENS of only .38 (Baird et al. 2000) for children screened at 18 months and seen for diagnosis at age 7 led its authors to develop a revision of the scale. Barton et al.’s review reports several studies using the Modified Checklist for Autism in Toddlers (M-CHAT; Robins et al. 2001), which two of the article’s authors co-authored. They note that the “relatively high false positive rate” of the initial 23-item M-CHAT prompted the authors to add a 15-min follow-up interview with the parents of children flagged by the screening tool. Barton et al. reported “estimated” SENS and SPEC of 87 and 99 % for 1,122 unselected children and 171 children referred for early intervention assessed with the M-CHAT. Robins (2008) reported that the M-CHAT plus interview had positive predictive value (PPV) of 57 %, indicating many false positives. The best SENS and SPEC reported by Barton et al. (2012) was for the Pervasive Developmental Disorders Test II (PDDST-II; Siegel 2004), namely 92 and 91 % in a clinical sample of 937 children 12–24 months of age. Barton et al. (2012) also reported results from what they call a “broad-band developmental screener,” the Developmental Behavior Checklist- Early Screen (DBC-ES; Gray and Tonge 2005). SENS of 83 % and SPEC of 48 % were reported for a sample of 207 children referred for possible developmental delay (DD).

Several studies have tested the CBCL (Achenbach 1991, Achenbach and Rescorla 2001), a widely used broad-band assessment instrument, as an autism screener for school-age children. For example, Bölte et al. (1999) reported that German school-aged children with autism scored higher on the CBCL Attention Problems, Social Problems, and Thought Problems scales than children in the general population and children with other psychiatric conditions. Duarte et al. (2003) found that the Thought Problems scale on the CBCL 4-18 (Achenbach 1991) differentiated Brazilian children with autistic conditions from schoolchildren with SENS of 94 % and SPEC of 100 %. Additionally, Biederman et al. (2010) reported that the CBCL Withdrawn/Depressed, Social Problems, and Thought Problems scales were the best independent predictors of an ASD diagnosis relative to other diagnoses in a US sample. Receiver operating characteristic (ROC) analysis of this triad of scales, which they called the CBCL-ASD profile, yielded an area under the curve (AUC) of 0.86. Using a sample in Singapore, Ooi et al. (2010) also found that the Withdrawn/Depressed, Social Problems, and Thought Problems scales discriminated best between the ASD group and referred children with ADHD, undiagnosed referred children, and children from a community sample. In a Dutch study, So et al. (2013) constructed an ASD scale based on the items from the CBCL and the Teacher’s Report Form (TRF; Achenbach and Rescorla 2001) that best discriminated 458 children diagnosed with ASD from 1,009 referred children without ASD and 999 children in the general population. The ASD scale yielded positive predictive value (PPV) of 73 % and negative predictive value (NPV) of 95 % in differentiating the ASD group from the non-referred group. NPV was 85 % when differentiating the ASD group from the referred group, indicating that the ASD scale was good at identifying children not requiring more in-depth ASD assessment, but at a cost of considerable over-identification (PPV = 49 %).

Of particular relevance to the current study are several studies that have tested the Child Behavior Checklist for Ages 1½–5 (CBCL/1½–5; Achenbach and Rescorla 2000) as an ASD screener. The CBCL/1½–5, which takes about 15 min to complete, has 99 items tapping a wide range of behavioral and emotional problems that are scored on various scales, including the DSM-oriented Pervasive Developmental Problems scale (DSM-PDP). Because the CBCL is not labelled as an ASD screener, a parent’s pre-existing disposition to endorse or deny features of autism may be less likely to influence ratings than on an autism-specific instrument. Additionally, because the CBCL is not just focused on autism, it can identify a wide range of other behavioral and emotional problems in young children that may need clinical attention.

To our knowledge, Rescorla (1988) conducted the first study using the CBCL to identify preschool children with ASD. Her sample of 204 3- to 5-year-old boys had been evaluated before the DSM-III was published. In that clinic, children who appeared to be autistic had been given the diagnostic label of “severe atypical” (24 % of the sample) or “mild atypical” (15 % of the sample), depending on their level of functioning. The rest of the sample was comprised of children given the label “reactive” (40 % of the sample) or labeled as “undiagnosed” (12 % of the sample). An early version of the CBCL was used to rate presenting symptoms reported in case records. Factor analysis of the CBCL ratings for this sample yielded eight factors, including an Autistic/Bizarre factor with five items, (e.g., Confused/in a fog, Strange behavior, Withdrawn). Cluster analysis of the children’s syndrome profiles yielded three clusters: one with most of the children diagnosed “severe atypical” or “mild atypical,” a second with most of the “reactive” group (children with a mixture of internalizing and externalizing problems), and a third with children who had low CBCL scores (most of the undiagnosed children).

Several years after this initial study, Sikora et al. (2008) compared the ability of the CBCL/1½–5 (Achenbach and Rescorla 2000) and the Gilliam Autism Rating Scale (GARS; Gilliam 1995) to differentiate among 147 preschool children referred to an autism clinic. Sikora et al. used the ADOS-G (Lord et al. 2002) to identify children with autism (n = 79), ASD (n = 18), or a non-spectrum disorder (n = 50). The non-spectrum children appear to have been developmentally delayed, based on their mean Mullen (1995) Early Learning Composite score of 72, a mean score comparable to that of the ASD group (also mean = 72) but higher than the score of the autistic group (mean = 54). The CBCL/1½–5 Withdrawn syndrome and DSM-PDP scale provided the best discrimination among the groups. The CBCL achieved higher SENS and SPEC in identifying autism and ASD than did the GARS, not only for the sample as a whole but also for boys and girls separately and for higher and lower functioning subgroups separately. For the full sample, SENS was highest for the DSM-PDP scale (80 %), next highest for the Withdrawn syndrome (65 %) scale), and lowest for the GARS (53 %). SPEC values were lower than SENS values, namely 42 % for DSM-PDP, 62 % for Withdrawn, and 54 % for the GARS, probaby due in part to the low functioning of the non-ASD group.

Pandolfi et al. (2009) evaluated the adequacy of the CBCL/1½–5 factor model in a sample of preschoolers with ASD (N = 128). They found that the CBCL measures the same constructs for both the ASD and the normal population. Children with ASD were found to score higher on all CBCL/1½–5 syndromes except for Anxious/Depressed.

Muratori et al. (2011) used the CBCL/1½–5 with three groups of Italian preschoolers: 101 diagnosed with ASD (diagnosed by clinical evaluation plus the ADOS), 95 diagnosed with other psychiatric disorders (OPD), and 117 with typical development (TD). The ASD group had significantly higher scores on the Withdrawn, Attention Problems, and DSM-PDP scales than the OPD group. When the ASD group was compared to the TD group, ROC analysis yielded SENS of 85 % and SPEC of 90 % for the DSM-PDP scale and 89 and 92 % for the Withdrawn syndrome. When the ASD group was compared to the OPD group, SPEC was lower for both the DSM-PDP and Withdrawn scales (60 and 65 %, respectively), indicating that some children on the OPD group with high scores on these scales did not have ASD but did have other diagnosable psychiatric problems. However, SENS was unchanged (85 and 89 %, respectively), indicating that both scales identified most of the children who received an ASD diagnosis. Areas under the curve (AUCs) were .85 for the DSM-PDP scale and .89 for the Withdrawn scale.

Georgiades et al. (2011) used the CBCL/1½–5 to compare 335 Canadian preschool children with ASD with children in the US normative sample (Achenbach and Rescorla 2000). The ASD sample had higher scores on all syndrome scales (DSM-oriented scales were not reported), with the largest standardized mean difference found for Withdrawn (2.7), Attention Problems (1.1), and Emotionally Reactive (.8), all large effect sizes according to Cohen’s (1988) benchmarks for d. When CBCL scores, ADI-R scores, and scores on a scale of repetitive behavior were submitted to principal components analysis, the Withdrawn scale had a loading of .60 on a Social Communication Deficits (SCD) factor, only slightly lower than the ADI-R Communication domain nonverbal score loading of .69. The other CBCL scales had significant loadings of .45 to .75 on an Emotional Behavioral Repetitive Problems (EBRP) factor. The SCD factor was significantly associated with measures of language, intellectual, and adaptive functioning, whereas the EBRP was not. Both factors were significantly associated with parenting stress. Georgiades et al. (2011) concluded that their results supported “a joint social-communication domain,” now codified in the DSM-5. They also concluded that the EBRP factor suggested that repetitive behaviors and fixated interests are part of a broader constellation of emotional and behavioral problems manifested by children with ASD but also by other maladjusted preschoolers, such as anxiety, emotional reactivity, attention problems, and aggressive behavior. An implication they draw from their findings is that the SCD and EBRP dimensions may have “different etiological mechanisms.”

Myers et al. (2013) recently reported CBCL/1½–5 findings for 70 preschoolers who had ASD and 85 clinical referred preschoolers who did not have ASD. ASD was diagnosed based on a clinical interview conducted by one psychologist, with a follow-up ADI-R administered if ASD was suspected. The non-ASD group had communicative or cognitive delays and/or behavioral concerns, but all received a diagnosis of “Developmentally Delayed.” Children who had ASD had significantly higher scores on the Withdrawn and DSM-PDP scales (d = .99 for Withdrawn and .81 for DSM-PDP). ROC analyses yielded AUCs of .75 for Withdrawn and .74 for DSM-PDP. A cutpoint of T ≥ 65 yielded SENS of 89 % for Withdrawn and 79 % for DSM-PDP, with SPEC of 52 and 48 %, respectively. NPV was 85 and 69 %, whereas PPV was 60 and 51 % for Withdrawn and DSM-PDP respectively.

In summary, from the three studies reporting decision statistics for the CBCL/1½–5, SENS has ranged from 79 to 85 % for the DSM-PDP scale and from 65 to 89 % for the Withdrawn scale, whereas SPEC has ranged from 42 to 90 % for the DSM-PDP scale and from 52 to 92 % for the Withdrawn scale. SPEC varied depending on the group to which children who have ASD were being compared. That is, SPEC was highest when the contrast group was typically developing, moderate when the contrast group had OPD, and lowest when the contrast group had a DD. This suggests that “false positives” are not typically developing children who have no problems. Rather, they are children who do not have ASD but who have other behavioral, emotional, or developmental problems that may also need to be addressed.

Despite some positive findings regarding the CBCL/1½–5 as a screening tool for ASD in preschoolers, research is still quite limited in this area. Furthermore, all studies to date have been done in Western countries (the US, Italy, and Canada). It remains to be seen how well the CBCL/1½–5 can screen for ASD in non-Western countries. The present study, which was conducted in Korea, examines this issue. That our study was conducted in Korea is particularly important given Kim et al.’s (2011) finding that 1.9 % of a large sample of Korean 7- to 12-year-olds in regular classrooms (N = 55,266) were diagnosable with ASD, most of whom had never previously been diagnosed. The children who had ASD showed less male predominance and less intellectual disability than usually found in autism prevalence studies, with 12 % having superior IQs (Kim et al. 2011). Probably because they were relatively high functioning, most of these children had not been identified with ASD as preschoolers, the time period in which many children are first diagnosed with ASD and offered intervention. It thus appears that an efficient method for screening for ASD in the general population during the preschool period would be especially useful in Korea.

Questions Addressed by the Current Study

The central aim of our study was to test the ability of the Korean CBCL/1½–5 to screen for ASD. To accomplish this aim, we compared 46 children in an ASD group diagnosed with PDD based on DSM-IV criteria with children from three other groups: 111 children with DD, 71 children with OPD, and 228 non-referred children (NR) from the Korean CBCL/1½–5 normative sample (N = 2,327) matched to the 228 referred children in age and gender. Four specific questions were examined: (1) On which CBCL/1½–5 scales did the ASD group differ significantly from each of the other groups? (2) Which CBCL/1½–5 items best discriminated the ASD group from the other groups? (3) What AUCs were generated by ROC analyses when the ASD group was compared to each of the other three groups? (4) What decision statistics values were obtained when the CBCL was used to discriminate the ASD group from each of the other groups?

Methods

Participants

The 228 children in the three clinical groups were evaluated in a child psychiatry department of a large public hospital in Seoul, South Korea: 46 children diagnosed with ASD (72 % boys, mean age 3.54 years, SD = 1.09 years); 111 children diagnosed with DD but not ASD (75 % boys, mean age 3.43 years, SD = 1.13 years); and 71 children diagnosed with OPD but not PDD or DD (52 % boys, mean age 3.37 years, SD = 1.20) (see Table 1). Overall, the clinical group had 675 boys (153 boys and 75 girls), with a mean age of 3.4 (SD = 1.4).

Table 1 Characteristics of the sample groups

The group of 228 NR children was drawn using stratified random sampling from the Korean CBCL/1½–5 normative sample of 2,327 children so that it would also contain 67 % boys (153 boys and 75 girls), with a mean age of 3.4 years (SD = 1.4). The full normative sample, which was used in one of our analyses, was collected in 13 preschools and daycares in four regions of Korea (Seoul/Kyung-gi Province, Choongchung-Gangwon Province, Chulla-Jesu Province, and Busan-Kyung-sang Province). The CBCL/1½–5 was sent home from school and forms were returned to school when completed, with an overall return rate of 92 %; 95 % of the forms were completed by mothers (Oh and Kim 2009). Self-reported socioeconomic status (SES) of the families was predominantly middle class (57 %) and lower-middle class (32 %) on a 5-level scale. The mean annual family income was 3,590,000 KW/month (roughly $3,485/month), slightly higher than the mean family income of 3,220,000 KW/month for Korea as a whole (roughly $3,126). Children were excluded from the NR group if there was any history of referral for behavioral, emotional, intellectual, or developmental problems in the past year.

Placement in Diagnostic Groups

The diagnostic procedure for the clinical sample consisted of a clinical interview by a child psychiatrist and psychological testing by a clinical psychologist. The psychiatrist interviewed parents using a set of questions based on DSM-IV diagnostic criteria for various disorders and also observed the child’s behavior during the office visit. For example, to diagnose ASD, the psychiatrist asked questions about social interaction, play with peers, communication problems, and repetitive behaviors. Following the psychiatric interview, the doctor entered the child’s diagnosis in the medical chart. Independently, the psychologist evaluated the child with the Korean versions of either the Bayley Scales of Infant Development or the Wechsler Preschool and Primary Scale of Intelligence, depending on the child’s age and/or developmental level, and the child’s mother also completed Korean versions of the Parenting Stress Inventory, the Vineland Adaptive Behavior Scale, and the Personality Inventory for Children. The psychologist used these evaluation data to make an independent diagnosis. The CBCL/1½–5 was completed for research purposes, was not part of the formal evaluation procedure, and was not reviewed by either the psychiatrist of the psychologist prior to making their diagnoses.

Because the children in this study were evaluated prior to publication of the DSM-5 (APA 2013), diagnoses were made using the DSM-IV (APA 1994), which used the term pervasive developmental disorder (PDD) to diagnose children who are now given the diagnostic label of ASD. Although the name of the disorder was modified with publication of the DSM-5 to better reflect the disorder’s dimensional character, the symptom criteria of problems in social interaction and social communication, as well as restricted/repetitive interests, were similar in both versions of the DSM. Children were classified as ASD for the purpose of this research if both the psychiatrist and the psychologst gave a diagnosis of PDD based on DSM-IV criteria. Thus, interrater agreement for ASD diagnoses was 100 %. Many but not all the children in the ASD group also manifested DDs/low IQs, language delays, and delays in adaptive behavior, but these features were not criterial for the ASD diagnosis. Unfortunately, specific IQ, language, and adaptive behavior scores could not be obtained for purposes of this report. To be placed in the DD group, the children were required to manifest IQ and adaptive behavior scores <70 and not be diagnosed with ASD/PDD. A few children (about 5 per group) in the ASD and DD groups had a comorbid diagnosis (e.g., an anxiety diagnosis). The OPD group was comprised of children with IQs and adaptive behavior scores >70, no ASD/PDD diagnosis, and some other DSM-IV diagnosis (e.g., ADHD, separation anxiety, ODD, etc).

Because the percentage of boys was lower in the OPD group than in the other two clinical groups and in the NR comparison group, we calculated a 4 × 2 χ2 for gender, which was significant at p < .05. The 4-group one-way ANOVA for age was not significant.

Measure

Each CBCL item is rated 0 = not true (as far as you know), 1 = somewhat or sometimes true, and 2 = very true or often true, based on the preceding 2 months. These item ratings are used to compute scores for seven syndromes derived by exploratory and confiratory factor analyses (Emotionally Reactive, Anxious/Depressed, Somatic Complaints, Withdrawn, Sleep Problems, Attention Problems, and Aggressive Behavior); five DSM-oriented scales (Affective Problems, Anxiety Problems, Pervasive Developmental Problems, Attention Deficit/ Hyperactivity Problems, and Oppositional Defiant Problems) identified by an international group of experts as being very consistent with diagnostic categories of the DSM-IV and DSM-5 (American Psychiatric Association 1994, 2013); and three broad-band scales (Internalizing, Externalizing, and Total Problems) (Achenbach and Rescorla 2000).

To develop the Korean CBCL/1½–5, the authors created a translated version of the US instrument, arranged for an independent back-translation, and then made revisions until the translation captured the meaning of the original. Like the US version, the Korean CBCL is written at about a fifth grade reading level and takes about 15 min to complete. Confirmatory factor analysis (CFAs) indicated excellent fit of Korean data to the US syndrome model (RMSEA = .04). Cronbach’s alphas were similar to those for the US version (.56–.94).

Lee et al. (2009)reported test–retest reliability and discriminant validity of the CBCL/1½–5 based on a clinically referred sample of 228 children as well as 228 children from the normative sample. Most of the children in the Lee et al. (2009) study also participated in the current study, but a small number of children differed across the two studies. In Lee et al. (2009), the clinical sample was not subdivided by diagnostic group, but the main diagnoses were anxiety, ADHD, language and/or DD, and ASD. CBCLs were obtained from parents of the clinic children at the time of their intake for assessment and were sent home from school for the normative sample. Test–retest reliability over a 2-week period ranged from .67 to .85 across scales. The CBCL attained 83 % accuracy in discriminating the referred and non-referred children when a T score ≥65 on Total Problems was used as a cutoff for deviance (93rd percentile). Odds ratios (ORs) for discriminating the two groups ranged from 2.30 to 11.12 for the various CBCL syndromes and DSM-oriented scales. The only scales that did not discriminate the two groups were Somatic Complaints (OR = .90) and Sleep Problems (OR = 1.04). The scales with the highest ORs were Withdrawn (OR = 7.77), Attention Problems (OR = 11.12), and the DSM-PDP scale (OR = 9.76).

Data Analyses

For the 228 children in the clinical samples and the 228 children in the NR sample, all CBCLs were completed by mothers. Our first analyses, which tested for group differences in CBCL/1½–5 scale scores, involved a multivariate analysis of variance (MANOVA) performed on the seven syndrome scales, a MANOVA performed on the five DSM-oriented scales, and ANOVAs performed on the three broad-band scales, followed by Student–Newman–Keuls (S–N–K) post-hoc tests for scales with significant effects. Because of the large number of comparisons, we used an alpha level of p < .001 for the (M)ANOVAs. Effect sizes (ESs) were measured by eta squared (η2), which were interpreted using Cohen’s (1988) criteria (small = 1 to 5.9 %, medium = 6 to 13.9 %, and large ≥ 14 %). To verify that the slight gender difference in our OPD group relative to our other groups had no effect on our results, we redid our (M)ANOVAs on all CBCL scales with gender as an additional factor. No gender effects were significant at our Bonferroni-corrected alpha level of p < .001 (two were significant at p < .05, namely Aggressive Behavior and Externalizing, higher in boys), and no diagnosis x gender interactions had p values <.20. The second analyses, conducted to determine which CBCL/1½–5 items best discriminated the ASD group from the other groups, involved a MANOVA on all 99 items. For all items with a significant overall group effect (p < .001), we conducted S–N–K post-hoc tests to identify the items on which the ASD group differed significantly from the other three groups. In the third analyses, we used ROC analyses to compare the ASD group with the other three groups using continuous scores on the CBCL scales that best differentiated the ASD group in the ANOVAs. These ROC analyses also yielded SENS and SPEC values for various cutpoints on the scales. Finally, in the fourth analyses, we dichotomized scores on the CBCL scales that best differentiated the ASD group using the cutpoint for a T score ≥65 (≥93rd percentile, borderline range). Cutpoints to dichotomize the sample into deviant versus non-deviant sub-groups were obtained from the CBCL/1½–5’s multicultural norm group appropriate for Korea (Achenbach and Rescorla 2010). For each scale analyzed, we obtained SENS, SPEC, PPV, and NPV values.

Results

Group Differences in Mean Scores

As presented in Table 2, significant effects for group were found for 13 of the 15 scales. The largest ESs were for the Withdrawn syndrome (34 %) and the DSM-PDP scale (37 %), with the ES for Attention Problems also very large (30 %) by Cohen’s (1988) benchmarks. The other ESs ranged from 4 % for DSM-oriented Anxiety Problems to 18 % for Total Problems. As indicated in Table 2, the group effect was not significant for Somatic Complaints or Sleep Problems. Whereas the p values for all the other CBCL scales were <.001, the p values were .74 for Somatic Complaints and .46 for Sleep Problems. As shown in Table 2, the mean scores on these scales were very similar across the four groups, and the ASD group did not have the highest mean score on either scale.

Table 2 CBCL/1½–5 Means and Standard Deviations (SDs) for ASD, DD, OPD, and NR Groups

Student–Newman–Keuls pair-wise comparisons between groups indicated that the ASD group had a higher mean score than the NR group on 12 of the 13 scales showing a significant group effect (not DSM-oriented Anxiety Problems) (see Table 2). The ASD group had significantly higher mean scores than the DD and OPD groups on the Withdrawn (p < .001), DSM-PDP (p < .001), and DSM-oriented Affective Problems (p < .01) scales. Cohen’s d values for the Withdrawn and DSM-PDP scales for both the ASD-DD and ASD-OPD comparisons on these two scales were medium-to-large based on Cohen’s (1988) benchmarks (DD: d = .76 for Withdrawn and d = .63 for DSM-PDP; OPD: d = .80 for Withdrawn and d = .79 for DSM-PDP).

Group Differences in Item Ratings

Although our ANOVAs indicated that the Withdrawn syndrome and the DSM-PDP scale were the aggregates that best discriminated between the ASD group and all other groups, we also wanted to determine which CBCL/1½–5 items (corresponding to “symptoms”) best discriminated the ASD group from the other groups. To accomplish this goal, we ran a MANOVA on all 99 items. For all items with a significant group effect (p < .001), we used successive S–N–K post-hoc tests to identify items on which the ASD group differed significantly from the other three groups at increasingly stringent p levels (.05, .01, .001). Table 3 presents the 15 items on which the ASD group had significantly higher mean ratings than all other groups at p < .05, along with their ESs. For three items reflecting general DDs (e.g., 2. Acts too young for age; 65. Resists toilet training; 75. Speech problem), the ASD group scored higher than all other groups (including the DD group) at the p < .01 level. For seven of the 15 significant items, the ASD group scored higher than all other groups at the p < .001 level. These seven items would seem to be the most “autism-specific” problems on the CBCL (e.g., 4. Avoids looking others in the eye; 23. Doesn’t answer when people talk to him/her; 25. Doesn’t get along with other children; 27. Doesn’t seem to feel guilty after misbehaving; 70. Shows little affection toward people; 71. Shows little interest in things around him/her; 88. Uncooperative).

Table 3 Mean rating and ES for CBCL/1½–5 Items Differentiating ASD Group from DD, OPD, and NR Groups

ROC Results

We next conducted ROC analyses to test the ability of the Withdrawn and DSM-PDP scales to differentate the ASD group from the other three groups. As shown in Table 4, AUCs for differentiating the ASD group from the NR group were .94 for Withdrawn and .93 for DSM-PDP. Table 3 also shows that using a higher cutpoint reduced SENS but increased SPEC. For example, for the Withdrawn scale, SENS was 96 % and SPEC was 74 % for a cutpoint of 2.5, whereas SENS was 89 % and SPEC was 84 % for a cutpoint of 3.5. For the DSM-PDP scale, SENS was 96 % and SPEC was 75 % for a cutpoint of 4.5, whereas SENS was 91 % and SPEC was 81 % for a cutpoint of 5.5. ROC analyses yielded weaker differentiation when the ASD group was compared to the OPD group, with AUCs of .74 for the Withdrawn scale and .70 for the DSM-PDP scale. A cutpoint of 3.5 on the Withdrawn scale yielded SENS of 89 % and SPEC of 44 %. Similarly, a cutpoint of 6.5 on the DSM-PDP scale yielded SENS of 80 % and SPEC of 44 %. For the ASD-DD comparisons, AUCs were .63 (Withdrawn) and .68 (DSM-PDP). A cutpoint of 3.5 on Withdrawn yielded SENS of 89 % and SPEC of 39 %, whereas a cutpoint of 5.5 on the DSM-PDP scale yielded SENS of 91 % and SPEC of 27 %. Thus, there was the greatest overlap between the ASD and DD groups in their scores on the Withdrawn and DSM-PDP scales, with most of the ASD children identified (high SENS) but at a cost of over-identification of DD children (low SPEC).

Table 4 CBCL ROC and Decision Statistics Results

Decision Statistics

For the final analyses, we obtained decision statistics using a deviance cutpoint of T score ≥65 (≥93rd percentile) based on the CBCL/1½–5’s multicultural norm group appropriate for Korea (Achenbach and Rescorla 2010). These analyses were conducted using both the Withdrawn scale (T score of 65 = 5) and the DSM-PDP scale (T score of 65 = 7). As shown in Table 4, results were quite similar for the two scales. Most of the children in the ASD group scored above the cutpoint (SENS = 78 % for Withdrawn and 80 % for DSM-PDP), and most of the children in the NR group scored below the cutpoint (SPEC = 89 % for Withdrawn and 87 % for DSM-PDP). Furthermore, negative predictive value (NPV) was 95 % for Withdrawn and 96 % for DSM-PDP, indicating that virtually all children scoring below the cutpoint were from the NR group. However, positive predictive value (PPV) was only 58 % for Withdrawn and 55 % for DSM-PDP, indicating that 42-45 % of children scoring above the cutpoint were in the NR group.

When the ASD group was compared with the OPD group, SENS was 78 % for Withdrawn and 80 % for DSM-PDP, whereas SPEC was 63 % for Withdrawn and 55 % for DSM-PDP, indicating more false positives than found in the ASD-NR comparison. PPV was 58 % for Withdrawn and 54 % for DSM-PDP, indicating that 42-46 % of children scoring above the cutpoint received other psychiatric diagnoses than ASD. NPV was 82 % for Withdrawn and 81 % for DSM-PDP.

When the ASD and DD groups were compared, SENS was 78 % for Withdrawn and 80 % for DSM-PDP, similar to results obtained for the NR and OPD comparisons. SPEC was 53 % for Withdrawn and 60 % for DSM-PDP, indicating that 47 and 40 % of the DD group scored above the cutpoint. NPV was comparable to findings for the ASD-OPD comparison (86 % for Withdrawn and 83 % for DSM-PDP), whereas PPV was lower (41 % for Withdrawn and 36 % for DSM-PDP), indicating that greatest degree of overlap between the ASD and DD groups, consistent with the ROC results.

In a final analysis, we redid our decision statistics using the full Korean normative sample of 2,327 children, which better approximates the base rate of ASD in the general population (i.e., 46/2,327+46 = 1.9 %). As shown in Table 4, ROC analyses yielded AUCs of .94 for both Withdrawn and DSM-PDP. On Withdrawn, SENS values were .96 and .89 for the 2.5 and 3.5 cutpoints, with SPEC values of .73 and .84, respectively. On DSM-PDP, SENS values were .96 and .91 for the 4.5 and 5.5 cutpoints, with SPEC values of .74 and .83. Using the T 65 cutpoint of 5 for Withdrawn yielded 78 % SENS, 91 % SPEC, 15 % PPV, and 99.5 % NPV. Using the T 65 cutpoint of 7 for DSM-PDP yielded 80 % SENS, 89 % SPEC, 12 % PPV, and 99.6 % NPV.

Discussion

The current study adds to a growing body of research testing the utility of the CBCL/1½–5 to screen for ASD. Overall, as discussed below, the results of our study were quite consistent with results of previous studies, indicating generalizability of the findings. Furthermore, our study was conducted in Korea, hence demonstrating replication in a non-Western sample of findings obtained in Western samples in previous studies.

Consistent with Georgiades et al. (2011), Muratori et al. (2011), and Myers et al. (2013), our Korean ASD group scored significantly higher than the NR group on most CBCL scales and scored significantly higher than the DD and OPD groups on the Withdrawn and DSM-PDP scales, with ds for the DSM -PDP scale of 2.0, .79, and .63 relative to the NR, OPD, and DD groups, respectively. The items on which the ASD group differed from all other groups at the p < .001 level include 4. Avoids looking others in the eye; 23. Doesn’t answer when people talk to him/her; 25. Doesn’t get along with other children; 27. Doesn’t seem to feel guilty after misbehaving; 70. Shows little affection toward people; 71. Shows little interest in things around him/her; and 88. Uncooperative. These items seem to capture the social interaction and social communication symptoms of ASD quite well. While none of these items refers specifically to repetitive behaviors, item 71 appears to reflect restricted interests and item 88 captures the rigidity manifested by children with ASD, which frequently makes them unccoperative. To our knowledge, previous studies have not determined which CBCL/1½–5 items are best able to differentiate children with ASD from children in other groups.

Our results suggest that the CBCL/1½–5 shows it greatest utility for Level 1 screening, namely to identify possible ASD cases in unselected samples, rather than for Level 2 screening, which seeks to differentiate children with ASD from children with other developmental disabilities or OPD. As noted by Barton et al. (2012), Level 1screening aims to flag all children who might have ASD, even at the cost of some over-identification, given that children flagged in the screening will be further evaluated and may turn out to have other disorders. Cicchetti (2001) has proposed useful benchmarks for evaluating decision statistics, namely 90–100 % = excellent, 80–89 % = good, 70–79 % = fair, and <70 % = poor. Our AUCs of .93-.94 for the Withdrawn and DSM-PDP scales when we compared our ASD group with our NR group and with the full Korean normative sample were thus all in the “excellent” range. In the ROC analyses, SENS was 96 % (“excellent”) and SPEC was 75 % (“fair”) for a cutpoint of 4.5 on the DSM-PDP scale, whereas SENS was 91 % (“excellent”) and SPEC was 81 % (“good”) for the slightly higher cutpoint of 5.5. The same pattern was seen when we compared our ASD sample with the full Korean normative sample, namely SENS values were 96 % (SPEC = 74 %) and 91 % (SPEC = 83 %) for the 4.5 and 5.5 cutpoints. Using T ≥ 65, which is a cutpoint of 7 on the DSM-PDP scale, SENS fell to 80 % but SPEC rose to 87 % (both “good”) when we compared ASD and NR groups and was 80 and 89 % (both “good”) for the full normative sample comparison. The same pattern of results was found for the Withdrawn scale. All of these results are similar to those reported for Muratori et al.’s (2011) ASD-TD comparison.

The CBCL/1½–5 was less successful at differentiating the ASD group from the OPD and DD groups, namely for Level 2 screening. For example, our AUCs for differentiating ASD from OPD were .74 (Withdrawn) and .70 (DSM-PDP), in the “fair” range and comparable to those reported by Myers et al. (2013) but lower than those obtained by Muratori et al. (2011) for Italian children (.89 for Withdrawn and .85 for DSM -PDP). When SENS was high (91 % for a cutpoint of 5.5 on PDP), SPEC was quite low (32 % for DSM-PDP). Using a higher cutpoint (6.5 for DSM-PDP) led to the expected decrease in SENS (80 %) and increase in SPEC (44 %), but SPEC was still quite low. When the T ≥ 65 cutpoint was used for DSM-PDP, SENS was 80 % and SPEC was 55 %, close to the values reported by Muratori et al. (2011) for the DSM-PDP scale (i.e., SENS = 85 %, SPEC = 60 %). For the ASD-DD conparison, AUCs were .73 for Withdrawn and .68 for DSM-PDP (“fair” and “poor” respectively). Using a 5.5 cutpoint on the DSM-PDP scale, SENS and SPEC were 91 and 27 %. Raising the cutpoint to 6.5 yielded SENS of 80 % and SPEC of 37 %. Using the T ≥ 65 cutpoint, SENS was 80 % and SPEC was 60 %.

In summary, results from our study indicate that the CBCL/1½–5 demonstrated excellent SENS and good SPEC for Level 1 screening, namely when differentiating children with ASD from typically developing children. Good-to-excellent SENS could be attained when the contrast group had OPD or DD, but only at the cost of low SPEC. However, these “false positives” were children who do not have ASD but who had other behavioral, emotional, or developmental problems that may also need to be addressed.

The best settings in which to use the CBCL/1½–5 for Level 1 screening are primary medical care or early child care settings, where most preschool children in the general population receive services at one time or another. The CBCL/1½–5 could be used routinely as part of a regular health assessment (e.g., at the 24 months check-up) or as part of a early child care center screening process, or it could be used only when parents, caregivers, teachers, or medical personnel express some concern about a child’s development. From our results, cutpoints of 5 or 6 on the DSM-PDP scale and 3 or 4 on the Withdrawn scale are likely to yield excellent SENS and good SPEC in Level 1 screening. “Positive screen” cases who do not turn out to have ASD on further assessment are very likely to manifest other behavioral, emotional, or developmental problems.

Because the CBCL/1½–5 is quick to complete and score, requires no professional time to administer, and yields printed output that is easy to interpret, its use for Level 1 screening is a practical way to identify children with possible ASD as well as with many other kinds of problems who may need further assessment. Furthermore, the CBCL/1½–5 (Achenbach and Rescorla 2000) has a strong normative foundation from epidemiological research. Specifically, normative data were obtained from a US national probability sample of 738 children ages 18–35 months representative of the population in terms of SES, race, and ethnicity assessed via parental home interviews in 40 states, with a 94 % completion rate. In addition, hundreds of publiushed studies have reported CBCL/1½–5 findings, giving it a strong research base.

Limitations and Conclusions

The most significant limitation of this study is that the ASD diagnoses were not made using the ADOS and/or the ADI-R. However, the diagnoses were made independently by two mental health professionals using DSM-IV criteria, with 100 % agreement on diagnosis for ASD cases. An additional limitation is that the ASD group was smaller than all the other groups, thereby limiting the power for our comparisons. A third limitation is that the clinical groups were all obtained from a psychiatric clinic, whereas the non-referred, general population group was drawn from the normative sample for the CBCL/1½–5 in Korea. Recruiting all participants from a general population sample would have provided a stricter test of the CBCL’s ability to differentiate among the four groups studied here (ASD, OPD, DD, and NR). Conducting such a study using an epidemiological sample would be prohibitively expensive due to the relatively low base rates of the clinical conditions. However, to simulate use CBCL/1½–5 for ASD Level 1 screening, we re-calculated decision statistics using the full Korean normative sample of 2,327 children as the comparison group. When compared with the results for the smaller NR group, SENS was similar, SPEC and NPV were much higher, and PPV was much lower, consistent with the larger sample size and hence lower “base rate” of ASD in the sample.

Despite these limitations, our study had two major strengths. First, we were able to compare an ASD group with three other groups, namely a large non-referred sample, a group with OPD, and a group with DD. Second, our study was conducted in Korea, allowing us to demonstrate the generalizability of CBCL/1½–5 findings from Western samples to an Asian sample. Overall, our study provides further evidence that the CBCL/1½–5 provides good differentiation between children with ASD and typically developing pre-schoolers and hence shows promise as a Level 1 screening tool. Decision statistic results were less satisfactory when differentiating the ASD group from the OPD and DD groups, with sharper trade-offs between SENS and SPEC. However, the ASD children scored more than a half SD above the DD and OPD groups on the Withdrawn and DSM-PDP scales, the two signature CBCL scales for identifying ASD. Furthermore, the mean item ratings for seven items highly characteristic of ASD problems were significantly higher for the ASD group than the OPD and DD groups at a very stringent level of significance (p < .001). In conclusion, additional studies using the methods employed in the Italian and Korean studies are needed to further demonstrate the utility of the CBCL/1½–5 as a screening tool for ASD in preschoolers.