Individuals with autism spectrum disorder (ASD) demonstrate many challenging behavioral patterns that can interrupt day-to-day functioning and classroom instruction (Dawson, Matson, and Cherry 1998; Hill and Furniss 2006; Matson and Shoemaker 2009). The difficulties that these individuals face communicating their thoughts (Aldred, Green, and Adams 2004), interpreting the expressions and communication of others (Howlin, Mawhood, and Rutter 2000), and understanding what is happening in their environment in lieu of sensory impairments or sensitivities (Rogers, Hepburn, and Wehner 2003; Tomchek and Dunn 2007) can have a negative effect on learning new skills and adaptive functioning in their environment. These difficulties, compounded throughout several years of life, can create a skills gap so great that they lag behind their peers (e.g., learning disabilities in reading, mathematics, or overall intellectual and adaptive functioning). Early intensive behavioral interventions have shown to have a positive influence on developmental outcomes (Matson and Smith 2008).

These early interventions tend to focus on improving interpersonal social interactions (e.g., Aldred et al. 2004; Bauminger 2002) and improving responses to aversive stimuli within the environment (e.g., Baranek 2002; Pfeiffer, Koenig, Kinnealey, Sheppard, and Henderson 2011). With increased instruction and individually tailored interventions, individuals with ASD can achieve higher levels of adaptive functioning in their daily lives, including academic settings. Diagnosing ASD and providing intensive early interventions are costly and time consuming to public schools (Hess, Morrier, Heflin, and Ivey 2008). Because of this, screeners such as the Social Communication Questionnaire (SCQ; Berument, Rutter, Lord, Pickles, and Bailey 1999) have been developed to allow parents, guardians, and teachers to help school psychologists and clinicians understand where they should focus their efforts for instruction and whether the child should be referred for a comprehensive diagnostic evaluation of ASD.

Before requesting a full clinical evaluation, screeners that caregivers (e.g., parents, guardians, teachers) can use to help determine the likelihood of a child having ASD are typically employed. In order for these screeners to be accessible to those not trained in special education or diagnostics, they are typically behavioral checklists that allow the respondents to report whether the child has demonstrated behavioral patterns that are characteristic of ASD. Each one of these observable, behavioral items contribute to an overall score that is compared with a cutoff score used to determine whether further diagnostic evaluation and instructional resources should be allocated to the student. There have been a few screeners for ASD developed in the last 20 years (e.g., the SCQ, Berument et al. 1999; Modified Checklist for Autism in Toddlers [M-CHAT], Robins, Fein, Barton, and Green 2001); however, the SCQ has been one of the most widely adopted (Wilkinson 2010). There are two versions of the questionnaire: SCQ Lifetime, which is completed with reference to the individual’s entire developmental history, and SCQ Current, which is completed with reference to the individual’s behavior during the last three months. The SCQ has found popularity not only in helping children receive the services and attention needed to intervene in skills development, but also for researchers studying a wide range of pervasive developmental disabilities.

The SCQ was developed using a sample of 200 participants ranging from 4 to 40 years of age (Berument et al. 1999). This norming sample represented a wide variety of individuals with and without pervasive developmental disabilities (e.g., ASD). With an established cutoff score of 15, Berument et al. (1999) reported a sensitivity of 0.96 and a specificity of 0.80 when comparing ASD to other diagnoses. However, subsequent validity studies did not confirm such psychometric properties of the SCQ when applied to different populations (e.g., individuals with intellectual disabilities, individuals with Down syndrome). At the cut-off of 15, most of these studies reported lower (< 0.80) and unsatisfactory balance of sensitivities and specificities (e.g., Allen, Silove, Williams, and Hutchins 2007; Brooks and Benson 2013; Eaves, Wingert, Ho, and Mickelson 2006; Oosterling et al. 2010; Snow and Lecavalier 2008; Wiggins, Bakeman, Adamson, and Robins 2007; Witwer and Lecavalier 2007), and quite a few of them suggested lowering the cut-off for improving the discriminant utility of the SCQ (e.g., Corsello et al. 2007; Johnson et al. 2011; Lee, David, Rusyniak, Landa, and Newschaffer 2007; Schanding, Nowell, and Goin-Kochel 2012; Wiggins et al. 2007). Therefore, the unpredictable nature of the SCQ has caused some researchers and clinicians to question its validity. Additionally, it is not always clear when researchers should utilize the Current version over the Lifetime version or if either provides more reliable results. While the SCQ Current was designed to measure change over time for assessing the effectiveness of therapeutic or educational interventions (Rutter et al. 2003a, b), many actually rely on it for children under 5 years old (see Brooks and Benson 2013; Corsello et al. 2007; Lee et al. 2007). The few studies which explicitly evaluated the SCQ Current version (Corsello et al. 2007; Oosterling et al. 2010) found particularly lower specificities of it among very young (approximately 2 to 4 years old) children. As such, it remains questionable whether the SCQ Current might be a qualified alternative to the Lifetime form for screening ASD in young children.

To date, the majority of the studies examining the appropriateness of the SCQ as a screener have focused on its external validity through receiver operating characteristic (ROC) curve analyses (e.g., Allen et al. 2007; Corsello et al. 2007; Snow and Lecavalier 2008; Wiggins et al. 2007). There is a paucity of research that carefully analyzes the internal validity of the SCQ since its initial development (Berument et al. 1999). This gap warrants attention as understanding merely the scale-level characteristics of the SCQ may limit the practical implications for the use of the scale. The purpose of this study was to evaluate the SCQ using the Item Response Theory (IRT; Lord and Novick 1968) framework. IRT offers three primary advantages over some traditional approaches to examination of the internal validity of psychometric measures: (1) the estimate of an examinee’s latent trait is independent of the particular sample of items that are administered; (2) the item characteristics are independent of the particular sample of respondents; and (3) a statistic indicating the precision with which each respondent’s latent trait as estimated is provided, and this statistic is free to vary among respondents (Hambleton and Swaminathan 1985). As such, the IRT analysis is capable of yielding item-level characteristics of the SCQ, which may provide valuable insight into its psychometric properties, suggestions for revisions, and directions for future uses.

In the context of health disparities research and individual assessment, researchers have also become increasingly concerned about measurement equivalence (Teresi 2006). An instrument is not fair when two groups of an equal amount of a latent trait earn different scores on the same item (Gall, Gall, and Borg 2006); such measurement bias is also termed as differential item functioning (DIF; Holland and Wainer 1993). While there have yet to be any published issues regarding biased outcomes of the SCQ, the fact that diagnoses of ASD are four to five times more prevalent amongst boys than girls (Blumberg et al. 2013) warrants further examination of the inventory as a screener.

Thus, the purpose of the current study was to examine the internal validity of the SCQ as a screening instrument for ASD. To achieve this purpose, the psychometric properties of the SCQ were evaluated according to both the classical true score theory procedures through confirmatory factor analyses and item response theory techniques. IRT techniques were utilized to examine for the presence of subsequent differential item functioning according to gender with each form of the SCQ and across forms of the SCQ (e.g., Current versus Lifetime) in general.

Method

Participants

Our samples consisted of a total of 2,134 individuals with the SCQ item-level scores available from the National Database for Autism Research (NDAR). Among them, 769 individuals were surveyed using the SCQ Current form (Sample 1) and 1,660 individuals using the SCQ Lifetime form (Sample 2), with an overlap of 295 individuals having valid scores of both forms. Sample 1 contained 81.4 % males (n = 626) and 18.6 % females (n = 143), with ages ranging from 1 year and 5 months to 21 years and 8 months (M = 8 years and 10 months, SD = 5 years and 6 months). Sample 2 contained 80.8 % males (n = 1,341) and 19.2 % females (n = 319), with ages ranging from 1 year and 3 months to 57 years and 10 months (M = 10 years and 9 months, SD = 6 years and 11 months). Sample 1 was used for the IRT and DIF analyses of the SCQ Current, and Sample 2 was used for the analyses of the SCQ Lifetime. When analyzing the DIF of SCQ Current versus SCQ Lifetime, we combined the two samples and excluded the overlapping 295 individuals with both scores available to avoid correlated errors.

Measures

The Social Communication Questionnaire (SCQ; Rutter et al. 2003a, b) is a 40-item, parent-report measure for screening symptomatology associated with autism spectrum disorder (ASD). All 40 items are administered in a dichotomous format (i.e., yes/no), with Item 1 simply documenting whether or not the child is able to speak with short phrases or sentences, and Items 2 through 40 used for scoring. For Items 2, 9, and 19 through 40, an individual scores 1 for response option no and 0 for yes; for the other items (i.e., Items 3 to 8 and 10 to 18), the individual scores 1 for yes and 0 for no. A score of 1 as opposed to 0 on an item always indicates a higher risk for ASD, and there is no need to reverse score. In the current study, we use “response option 1” to refer to a score of 1 on the item regardless of the verbal response (yes/no).

The standardization study by Berument et al. (1999) took four steps to assess the validity of the SCQ. First, a factor analysis was performed to determine whether the scale reflected the differentiated conceptualizations of three main domains of abnormality found in autism, and the results indicate that all 39 items (Items 2 through 40) gauge four domains of symptomatology: Social Interaction (k = 20, α = .91), Communication (k = 6; α = .71), Abnormal Language (k = 5; α = .79), and Stereotyped Behavior (k = 8; α = .67). Second, the combination of individual items was assessed by noting the items-total correlations and how well they differentiated ASDs from other diagnoses, and 33 out of the 39 items (85 %) showed statistically significant differentiation. Third, the correlations between the SCQ and the Autism Diagnostic Interview-Revised (ADI-R; Rutter et al. 2003a, b) for the total scores as well as the ADI domain totals, and the correlation coefficients were statistically significant for all comparisons. Fourth, ROC curves were applied to determine how the SCQ differentiated ASD from other diagnoses, and a sensitivity of 0.96 and a specificity of 0.80 were reported when comparing ASD to other diagnoses at a cut-off of 15. In addition, convergent validity and discriminant validity of the SCQ were assessed using a multitrait-multimethod matrix, and it was reported that the validation support was moderately strong for the Reciprocal Social Interaction and Communication domains, while weaker for the Restricted, Repetitive, and Stereotyped Patterns of Behavior domain.

Analyses

A major assumption of IRT analyses is unidimensionality, which refers to the notion that a set of items measures a single latent construct (Lord and Novick 1968). To test this assumption, confirmatory factor analyses (CFA) were performed separately on the four subscales of both forms using Mplus v. 6.0 (Muthén and Muthén 2010) with robust weighted least squares (WLS) estimation. Chi-Square (χ 2) tests of model fit were referred to along with approximate fit indices (AFIs) for determining acceptable model fit. Due to the well-known criticism that the χ 2 test is severely sensitive to sample size (Schmitt 2011), the following criteria of AFIs were consulted upon significant results of the χ 2 test: Root Mean Square Error of Approximation (RMSEA) ≤ .06, Comparative Fit Index (CFI) ≥ .95, and Weighted Root Mean Square Residual (WRMR) ≤ 1.0 (Bentler 2007; Hu and Bentler 1999; Yu, 2002). A CFA model with fit indices close to these cutoff values is considered to fit the data well, while a model with fit indices well above or below these values (e.g., CFI < 0.90, RMSEA > .10) is considered to fit the data poorly.

Given the dichotomous response format of the SCQ items, Birnbaum’s (1968) three-parameter (3PL) logistic model was employed for the IRT estimates in IRTPRO v. 2.1 (Cai, Thissen, and du Toit 2011). The mathematical formula of the item characteristic curve (ICC) with an item is written as:

$$ {P}_i={c}_i+\left(1-{c}_i\right)\frac{e^{a_i\left(\uptheta - {b}_i\right)}}{1 + {e}^{a_i\left(\uptheta - {b}_i\right)}} $$
(1)

where P i is the probability of a person’s endorsement of response option 1 for item i, the mathematical constant e is the base of the natural logarithm, θ (theta) is the latent trait, a i is the discrimination parameter for item i, and b i is the threshold parameter (de Ayala 2009). The threshold parameter b i represents the value of θ where an individual randomly selected from all persons with this level of latent trait has the probability of 0.5 to endorse option 1, as well as the inflection point of the ICC. The discrimination parameter a i is proportional to the slope of the ICC at its inflection point, the slope at b i = 0.25 \( \times \) a i . According to Baker (1985, 2001), item discrimination is considered to be “very low” for a < .34, “low” for .35 ≤ a ≤ .64, “moderate” for .65 ≤ a ≤ 1.34, “high” for 1.35 ≤ a ≤ 1.69, and “very high” for a > 1.70. For item discrimination to be reasonably good, it should also not exceed 2.50 (de Ayala 2009). Finally, the pseudo guessing parameter of item i was denoted by c i , indicating the probability of a response of 1 when θ approaches negative infinite (-\( \infty \)). In other words, it reflects that some respondents with infinitely low thetas may endorse 1 when they should not (de Ayala 2009).

The analysis of DIF was performed using DIFAS v. 5.0 (Penfield 2012). The DIF analysis is conducted by comparing a focal group (usually the minority) against a reference group (usually the majority). The Mantel-Haenszel Chi-Square (MH χ 2) procedure was used for detecting DIF in polytomous items (Penfield 2012) with α = .05. The MH χ 2 statistic is distributed with one degree of freedom, with a higher value of a particular item indicating a higher probability for this item to display DIF (Mantel 1963). To reduce Type 1 error, we used the Benjamini and Hochberg false discovery rate (BH-FDR; Benjamini and Hochberg 1995) to adjust each item’s p value of MH χ 2. According to Kim and Oshima (2012), the BH-FDR is the most balanced adjustment in lowering the Type 1 error rate as compared with Bonferroni correction and Holm’s procedure. In terms of effect size of differences, DIFAS v. 5.0 provides Mantel-Haenszel Common Log-Odds Ratio (MH LOR; Mantel and Haenszel 1959) statistics for estimating magnitude and direction of DIF. Positive MH LOR values indicate DIF in favor of the reference group, while negative values indicate DIF in favor of the focal group (Penfield 2012). In addition, Breslow-Day Chi-Square (BD; Breslow and Day 1980) statistics were also consulted for detecting nonuniform DIF. Finally, the impact of DIF at the subscale level was examined by Differential Test Functioning (DTF) analyses. DTF represents the aggregated DIF across the items of a test or subscale and is essentially the variance estimator of item-level DIF effects. Thus, it can provide information concerning the overall impact of DIF effects. DIF effect variance is considered to be small for τ 2 (Tau-square) < 0.07, medium for 0.07 ≤ τ 2 ≤ 0.14, and large for τ 2 > 0.14 (Penfield and Algina 2006).

Results

Unidimensionality

Table 1 summarizes the fit indices of the one-factor CFA models along with the internal consistency reliabilities of each subscale: Social Interaction (SI), Abnormal Language (AL), Communication (COMM), and Stereotyped Behavior (SB). While results indicated acceptable to excellent reliabilities for SCQ Lifetime, the subscales of SCQ Current appeared to have lower internal consistencies. In addition, internal consistencies were shown to be similar within each pair of comparison groups for subsequent DIF analyses (see Table 1). In terms of unidimensionality, three subscales (i.e., Abnormal Language, Communication and Stereotyped Behavior) showed sufficient unidimensionality for both SCQ Lifetime and Current, while Social Interaction of the Current form demonstrated particularly poorer fit.

Table 1 CFA fit indices for subscales of the SCQ current and lifetime versions

Item Response Theory Analyses

Table 2 summarizes IRT parameter estimates for the items of all four subscales for both SCQ versions.

Table 2 IRT parameter estimates (a, b, c) by subscale for the SCQ current and lifetime forms

Based on Baker’s (1985, 2001) guidelines, 18 out of the 39 items (46 %) demonstrated high to very high discrimination (a ≥ 1.35) in the Current form, while 33 items (85 %) demonstrated high to very high discrimination in the Lifetime form. Some items warranted particular attention because they demonstrated low to very low discrimination (a ≤ 0.64) in either SCQ version (i.e., Items 4 [inappropriate questions], 9 [inappropriate facial expression], 15 [hand and finger mannerisms], 17 [deliberate self-injury], 19 [friends], 23 [gestures], and 39 [imaginative play with peers]). In general, items of the Lifetime form showed higher and more consistent discriminations within subscales than those of the Current form. For example, a ranged from 0.99 to 2.96 (M = 1.57, SD = 0.70) for SB items in the Current form, while from 1.69 to 2.98 (M = 2.30, SD = 0.49) in the Lifetime form. Such comparisons are illustrated by the ICCs in Figs. 1 and 2.

Fig. 1
figure 1

Item characteristic curves (ICCs) of SCQ current by subscale

Fig. 2
figure 2

Item characteristic curves (ICCs) of SCQ lifetime by subscale

In terms of item thresholds, the Lifetime form subscales demonstrated higher average thresholds than the Current form ones. The average threshold (difficulty) was −0.09 (SD = 0.46) for AL (Current) and 0.19 (SD = 0.22) for AL (Lifetime), 0.00 (SD = 0.48) for COMM (Current) and 0.69 (SD = 0.44) for COMM (Lifetime), 0.11 (SD = 0.58) for SB (Current) and 0.10 (SD = 0.25) for SB (Lifetime), and −0.04 (SD = 0.58) for SI (Current) and 0.26 (SD = 0.37) for SI (Lifetime). Finally, the Lifetime form also appeared to be less influenced by the pseudo guessing effects because only two items (5 %) had significant nonzero c values, as compared with the Current form in which seven items (18 %) had significant c values.

We also compared the test information (Hambleton and Swaminathan 1985) curves among the subscales. The test information function is defined for a particular set of items (e.g., a subscale) at each point along with the continuum of the latent factor (Studts 2008). The contribution of each item to the total test information function is additive. As shown in Fig. 3, items of the Lifetime form altogether provided more information for measurement of children’s social communication skills than those of the Current form.

Fig. 3
figure 3

Test information functions of SCQ current and lifetime

Differential Item Functioning

The analyses of differential item functioning (DIF) were first conducted separately on the two SCQ versions, both with male as the reference group and female as the focal group. Next, the two samples were combined for DIF to contrast the Current form and the Lifetime form. Table 3 presents results of the three sets of DIF analyses. For SCQ Lifetime, six out of the 39 items (15.4 %) demonstrated significant DIF across gender: Items 19 (friends) and 34 (imitative social play) of SI, Item 9 (inappropriate facial expression) of COMM, Item 6 (neologism) of AL, and Items 10 (use of other’s body to communicate) and 13 (circumscribed interests) of SB. The Differential Test Functioning (DTF) effect variances were small for SI (τ 2 = 0.03) and AL (τ 2 = 0.06), medium for COMM (τ 2 = 0.14), and large for SB (τ 2 = 0.19).

Table 3 DIF results by subscale and item

For the SCQ Current, while only five out of the 39 items (12.8 %) demonstrated significant DIF across gender, four of them (i.e., Items 11 [unusual preoccupations], 12 [repetitive use of objects], 13 [circumscribed interests], 18 [carry objects around]) were of the SB subscale, meaning that 50 % of the SB items were showing measurement bias. In other words, even given the same severity of stereotyped behaviors, parents of boys were more likely than those of girls to endorse option 1 (yes) of the statements “preoccupied with odd interests (e.g., traffic lights, drainpipes, or timetables)” (MH χ 2 = 7.10, BH FDR corrected p = .02, MH LOR = +0.73), “odd interest in an object (e.g., spinning the wheels of a car)” (MH χ 2 = 10.52, p = .005, MH LOR = +0.83), and “interests that are unusual in intensity (e.g., trains or dinosaurs)” (MH χ 2 = 6.16, p = .03, MH LOR = +0.57). On the other hand, parents of girls were more likely to endorse option 1 (yes) of the statement “objects that she/he has to carry around (other than a soft toy or comfort blanket)” (MH χ 2 = 23.70, p < .0001, MH LOR = −1.17). The DTF effect variances were small for SI (τ 2 = 0.01), medium for AL (τ 2 = 0.08), and large for COMM (τ 2 = 0.15) and SB (τ 2 = 0.35). Finally, only one item (i.e., Item 13 [circumscribed interests] of SCQ Lifetime) showed significant nonuniform DIF (BD = 8.93, BH FDR corrected p = .02).

In terms of the DIF results contrasting the SCQ Current and Lifetime, 16 out of the 39 items (41 %) demonstrated significant DIF and 14 of them were of SI (i.e., Items 21, 26 to 30, 32 to 35, and 37 to 40), meaning that parents responded differently to 70 % of the SI items according to whether they were administered in a Lifetime or Current version. Moreover, 50 % of the COMM items (k = 3) had significant nonuniform DIF as indicated by the BD statistics. The DTF effect variances were small for AL (τ 2 = −0.01) and SB (τ 2 = 0.03) and large for COMM (τ 2 = 0.17) and SI (τ 2 = 0.61).

Discussion

Whereas the majority of studies in the literature examined the external, utility validity (e.g., ROC curve analyses) of the Social Communication Questionnaire, the current study is a comprehensive review of the interval validity issues of both SCQ forms (i.e., Lifetime and Current). We could only locate one other study (Magyar, Pandolfi, and Dill 2012) that examined the internal validity of the SCQ. In Magyar et al. (2012), the sample consisted of approximately 70 participants and was limited to the classical true score theory and factor analysis. In the current study, classical true score theory techniques were first utilized via the examination of the internal consistencies. Then, IRT techniques were utilized to examine item-level characteristics and measurement equivalence across items. In view of extant literature, the current study represents the largest sample among the validity studies on SCQ. With this large sample available from NDAR, we consider our findings to possess valuable information for research and clinical practice on Autism Spectrum Disorder.

Our findings indicate sufficient psychometric properties of the SCQ Lifetime form, but several measurement issues emerged with respect to the Current form. First, the SCQ Current subscales demonstrated lower internal consistencies and a weaker factor structure. In particular, the unidimensionality of the Current form subscales does not appear to be strong as indicated by the fit indices (Bentler 2007; Hu and Bentler 1999; Yu, 2002). We need to note that when Berument et al. (1999) established the four-factor structure of the SCQ, no data were available for the evaluation of the Current form. Therefore, researchers need to carefully examine the factor structure of the SCQ when necessary in their studies. Second, the IRT-based examination indicated that the same items in the SCQ Lifetime may not perform as well when utilized in the Current form. That is, certain behaviors observed within a shorter time frame (i.e., the past 3 months for the Current form), rather than in one’s lifetime, may not be sufficient indicators of an individual’s potential for being diagnosed with ASD. In terms of the IRT parameters, the superiority of the Lifetime form mainly concerns the item discrimination (a) that the majority of the 39 items (92 %) demonstrated higher discrimination in the Lifetime form (see Table 2 and Figs. 1 and 2). When administered in the Current form, less than 50 % of the items showed reasonably high discrimination, and some (e.g., Items 17 [deliberate self-injury] and 19 [friends]) were low in discrimination. Lastly, there appears to be more significant c (pseudo guessing) values in the Current form as compared with the Lifetime form, suggesting that parents are more likely to endorse option 1 on certain items in the Current form (e.g., Items 21 [imitation], 22 [pointing to express interest], and 23 [gestures]) when they would not be expected to given their otherwise low trait values. In addition, the higher incidence of the pseudo-guessing effect in the Current form items (18 % as compared with 5 % in the Lifetime items) also indicates that the reliability of some measures may be impaired when a narrow time frame is specified in the Current form.

In terms of DIF, our analyses reveal that the Lifetime form was established with sufficient measurement equivalence across gender because the percentage of items showing significant DIF were limited to 25 % or below within each subscale (Teresi 2006). However, we must note that the DIF in some Lifetime items (i.e., Items 6, 9, 10, 19, 13, and 34) will likely confound any item-level comparisons even though the subscale-level scores can be trusted. For instance, we do not recommend using the raw scores on Item 9 to examine gender differences in facial expression because the DIF results show that a parent is more likely to indicate that the child’s facial expression has been inappropriate if the child is a girl rather than a boy. On the other hand, the Current form warrants some concern with regard to the measure of stereotyped behaviors because 50 % of the SB items demonstrated significant DIF across gender. It is possible that the wording or examples of specific items to be gender-stereotypical (e.g., “drainpipes,” “spinning the wheels of a car,” “soft toy,” “comfort blanket”) and may interfere with a parent’s or caregiver’s responding according to the child’s gender. While this observation is preliminary and certainly requires further investigations, there is empirical evidence that children’s gender-role development begins early at 3- or 4-years old (Bussey and Bandura 1992, 2004). We recommend using more gender-neutral activities as examples in future revisions of the SCQ items.

It is worth particular attention that the DIF analysis contrasting the Lifetime and Current forms yielded significant and substantial DIF, particularly in the Social Interaction subscale. That is, a parent will likely respond differently according to which form is administered, even given the same level of social interaction skills of the child. Although the SI items were phrased differently in SCQ Current and Lifetime that an age limit was specified in the Lifetime version (e.g., “does she/he ever . . .” in the Current form versus “when she/he was 4 to 5, did she/he ever. . .” in the Lifetime form), the DIF is unlikely a function of this difference in wording given that six of the 14 items had significant DIF in favor of the reference group (MH LOR > 0) while the other eight in favor of the focal group (MH LOR < 0). Along with our finding that the unidimensionality of the SI subscale is particularly weak in the Current form, the measure of individuals’ social interaction skills for the purpose of ASD screening using the Current form may be problematic.

In summary, the IRT-based examination and analyses of measurement equivalence do not indicate good psychometric properties of the Current form, which aligns with the findings from the external validity studies of the SCQ. Our findings, however, indicate that simply lowering the cut-off may not solve the measurement issues because the instrument suffers from an insufficient and unbalanced distribution of item discriminations, as well as some degree of pseudo guessing effects. We caution researchers and clinicians about the administration and interpretation of the SCQ Current form. In particular, it may not be appropriate for some studies to use the Current form among children below 5 years old (e.g., Corsello et al. 2007; Lee et al. 2007; Oosterling et al. 2010) or other special situations (e.g., teacher-report; Schanding et al. 2012). First, the Current form was not intended as an alternative to the Lifetime form (Rutter et al. 2003a, b). Second, the SCQ was developed for individuals from 4 to 40 years old, and the downward extension to 2 years old needs to be undertaken with caution (Rutter et al. 2003a, b).

In Lee et al. (2007), researchers were advised via personal communication with SCQ publishers to use the current form of the SCQ over the lifetime form for individuals younger than 4 years old. For future research where the SCQ is to be used among children below 4 years old, we recommend modifying the wording of the Lifetime items (i.e., removing the timeframe “when she/he was 4 to 5”) rather than switching to the Current form where a 3-month time frame is specified. Given the psychometric properties revealed by this study, it is difficult to rely on the Current form to accurately measure change over time for evaluating the effectiveness of interventions, as intended by the test developers (Rutter et al. 2003a, b). We consider it necessary to administer the SCQ Current form along with the modified (i.e., time frame removed) Lifetime form for an examination of concurrent validity in such program evaluations. It is recommended that future studies investigate the association between the temporality (e.g., is the past 3 months an appropriate timeframe for the measure?) of certain behaviors and the potential of being diagnosed with ASD.

Finally, an investigation of age neutrality may be another promising direction for future research considering the wide age range (e.g., 4 to 40 years) that the SCQ purports to accurately measure. The age neutrality of the SCQ is questionable based upon previous research (e.g., Barnard-Brak et al., in progress, Corsello et al. 2007; Allen et al. 2007; Wiggins et al. 2007) given this broad range of ages. We suggest that for some items such as those concerning pronoun reversal would not be developmentally appropriate at some ages but would be considered so at older ages. Other items would appear to more resistant to item parameter drift (Wells, Subkoviak, and Serlin 2002) or the presence of longitudinal DIF such as items concerning self-injurious behavior, which ostensibly be infrequent at any age in the typically developing population. Future research should examine SCQ scores from a longitudinal perspective to discern the presence of DIF across time. The lack of non-uniform DIF would appear to support the ability of researchers to examine for the presence of DIF across time without having to be concerned about the interaction of variables indicating the presence of cross-sectional DIF.

Limitations that emerged as part of conducting this study need to be noted. First, while our analyses benefit from large samples available from National Database for Autism Research (NDAR), limited demographic information were available for the data set to be considered nationally-representative of individuals who may receive the SCQ. Second, the 295 participants we excluded from the DIF analysis contrasting Lifetime and Current forms may contain valuable information because they have valid data of both forms. Future studies may examine, for example, how a caregiver of the same individual may respond differently to the Lifetime and Current forms. Interestingly, caregivers could rate an individual as having had a behavior on the Lifetime form but not endorse this same behavior on the Current form. This discrepancy could indicate that certain behaviors are a function of the developmental course of autism spectrum disorder, suggesting a lack of age neutrality of the SCQ. Finally, because our data were drawn from a set of studies from NDAR (e.g., “Computer Adaptive Testing of Adaptive Behavior of Children and Youth with Autism,” “Biological and Information Processing Mechanisms Underlying Autism”), there may be study-level characteristics which influence the data analysis and inferences. Future studies utilizing this data set may consider adopting multilevel analytic techniques if the nested nature of the data would appear to be influential. In conclusion, the current study provides a comprehensive examination of the internal validity of the SCQ according to both classical true score and item response theory techniques using the largest sample of SCQ responses in extant, peer-reviewed literature. Item response theory techniques were emphasized in the discussion of implications as item-level characteristics and their interactions with respondents were able to be examined using these techniques. We conclude that the Lifetime form of the SCQ appears to have sufficient psychometric properties, while the Current form of the SCQ appears to have more questionable qualities. We suggest that the Current form of the SCQ be used with particular caution given the evidence of its questionable internal validity.