Introduction

There is a critical need for the development of outcome measures that assess changes in social communication behaviors. Though most treatments for autism spectrum disorder (ASD) focus on improvements of social communication behaviors (Rogers and Vismara 2008; Wolery and Garfinkle 2002), the field of ASD intervention research in particular has struggled to find measures of treatment response that adequately capture changes in these behaviors (Anagnostou et al. 2015; Fletcher-Watson and McConachie 2015; McConachie et al. 2015). This is in part because changes in social communication behaviors are often subtle, making it difficult to find measures that are sensitive enough to capture small, though potentially meaningful, changes (Anagnostou et al. 2015; Cunningham 2012; Matson 2007; McConachie et al. 2015; Yoder et al. 2013).

Moreover, few of the measures available are flexible or standardized enough to be used across sites and studies, resulting in numerous studies that use a variety of different measures (Cunningham 2012; McConachie et al. 2015). A recent review noted that out of 195 behavioral intervention trials for ASD, over 200 different measurement tools were used to assess treatment response (Bolte and Diehl 2013). Sixty percent of these tools were used in only a single study, with only three tools used in more than 2 % of studies (Bolte and Diehl 2013). In addition, another study noted that 75 commonly used tools have little validity as outcome measures (McConachie et al. 2015).

Recently, a panel of ASD experts determined that only a handful of existing measures are appropriate for identifying treatment response in ASD and even these recommended instruments have significant limitations (Anagnostou et al. 2015; Scahill et al. 2015). For example, researchers often measure skills, such as cognitive or adaptive skills, which are not directly targeted in intervention (Matson 2007; Spence and Thurm 2010; Wolery and Garfinkle 2002). Alternatively, researchers use treatment response measures that are study-specific (Green et al. 2010; Kaale et al. 2012; Kasari et al. 2012; Rogers et al. 2012; Yoder et al. 2014). For example, researchers create a measure that captures the frequency of a specific operationalized behavior that is targeted in treatment, such as joint attention (Kaale et al. 2012) or imitation (Rogers et al. 2012). Although these measures may be helpful to identify changes in single behaviors in particular studies, they do not capture broader social communication changes (Spence and Thurm 2010). Even though researchers often measure similar social communication behaviors across different studies, these behaviors are often operationalized differently across measures, which interferes with interpretation of results beyond single studies (Wolery and Garfinkle 2002). Furthermore, focusing outcome measures on behaviors that are highly specific to a particular treatment could accentuate treatment effects that may not generalize beyond these specific measures (Yoder et al. 2013).

Other measures used are intended for screening, diagnosis, or measuring symptom severity (Anagnostou et al. 2015; Scahill et al. 2015). As such, these measures are usually not sufficiently sensitive for measuring change over short periods of time (e.g., months rather than years). For example, the Autism Diagnostic Observation Schedule (ADOS-2; Lord et al. 2012b, c), a measure intended for diagnostic purposes, has frequently been applied as an outcome measure. Using raw scores from the ADOS-2 has generally been unsuccessful in assessing changes (Owley et al. 2001), perhaps because ADOS-2 raw scores are not intended for use as interval data or for measuring change. When changes have been identified with ADOS-2 raw scores, the clinical significance of these changes may be limited since changes are also present in treatment-as-usual conditions when comparison groups are presented (Green et al. 2010; Gutstein et al. 2007).

The ADOS-2 Calibrated Severity Score (CSS; Esler et al. 2015; Gotham et al. 2009) has been useful in identifying changes over the course of years (Gotham et al. 2012; Lord et al. 2012a), but has been less successful in identifying changes over shorter periods of time (Brian et al. 2015; Dawson et al. 2010; Estes et al. 2015; Shumway et al. 2012; Thurm et al. 2015). Analyses of the Autism Diagnostic Interview, Revised (ADI-R; Lord et al. 1994), a parent interview used for diagnostic assessment, has also proven useful in identifying trajectories of change over the course of years (Gutstein et al. 2007; Lord et al. 2015; Sallows and Gaupner 2005), but its utility over shorter periods of time is unclear. A further hindrance to using measures such as the ADOS-2 and ADI-R is that they require significant training to administer and score reliably, as well as substantial time from patients and clinicians, limiting feasibility in large-scale, multi-site studies. Given these limitations, a recent review (Anagnostou et al. 2015) recommended against using ASD diagnostic measures, like the ADOS-2, as outcome measures though use of these tools was previously encouraged for this purpose (Cunningham 2012; Matson 2007).

An additional limitation to measures commonly used in clinical trials is the reliance on caregiver or clinician report (Anagnostou et al. 2015; Bolte and Diehl 2013), such as Clinical Global Impressions (CGI; Busner and Targum 2007), because placebo effects are particularly strong for caregiver or clinician report measures. These effects may outweigh more subtle changes that occur over time or in response to interventions (Guastella et al. 2015; Lord et al. 2012a; Owley et al. 2001). In a recent paper, caregiver-report measures of response to treatment were more related to the caregiver’s belief that the child was receiving the experimental treatment than to the treatment itself (Guastella et al. 2015). A second, related issue that limits measurement of treatment effects is “unblinding,” which is often inherent in caregiver or clinician reports. For example, in treatments that have significant side effects, caregivers and clinicians are frequently aware if the child is experiencing these other changes. Bias associated with unblinding is often inherent since few studies use reporters of outcome measures who are blind to the child’s treatment status (Wolery and Garfinkle 2002). Last, measures used to capture a broad range of social-communication behaviors are often confounded by co-occurring intellectual deficits and behavior and/or language problems (Hus et al. 2013). The influence of these confounds may make it difficult to disentangle meaningful changes in ASD-specific social-communication behaviors from other non-ASD-specific behaviors.

The limitations of currently used measures interfere with the ability of clinicians and researchers to measure effectiveness of interventions, perhaps contributing to the phenomenon that few ASD interventions have met standard criteria for efficacy (Chambless and Hollon 1998; Danial and Wood 2013; Levy et al. 2009; Rogers and Vismara 2008; Spreckley and Boyd 2009). Given this critical need, researchers have begun to focus efforts on developing measures that are sensitive to change (Fletcher-Watson and McConachie 2015; McConachie et al. 2015). The Brief Observation of Social Communication Change (BOSCC) is an initial attempt by our group to address the limitations of commonly used measures. The BOSCC is a new measure consisting of specific items that were developed to identify changes in social-communication behaviors over relatively short periods of time by quantifying subtleties in both the frequency and the quality of specific behaviors. The goal of the BOSCC is to provide researchers and clinicians with an outcome measure that is flexible, easy to code, and minimally-biased by caregiver or clinician report. The BOSCC is flexible enough to be used across a variety of settings (e.g., across multi-site studies, in clinics or at home) and to be coded by a clinician/researcher who is blind to the child’s treatment status, and is, for example, new to ASD research (e.g., research assistant).

The BOSCC described in this work is applicable to minimally-verbal, young children. The BOSCC is a coding scheme that was developed by modifying and expanding codes from the ADOS-2 (Lord et al. 2012b) to capture more subtle variations in behaviors. In this initial paper, we use the BOSCC to measure behaviors in children with ASD, although applications of the BOSCC may extend to other disorders with deficits in social-communication (e.g., language impairments, social/pragmatic communication disorder, and social anxiety disorder). The goal of this paper is to provide preliminary evidence for the utility of the BOSCC as a treatment outcome measure. The specific aims are to (1) determine items for inclusion in the final BOSCC coding scheme through exploration of item correlations, (2) explore the relationship between items using factor analysis, (3) examine its psychometric properties, including inter-rater and test–retest reliability, and (4) provide initial evidence for validity through explorations of changes in BOSCC scores over time compared with changes in scores from other standard measures. In order to provide an example of how BOSCC data could be used in a clinical trial, we present data in a similar manner to existing early intervention research.

Method

Participants

Fifty-six children (44 males) with a Best Estimate Clinical Diagnosis (BEC; Anderson et al. 2014) of ASD were included in this study. Diagnoses of ASD were determined based on thorough diagnostic evaluations, including administration of the ADI-R (Lord et al. 1994) and the ADOS-2 (Lord et al. 2012b, c). All participants had elected to join various treatment studies (Kasari et al. 2010; Rogers et al. 2012; Wetherby et al. 2014) depending on which studies were available at the time and were then randomized into a treatment condition at the University of Michigan Autism and Communication Disorders Center (UMACC; n = 49) or the Center for Autism and the Developing Brain (CADB; n = 6), with the exception of one participant. Data from this one child was extracted from an existing database of children whose parents had provided written informed consent for their child’s clinical information/assessments to be included in an Institutional Review Board (IRB)-approved database. All children included in this study with the exception of one child were receiving some form of intervention while participating, either through the treatment condition in the clinical trials or elsewhere, though the interventions varied in frequency and type (see Kasari et al. 2010; Rogers et al. 2012; Wetherby et al. 2014 for details regarding intervention trials). For the one child who was not receiving any form of intervention, only one BOSCC observation was available; accordingly, this child was not included in analyses of change over time. For the purposes of this initial work, which focuses on the validity and reliability of the BOSCC, the effects of specific treatment conditions are not explored; future work will address this question. All children included in the study were between 1 and 5 years of age with minimal spontaneous language (simple phrase speech or less, equivalent to ADOS-2 Module Toddler, 1, or 2), as is appropriate for the current BOSCC coding scheme (described below). See Table 1 for demographic and initial observation information.

Table 1 Background and first observation information (n = 56)

Primary Measure (BOSCC)

For the purposes of assessing the initial psychometric properties and validity, the BOSCC coding scheme was applied to 10-min videos of free-play interactions between a caregiver and a child, gathered over the course of the child’s participation in an intervention trial. A parent was the play partner for the majority of the BOSCC observations (97 %, n = 171), with 94 % (n = 160) of these conducted with mothers. For the remaining videos, the interaction was gathered with the child and another caregiver (e.g., grandparent). The majority of observations were gathered in the clinic setting (n = 147, 83 %) while the remaining observations were gathered at the child’s home. These play samples were determined to be adequate for applying the BOSCC coding scheme as they comprised many of the elements recommended for BOSCC observations including minimal structure, a variety of toys (such as cause and effect and pretend objects), and duplicates of toys (in order to promote interactive play). Caregivers were given minimal instruction and simply told to play “how you typically would” with the child.

Between one and eight videos were available per child. Two or more videos were available for a subset of children (n = 50) with an average of 5.9 months (SD = 3.1) from first to last video observation. Children were between the ages of 12 and 56 months at their first observation (M = 29, SD = 11) and between the ages of 18 and 62 months at their last observation (M = 35, SD = 11).

The original BOSCC coding scheme consisted of 16 items coded on a 6-point scale from 0 (abnormality is not present) to 5 (abnormality is present and significantly impairs functioning). Nine items related to social communication behaviors; one of these items was subsequently eliminated (see “Preliminary Analyses” section below). One item related to play and three items related to restricted, repetitive behaviors/interests seen in ASD. Three items were used as markers of Other Abnormal Behaviors often seen in ASD, although these behaviors were rarely observed in this sample of children playing with their caregiver(s). Nevertheless, these items were deemed relevant for the BOSCC because they may be useful in the future for determining whether a BOSCC observation is valid (e.g., high scores on these items may suggest that the BOSCC observation was not representative of the child’s typical behavior).

Each BOSCC item is coded using a novel, empirically-based decision tree, which captures detailed information about specific behaviors, including, for example, information about a behavior’s frequency and quality (see Supplementary Figure 1 for example item). At each branch of the decision tree, the coder answers a question about the child’s behavior before proceeding on to the next question or arriving at a code. For example, the Directed Vocalizations item first asks whether the child directs vocalizations to another person (branch 1), then asks whether this ever occurs beyond directed echoed or highly routinized speech (branch 2), how often these more flexible directed vocalizations occur (branch 3), in what pragmatic contexts these occur (branches 4 and 5), and in how many activities (branch 6). The BOSCC is coded in two 5-min segments of a 10-min video (first 5-min segment = Segment A, second 5-min segment = Segment B), with codes averaged across segments. The initial coding process relied on viewing each video segment (5-min) one time and then coding. Over the course of development, this process was modified such that each video segment was watched and coded twice, with the second codes deemed final and used for analyses in this study. Observing and coding each segment twice resulted in greater accuracy in capturing behaviors, higher reliability amongst coders on individual items, and greater confidence in coding decisions. Coding a BOSCC video takes a trained coder about 30 min to complete.

Coders of data presented here were one psychologist, one psychiatrist, one clinical psychology graduate student, and several research assistants. All coders were blind to the child’s treatment status as well as the treatment time point. Before coding independently, coders obtained inter-rater agreement standards that the authors deemed adequate across both segments A and B: no more than three items with more than one point disagreement AND within three points across summed totals for all items, across three consecutive videos. Training involved review of the BOSCC coding scheme, practice watching and coding video observations, and participation in coding discussions with reliable coders. How quickly trainees reached these inter-rater agreement standards varied though most met standards after practice coding approximately ten to twelve videos. Codes from coders that were “in training” (had not yet met the above inter-rater agreement standards) were never included in datasets. Most coders of the BOSCC used in this study (September 2015 version) had been involved in coding that used previous versions of the BOSCC coding scheme while it was under development; as such, these coders, though many were bachelor-level assistants with limited previous ASD experience, had had exposure to the BOSCC measure over several months. In addition, coders began training on the BOSCC at different points in the study, each participating in coding and consensus discussions of videos. Codes were only used in this study from coders who had attained the inter-rater agreement standards described above.

A random sub-sample of videos (approximately every 6th video) was chosen for coding by multiple coders (ranging from 2 coders to 7 coders, depending on coder availability) in order to ensure that inter-rater agreement standards were retained over time and to assess inter-rater reliability (see below). During consensus meetings for these multiply-coded videos, coders determined consensus codes; validity data presented here uses consensus codes (16 %, 28 videos) when applicable (but these codes were not used for inter-rater reliability, see below).

Additional Measures

As part of participation in the intervention trials, children completed several assessments, including assessments of cognitive functioning, adaptive functioning, and diagnostic assessments. These additional measures provided an opportunity to explore the convergent validity of the BOSCC. See Table 2 for a summary of measures included.

Table 2 Information about assessments gathered

Adaptive Functioning

The Vineland Adaptive Behavior Scales (VABS; Sparrow et al. 2005) was completed with the caregiver(s) of a subset of children (n = 31) at two or more time points. The VABS is a caregiver interview of adaptive functioning that provides standard scores in the domains of socialization, communication, daily living, and motor skills as well as an overall adaptive behavior composite standard score (ABC). See Table 1 for information about VABS Domain scores at the initial observation.

Cognitive Functioning

Children were administered either the Mullen Scales of Early Learning (MSEL; Mullen 1995) or the Differential Abilities Scales (DAS; Elliot 2007), depending on the child’s ability level. The MSEL (collected from 36 children at two or more time points) provides standard scores in the domains of expressive language, receptive language, visual reception (non-verbal problem-solving), and fine motor skills. The DAS provides standard scores in the domains of verbal and nonverbal cognition. Ratio IQs were calculated for some children due to the inability to calculate norm-referenced standard scores because the child’s age exceeded standard cut-offs and/or their developmental level was too low to be calculated using standard metrics (see Bishop et al. 2011). None of the children received the DAS at more than one time point. As a result, only the participants with multiple MSEL scores were explored in analyses addressing change in cognitive scores. See Table 1 for information about cognitive functioning at the first observation.

ASD Symptoms

The Autism Diagnostic Observation Schedule, 2nd Edition (ADOS-2; Lord et al. 2012b, c) was administered to 55 children at one time point with most children receiving ADOS-2 Module 1 or the Toddler Module (85 %, n = 47), while the remaining children (n = 8) received ADOS-2 module 2. A subset of children (n = 41) received an ADOS-2 at two or more time points, allowing for exploration of change over time. The ADOS-2 obtains information about a diagnosis of ASD through direct observation by a clinician. All clinicians involved in administering the ADOS-2 established research reliability on the measure prior to administration. None of the clinicians involved in administering/scoring of the ADOS were involved in coding of the BOSCC. The ADOS-2 provides CSS for the algorithm total (CSS Overall) and domain severity scores in the areas of Social Affect (CSS SA) and Restricted and Repetitive Behavior (CSS RRB; Esler et al. 2015; Gotham et al. 2009). These scores provide a cross-module comparison that takes into account language level and age. See Table 1 for information about the ADOS-2 CSS at the first observation.

Clinical Global Impression-Improvement (CGI)

The CGI is a measure used by clinicians to evaluate whether an individual is responding to treatment (Busner and Targum 2007). Clinicians rate the participant’s level of improvement on a 7-point scale ranging from “very much improved” (1) to “very much worse” (7). The CGI was collected on six children who participated in an intervention trial at CADB, for whom we also had BOSCCs at multiple time points. None of the clinicians who rated the CGI also coded the BOSCC.

Data Analysis

Preliminary Analyses

Over several versions of the BOSCC coding scheme, numerous codes and coding structures were generated and tested. Given the goals of the BOSCC, a uniform distribution over the coding range for items was desirable, although not always attained. Item codes were re-written over several versions to better achieve this distribution. As changes were made to the BOSCC while under development, videos were re-coded to reflect these changes. Other studies have used a preliminary version of the BOSCC (from February 2014; Fletcher-Watson et al. 2015; Kitzerow et al. 2015). All data in this study used an updated version of the BOSCC coding scheme (September 2015 version).

Using the final version of the BOSCC described in this paper (September 2015 version), Fig. 1 depicts the averaged (across segment A and B) item distributions for the 12 BOSCC items (BOSCC Core). Since many children with ASD do not show all of the coding ranges for RRBs (Kim and Lord 2010), we did not expect normal or uniform distributions for the three items related to these behaviors, namely Sensory Interests, Hand/Finger Mannerisms, and Restricted/Repetitive Behaviors/Interests. Few children were scored as having Other Abnormal Behaviors (Supplementary Figure 2) and these items were therefore not included in subsequent analyses; however, these items were retained in the BOSCC coding scheme as they may provide valuable information when determining whether the BOSCC observation provides a valid representation of the child’s behavior (e.g., if child was very irritable, this observation may not be representative). In addition, in order to ensure that no coder was coding any item significantly differently than other coders, ongoing reliability checks of individual coders were conducted; overall, there were no significant coding discrepancies between coders except for one coder who consistently under-scored behaviors in the RRB domain. As such, Sensory Interests, Hand/Finger Mannerisms, and Restricted/Repetitive Behaviors/Interests that were coded by this coder were re-coded by other coders.

Fig. 1
figure 1

Distributions for 12 core BOSCC items (averaged across segments A and B). Note Solid red represents items in the social communication domain; stripped blue represents items in the restricted, repetitive behavior domain (Color figure online)

A correlation matrix of the BOSCC items was constructed which indicated that the correlation between the Shared Enjoyment and Facial Expressions items exceeded 0.7, suggesting substantial overlap in the behaviors captured by these codes. Facial Expressions had a more uniform distribution across the coding range (0–5) and was thus retained while Shared Enjoyment was eliminated from the measure and subsequent analyses.

In order to determine domain scores, exploratory factor analyses (EFA) were conducted for the 12 Core BOSCC items (shared enjoyment removed, see above). For the EFA, scores for the three RRB items with skewed distributions (Sensory Interests, Hand/Finger Mannerisms, and Restricted/Repetitive Behaviors/Interests) were collapsed to 3 or 4 categories based on the item distribution and treated as ordinal scores. EFA was conducted using all available codings, which includes multiple codings from different coders of the same video (308 total available codings). Analyses were undertaken in Mplus (Muthen and Muthen 19982012) using a promax oblique rotation, taking into account the multiple codings by using the complex survey adjustment with the child as the cluster-level unit. EFA of the 12 Core items, of which the last three items were treated categorically, gave eigenvalues of 5.48, 1.58, and 1.05 and RMSEA values of 0.107, 0.067, and 0.037 for the one, two, and three factor solutions, respectively (see Tables 3, 4).

Table 3 Brief Observation of Social Communication Change (BOSCC) exploratory factor analysis model comparison
Table 4 1, 2, and 3-Factor model factor loadings for Brief Observation of Social Communication Change (BOSCC) items

For subsequent analyses, the two-factor solution was chosen as a plausible parsimonious fit for the data as eigenvalues were substantially greater than 1, with a RMSEA value under 0.07 (Browne and Cudeck 1993), and theoretical overlaps with other two-factor solutions found in ASD literature (Guthrie et al. 2013; Mandy et al. 2012; Shuster et al. 2014). Factor 1, the social communication domain, consisted of items 1–8 (SC domain). Although some studies suggest that play is a separate factor (Boomsma et al. 2008; van Lang et al. 2006), the play item, which cross-loaded both on factor 1 and 2, was placed in the RRB domain (items 9–12) for subsequent analyses due to item content that most closely related to play with materials rather than social aspects of play. The two domains (SC and RRB) will be referred to in subsequent analyses as well as the Core total (items 1–12; see Fig. 2). As described above, the three items related to Other Abnormal Behaviors were not included due to the rare presentation of these behaviors in this sample of children.

Fig. 2
figure 2

BOSCC items, domains, and total. Note RRB restricted, repetitive behavior/interest

Primary Statistical Analyses

Inter-rater Reliability

Sums for items in the factors (domains) defined by the EFA results were calculated as well as a sum for Core items (1–12). As described above, approximately every 6th video (28 videos) was coded by multiple coders; these double codings were used to obtain estimates for inter-rater reliability by randomly selecting two coders when more than two coders (up to 7 coders) coded a video. Consensus codes (mutually agreed upon codes for multiply-coded videos) for these 28 (16 %) videos were not used for inter-rater reliability. Rather, original codes from two randomly selected coders were used for this purpose. Intraclass correlations (ICCs) for inter-rater reliability on domains (SC and RRB defined from the EFA) and Core total (items 1–12) were obtained from linear mixed models (xtmixed in Stata 14). ICCs were the square root of the ratio of common variance component to the sum of common and error variances, with confidence intervals obtained using the delta-method. For the three skewed items (Sensory Interests, Hand/Finger Mannerisms, and Restricted/Repetitive Behaviors/Interests) these results should be interpreted cautiously. ICCs for individual items (summed from segment A and B) were also calculated.

Test–Retest Reliability

For estimation of test–retest reliability, a test–retest sub-sample of 40 videos from 20 individuals that were gathered on two occasions less than one-month apart were randomly assigned to available coders. ICCs on domains (defined from the EFA) and the Core total were estimated as described above. ICCs were also estimated for individual items (summed from segment A and B).

Validity

To assess the validity of the BOSCC as a measure of relevant change, first, paired t tests with α = 0.05 and effect sizes of changes (Cohen’s D; the mean difference between first and last observation divided by the pooled standard deviation) were used to examine whether significant amounts of change in BOSCC and ADOS-2 CSS scores were present from the first to last observation. Second, individual change models were fitted to all the available data on each child for the BOSCC Core total (items 1–12), the VABS communication score, the MSEL receptive language score, and the ADOS-2 CSS (treated as a 10-point ordinal scale) in order to include the multiple observations available on the same individual (see Table 2). These analyses were also conducted on the BOSCC SC (items 1–8) domain. Specifically, for each participant in turn, a linear regression was fitted and the coefficient associated with the age at assessment was used as the average rate of change score for that participant. To assist comparison for each measure we standardized the expected change over 6 months by its standard deviation at baseline. This can be thought of as the effect size (Cohen’s D) that would have been obtained using each measure had these intervention children all been followed for 6 months from baseline, and compared to a randomized control group showing no change. A comparison of these effect sizes for the ADOS-2 CSS and BOSCC was constructed using the delta-method following a multivariable regression. Third, correlations of cross-sectional and change scores across these measures were conducted to assess convergent validity. Fourth, discriminant validity and coding contamination from maternal education and family income was tested by examining their association with BOSCC scores when included as fixed predictors within a mixed effects model for the repeated BOSCCs.

Post-hoc Analyses

Given the phenotypic heterogeneity of ASD, it was expected that not all children would respond to treatments (Rogers and Vismara 2008; Spence and Thurm 2010). Therefore, responders and non-responders were identified based on changes from first to last observation on the basis of other measures of social and communication skills used as outcomes in previous studies (MSEL, VABS, ADOS; Dawson et al. 2010; Wetherby et al. 2014). First, responders were defined based on MSEL Receptive Language and, second, based on VABS Communication Standard Scores, consistent with changes observed in recent intervention trials (Dawson et al. 2010; Wetherby et al. 2014). Specifically, children who demonstrated an increase in MSEL Receptive Language standard score of ≥5 points (1/2 standard deviation) were defined as responders (n = 15) while the remaining children were defined as non-responders (n = 21). Using the VABS standard Communication score, children were defined as responders if they demonstrated an increase of ≥8 points (1/2 standard deviation; n = 16), while the remaining children were defined as non-responders (n = 15). Third, children were defined as responders if they demonstrated an ADOS-2 CSS score decrease of ≥1 point (1 standard deviation; n = 16), while the remaining children were defined as non-responders (n = 25). Convergent validity was assessed using t tests comparing the amount of change in BOSCC SC and RRB domains and Core Totals between responder and non-responder groups as defined by the above definitions on these measures.

Finally, in order to explore whether decreases in the BOSCC domain scores align with clinician’s impressions of improvement, BOSCC scores for six children participating in an early intervention trial at CADB (DeGeorge et al. in preparation) were separated into responders and non-responders based on CGI scores. Specifically, four children received CGI scores of “much improved” (responders) while two children received CGI scores of “no change” (non-responders). No statistical analyses were conducted on these six children due to small sample size.

Results

Inter-rater Reliability

The estimated inter-rater reliability from the 28 videos that were coded by multiple coders (two randomly selected coders) was excellent for SC and RRB domains, as well as for the Core Total, with ICCs ranging from 0.97, 95 % CI [0.94–0.99], to 0.98, 95 % CI [0.96–0.99]. ICCs of individual items (sums across segment A and B) ranged from 0.72, 95 % CI [0.53–0.91] to 0.96, 95 % CI [0.93–0.99] (see Supplementary Table 1).

Test–Retest Reliability

Using a sub-set of children (n = 20) with two video observations separated by less than one month (40 videos total), the estimated test–retest reliabilities (ICCs) were high: 0.89, 95 % CI (0.77, 0.98), for the social-communication domain, 0.79, 95 % CI [0.62, 0.96], for the RRB domain, and 0.90, 95 % CI [0.81, 0.98], for the Core total. ICCs of individual items (sums across segment A and B) ranged from 0.44, 95 % CI [−0.05, 0.92] to 0.89, 95 % CI [0.81, 0.98] (see Supplementary Table 2).

Validity

First, results of paired t tests indicated that from first to last BOSCC observation (n = 50), statistically significant changes were found in the Core total (M = −2.53, SD = 8.01), [t(49) = 2.23, p < 0.05], corresponding to an effect size of 0.26, although changes in the separate SC and RRB domains were not statistically significant. Paired t tests from first to last ADOS-2 observation (n = 41) indicated that there were no statistically significant changes in ADOS-2 CSS (M = −0.29, SD = 1.75, d = 0.15), ADOS-2 SA CSS (M = −0.42, SD = 1.91, d = 0.21), or ADOS-2 RRB CSS (M = 0.42, SD = 1.84, d = 0.20). See Table 1 for amount of time between first and last ADOS and BOSCC observations.

Second, results from individual growth curve models indicated that the average rate of change in the ADOS-2 CSS score over 6 months was 0.33, which corresponded to an effect size of −0.15, 95 % CI [−0.44, 0.15]. The average rate of change in the BOSCC Core Total over 6 months was −4.2, corresponding to an effect size of −0.37, 95 % CI [−0.73, −0.01]. Corresponding values for the BOSCC SC domain score were −3.4 with an effect size of −0.38, 95 % CI [−0.81, 0.05]. Though the effect sizes were larger for the BOSCC, a comparison of the difference in effect sizes between changes in BOSCC Core Total and changes in ADOS-2 CSS indicated no statistically significant difference (p = 0.35). However, the effect size of the BOSCC Core total was statistically different from a no change alternative (p < 0.05) while the effect sizes of the ADOS-2 CSS and BOSCC SC domain were not statistically different from a no change alternative (p = 0.33 and p = 0.08, respectively).

Third, in cross-sectional correlations, the BOSCC Core total and the ADOS-2 CSS score were strongly associated (Pearson correlation of 0.48, cluster robust p < 0.001). When correlating change scores to assess convergent validity, the MSEL Receptive Language and VABS Communication Standard scores showed highly correlated change scores (r = 0.69, p < 0.001). For the ADOS-2 CSS change score, evidence for convergent validity with the MSEL Receptive Language and the VABS Communication Standard score was neither significant nor consistent, while for the BOSCC Core total, correlations were in the expected direction and, in the case of the MSEL Receptive Language, approached significance (r = −0.35, p = 0.05). The correlation of ADOS-2 CSS to change in ADOS-2 CSS was 0.28 (p = 0.08) and of the BOSCC Core total to change in BOSCC Core total was −0.37 (p = 0.08).

Fourth, results of discriminant validity analyses indicated no associations of maternal education or family income with the BOSCC social communication domain (χ2(2) = 1.94, p = 0.38), RRB (χ2(2) = 1.75, p = 0.42) domain or the BOSCC Core Total (χ2(2) = 1.53, p = 0.47). There was also no association of maternal education and family income with the ADOS-2 CSS (χ2(2) = 3.40, p = 0.18).

Post-hoc Analyses

T tests comparing the amount of change in BOSCC scores between the groups indicated that the MSEL responder group demonstrated significantly more change in the BOSCC SC domain (t(34) = 3.04, p < 0.01) and the BOSCC Core total (t(34) = 3.58, p < 0.01) than the MSEL non-responders group (See Fig. 3). Results of t tests also indicated that the VABS responder group demonstrated significantly more change in the BOSCC RRB domain (t(29) = 2.51, p < 0.05) and the BOSCC Core Total (t(29) = 2.40, p < 0.05) than the VABS non-responder group. In contrast, BOSCC domains and the BOSCC Core total did not statistically differ in the ADOS-2 CSS responder and non-responder groups (non-significant results).

Fig. 3
figure 3

Responder groups defined by MSEL, VABS, or ADOS-2 in early intervention studies. Note *p < 0.05, **p < 0.01; ADOS-2 Autism Diagnostic Observation Schedule, 2nd Edition, BOSCC Brief Observation of Social Communication Change, CSS ADOS Calibrated Severity Score, MSEL Mullen Scales of Early Learning, n.s. not significant, RRB Restricted and Repetitive Behaviors BOSCC Domain, SC Social Communication BOSCC domain, VABS Vineland Adaptive Behavior Scales

As shown in Fig. 4, with the exception of the BOSCC RRB domain, from first to last time point, BOSCC scores for the CGI responders consistently decreased more than the CGI non-responders. Figure 4 is provided for illustrative purposes since no statistical analyses were conducted on these groups given the small sample size.

Fig. 4
figure 4

Responder groups defined by Clinical Global Impression (CGI) in community-based intervention study. BOSCC Brief Observation of Social Communication Change, CGI Clinical Global Impression-Improvement, RRB Restricted and Repetitive Behaviors BOSCC domain, SC Social Communication BOSCC domain

Discussion

Results of these initial analyses suggest that the BOSCC is a promising outcome measure that is sensitive to subtle changes in social communication behaviors over time. To our knowledge, the BOSCC is the first briefly assessed, observation-based measure of treatment response specific to a broad range of social communication behaviors. A two-factor model, consistent with other models of ASD symptoms, supporting a social communication domain separate from RRBs (Guthrie et al. 2013; Mandy et al. 2012; Shuster et al. 2014), fitted the item-data satisfactorily. The separation of the two domains allows future researchers to explore changes in social communication skills in children with social-communication impairments who do not necessarily have RRBs or meet criteria for ASD. Analyses of the psychometric properties of the BOSCC indicate that the BOSCC has excellent inter-rater reliability and high test–retest reliability, meeting recommended standards (Cunningham 2012) and consistent with other work using an earlier version of the measure (Kitzerow et al. 2015).

Results indicate that changes in BOSCC scores over a 6-month period demonstrated small to medium effect sizes, though the effect size varied a little depending on the statistical method used. Although these changes were not statistically different than the effect sizes of change seen in the ADOS-2 CSS, the effect size itself, considering the small sample size, is promising. In addition, the BOSCC scores demonstrated statistically significant changes over time while the ADOS-2 CSS scores did not, when compared to a no change alternative. The BOSCC may be more sensitive to changes in social communication behavior than the ADOS-2 CSS, and hence more successful in identifying changes in response to treatments over shorter periods of time (Brian et al. 2015; Dawson et al. 2010; Shumway et al. 2012; Thurm et al. 2015). Additional work is clearly needed to confirm this hypothesis.

This work is the first indication that the BOSCC has convergent validity with social communication changes seen in other measures, including a caregiver report measure (VABS) and a standardized cognitive measure (MSEL). There is also some preliminary evidence of convergent validity with a clinician’s impression of improvement (CGI) in a very small sample. The BOSCC Core total and the ADOS-2 CSS score were highly correlated with each other, although there was not a significant correlation between change in the BOSCC Core total and change in the ADOS-2 CSS score. These findings suggest that the BOSCC may be measuring behaviors, especially subtle behaviors that are improving, differently than the ADOS-2 CSS. Alternatively, this finding may be related to the limited range of change or limited range of scores overall in the ADOS-2 CSS scores, consistent with other studies (Dawson et al. 2010).

In contrast, correlations of BOSCC change scores with changes in receptive language (MSEL) and communication skills (VABS) were, though not statistically significant, in the expected direction, indicating some evidence for convergent validity. The lack of a statistically significant correlation with changes in the MSEL or VABS is not discouraging because we would not necessarily expect the BOSCC to correlate highly with the MSEL and VABS; the BOSCC is a more global measure of social-communicative skills than either the MSEL Receptive Language domain or the VABS Communication domain. Yet, when children were defined as either responders or non-responders based on the VABS and MSEL, BOSCC scores decreased significantly more in responders than non-responders. This further suggests that although correlations of the BOSCC and other measures of communication did not reach statistical significance, the possibility is good for some convergence with measures of change when samples are larger.

It should be noted that, despite the significant correlation between the BOSCC and ADOS-2 CSS score, the BOSCC is not intended to be a measure of diagnostic classification. Rather, the BOSCC was developed to capture nuanced social communication behaviors that may change over relatively brief periods of time. This distinction is important to prevent misuse of this new measure. This also highlights that BOSCC scores at any single time point are only meaningful in relation to another time point; a BOSCC score at one time point cannot stand on its own.

When considering the importance of the two BOSCC domains, improvements (decreases) in the BOSCC Core total (items 1–12, combining social communication and RRB domains) most consistently converged with improvement (increases) in other standard measures of communication (VABS, MSEL), while changes in the separate BOSCC SC and RRB domains was less consistent. Although the separate SC and RRB domains may prove useful in non-ASD populations or when assessing change specific to one domain, this work suggests that the BOSCC Core total may be the most appropriate domain to identify improvement in young, minimally verbal children with ASD. This needs to be confirmed in future work with larger samples.

Of note, only three items on the BOSCC attempt to capture RRB behaviors across a continuum. Item distributions indicated that obtaining a continuum for these behaviors was challenging. It may be that these behaviors are either clearly present or not present at all (with little variation in between) or that subtle variations in these behaviors are difficult to capture, especially within a 5-min time frame. Though still adequate, the RRB domain score demonstrated lower test–retest reliability than the SC domain, consistent with earlier iterations of the BOSCC (Kitzerow et al. 2015) and the ADOS, from which initial drafts of these items were developed. As mentioned, it was the BOSCC Core total (combining SC and RRB domains) that was most successful in identifying changes, indicating the importance of these behaviors in combination with the SC behaviors, at least in this ASD sample. Perhaps this is a result of the strong relationship between these domains in the ASD population (Richler et al. 2010). The BOSCC RRB domain may not prove to be a useful domain in which to measure change on its own but additional studies are needed. In the meantime, it may be helpful to use other measures of RRB behaviors to complement the BOSCC, such as the Repetitive Behavior Scale-Revised (RBS-R; Lam and Aman 2007). Although there are biases with the reliance on such a likely unblinded caregiver response measure, concordance with the BOSCC may prove useful in both providing validity for the BOSCC and in confirming the presence of meaningful change in caregiver reports.

In line with the goals for development, research assistants can reliably code the BOSCC (coding does not require a highly experienced or credentialed coder), unlike other commonly used measures (Bolte and Diehl 2013). In fact, our group has been successful at training several undergraduate-level research assistants as well as one highly motivated high school student to code the BOSCC reliably. Since the BOSCC measures changes within an individual, high levels of agreement amongst coders in a coding team are particularly crucial, though agreement across sites is less important (unlike reliability training for the ADOS, for example). The high inter-rater agreements in our group suggest that this level of agreement is possible across a coder’s level of experience, though one experienced coder (a child psychiatry fellow) in our group tended to consistently under-score behaviors in the RRB domain. The BOSCC may initially be more challenging for someone who has more advanced training or experience, particularly in a specific framework, though this remains to be thoroughly explored. Also in line with goals of development, the BOSCC does not rely on caregiver report of symptoms, minimizing measurement bias (Anagnostou et al. 2015; Bolte and Diehl 2013; Guastella et al. 2015) and allowing truly “blinded” coding. In addition, the BOSCC’s minimally structured, naturalistic context places little demand on administration and contributes to the measure’s ecological validity. Through the use of video coding, coders have more time to consider behaviors without the pressure of assessing every behavior quickly, as in live coding situations. Despite the advantage of video coding, our group eventually aims to explore the utility of the BOSCC in live coding situations, as this method would not require video cameras or adequate audio/visual recordings.

Given the subtlety of social communication behaviors that the BOSCC measures, it is currently recommended that each BOSCC video segment (5 min) be viewed twice and the second set of codes be considered final for interpretation. This method takes approximately 30 min per video. Although our group found little difference in averaged totals between the first and second set of codes (data not presented), changes at the item level were present. In addition, coders reported having more confidence in their coding after their second viewing, suggesting the need to continue this practice.

Another aspect to consider in relation to the BOSCC is that any changes in a child’s behavior during an interaction with a caregiver must be considered in light of changes in the caregiver’s behavior. Parent–child interaction is often described as bi-directional—the child’s behaviors impact the parent and vice versa (Ginn et al. 2015; Rutgers et al. 2004; Siller and Sigman 2008; Slaughter and Ong 2014; Zhou and Yi 2014). A recent parent-focused intervention study found that changes in ASD symptoms, as measured by the ADOS-2 CSS, were mediated by parental synchrony (Pickles et al. 2015). Similarly, work has also shown that children’s language development may be influenced by a parent’s responsiveness during play interactions (Siller and Sigman 2008). Another study found a high correlation between the quality of the parent–child interaction and the child’s ASD severity (using the ADOS-2 CSS; Hobson et al. 2015). Our study did not assess whether the caregiver’s behavior significantly impacted the child’s BOSCC scores or if the child’s severity of ASD or other behaviors impacted the caregiver’s behavior. Given these potential confounds, some researchers may choose to have an examiner who is blind to the child’s treatment status interact with the child during the BOSCC. If the caregiver is chosen as a BOSCC partner, researchers should consider collecting additional measures of generalization and/or caregiver behaviors that may contribute to observed changes in the child’s behavior (Pickles et al. 2015). Previous work has emphasized the importance of the context in which changes are assessed (Yoder et al. 2013), therefore whichever social and environmental context is chosen for the BOSCC observation, the context should be as consistent as possible (e.g., same play partner, same materials, same location) in order to ensure the validity of the observations gathered. At the same time, measures that go beyond a single context are clearly necessary to ensure generalization of skills gained in treatment. Future work assessing the impact of the context in which the BOSCC is gathered is clearly warranted.

Although the initial results of the BOSCC are promising, they should be interpreted in light of several limitations of this project, including the small sample size. This study focuses on a sample of 56 young children with ASD, with even smaller samples of children with multiple observations of other measures (e.g., VABS, MSEL, ADOS-2) used for convergent validity. Our small sample also did not allow for analyses of differences by sex, race, or ethnicity. All children included in this paper used simple phrase speech or less and the majority had completed the Toddler Module or Module 1 of the ADOS-2. A subsample of eight children completed module 2 of the ADOS-2. It may be that this version of the BOSCC coding scheme is not maximally effective at capturing change in this more verbal (module 2, phrase speech) group. Future work will address whether modifications to the BOSCC coding scheme are necessary to capture adequate change in children using phrase speech. Also, test–retest reliability ICCs were high in domain and total scores, but there was some variability amongst item-level reliabilities. Though we do not recommend the use of BOSCC items individually, it is possible that one month between observations may be too long to adequately assess test–retest reliability on the BOSCC.

In addition, this paper did not explore the effects of specific treatment or control conditions. We hope to expand this work to a larger sample comparing different treatment conditions, employing the BOSCC as an independent measure of treatment response. It is also important to consider the limited endorsement of other abnormal behaviors in this context of free-play with a parent. Nevertheless, other researchers may want to consider these items in future analyses; these behaviors may impact social communication and RRB behaviors captured in other codes or be more common in other contexts. These behaviors may also be useful in determining whether the BOSCC observation is a valid representation of the child’s behavior.

Although this study focused on a sample of children with diagnoses of ASD, future work should also address whether the BOSCC can capture changes in children with social communication deficits who do not have ASD (e.g., social/pragmatic communication disorder, social anxiety disorder). Our group is also working on several lines of research related to the development of the BOSCC, including applying the BOSCC to school-age children who have limited speech and expanding the BOSCC to individuals with verbal fluency. Researchers outside our group have successfully applied an earlier version of the BOSCC to segments of ADOS-2 videos in a small sample (Kitzerow et al. 2015). We aim to confirm the validity of this method in future research, which would allow researchers to explore pre- and post-treatment ADOS-2 videos from previously collected data.

Our ongoing work and the work of other researchers (Fletcher-Watson et al. 2015; Kitzerow et al. 2015) will continue to provide larger samples across multiple sites in order to contribute to our continued understanding of the value and limitations of the BOSCC. Because the BOSCC is new and additional testing of its ability to capture meaningful change needs to be completed, we recommend that the BOSCC be used in combination with other measures of change. This is consistent with recommendations from other researchers endorsing multiple means of assessing treatment outcome (Cunningham 2012). The utility of the BOSCC, which measures broad ASD symptoms, may be strengthened when used in combination with other measures that assess more specific behaviors (e.g., joint attention) in detail. Also, the BOSCC may be useful in clarifying potential placebo effects often found in caregiver reports, allowing for more effective use of parent report measures. As the field focuses efforts on finding appropriate outcome measures for longitudinal studies and randomized controlled trials, we look forward to the continued validation of measures such as the BOSCC that will hopefully provide unique, objective observational data to aid in assessing the efficacy and course of treatments aimed at improving social communication skills.