Autism spectrum disorder (ASD) is a type of developmental disability characterized by deficits in social interaction and communication skills and by the presence of repetitive/ritualistic behaviors and a restricted range of interests (American Psychiatric Association 2013). Although ASD was once considered to be a relatively rare condition (Simpson 2004), its prevalence is now estimated at up to 1 in every 59 individuals (Baio et al. 2018). With increasing numbers of children being diagnosed with ASD, there is a corresponding need for provision of effective intervention (Woods and Wetherby 2003).

Various intervention approaches have been developed and evaluated to assess their efficacy and/or effectiveness for promoting improved adaptive behavior functioning and reducing ASD symptoms. Efficacy research refers to the evaluation of intervention effects when the intervention is conducted under ideal or controlled conditions, such as when delivered by researchers in a clinical setting. In contrast, effectiveness research refers to the evaluation of intervention effects when the intervention is conducted under real-world conditions, such as when delivered in preschool settings by the usual teaching personnel (Singal et al. 2014).

The range of interventions that has been evaluated in efficacy or effectiveness research includes pharmacological agents, dietary interventions, occupational and speech-language therapies, interventions based on the principles of applied behavior analysis, and developmentally orientated interventions (Ospina et al. 2008). In addition, hybrid intervention approaches have been evaluated, such as early intervention programs that make use of behavior analytic instructional tactics (e.g., reinforcement, response prompting) within naturalistic and developmentally appropriate social interactions and play routines (Debodinance et al. 2017; Odom et al. 2010). Interventions that fall into this latter, behavioral-developmental, category appear to be the most widely used in contemporary practice, perhaps due to the growing evidence base supporting their efficacy (Ospina et al. 2008).

Positive outcomes across a range of domains of functioning (e.g., social, play, cognitive, and communication skills) from such behavioral/developmentally oriented early intervention programs have been documented in a number of studies (e.g., Dawson 2008; Debodinance et al. 2017; Drahota et al. 2012; Estes et al. 2015; Keenan and Dillenburger 2011; Reichow 2012). However, in most existing studies, the treatment procedures were implemented by researchers or therapists in a one-on-one format or within specialist small-group arrangements, rather than by usual personnel (e.g., teachers) in more naturalistic/inclusive group settings (Young et al. 2016). That is, most research to date has examined efficacy rather than effectiveness. To advance evidence-based practice with respect to the implementation of early intervention programs, it would seem critical to evaluate outcomes from effectiveness studies, that is, studies in which the treatment procedures were implemented under more natural, real-world conditions, such as in inclusive preschool settings with teachers serving as the intervention agents.

An inclusive preschool-based approach to the delivery of interventions would seem to offer several potential advantages over delivery of intervention by experts in more structured/clinical settings. First, international guidelines recommend delivery of interventions in settings that (a) provide ongoing opportunities for interaction with typically developing peers and (b) are the least restrictive environment for meeting the individual’s needs (Broderick 2017; United Nations 2006). Inclusive preschools are perhaps more likely to meet these requirements than clinical settings as they offer the opportunity for children with ASD to be part of the same learning environment as their typically developing peers and to practice the social behaviors needed to interact with these peers (Vivanti et al. 2017). Opportunities for interaction with peers would seem especially important for young children with ASD because of the associated social communication impairments, which are likely to interfere with forming positive peer relationships. The presence of typically developing peers in an inclusive preschool setting might be useful for promoting more positive peer relations and offering models of age-appropriate play, communication, and social behavior (Koegel et al. 2001).

Providing early intervention in an inclusive preschool setting would also seem to offer some potential cost saving, as intervention could be delivered to more than one child at a time using existing resources (e.g., existing teaching staff, physical spaces, and equipment). Additionally, there is a documented disparity in children’s access to early intervention services (Thomas et al. 2007) and healthcare (Kramer et al. 2017). Specifically, ethnicity, parental education, and geographic location may all impact on families’ ability to access ASD-related services (Thomas et al. 2007). Addressing this disparity has been cited as an important and understudied area of research in the area of ASD intervention (Interagency Autism Coordinating Committee (IACC) 2013). Delivering intervention in an inclusive preschool environment might help to address this disparity to some extent, particularly if the young child with ASD is already attending an inclusive preschool where early intervention could be provided.

Existing teaching staff would seem to be the most logical intervention agents for interventions conducted in inclusive preschool settings because they are likely to have knowledge of early development and learning as well as familiarity with the unique needs and interests of the children they teach (Lawton and Kasari 2012). It is important to note, however, that some research suggests that typical pre-service training may not equip teachers with the knowledge, skills, and/or levels of confidence required to meet the needs of children with ASD (Mitchell and Hegde 2007). Thus, it seems important to examine how teaching staff in inclusive preschools might be trained to meet the needs of the young children with ASD that they teach.

Given the increasing need for, and potential positive outcomes from, early intervention programs for young children with ASD—and the potential advantages of inclusive preschool-based delivery of interventions by teaching staff—it seems important to explore whether such interventions can be delivered in this type of setting and whether doing so is likely to produce positive outcomes for the child. In this review, we sought to identify studies involving the provision of early intervention to children with ASD who were attending inclusive preschool settings. We also sought to appraise the quality of the identified studies and evaluate their effects on child outcomes. The strategies used in training teaching staff to implement these interventions with fidelity were a particular focus of the review as well. The specific questions addressed in this review are:

  1. 1.

    What types of intervention procedures/programs have been used in the included studies?

  2. 2.

    What are the design characteristics and rigor of the included studies?

  3. 3.

    What were the characteristics of the participants involved in the studies?

  4. 4.

    How were teaching staff trained to implement the intervention procedures and to what extent did this training enable the staff to implement the interventions with fidelity?

  5. 5.

    What were the range of outcomes for the children with ASD who received intervention?

  6. 6.

    To what extent were the interventions/staff training procedures perceived by stakeholders to be effective and acceptable (i.e., socially valid)?

Method

Search and Screening Procedures

Searches were carried out by the first author using the PsycINFO, ERIC, Scopus, PubMed, and ProQuest databases. For all databases, the following search terms were entered into the “Anywhere” field: Autis* OR ASD AND “Teacher led” OR “teacher implemented” AND Intervention OR program* OR treatment AND “Early intervention” OR preschool OR “early childhood.” Results were limited to journal articles published in English between 2000 and 2017. These initial electronic searches returned 351 articles.

The titles and abstracts for these 351 articles were then reviewed to screen them for their potential eligibility for inclusion. At this stage, 16 articles were deemed eligible for inclusion, and consequently, the full text of each of these articles was reviewed to ensure that each study met all of the inclusion criteria. Ten articles met all of the inclusion criteria. An ancestral search of the reference lists of the included articles from the database search produced a further four articles for inclusion. Finally, author searches were conducted on the authors of the included articles from the database and ancestral searches. These author searches produced an additional five articles for inclusion. At this stage, three articles were excluded because they involved both inclusive and non-inclusive preschool settings and the data from non-inclusive preschools could not be separated from the data from the inclusive preschools. In total, 16 articles met the inclusion criteria.

Inclusion Criteria

To be included in the review, studies needed to meet several criteria. First, the study had to have evaluated outcomes from interventions that were conducted in inclusive preschool settings. An inclusive preschool setting was defined as an educational and/or care setting for preschool-aged children (typically aged 1 to 6 years) that included both children with and without disabilities. In addition, the interventions had to have been implemented by the staff who regularly worked in that setting. Staff could include teachers, paraeducators, teaching assistants, or similar personnel. The study also had to include at least one child participant who was (a) aged between 12 and 72 months or attending the preschool and (b) had a diagnosis of ASD, Autism, Asperger’s Syndrome, or Pervasive Developmental Disorder-Not Otherwise Specified (PDD-NOS). If a study also included one or more participants who did not meet these inclusion criteria, only data from the participants who were eligible for inclusion was extracted and analyzed. If data from eligible participants could not be separated out, the study was excluded. Studies set in special education classrooms and studies focused on interventions implemented by parents, researchers, or specialists that were not regular preschool staff were excluded as were case studies, studies with qualitative designs, and theses or dissertations.

Data Extraction

The following data were extracted from each included study: (a) type of study design, (b) child characteristics (number, age, and diagnoses), (c) intervention characteristics (intervention type, frequency, and duration), (d) teacher training (method, frequency, and duration), (e) research quality/rigor, (f) child outcomes (type of outcome, method of measurement, and results), (g) teacher outcomes/behavior (type of outcome/behavior, method of measurement, and results), and (h) social validity (method of measurement and results).

Results were classified as positive, mixed, or no effect/negative. For studies with a single-case design, a positive result was coded when positive changes for all primary dependent variables (DV), all participants, and all intervention phases were reported. Mixed results referred to situations where authors reported minimal or no improvement or highly variable results for one or more participants, primary DVs, and/or intervention phases. Finally, a coding of no effect or a negative result meant that the intervention was not associated with any positive results for any of the participants or for any of the primary DVs. For studies using group designs, results for each DV were reported separately. A positive result was coded when authors reported significant improvements for the experimental group (EG) for a given DV. Where control group (CG) data were reported, improvements observed in the EG needed to be significantly better than those observed in the CG for the result to be coded positive. Conversely, a code of no effect or negative result was used when no significant improvements were reported for the EG and/or reported improvements for the EG were not significantly better than those reported for the CG.

A modified version of Goldstein et al.’s (2014) framework was used to evaluate each included study for its quality/rigor. The quality criteria used in this framework are consistent with those proposed by Cook et al. (2009) and the School Psychology Division of APA (Kratochwill and Stoiber 2002).This framework was chosen because it is suitable for use with both single-case and group-design studies and allows for a quantification of the quality/rigor of a study across a comprehensive set of specific quality indicators (Snyder et al. 2015). This framework was also selected because it allows readers to graphically view the strengths and weaknesses of studies and thus provides the ability for readers to quickly assess studies (individually and collectively) across the variables that they are most interested in.

Using this framework, included studies were evaluated across four broad areas: (a) design characteristics and internal reliability, (b) measurement features, (c) general characteristics, and (d) results and external validity. Thirteen quality criteria, across the four aforementioned areas, were used to evaluate all studies (including group designs and single-case research designs). These 13 criteria are (a) design characteristics, (b) measurement, (c) reliability, (d) intervention fidelity, (e) training fidelity, (f) rationale, (g) robust treatment effects, (h) statistics, (i) maintenance and generalization, (j) implementation site, (k) participant selection, (l) consumer satisfaction, and (m) social validity. For single-case research designs, two additional criteria (quality of baseline and visual analysis) were used. Definitions and guidelines for assigning ratings for each criterion are included in Goldstein et al.’s framework.

The procedures that we used for evaluating the quality of the studies involved having the first author examine each article and assign a rating of between 1 and 4 for each applicable category. A score of 4 represented exemplary performance, 3 represented acceptable performance, 2 represented minimal performance, and a score of 1 represented unacceptable performance. A total mean score was then calculated for each study by dividing the sum of the study’s scores for each category by the number of categories scored.

Interrater Agreement

A second reviewer independently reviewed the full text of all articles identified during the database, ancestral, and author searches to check on their eligibility against the inclusion criteria. Before undertaking any agreement checks, the primary rater (first author) and the independent rater (second author) discussed the inclusion criteria and the types of data to be extracted from each article. Interrater agreement on whether or not each identified article did or did not meet the inclusion criterion was calculated by dividing the number of agreements by the number of agreements plus disagreements and multiplying by 100. Overall agreement for all searches was 88.3% (range = 80 to 100%). Agreement on the accuracy of data extraction was also assessed for all included studies and for all variables. Agreement ranged from 96 to 99% with a mean of 97%. Disagreements were discussed to obtain consensus.

Results

The 16 included studies are summarized in Tables 1, 2, 3, and 4. Table 1 summarizes design, child characteristics, and intervention characteristics. Table 2 summarizes child outcomes, Table 3 summarizes teacher outcomes and social validity, and Table 4 provides an evaluation of each study’s design characteristics and the presence of specific quality indicators.

Table 1 Summary of child characteristics, intervention characteristics, and teacher training for included studies
Table 2 Summary of child outcomes
Table 3 Summary of teacher and social validity outcomes for included studies
Table 4 Experimental design and quality indicators for included studies

Child Characteristics

A total of 809 children participated across the 16 included studies. We classified children in terms of the diagnoses they had been assigned in the original research reports. Of these participating children, 734 (91%) had a diagnosis of autism/ASD, 25 (3%) had a diagnosis of PDD-NOS, 1 child (< 1%) had a diagnosis of Asperger’s Syndrome, and 1 (< 1%) child had no formal diagnosis, but was reported to have displayed autism-like symptoms. A further 48 children (6%) from one study (Schwartz et al. 2004) had a diagnosis of either ASD or PDD-NOS; however, the authors did not specify which diagnosis each child had. The mean age across studies was 45.9 months. This mean does not include the participants from Schwartz et al. (2004) because these authors only provided the range of participants’ ages, not the mean. Early intervention was provided to 517 (64%) of the participating children with the remaining 292 children (36%) assigned to CGs that received treatment as usual.

Setting

The early interventions being evaluated in these studies were implemented in the regular classroom environments within inclusive preschool settings for all 16 included studies. However, in one study, part of the intervention was delivered on a one-on-one basis in a designated treatment room at the preschool (Eikeseth et al. 2012). Twelve studies (75%) took place in US-based preschools (Boulware et al. 2006; Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007; Schwartz et al. 2004; Strain and Bovey 2011; Van DerHeyden et al. 2002; Young et al. 2016). The remaining four studies (25%) were conducted in Italy (D’Elia et al. 2014), Norway (Eldevik et al. 2012), Germany (Kern and Aldridge 2006), and Sweden (Eikeseth et al. 2012).

Intervention Approaches

Various intervention approaches were used across the 16 included studies. Seven studies (44%) delivered some type of comprehensive intervention that targeted a range of developmental areas (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Eldevik et al. 2012; Schwartz et al. 2004; Strain and Bovey 2011; Young et al. 2016). These seven studies involved the use of one of six different comprehensive intervention programs, specifically: (a) Developmentally Appropriate Treatment for Autism (DATA; Boulware et al. 2006; Schwartz et al. 2004), (b) Treatment and Education of Autistic and Related Communication Handicapped Children (TEACCH; D’Elia et al. 2014), (c) Early Intensive Behavioral Intervention (EIBI; Eldevik et al. 2012), (d) Learning Experiences and Alternative Program for Preschoolers (LEAP; Strain and Bovey 2011), (e) Comprehensive Autism Program (CAP; Young et al. 2016), and (f) an EIBI intervention described as being based on Lovaas’ UCLA model (Eldevik et al. 2012), which in fact appears to have been based on similar principles to the EIBI intervention reported in Eldevik et al. 2012

The remaining nine studies (56%; Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern and Aldridge 2006; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007; Van DerHeyden et al. 2002) focused on interventions that we classified as more targeted because the intervention focused on changing a less comprehensive (more specific) set of skill(s) or focused on a smaller number of specific developmental areas than did the comprehensive programs listed before. The specific skills and developmental areas targeted in these interventions included (a) communication (Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007), (b) play skills (Kern and Aldridge 2006; Van DerHeyden et al. 2002), (c) peer interaction (Garfinkle and Schwartz 2002; Kern and Aldridge 2006; McBride and Schwartz 2003), and (d) reading skills (Fleury and Schwartz 2017).

Frequency of Intervention Sessions

Frequency of intervention refers to the number of sessions per week. Seven studies (44%) included details of the frequency of intervention sessions (Boulware et al. 2006; Eldevik et al. 2012a, 2012b; Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Olive et al. 2007; VanDerHeyden et al. 2002); the mean number of sessions per week across these studies was 4.2 (range: 2 to 5). A further four studies (25%; Gibson et al. 2010; Kern and Aldridge 2006; Strain and Bovey 2011; Young et al. 2016) provided some information on intervention frequency, but not enough for the mean number or sessions per week to be calculated.

Intensity of Intervention Sessions

Intervention intensity, that is, the mean number of hours of intervention per week, was not specified for one study (Kern and Aldridge 2006) and was unclear for a further study (VanDerHeyden et al. 2002). Another three studies (Gibson et al. 2010; Kern et al. 2007; McBride and Schwartz 2003) provided the number of minutes per intervention session, but not enough information for the mean number of hours per week to be calculated. Two studies (Harjusola-Webb and Robbins 2012; Young et al. 2016) reported the number of hours per week that children were enrolled to attend preschool but not the intensity of the intervention received. For the remaining nine studies (56%), mean intensity of intervention that was delivered in the preschool setting was 9.6 h/week (range 0.3 to 23 h/week).

For three studies (19%), the intervention also included a family/home component that involved an additional number of hours delivered by the child’s family and/or in the child’s home (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012). The number of hours per week was not specified by Eikeseth et al. (2012), but for the Boulware et al. (2006) study, the family/home component involved an additional 7 h/week of intervention, and for the D’Elia et al. (2014) study, participants received an additional 2 h of intervention per week delivered in their home.

Duration of Intervention Sessions

Duration refers to the amount of time (in months) over which the intervention was conducted. The mean duration of intervention was reported in 10 studies (63%) and ranged from 1.2 to 25 months with a mean of 15 months. The duration of intervention was not specified or was not clearly specified in four studies (25%; Garfinkle and Schwartz 2002; Harjusola-Webb and Robbins 2012; Olive et al. 2007; VanDerHeyden et al. 2002). Two studies (13%; Gibson et al. 2010; McBride and Schwartz 2003) specified the total number of sessions of intervention received, but not the period of time over which the sessions were delivered.

Staff Training

Five studies (31%) did not report any details on the approach used for training teachers to implement the intervention procedures (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Schwartz et al. 2004; Van DerHeyden et al. 2002). For the remaining studies (69%), a range of methods was used to train teaching staff, including (a) providing a formal graduate course in communication intervention (Olive et al. 2007); (b) didactic coaching, mentoring, and/or training (Eldevik et al. 2012; Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Kern and Aldridge 2006; McBride and Schwartz 2003;Young et al. 2016); (c) modeling (Eldevik et al. 2012; Fleury and Schwartz 2017; Gibson et al. 2010; Kern et al. 2007; Strain and Bovey 2011); (d) use of role play (Fleury and Schwartz 2017; Gibson et al. 2010); (e) individual feedback (Fleury and Schwartz 2017; Gibson et al. 2010; Harjusola-Webb and Robbins 2012; McBride and Schwartz 2003; Strain and Bovey 2011); (f) written instruction and/or feedback (Fleury and Schwartz 2017; Gibson et al. 2010; McBride and Schwartz 2003; Strain and Bovey 2011); (g) group instruction and/or coaching (Fleury and Schwartz 2017; Young et al. 2016); (h) workshops (Young et al. 2016); and (i) in vivo practice and/or coaching (McBride and Schwartz 2003; Strain and Bovey 2011). In one study, training was delivered via videoconferencing (Gibson et al. 2010), another study used video modeling (Fleury and Schwartz 2017), and a written training manual was provided to staff in the Harjusola-Webb and Robbins (2012) study. Finally, in the study by Kern et al. (2007), staff training began with an initial consultation meeting to establish intervention goals and procedures for each participating child.

The frequency, intensity, and duration of staff/teacher training varied across the included studies. Seven studies (44%) did not specify the frequency, intensity, or duration of training (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Garfinkle and Schwartz 2002; Olive et al. 2007; Schwartz et al. 2004; VanDerHeyden et al. 2002), and the authors of two studies (13%) did not provide enough information for the exact frequency, intensity, and duration of training to be calculated (Fleury and Schwartz 2017; McBride and Schwartz 2003). For one study (6%; Gibson et al. 2010) training consisted of one 45-min session, this study is not included in the calculations of training frequency, intensity, or duration in this section. Three studies (19%) provided details of the frequency of teacher training (Eldevik et al. 2012; Harjusola-Webb and Robbins 2012; McBride and Schwartz 2003). Training was delivered weekly in two of these studies (Eldevik et al. 2012; Harjusola-Webb and Robbins 2012), and training sessions decreased to bi-weekly for one study as the intervention progressed (Eldevik et al. 2012). The intensity of teacher training was specified in two studies (Eldevik et al. 2012; Harjusola-Webb and Robbins 2012) and ranged from 0.17 to 6 h/week. A further two studies (McBride and Schwartz 2003; Strain and Bovey 2011) provided some information on the intensity of teacher training, but not enough to calculate the mean number of hours per week.

The duration of teacher training was reported in five studies (31%; Eldevik et al. 2012; Kern and Aldridge 2006; Kern et al. 2007; McBride and Schwartz 2003; Strain and Bovey 2011). In two of these studies (Kern and Aldridge 2006; McBride and Schwartz 2003), training ended once staff reached a pre-determined level of fidelity or indicated that they were confident with using the intervention. For the remaining three studies (19%), the mean durations of teacher training were 25 months (Eldevik et al. 2012); 8 months (Kern et al. 2007); and 0.5 months (Strain and Bovey 2011).

Child Outcomes

Table 2 indicates that all 16 studies reported on child outcomes. Outcomes have been grouped according to study design. Specifically, the nine studies (56%) with single-case designs primarily used direct observation to measure child outcomes so these studies have been grouped together. The remaining seven studies (44%), which all had group designs, used a range of instruments other than direct observation to assess child outcomes. The findings from these studies have been grouped according to domains of functioning.

Child Outcomes for Single-Case Design Studies

The nine studies (56%) with a single-case design included at least one child outcome measured by in vivo or video observation of child behavior (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern and Aldridge 2006; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007; Van DerHeyden et al. 2002). The specific outcomes measured varied across studies and included frequency of elopement (Gibson et al. 2010); verbal participation in the target reading activity (Fleury and Schwartz 2017); social initiations (Garfinkle and Schwartz 2002); verbal responses (Garfinkle and Schwartz 2002; McBride and Schwartz 2003); imitation of peers (Garfinkle and Schwartz 2002); engagement and physical proximity to peers (Garfinkle and Schwartz 2002; McBride and Schwartz 2003); peer imitation of participating children (Garfinkle and Schwartz 2002); frequency of expressive communicative acts (Harjusola-Webb and Robbins 2012); positive peer interactions, play, and engagement with materials and equipment (Kern and Aldridge 2006); number of correctly completed steps of the morning arrival routine (Kern et al. 2007); use of a speech-generating device for communication (Olive et al. 2007); and contact with target activity centers, toy play, and disruptive behavior (Van DerHeyden et al. 2002). One study (Fleury and Schwartz 2017) also included a researcher-developed assessment of book-specific vocabulary.

Due to the wide variety of outcomes measured across the nine single-case design studies, it is not possible to make direct comparisons. Results were coded as positive for five studies (Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007). Mixed results or minimal improvements were reported in four studies (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Kern and Aldridge 2006; Van DerHeyden et al. 2002). For example, Kern and Aldridge (2006) reported positive results across all participants, but for only two of the three intervention phases and Fleury and Schwartz (2017) reported minimal improvement in child verbal initiations, but positive results for all other measured child outcomes.

Child Outcomes for Group Design Studies

For the seven group design studies, a range of instruments were used to measure child outcomes. Six (86%) of these group studies (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Eldevik et al. 2012; Strain and Bovey 2011; Young et al. 2016) measured children’s adaptive/maladaptive behavior using a range of instruments including (a) Temperament and Atypical Behavior Scale (TABS; Bagnato et al. 1999), (b) Vineland Adaptive Behavior Scales (VABS; Sparrow et al. 1984), (c) Child Behavior Checklist (CBC: Achenbach 1991), and (d) the Social Skills Rating System (SSRS; Gresham and Elliott 1990). Five (71%) of these group design studies that measured adaptive/maladaptive behavior reported positive results (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Eldevik et al. 2012; Strain and Bovey 2011), while the remaining study was coded as having no effect because there were no significant changes in participant scores for the EG (Young et al. 2016).

Functional skills were measured as outcomes in two (29%) of the group studies (Boulware et al. 2006; Schwartz et al. 2004) and were assessed using (a) Bayley Scales of Infant Development (Bayley 2006); (b) Assessment, Evaluation, and Programming System for Infants and Children (AEPS; Bricker 1994); (c) a researcher-developed functional outcomes index (Schwartz et al. 2004); and (d) a researcher-developed functional outcomes scale (Boulware et al. 2006). Participating children from both studies demonstrated gains across at least one functional outcome, and participants from the Schwartz et al. (2004) study made gains across all six of the functional outcomes measured.

Four (57%) of the seven group studies reported on measures of autism severity and/or symptoms (D’Elia et al. 2014; Eikeseth et al. 2012; Strain and Bovey 2011; Young et al. 2016) using the Autism Diagnostic Observation Schedule (ADOS; Lord et al. 2008), or the Childhood Autism Rating Scale (CARS; Schopler et al. 2002). D’Elia et al. (2014) reported decreases in autism diagnoses across both EGs and CGs, as measured by the ADOS, with a larger decrease observed in the EG. Similarly, in the study by Strain and Bovey (2011), the EG demonstrated a greater decrease in CARS scores than the CG. The Eikeseth study (2012) reported a significant decrease in CARS scores for the EG, but did not report comparison data for the CG. The authors of the final study (Young et al. 2016) did not report any significant change in CARS scores.

Child communication and/or language was measured in five (71%) of the group studies (Boulware et al. 2006; D’Elia et al. 2014; Fleury and Schwartz 2017; Strain and Bovey 2011; Young et al. 2016) via a range of different instruments including (a) Communication, Social, and Symbolic Behavior Scales (CSBS; Wetherby and Prizant 2002); (b) MacArthur Communication Developmental Inventories (CDI; Fenson et al. 1993; Fenson et al. 1994); (c) Preschool Language Scale (PLS; Zimmerman et al. 1991); (d) Expressive One Word Picture Vocabulary Test (EOWPVT; Brownell 2000a); (e) Receptive One Word Picture Vocabulary Test (ROWPVT; Brownell 2000b); and (f) a researcher-delivered book vocabulary assessment (Fleury and Schwartz 2017). Participants demonstrated improvement on at least one communication/language outcome across all five of these studies.

Two (29%) of the group studies (Strain and Bovey 2011; Young et al. 2016) reported on social skills, which were measured via the Social Skills Rating System (SSRS; Gresham and Elliott 1990) and the Autism Screening Instrument for Educational Planning (ASIEP; Krug et al. 2008). Both studies reported positive results, with the EG making greater improvements than the CG in both cases.

Aspects of child cognition or educational strengths and weaknesses were reported as outcomes in four (57%) of the group studies (D’Elia et al. 2014; Eldevik et al. 2012; Strain and Bovey 2011; Young et al. 2016). Intellectual functioning was measured in one study (Eldevik et al. 2012) using the Bayley Scales of Infant Development (BSID; Bayley 2006) for participants younger than 42 months of age, and the Stanford-Binet Intelligence Scale (Thorndike et al. 1986) for participants older than 42 months. Overall, the EG made significantly greater gains than the CG on composite scores for both instruments. Another study (D’Elia et al. 2014) measured psycho-educational skills using the Psychoeducational Profile: Third Edition (PEP-3; Schopler et al. 2005) and found that EG participants made significant improvements over time across most categories. Finally, child cognitive development was measured in two (29%) of the seven group studies (Strain and Bovey 2011; Young et al. 2016) using the Mullen Scales of Early Learning (Mullen 1995) and the cognitive domain of the Bayley Scales of Infant Development (BDI; Bayley 2006). In the Strain and Bovey (2011) study, EG scores were significantly higher than CG scores after intervention; however, no significant change in scores was reported in the Young et al. study.

Teaching Staff Outcomes

Table 3 shows that four (25%) of the included studies did not report any measures of teacher outcomes (Boulware et al. 2006; Eikeseth et al. 2012; Eldevik et al. 2012; Schwartz et al. 2004). For the purposes of this review, the term “implementation fidelity” has been used as an umbrella term to describe the extent to which interventions were delivered as intended and in line with the program model or prescribed procedures. The included studies used a range of terms to refer to this concept including fidelity of intervention implementation (D’Elia et al. 2014; Olive et al. 2007; Strain and Bovey 2011; Young et al. 2016); procedural fidelity (Fleury and Schwartz 2017; Van DerHeyden et al. 2002); treatment fidelity (Garfinkle and Schwartz 2002); and fidelity of treatment (Kern and Aldridge 2006).

Implementation fidelity was measured as a teacher outcome in 9 (56%) of the 16 included studies (D’Elia et al. 2014; Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Kern and Aldridge 2006; Olive et al. 2007; Strain and Bovey 2011; Van DerHeyden et al. 2002; Young et al. 2016). However, in one of these nine studies (D’Elia et al. 2014), the authors did not actually provide the results of their fidelity checks. Implementation fidelity was measured via direct observation in eight of these studies (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Kern and Aldridge 2006; Olive et al. 2007; Strain and Bovey 2011; Van DerHeyden et al. 2002; Young et al. 2016), and through review of participating children’s individual education plans (IEP) for the ninth study (D’Elia et al. 2014). Young et al. also included data on teachers’ rate of attendance at training workshops and responses from teacher exit interviews to support data from the in-class observations. Six studies (38%) reported high levels of implementation fidelity (M = 90%, range of means = 73 to 100%; Fleury and Schwartz 2017; Gibson et al. 2010; Olive et al. 2007; Strain and Bovey 2011; VanDerHeyden et al. 2002; Young et al. 2016). Two studies (13%) reported mixed results with some participating teachers failing to reach high fidelity levels (Kern and Aldridge 2006) or requiring additional coaching to reach required fidelity levels (Garfinkle and Schwartz 2002). Young et al. (2016) also reported high rates of attendance at teacher-training workshops.

Other teacher outcomes related to the use of specific teaching techniques include (a) rate of dialogic prompt use (Fleury and Schwartz 2017); (b) use of communication-promoting strategies (Harjusola-Webb and Robbins 2012); (c) use of prompting (Kern et al. 2007); (d) rate of instruction, use of physical prompts and interactions with the target child (McBride and Schwartz 2003); and (e) use of prompts and attention (VanDerHeyden et al. 2002). For most of these studies, outcomes were assessed via direct observation (Harjusola-Webb and Robbins 2012; Kern et al. 2007; Van DerHeyden et al. 2002) or video observation (McBride and Schwartz 2003). However, Fleury and Schwartz (2017) did not specify the method of measurement used. Two studies reported positive results (Harjusola-Webb and Robbins 2012; Van DerHeyden et al. 2002), one study reported mixed results (McBride and Schwartz 2003) and two studies did not report the results (Fleury and Schwartz 2017; Kern et al. 2007).

Social Validity

As displayed in Table 3, nine studies (56%) did not appear to have assessed social validity (Boulware et al. 2006; D’Elia et al. 2014; Eikeseth et al. 2012; Eldevik et al. 2012; Harjusola-Webb and Robbins 2012; Kern and Aldridge 2006; Kern et al. 2007; Olive et al. 2007; Schwartz et al. 2004; Van DerHeyden et al. 2002). Among the seven studies (44%) that did include a measure of social validity, five of these studies did so by administration of a questionnaire that was completed by the teaching staff (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Strain and Bovey 2011; Young et al. 2016), whereas one study used teacher interviews (McBride and Schwartz 2003), and one study reported on what would appear to have been more anecdotal evidence on social validity that was provided by participants’ families (Schwartz et al. 2004). Due to the range of different measures used to evaluate social validity across these seven studies, results cannot be summarized or compared. However, the reported results were generally positive, with the interventions rated highly by teachers (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; McBride and Schwartz 2003; Strain and Bovey 2011; Young et al. 2016) and parents (Schwartz et al. 2004). Strain and Bovey (2011) reported a strong correlation (r = .89) between teachers’ implementation fidelity and their ratings of the social validity of the intervention.

Design Characteristics and Quality Ratings

A detailed analysis of each study’s design characteristics and presence of specific quality indicators are presented in Table 4. Nine (56%) of the included articles used single-case research designs (Fleury and Schwartz 2017; Garfinkle and Schwartz 2002; Gibson et al. 2010; Harjusola-Webb and Robbins 2012; Kern and Aldridge 2006; Kern et al. 2007; McBride and Schwartz 2003; Olive et al. 2007; Van DerHeyden et al. 2002). The remaining seven articles (44%) used group designs including a single group pre-test and post-test design (Boulware et al. 2006; Schwartz et al. 2004), nonequivalent comparison-group design (D’Elia et al. 2014; Eikeseth et al. 2012; Eldevik et al. 2012), randomized controlled trial (Strain and Bovey 2011), and cluster-randomized trial (Young et al. 2016).

The quality of each study was assessed using the quality-rating framework developed by Goldstein et al. (2014), and the results are displayed in Table 4. The mean quality rating across all of the included studies was 2.5 out of 4 (range = 1.8 to 3.4). This mean score can be loosely translated as indicating a minimal acceptable level of quality (Goldstein et al. 2014). In order to receive a rating of “minimal quality,” studies needed to have an overall mean score of at least 2. Boulware et al.’s (2006) study received a mean rating of 1.8 and was the only included study that did not meet the standards for minimal quality. Only one of the included studies (Strain and Bovey 2011) demonstrated an “acceptable” level of quality, with a mean score of 3.4. The mean scores for the remaining 14 studies ranged from 2.1 to 2.9. With respect to scores for each category, studies tended to score highly for study rationale (M = 3.3), robust treatment effects (M = 3.1), and the external validity of the implementation site (M = 4), and studies with single-case designs tended to score highly for quality of baseline (M = 3.4) and visual analysis (M = 3.3). Lower mean scores were obtained for social validity (M = 1.5), consumer satisfaction (M = 1.9) maintenance and generalization (M = 1.8), and training fidelity (M = 1.9). The mean scores for the remaining categories ranged from 2 to 2.6.

Discussion

The purpose of this review was to gather, summarize, and evaluate the empirical literature regarding teacher-implemented early interventions for young children with ASD. The review was limited to studies conducted in inclusive preschool settings to assess intervention effectiveness, that is, the effects of interventions when implemented under “real-world” conditions. A systematic search of the literature produced 16 articles that met the inclusion criteria. Nine (56%) of the included studies used a single-case research design, three studies (19%) used a nonequivalent comparison-group design, two (13%) used a single-group pre-test post-test design, and two studies (13%) were randomized controlled trials. Various intervention approaches were evaluated across the studies, including six different comprehensive interventions and a range of targeted interventions focused on specific skills or developmental areas. Intervention dosage also varied widely across studies, as did the method, frequency, and duration of teacher training.

Child Outcomes

Overall, the results from the present review suggest that interventions delivered by teaching staff in an inclusive preschool setting can be effective in improving outcomes for young children with ASD. For 14 (88%) of the 16 included studies, the participating children were reported to have made gains in at least one primary outcome variable. For the remaining two studies (13%), the participating children showed minimal improvement or variable gains across the primary outcome measures (D’Elia et al. 2014; Garfinkle and Schwartz 2002). Interestingly, both of these studies reported positive results on secondary or collateral outcomes. The Garfinkle and Schwartz (2002) study also reported some generalization of child behaviors to nontarget activities and/or peers and highly favorable teacher ratings regarding the benefit of the intervention to target children. For the D’Elia et al. study, mean improvements for the EG were only slightly higher than those observed in the CG across most primary outcomes; however, no implementation fidelity data were reported and so it is difficult to know whether teaching staff implemented the intervention with integrity.

More than 20 different child outcomes were measured across the 16 included studies via a range of different methods and instruments. Direct observation (either in vivo or via video) was the most commonly used method. The specific outcomes measured via direct observation varied widely across studies, possibly because of the wide variety of outcomes targeted by different interventions (particularly interventions focused on specific skills/developmental areas). However, there was also a lack of consistency in the way that broader outcomes (e.g., adaptive behavior, autism severity, and communication skills) were measured. Twenty different instruments were used to measure broad child outcomes across six (38%) of the included studies, but only four of these instruments were used in more than one study. This variation in the approach to the measurement of outcomes across studies would seem to hinder cross-study comparisons of interventions, which is important for informing decisions regarding which treatment/treatments should be considered best practice (Ospina et al. 2008).

It is also important to determine whether a given intervention has a more positive effect than treatment as usual (TAU). Thus, it is useful for intervention research to compare the focus intervention with the treatment(s) that participants would typically receive. Furthermore, studies need to include a clear, detailed description of TAU conditions (Dingfelder and Mandell 2011). Four (25%) of the studies included in this review compared specific interventions with TAU and found that the intervention group performed significantly better than the CG on at least one primary outcome (Eikeseth et al. 2012; Eldevik et al. 2012; Strain and Bovey 2011; Young et al. 2016). This suggests that the interventions being evaluated in these studies appeared to be more effective than the treatment that the target children would have ordinarily received, had they not participated in the study. It is important to note that for the Eldevik et al. (2012) study, the participants were not randomly allocated to a treatment condition and this may have compromised the validity of the comparison between the EG and the CG. However, the authors reported group equivalence with respect to age, gender, duration of intervention, diagnosis, and level of intellectual disability prior to the beginning of the study. A further study (D’Elia et al. 2014) also compared intervention with TAU and reported that EG improvements on primary outcomes were not significantly better than those recorded for the CG.

Research on intervention effectiveness should also include measures of the generalization and maintenance of observed results (Koegel and Rincover 1977). However, only two studies (13%) reported on generalization of child behavior (Garfinkle and Schwartz 2002; Van DerHeyden et al. 2002). VanDerHeyden et al. reported on generalization of child behavior to nontarget activity centers and the study by Garfinkle and Schwartz included data on generalization of child behavior to a nontarget activity and nontarget peers. The study by Garkinkle and Schwartz also included a 2- to 4-week follow-up phase and was the only included study to report on maintenance of child behavior. A further five studies (31%) did not include a follow-up phase, but measured child outcomes across a relatively long-term (> 12 months) intervention phase (D’Elia et al. 2014; Eldevik et al. 2012; Strain and Bovey 2011; Young et al. 2016). However, these data do not show whether target behavior(s) occurred absent the intervention, an important consideration given that it is possible for improvements made through early intervention to decline after the intervention ends (Estes et al. 2015).

Teacher Outcomes

It is generally accepted that intervention research should include data on implementation fidelity. Such data can help to establish the extent to which the intervention was implemented as intended and thus whether its correct implementation was likely to have been responsible for any positive intervention effect (Ospina et al. 2008). The measurement of fidelity in community-based settings is especially important as it is common for fidelity to be compromised when an intervention is transferred from a more controlled clinical setting to a more natural, applied setting (Breitenstein et al. 2010; Chang et al. 2016). Research has demonstrated that child outcomes are impacted by implementation fidelity (Stahmer et al. 2017). Indeed, two of the included studies highlighted decreases in child performance corresponding with decreased levels of teacher fidelity (Kern and Aldridge 2006; Van DerHeyden et al. 2002).

With respect to reporting on implementation fidelity, eight studies (50%) reported data related to this quality indicator. This number suggests a reporting improvement over time in that an earlier review by Wheeler et al. (2006) reported that only 18% of studies reported on implementation fidelity. Of the eight studies from the current review that did report on implementation fidelity, six studies reported positive results with teachers reaching and maintaining high levels of fidelity during the intervention. For the remaining two studies (Garfinkle and Schwartz 2002; Kern and Aldridge 2006), results were mixed with some teachers failing to reach high levels of fidelity and others requiring extra coaching to meet acceptable fidelity levels. Interestingly, although studies referred to acceptable and high levels of fidelity, there did not appear to be any clear consensus across studies regarding the level of performance required to reach each level. For example, in the study by Young et al. (2016), a mean fidelity rate of 73% was deemed to be high, while in the study by Van DerHeyden et al. (2002), a mean fidelity rate of 72% was described as poor. Further, none of these eight studies provided a minimum standard of fidelity, that is, the level of fidelity required to establish experimental control and/or ensure the effectiveness of the intervention.

Also of note was the absence of generalization and maintenance data for teacher fidelity across the included studies. Only one included study (6%) reported on the generalization of teacher behavior to nontarget activities and children (McBride and Schwartz 2003). Results suggested that teacher fidelity improved from baseline but to a lesser extent than the improvement seen with target activities and children. None of the included studies reported follow-up data for teacher fidelity. This type of data may be important in establishing teacher-delivered intervention as a feasible long-term option for early intervention for children with ASD. This is because a potentially valuable goal would be for community-based intervention to become largely self-sustaining, with teachers able to implement intervention with only minimal input from outside experts such as researchers. Indeed, the long-term feasibility and sustainability of interventions is an important consideration for any mental health–related intervention research study (Proctor et al. 2009) as current research suggests that many community-implemented interventions do not sustain over time (Dingfelder and Mandell 2011).

Social Validity

Based on the quality rating scale used in this review (Goldstein et al. 2014), social validity refers to the educational or clinical significance of a study while consumer satisfaction refers to stakeholder’s perceptions of the acceptability of the treatment or intervention. We use the term social validity as an umbrella term to encompass both of these aspects. Social validity is an important consideration in the implementation of interventions in community settings because of the well-documented research-to-practice gap (Drahota et al. 2012) and the recognized role of stakeholder perceptions in bridging this gap (Stahmer et al. 2017). Findings from the included study by Strain and Bovey (2011) suggested a high correlation (r = .89) between teachers’ ratings of the social validity of the intervention and their implementation fidelity, further highlighting the potential importance of social validity. Indeed, in the Strain and Bovey study, teachers that viewed the intervention as socially valid were more likely to implement it with high levels of fidelity.

It is concerning then that only six studies (38%) included in this review met the quality requirements for acceptable measurement of social validity. All six of these studies reported overall positive results, suggesting that the interventions were viewed favorably by teaching staff. Of the remaining 10 studies, one study included a measure of social validity that was rated as “minimal” quality and the remaining 9 studies did not include any measure of social validity. Clearly, future research could be strengthened by greater attention to the assessment of social validity.

Limitations

Several limitations need to be considered with this review. Firstly, we limited included studies to peer-reviewed articles published in English, and as a result, we may have missed studies that might have otherwise met the criteria for inclusion. Also, the exclusion of unpublished work, such as theses and dissertations, may have increased the likelihood of no effects or negative results due to publication bias. Finally, included studies may have received different quality ratings if we had used an alternative framework to assess quality.

Implications

The results of this review would seem to suggest that teaching staff might be able to learn how to effectively deliver early intervention, with a reasonable degree of fidelity, to young children with ASD in inclusive preschool environments. Further, the delivery of early intervention in these settings may improve outcomes for the participating children. However, many of the reviewed studies had minimally acceptable levels of quality based on the Goldstein et al. (2014) rating framework. Given the need for high-quality studies to guide evidence-based practice, the results of this review point to the need for additional and higher-quality studies. At the present time, any statements regarding the effectiveness of the early interventions included in this review must be viewed as tentative. Although these results suggest that a range of early intervention programs can be effective when implemented in inclusive preschool settings, further research is needed to establish the generality of the findings of this review. Specifically, there is a need for more high-quality studies that evaluate the long-term effectiveness of interventions and the long-term maintenance of both child and teacher outcomes. These studies should also include assessment of the social validity as well as the long-term feasibility of use of the intervention for different communities (Dingfelder and Mandell 2011). Alongside this broad long-term research agenda, smaller studies that examine the active components of intervention could be informative and may enable providers to make use of only those intervention components that are most likely to be necessary. Eliminating the use of inactive treatment components may reduce the costs and increase the efficiency of intervention efforts. Single-case research designs would seem well-geared towards identifying the active ingredients of a given intervention, as well as for evaluating the generalization and maintenance of treatment effects (Lord et al. 2005; Ward-Horner and Sturmey 2010).

There may also be value in future comparative research to determine which intervention approach or package, if any, is the most effective when delivered in an inclusive preschool setting. This type of research should also include a thorough assessment of the initial and ongoing costs of different intervention approaches as well as their acceptability to stakeholders (Dingfelder and Mandell 2011). It may also be valuable to compare inclusive preschool-based delivery of interventions with one-on-one or specialist preschool delivery to determine which mode of delivery is most effective. Future research should also include clearly defined child outcomes that are explicitly linked to the expected outcomes of the intervention being studied (Lord et al. 2005). It would be valuable for these outcomes to be measured with a consistent set of instruments across studies, to allow for future cross-study comparisons. Finally, future studies should include a clearly defined measure of implementation fidelity. Researchers might consider specifying a minimum level of acceptable fidelity. In line with this, future research that more closely examines possible links between implementation fidelity and child outcomes would seem of some value.