Introduction

The last few decades have witnessed an increase in the prevalence of autism spectrum disorder (ASD), currently estimated to effect between 1.5 and 2% of the population (Baio et al., 2018; Blumberg et al., 2013). This growth necessitates appropriate educational interventions become more mainstream, as few, if any educational systems can adequately support such a large proportion of their population in separate systems. It has been suggested by those diagnosed with autism that there is a need for an increase in research focused upon translational benefits for this community, to enhance the lives of those with autism and that of their families (Pellicano et al., 2014). Together, these factors provide an impetus to better develop evidence-based practice around interventions and support for those with autism that can be delivered through mainstream support systems.

Among the defining features of ASD are deficits in social and communication skills. These have profound effects on the development and functioning of individuals across a range of psychological and educational domains (American Psychiatric Association [APA], 2013; Williams-White et al., 2007). While social impairments are diverse, commonly identified problems include difficulties initiating social interactions, understanding linguistic conventions, and interpreting both verbal and nonverbal cues (Rao et al., 2008; Williams-White et al., 2007). These deficits negatively impact children’s learning, play skills, and friendship development (Rogers et al., 2005), as well as general social behaviour within both social and educational settings (Carter et al., 2005). Persisting into adulthood, these impairments have been linked to a lack of social support (George & Stokes, 2018), academic (Gurbuz et al., 2019), and occupational under-achievement (Hayward et al., 2018), as well as poorer long-term prognosis (Howlin, 2005; Spain & Blainey, 2015). Consequently, individuals with ASD not only face many difficulties achieving developmental milestones, but also experience considerable behavioural and educational challenges.

Evidence suggests that early interventions are the most effective means of attenuating the long-term impact that problems with social function has on individuals with ASD (Spain & Blainey, 2015). These interventions are typically designed to develop and enhance the functional and communication skills necessary to negotiate social interactions and decrease problematic behaviours (Rao et al., 2008). Such programmes have been reported to be effective for a range of childhood conditions (see Spain & Blainey, 2015, for review), with group-based interventions that facilitate greater opportunities for peer support viewed as highly advantageous within autism (Williams-White et al., 2007). Children with ASD spend much of their time in the classroom and also represent a rapidly growing group of school-aged children with specialized needs (Roux et al., 2015a, 2015b). Hence, it would be difficult to overstate the importance of implementing interventions that identify and cater to the needs of students with ASD, particularly within mainstream classrooms (McLeskey & Waldron, 2007).

A considerable number of academic, behavioural, and social interventions have been published in literature, with interventions varying on a number of factors including participant age groups, methodologies, and targeted behaviour (Ozonoff & Miller, 1995; Rogers, 2000). Of the few studies that have specifically focused on early interventions for school-aged children, the majority of interventions have been conducted in settings that are not educationally based (i.e. laboratories, participants’ homes, community, or clinical settings; Ospina et al., 2008), while others have often been practice-based and constrained by serious methodological limitations, small sample sizes, and a short-term focus, and few control for the time spent in school environments.

An examination of the literature suggests that there is a shortage of research that investigates interventions and educational approaches delivered in school settings by teachers and other practitioners for learners with ASD (Stokes et al., 2017). Recent meta-analyses have overlooked this (i.e. Grynszpan et al., 2014; Virues-Ortega et al., 2013; Wang et al., 2013), the exceptions being meta-analyses by Kokina and Kern (2010) and Whalon et al. (2015). However, while the majority of interventions analysed by Kokina and Kern (2010) were applied within the school setting, the focus was on the use of social stories specifically and included participants of all intellectual abilities. Furthermore, although Whalon et al. (2015) focused on school-based interventions, their search criteria were restricted to papers that focused on single cases, inconsistently captured persons with pervasive developmental disorders (PDD; i.e. used search terms “autism” and “Asperger” only), was focused solely on schooling within North America (e.g. used search term “elementary”), was restricted to participants 12 years and under, and did not control for intelligence, nor did the meta-analysis include dissertations and other grey literature, international literature, and higher levels of schooling. Educational interventions have been reviewed within The National Standards Project (National Autism Center, 2015) and the National Professional Development Center of Autism (Wong et al., 2013). However, in these comprehensive reviews of literature published up to 2011 and 2012 respectively, educational interventions were presented alongside a variety of other behavioural and developmental interventions applied to a variety of children with ASD and young adults. Consequentially, development of clear recommendations concerning educational best practice has yet to be adequately provided (Delmolino & Harris, 2012). Potentially, this could lead to ill-informed attempts to support students with ASD and their education experience (Smith, 2008). Thus, despite being a widely studied condition, it is evident that there remains little clarity as to which educational interventions are most appropriate.

The heterogeneous presentation of autism highlights the potential limitations of simpler, reductive, nomothetic approaches to research. Individual learning needs of children with ASD have become more widely recognized (Harrower & Dunlap, 2001), suggesting that most children require specialized support to succeed within educational contexts, which for some time have been argued as paramount to the success of teaching practice (Harrower & Dunlap, 2001). However, as many interventions are not appropriate or feasible at all times in the educational environment, it is important to identify which education-based interventions are most useful in reducing core deficits, as well as how they can best be delivered to optimize outcomes (Spain & Blainey, 2015) within the constraints of the educational system.

The purpose of this systematic review was to synthesize the existing empirical knowledge of school-based interventions and therapies that have been delivered to students with ASD and build on previous research and reviews. This review aimed to identify effective interventions, thereby assisting educators in managing specific deficits experienced by children with ASD.

Method

This review involved the systematic analysis of literature published from 2000 to June 2019 that empirically examined the impact of education-based interventions on students with ASD. Literature included journal articles, dissertations, theses, and books written in the English language, with no imposed restrictions on publication status.

Study Eligibility Criteria

Studies were included in this review based on six criteria. Included studies (a) had participants who attended kindergarten/pre-school, primary, or secondary school; (b) who had a formal diagnosis of autism or ASD; and (c) who did not have comorbid intellectual disability. This was supported by either having a diagnosis of autism with a confirmed full-scale IQ of 70 or higher, recruited into a study that specifically excluded participants with full-scale IQ under 70, or had an autism related diagnosis that inferred an absence of intellectual disability, such as Asperger’s syndrome (AS) or high-functioning autism (HFA). AS and HFA were included in the absence of full-scale IQ given the requirement of AS that intelligence be within or above typical range, and HFA being operationally defined as a person with autism whose intelligence mirrors AS (Attwood, 2007). We limited studies to students lacking a comorbid intellectual disability to avoid confounding across conditions, as interventions including individuals with intellectual disability may be focused more toward supporting intellectual function than autism, and therefore, the intervention or the measured outcome may not directly relate to autism. (d) That applied an intervention aimed at teaching skills to children (e) where the intervention was conducted within the context of a school (f) included pre- to post-intervention comparison. Lastly, in association with the changes to diagnostic criteria introduced in the Diagnostic and Statistical Manual, version 4, Text Revision (DSM-IV-TR), we restricted publications to those published between January 2000 and June 2019 inclusive. There were a small number of exceptions. Firstly, articles were included that focused on teaching academic skills such as written language and mathematical ability and took place outside of schools, since these studies focused on behaviours that are directly related to academic skills that are essential for success in the classroom, which could be adapted to the classroom. If there were a mixture of eligible and ineligible participants within a study, it was retained if data extraction for eligible participants was possible.

Regarding intelligence, the decision to include only cases with full-scale IQ of 70 or more could result in a number of cases being excluded where there is other evidence suggesting retention. There has been some debate over whether, and at what level, the Verbal IQ – Performance IQ (VIQ-PIQ) combination indicated typical range intellectual function (Charman et al., 2011; Lincoln et al., 1995). Given the debate and the desire to be as inclusive as reasonable, it was decided that, given the norming procedure of VIQ and PIQ (Wechsler, 2003), it would be unlikely that participants with a combination of VIQ and PIQ that both exceeded 70 points would have a full-scale IQ less than 70 points; thus, such cases would be retained in the analysis but the IQ scores would not be used as part of the mean IQ calculation.

Regarding methodology, in order to be as robust as possible, this analysis reports on standardized measures and student response measures; variables based on proxy reports from teachers and parents using unstandardized measures were excluded from the analysis. This is due to the elasticity often present between parent reports, teacher observation, and clinical judgement (Clionsk et al., 2012; Lemler, 2012). Publications were also to be excluded if ceiling effects presented during baseline or there was inconsistent intervention delivery. Within such studies, analyses not impacted by these issues were to be retained.

The autism community has indicated that the terms low and high functioning are simplistic, with the term high or low needs being suggested as better descriptors (Parliament of Victoria, Family and Community Development Committee, 2014). However, the term HFA has been confounded with diagnosis in much published research. Consequently, it is difficult to remove the term yet refer transparently and appropriately to much of the literature. Hence, we have retained this term throughout, though recognize, with respect, its limitations.

Information Sources

An electronic search was completed using ERIC, INFORMIT, OpenGrey, MEDLINE Complete, PsycEXTRA, PsycINFO, PubMed, Scopus, Web of Science, TROVE, and Proquest Dissertations and Theses Global. These sources were used as they provided a comprehensive overview of the psychological, educational, and grey literature. The following search terms were used: (Autis* or Asper* or PDD*) and (high function* or HFA) and (educat* or class* or teach* or academ* or learn* or school* or kinder*) and (interv* or therap* or instruct* or treat* or procedur* or manag* or effect*). Reference lists and citations were also screened within publications as were other relevant review publications (e.g. Grynszpan et al., 2014; Kokina & Kern, 2010; Southall & Campbell, 2015; Virues-Ortega et al., 2013; Wang et al., 2013; Whalon et al., 2015; Wong et al., 2013). To prevent data inflation, papers were removed if they were pilot studies, or dissertations that were then later published in the peer-reviewed literature that we had otherwise included. Additionally, studies were excluded where the same participants had been included within another publication focused on similar interventions and outcomes and where these cases could not be separated (e.g. as detailed in Bauminger, 2007a).

Study Selection and Data Extraction

Four studies where VIQ and PIQ were provided asserted 33 participants as having typical range IQ based on combined VIQ and PIQ (e.g. Agrawal, 2013; Roux et al., 2015b; Scott, 2013; Wiegand, 2003). These cases were included and the remaining participants within these four publications were excluded for having VIQ or PIQ less than 70, while one was excluded for absence of a formal diagnosis.

Inter-rater Reliability

Three researchers (also members of the authorship team) were trained by a senior author and screened all papers that presented in the literature search, first based on title and abstract. Papers flagged by one or more of the three researchers during this phase underwent full-text review by two of the original three researcher, screened according to the inclusion or exclusion criteria. Inter-rater reliability between the two researchers was 88.5%. Instances of discrepancy were resolved by referral back to the senior researcher who conducted all screening training, where the paper was discussed as a group to determine inclusion or exclusion.

Summary Measures and Statistical Analysis

Each publication that met the inclusion criteria was examined, with all relevant descriptive information, demographic characteristics, study results, and effect sizes manually collected and summarized in a tabular form (refer to Supplementary Tables 1 to 6), when a comparison group was available both between groups and within group effects are reported. As the variables of interest varied widely between studies, often being incompatible, individual analyses were conducted on each of the main skill areas targeted by interventions.

The magnitude of effect sizes was interpreted in accordance with Cohen’s (1969) guidelines, which describes d values as demonstrating no change (d: 0.0–0.20), small change (d: 0.20–0.50), moderate change (d: 0.50–0.80), or large change (d > 0.80). When d values were not given, but could be calculated, effect size was calculated by two researchers. When change between pre- and post-intervention was measured only as a percentage, or when a measure included a maximum score allowing for fractions, such as an unstandardized questionnaire, and effect sizes could not be calculated, change was calculated as percentage difference. In the event that a there were a number of data collection points at pre- and/or post-intervention, percentage difference was calculated by subtracting pre-intervention mean percentage from post-intervention mean percentage. If the measure was a count, it was treated as unit change, whereby pre-intervention scores were subtracted from post-intervention scores. In the absence of guidelines to assist in interpretation of percentage difference, this was interpreted as demonstrating no, small, moderate, and large change based on the constraints; 0 to 20%, 20 to 50%, 50 to 80%, and greater than 80%, respectively. Percentage difference is typically reported as an absolute value (Wenning, 2014); however, this omits valence, and thus, this analysis retains the use of valence symbols in order to maintain directional information. In instances where only a proportion of participants of the overall sample were eligible for inclusion, but effect sizes were only available for the complete sample, results were listed as “not reported” (nr). At times, this occurred when only one, or a proportion, of the sample met eligibility criteria, but only group results were available.

When results were presented as figures only, data were manually extracted by measurement and converted into percentage difference or unit change as appropriate. For instance, the data figure was enlarged, and a grid applied to the figure using the measurement units provided by each paper’s author, and each data point was manually measured. Measurement was completed by two members of the research team. Measurements were only included where agreement was obtained between independent measurements. Where these were not obtained, the process was discussed and repeated from the beginning and agreement sought. Effect size could not be calculated using this method. This method was required for a number of studies (see Supplementary Tables 1–6); therefore, despite steps taken to ensure the reliability of results, it is advised they be interpreted with caution.

Studies that employed single case designs applied visual analyses to outcomes and interpretations. This presented a number of issues for calculating effect size, such as accuracy, and calculating standard error and formulating confidence intervals. Given the few papers that adopted comparable protocols and measures, single case designs were included in the review, often only the direction of the effect, or the percentage change could be determined. However, we did endeavour to assess the strength of the evidence based on the numerous single case designs using the Council for Exceptional Children’s “standards for evidence-based practice in special education” to identify interventions that could be considered evidence-based, potentially evidence-based or would be according to the tool supported by mixed evidence, insufficient evidence, or of negative affect (Cook et al., 2014). The variability in intervention types and outcomes investigated and the descriptive nature of many studies prevented meeting the tool’s criteria for identifying any evidence-based, potentially evidence-based or negative impact school-based interventions included in this review.

Results

Study Selection

A total of 14,998 records were identified through searches. From these, 146 publications that investigated 166 interventions were retained for the analysis (see Fig. 1 for the PRISMA flow diagram).

Fig. 1
figure 1

Flow diagram of study selection investigating education-based intervention among students with ASD and the included participants

Participants

Following application of exclusion criteria, 589 participants remained within the review (see Fig. 1); 492 (83.5%) of which were male, 80 (13.6%) female, the remaining were unspecified (Jacquez, 2018; Roux et al., 2015a, 2015b). These were distributed over 146 publications (including theses and dissertations), between them which applied 165 manipulations or interventions and evaluated over 400 outcome behaviours.

While at the time of this analysis ASD had become a mainstream term, incorporating various pervasive developmental disorders (APA, 2013), the majority of trials were published prior to this change and therefore reported AS, autism, and pervasive developmental disorder, not otherwise specified (PDDNOS). Diagnoses were 213 (36.2%) AS, 170 (28.9%) autism, 37 (6.3%) PDDNOS, 117 (19.9%) HFA, and 52 (8.8%) were classified using the encapsulating ASD diagnosis, consistent with current diagnostic regulation (APA, 2013). Changes in diagnostic criteria may have contributed to the reduced number of publications meeting criteria within recent years. Publications post 2013 trended towards ASD diagnoses, without providing information about the level of functioning. Consequently, these publications were often excluded. No participants were included with a diagnosis of Rett’s syndrome.

Participants were aged 3 to 18 years (MAge = 10.28, SD = 5.16). One study’s eligibility criteria included students aged up to 21; however, specific participant ages were not given (Copeland, 2011). When publications provided only the school year level of the participant (e.g. Haskins, 2012; Laushey et al., 2009; McDaid, 2007; Potter, 2014), or the participants’ age range (e.g. Gal et al., 2009; Lorenzo et al., 2013; Lovett, 2012; Tsao, 2009), cases were excluded from age-related calculations. While age estimates can be made from a student’s year level, this assumption was not made due to the elasticity between chronological and mental age often present among students with ASD (Pellicano et al., 2014), potentially influencing their educational progression. Participants’ IQ was reported for 235 participants, which ranged from 70 to 140 (MIQ = 96.75, SD = 8.09).

Race was rarely reported. Race was reported for 112 (19.2%) participants in 41 publications. The majority of those were Caucasian (n = 65, 58.0%), ten different racial backgrounds were to ascribed to 50 participants, while the remaining were listed by the authors as an “other” minority background. For a breakdown per study, refer to Supplementary Tables 1 to 6. Of the 105 publications missing this information, only two reflected on their limitation of missing this information (Copeland, 2011; Cunningham, 2009).

Interventions

Interventions within eligible studies were overwhelmingly found to target one of four skills: academic, on-task behaviour, verbal, and social. The majority of outcomes targeted social skills and could be further broken down into three themes: social cognition, interaction, and play. These were therefore the main areas of the review. It was far less common for studies to target other behaviours, and due to the heterogeneity of these behaviours, they were excluded from the review. Other target areas included inappropriate self-soothing (Deaton, 2007), stereotypy (Conroy et al., 2005; Southern, 2004; Sterkin, 2012), relaxation levels (Kampfer-Bohach, 2008), perceived loneliness (Bauminger, 2007a), psychological distress (Pahnke et al., 2014), and functional skills, which included eating skills (Bledsoe et al., 2003), dance execution (Gies, 2012), object retrieval (Ogle, 2012), and repetitive and restrictive behaviours (Waugh & Peskin, 2015). Results are presented based on the six core outcome themes. The numerous unique interventions and lack of componentry information that would allow for commonalities to be identified meant that it was not possible to categorize based on intervention type.

Interventions that targeted academic skills focused on theoretical and practical outcomes directly related to academic work tasks and grading. On-task behaviour-based interventions focused on increasing behaviours that facilitate learning (such as engagement, independence, and appropriate classroom behaviours) and/or reduce behaviours that inhibit learnings (such as disruptive classroom behaviour, task intolerance, and non-compliance). Verbal skill-based interventions focused on the skills required for verbal communication, as well as interventions that targeted other verbal behaviours associated with ASD, such as echolalia and selective mutism. Social behaviours were separated due to the large number of studies that focused on social behaviour outcomes; thus, differences were often subtle. Social cognition interventions focused on improving the mental process of and acquisition of knowledge relating to fundamental socialization skills such as recognizing emotional cues. Interventions that focused on social interaction targeted reciprocal behaviours with others for which social skills would be the prerequisite. This included improving interactions including appropriate social initiation and responding. Play-based interventions specifically focused on increasing skills for appropriately engaging and/or cooperating in collaborative and imaginative play.

Herein, studies and publications refer to each individual report, intervention refers to the manipulation undertaken, wherein a single study may have one or more intervention, and behaviours refer to the dependent variable measured, wherein each intervention may be measured over several behaviours and therefore present a number of times across the review’s areas of investigation. Many publications (n = 47) investigated effects of one or more interventions over a range of dependent variables and therefore appear in multiple categories. An example is that of Adams (2003) who investigated the effects of participation in dance classes on areas of play, social interaction, and verbal skills. Dependent variables identified during full-text screening that did not fall into any of the six categories were not included in the review (e.g. undesirable behaviours specific to the subject of interest, parent satisfaction with intervention); therefore, a publication may have been retained without mention of all variables investigated.

Academic Task Skills

Academic skills were the most popular area of investigation, 35 publications investigated the impact of 46 interventions on 81 behaviour outcomes (Supplementary Table 1). These focused on theoretical and practical outcomes directly related to academic work tasks and grading (test scores, etc.). Single case and small sample designs were common with 17 behaviour outcomes assessed against single participant samples. The mean sample size reported was 3 participants (SD = 5.57), with only four interventions assessed within a sample of 10 or more participants. Data were available for 108 individuals who met eligibility criteria, some of whom participated in multiple interventions (O’Connor & Klein, 2004; Rago, 2013; Stringfield et al., 2011; Valentine, 2001; Whitby, 2009). Of these participants, only 18 (13.7%) were female. Compared to the overall sample, those contributing to the academic skill subset were slightly younger (t(605) = 0.67, p = 0.50, d = 0.05; MAge = 9.91 years, SD = 2.88) with an IQ score similar to the overall average IQ (t(268) = 0.01 p = 1.00, d = 0.00; MIQ = 96.74 years, SD = 15.78).

Examination of the effectiveness of academic skill interventions on students with ASD revealed generally positive outcomes. Three studies, reporting on five interventions either provided an effect size, or sufficient data to calculate Cohen’s d, two of which were randomized control trials. When there was a control group, average between group effects was moderate, d = 0.74 (SD = 0.49), when comparison was pre- to post-intervention, on average effects were large, d = 1.29 (SD = 0.18). Results suggest that academic skills, in particular reading skills, may be improved through intervention.

Twenty interventions were measured based on percentage difference or data that could be translated to percentage difference over 30 behaviours, and of these, many reported considerable effects. Percentage difference ranged from no change (Whitby, 2009) to 96.00% (Dixon et al., 2016). The mean percentage difference across studies was 43.80% (SD = 28.40). Six studies reported results as counts over 19 behaviours, measured as unit change. As the definition of one unit differs depending on the researcher’s question, these could not be appropriately compared but are displayed per intervention, per behaviour in Supplementary Table 1. Further, maximum possible count values were frequently unavailable, preventing conversion to percentile change.

All but two trials reported positive or mixed (some positive and some negative results) change. The only study found to produce no change in academic skills post-intervention was that of Chen (2000) who found no change in reading performance following a strategy focusing on learning Chinese phonetic symbols. Cayce (2012) found mixed results for the one student who met eligibility criteria; however, upon averaging pre- and post-scores, change in academic achievement was found to only vary by 1%.

On-Task Behaviour

On-task behaviour was evaluated in 34 publications that investigated the impact of 40 interventions on 76 behaviour outcomes (Supplementary Table 2), involving 109 individuals, some of whom were exposed to multiple interventions (e.g. Blakeley-Smith et al., 2009; Finn, 2013; Fondacaro, 2001; Groot, 2014; Shogren et al., 2011). Of these participants, only 16 were female (14.7%). The mean sample size reported was 2.95 participants (SD = 3.83). Only five interventions were measured against a sample of 10 or more participants. Compared to the overall sample, those contributing to the on-task behaviour subset were about 2 years and 1 month older in age (t(622) = 1.22, p = 0.22, d = 0.01; MAge = 9.66 years, SD = 2.49) with an IQ score 11.05 points lower (t(110) = 8.78, p < 0.001, d = 1.67; MIQ = 85.70 years, SD = 17.03).

Cohen’s d was provided or could be calculated for 5 interventions, on 8 behavioural outcomes (M = 0.81, SD = 0.77). Of the 13 studies for which percentage difference was reported or calculable for, 34 behavioural outcomes were assessed, and difference ranged from no change (Grey et al., 2007) to a large positive change of 98.47% (Cale et al., 2009; experiment 1), with the mean percentage difference across studies being 41.21% (SD = 28.97). In two of three experiments conducted by Cale et al. (2009), the combination of visual schedules, verbal warnings, environmental rearrangements, and cue cards increased the ability of students to complete individual tasks and complete teaching sessions.

When outcomes were measured in units (e.g. time), or as a count, these were reported in Supplementary Table 2 as unit change. These represented the number of occurrences of a behaviour or number of minutes engaged in a task. As mentioned, units were not comparable and therefore average unit change is not an informative measure of change. In total, nine studies provided data that could be calculated into units, measured based on 19 behaviour outcomes; all studies found the intervention to produce positive effects on at least one behaviour; however, for Ko (2002), the positive direction of change was, on average, less than 1 disruptive behavioural occurrence (0.6).

Play

Play skills involved the reciprocal act of play, imaginative play, and the specific skills used when engaging in play. Eighteen interventions in 15 publications assessed 31 behaviour outcomes (Supplementary Table 3), involving 47 individuals, some of whom were represented multiple times where multiple interventions were applied (Lydon et al., 2011; Reinecke, 2005). Of these participants, only 3 were female (from 3 studies). The mean sample size was 3.13 students per intervention (SD = 3.22), with only one study including 10 or more participants. Compared to the overall sample, those contributing to the play skills subset were younger in age (t(543) = 3.24, p < 0.05, d = 0.28; MAge = 7.20 years, SD = 2.63) with an IQ score slightly lower where IQ could be extracted (n = 15; t(15) = 0.46, p = 0.65, d = 0.24; MIQ = 95.00, SD = 14.52).

Effect sizes could only be calculated based on available data for one trial (Gal et al., 2016), which found the program StoryTable (Zancanaro et al., 2007) to have a strong positive effect on engagement in collaborative play involving puzzles (d = 1.43) and a craft activity (collage; d = 1.31). This was based on the largest sample within the play category and included 14 male students aged 7 to 12 years.

Six studies reported percentage difference over 10 behaviours, or in a form that could be translated into this. Average percentage difference across studies was 15.04 (SD = 38.56). It ranged from moderate negative change (Reinecke, 2005) when play activity choice was measured after a toy (− 16.72%). Only a toy was paired with an edible reinforcement did play activity choice increase (46%). The greatest percentage change was found by Bock (2007b) who found engaging in organized sports games increased by a moderate 69.7% after social–behavioural learning strategy training. Four studies reported change in units of observed actions over 14 behaviours.

Social Cognition

Social cognition was evaluated in 35 publications that investigated the impact of 38 interventions on over 150 behaviour outcomes (Supplementary Table 4), involving 286 individuals, some of whom participated in multiple interventions (Lopata et al., 2008; Tartaro, 2011; Wilkinson, 2010). Of these participants, 35 were female, from 17 studies. Single case and small sample designs were used in most of these studies. The sample size ranged from 1 to 26, and the mean reported was 8.71 (SD = 6.88) participants, with 12 interventions measured using a sample of 10 or more participants. Compared to the overall sample, those contributing to the social cognition subset were slightly older in age (t(705) = 0.65, p = 0.52, d = 0.05; MAge = 10.67 years, SD = 10.79) with an IQ score 3.06 points higher (t(345) = 2.35, p = 0.02, d = 0.25, MIQ = 99.81, SD = 16.13).

Examination of the effectiveness of interventions targeting social cognition among students with ASD revealed generally positive outcomes. Cohen’s d was provided or could be calculated for 14 interventions, on over 90 behavioural outcomes. When there was a control group, average between-group effects were moderately negative, d =  − 0.29 (SD = 2.07), and when comparison was pre- to post-intervention, on average, effects were large and positive d = 1.44 (SD = 1.29). It should be noted that while some studies appeared to report sufficient detail, examination revealed in several instances that this was not the case (e.g. Ko, 2002, missing dferror). Of the results reporting Cohen’s d or where sufficient information was provided to derive this, large effects were found for a cognitive behavioural intervention that taught friendship, emotions, and social interpersonal problem solving (Bauminger, 2002). When students were required to define emotions and give examples of experiences of the emotion, Bauminger (2002) found a large positive effect on student’s ability to provide a specific example based on students’ ability to associate the emotion with a prior experience (basic emotion: d = 1.52, complex emotion: d = 3.94). An increase in knowledge of complex emotion relating to an audience was also found (basic emotions: d = 2.45, complex emotions: d = 4.10), suggesting students with ASD increased their ability to recognize and identify people’s emotions. Eight studies reported percentage difference over 13 behaviours or data that could be translated to this. Although Pyle (2018) found no change (2.00%) in social engagement when there is a daily report card used, Leaf et al. (2009) found a moderate increase (71.90%) in social skills after a teaching interaction procedure with reinforcement and priming. The mean percentage change across studies was 26.75% (SD = 27.09). Five studies reported units of change or data that could be converted to this for 13 behaviours.

Six interventions resulted in negative change. In one trial, Lopata et al. (2008) found that giving feedback to 18 students on specific behaviours had a small negative impact on emotion recognition in children (d =  − 0.21) and adults (d =  − 0.25), as did giving feedback that was not based on operationally defined behaviours (child d =  − 0.17, adult d =  − 0.06), although the latter is operationally defined as no change (Cohen, 1969). They also found giving feedback on operationally defined behaviours increased parent-reported social skills (d = 0.20) but that teachers reported a decrease (d =  − 0.39). On the contrary, Cunningham (2009) found a peer-mediated intervention had a positive effect on teacher reported social skills (+ 7.50%), but decreased according to parents (− 20%). Negative results were also found for social stories on peer interaction (Holmes, 2008).

Social Interaction

Social interaction was the second most commonly investigated area. The impact of 47 interventions on over 80 behaviour outcomes was assessed in 38 publications (Supplementary Table 5), involving 189 individuals, with many participating in multiple interventions (Apple et al., 2005; Banner, 2007; Lopata et al., 2008; Talebi, 2007). Of these participants, only 24 were female, from 10 studies. Samples ranged from 1 to 29, the mean sample size reported was 4.97 participants (SD = 6.83), with only 4 interventions and behavioural outcomes measured based on a sample of 10 or more participants. Compared to the overall sample, those contributing to the social interaction subset were younger in age (t(588) = 1.01, p = 0.29, d = 0.09; MAge = 9.54 years, SD = 8.22) with an IQ score 0.90 points lower (t(322) = 0.78, p = 0.44, d = 0.09; MIQ = 95.85 years, SD = 11.92).

Overall interventions were positive. Effect size was available or could be calculated, for 8 studies on 40 social behavioural outcomes. d values ranged from no change (d =  − 0.06; Segura, 2012) to very large changes when cognitive behavioural training (Bauminger, 2002, 2007b) and peer-mediated training (Cunningham, 2009) were applied. When there was a control group, average between-group effects were moderate, d = 0.23 (SD = 0.34), and effects were also large, although more varied, when comparison was pre- to post-intervention, d = 1.80 (SD = 2.61).

Fourteen social behaviours were reported as percentage difference over nine studies, or data that could be translated to this, and of these, only one reported a large effect (Koegel et al., 2012b). This was on engagement with peers when an intervention of a social club themed based on the 3 participants’ preferred hobbies, and this also increased the frequency of unprompted peer initiations. At the lowest end of the range, Banner (2007) found their social skill intervention produced a slightly negative (− 1.17) effect on inappropriate social intervention; however, the slight increase in negative interaction was also accompanied by a 16.85% increase in appropriate social interaction. Mean percentage difference within social interaction was 24.04% (SD = 29.62). A moderate change between pre- and post-intervention was also recorded by Bock (2007b) who used a social–behavioural learning strategy with four students aged 9 and 10 years and found this significantly improved cooperation by 55.33% (p < 0.05), together with peer social interactions which increased by 41.81%, although the latter was not significant. Most studies measured their intervention(s) on single baseline and post-treatment comparison.

The outcomes of 19 behaviours were reported as units of change by nine studies; the unit of which was the number of behavioural occurrences. LaCava (2007) reported change based on a scoring system and found that when a mind-reading guide for emotions was applied, students’ empathy and emotional recognition increased for both positive and negative interactions with adults by 5.5 and 2.86 occurrences respectively. However, negative effects were recorded for empathy and emotion recognition when interactions were with peers, regardless of whether the interaction was positive (2.17 less positive interactions) or negative (2.20 more negative interactions). Negative effects were reported for five interventions on eight social interaction behaviours.

Verbal Skills

Verbal skills were a focus of 20 interventions, from 15 publications that reported on 36 verbal behaviours. They involved the smallest sample of 29 students, some of whom appeared in multiple interventions (Abraham, 2008; Cook, 2002; Kagohara et al., 2013; Lydon et al., 2011; Valentine, 2001; see Supplementary Table 6). Interventions that targeted verbal skills focused on the ability of students to successfully convey information, particularly through speech. There was some overlap with the social interaction domain. When the communication focused on a reciprocal dialogue, the trial was retained within social interaction. Results were based on small samples within this group ranging from one to four students.

On average, participants were slightly older (MAge = 10.68, SD = 3.13, t(36) = 0.63, p = 0.53, d = 0.21) and had on average 1.58 higher than the overall average IQ score, (MIQ = 98.33, SD = 16.57, t(248) = 0.68, p = 0.50, d = 0.09, based on the participants whom IQ was reported (n = 15).

Positive results were largely reported, although samples were small. Interventions were based on an average of 1.93 students (SD = 1.03). No studies contained sufficient information to calculate Cohen’s d. Five studies reported percentage difference or data that could be translated to this across 9 behaviours. At most, the change was small at 22.05%, and non-significant (Davis et al., 2010) when conversational skills were assessed after power cards were applied to two adolescent students. Mean percentage difference across studies was 9.70% (SD = 6.94). Unit change was reported in five studies, for 10 verbal behaviours. Cook’s (2002) dissertation investigated the effects of two interventions on verbal skills, one of which found positive results. Cook found social skills training produced a larger positive effect compared to disability awareness training and that this was more pronounced on verbal responding, which increased by an average of 22.33 occurrences, compared to verbal initiation which only increased by an average of 2.95 occurrences.

Maintenance and Generalizability

For details on maintenance and generalizability for each intervention, see Supplementary Tables 1 to 6. Notably, within on-task behaviour, only one intervention, Angell (2005), was found to be both maintained and generalized across settings. Specifically, Angell (2005) found verbal cues and 5-min grace period reduced off-task behaviour among five young male students. Within social interaction three interventions reported both generalized and maintained effects for three different interventions, all of which were based on samples of 1: video self-modelling (Buggey, 2005), combined theory of mind testing and social skills training (Feng et al., 2008), and a reading comprehension intervention (Reutebuch et al., 2015). Within verbal skills, Swaine (2004) investigated the effects of social stories and role play and while maintenance was found for talking in a classroom (as a positive behaviour), the long-term effects of the intervention were not clear for remaining variables. The two verbal skill interventions that reported generalization (video modelling and pivotal response training) were both investigated by Lydon et al. (2011). The effect was only found in a single male student, aged 5 years old.

Discussion

This systematic review aimed to synthesize and build on the existing research concerning school-based interventions for students diagnosed with ASD. Specifically, the review sought to understand the efficacy, generalizability, and maintenance of these interventions and to identify those interventions that have been found most useful in managing core deficits and, in turn, identify the most appropriate strategies that will assist educators in managing specific deficits experienced by children with ASD. While the aims of all included studies were to target and improve a range of ASD-specific deficits, there were marked differences between studies in the structure, content, and duration of the interventions and in the outcome measures used. Consequently, it is difficult to compare the effect of the interventions on the wide range of behaviour outcomes, or the maintenance and generalization of the intervention effects. It is important to note that significant improvements were reported across all symptom domains following the administration of various targeted interventions, and that the result of all quantitative syntheses provides support for the effectiveness of school-based interventions for students diagnosed with ASD.

Of the 165 interventions that were investigated, the majority targeted social impairments that persons with ASD can experience. The importance of improved social interaction, likely mediated by improved social cognition, verbal skills, and play skills, cannot be exaggerated, with the literature consistently demonstrating that positive social interaction contributes to improved physical and mental health among people of all age groups (Monshouwer et al., 2013; Umberson & Montez, 2010). The positive outcomes of the school-based interventions targeting social skills found here are broadly consistent with the findings in three systematic reviews that specifically examined the effectiveness of group-based social skill interventions within ASD (Cappadocia & Weiss, 2011; Reichow et al., 2012; Spain & Blainey, 2015). The three reviews examined outcomes in both youth and adults and found improvements in the domains of communication, quality of reciprocity, and quality of friendships. The addition of our findings further strengthens the evidence to support social skill interventions in reducing impairments and improving quality of life for students with ASD whose IQ is above 70.

The results of studies testing interventions designed to improve academic and on-task skills indicated significant positive outcomes. Although a proportion of the interventions identified no change in targeted behaviours, overall conclusions provide preliminary support that school-based interventions can be successfully used to improve adaptive skills in students with ASD whose IQ is above 70. Most notably, Blakeley-Smith et al. (2009) found that adjusting the academic environment so that curricular demands did not exceed student competency improved students’ essay and handwriting considerably.

Maintenance and generalization were not widely studied, but where included, the measures of maintenance and generalization varied widely. Few studies assessed generalization, with some indicating a skill applied in an alternative setting, while others indicating generalization to alternate tasks. Maintenance was difficult to compare given variability in the interval between cessation of the intervention and post-testing. Some maintenance effects were measured over a matter of days (e.g. Stringfield et al., 2011) weeks (e.g. Carnahan et al., 2016; Cunningham, 2009; Delano, 2007b; Paterson & Arco, 2007; Reutebuch et al., 2015; Schatz, 2017; Songlee et al., 2008), or months (e.g. Grindle et al., 2013; Gutman et al., 2010; Leakan, 2012; Roux et al., 2015a, 2015b). In some instances, length of follow-up varied between participants within the same study (e.g. Howorth et al., 2016; Kagohara et al., 2012, 2013; Price et al., 2017) or for some participants was not measured (e.g. Schatz, 2017).

Limitations

The decision to focus on school-based interventions among students who had a formal diagnosis necessarily excluded evaluation of some early interventions designed to support children with ASD during the important preschool years (Prior & Roberts, 2012; Zwaigenbaum et al., 2015). There was no restriction on age; however, it is possible the criteria of participants requiring a formal diagnosis meant pre-school and kindergarten students were under-represented in this review with only nine studies found to include at least one eligible participant in early education (≤ 6 years). Furthermore, restricting sample eligibility criteria to students with ASD with a recorded IQ of 70 or more restricted this review’s focus to “higher-level” interventions that require prerequisite skills such as communication and self-regulation. These limitations necessarily reduce the generalizability of results, and the level of intellectual functioning and age at which school-based interventions can be confidently introduced, as effects may be contingent upon the students’ developmental level. For example, Segura (2012) found an increase in social skills for most participants. However, adverse effects were recorded for the youngest participant. Additionally, samples within each study were generally small and at times reduced further once ineligible study participants were removed and females were under-represented.

In most studies, reliability and efficacy were favoured over ecological validity and effectiveness respectively, whereby study protocols frequently excluded participants with various comorbidities or those using various medications. While excluding these participants increases confidence when determining effect, the low external validity of findings from such restricted samples may limit the potential for the results to inform the development of evidence-based interventions. An example of the importance of including representative, rather than controlled, samples of people with ASD is provided by LaCava (2007) who found persons with a diagnosis of ASD experienced a positive outcome following intervention, while those with comorbid diagnoses experienced either no or negative effects.

Although providing useful evidence about a wide range of interventions designed to improve school children’s academic achievements, the quality of evidence selected for inclusion in this review was generally poor. This is due largely to the preference for single case follow-up (pre- and post-comparison within each individual) and the nature of the subject which constrains the participant pool at the level of diagnosis, suitability, proficiency, accessibility, cooperation, etc., to create small, relatively homogenous samples that do not necessarily represent the population or the context in which the population live. For example, the under-representation of girls in some studies and the omission of girls in many represents a serious bias that needs to be addressed in future studies. The same can be said for children from non-Caucasian racial backgrounds, who it appears were also under-represented, although this was unclear due to the majority of studies lacking this level of demographic detail.

Future Directions

The majority of interventions that focused on younger persons (< 7 years) focused on behavioural and social interventions, followed by academic performance. It is important to recognize that behavioural difficulties are not a symptom of ASD, but rather a response that reflects the difficulty in managing the core challenges of ASD. These challenges include managing sensory reactivity and deficits in communication and emotional understanding—all of which are key diagnostic criteria of ASD (APA, 2013). Indeed, the focus on behaviour and social interventions may be to address the challenges faced by the carers of children with ASD rather than those of the ASD child.

This review reveals a deficit in evidence supporting maintenance of an intervention’s effectiveness over time and generalization of the intervention’s outcomes across settings. This finding should not be interpreted as a failure to demonstrate maintenance and generalization in studies, but rather a reticence to test maintenance and generalization: Approximately half of studies investigated maintenance and only one-third investigated generalization. Two-thirds of studies that tested maintenance or generalization found support, but very few applied rigorous statistical criteria or used sufficiently large samples to provide a strong test. While a lot of time and resources are involved in retaining participants for these following stages of evaluation, the investment is necessary to evaluate the long-term potential and usefulness of any intervention.

Future research is needed to investigate the secondary effects of interventions and whether improvements are also made on aspects of student life, notably mental health. Individuals diagnosed with ASD often suffer co-morbid mental illness, such as anxiety and depression, yet this was rarely ever a focus. For example, positive secondary effects were described by Valentine (2001), who found that distance technology, such as phone and email peer tutoring, had a positive effect on the participants’ academic skills and social communication, as well as affect. McDaid (2007) found a modified form of inoculation training had a positive effect on on-task behaviours as well as both crying and self-injurious behaviour. Negative secondary effects have also been reported; for example, Gal et al. (2009) found that story narration with an enforced collaboration paradigm led to positive impacts on social interaction but also elevated frustration and increased autistic behaviours for one of six participants. Time of intervention exposure or participation also needs to be considered with Holmes (2008) reporting temper control to perhaps worsen as the study progressed despite social stories’ positive effect of task completion.

Valentine (2001) and Groot (2014) were the only eligible studies to investigate technology-based interventions. This was unexpected due to the progressive use of technology in schools, and its wide use to assist students with ASD. It is anticipated that this will become a focus in future research.

Conclusions

This review found many small studies investigating a variety of school-based interventions. We found that students with ASD (with formal diagnoses and IQ > 70) appear to respond well to school-based interventions to the extent we can assess from the evidence reviewed herein. The behaviour domains targeted by studies included here reflect the academic and social environment of schools and the identified need to improve children’s fit and achievement levels within schools (i.e. academic skills, on-task and play behaviours, social cognition and interactions, and verbal skills). The value of the school-based intervention is the partnership between the child, interventionists, and the school, to create a school environment that is aware of and can support and maintain the intervention and its effects. Given this potential, there may be opportunities to broaden the scope of current interventions to address children’s additional challenges in sensory reactivity, communication, and emotional understanding which underlie many of their more overt behaviour problems. The evaluation of ASD interventions, school-based or otherwise, constitute only a small proportion of the corpus of ASD research (Kutchel, 2015; Pellicano et al., 2014). The consensus is that research needs to focus less on the pure and more on the applied science to help those with the condition by establishing interventions that are both therapeutically and educationally beneficial (Pellicano et al., 2014).