Recently, we sought to examine current research (i.e., publication between 2001 and 2005) on interventions for young children with autism (i.e., children with autism who were less than 8-years-old) to determine if any intervention had accumulated the empirical evidence needed to be considered an evidence-based practice (EBP; Reichow et al. 2007). We quickly realized that previous methods for determining EBP did not meet our needs, and that a new method for evaluating the empirical evidence was required. To meet this need, the Evaluative Method for Determining EBP in Autism was developed, and is described in this paper.

We begin with an overview of previous definitions of EBP and the history and relevance of EBP for young children with autism. An outline of the evaluative method is then provided. This outline includes an overview of the rubrics for the evaluation of research report rigor, guidelines for evaluating the strength of research reports, and criteria for determining if a practice meets the standards of EBP. Results from the field test of the evaluative method in relation to reliability and validity are then described. The authors conclude with a discussion of the challenges faced in the development of the evaluative method and suggestions for future use of the evaluative method.

Evidence-Based Practice

The use of research to inform practice began in medicine, and was coined evidence-based medicine (Cutspec 2004). The idea of using research to inform practice quickly spread to other disciplines including the social sciences. Currently, most organizations representing social scientists have guidelines and definitions for using evidence to inform practice. Multiple terms have been used to denote the use of empirical evidence to inform practice including EBP (e.g., Odom et al. 2005; Shavelson and Towne 2002), evidence-based psychotherapies (Goodheart et al. 2006; Kazdin and Weisz 2003), empirically supported interventions (e.g., Wampold et al. 2002), and empirically supported treatments (e.g., Lonigan et al. 1998).

Although a universally accepted definition of EBP does not exist, evidence from two independent randomized clinical trials conducted by separate research teams usually meets the criteria of EBP. While this criterion is used exclusively by some organizations, (e.g., Hamilton 2005; Shavelson and Towne 2002), many organizations have outlined criteria for using other research methods to support the empirical evidence of a practice (e.g., Horner et al. 2005; Kratochwill and Stoiber 2002; Lonigan et al. 1998). The inclusion of additional research methods has been the cause of much debate (see Norcross et al. 2006, for a discussion of issues debated on EBP in the mental health field), and even when the same research methods are included, differences often exist with regard to the strength and amount of evidence required for a practice to be considered evidence-based. For a field such as autism, which utilizes several independent bodies of research (e.g., medical, psychological, educational) with distinct purposes, orientations, theories, and research methods, the need for common criteria of EBP becomes evident.

EBP in Autism

To address the need for an evaluation of treatments for young children with autism the National Research Council (NRC) formed the Committee on Educational Interventions for Children with Autism (Lord et al. 2001). This committee’s review of the empirical evidence on interventions for young children with autism yielded no single practice being labeled as an EBP. Additional reviews have provided similar results (Baker and Feinfield 2003; Bristol et al. 1996; Francis 2005; Gresham et al. 1999; Rogers 1998). Many definitions of EBP were used across these reviews, however, we determined these definitions would not be sensitive when evaluating interventions for young children with autism. Limitations of existing methods include (a) the lack of an operational method for evaluating evidence (Francis 2005; Lonigan et al. 1998; Lord et al. 2001; Shavelson and Towne 2002), (b) the lack of an operational method for determining if a treatment was an EBP (Kratochwill and Stoiber 2002; Lonigan et al. 1998; Lord et al. 2001), (c) the often narrow interpretation of what constituted evidence (Francis 2005; Lord et al. 2001; Shavelson and Towne 2002), and (d) the treatment of single subject research.Footnote 1 Given these problems associated with existing definitions of EBP and the pertinent need to find EBP for young children with autism, we felt a new method for evaluating evidence was needed.

Development of the Evaluative Method for Determining EBP in Autism

To assist researchers and practitioners in the determination of EBP for young children with autism, we created the Evaluative Method for Determining EBP in Autism. The evaluative method contains three instruments: (1) rubrics for the evaluation of research report rigor, (2) guidelines for the evaluation of research report strength, and (3) criteria for the determination of EBP. These instruments provide a standardized method for researchers, clinicians, and practitioners to evaluate the empirical evidence on autism interventions. Using the evaluative method also provides individual ratings for each study reviewed. By assigning individual ratings to each research report, we sought to create a method making the combination of different research methodologies in the evaluation of EBP possible (e.g., group research and single subject research). To our knowledge, these are the first guidelines for evaluating the efficacy of research in the social sciences to include an operationalized method of combining evidence across research methodologies.

Rubrics for the Evaluation of Research Report Rigor

To evaluate the rigor of research reports, two rubrics were developed, one for group research and one for single subject research. These rubrics provide a method for evaluating the quality (rigor) of methodological elements of research reports. Two levels of methodological elements are included in the rubrics: primary quality indicators and secondary quality indicators. Primary quality indicators are elements of research design deemed critical for demonstrating the validity of a study. These are operationally defined on a trichotomous ordinal scale (high quality, acceptable quality, and unacceptable quality). The secondary quality indicators are elements of research design that, although important, are not deemed necessary for the establishment of the validity of a study. These are operationally defined on a dichotomous scale (evidence or no evidence). Because experiments of high integrity of both group designs and single subject research designs share many characteristics, every attempt was undertaken to retain similar definitions across rubrics. Common primary quality indicators include (a) participant characteristics, (b) dependent measures, and (c) independent variables. Common secondary quality indicators include (a) interobserver agreement, (b) blind raters, (c) procedural fidelity, (d) generalization and maintenance, and (e) social validity. Indicators specific to one type of research method are also needed due to the differences in research methodologies. Definitions for the quality indicators for both the group research rubric and the single subject research rubric are presented in Tables 1 and 2, respectively. Operational definitions for the trichotomous scale of the primary indicators and the dichotomous scale of the secondary indicators can be obtained by contacting the authors.

Table 1 Definition of group research quality indicators
Table 2 Definition of single subject research quality indicators

Guidelines for the Evaluation of the Research Report Strength

The second instrument provides a method of synthesizing the ratings from the rubrics into a strength of research report rating. Three levels of research report strength are operationalized: strong research report strength, adequate research report strength, and weak research report strength. Research reports meeting the criteria of the strong research report strength level demonstrate concrete evidence of high quality. Adequate research report strength designates research reports showing strong evidence in most, but not all areas. Weak research report strength indicates the research report has many missing elements, and/or fatal flaws. The guidelines for determining research report strength are shown in Table 3.

Table 3 Guidelines for the determination of research report strength

Criteria for EBP

The final instrument provides the criteria for the aggregation of research report strength ratings across studies to determine whether a practice has amassed enough empirical support to be classified as an EBP. The criteria are guided by the criteria used by other fields in the social sciences. Two categories of EBP were created: established EBP and promising EBP.

An established EBP is a treatment shown to be effective across multiple methodologically sound studies conducted by at least two independent research groups. Practices meeting this requirement have demonstrated enough evidence for confidence in the treatment’s efficacy. A promising EBP is a treatment also shown to be effective across multiple studies. However, as a whole, the evidence for the practice is limited by weak methodological rigor, few replications, or an inadequate number of independent researchers demonstrating the effects. If promising EBPs are used, these practices should be employed with caution and should be closely monitored until a greater accumulation of evidence is present. Definitions for each category of EBP are presented in Table 4.

Table 4 Criteria for treatments to be considered EBP

Reliability and Validity

Reliability of the Rubrics

We conducted a field trial during the development of the evaluative method to evaluate the reliability of the rubrics. The exemplars selected for the field trial were selected from the articles chosen from a review of autism interventions for young children with autism from 2001 to 2005 (Reichow et al. 2007). The literature searches for that review were conducted using PsycINFO and MedLine, and yielded 124 reports published in peer-reviewed journals. These reports were numbered in the order in which they appeared in the searches, and then assigned a random number using a computerized random number generator (Urbaniak and Plous 2006). From the list of randomized reports, 8 group research reports and 10 single subject research reports were rated by two independent raters. From these evaluations, reliability statistics were calculated, which are shown in Table 5. Using established criteria for inter-rater agreement, reliability was good (.60–.74) to excellent (.75–1.00) (Cicchetti 2001) and substantial (.61–.80) to almost perfect (.81–1.00) (Landis and Koch 1977).

Table 5 Kappa statistics for reliability field trial

Since the field trial, the rubrics have been used to evaluate communication interventions (Doehring et al. 2007) and to complete the review on interventions for young children with autism (Reichow et al. 2007). Across these analyses, the percentage of agreement between the reviewer and an experienced rater ranged between 84% and 100%, with corresponding kappas ranging between .65 (Good) and 1.00 (Excellent/Perfect). The high agreement across applications and individuals support the evaluative method as a tool for reliably reviewing autism intervention research.

Validity of the Evaluative Method

Different forms of validity can be used to validate the definitions used in the proposed evaluative method including concurrent validity, content validity, and face validity. By reviewing and aligning the definitions used for the proposed rubrics with previous definitions of EBP (e.g., Kratochwill and Stoiber 2002; Lonigan et al. 1998; Lord et al. 2001; Odom et al. 2005), the rubrics demonstrated concurrent validity. Content validity was established through the operalization of the definition of what was needed for a practice to be considered an EBP (see Table 4). Finally, since the definitions were aligned to common usage they demonstrated face validity.

Further validity can be demonstrated through the process of training inexperienced clinical examiners to use these rubrics reliably. This process is currently the model used in training clinical examiners to meet standards for diagnosing the presence or absence of autism (e.g., see the work of Klin et al. 2000). When training inexperienced clinical examiners, the ratings of experienced clinical examiners are the gold standard criterion judgment, and the extent to which inexperienced examiners agree at designated criterion with the best clinician evaluations provides evidence of validity. Demonstrations of the validity of this method occurred with a clinical psychologist (Doehring et al. 2007) and a special education doctoral candidate (Reichow et al. 2007). Although these analyses demonstrated good to excellent reliability, further investigation, including the evaluation of more diverse evaluators (e.g., teachers, policy makers, related service providers) is still required.

Discussion

The creation of our evaluative method was not without challenges, including how to combine group research and single subject research, how the application of the method might be used to help close the research to practice gap, and the generality of the method. Empirical questions needing further analysis are raised within the context of these challenges. A summary, including future directions, concludes this report.

Integration of Results from Group and Single Subject Research Designs

We believe our criteria for the determination of EBP to be one of the first sets of criteria providing an operationalized method for this purpose. From a philosophical-theoretical point of view, we believe that both group research (the nomothetic approach to science) and single subject research (the ideographic approach to science) are each valuable approaches to furthering knowledge in any given field of scientific inquiry. Our criteria allow us to include well supported studies and exclude poorly supported ones, regardless of experimental design. Finally, and again from a theoretical stance, the idiographic-nomothetic approach to science is still of interest to behavioral scientists (e.g., the recent scholarly work of Grice et al. 2006). Because psychology, medicine, and education are all fields that focus on the individual, we propose these sciences need not be based solely upon the results of group-design studies.

One difficulty we encountered when determining how to synthesize group design research reports and single subject design research reports was quantifying the amount of single subject research that was comparable to one group study. We determined eight participants from single subject experiments were equivalent to one group study. This figure was determined by averaging two previous quantifications from definitions of EBP containing group research and single subject research (Lonigan et al. 1998; Odom et al. 2005). Because this definition is the first attempt at synthesizing group research and single subject research, empirical validation of this criterion is required.

Addressing the Research to Practice Gap

One important challenge of the EBP movement has been the translation of research to practice (Chorpita 2003; Kazdin 2001). Our definition was created with the intent that practitioners would be able to use our scales and apply judgment rules to evidence they locate in journals or other sources (e.g., books, internet, etc.) to inform the choices they make in practice. This decision led us to keep the definition from becoming too technical while trying to maintain its overall merit. Hence, some aspects of the current definition for EBP may be perceived as over-simplistic by some. To balance this, the guidelines for the determination of research report strength are stringent and should only allow studies with high methodological rigor to receive strong research report strength ratings. The recent review of 124 intervention studies for young children with autism conducted by Reichow et al. (2007) confirmed the stringency of the ratings, and was consistent with the findings of the NRC Report (Lord et al. 2001). Using the definition of EBP provided here, practitioners should be able to synthesize decisions about which interventions are EBPs and therefore, which to use in their classrooms.

However, the creation of a practitioner friendly method of evaluating EBP might not be enough to close the research to practice gap. One barrier remaining between research and practice is access to and comprehension of research reports. We propose three ways to combat this element of the research to practice gap: a standardized method of locating and retrieving research reports, better instruction and guidance on interpreting research reports, and a standardized method of evaluating research reports. Excellent resources exist that describe the methods for locating and retrieving evidence (see Lucas and Cutspec 2005; Petticrew and Roberts 2006; McHugo and Drake 2003). With respect to improve the instruction and guidance of interpreting research reports, the field of medicine has published guidelines for medical practitioners on how to read research reports and assess evidence (Greenhalgh 2001). Although similar attempts have been made in the social sciences (e.g., Levin 2002; Weisz and Hawley 1998), the guidelines are not as thorough and user friendly. A resource such as the one provided by Greenhalgh (2001) would be very helpful in the social sciences, particularly in psychology and education, where evidence comes from many different research methodologies, including some methodologies with which the practitioner might not be familiar. Finally, it has been suggested that a systematic, standardized method of reviewing evidence is needed (Weisz et al. 2000). We feel the evaluative method presented here provides a standardized method of reviewing evidence and that, with the aforementioned resources for locating and comprehending research, practitioners have all the tools necessary to conduct through, systematic reviews.

A final aid for closing the research to practice gap would be the inclusion of research report strength ratings in published reviews. Two examples of the reporting of strength ratings can be seen in the review of comprehensive treatment programs in autism by Kasari (2002) and the review of EBP in diagnostic techniques for autism by the American Academy of Neurology (Filipek et al. 2000). By including the research report strength ratings, practitioners would not have to review these studies again, thus reducing the time practitioners would need to spend reviewing to determine EBP.

Generality of Evaluative Method

We have outlined a new method of evaluating empirical evidence designed for the specific application to interventions for young children with autism. The definition provides a method for researchers and practitioners to review the evidence on treatments for young children with autism and to synthesize that review into treatments that will be considered EBPs for the first time. The new evaluative method was formed in response to the perceived inadequacy of current definitions of EBP in relation to research conducted on children with autism and the continued (and growing) need to define EBPs in autism. Although these instruments were developed and field tested on interventions for young children with autism, there is no reason the definitions cannot be modified to allow their use by professionals who work with other populations (e.g., older individuals with autism, individuals with mental retardation, individuals with Down syndrome). Future work using the evaluative method is needed to provide evidence of the generality of the method.

Conclusion

In principle, the arrangement of identifying educational practices based on scientific evidence is admirable; using scientific evidence to inform practice should increase the likelihood of providing effective treatments. However, researchers have yet to establish EBP for young children with autism. To address the need to identify EBP for young children with autism, the evaluative method reviewed in this paper was created. It is not our intention to lower the standards of evidence needed for a practice to be considered an EBP, but rather to create an evaluative method that accommodates common challenges faced when conducting research on young children with autism. Early assessments of reliability and validity suggest the evaluative method is reliable and provides a valid assessment of the empirical evidence on practices for young children with autism. Further application of these tools should allow researchers, clinicians, and practitioners to identify interventions for young children with autism that can be considered EBP for the first time.