Introduction

Systematically identifying effective interventions for students is of paramount importance to educators. The evidence-based practice movement has gained momentum in the past decade with developments in psychology, education, and prevention science. Federal law defines scientifically based research as “rigorous, systematic and objective procedures to obtain valid knowledge” (No Child Left Behind Act of 2001, p. 126), which includes research that is “evaluated using experimental or quasi-experimental designs” (p. 541). Recent Race to the Top initiatives confirm the importance of using evidence to adopt educational practices and policies in the outlined priority goals of the Department of Education’s performance plan (Analytical Perspectives 2010). The adoption and implementation of evidence-based practices must be a responsibility of not only practitioners but also researchers in evaluating the feasibility and effectiveness of interventions that are integrated into practice settings (Kratochwill and Shernoff 2003).

Identifying effective interventions is based increasingly on the outcomes of meta-analyses (Cooper et al. 2009), and practitioners should examine meta-analytic studies to inform policy decisions (Kavale and Forness 2000). One of the problems with relying on meta-analyses is that they often exclude data generated from small N and single-case design (SCD) studies evaluating interventions. SCDs can provide a strong basis for establishing causal inference, and these designs are widely used in applied and clinical disciplines in psychology and education (Kratochwill et al. 2010). The traditional approach to analysis of SCD data involves visual comparison within and across conditions of a study (Parsonson and Baer 1978); however, when aggregating SCD data in meta-analysis, statistical approaches to calculating effect sizes are needed.

As of yet, there is no agreed upon effect size (ES) metric from SCDs for use in quantitative syntheses of intervention studies. Researchers have proposed a variety of ES metrics for synthesizing single-case research including percentage of non-overlapping data (PND; Scruggs et al. 1987), percentage of all non-overlapping data (PAND; Parker et al. 2007), and percentage exceeding median (PEM; Ma 2006). Each of these ES metrics utilizes non-overlapping data between phases as an indicator of performance differences. Although potentially useful, weaknesses of existing non-overlap indices cited by Parker and Vannest (2009) include “(a) lack of a known underlying distribution (PND); (b) weak relationship with other established effect sizes (PEM); (c) low ability to discriminate among published studies (PEM, PND); (d) low statistical power for small N studies (PND, PAND, PEM); and (e) open to human error in hand calculations from graphs (PND, PAND, PEM)” (p. 357).

In an attempt to address these weaknesses (with the possible exception of human error), Parker and Vannest (2009) proposed a fourth index of non-overlapping data, non-overlap of all pairs (NAP). NAP calculates a percentage of non-overlapping data by investigating the extent to which each data point in phase A (baseline) overlaps with each data point in phase B (intervention). Initial research on NAP is promising; however, little is known about interpreting NAP values in relation to detecting and evaluating treatment effects, specifically within multiple baseline designs (MBD). Preliminary evaluations of NAP utilized AB contrasts between phases of published articles using a variety of designs, including AB, ABAB, and multiple baseline designs (MBDs) (Parker and Vannest 2009). However, a recent review of research found that SCDs are increasing in prevalence overall, with a growing emphasis on MBDs (Hammond and Gast 2010). As discussed above, the evidence-based movement urges the identification of effective educational interventions in the field. A majority of academic intervention research in education is conducted with MBDs, perhaps because most academic skills are nonreversible (Bramlett et al. 2010). An ES that can accurately allow educators to interpret the results of an aggregation of MBD studies is needed to identify academic interventions as evidence-based on a larger scale. When behaviors are nonreversible, it is less likely that a behavior will overlap with baseline data because data are not expected to return to baseline levels, therefore potentially inflating non-overlap indices within MBDs as compared to AB case study and ABAB withdrawal designs examined by Parker and Vannest (2009).

MBDs are considered more complex than withdrawal or reversal designs because a number of responses are identified and measured over time by varying the length of baseline observations across two or more behaviors, settings, or participants (Baer et al. 1968). Because observations in MBDs extend across two or more baselines, the design typically yields more data points (e.g., 60–80 total; Parker et al. 2007) also contributing to the complexity of the design over withdrawal or reversal designs. Moreover, specific parameters of effect detection (sensitivity, specificity) are rarely investigated in SCD ES evaluation studies and have never been investigated specific to MBDs to the author’s knowledge. In this context, sensitivity evaluates how accurate an approach is at detecting a true effect while specificity refers to the ability of an approach to rule out the absence of an effect (McNeily and Hanley 1982).

The purpose of the current study was to investigate the use of NAP with MBDs due to the increasing use of MBDs to evaluate the effectiveness of academic interventions, the need for ESs to allow for quantification of intervention effects, and the potential for SCD non-overlap ESs to differ in MBDs. The following research questions guided the analyses: (a) What are typical NAP values within MBDs?, (b) To what extent is NAP sensitive and specific in detecting intervention effects in SCD studies using MBDs?, (c) What constitutes a large and small NAP effect size in MBDs?, and (d) To what extent does a framework for interpreting NAP ES estimates within MBDs agree with interpretations based on visual analysis?

Method

Data Collection

The PsycINFO, ERIC, and Academic Search Premier databases were searched for articles on March 26, 2011, using the terms multiple baseline AND reading intervention, multiple baseline AND math intervention, multiple baseline AND writing intervention, multiple probe AND reading intervention, multiple probe AND math intervention, and multiple probe AND writing intervention. The following criteria were used to select articles to include in the current meta-analysis:

  1. 1.

    The study was published in a peer-reviewed journal between 2000 and 2011; this date range was selected because a previous study aggregated academic intervention research prior to 2000 (Swanson et al. 1999);

  2. 2.

    The study investigated an intervention to enhance reading, writing, or math performance;

  3. 3.

    The study used a multiple baseline or multiple probe design with three or more participants, settings, behaviors, or sets of materials;

  4. 4.

    The study provided sufficient data to compute NAP;

  5. 5.

    The study’s participants were school-age (3–21 years); and

  6. 6.

    The study was written in English.

The search terms identified 127 articles, 85 of which met the above criteria and were included in the current study. Forty-two studies were excluded, most for the following reasons: an insufficient number of baselines, outcomes focused on social-emotional behaviors rather than academic behaviors, and duplicate publications. In cases in which multiple outcomes or multiple comparison sets were used (e.g., multiple subjects in a multiple baseline across materials study), each outcome or comparison set was used. This resulted in 176 comparison sets.

Coding of Articles

Studies that met inclusion criteria were systematically reviewed and coded using a coding form created in a spreadsheet program. The authors coded study design characteristics and a decision about the effect based on visual analysis.

Study Design Characteristics

Design characteristics included the type of multiple baseline employed and the number of participants, settings, or behaviors included. The type of multiple baseline employed was coded as follows; across participants, across settings, across behaviors, a combination of the three aforementioned conditions, across materials, and a category deemed as “other.”

Visual Analysis

Visual analysis was also conducted to decide whether an effect existed. As suggested by the single-case technical documentation of What Works Clearinghouse, two of the authors evaluated changes in level, trend, variability, and immediacy of change between the baseline and first treatment phase (Kratochwill et al. 2010). Change in level was determined by looking at the last data point in baseline and the first data point of the intervention phase. Change in variability refers to the fluctuation of the data around the mean from baseline to intervention phases. The effects of the intervention were analyzed by immediacy of change, or the rapidity of the effect after the onset and/or withdrawal of the intervention. Immediacy of the effect is the extent to which the level, trend, and variability of the last three data points in the phases are different from the first three data points in the intervention phase. If changes in at least two of these four elements were detected between phases, the authors coded an intervention effect. Agreement was established by rating two studies collaboratively, and then the authors coded the remaining studies individually. Authors coded the articles separately and then convened to make adjustments in coding to minimize potential interobserver error throughout the coding process.

The data for each study were judged to demonstrate a large intervention effect if 75% or more of the baselines within the MBD showed an intervention effect as described above. (If a study had fewer than four baselines, each needed to demonstrate an effect to be coded as a large effect.) Multiple baseline data in which between 50% and 75% of baselines demonstrated an effect as described above were judged as a small effect.

Effect Size Calculation

NAP (Parker and Vannest 2009) was used to estimate ES for each set of participants. As mentioned above, NAP calculates a percentage of non-overlapping data by investigating the extent to which all possible pairs of data points across phases overlap. Each data point in the baseline phase A was compared to each data point in the intervention phase B to determine whether overlap occurred between phases. Each pair of data points that overlapped completely was assigned a value of ‘1’, and each pair of data points that tied was assigned a value of ‘0.5’. Adding the overlap sums, subtracting from the total possible number of comparison pairs (i.e., the number of data points in phase A multiplied by the number of data points in phase B), and dividing by the total possible number of comparison pairs derives NAP for that baseline to intervention phase change. For example, a study with 5 baseline data points and 11 intervention data points has 55 total possible pairs. If one data point from the baseline overlaps with two data points and ties with one data point in the intervention phase, there are 2.5 total overlaps. Subtracting 2.5 from 55 is 52.5, and 52.5 divided by 55 is equal to 0.95; 0.95 is the NAP value for this example. Readers are directed to Parker and Vannest (2009) for further information on how to calculate NAP as well as other SCD ESs (PND, PEM, and PAND). NAP was then averaged across each baseline of each outcome within each study.

Interobserver Agreement

Interobserver agreement was calculated for 25% of included studies and outcomes. Percentage agreement was calculated as agreements divided by agreements plus disagreements multiplied by 100% for coding variables and visual analysis decisions. Percentage agreement between the two raters was 98.9% for coding study variables and 92.2% for decisions from visual analyses.

Analyses

Kappa coefficients and receiver operating characteristic (ROC) analysis were used to answer the primary research questions for this study. The kappa coefficient (Cohen 1960) is a measure of observer agreement taking chance into account. Kappa adjusts for chance in the calculation of observer agreement by subtracting chance from observed agreement. ROC analysis was used to assess the extent to which a measure (in this case, NAP) finds the same dichotomous outcome as a “gold standard” measure (in this case, visual analysis). ROC has become widely used in psychology to assess the accuracy of diagnostic tests and procedures (Swets et al. 2000).

The sensitivity and specificity of an assessment are important concepts in ROC analysis. Accuracy in a ROC analysis is measured by area under the curve (AUC); the curve is the ROC curve. An AUC value above 0.80 indicates a reasonable measure, based on possible values between 0.50 and 1.0 and in accordance with criteria used in past investigations (e.g., Muller et al. 2005). The accuracy is determined by how well the measure (NAP) detects false positives and false negatives by the ROC curve, also known as the “sensitivity and specificity” curve. Sensitivity refers to NAP’s ability to detect true effects, while specificity refers to NAP’s ability to rule out when there is no effect. There is a trade-off between sensitivity and specificity; as sensitivity increases, specificity decreases, and vice versa. The measure’s accuracy is determined by the relationship between sensitivity and specificity in a ROC analysis.

Results

The first research question addressed typical NAP values within MBDs. The overall mean of NAP estimates across 176 comparisons was 0.92 (SD = 0.10, range = 0.51–1.00), and the median was 0.96. NAP estimates found in the current sample were generally quite high, and the distribution was negatively skewed (skew = -1.69). ESs varied somewhat according to study characteristics (see Table 1), but were still close to 1.00. Median NAP ESs were lower for reading interventions (median=0.94) when compared to math and writing interventions (medians of 0.99 and 0.99, respectively). Medians did not differ considerably by the type of multiple baseline design or the number of baselines included in the design with one exception. MBDs employing five baselines yielded a lower NAP of 0.89.

Table 1 Median NAP values by study characteristics

The second research question addressed the sensitivity and specificity of NAP in detecting intervention effects in SCDs using MBDs. The AUC was 0.86 for the large effect of 75% of the baselines demonstrating an effect from visual analyses, and 0.82 for the small effect criterion of 50%. AUC values range from 1.00 (a perfect measure) to 0.50 (a measure that correctly classifies individual cases at a chance level; Zweig and Campbell 1993). The NAP criteria resulted in AUC values of above 0.80, which suggests a reasonable measure of effect.

The third research question addressed the identification of NAP values that correspond to large and small effects. Based on this favorable AUC estimate of the visual analysis cut scores, the coordinates of the curve were examined to determine what NAP score would also indicate an effect. A visual analysis cut score of at least 75% of the baselines showing an effect resulted in a NAP score of 0.96, which yielded a specificity of 0.81 and sensitivity of 0.73. A visual analysis cut score of 50% or more of the baselines showing an effect resulted in a NAP cut score of 0.93, which resulted in specificity of 0.83 and sensitivity of 0.78.

The last research question addressed the agreement between NAP ESs and interpretations based on visual analysis. Based on the results from the ROC analysis, a large effect cutoff of 0.96 and a small effect cutoff of 0.93 were investigated further using a contingency table and kappa coefficient. The contingency table is shown in Table 2. It suggests that NAP effect size cutoffs agreed to a greater extent with visual analysis when identifying studies with a large effect or no effect, but that agreement was much lower when identifying studies with a small effect. The kappa coefficient corresponding to this analysis was 0.45; values between 0.41 and 0.60 are generally considered moderate (although these criteria are arbitrary; Landis and Koch 1977).

Table 2 Classification of effect sizes based on NAP effect size and visual analysis

Discussion

The current analysis supports the suggestion that non-overlap ES metrics are larger when MBDs are used. The median NAP from the current data was 0.96, whereas the median NAP in the initial field test was somewhat lower at 0.84 (Parker and Vannest 2009). Estimates of ES cutoffs also varied between studies. Large and small ES cutoffs suggested in this study were 0.96 and 0.93, respectively. In Parker and Vannest’s (2009) introduction of NAP, they found that NAP ESs of 0-0.65, 0.66-0.92, and 0.93-1.0 corresponded to small, moderate, and large effects based on expert visual judgment. Based on the initial findings reported here, it would be reasonable to hypothesize that the higher NAP ESs may be due, at least in part, to the non-reversible nature of academic behaviors investigated through MBD. However, alternate hypotheses should also be considered, including the possibility that random error led to higher NAP estimates in the current investigation or that the interventions being tested had already been shown effective in previous research. Additional research is needed to directly test these hypotheses.

NAP ES calculations appeared to have acceptable sensitivity and specificity in the current study. NAP ES designations agreed with visual analysis decisions over 80% of the time among the multiple baseline studies in the sample. While these results are promising, the kappa coefficient calculated in this analysis is moderate, and it may be especially difficult for the NAP effect size to identify a small effect as it was defined in this study. Additionally, ceiling effects are a potential concern since NAP values in this set of multiple baseline studies were generally quite high. As a result of the high NAP values, the difference between a small and large effect as defined in this study was quite small.

Past research has suggested values that may imply a meaningful effect for a variety of single-case ES metrics (Brossart et al. 2006; Parker et al. 2005; Parker and Hagan-Burke 2007; Parker and Brossart 2003). Parker and Vannest’s (2009) paper introducing NAP also suggested typical values and potential ES cutoffs based on Cohen’s guidelines for d. However, Parker and Vannest (2009) showed that Cohen’s guidelines may not be relevant for single-case design. The current study may lend additional support to the notion of incompatibility of Cohen’s guidelines with single-case designs because NAP values of 0.93 and 0.96 corresponded to Cohen’s d values of 1.63 and 2.08, which are well above the small and large cutoffs Cohen suggested. The studies used in Parker and Vannest (2009) included AB comparisons from a variety of designs. The current study investigated published data from MBDs only and found that there may be important distinctions between typical ES values for MBD studies and the broader universe of SCD designs that include MBDs, AB, ABA, and ABAB designs.

There are three key limitations worth noting. Only those studies that evaluated interventions in reading, math, and writing, were published between 2000 and 2011, and were available in full-text online were included. Therefore, the results are specific to the studies sampled. The effect size (NAP) calculation and the visual analysis were performed by the same two individuals (first two authors). Therefore, the possibility exists that there was bias influencing the visual inspection (this possibility is made less plausible by the fact that study datasets were visually inspected independently and a high IOA was obtained). An additional limitation common to all non-overlap effect sizes (NAP, PND, PAND) is lack of sensitivity to the magnitude of the discrepancy between baseline and intervention phases, only noting whether or not there is a discrepancy.

Future work should investigate the use of the NAP ES as well as its sensitivity and specificity with a variety of published studies, including studies published prior to 2000, studies of non-academic interventions, and studies that used phase change designs. Additionally, most studies investigating ES indices for single-case research use fabricated data and/or a convenience sample. Further research should attempt to include a larger and more representative sample to investigate ESs. The data from the current study suggest potentially useful interpretive criteria for NAP. However, more research is needed on NAP and other single-subject effect sizes in order for single-subject research to be effectively included in meta-analytic research.