You would be hard-pressed to find an educator who does not have an opinion on Direct Instruction (DI). Whether they tried it in their classroom and found that previously “uneducable” students were now learning or heard about it as a strawman for “drill and kill” techniques that diminish the “love of learning,” everyone seems to love DI or hate DI (Pondiscio, 2018). Despite the philosophical arguments, there can be little doubt that DI works.

Direct Instruction is perhaps the most heavily researched educational model (Borman et al., 2003; Hattie, 2009). More than 500 individual research reports have been identified on DI, and an entire journal was once dedicated to its exploration (Stockard et al., 2020). This mountain of data has been analyzed, summarized, and scrutinized elsewhere, and it is not our intent to provide another systematic review. Rather, we provide some commentary on the vast amount of research on DI and offer some perspective on the interpretation of their results. The love/hate divergence is the first in a series of dichotomies that we present to help clarify some of the ambiguities that are inherent with complex statistical analyses. Because, regardless of which side of the aisle you stand, everyone should have access to the basic facts.

Direct Instruction has been lauded for its careful instructional programming to condition abstract stimulus control (i.e., teaching “big ideas”; Slocum, 2004). It has also been criticized for leading to student overdependence, the use of rigid and inflexible teaching methods, and emphasizing fact accumulation at the expense of critical thinking skills (Edwards, 1981). The polarizing qualities of DI cannot merely stem from its efficacy, because it would be difficult to refute the data demonstrating the variety of skills that can be taught through DI. Table 1 provides a sample of the available data demonstrating the effectiveness of DI across students and academic content.

Table 1 Effect sizes obtained in a representative sample of direct instruction studies: 1972–2011

Direct Instruction—as developed by Siegfried Engelmann, Carl Bereiter, Douglas Carnine, Jean Osborn, Wesley Becker, and colleagues—was initially a preschool program for children from communities of low socioeconomic status. Instructional periods for these children were only 20–30 min long and focused exclusively on math and reading. The teacher’s instructions were carefully programed and sequenced, ensuring mastery of prerequisites skills to eliminate ambiguity when introducing more advanced concepts. Children in these early studies showed marked improvements, encouraging further research with this mode of instruction (Bereiter & Engelmann, 1966).

Direct Instruction was then expanded to include math, reading, and language for populations outside of preschools. This model became known as the Direct Instruction System for Teaching Arithmetic and Reading (DISTAR; Engelmann, 2007; Wood, 2014). Early success led to DI’s inclusion in Project Follow Through, a longitudinal study from 1968 to 1977 that compared 20 educational programs and school-based interventions across some 700,000 children of low socioeconomic status. Project Follow Through found that DI was the only intervention with a “significant positive impact on all of the outcome measures” (Stockard et al., 2018, p. 482). Although the benefits of DI were clear, it was (and continues to be) criticized for being dogmatic, utilitarian, and authoritarian, with opponents claiming that its tightly structured scope and sequence leave little room for teacher and student creativity (Hattie, 2009).

The academic impact of DI was largely suppressed by the reviewers of Project Follow Through who recommended every model of purported benefit to students (e.g., along measures of self-concept, attitude, mental or physical health) or teachers (e.g., instructional fidelity). Watkins (1997) laments that, “it does the public little good to ‘identify’ little-used methods” (p. 44). Nonetheless, DI programs have been developed for teaching a wider variety of curriculum content to increasingly diverse student populations.

Direct Instruction versus direct instruction

Within the literature, discriminative control is established between Direct Instruction (“big DI”) and direct instruction (“little di”). Big DI, as developed by Engelmann and colleagues, is centered on a carefully crafted model of instruction focused on teaching concept formation (Cooper et al., 2020; Engelmann & Carnine, 1992). These instructions have been specifically sequenced, scripted, and programmed to minimize obfuscation, and have demonstrated tremendous success at teaching specified objectives when compared to other modes of instruction (Kennedy, 1978; Stebbins et al., 1977).

Little di, on the other hand, refers to a broader repertoire of teacher behavior that incorporates elements of systematic or explicit instruction. Little di is often exemplified by the work of Rosenshine (1976, 2012) who identified a set of teacher variables that yield significant academic improvements for students, including: (1) structuring materials, (2) using clear and concise instructions, (3) promoting active student responding, (4) providing immediate feedback contingent on student answers, and (5) minimizing free time. As with big DI, little di incorporates systematically sequenced instructions along with gradual increases in rigor and difficulty. Centered on direct measures of student behavior, relevant outcome data, and systematic measurement practices, the combination of these instructional technologies have produced favorable cost-benefit ratios when compared with traditional teaching practices (Binder & Watkins, 1990; Lipsey et al., 2012).

Both big DI and little di focus on explicit and systematic instruction. The primary distinction between the two is that little di refers to a set of teacher behaviors, whereas big DI (hereafter, simply “DI”) refers to curricular programs in the Engelmann tradition (e.g., DISTAR, Reading Mastery, Connecting Math Concepts, Language for Learning) and the teacher–student interactions that put the programs in play (e.g., fast paced scripted lessons, choral responding, error correction; Rolf & Slocum, 2021; Slocum & Rolf, 2021).

Effect Sizes

Effect sizes, like those listed in Table 1, measure the magnitude of a phenomenon, and are commonly reported using the indices for standardized mean difference (d), correlation coefficient (r), or coefficient of determination (r2; Reimer & Russell, 2017). When an effect size is expressed as the difference between means in a standard score (i.e., z-value), it represents the number of standard deviations between experimental and control groups. Assuming groups of equal sample size, these various indices may be converted from one to another. For example, by coding group membership with a binary variable (e.g., “0” for the control group, and “1” for the experimental group), we can then convert d to r using the following formula:

$$ r=\frac{d}{\sqrt{d^2+4}} $$

Likewise, we could then calculate the coefficient of determination (r2) by squaring the correlation coefficient to measure the proportion of variance (or percentage of variance if we then multiply by 100) in the dependent variable that can be predicted by the independent variable.

The standard convention for describing d and r includes the terms “small” (d = 0.2, r = .1), “medium” (d = 0.5, r = .3), and “large” (d = 0.8, r = .5). Cohen (1988) used the average height of women as the context for this interpretation. A “small” effect might be interpreted as the difference between the heights of 15- and 16-year-old females in the United States. A “medium” effect was visible to the naked eye, and could be interpreted as the difference between the heights of 14- and 18-year-old women. A “large” effect is grossly perceptible such as the difference between the heights of 13- and 18-year-old women.

However, Cohen (1988) also acknowledged the danger of labeling a change in magnitude as “small,” “medium,” and “large” devoid of context, noting that we can only interpret the effectiveness of a particular intervention in relation to another intervention that seeks to produce the same effect, along with their relative costs and benefits. As a more tractable interpretation for evaluating educational interventions, Glass et al. (1981) proposed that an effect size of 1.00 corresponds to the difference of about a year of schooling on the standardized test scores of elementary school students. Coe (2002) argued that 0.60 is a more accurate estimate of this phenomenon. In other words, an average third-grade student at pretest who received an intervention with an effect size of 0.60 would perform at the average fourth-grade level at posttest.

Binomial Effect-Size Display

Statistically laden effect sizes like d and r are challenging to interpret, especially when there is no indicator of which effect-size index to use when working with multiple datasets (Randolph & Edmondson, 2005; Rosenthal, 1991). For example, in their meta-analysis of more than 500 individual studies on DI, Stockard et al. (2020) used mixed-model regressions to identify multiple effect sizes across various dimensions (e.g., grade level, subject matter, educational setting), along with more extensive analyses and estimates of effectiveness with different levels of exposure and fidelity. Answering the seemingly simple question of “Is DI effective?” required no less than 290 pages (with an additional 383 pages of supplemental material available online). The data provided by Stockard and colleagues are so immersive that the nonstatistician cannot see the wood for the trees. As Lipsey et al. (2012) note,

The situation is not unusual—the native statistical representations of the findings of studies of intervention effects often provide little insight into the practical magnitude and meaning of those effects. To communicate that important information to researchers, practitioners, and policymakers, those statistical representations must be translated into some form that makes their practical significance easier to infer. (p. 1)

Though we cannot argue with Stockard et al.’s conclusion on the use of DI, “Our students deserve no less” (p. 165), we also cannot help but wonder whether they might have taken a more direct route to reach it.

To help clarify the real-world importance of an intervention, Rosenthal and Rubin (1982) suggested reducing continuous measures to a simple dichotomy. For example, we may select a given measure—such as the mean or median—and classify each value that falls below that measure as a “failure” (or “0”) and each value above it as a “success” (or “1”). We can then calculate r as the difference in proportions in each category (Coe, 2002). This conversion, called the binomial effect-size display (BESD) allows for more intuitive and informative data-based decision making by clearly conveying the real-world importance of treatment effects.

The BESD helps to interpret exceedingly complex data, which may be useful when evaluating DI outcomes because it is “(a) easily understood by researchers, students, and laypersons; (b) applicable in a wide variety of contexts; and (c) conveniently computed” (Rosenthal & Rubin, 1982, p. 166). In particular, the BESD may be more pragmatic than other effect-size indices, increasing the likelihood that parents and elected officials will select schools, teachers, and curriculum based on empirically supported evidence.

The BESD is premised upon three primary assumptions. Foremost, we assume that samples for both experimental and control groups follow a normal distribution. The sample that forms this normal distribution is better described as a density curve along a continuum of magnitude ranging from minimal to maximal effectiveness. The interpretation of effect sizes is sensitive to violations of this assumption, and it may be difficult to make an accurate comparison between an effect size based on a normal distribution and one derived from a nonnormal distribution (Coe, 2002).

Second, the BESD requires that we draw an imaginary line at the median value—which, in this case, is also the mean—and group our sample into one of two categories (see Fig. 1). The values that fall to the right of this imaginary line are above the mean, and therefore might be labeled “good,” “effective,” or a “success.” The values that fall to the left of this imaginary line are below the mean, and therefore might be labeled “bad,” “ineffective,” or a “failure.”

Fig. 1
figure 1

A normal distribution showing the dichotomous categorization of effective (right) and ineffective (left) practices

Finally, we must assume the use of a binary dependent variable upon which we can classify an outcome as a success or failure. That is, either the response contacted reinforcement or it did not. Our concern is only that the response meets this minimal criterion, and—as before—without regard for the magnitude of any individual effect. These overlapping dichotomies of success and failure result in a Punnett square through which we can examine a given effect-size estimate.

Life vs Death

As with other applications of human behavior, selectionism may serve as an appropriate framework for demonstrating the BESD. Table 2 shows the BESD for a hypothetical medical treatment that increases a patient’s rate of survival by 1%. An r = .10 is the very definition of “small,” but still considered important in the context of a life-saving intervention. Cohen (1988) notes that “‘Death’ tends to concentrate the mind” (p. 534). In absolute terms, an r2 of .01 is undeniably “small,” but when it represents a 10% increase in the rate of survival, our hypothetical patient might consider it large: “alive, mind you!” (p. 534).

Table 2 BESD for a hypothetical medical treatment that accounts for 1% of variance in survival

Thus, a difference between success rates of .55 and .45 yields r = .10, and accounts for only 1% of the variance in the dependent variable as a function of the independent variable. However, effects that are operationally defined as “small” cannot be overlooked when it marginally increases the rate of survival. Even marginal differences have the potential to shift an entire population (Baum, 2017; Rachlin, 1991).

Rosenthal (1990) described two occasions on which randomized controlled trials on the use of medications to prevent heart attacks were ended prematurely. In both cases, the results from the experimental group were so favorable that the research teams for each study determined it would be unethical to continue to withhold the life-saving drugs from the patients in the control group. The rs for these two independent studies were both well below the .10 threshold, with r = .04 (propranolol) and r = .034 (aspirin). As Rosenthal explained,

Behavioral researchers are not used to thinking of rs of .04 as reflecting effect sizes of practical importance. But when we think of an r of .04 as reflecting a 4% decrease in heart attacks, the interpretation given r in a BESD, the r does not appear to be quite so small, especially if we can count ourselves among the 4 per 100 who manage to survive. (p. 775)

Table 3 provides a conversion from Cohen’s d effect sizes ranging from zero to three in intervals of 0.10, to their corresponding r and r2 values. The differences in success rates increase systematically along with the effect sizes.

Table 3 BESD corresponding to various values of d, r, and r2

As can be seen, the difference in success rate is always identical to r. Cohen (1988) explains that “the fact that the difference in proportions equals the r is not a coincidence, but a necessity when the table is symmetrical (i.e., when the two values are equidistant from .50”; p. 534). Even a d as small as 0.1, accounting for less than 1% of the variance, increases the probability of selection from 48% to 53%. As a result, the chance of survival—whether species, individual, or behavior—has increased by 5%.

Effective Schools vs Ineffective Schools

“Small” effects become even more important when they are cumulative (Cohen, 1988). More than a one-time intervention, education is designed to last 12 or more years in duration. Coe (2002) explains that few educational interventions have effects that are anything other than “small,” due to the wide variation found within the population as a whole. Barbash (2012) and Tallmadge (1977) agreed that d = 0.25 is the conventional threshold to indicate an educationally significant outcome, though Lipsey et al. (2012) felt that even this was too high. (In contrast, recall that the average effect size for DI in Table 1 is d = 0.79.)

Despite the longitudinal efforts of our educational system, Coleman et al. (1966) found that schooling accounts for only 10% of student achievement, with the other 90% attributed to students’ home environment. It would seem that Hart and Risley (1995) were correct! However, we can once again employ a BESD to clarify the importance of 10% variance (r2 = .10). Table 4 shows the impact of a mere “moderate” effect size on the potential for student success.

Table 4 BESD that accounts for 10% of variance in student achievement

Although Coleman et al.’s (1966) findings state that schools only account for 10% of student achievement, the BESD for this finding paints a drastically different picture. The success rate for students increases from 34.3% to 65.8% when they are placed in an effective school. In other words, effective schools benefit an additional 32 per 100 students, regardless of the background from which they step into the classroom. Almost two thirds of the students in an effective school are likely to succeed, compared to just over one third of students in an ineffective school.

In contrast to the findings by Coleman et al. (1966), more recent findings conclude that schools may actually account for as much as 20% of variance in student achievement, with the remaining 80% attributed to their background environment. According to a meta-analysis by Marzano (2003), the use of verbal acuity as Coleman et al.’s primary dependent measure underestimates the impact of schools. Table 5 shows the revised impact of schools on the potential for student success.

Table 5 BESD that accounts for 20% of variance in student achievement

Although Marzano’s (2003) findings suggest that schools account for an adjusted 20% of student achievement, the BESD for this finding demonstrates that a student’s success rate increases from 27.6% to 72.4% when placed in an effective school. In other words, effective schools benefit an additional 45 per 100 students, regardless of the background from which they step into the classroom. Almost three fourths of the students in an effective school are likely to succeed, compared to just over one quarter of students in an ineffective school.

Direct Instruction versus Other Models

The model provided by Marzano (2003) represents the absolute best that schools can offer to students. Effective schools can increase the academic achievement for almost half of their student population, regardless of socioeconomic status, English proficiency, and other risk factors.

Since its inception, DI has focused on teaching students from academically disadvantaged backgrounds. By providing teachers with an organized curriculum and explicit instruction, DI has been shown to effectively control for the outside variables that have traditionally prohibited academic achievement. This includes: (1) student-level factors, such as frequent tardiness or absences; (2) parent-level factors, such as little interest in child’s educational outcomes; and (3) teacher-level factors, including lack of focus on basic skills (Butler, 2020).

Adams and Engelmann (1996) reviewed the then current literature on DI, and found an overall effect size per variable of d = 0.97, across 173 total comparisons (72 general education and 101 special education). By conventional standards, this would be considered a “large” magnitude of effect. Using the formula for converting d to r above, we can then create a BESD for DI.

Table 6 shows that d = 0.97 converts to r = .436, which falls within the “medium” range between .3 and .5. Already we note a discrepancy in the interpretation of effect size based on the different indices. However, the r2 = .19 tells us that DI accounts for 19% of the variance in student achievement.

Table 6 BESD that accounts for 19% of variance in student achievement

Adams and Engelmann’s (1996) 19% of variance in student achievement approximates Marzano’s (2003) 20% model of maximum school effectiveness. In other words, DI accounts for almost all of the difference in schooling. The BESD for this finding demonstrates that with the use of DI student success rates increase by 43.6%. In other words, DI benefits an additional 44 per 100 students, regardless of background. Compared to the other curricula that were examined in the meta-analysis, DI increased academic achievement from 28% to 72%.

The conventional use of effect sizes provides thresholds by which we can refer to the size of a phenomenon as “small,” “medium,” “large,” etc. As previously stated, Cohen (1988) acknowledged the danger of describing effect-size estimates in such terms, stressing that the effectiveness of a particular intervention can only be interpreted within the context of another intervention that seeks to produce the same effect. The use of absolute values to define “small” and “large” stems from a perspective of realism indicative of methodological behaviorism (Baum, 2017). However, such absolute thresholds force conceptual arguments about whether to interpret the same phenomenon as “large” (i.e., d = 0.97) or “medium” (i.e., r = .436).

In contrast, BESD provides a more pragmatic view of the magnitude of an experimental effect by placing the data in context. It is one thing to say that DI has a “large” effect. It is another to say that DI increases the rate of success by 44%.

Curriculum versus Instruction

Allow us to present one last dichotomy to clarify the distinction between curriculum and instruction. Although it is superficial, we can call this the same distinction between DI, the published curricula materials, and di, the process of systematic and explicit instruction. To demonstrate the importance of this distinction, we pose the following hypothetical scenario.

Imagine that we rank ordered all of the teachers in the United States according to effectiveness. Given that there are about 3.7 million teachers in the United States (National Center for Educational Statistics, 2019), we can assume that they approximate a normal distribution. Let us also imagine that we rank ordered all of the various curricula according to effectiveness, which approximates another normal distribution. As before, we will draw an imaginary line at the mean of each distribution, and call everything to the right of this imaginary line “good,” and everything to the left of this imaginary line “bad.” The result is a Punnett square showing the various combinations of teachers and curricula (see Table 7).

Table 7 Punnett square for evaluating the intersection of curriculum and instruction

It is safe to assume that everyone wants their child in a classroom with a good teacher using a good curriculum, and that nobody wants their child in a classroom with a bad teacher using a bad curriculum. This leads us to the following question (visualized in Fig. 2): Would you rather place your child in a classroom with a good teacher using a bad curriculum, or a bad teacher using a good curriculum?

Fig. 2
figure 2

Would you rather your child were in a classroom with a good teacher using a bad curriculum, or a bad teacher using a good curriculum?

Can effective instruction overcome the deficits of a poor curriculum? Can a systematic curriculum make up for the faults of poor instruction? To answer these questions, Marzano et al. (2003) conducted a meta-analysis to tease out these effects. The results of this analysis are displayed in Fig. 3.

Fig. 3
figure 3

A normal distribution showing the interaction of curriculum and instruction. Note. Constructed from data provided by Marzano et al. (2003). (1) A student who begins at the 50th percentile and receives poor instruction from a poor curriculum will rank at the 3rd percentile after 2 years. (2) A student who begins at the 50th percentile and receives poor instruction from a good curriculum will rank at the 37th percentile after 2 years. (3) A student who begins at the 50th percentile and receives average instruction from an average curriculum will remain at the 50th percentile after 2 years. (4) A student who begins at the 50th percentile and receives good instruction from a poor curriculum will rank at the 63rd percentile after 2 years. (5) A student who begins at the 50th percentile and receives good instruction from a good curriculum will rank at the 96th percentile after 2 years.

According to Marzano and colleagues, the student who begins at the 50th percentile and has an average teacher who uses an average curriculum, will still be at the 50th percentile at the end of 2 years. This student has learned enough to keep pace with other students in the same grade.

But what happens to the student who has an ineffective teacher that uses an ineffective curriculum? After 2 years of school, this student has dropped from the 50th percentile to the 3rd percentile. Although this student may have acquired some basic skills, their learning is so sporadic and unorganized that they have lost considerable ground compared to other students in the same grade.

Suppose that the student who enters at the 50th percentile is given an ineffective teacher who uses a curriculum with a strong evidence base. Despite the class-wide implementation of a research-validated program, after 2 years this student has fallen to the 37th percentile when measured against other students in the same grade.

In contrast to the previous scenarios, which featured poor instruction from the classroom teacher, the next two highlight the benefits of effective instruction. Consider the student who begins at the 50th percentile, and is fortunate enough to be assigned to an effective teacher who uses an evidence-based curriculum. After 2 years, this student now performs at the 96th percentile.

Finally, let us suppose the student who enters school at the 50th percentile is given an effective teacher who uses a curriculum without empirical support. After 2 years, the student would still have gained 13 percentile points, and rank at the 63rd percentile. The importance of effective instruction can easily be seen when we compare it against a poor curriculum. Marzano et al. (2003) found that even when using ineffective curricula, effective teachers can produce meaningful gains in student achievement (Table 8).

Table 8 BESD that accounts for 6.7% of variance in student achievement

By juxtaposing the effects of curriculum and instruction, we see that instruction accounts for 6.7% of variance in student success. The BESD for this finding demonstrates that a student’s success rate increases from 37% to 63% when placed in a classroom with effective instruction. In other words, effective teachers benefit an additional 26 per 100 students, compared to the use of effective curricula alone.

Understanding the substantial effect schools can have on student achievement highlights the importance of evidence-based classroom practices (Slocum et al., 2014; Smith, 2013). When displayed in a BESD, Marzano et al.’s (2003) data force us to reconsider the idea of evidence-based practices as off-the-shelf curricula. Government mandates for the use of evidence-based practices seemingly emphasize purchasing the right curricular products. As we can see from Table 8, this focus on curriculum alone decreases student success rates by 26%.

The term “evidence-based” is more often explanatory rather than descriptive, as in, the curriculum is evidence-based, and therefore effective. Selecting a program because it’s marketed as research based and then implementing it without ongoing progress is at best the result of good marketing, and at worst malpractice. To paraphrase the tautological reasoning described by Vargas (2013), “How do you know the curriculum works? Because it is evidence based. How do you know it is evidence based? Because it works.” Moreover, research on fidelity of implementation tells us that maintaining treatment integrity is critical to student achievement (Mathews et al., 2018; Travers et al., 2016). What is “instruction” if not the implementation of a teaching program as designed?

Stockard et al. (2018) conducted a meta-regression of research on DI curricular materials that emphasized “a more precise measure of teacher preparation, including fidelity to all the various technical elements of the programs and training specific to the programs taught” (p. 501). The estimated effect size for the total sample was d = 0.60, which, by conventional standards, would be a “medium” effect. Using the formula for converting d to r above, we can then create a BESD for DI.

Table 9 shows that d = 0.60 converts to r = .287, accounting for 8.2% of the variance in student achievement. Stockard et al.’s (2018) findings on DI curricula surpass Marzano et al.’s (2003) 6.7% of variance resulting from effective instruction. Once again, DI bridges the gap between effective and ineffective instruction. The BESD for this finding demonstrates that the use of DI curricular materials increased student success rates from 35.6% to 64.4%. Compared to the other curricula that were examined in the meta-regression, DI increased academic achievement by 28.7%.

Table 9 BESD that accounts for 8.2% of variance in student achievement

How Effective is DI?

Chubb and Moe (1990) assert that,

All things being equal, a student in an effectively organized school achieves at least a half-year more than a student in an ineffectively organized school over the last two years of high school. If this difference can be extrapolated to the normal four-year high school experience, an effectively organized school may increase the achievement of its students by more than one full year. That is a substantial school effect indeed. (p. 140)

Over the past half-century, no other curriculum model has been more heavily researched than DI. Syntheses of this research tell us that by equipping teachers with DI curricular materials, and emphasizing high fidelity of implementation, the effects of schools can be substantial.

Glass et al. (1981) stated that an effective size of d = 1.00 equates to the difference of about a year of schooling. In their meta-analysis of DI, Adams and Engelmann (1996) found a similar overall effect size of d = .97. The BESD for Adams and Engelmann, which showed that DI accounted for 19% of the variance in student achievement, is consistent with Marzano’s (2003) model showing that schools can account for as much as 20% of student success.

Coe (2002) argued that the difference of a year of school is closer to an effect size of d = 0.60. In their meta-regression on DI, Stockard et al. (2018) found an effect-size of d = 0.60. The BESD for Stockard et al., which showed that DI accounted for 8.2% of variance in student achievement, surpasses Marzano et al.’s (2003) finding that effective instruction can account for up to 6.7% of student success. Stockard et al.’s data also fit the model of effective schooling determined by Coleman et al. (1966).

Regardless of whether you want to describe the effect as “medium” (d = 0.60) or “large” (d = 0.97), the research shows that DI can effectively double the academic success rate of students: 2 years of gains across 1 year of instruction. Hunter and Schmidt (1990) describe the limitations of various effect-size indices such as d, r, and r2, noting that,

The percent of variance accounted for is statistically correct, but substantively erroneous. It leads to severe underestimates of the practical and theoretical significance of relationships between variables. . . . The problem with all percent variance accounted for indices of effect size is that variables that account for small percentages of the variance often have very important effects on the dependent variable. (pp. 199–200)

Although there are multiple ways to talk about effect sizes, any scientific approach to educational research is premised upon the identification of small, yet meaningful, variables. More than 50 years of research on DI have illustrated that rather than immediately solving all of our education needs, research-validated instructional practices point to the ongoing and systematic development of effective teacher behavior. In explaining longitudinal results, Ferster et al. (1975) declared that “The successful results of [education], even though we sometimes think of them in terms of a final dramatic outcome, actually occur in small increments of behavior” (p. 89). However, the nomenclature used to describe the magnitude of change fails to describe the pragmatic difference in student achievement resulting from DI.

In contrast to other effect-size indices, the BESD provides a real-world context for interpretation. Dichotomizing the independent variable in terms of experimental or control and dichotomizing the dependent variable in terms of effective or ineffective eliminates ambiguities to allow for easier and simpler data-based decision making. To reiterate the conclusion of Stockard et al. (2018), “Researchers and practitioners cannot afford to ignore the effectiveness [of] research on DI” (p. 503).

The BESD is not without limitations, however. McGraw (1991) argues that the BESD so distorts the original data that the exercise is misleading, whereas Strahan (1991) calls BESD a “what if” statistical technique. In addition, the assumptions of normality on which BESDs rely have been noted as a primary limitation (Hsu, 2004; Thompson & Schumacker, 1997). However, perhaps the most important limitation is raised by Skinner (1968), who warns us about the Idols of the School.

The Idol of the Good Teacher is the belief that what a good teacher can do any teacher can do. Some teachers are, of course, unusually effective. . . . The Idol of the Good Student is the belief that what a good student can learn, any student can learn. Because they have superior ability or have been exposed to fortunate early environments, some students learn without being taught. (pp. 260–261)

The dichotomous nature of the BESD treats minimally effective teachers the same as maximally effective teachers, just as it treats minimally ineffective teachers the same as maximally ineffective teachers. One of Stockard et al. (2020) many findings was that even with limited instructional time and relatively low procedural fidelity, DI was more effective than other curricula. Student outcomes are even more dramatic with greater exposure and higher fidelity. In the classroom and other applied settings, outcomes are continuous rather than binary. Individual results may vary.

Of the many responsibilities with which the classroom teacher is charged, instructional design is not one of them (Pondiscio, 2018). Although we have posited that a good teacher can correct a bad curriculum, few teachers ever receive training in the systematic development and careful sequencing of content within the comprehensive scope of a given content area. Skinner (1968) noted that the central source error is the teacher’s belief that their own personal experience is the primary basis of pedagogical wisdom. By teaching big ideas, scripting lessons, programming for generalization, and assessing mastery, DI shifts the teacher’s focus from personal experience to student-generated data.

“There is no universal guideline or rule of thumb for judging the practical importance or substantive significance of a standardized effect size estimate for an intervention,” argues Hill et al. (2008). “Instead one must develop empirical benchmarks of comparison that reflect the nature of the intervention being evaluated, its target population, and the outcome measure or measures being used” (p. 172). Despite having never been widely embraced or implemented, the large body of evidence supporting the effectiveness of DI has elevated it to education’s empirical benchmark of comparison. Perhaps it is time to acknowledge that DI has situated itself as the Idol of the Good Curriculum.