Introduction

There is increased focus on the use of universal school-based interventions to promote a range of academic and behavioral outcomes for students and possibly staff; however, it is common for there to be variation in the extent to which teachers fully comply with the intended implementation model (Domitrovich et al. 2009). Most intervention studies examining the effects of school-based programs have taken an intent-to-treat (ITT) approach, whereby the researchers estimate the effect of being assigned to the treatment condition (Schochet et al. 2014). However, there are typically small or null effects for those participants who do not fully implement, and thus the ITT estimates may understate the effect of intervention (Stuart et al. 2008).

An alternative to traditional ITT analysis is the complier-average causal effect (CACE) analysis approach, which has been successfully used to estimate treatment effects accounting for compliance (see Angrist et al. 1996; Little and Yau 1998; Jo 2002; Stuart et al. 2008). Both ITT and CACE approaches are useful for gaining a more complete understanding of intervention effects (Jo 2002; Stuart et al. 2008). The CACE method has been applied to the estimation of treatment effects with noncompliance within several randomized interventions serving families and youth (e.g., Barnard et al. 2003; Connell et al. 2007; Stanger et al. 2011); however, there has been less focus on classroom-based preventive interventions implemented by teachers. In the current study, we used the CACE method to estimate the impacts of a commonly used classroom-based preventive intervention called the good behavior game (GBG) (Bradshaw et al. 2009b; Ialongo et al. 1999, 2001; Kellam et al. 2008) on teachers’ self-efficacy and burnout over the course of a school year. This universal, behavioral management model was combined with a social-emotional learning intervention and implemented in elementary schools. The overall goal of the current study was to provide an example of CACE analysis as applied to a classroom-based intervention. This study represents a novel extension and application of the CACE analytic approach to better understand the impact of the intervention, which had varying levels of implementation across trained teachers.

Teachers’ Compliance with School-Based Program Implementation

Teachers are often the primary implementers of classroom-based preventive interventions, yet the degree to which they opt to implement the various components of the intervention often varies (Domitrovich et al. 2009). In fact, many of the efficacy and effectiveness studies of school-based prevention programs have noted considerable variation in implementation quality, which in turn attenuates program impacts on student and staff outcomes (Durlak and DuPre 2008; Durlak et al. 2011; Ringeisen et al. 2003). There is emerging evidence that certain characteristics of implementers, including teacher characteristics of attitudes and beliefs about themselves or the school environment, are associated with variation in implementation (Domitrovich et al. 2015; Payne and Eckert 2010). Much of the exploration into the effects associated with variation in implementation has been descriptive and post hoc, with limited use of causal inference approaches and a lack of clarity about the direction of effects. Nevertheless, this line of research suggests that poor implementation and characteristics systematically associated with variation in implementation are typically unmeasured or not accounted for in traditional randomized trials employing an ITT approach. This source of bias may in turn result in an under-estimation of intervention effects. Additional research is needed to demonstrate a causal impact of programs, while taking into consideration program implementation. In fact, the assumption has been that increased compliance with intervention implementation translates into better outcomes (Botvin et al. 1995; Derzon et al. 2005; Durlak and DuPre 2008; Rohrbach et al. 1993). The current proposal focused on implementation, or compliance, which is defined as “the discrepancy between what is planned and what is actually delivered when an intervention is conducted” (Domitrovich et al. 2008, also see Chen 1998; Hulleman and Cordray 2009; O’Donnell 2008). A common indicator of compliance of school-based interventions is program dosage, which includes the frequency with which the program is implemented, with the expectation that a higher dosage is associated with better outcomes.

Background on the Interventions

The current study tested two evidence-based elementary school prevention programs: the PAX version of the good behavior game (PAX GBG; Embry et al. 2003) and Promoting Alternative Thinking Strategies (PATHS; Greenberg, Kusché and Conduct Problems Prevention Research Group 2011; Kusché, Greenberg, and Conduct Problems Prevention Research Group 2011). Specifically, a randomized controlled trial (RCT) design was used to compare these two intervention models against a control group. The first intervention model was the PAX GBG alone. The second model was the integration of the PAX GBG and the PATHS program (Domitrovich et al. 2010). PAX GBG focuses on providing teachers with an efficient way of reinforcing the inhibition of aggressive/disruptive and off-task behavior in a “game” like context (Embry et al. 2003). The PATHS curriculum trains teachers to provide explicit instruction to students to promote the development of emotional awareness and communication, self-regulation, social problem solving, and relationship management skills (e.g., interpersonal skills, conflict management) through didactic lessons that take place weekly across the school year (Greenberg and Kusché 2006). Several large RCTs of GBG have demonstrated positive effects on student peer relations, aggressive/off-task behavior, substance use, and academic outcomes (e.g., Bradshaw et al. 2009b; Ialongo et al. 1999, 2001; Kellam et al. 2008). Similarly, prior RCTs of PATHS have yielded positive effects on student social-emotional skills, peer relations, prosocial cognitive functioning, socially-competent behaviors, and behavioral adjustment (e.g., Conduct Problems Prevention Research Group 1999; Greenberg and Kusché 2006; Greenberg et al. 1995).

Although most of the focus has been on student impacts of PAX GBG and PATHS, as well as other whole-school social-emotional programs, teachers implementing these programs may experience benefits such as increased efficacy in managing their classrooms and reduced emotional exhaustion and other forms of burnout (Bradshaw et al. 2008; Bradshaw et al. 2009a; Han and Weiss 2005). On the other hand, the additional burden placed on teachers to implement the program may unintentionally cause some teachers to experience increased burnout and stress. Impacts on teachers, whether positive or negative, may be secondary effects of the program’s impacts on students, may stem from teachers’ involvement in the training component of the intervention, or may be a function of the supports accompanying the intervention (see Domitrovich et al. 2015 for a more extensive discussion). Further, impacts on students may be a function of how engaged teachers are in implementing the components of the intervention. Greater involvement could result in positive effects as teachers learn to better manage their classrooms, or in negative effects as teachers’ burden increases. The effects of such school-based interventions, either positive or negative, on the teachers who implement them likely have important implications for how effective the interventions are at producing positive effects on students. Yet few studies have specifically tested the impacts of student-focused classroom-based interventions on teacher outcomes (Domitrovich et al. 2016).

Study Design

Data for this study came from 27 elementary schools in a large urban, east coast public school district. Schools were recruited and principals agreed to participate in a randomized controlled trial of two intervention models and to potentially receive one year of training and coaching. Schools were then randomized (i.e., cluster randomized trial) to one of three conditions: the PAX GBG only (nine schools), the integration of PAX GBG and PATHS (referred to as PATHS to PAX [P2P]) (nine schools), and a control condition (nine schools) where teachers conducted their usual practice. The study took place over the course of one school year. A novel aspect of the design of the current study was the plan to contrast the PAX GBG classroom management model when implemented alone with an integrated training, combining PAX GBG with the PATHS social-emotional learning program. In the current study, we were particularly interested in impacts on teacher outcomes, rather than the traditional impacts solely on students. In fact, as noted above, both PAX GBG and PATHS have the potential to positively impact teacher outcomes of burnout and efficacy, as a function of their positive impact on classroom management and student behavior; however, no studies have taken into consideration compliance when examining these effects. Specifically, our prior analysis of data from this trial using an ITT approach suggested that teachers in the integrated condition reported feeling more efficacious and feeling more personal accomplishment relative to control teachers; however, they did not report reduced levels of emotional exhaustion or depersonalization (Domitrovich et al. 2015).

In the current study, we operationalized implementation compliance as the teachers’ use of the PAX GBG “games” in the classroom using records of how many games they played throughout the school year and for how long they played each game. More specifically, compliance was defined as being above a cut point on both the number of games played and the total number of minutes of games played. Based on our prior ITT findings (Domitrovich et al. 2016), we expected to find stronger effects on teacher efficacy and personal accomplishment among intervention teachers who sufficiently complied with the program components because these were the teachers who stood to gain the most from the intervention. Furthermore, we anticipated that these effects would be most pronounced in the integrated PATHS to PAX condition. We also expected to find intervention effects on emotional exhaustion and depersonalization, although the direction of effects was less clear to us. On the one hand, teachers who were provided the tools to handle behavior management challenges and to improve children’s social skills and who felt more efficacious in doing so could in turn experience reductions in emotional exhaustion and depersonalization. However, the potential burden of implementing a new program could put additional strains on teachers and increase burnout (Domitrovich et al. 2010; Han and Weiss 2005), particularly among teachers who spent the most time integrating program components into their daily practice. Thus it is especially important to examine the program impacts on teachers using a CACE analysis, in light of the potential added burden of implementing the multicomponent PATHS to PAX program (Domitrovich et al. 2010).

Overview of the CACE Approach

The overall goal of the current study was to estimate the effects of the interventions on teachers while accounting for compliance with assigned intervention. In order to do this, we needed to compare outcomes for teachers in the treatment group who implemented the intervention to the outcomes for teachers in the control group who would potentially do the same if assigned to the intervention group. Angrist et al. (1996) provided a framework for this approach, which outlined a process for a two-arm trial with binary compliance in the potential outcomes framework (also see Frangakis and Rubin 2002; Holland1986). They defined four compliance types on the basis of individuals’ treatment assignment status (1 = treatment, 0 = control) and potential treatment receipt status (1 = received/participated, 0 = not received/not participated). These groups are important because we assume that the treatment and control groups are likely to have the same proportion of each compliance type due to the fact that the groups were randomly assigned. Therefore, the difference between the treatment and control condition within each compliance type can be interpreted as a causal effect (Frangakis and Rubin 2002). A nuance of the application of CACE to the current study is that the participants here are the teachers who are in effect delivering the intervention, rather than receiving it from another source. Specifically, in the current application of the CACE framework, compliers are those who participate in the treatment (i.e., implement) when assigned to the treatment group and do not participate/implement when assigned to the control group. In an ITT analysis, the effect of treatment assignment is the same as the effect of full participation for the compliers. Always-takers are those who will always implement the treatment, no matter what group they are assigned to. Never-takers are those who will never implement, regardless of the treatment assignment. In contrast, defiers are those who will not implement if assigned to the treatment group and will implement if assigned to the control group. Similar to Angrist et al. (1996), our primary interest in this study was the causal treatment effect for compliers (i.e., CACE). This focus on implementation, coupled with the use of a continuous compliance indicator for which we set a threshold of high and low compliance (as compared to a traditional categorical approach to compliance) make this application of CACE particularly unique.

Assumptions to Identify CACE

Given that implementation behavior of each participating teacher can be observed only under the condition the teacher is assigned to, CACE cannot be calculated directly comparing the same teacher’s outcomes under the treatment and under the control conditions. Holland (1986) called this the fundamental problem of causal inference. However, under certain conditions, we are able to identify the causal effect at the average level. In particular, the set of conditions (assumptions) used in Angrist et al. (1996) have been widely used to identify CACE. One core assumption is ignorable treatment assignment, which provides the basis for causal inference as it guarantees the comparability between treatment arms. In our case, this assumption is automatically satisfied as schools are randomized to intervention and control conditions. A second assumption is the stable unit treatment value, which means that the potential outcome of each individual is not affected by the treatment assignment status of other individuals. This is a questionable assumption in school settings because teachers in the same school are highly likely to interact with one another. To minimize this interaction across different treatment arms, we employed cluster randomization, where the unit of randomization is school. Previous studies suggested that by employing cluster randomized trials, interaction or contamination among individuals becomes a more manageable problem (Jo et al. 2008; Sobel 2006). That is, now we only need to worry about interactions among teachers within schools, which can be handled using statistical techniques such as multilevel analysis or generalized estimating equations. We cannot prevent interactions among teachers across different intervention arms, but the likelihood of these interactions will remain about the same as that observed when no nesting exists. A third assumption is monotonicity, which assumes that there are no defiers. This is a reasonable assumption in our case because teachers in the control group did not have access to the intervention. It is also assumed that there are at least some compliers, meaning that the offer of the intervention induces at least some teachers to participate. This is a reasonable assumption in our study.

Finally, the exclusion restriction assumes that always-takers and never-takers in the control group will not benefit from the program and therefore the distribution of outcomes is the same in the treatment and control groups for these two types. In our context, this means that there is no effect of assignment for never-takers. Since teachers in the control group did not have access to the intervention, the stratum of always-takers does not apply to our study. Given our simplified setting with only compliers and never-takers, we will use non-compliers to refer to never-takers. The assumption of the exclusion restriction may need to be relaxed in school-based interventions where it is quite possible that teachers are affected by the intervention assignment even if they do not participate. For example, teachers may be affected by the training at the beginning of the year even if they do not end up implementing the program in their classrooms according to our definition of implementation. Since compliance in our case is not a dichotomous variable (i.e., teachers can vary in their frequency and quality of program implementation) for whether a teacher participates or not, the cut-off for determining participation will affect the likelihood that the exclusion restriction is met. Given the possible deviation from the exclusion restriction, we additionally conducted CACE estimation assuming an alternative assumption that the intervention effects are additive (Jo et al. 2008), meaning that the intervention effects do not change depending on the values of covariates. In addition, we conducted CACE estimation using two different cut points of our original continuous implementation indicator of program dosage. Comparing the CACE estimates with the exclusion restriction assumption compared to with the additivity assumption, and with two different cut points, served as sensitivity analyses because we cannot ensure that we met the exclusion restriction.

Method

Sample

The current study sample included 350 K-5 teachers across 27 schools. Schools, and therefore teachers, were enrolled in three cohorts (i.e., for 1  year each, in three consecutive years) and provided consent for their voluntary participation. The sample was generally evenly split across the three cohorts (31 % cohort 1, 34 % cohort 2, and 35 % cohort 3) and across the three conditions (25 % PAX GBG, 29 % PATHS to PAX, 37 % control). The majority of students in the schools was African American (88 % on average) and received free and reduced meals (i.e., FARMs; 85 %). The vast majority of the teacher sample was female (i.e., 88 %). Less than half was 30 or younger (41.4 %), and taught students in grades three through five (44.1 %). Just over half of the teachers had a graduate degree (56.4 %). See Table 1 for additional details on the sample as well as average scores on the key measures administered in this study.

Table 1 Descriptive information on teacher participants and schools

Measures

All outcome measures were assessed using a teacher self-report measure administered four times (i.e., fall baseline and three follow-ups) over the course of the school year.

Teacher Burnout

Teachers were asked to report on their level of emotional exhaustion (nine items, e.g., I feel used up at the end of the workday, α = 0.92), personal accomplishment (eight items, e.g., I deal very effectively with the problems of my students, α = 0.85), and depersonalization (3 items, e.g., I’ve become more callous towards people since I took this job, α = 0.64) from the Maslach Burnout Inventory (Maslach et al. 1997). Responses were rated on a 7-point Likert scale from never to every day, with higher scores indicating greater burnout.

Teacher Efficacy

Teachers reported on a 5-point scale their self-efficacy in two domains. The Behavior Management Self-Efficacy Scale (Main and Hammond 2008) assessed teachers’ self-efficacy in promoting classroom behavior management (14 items; e.g., I am able to use a variety of behavior management techniques; α = 0.94). The Social-Emotional Learning Efficacy Scale (Domitrovich and Poduska 2008) assessed teachers’ self-efficacy in promoting social-emotional skills in students (eight items; e.g., I am able to teach children to show empathy and compassion for each other; α = 0.93).

Compliance

Teachers in the intervention groups completed weekly logs in which they recorded the number of games they played and the number of minutes they spent on each game. These were the indicators of compliance. The number of games and the number of minutes played were each summed, for a total score for each measure across the school year. The compliance cut point will affect the exclusion restriction (Stuart et al. 2008); therefore there is a trade-off when deciding where to set the cut point. For example, if the cut point is five games, the assumption is that teachers who led less than five games would not be affected by the intervention. Setting the cut point too low may lead to great variation in the degree to which compliers implemented the program. However, with a higher cut point CACE is assumed to be larger, but the sample size among compliers becomes small thereby reducing the quality of CACE estimates. Therefore, participation, or compliance, was defined in two ways: a medium compliance cut point for teachers who fell above the 50th percentile on both the number of games played and the minutes played (n = 81 total treatment teachers), and a high compliance cut point was defined as teachers who fell above the 75th percentile on both number of games and minutes played (n = 29 total treatment teachers).

Covariates

Several baseline variables that were correlated with whether or not treatment teachers were classified as compliers were included as covariates in all models (Domitrovich et al. 2015). A teacher information form was completed at baseline to collect information on teacher demographics (e.g., gender, age, education, degree attained), professional development experiences, and information regarding other social-emotional and classroom management interventions being used by the teacher. Teacher gender, age, graduate degree attainment, the grade level taught, cohort, and school mobility were included in the current study. In addition, several other baseline scales were included as covariates. A total mindfulness score was computed as the mean of 20 items (e.g., When I am in the classroom I have difficulty staying focused on what is happening in the present; α = 0.84) from the Mindfulness in Teaching Scale (Frank et al. 2016). The Openness to Innovation subscale from the Trust in Schools measure (Bryk and Schnieder 2002) was computed as the mean score of three items (e.g., take responsibility for improving the school; α = 0.84). Baseline depersonalization and emotional exhaustion were also included as covariates in all models where they were not the outcome.

Estimation of CACE

CACE models were estimated separately for each of the treatment conditions relative to control (i.e., PATHS to PAX integrated vs. control and PAX GBG vs. control). Linear growth curve models with intercept and slope parameters were used to estimate the initial level and change of each outcome over the school year (see Fig. 1). In this longitudinal framework, CACE was defined as the effect of intervention assignment for compliers on the change (slope) in each outcome (Jo and Muthén 2003; Jo et al. 2009). As described above, compliance was defined in the current study using two different cut points (50th and 75th percentile), as a sensitivity analysis for both the cut point and the potential deviation from the exclusion restriction. Specifically, the CACE models were identified in two different ways. First, we assumed the exclusion restriction. In this model, the slope was regressed on treatment assignment in the complier class but not in the non-complier class. In this case, we are assuming that non-compliers are not affected by treatment assignment. However, as discussed earlier, this assumption might have been violated in our trial. Given these possibilities of deviation from the exclusion restriction, we additionally conducted CACE estimation assuming that the intervention effects are additive (Jo et al. 2008). That is, we assumed that the intervention effects do not change depending on the covariates. In the model with the additive treatment effect assumption (instead of the exclusion restriction), the slope was regressed on treatment in both the complier and non-complier classes. In both models, the intercept and slope were regressed on the pre-treatment covariates.

Fig. 1
figure 1

Compliance average causal effect model with covariate

Missing data on the compliance measure caused 32 cases (19 in the integrated condition and 13 cases in the PAX GBG condition) to drop out (through listwise deletion), resulting in a total sample size of 318 teachers. An additional set of teachers was excluded in the analysis phase due to missing data on the covariates. Specifically, in the models comparing the PATHS to PAX condition to the control condition, 25 teachers were dropped, resulting in a sample size of 185 teachers. In the models comparing the PAX GBG condition to the control condition, 34 teachers were dropped, resulting in a total sample size of 202. In principle, we could incorporate all cases including the ones with incomplete information. However, we employed listwise deletion, given that there is little research which provides guidelines for handling simultaneous complications of noncompliance, clustering, and missing data. In this study, we focused on handling of noncompliance and clustering. We ignored biases introduced by dropping teachers with missing data, which is a limitation of the study. We used maximum likelihood (ML) estimation with the expectation maximization (EM) algorithm (Little and Rubin 2002) for CACE estimation, conveniently implemented using the mixture modeling feature in Mplus (Muthén & Muthén, 1998–2015). In this framework, compliance status was defined by a categorical latent variable, with one class referring to the compliers and the other class referring to the non-compliers. Given our simplified setting where there are only compliers and never-takers, the compliance class membership was completely observed in the treatment group whereas it was completely unobserved in the control group. The unknown compliance type of individuals in the control condition was handled as missing data via the EM algorithm. Characteristics of teachers were used as predictors of the latent complier class membership (Domitrovich et al. 2015).

In principle, between school and within school level parameters can be formally modeled taking into account compliance in the context of cluster randomized trials (i.e., multilevel modeling). However, in practice, the number of clusters is often small (nine schools per condition in our study). Fairly large numbers of clusters (preferably 50 or more) are necessary to yield accurate CACE estimates when taking a formal multilevel approach (Jo et al. 2008). Instead, we used the sandwich estimator in conjunction with the ML-EM mixture (type = mixture complex) to adjust the standard errors for the clustering of teachers within schools. In order to interpret the magnitude of effects across the different models, effect sizes (ES) were calculated by dividing the outcome difference across the two conditions by the square root of the total variance obtained from a fully unconditional model. We also applied a Bonferroni correction to the p value for multiple tests, which set the p-value for significance at 0.01. Because of the conservative nature of Bonferroni correction, effects between alpha levels of 0.05 and 0.10 were also noted and referred to as trend-level effects.

Results

Descriptive statistics on the study variables for the two sample conditions are reported in Table 1, whereas Table 2 indicates baseline differences on the covariates between compliers (i.e., high implementer) and non-compliers (i.e., low implementer). Complier teachers in the integrated PATHS to PAX condition were less burnt out at baseline compared to non-compliers, using either the medium or high fidelity/compliance cut point. Complier teachers in the integrated PATHS to PAX condition also had higher mindfulness scores at baseline (high compliance cut point only). Complier teachers in the PAX GBG condition were less likely to have a graduate degree compared to non-compliers. Complier teachers in both conditions were in schools with less mobility regardless of the compliance cut point.

Table 2 Baseline differences between compliers and non-compliers randomized to the treatment condition

ITT Estimates

Teachers in the integrated PATHS to PAX condition reported increases in SEL efficacy (ES = 0.15, p < 0.001) and BM efficacy (ES = 0.11, p < 0.001) relative to teachers in the control condition. In addition, teachers in the integrated PATHS to PAX condition reported increases in personal accomplishment, one dimension of burnout, relative to teachers in the control condition (ES = 0.09, p < 0.01). There were no significant impacts on change in depersonalization (ES = 0.01, p > 0.01) or emotional exhaustion (ES = 0.02, p > 0.01). The PAX GBG condition did not impact teachers’ SEL efficacy (ES = 0.04, p > .01), BM efficacy (ES = 0.02, p > 0.01), personal accomplishment (ES = 0.03, p > 0.01), depersonalization (ES = 0.01, p > 0.01), or emotional exhaustion (ES = 0.01, p > 0.01). Results are presented in the right-hand column of Table 3.

Table 3 CACE effects with covariates

CACE Estimates

Effects of each intervention condition relative to the control condition with compliance are reported in the left-hand columns of Table 3, with the left-most columns reporting the medium compliance cut point with and without the exclusion restriction. Using the medium compliance cut point, complier teachers in the integrated PATHS to PAX condition showed statistically significant increases in SEL efficacy (ES = 0.13, p < 0.01) and trend-level increases in depersonalization (ES = 0.11–0.13, p < 0.05) across the school year (with and without the exclusion restriction). Teachers in the PAX GBG condition also showed trend-level increases in depersonalization, but only without the exclusion restriction (ES = 0.13, p < 0.05). Complier teachers showed trend-level increases in emotional exhaustion in both conditions, but only without the exclusion restriction (PAX GBG condition ES = 0.10, p < 0.05; integrated PATHS to PAX condition ES = 0.13, p < 0.05). Without the exclusion restriction, the effects of treatment assignment on the slopes of the outcomes were also estimated for the non-compliers. Non-complier teachers in the integrated PATHS to PAX condition reported decreases in emotional exhaustion (ES = 0.32, p < 0.001). Non-compliers in the PAX GBG condition reported trend-level decreases in emotional exhaustion (PAX GBG ES = 0.22, p < 0.05). Non-compliers in the integrated condition reported increases in personal accomplishment (ES = 0.25, p < 0.001). The effect of being assigned to treatment among non-compliers on these outcomes was stronger than the effect of treatment among compliers. In addition, non-compliers in the integrated condition reported trend-level decreases in depersonalization (ES = 0.17, p < 0.05), whereas the effect among compliers was positive, indicating a trend towards an increase in depersonalization.

The right-hand columns show results from the models using the high compliance cut point with and without the exclusion restriction. In most instances the results were in the same direction, with the exception of personal accomplishment when comparing PAX GBG to the control condition. Using the high compliance cut point, complier teachers in both conditions showed increases in BM efficacy (PAX GBG ES = 0.32, p < 0.01; integrated ES = 0.39, p < 0.001) across the school year with and without the exclusion restriction. Those in the integrated PATHS to PAX condition showed trend-level increases in emotional exhaustion with and without the exclusion restriction (ES = 0.24–0.26, ps < 0.05). Personal accomplishment increased among high complier teachers in the PAX GBG condition (ES = 0.69, p < 0.001), and decreased among those in the integrated condition (ES = 0.20, p < 0.001) without the exclusion restriction. In addition, in contrast to compliers, non-compliers in the integrated condition reported increases in SEL efficacy (ES = 0.10, p < 0.01) and personal accomplishment (ES = 0.14, p < 0.01). Finally, results were similar when models were estimated using a sandwich estimator, and were therefore not sensitive to adjusting the standard errors to account for the clustering of teachers in schools.

Covariate Associations with Compliance

Table 4 shows results from the logistic regression models predicting high compliance from the models without the exclusion restriction. When comparing the integrated condition to the control condition, gender (odds ratio [OR] = 5.38), cohort (OR = 3.94), mobility (OR = 1.15), and mindfulness (OR = 0.03) significantly predicted compliance when personal accomplishment was the outcome (all ps < 0.05). Gender (OR = 21.19), cohort (OR = 13.10), mobility (OR = 1.30), and depersonalization (OR = 2.59) predicted compliance when emotional exhaustion was the outcome (all ps < 0.05). Mindfulness predicted compliance with regard to BM efficacy (OR = 0.24), such that compliers were more likely to be higher on mindfulness at baseline. The covariates did not significantly predict SEL efficacy or depersonalization. When comparing the PAX GBG condition to the control condition, mobility predicted compliance for the SEL efficacy outcome (OR = 1.14). Cohort predicted compliance for the BM efficacy (OR = 0.55) and depersonalization (OR = 0.38) outcomes (all ps < 0.05).

Table 4 Logistic regression of high compliance on baseline covariates (compliers vs. never-takers)

Discussion

Many district, school, structural, training, and teacher factors can facilitate or impede the implementation of school-based prevention programs, particularly those that are dependent on teachers’ use in the classroom (Domitrovich et al. 2009; Han and Weiss 2005). Comparing average outcomes of schools randomized to an intervention group and a control group should produce unbiased estimates of a program’s impacts, assuming that randomization was successful in creating equivalent groups, but the contrasts between the conditions are diminished as a result of variation in treatment received (Weiss et al. 2013). In the current study, average outcomes across treatment conditions should have produced unbiased estimates of the impacts on teacher efficacy and burnout of two evidence-based prevention programs—PAX GBG and PATHS—intended to build students’ social-emotional skills, reduce aggressive behaviors, and help teachers manage their classrooms. But teachers varied in the degree to which they implemented the programs, which is a common occurrence in school-based programming (Domitrovich et al. 2008; Fixsen et al. 2005). Ignoring this lower implementation (i.e., compliance) may result in decreased power to detect average effects (Jo et al. 2008). This source of variation likely diminishes the contrasts between treatment conditions and, in turn, attenuates program impacts. CACE estimation is one approach to accounting for this source of variation. In this study, we applied the CACE framework to account for teachers’ implementation dosage of the PAX GBG program. The interventions tested within this study are in fact similar in many ways to other classroom-based prevention programs that largely rely on teachers for implementation (e.g., 4R’s, Second Step).

Overall, the CACE estimation approach was helpful in understanding treatment-control differences when accounting for variation in treatment conditions due to teacher implementation. This approach revealed impacts on teachers that were different than those produced using an ITT approach. First, some intervention effects on teacher efficacy and burnout were stronger among higher implementing teachers compared to teachers overall. Specifically, we found positive effects on social-emotional and behavioral management efficacy among higher implementing teachers in both of the intervention conditions. Teachers on average had greater increases in efficacy in the integrated condition than in the control condition. In the case of behavior management efficacy, these effects seemed to be concentrated among higher implementing teachers. In the case of social-emotional efficacy, the effects among higher implementers and among the teacher sample overall were similar.

The estimation of program impacts on burnout (i.e., personal accomplishment, depersonalization, and emotional exhaustion) while accounting for implementation revealed a different story than the estimation of average impacts across all teachers. Specifically, program effects on personal accomplishment were stronger among teachers most likely to comply in both intervention conditions. On average, there were no significant differences between treatment conditions and the control condition in growth or change of emotional exhaustion or depersonalization across the school year. However, accounting for implementation showed some evidence for opposing findings among higher and lower implementers and for some increases in burnout among higher implementers. Specifically, being in the higher implementing group in the integrated condition led to greater reports of emotional exhaustion, whereas being in the lower implementing group within an intervention school was associated with somewhat reduced emotional exhaustion. Furthermore, being a higher implementer led to slightly greater reports of depersonalization in both conditions, while being a lower implementer in an integrated school was associated with slightly reduced depersonalization. Overall program impacts on emotional exhaustion and depersonalization may have been masked by these opposing findings. In addition, the trends using the medium and high implementation cut points were somewhat similar, but there were several differences. The effects using the high implementation cut point were notably stronger for all the outcomes except depersonalization. In addition, there were a few cases where the direction of the effect was different (e.g., SEL efficacy in the integrated condition and emotional exhaustion in the PAX GBG condition).

Limitations

The initial sample size was small, and the sample of teachers became smaller when split into implementation groups. Estimation of program impacts using the high implementation cut point yielded an especially small sample size and likely rendered the estimates unstable. In addition, despite the fact that schools and not teachers were randomized, the small sample of schools in each condition prevented us from employing a multilevel modeling approach. Multilevel mixture modeling using the EM estimation approach is computationally demanding and treatment effects accounting for compliance are poorly estimated when the number of clusters are small (Jo et al. 2008). Therefore, we were not able to accommodate both implementation and the clustering of teachers in schools.

The interpretation of our findings is limited by our measure of implementation. In most prior applications of the CACE approach, full implementation was known, either based on a theoretical idea of what level of implementation is needed to benefit from the program (e.g., Stuart et al. 2008) or because implementation was defined as electing to receive the intervention or not (e.g., Connell et al. 2007; Cowen 2008). In our case, we did not have a target level of dosage to measure perfect implementation and implementation was measured on a continuum from low to high. As evidenced by the findings, the cut point we used for implementation made a difference in terms of the pattern of findings. Thus, when using the CACE approach to account for service delivery by teachers rather than program uptake of participants, as the approach has most often been used, it would be useful to have a more precise threshold for full implementation. In addition, CACE estimation relies on a set of assumptions, some of which are difficult to meet when applied to school-randomized trials. As discussed earlier, a violation of the exclusion restriction is imaginable in the current study because teachers are likely affected by the intervention even if they did not participate and because our implementation measure is a continuous variable from which we established artificial cut points. The models assuming additivity that relax the exclusion restriction are more likely to suffer from a violation of normality, however. Given these trade-offs, we conducted the CACE estimation with two different cut points and with and without the assumption of the exclusion restriction. We gained more confidence in our findings because our results generally held through sensitivity testing in which we relaxed the exclusion restriction (Jo 2002). However, the results were somewhat sensitive to the cut point used for implementation. This is not surprising given that implementation was higher and more concentrated using the high implementation cut point rather than the medium implementation cut point. On the other hand, the sample size was significantly reduced using the high implementation cut point. Further tests of violations of the identifying assumptions are not possible. It is important to keep in mind that bias from violation of these assumptions will be problematic in any application of CACE estimation (Jo 2002). Although a strength of this study was the relatively large number of outcomes we examined, it did result in a rather large number of tests conducted. We applied a Bonferroni correction to adjust for the multiple tests; interestingly, many of the findings which dropped to trend-level significance after applying this correction tended to be those which were suggestive of an iatrogenic effect of high implementation. As a result, we are cautious in our interpretation of this finding.

Conclusion and Implications

Most universal school-based interventions are tested using an ITT approach. Average treatment effects are useful for understanding whether school-based interventions can work under real world conditions. Estimating variation in implementation and impacts can help unpack under what conditions and for whom programs are effective. This can be helpful in targeting interventions and informing design and implementation of evidence-based interventions (Schochet et al. 2014). CACE estimation is one approach for taking into account implementation or compliance when estimating causally estimated treatment impacts. Applied to a case study of two evidence-based interventions implemented in the classroom, we believe that this approach was helpful in uncovering effects that differed from the traditional ITT approach. The findings suggest the possibility that the highest implementing teachers benefited from the interventions in that they felt more efficacious in their instruction across the school year than teachers in the control group. On the other hand, the results raise the possibility that the increased demand put on teachers in the intervention schools may have increased burnout for some teachers over the year. The current findings suggest the possibility that the implementation of an intervention may increase stress and burnout for certain teachers, even as the design intends for the intervention to be integrated into the regular curriculum and seeks to minimize the amount of additional burden placed on the teacher. Another possibility is that certain teachers who are engaging most in the intervention are becoming more aware and learning to recognize their own emotional responses. As a result of this increased emotional awareness, they may be perceiving and reporting greater feelings of burnout. However, we do not know the extent to which the level of burnout reported may translate into significant or clinical impairment. Regardless, the extent to which these emotional symptoms may be affecting students still needs to be addressed. Given previous research on the negative associations between teacher burnout and student outcomes (Maslach et al. 1997; Pas and Bradshaw 2014), it is possible that any increased burden placed on teachers could attenuate the effects of the interventions on children (for further discussion of the effects of teacher burnout on students, see Abenavoli et al. 2013). Additional research is needed to better understand what potential elements of programs teachers find stressful to implement. With widespread concern for the effective dissemination of evidence-based programs, these data provide some insight into why some teachers may experience resistance to their implementation.

The question also becomes how we can provide the necessary supports for teachers to implement classroom-based interventions without the generation or perception of increased stress. It is possible that some teachers are better equipped to implement these types of programs with high compliance and experience better outcomes than others. Some teachers respond to and benefit from programs without additional supports; other teachers may benefit from implementation supports such as coaching. In this case, program implementation should be somewhat tailored to the implementing teacher, including adapting a program to decrease implementing burden when burden is a concern. This line of research also provides some insight regarding why we might see variation in program impacts. Variation in impacts might be somewhat attributable to variation in teacher response.

Finally, teachers’ implementation of social-emotional learning programs involves the development of a similar set of skills among adults. One possibility is that prevention programs that foster social-emotional skill-building in children could also provide the supports and skill-building for teachers to manage and develop their own emotional responses. This is particularly important as greater demands are placed on teachers to incorporate these lessons into their everyday classroom routines. Additional research is needed to further explore the predictive validity of teacher compliance as it relates to student outcomes achieved, over and above the teacher impacts examined in the current study.