More than half a century has passed since foundational social psychological research revealed evidence of a “bystander effect” in which witnesses to emergency situations fail to take action to aid those in need (Darley and Latane 1968; Latane and Darley 1969). Contemporary research suggests that the bystander effect may play a role in the prevalence of sexual assault among adolescents and college students, as young people are often unlikely to act when witnessing signs of sexual violence (Banyard 2008; Bennett et al. 2014; Burn 2009; Casey and Ohler 2012; Exner and Cummings 2011; McCauley et al. 2013; McMahon 2010; Noonan and Charles 2009).

One possible solution to this problem is the implementation of bystander sexual assault prevention programs, which are designed to encourage young people to intervene when witnessing signs of sexual violence. Targeting bystanders may be an effective strategy for combatting sexual assault among young people, as past research has produced minimal evidence that traditional prevention programs, which target potential victims and/or perpetrators, are effective in preventing sexual violence (DeGue et al. 2014; De Koker et al. 2014). The present study is a systematic review and meta-analysis examining the effects of bystander programs on attitudes and behaviors pertinent to sexual assault among adolescents and college students. We synthesize high-quality studies assessing the effects of bystander programs on fostering requisite skills believed to combat the bystander effect as it pertains to sexual assault (i.e., noticing sexual assault, identifying a situation as appropriate for intervention, taking responsibility for acting, and knowing strategies for intervening, see Burn 2009) as well as actual bystander intervention behavior.

Sexual assault prevention among adolescents and college students

Sexual assault is a significant problem among adolescents and college students in the USA and across the globe. An analysis of data from three nationally representative telephone surveys of adolescents indicated that 17.8% of 17-year-old females and 3.1% of 17-year-old males had experienced sexual assault (i.e., nonconsensual genital touching or sex) perpetrated by another juvenile at some point in their lifetime (Finkelhor et al. 2014). Additionally, findings from the Association of American Universities Campus Climate Survey found that 26.1% of responding female college seniors and 6.3% of responding male college seniors experienced completed sexual assault (i.e., nonconsensual sexual penetration or sexual touching) as a result of physical force or incapacitation since entering college (Cantor et al. 2015). Similar rates have been reported in Australia (Australian Human Rights Commission 2017), Chile (Lehrer et al. 2013), China (Su et al. 2011), Finland (Bjorklund et al. 2010), Poland (Tomaszewska and Krahé 2018), Rwanda (Van Decraen et al. 2012), Spain (Vázquez et al., 2012), and in a global survey of countries in Africa, Asia, and the Americas (Pengpid and Peltzer 2016).

These rates are problematic, as sexual assault in adolescence and/or young adulthood is associated with numerous adverse outcomes, including risk of repeated victimization, depressive symptomology, heavy drinking, and suicidal ideation (Exner-Cortens et al. 2013; Cui et al. 2013; Halpern et al. 2009). Importantly, there is evidence indicating experiences of sexual assault during these two life phases are related, as victimization and perpetration during adolescence are, respectively, associated with increased risk of victimization and perpetration during young adulthood (Cui et al. 2013). Thus, early prevention efforts are of paramount importance.

Reviews of research on the effectiveness of programs designed to prevent sexual assault among adolescents and college students have noted both a dearth of high-quality studies, such as randomized controlled trials (RCTs), and minimal evidence that these prevention programs have meaningful effects on young people’s behavior (DeGue et al. 2014; De Koker et al. 2014). Concerning the latter point, evaluations of such programs tend to measure attitudinal outcomes (e.g., rape supportive attitudes, rape myth acceptance) more frequently than behavioral outcomes (e.g., perpetration or victimization) (Anderson and Whiston 2005; Cornelius and Resseguie 2007; DeGue et al. 2014). Additionally, findings from a meta-analysis of studies assessing outcomes of college sexual assault prevention programs suggested that effects are larger for attitudinal outcomes than for outcomes related to the actual incidence of sexual assault (Anderson and Whiston 2005).

Bystander sexual assault prevention programs

Given this paucity of evidence regarding behavior change, it is imperative to identify effective strategies for preventing sexual assault among adolescents and young adults. One promising strategy is the implementation of bystander sexual assault prevention programs, which encourage young people to intervene when witnessing incidents or warning signs of sexual assault (e.g., intervening when witnessing controlling behavior or witnessing a would-be perpetrator leading an intoxicated person into an isolated area). Many of these programs are implemented with large groups of adolescents or college students in the format of a single training/education session (e.g., as part of college orientation). However, some programs use broader implementation strategies, such as advertising campaigns where signs are posted across college campuses to encourage students to act when witnessing signs of violence.

By treating young people as potential allies in preventing sexual assault, bystander programs have the potential to be less threatening than traditional sexual assault prevention programs, which tend to approach young people as either potential perpetrators or victims of sexual violence (Burn 2009; Messner 2015; [Jackson] Katz 1995). Instead of placing emphasis on how young people may modify their individual behavior to either respect the sexual boundaries of others or reduce their personal risk for being sexually assaulted, bystander programs aim to foster prerequisite knowledge and skills for intervening on behalf of victims. Thus, by treating young people as part of the solution to sexual assault, rather than part of the problem, bystander programs limit the risk of defensiveness or backlash among participants (e.g., decreased empathy for victims, increased rape myth acceptance) (Banyard et al. 2004; Katz 1995).

Skill-based approach to combatting the bystander effect

Bystander sexual assault prevention programs are designed to combat a general “bystander effect” in which responsibility for action is diffused in group settings (Darley and Latane 1968). To intervene as a witness to sexual assault, individuals must notice the event (or its warning signs), define the event as warranting action/intervention, take responsibility for acting (i.e., feel a sense of personal duty), and demonstrate a sufficient level of self-efficacy (i.e., perceived competence to successfully intervene) (Latane and Darley 1969). Studies have indicated that, as witnesses to sexual assault, young people often fail to meet these criteria (Banyard 2008; Bennett et al. 2014; Burn 2009; Casey and Ohler 2012; Exner and Cummings 2011; McCauley et al. 2013; McMahon 2010; Noonan and Charles 2009), with males being less likely than females to intervene (Banyard 2008; Burn 2009; Edwards et al. 2015; Exner and Cummings 2011; McMahon 2010).

Thus, bystander programs seek to sensitize young people to warning signs of sexual assault, create attitudinal change that fosters bystander responsibility for intervening (e.g., creating empathy for victims), and build requisite skills/tactics for taking action (Banyard 2011; Banyard et al. 2004; Burn 2009; McMahon and Banyard 2012). As outlined by Burn (2009) in her “situational model” of sexual assault intervention, bystander programs should promote the following requisite skills for intervention: noticing an event, identifying a situation as warranting intervention, taking responsibility for acting, and deciding how to help. In a correlational study using a sample of 588 male and female undergraduate students, Burn found that each of these aforementioned prerequisites was positively related to students’ self-reported intervention behavior.

Although the situational model is widely cited in the literature, it is not the only theoretical model underlying sexual assault bystander programs. For example, some programs aim to foster a sense of community responsibility for ending sexual assault (e.g., Bringing in the Bystander, see Banyard et al. 2009), while others aim to deconstruct gender norms that may promote men’s sexual violence against women (e.g., Mentors in Violence Prevention, see [Jackson] Katz 1995). It is important to note that the situational model is not theoretically exclusive of the aforementioned models. In fact, many bystander sexual assault prevention programs aim to foster prerequisite skills for intervention as well as promote community responsibility for ending sexual assault or challenge gender norms that may promote violence. Here, we examine the effects of bystander sexual assault prevention programs on outcomes of theoretical relevance to the situational model (i.e., requisite skills believed to combat the bystander effect as it specifically applies to sexual assault). Findings from this analysis may shed some light on the importance of the proposed mechanisms believed to prepare young people to act as prosocial bystanders when witnessing sexual assault.

The current study

This systematic review and meta-analysis is part of a larger synthesis of research examining the effects of bystander programs on attitudes and behaviors pertinent to sexual assault among adolescents and college students. The protocol for the larger project was pre-registered with the Campbell Collaboration (see Kettrey and Tanner-Smith 2017). The present report synthesizes high-quality studies assessing the effects of bystander programs on requisite intervention skills as proposed by Burn (2009) as well as actual intervention behavior. This includes the following specific outcomes: noticing a sexual assault, identifying a situation as appropriate for intervention, taking responsibility for acting, knowing strategies for intervening, and bystander intervention behavior.

It is important to note that two other meta-analyses of research on sexual assault bystander programs exist. In what they called an “initial” meta-analysis of experimental and quasi-experimental studies published through 2011, ([Jennifer] Katz and Moore 2013) found moderate beneficial effects of bystander programs on participants’ self-efficacy and intentions to intervene, and small (but significant) effects on bystander behavior, rape-supportive attitudes, and rape proclivity (but not perpetration). However, Katz and Moore did not synthesize data representing program effects on requisite skills outlined in Burn’s (2009) situational model. Additionally, their sample was exclusively composed of studies of college students (i.e., no adolescent samples) and, although inclusion criteria specified that studies must contain a comparison group, they imposed no other research design criteria. As a result, Katz and Moore’s analysis included reports that failed to meet methodological criteria for this analysis.

In a more recent meta-analysis of experimental and quasi-experimental studies published through August 2017, Jouriles and colleagues found small but significant effects on attitudes/knowledge and bystander behavior (Jouriles et al. 2018). The methods of this particular meta-analysis have been critiqued in detail elsewhere (Mindthoff et al. 2019). With reference to the current systematic review and meta-analysis, it is important to note that Jouriles et al. aggregated effect sizes into two general categories (i.e., attitudes/knowledge and behavior), limiting the ability to interpret findings in a substantively meaningful way. In this report, we relay meta-analytic findings as they pertain to four distinct prerequisite skills for bystander intervention as well as for incidents of bystander intervention, offering a more granular understanding of the impacts of bystander sexual assault prevention programs.

Inclusion criteria

To be included in the present review, studies had to assess the effects of a bystander sexual assault prevention program on bystander intervention prerequisites and/or bystander intervention behavior among adolescents or college students.

Participants

Eligible participants included adolescents enrolled in grades 7 through 12 and college students enrolled in any type of undergraduate postsecondary educational institution. Eligible participant populations included studies that reported on general samples of adolescents and/or college students as well as studies using specialized samples such as those primarily consisting of college athletes, fraternity/sorority members, and single-sex samples. Study samples primarily consisting of post-graduate students were ineligible for inclusion; the mean age of samples could be no less than 12 and no greater than 25 to be included in the review. Study samples needed to include a minimum of 10 participants to be included in the review.

Settings

The review focused on studies that examine outcomes of bystander programs that target sexual assault and are implemented with adolescents and/or college students in educational settings. Eligible educational settings included secondary schools (i.e., grades 7–12) and colleges or universities. Studies that assessed bystander programs implemented with adolescents and young adults outside of educational institutions (e.g., in community settings, military settings) were ineligible for inclusion in the review. There were no geographical limitations on eligibility.

Interventions

Eligible intervention programs were those that approach participants as allies in preventing and/or alleviating sexual assault among adolescents and/or college students. Some part of the program had to focus on ways that cultivate willingness for a person to respond to others who are at risk for sexual assault. All delivery formats were eligible for inclusion (e.g., in-person training sessions, video programs, web-based training, advertising/poster campaigns). There were no intervention duration criteria for inclusion.

Studies that reported bystander outcomes but did not meet the aforementioned intervention inclusion criterion were not eligible for inclusion. Additionally, studies that assessed outcomes of programs that aimed to facilitate prosocial bystander behavior, but that did not explicitly include a component addressing sexual assault (e.g., programs to prevent bullying) were not eligible for inclusion, as they do not constitute bystander sexual assault prevention programs.

Eligible comparison groups must have received no intervention services targeting bystander attitudes/behavior or sexual assault. Thus, treatment-treatment studies that compared outcomes of individuals assigned to a bystander program versus those assigned to a general sexual assault prevention program were not eligible for inclusion. Eligible comparison groups may have received an attention-placebo control expected to have no effect on bystander outcomes or attitudes/behaviors regarding sexual assault.

Research design

To be eligible for inclusion in the review, studies must have used an experimental or controlled quasi-experimental research design to compare an intervention group (i.e., students assigned to a bystander program) with a comparison group (e.g., students not assigned to a bystander program). We limited our review to such study designs because they typically have lower risk of bias relative to other study designs (e.g., single group designs). More specifically, we included the following designs:

  1. 1.

    Randomized controlled trials: studies in which individuals, classrooms, schools, or other groups were randomly assigned to intervention and comparison conditions.

  2. 2.

    Quasi-randomized controlled trials: studies where assignment to conditions was quasi-random, for example, by birth date, date of week, student identification number, month, or some other alternation method.

  3. 3.

    Controlled quasi-experimental designs (QEDs): studies where participants were not assigned to conditions randomly or quasi-randomly (e.g., participants self-selected into groups). Given the potential selection biases inherent in these controlled quasi-experimental design, we only included those that also met one of the following criteria:

    1. a.

      Regression discontinuity designs: studies that used a cutoff on a forcing variable to assign participants to intervention and comparison groups, and assessed program impacts around the cutoff of the forcing variable.

    2. b.

      Studies that used propensity score or other matching procedures to create a matched sample of participants in the intervention and comparison groups. To be eligible for inclusion, these studies must have also provided enough statistical information to permit estimation of pretest effect sizes for the matched groups.

    3. c.

      For studies where participants in the intervention and comparison groups were not matched, enough statistical information must have been reported to permit estimation of pretest effect sizes for at least one outcome measure.

Outcome measures

We included studies that measured the effects of bystander programs on at least one of the following outcome domains: (1) intervention requisites as defined by Burn 2009 (i.e., noticing a sexual assault, identifying a situation as appropriate for intervention, taking responsibility for acting, knowing strategies for intervening) and (2) bystander intervention behavior when witnessing instances or warning signs of sexual assault (e.g., Banyard et al.’s 2007 Bystander Behaviors Scale, which measures whether participants have engaged in bystander behaviors such as walking a friend home from a bar or party where they had too much to drink, calling 911 because of a suspicion that a friend or stranger had been drugged, speaking up against friends or peers who discussed forcing a friend or peer to engage in sexual behavior).

Any outcome falling into these two domains was eligible for inclusion. This included outcomes measured with any form of assessment (e.g., self-report, official/administrative report, observation, etc.) that could be summarized by any type of quantitative score (e.g., percentage, continuous variable, count variable, categorical variable, etc.). In the event that a particular study included multiple measures of a single construct category (e.g., two measures of bystander intervention within a given study), we only included one outcome per study for that construct. We selected the most similar outcomes for synthesis within a construct category. For example, Senn and Forrest (2016) reported bystander intervention separately for situations involving a stranger and situations involving a friend. After lengthy discussion, we included bystander intervention involving strangers, as this was more similar to measures reported in other studies. Additionally, we believed this would yield more conservative measures of bystander outcomes, as young people are less likely to intervene on behalf of strangers than on behalf of friends (McMahon 2010).

Follow-up duration

Studies reporting follow-ups of any duration were eligible for inclusion. When studies reported more than one follow-up wave, we coded each wave and identified its reported duration. Outcome data from all follow-up waves were included in analyses, using robust variance estimation (RVE) to account for dependence between effect sizes (see “Data Synthesis” description below).

Search strategy

In October 2016, we conducted an initial search of the literature, which we updated in June 2017. We identified candidate studies through searches of electronic databases, relevant academic journals, and gray literature sources. We also contacted leading authors and experts on bystander programs to identify any current/ongoing research that might be eligible for the review. Additionally, we screened the bibliographies of eligible studies and relevant reviews to identify additional candidate studies. We conducted forward citation searches (searches for reports citing eligible studies) using the website Google Scholar, as this database produces similar results to other search engines (e.g., Web of Science; Tanner-Smith and Polanin 2015) and is also more likely to locate existing gray literature.

Electronic database searches

The prevention of sexual assault among adolescents and college students is a topic that spans multiple disciplines (e.g., sociology, psychology, education, public health). Thus, we searched a variety of databases that are relevant to these fields. Search terms varied by database, but generally included two blocks of terms and appropriate Boolean or proximity operators, when allowed: blocks included terms that address the intervention and outcomes. We specifically searched the following electronic databases (hosts): Cochrane Central Register of Controlled Trials (CENTRAL), Cochrane Database of Abstracts of Reviews of Effects (DARE), Education Resources Information Center (ERIC, via ProQuest), Education Database (via ProQuest), International Bibliography of the Social Sciences (IBSS, via ProQuest), PsycINFO (via ProQuest), PsychARTICLES (via Proquest), PubMed, Social Services Abstracts (via Proquest), and Sociological Abstracts (via Proquest).

The strategy for searching electronic databases involved the use of search terms specific to the types of interventions and outcomes eligible for inclusion. Search terms for types of interventions included general terms for bystander programs as well as names of specific bystander programs (e.g., Mentors in Violence Prevention). Search terms for types of outcomes included terms that are specific to measures of sexual violence (e.g., sexual assault) as well as more general terms that have the potential to identify studies that measure physical and/or sexual violence. Due to the overwhelming focus of bystander programs on adolescents and college students (aside from a few implementations with military samples), search terms did not limit initial results by the age or general characteristics of the target population.

The search terms and strategy for PsycINFO via ProQuest were as follows (terms were modified for other databases): (ABSTRACT, TITLE(“bystander”)) AND (ABSTRACT, TITLE(“education” OR “program” OR “training” OR “intervention” OR “behavior” OR “attitude” OR “intention” OR “efficacy” OR “prosocial” OR “pro-social” OR “empowered” OR “Bringing in the Bystander” OR “Green Dot” OR “Step Up” OR “Mentors in Violence Prevention” OR “MVP” OR “Know Your Power” OR “Hollaback” OR “Circle of 6” OR “That’s Not Cool” OR “Red Flag Campaign” OR “Where Do You Stand” OR “White Ribbon Campaign” OR “Men Can Stop Rape” OR “The Men’s Program” OR “The Women’s Program” OR “The Men’s Project” OR “Coaching Boys into Men” OR “Campus Violence Prevention Program” OR “Real Men Respect” OR “Speak Up Speak Out” OR “How to Help a Sexual Assault Survivor” OR “InterACT”)) AND (ABSTRACT, TITLE(“sexual assault” OR “rape” OR “violence” OR “victimization”)).

Gray literature and other searches

We conducted gray literature searches to identify unpublished studies that met inclusion criteria. This included searching electronic databases that catalog dissertations and theses, searching conference proceedings, and searching websites with content relevant to sexual assault and/or violence against women. We specifically searched the following gray literature sources: ProQuest Dissertations and Theses Global, Clinical Trials Register (https://clinicaltrials.gov), End Violence Against Women International website, National Sexual Violence Resource Center website, National Online Resource Center on Violence Against Women website, US Department of Justice Office on Violence Against Women website, and Center for Changing our Campus Culture website.

We also searched the tables of contents of current issues of journals that publish research on sexual violence. This included the following journals: Journal of Adolescent Health, Journal of Community Psychology, Journal of Family Violence, Journal of Interpersonal Violence, Psychology of Violence, Violence Against Women, and Violence and Victims. Additionally, we searched reference lists of previous systematic reviews and meta-analyses, CVs and websites of primary authors of eligible studies, and reference lists of all eligible studies. Finally, we conducted forward citation searches of all eligible studies.

Eligibility screening

After identifying candidate studies, the two authors independently screened each study title and abstract and recorded our separate eligibility recommendations (i.e., ineligible or eligible for full-text screening). We resolved disagreements by discussion and consensus, and recorded the final abstract screening decision. We then retrieved potentially eligible studies in full text and independently reviewed them for eligibility, recording our separate eligibility recommendations (and, when applicable, rationale for an ineligibility recommendation). Again, we resolved disagreements via discussion and consensus and recorded the final eligibility decision. In cases where we could not determine eligibility due to missing information in a report, we contacted study authors for this information.

Study coding

The two authors independently double-coded all included studies, using a piloted codebook (see Kettrey et al. 2019 for the published codebook). If data needed to calculate an effect size were missing from a report, we contacted the primary study authors for this information. We entered all coding into an electronic database, with a separate record maintained for each independent coding of each study. Following methods recommended by Lipsey and Wilson (2001), the two authors met regularly to check reliability and reconcile studies on an ongoing basis. This allowed us to refine our coding as we moved through the process. During our regular reconciliation meetings, we resolved coding disagreements via discussion and consensus with final coding decisions maintained in a separate record.

The primary coding categories reported in this meta-analysis are as follows: participant demographics and characteristics (e.g., gender, education level, race/ethnicity); intervention setting (e.g., secondary or post-secondary institution); study characteristics (e.g., study design, duration of follow-up, sample N); outcome construct (e.g., type, description of measure); and outcome results (e.g., follow-up means and standard deviations or proportions).

Calculation of effect sizes

We extracted relevant summary statistics (e.g., means and standard deviations, proportions, observed sample sizes) to calculate effect sizes. We then used David Wilson’s (2013) online effect size calculator to calculate effect sizes. The overwhelming majority of studies reported continuous measures of treatment effects, so we used a standardized mean difference (SMD) effect size metric with a small sample correction (i.e., Hedges’ g). For cases in which binary outcomes measures were reported, we deemed these measures to represent the same underlying construct as continuous measures, as they typically relied on the same measurement tools as relevant continuous measures. Thus, we transformed any log odd ratio effect sizes available from binary measures into standardized mean difference effect sizes by entering the observed proportions and sample sizes into Wilson’s (2013) online effect size calculator. All standardized mean difference effect sizes were coded such that positive values (i.e., greater than 0) indicate a beneficial outcome for the intervention group.

The unit of analysis of interest for this review was the individual (i.e., individual-level behaviors). Some of the included studies used cluster randomized trial designs where participants were randomized into the intervention or comparison conditions at the group level (e.g., entire schools were assigned to a single condition), and inferences were made at the individual level. To correct for these unit of analysis errors, we followed the procedures outlined in the Cochrane Handbook (Higgins and Green 2011) to inflate the standard errors of the effect sizes from these nine studies by multiplying them by the design effect:

1 + (M – 1) ICC,

where M is the average cluster size for a given study and ICC is the intracluster correlation coefficient for a given outcome. In cases where study authors did not report ICCs, we used a liberal assumed value of .10, as indicated by Hedges and Hedberg’s (2007) research on ICCs in cluster randomized trials conducted in educational settings.

Data synthesis

To allow for the inclusion of dependent effect sizes (e.g., estimates from multiple follow-up waves or multiple treatment arms within a single study), we used the robust variance estimation (RVE) meta-analytic method. This method allows researchers to estimate meta-regression models with dependent effect sizes; however, when used with small samples, it can produce narrow confidence intervals and, thus, increase the chances of a type I error (Hedges et al. 2010; Tanner-Smith and Tipton 2014; Tipton 2013). To avoid this problem, Tanner-Smith and Tipton (2014) recommend researchers use samples with a minimum of (1) 10 studies to estimate average effect sizes and (2) 40 studies to estimate meta-regression coefficients (i.e., moderator coefficients).

Tipton (2015) recently developed a method for conducting RVE with a small sample correction. This method can be implemented with the robumeta package in R (Fisher and Tipton 2015). Because the final sample for this meta-analysis failed to meet Tanner-Smith and Tipton’s (2014) recommendations, we used the robumeta package to conduct RVE correlated effects modeling using inverse variance weighting and Tipton’s small sample adjustment (statistical code is available from the first author upon request). To minimize any potential bias in the meta-analysis results due to effect size outliers, we Winsorized all effect sizes that fell more than two standard deviations away from the mean of the effect size distribution (Lipsey and Wilson 2001), replacing them with the value that fell exactly two standard deviations from the mean of the distribution of effect sizes.

We assessed heterogeneity using the I2 and τ2 statistics and we implemented meta-regression using small-sample RVE estimators to conduct moderator analyses. We report SMDs for the aggregate sample of effect sizes for each outcome (summarized in separate Forest plots) as well as for subsamples of (1) RCTs and (2) effect sizes broken down by timing of follow-up wave for each outcome (follow-up wave categories varied by outcome). For the aggregate analyses, we examined the moderating effects of study attrition and risk of bias (ROB). Attrition is the proportion of respondents who dropped out of the study at the first follow-up. ROB is a dichotomous measure indicating whether the study was rated as having high risk of bias based on one or more of the following criteria: randomization procedure, blinding of participants, handling of incomplete data, or reporting bias. This measure was dichotomized due to the low number of studies exhibiting each individual ROB criterion.

Results

Search results

We conducted an initial literature search in October 2016 and an updated search in June 2017. Figure 1 outlines the flow of studies through the search and screening process. Through our initial and updated searches, we identified 797 reports. Of these reports, 738 were identified from searches of electronic databases, 20 from reference lists of review articles, 19 from ClinicalTrials.gov, 9 from CVs and websites of primary authors of eligible studies, 5 from website searches, 3 from reference lists of eligible reports, 1 from conference proceedings, 1 from tables of contents searches, and 1 from forward citation searching of eligible studies. After deleting duplicate reports and reports that were deemed ineligible through the abstract screening process, 154 reports were deemed eligible for full-text screening. Three of these reports could not be located; thus, we screened a total of 151 full-text reports for eligibility.

Fig. 1
figure 1

PRISMA study flow diagram

Twenty independent studies summarized in 28 reports met inclusion criteria. One of these studies was eligible for the review by methodological standards, but did not report code-able outcomes (Darlington 2014). This study only reported within-group pre-post effect sizes. No between-group data were reported and our attempts to obtain these data from the study author were unsuccessful. As a result, our meta-analysis includes 19 independent studies summarized in 27 reports. Among these 27 reports, one (Jouriles et al., 2016) presented findings from 2 independent studies. We coded these two studies separately and identified them as Jouriles et al. (2016).

Characteristics of the 19 included studies are summarized in Table 1. Nine studies utilized random assignment at the individual level, five utilized random assignment at the group level, and five used an eligible quasi-experimental research design. Two studies (Banyard et al. 2007 and Jouriles et al. n.d.) included two eligible intervention arms. Banyard et al. (2007) randomly assigned participants to one of three groups: (1) a no-treatment control that only completed pre- and post-intervention surveys, (2) an intervention group that was assigned to complete one 90-min bystander program session, or (3) an intervention group that was assigned to complete three 90-min bystander program sessions. Jouriles et al. (n.d.) randomly assigned participants to one of three groups: (1) a comparison attention-placebo treatment condition that presented material on study skills, (2) a computer-delivered bystander program that participants completed independently, or (3) a computer-delivered bystander program that participants completed in a lab under supervision. The use of RVE methods, which allows for the inclusion of dependent effect sizes within a single analysis, permitted us to include effects from multiple treatment arms of a single study in the same analysis.

Table 1 Characteristics of included studies (k = 19)

Noticing sexual assault

Four studies reported a total of seven effects of bystander programs on noticing sexual assault. Studies operationalized this outcome in a number of ways, including a single item developed by Burn (2009): “At a party or bar, I am probably too busy to be aware of whether someone is at risk for sexually assaulting someone.” All but one of these studies (Senn and Forrest 2016) involved random assignment of participants to conditions and all but one (Brokenshire 2015) were published in a peer-reviewed outlet. This specific report was an unpublished master’s thesis.

The forest plot in Fig. 2 summarizes seven effects reported across the four studies measuring respondents’ ability to notice a sexual assault event. The average intervention effect was non-significant (g = 0.01, 95% CI [− 0.40, 0.42]) with no evidence of significant heterogeneity (I2 = 0.00%, τ2 = 0). A sensitivity analysis to determine whether the average effect size differed when assuming a .00, .20, .40, .60, or .80 correlation between within-study effect sizes revealed that the average effect size was the same for each of these assumed correlations (g = 0.01). Bivariate RVE meta-regression indicated attrition (b = − 0.35, 95% CI [− 5.78, 5.08]) and ROB (b = 0.07, 95% CI [− 0.88, 1.02]) were not significantly related to effect size.

Fig. 2
figure 2

Forest plot of bystander intervention effects on noticing sexual assault

Subsample analysis of the four effect sizes reported across the three RCTs indicated that the average intervention effect was non-significant (g = − 0.11, 95% CI [− 0.50, 0.27]) and demonstrated no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). To conduct subsample analyses based on timing of follow-up wave, we collapsed effect sizes for bystander intervention behavior into three intervals: (1) immediate posttest (i.e., 0 weeks to 1 week), (2) 1- to 4-month follow-up, and (3) 1-year follow-up. The average intervention effect among the two effect sizes reported across two studies at immediate posttest was non-significant (g = 0.17, 95% CI [− 1.96, 2.30]) and demonstrated no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). The average intervention effect among the three effect sizes reported across three studies at 1- to 4-month follow-up was non-significant (g = 0.01, 95% CI [− 0.42, 0.43]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). One study reported noticing sexual assault as an intervention outcome 1-year post-intervention. The effect size from this study (Miller et al. 2013) was non-significant (g = − 0.06, 95% CI [− 0.13, 0.41]).

Identifying a situation as appropriate for intervention

Six studies reported a total of 11 effects of bystander programs on participants’ identification of a situation as appropriate for intervention. Studies frequently operationalized this concept using some adaptation of Burn’s (2009) Failure to Identify Situation as High Risk scale (e.g., “In a party or bar situation, I think I might be uncertain as to whether someone is at-risk for being sexually assaulted”). All but three of these studies (Amar et al. 2015; Baker et al. 2014; Senn and Forrest 2016) involved random assignment of participants to conditions and all but one (Addison 2015) was published in a peer-reviewed outlet. This specific report was an unpublished doctoral dissertation.

The forest plot in Fig. 3 summarizes 11 effects reported across the 6 studies measuring respondents’ ability to identify a situation as appropriate for intervention. The average intervention effect was significant and positive (g = 0.77, 95% CI [0.06, 1.47]) with evidence of heterogeneity (I2 = 60.46%, τ2 = 0.28). A sensitivity analysis to determine whether the average effect size differed when assuming a .00, .20, .40, .60, or .80 correlation between within-study effect sizes revealed that the average effect size was the same for each of these assumed correlations (g = 0.77). Bivariate RVE meta-regression indicated attrition (b = 2.55, 95% CI [− 1.80, 6.89]) and ROB (b = − 0.25, 95% CI [− 1.99, 1.49]) were not significantly related to effect size.

Fig. 3
figure 3

Forest plot of bystander intervention effects on identifying a situation as appropriate for intervention

Subsample analysis of the six effect sizes reported across the three RCTs indicated that the average intervention effect was significant and positive (g = 1.15, 95% CI [0.21, 2.09]) and demonstrated some evidence of heterogeneity (I2 = 23.85%, τ2 = 0.05). To conduct subsample analyses based on timing of follow-up wave, we collapsed effect sizes for identifying a situation into two intervals: (1) immediate posttest (i.e., 0 weeks to 1 week) and (2) 1- to 4-month follow-up. The average intervention effect among the six effect sizes reported across five studies at immediate posttest was significant and positive (g = 0.67, 95% CI [− 0.08, 1.42]) and demonstrated evidence of heterogeneity (I2 = 51.98%, τ2 = 0.20). The average intervention effect among the four effect sizes reported across three studies at 1- to 4-month follow-up was non-significant (g = 0.56, 95% CI [− 0.93, 2.05]) with some evidence of heterogeneity (I2 = 17.30%, τ2 = 0.04).

Taking responsibility for acting

Four studies reported a total of eight effects of bystander programs on participants’ taking responsibility for acting/intervening when witnessing violence or its warning signs. Studies operationalized this outcome using either the full or an adapted version of the Failure to Take Intervention Responsibility scale developed by Burn (2009), which consisted of items such as “I am less likely to intervene to reduce a person’s risk of sexual assault if I think she/he made choices that increased their risk.” Two studies (Banyard et al. 2007; Moynihan et al. 2011) involved random assignment of participants to conditions and two (Amar et al. 2015; Senn and Forrest, 2016) did not. All four studies were published in a peer-reviewed outlet.

The forest plot in Fig. 4 summarizes eight effects reported across the four studies measuring taking responsibility for acting. The average intervention effect was non-significant (g = 0.29, 95% CI [− 0.70, 1.27]) with evidence of heterogeneity (I2 = 57.09%, τ2 = 0.23). A sensitivity analysis to determine whether the average effect size differed when assuming a .00, .20, .40, .60, or .80 correlation between within-study effect sizes revealed that the average effect size was the same for each of these assumed correlations (g = 0.29). Bivariate RVE meta-regression analyses indicated attrition (b = − 10.59, 95% CI [− 115.64, 94.50]) and ROB (b = − 0.95, 95% CI [− 2.13, 0.24) were not significantly related to effect size.

Fig. 4
figure 4

Forest plot of bystander intervention effects on taking responsibility to act/intervene

Subsample analysis of the five effect sizes reported across the two RCTs indicated that the average intervention effect was non-significant (g = 0.71, 95% CI [− 3.66, 5.07]) with evidence of heterogeneity (I2 = 41.68%, τ2 = 0.18). To conduct subsample analyses based on timing of follow-up wave, we collapsed effect sizes for bystander intervention behavior into two intervals: (1) immediate post-test (i.e., 0 weeks to 1 week) and (2) 1- to 4-month post-intervention. The average intervention effect among the four effect sizes reported across three studies at immediate posttest was non-significant (g = 0.39, 95% CI [− 1.66, 2.45]) and demonstrated evidence of heterogeneity (I2 = 77.75%, τ2 = 0.51). The average intervention effect among the four effect sizes reported across three studies at 1- to 4-month follow-up was non-significant (g = 0.38, 95% CI [− 0.75, 1.51]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0).

Knowing strategies for intervening

Four studies reported a total of five effects of bystander programs on participants’ knowledge of strategies for helping/intervening. Studies operationalized this concept using Burn’s (2009) two-item Failure to Intervene Due to Skills Deficit scale, which asked respondents to rate their agreement with statements such as “Although I would like to intervene when a guy’s sexual conduct is questionable, I am not sure I would know what to say or do.” Two studies (Brokenshire 2015; Potter et al. 2008) involved random assignment of participants to conditions and two (Amar et al. 2015; Senn and Forrest, 2016) did not. All but one study (Brokenshire 2015) was published in a peer-reviewed outlet. This specific report was an unpublished master’s thesis.

The forest plot in Fig. 5 summarizes five effects reported across the four studies measuring knowledge of strategies. The average intervention effect was non-significant (g = 0.51, 95% CI [− 0.52, 1.54]) with evidence of heterogeneity (I2 = 33.53%, τ2 = 0.12). A sensitivity analysis to determine whether the average effect size differed when assuming a .00, .20, .40, .60, or .80 correlation between within-study effect sizes revealed that the average effect size was substantively similar for each of these assumed correlations (g = 0.52 for rho of .00, .02, and .04; g = 0.51 for rho of .06, .08, and 1.0). Bivariate RVE meta-regression analyses indicated that attrition (b = 4.56, 95% CI [− 12.04, 21.15]) was not significantly related to effect size. Lack of variation in ROB prohibited moderator analysis, as all studies reporting knowing strategies for helping were rated as having high risk of bias.

Fig. 5
figure 5

Forest plot of bystander intervention effects on knowing strategies for intervening

Subsample analysis of the two effect sizes reported across the two RCTs indicated that the average intervention effect was non-significant (g = 0.84, 95% CI [− 4.73, 6.41]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). To conduct subsample analyses based on timing of follow-up wave, we collapsed effect sizes for bystander intervention behavior into two intervals: (1) immediate posttest (i.e., 0 weeks to 1 week) and (2) 1- to 4-month follow-up. The average intervention effect among the three effect sizes reported across three studies at immediate posttest was non-significant (g = 0.59, 95% CI [− 1.16, 2.35]) and demonstrated evidence of heterogeneity (I2 = 58.23%, τ2 = 0.27). The average intervention effect among the two effect sizes reported across two studies at 1- to 4-month follow-up was non-significant (g = 0.58, 95% CI [− 0.91, 2.06]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0).

Bystander intervention behavior

Thirteen studies reported a total of 17 effects of bystander programs on bystander intervention behavior. All studies used a form of the Bystander Behaviors Scale (Banyard et al. 2007), which asked participants to indicate the extent to which they have engaged in bystander behavior (e.g., “walked a friend home from a party who has had too much to drink”). All but three of these studies (Miller et al. 2014; Peterson et al. 2016; Senn and Forrest 2016) involved random assignment of participants to conditions and all but one (Jouriles et al. n. d.) was published in a peer-reviewed outlet. This specific report was a draft of a manuscript under review for publication at the time of this review.

Inspection of the distribution of effect sizes revealed that the effect size of one of the non-randomized studies (Peterson et al. 2016) fell more than two standard deviations above the mean of the distribution. We therefore Winsorized this effect size (g = 0.63) by replacing it with the value that fell exactly two standard deviations above the mean (g = 0.57). The forest plot in Fig. 6 summarizes 17 effects reported across the 13 studies measuring bystander intervention. The average intervention effect was significant and positive (g = 0.22, 95% CI [0.13, 0.32]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0).

Fig. 6
figure 6

Forest plot of bystander intervention effects on bystander intervention

A sensitivity analysis to determine whether the average effect size differed when assuming a .00, .20, .40, .60, or .80 correlation between within-study effect sizes revealed that the average effect size was the same for each of these assumed correlations (g = 0.22). An additional sensitivity analysis replacing the Winsorized effect size from the Peterson et al. study (g = 0.57) with the original effect size (g = 0.62) demonstrated that the findings were substantively similar as those from the main analysis (g = 0.23, 95% CI [0.13, 0.32]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). Bivariate RVE meta-regression analyses indicated that attrition (b = − 0.35, 95% CI [− 0.94, 0.25]) and ROB (b = 0.08, 95% CI [− 0.16, 0.32]) were not significantly related to effect size.

Subsample analysis of the 14 effect sizes reported across the 10 RCTs indicated that the average intervention effect was significant and positive (g = 0.20, 95% CI [0.12, 0.28]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). To conduct subsample analyses based on timing of follow-up wave, we collapsed effect sizes for bystander intervention behavior into two intervals: (1) 1- to 4-month follow-up and (2) 6-month to 1-year follow-up. The average intervention effect among the ten effect sizes reported across eight studies at 1- to 4-month follow-up was significant and positive (g = 0.24, 95% CI [0.13, 0.36]) and demonstrated no evidence of heterogeneity (I2 = 0.00%, τ2 = 0). The average intervention effect among the seven effect sizes reported across six studies at 6-month to 1-year follow-up was non-significant (g = 0.19, 95% CI [− 0.02, 0.40]) with no evidence of heterogeneity (I2 = 0.00%, τ2 = 0).

Discussion

This systematic review and meta-analysis synthesized findings from high-quality studies examining the effects of bystander sexual assault programs on requisite skills for prosocial bystander intervention as outlined by Burn (2009; i.e., noticing a sexual assault event, identifying a situation as appropriate for intervention, taking responsibility to intervene, knowing strategies for intervening) as well as actual bystander intervention behavior.

Effects of bystander sexual assault programs on participants’ ability to take notice of sexual assault or its warning signs were non-significant. This was the case in the aggregate analysis as well as at each individual follow-up wave.

Program effects on identifying a situation as appropriate for intervention were significant but the effect dissipated at later follow-up waves. At immediate post-test, measures of this outcome were almost seven-tenths of a standard deviation higher for bystander program participants than for the comparison group. By 1- to 4-month follow-up, this difference was approximately five-tenths and became statistically non-significant. To put the immediate post-test findings into context, one study in our meta-analytic sample (Addison 2015) reported a pre-intervention mean rating of 5.94 (SD = 1.11) regarding responses to a vignette depicting a scenario in which a young woman was too inebriated to consent to sexual activity. The total possible score was 7, with higher scores indicating stronger agreement that the scenario depicted in the vignette warranted intervention. Considering that the average effect size for this outcome was 0.67 at immediate post-test, extrapolation suggests that participating in a bystander program increased the mean rating from 5.94 to 6.68 (i.e., (5.94 + [0.67 * 1.11]) = 6.68), a change of approximately three-fourths a point on a 7-point scale.

Effects on taking responsibility to intervene and knowing strategies for intervening were non-significant in the aggregate analysis is as well as at each individual follow-up wave.

Our results indicated that bystander programs have a desirable effect on bystander intervention behavior. At 1- to 4-month follow-up, measures of bystander intervention were approximately one-fourth a standard deviation higher for bystander program participants than for the comparison group. To contextualize, one study in our meta-analytic sample (Jouriles et al. 2016) reported a pre-intervention mean score of 27.95 (SD = 19.02) on the Bystander Intervention Scale, with a total possible score of 49 and higher scores indicating greater intervention behavior over the past month. Considering that the average effect size for bystander intervention at 1- to 4-month follow-up was 0.24, then extrapolation suggests that participation in a bystander intervention program increased bystander intervention scores from 27.95 to 32.51 (i.e., (27.95 + [0.24 * 19.02]) = 32.51), indicating that participants engaged in approximately five additional acts of intervention in the past month (relative to pre-test behavior). However, this effect was non-significant at 6 months post-intervention.

Conclusion

According to one prominent logical model, bystander sexual assault prevention programs promote bystander intervention by fostering requisite knowledge and attitudes (Burn 2009). Our meta-analysis provides inconsistent evidence of the effects of bystander programs on the proposed requisite knowledge and attitudes, but promising evidence of short-term effects on bystander intervention. Effects on requisite skills varied, with significant favorable effects on identifying a situation as warranting intervention and non-significant effects on noticing the event, taking responsibility for acting, and knowing strategies for helping. Bystander programs did have a significant favorable effect on bystander intervention behavior.

The desirable effect on bystander behavior, paired with the inconsistent effects on requisite skills, casts some uncertainty around the relationship between the proposed requisite knowledge/attitudes and bystander behavior. Although Burn (2009) found that each skill outlined in her situational model was significantly associated with prosocial bystander behavior, we found that bystander programs only had an effect on identifying a situation as warranting intervention. This could be interpreted to suggest that noticing an event, taking responsibility for acting, and knowing strategies for helping may not play an important role in promoting bystander intervention, especially since programs did not have a significant effect on these three requisites but did have a significant effect on bystander intervention. Importantly, Burn’s study measured all four requisite skills and prosocial bystander behavior retrospectively and at a single point in time. Thus, the findings are correlational in nature and do not necessarily demonstrate a causal relationship between the proposed requisite skills and bystander behavior and therefore, we do not know whether each of the proposed requisite skills actually combats the bystander effect as it pertains to sexual assault among young people.

Importantly, the nature of the studies in our meta-analytic sample did not allow us to explore relationships between program effects on each of the four requisite skills and bystander intervention behavior. That is, only one study (Senn and Forrest 2016) reported effects for all five of these outcomes and the majority of the remaining studies tended to report either a selection of requisite skills or bystander intervention behavior. This highlights an important direction for future research. Our understanding of the causal mechanisms of program effects on bystander behavior would benefit from further analysis, especially path analysis mapping relationships between specific knowledge/attitude effects and bystander intervention. This is especially important considering the heterogeneity in theoretical foundations of bystander programs, with some specific curricula emphasizing community responsibility for preventing sexual assault and others aiming to deconstruct gender norms believed to foster violence against women. Thus, future research should also explore any differential program effects between these two approaches on bystander intervention behavior. Until such research emerges, it is clear that bystander sexual assault prevention programs have a desirable effect on bystander behavior. Regardless of the underlying mechanisms, these programs are successful in encouraging young people to intervene on behalf of potential victims of sexual assault.