Introduction

Young people may engage in sexual behavior that puts them at elevated risk for pregnancy, HIV, and other sexually transmitted infections (STIs). The Centers for Disease Control and Prevention (CDC) found in its 2013 survey of US high school students (generally 13–19 years old) that half (47%) reported having had sexual intercourse, including 15% with four or more lifetime partners. About one third of students were sexually active, defined as having had sex in the past 3 months. Of these, 14% report that neither they nor their partner had used a pregnancy prevention method at last intercourse.

Close to 300,000 young women (15–19 years old) gave birth in the USA in 2013 (Martin et al. 2015), and over 80% of teen pregnancies are unintended (Kost and Henshaw 2014). Pregnancy and birth rates in adolescents have declined significantly since their peak of 61.8 per 1000 live births in 1991 and have declined every year since then (CDC 2013, 2014). The rate in the USA, 24 per 1000 live births in women aged 15–19 in 2014, is still very high when compared with other industrialized nations, most of which have rates below 10 (World Bank 2014). There are also large disparities in rates among socioeconomic and ethnic groups in the USA. Birth rates in non-Hispanic black, Hispanic, and American Indian/Alaska Native teens are approximately double those of non-Hispanic white teens, at 4.4, 4.6, 3.5, and 2.0%, respectively (CDC 2013).

Rationale for Systematic Review

Several systematic reviews have examined the evidence for programs to reduce risky sexual behaviors among young people. Most of these have focused on HIV and STI prevention outcomes (Fonner et al. 2014; Jamil et al. 2014; Johnson et al. 2011; Kang et al. 2010; Lazarus et al. 2010; Mavedzenge et al. 2014; Michielsen et al. 2010; Mullen et al. 2002; Naranbhai et al. 2011; Picot et al. 2012; Shepherd et al. 2010; Underhill et al. 2007, 2008), while a few have addressed pregnancy outcomes (Blank et al. 2010; DiCenso et al. 2002; Harden et al. 2009; Oringanje et al. 2016; Scher et al. 2006). Some have examined pregnancy prevention in addition to HIV and STI prevention (Cardoza et al. 2012; Chin et al. 2012; Goesling et al. 2014; Mason-Jones et al. 2012, 2016; Tolli 2012).

None of the existing reviews are designed to address our primary research question, which is the effect of school-based programs to reduce pregnancy in the USA. Past reviews suggest that school-based programs may reduce sexual risk behavior in young people (Chin et al. 2012; Goesling et al. 2014; Mavedzenge et al. 2014; Underhill et al. 2008) though there is much variation in reported effect size and high risk of bias in many studies. Of these five review articles, only three include pregnancy as an outcome (Chin et al. 2012; Goesling et al. 2014; Oringanje et al. 2016). One of these, Oringanje et al. (2016) had no country exclusion criterion and focused primarily on low and middle-income countries. Studies included school-based, community/home-based, clinic-based, and faith-based programs, and there was no substudy for school-based programs. Only a small number of included studies had true control groups, and only five of the randomized controlled trials (RCTs) were of school programs.

The remaining two reviews that reported on pregnancy were confined to US programs. Of these, one is restricted to reporting only the findings of programs with evidence of effectiveness. (Goesling et al. 2014). After applying exclusion criteria, the authors identified 88 studies comprised of 78 unique program models (inclusion end date, January 2011). Among these, 34 were considered to have null findings for full sample or subgroups and 13 had positive impacts for subgroups defined by sexual activity at follow-up. The authors presented the results only for the remaining 31 programs which had evidence of positive effect defined as one statistically significant positive impact on at least one outcome, and no adverse effects. By excluding papers with evidence of no effectiveness, it is impossible to critically evaluate the papers that are included, as it is unclear whether the “ineffective” programs lacked evidence to measure effectiveness or whether they were actually ineffective for achieving the desired outcomes. The third review reporting on pregnancy; Chin (2012) focused on group-based comprehensive risk reduction and abstinence-only programs, including both school and community-based programs. The search period for these reviews ended on August 31, 2007. It reached no conclusion on abstinence-only program effectiveness. For comprehensive programs, it found statistically significant reductions in risk behaviors. The 11% reduction in pregnancy risk was not statistically significant. No subgroup analysis was performed on community versus school-based programs regarding pregnancy outcomes. Thus, none of the previous reviews focus specifically on the pregnancy outcomes of school-based risk reduction programs in the USA. In 2016 and 2017, we conducted that review, reported here. In addition, we examined the effectiveness of these programs on three secondary outcomes: condom use, oral contraceptive pill (OCP) use, and sexual initiation. Our study was not powered to identify which specific approaches, e.g., service learning versus peer-led interventions, may be most effective.

Methods

Our methods are generally based on recommendations of the Cochrane Collaboration (Higgins and Green 2011). For the purposes of the meta-analysis, we assigned each study to one of two broad categories: “RCT” or “non-RCT.” There is some debate about whether it is defensible to include both types in the same pooled analysis. The Cochrane Collaboration discourages this practice (Deeks et al. 2011). However, other analysts argue that the potential problems are exaggerated, and that in many cases, they can be combined without introducing significant bias (Shrier et al. 2007). Because there were frequent problems with randomization, blinding, and other methodological issues associated with the RCTs, the difference between non-RCTs and RCTs may be smaller in the current analysis than in other studies. We present results for each outcome in which non-RCTs and RCTs are both combined and stratified. Our methods are consistent with PRISMA guidelines and checklist which is available from the corresponding author (Moher et al. 2009).

Inclusion and Exclusion Criteria

To be included, studies had to report data from programs in the USA or Canada conducted in elementary, middle, or high schools, and report pregnancy risk for the intervention and a control condition (another group or time). Any sexual risk reduction intervention delivered to young people in a school setting, including after school hours, which reports on pregnancy was included. We excluded interventions with one or more components external to the school context, for which outcomes are not stratified by component, that include young adults or “any age,” interventions for which outcomes are not stratified by age range, interventions focused on secondary prevention, not specifically addressing youth pregnancy prevention outcomes, and interventions without comparators. RCTs, prospective or retrospective observational cohorts, serial cross-sectional studies, and other longitudinal analyses published in any language were eligible. Studies in peer-reviewed journals, reported at scientific conferences, in doctoral dissertations, and in other contexts were eligible. Precise search terms are shown in the search protocol (Appendix A).

Searches and Study Selection

Using a range of keywords and Medical Subject Heading (MeSH) terms, we developed a comprehensive search strategy as described in our protocol (Appendix A). The date range was from January 1, 1985 to our search date of May 17, 2017. We searched bibliographic databases including the Cochrane Central Register of Controlled Trials (CENTRAL), Education Resources Information Center (ERIC), PubMed, PsycINFO, Scopus, and Web of Science.

We also searched “gray literature” to obtain data reported in conferences, dissertations, or other contexts outside peer-reviewed journals. We searched the New York Academy of Medicine’s Gray Literature Report, abstract archives of the American Public Health Association (APHA), doctoral dissertations through ProQuest Dissertations, and Google and Google Scholar using advanced targeted search syntax. We reviewed study bibliographies and contacted authors of included studies and other experts to learn of studies in progress or missed.

We imported all resulting records into EndNote version X7 (Thomson Reuters 2013). One reviewer removed duplicate records and those that were clearly irrelevant. Two reviewers working independently then screened citations by titles, abstracts, and keywords to identify records for full-text review. A third reviewer reconciled any disagreement. Two reviewers then examined the full text of each article to determine which satisfied inclusion criteria.

Data Extraction

To address variations in the precise Population, Intervention Comparator and Outcome (PICO) (Counsell 1997; Schardt et al. 2007) scope within our systematic review, we extracted data using an “Intervention-Outcome-Population Trio” (IOPT) structure. Each data point describes the effect of a specified intervention (I) on a specified outcome (O) in a specified population (P). Because studies typically report more than one outcome or population, we extracted multiple IOPTs from each study. We grouped follow-up periods into three categories: < 13, 13–23, and ≥ 24 months; if a study had multiple outcomes within one period, we used the latest one. Two reviewers working independently extracted data into a piloted data extraction form and reconciled any discrepancies. Extracted data included study design characteristics, study setting, and details necessary for risk of bias assessment. Appendix B provides a detailed description of data extraction procedures.

Risk of Bias Assessment

We used the Cochrane Collaboration tool (Higgins et al. 2011) for assessing risk of bias for each IOPT. For RCTs, risk of bias in individual studies includes seven domains: sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, incomplete outcome data, selective outcome reporting, and other potential biases. For non-RCTs, we also used indices recommended by the Grades of Recommendation Assessment, Development and Evaluation (GRADE) Working Group (Guyatt et al. 2011; Holger et al. 2013), checking whether eligibility criteria were appropriately developed and applied, exposures and outcomes appropriately measured, and evaluating the adequacy of measures to adjust for confounding and adequacy of follow-up time. We reduced the potential for publication bias by comprehensively searching multiple databases and gray literature.

Additional Data Inquiries

We contacted the corresponding authors of 16 of the 21 included studies to inquire about unpublished data or subgroup analyses pertinent to our systematic review and about any additional studies that we may have missed. We received responses from six authors none of which led to additional or revised effectiveness estimates.

Quality of the Evidence

We graded the quality of evidence for each IOPT following the GRADE approach (Guyatt et al. 2011), using GRADEpro software version 3.2 to perform analyses (Brozek et al. 2008). GRADE ranks the quality of evidence on four levels: “high,” “moderate,” “low,” and “very low.” Evidence quality from RCT data is initially presumed to be “high,” but can be downgraded based on study limitations, inconsistency of results, indirectness of evidence, imprecision, or reporting bias. Evidence quality from non-RCT study data starts “low” but can be upgraded if the magnitude of treatment effect is very large, if there is a significant dose-response relation or if all possible confounders would decrease the magnitude of an apparent treatment effect (Guyatt et al. 2011). Evidence from non-RCTs can also be downgraded.

Measures of Treatment Effect

We used Review Manager 5.2 (The Cochrane Collaboration 2014) provided by the Cochrane Collaboration for preparing the review and statistical analysis. From the data extraction file (available on request from corresponding author), we selected the quantitative information required to perform the meta-analysis. These data include sample size and risk for both intervention and control groups, at baseline and at follow-up. From these figures, we calculated the number of favorable (e.g., no pregnancy) and unfavorable (e.g., pregnancy) events. “Pregnancy” included reports by females and combined male and female reports if female-only results were not reported. These intermediate results were then used to calculate risk ratios (RRs) and a 95% confidence interval for each IOPT (Higgins and Green 2011) as shown in Appendix C.

There were a few IOPTs with uncertain data needed to calculate risk ratios; for example, the total sample size was reported, but not the number of subjects in the control and intervention arms. In this instance, we assumed that 50% of the subjects were in each arm (Kirby et al. 1997a, b), an assumption that is also consistent with the format of Table 1 of that paper. In another instance (Coyle et al. 2006), calculation of an RR required an estimate of pregnancy risk in the control group, a figure we could not derive from the paper. In our calculation of relative risk for this IOPT, we used the highest pregnancy risk reported in the other included studies, namely 18.5% per O’Donnell et al. (2002). This high figure would favor detection of statistically significant benefit, and is thus “conservative” in the context of the preponderance of results indicating no intervention benefit. The final figures were entered into Stata® (version 13) for generation of forest plots using Stata’s Metan command.

Table 1 Summary description of 21 studies included in this systematic review and meta-analysis

Pregnancy was the primary outcome of interest and was a study inclusion criterion. Once studies had been identified, we reviewed them for three secondary outcomes thought to be associated with pregnancy, namely condom use, oral contraceptive pill use, and sexual initiation. Secondary outcome measures for condom use and OCP use varied somewhat. Some studies relied on self-reported condom use at last sex, generating a proportion across all respondents. Others reported frequency of condom use over varying recall periods. We combined these measures into one measure of relative risk of non-condom use. Questions on OCP use were sometimes framed as “always use birth control” and other times as “OCP use at last sex,” and we combined them.

Meta-Analyses

We used a random effect model because the programs studied were diverse in design, and performed by different researchers using different evaluation methods in varying populations. The assumption of a fixed effect is therefore implausible (Borenstein 2009). In addition to stratifying by RCTs and non-RCTs, we stratified, prior to analysis, according to other variables we hypothesized might affect outcomes: abstinence-only and non-abstinence-only, package of activities and narrowly focused activities, pregnancy reported from females only and “caused pregnancy” (males) also included, analysis of results from youth who were sexually active at baseline only and both sexually active and inactive at baseline, and mixed school and community setting and school setting only. These variables are further described in Appendix D. We used standard funnel plot and Egger’s test methods to test for publication bias (Egger et al. 1997).

Adjustment for Baseline Values

Eight studies counted new pregnancies starting at baseline (Coyle et al. 2006; Hawkins et al. 1999; Howard and McCabe 1990; Kirby et al. 1997a, b; Lieberman et al. 2000; O'Donnell et al. 2002; Smith et al. 2000). Thus, the baseline risk of pregnancy was zero for both intervention and control groups. For the other 13 studies, pregnancy was reported as either as “ever” or “any” pregnancy, so the control and intervention groups could differ on prior pregnancies at baseline (Allen et al. 1994, 1997; Anderson et al. 1999; Gelfond et al. 2016; Handler 1987; Kirby et al. 1991; Kisker and Brown 1996; LaChausse 2016; Mitchell-DiCenso et al. 1997; Paine-Andrews et al. 1999; Thomas et al. 1992; Vincent et al. 1987; Walsh-Buhi et al. 2016). To correct for these baseline differences, we used the pregnancy risk difference between intervention and control groups at baseline to adjust the post-intervention risk of pregnancy in controls, and then compared post-intervention risk of pregnancy in the intervention group to this adjusted risk in the control group.

Results

Search Results

Our searches yielded 4867 unique citations, including 94 in the gray literature (Fig. 1). Screening of titles and abstracts identified 222 citations for full-text review, of which 21 ultimately met inclusion criteria and proceeded to data extraction. See Appendix E for details on exclusion, by study. Key information on the 21 included studies is shown in Table 1. Ten (48%) were RCTs, and 11 (52%) were non-RCTs. Five (24%) were set in rural areas, 11 in urban, 1 in suburban, one in mixed urban-rural settings, and three were indeterminate. These studies were conducted in eight states, one was conducted in Ontario, Canada, one was conducted in unspecified “southern states,” and two had “nationwide” scope. The 13 (62%) studies from which we could extract SES information indicated low income or low educational levels, and these terms are defined in various ways. For example, low SES is defined using specific tangible criteria such as “eligible for free lunch in one study” (Hawkins et al. 1999) and “paid less than standard low-income fee at last hospital visit” (Howard and McCabe 1990). Other studies classified the target population as low SES based on broad terms such as “economically disadvantaged” (O’Donnell et al. 2002), “blue collar” (Thomas et al. 1992), and “primarily low income” (Handler 1987). By contrast, five studies that provided information on educational attainment consistently defined it in specific, measurable terms such as mean educational level reported on a scale from “didn’t graduate high school, to “college graduate” (Allen et al. 1994) or “mother’s finished high school: 74%; finished college 47%” (Kirby et al. 1991). As shown in Table 1, the average age (or targeted age when actual age was not reported) of study subjects at baseline was as low as 10.6 years old in one abstinence-only based program in Los Angeles County, four others had an average age of 12, and the rest ranged from 14 to 17 years old. The study population was disproportionately African-American and Hispanic, except for one curriculum-based sexuality education program in multiple California settings that was 62% white (Kirby et al. 1991). Two studies had 99–100% African-American subjects (Handler 1987; Howard and McCabe 1990).

Fig. 1
figure 1

Flowchart for systematic review. From: Moher et al. (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLoS Med 6(7), e1000097. https://doi.org/10.1371/journal.pmed1000097

Two (10%) of the studies evaluated abstinence-only programs. As shown in the right-hand column of Table 1, the remaining 19 included a wide range of education modalities, including service learning, positive youth development, peer-led programs, and other pedagogical models including cognitive behavioral theory and social learning theory. These categories overlap, and we are aware of no definitive typology for characterizing school-based risk reduction programs.

Risk of Bias in Included Studies

Overall, risk of bias was high in our included studies. Among RCTs, randomization methods were poor in one study (Allen et al. 1997) and unclear in five (Handler 1987; Kirby et al. 1997a; LaChausse 2016; O'Donnell et al. 2002; Walsh-Buhi et al. 2016). No trials were blinded to participants, personnel, or outcome assessors as this is infeasible in a school-based study. Allocation concealment in RCTs was either not done (Handler 1987; Kirby et al. 1997b; Thomas et al. 1992; Walsh-Buhi et al. 2016) or was not reported clearly (Allen et al. 1997; Coyle et al. 2006; Kirby et al. 1997a; Mitchell-DiCenso et al. 1997; O’Donnell et al. 2002). Four RCTs (Coyle et al. 2006; Kirby et al. 1997a; O'Donnell et al. 2002; Thomas et al. 1992) and three non-RCTs (Gelfond et al. 2016; Kirby et al. 1991; Lieberman et al. 2000) lost more than 20% of participants at follow-up, placing them at high risk of attrition bias. We suspected selective outcome reporting in two studies: one excluded certain sites from analysis (Allen et al. 1997) and one excluded 6% of responses as “incompatible” as well as participants who did not complete follow-up surveys (Kirby et al. 1997b). Among non-RCTs, two were at high risk of bias because the SES characteristics of study groups lacked baseline equivalence (Howard and McCabe 1990; Smith et al. 2000). Four of the 21 studies had high risk of contamination from the control group, thereby biasing the estimated intervention effects toward null (Kisker and Brown 1996; Lieberman et al. 2000; Paine-Andrews et al. 1999; Smith et al. 2000). Seven of the 21 studies did not adjust outcomes for confounding (Anderson et al. 1999; Handler 1987; Hawkins et al. 1999; Howard and McCabe 1990; Lieberman et al. 2000; Smith et al. 2000; Vincent et al. 1987). Outcome adjustment was unclear in five studies (Kirby et al. 1991, 1997a; Mitchell-DiCenso et al. 1997; O'Donnell et al. 2002; Thomas et al. 1992). The Egger’s test for small-study effects and funnel plot asymmetry suggested no publication bias. P values were 0.53, 0.81, and 0.06 for results at < 13, 13–24, and 24+ months, respectively (details on risk of bias for each IOPT and funnel plots are available on request from corresponding author).

Meta-Analysis Results

The 21 included studies generated 28 pregnancy RR study-level IOPTs, which fell into one of the three follow-up periods (12 < 13 months, 6 13–23 months, and 10 ≥ 24 months). The studies also provided 22, 15, and 11 computable study-level RRs for “no sexual initiation,” “no condom use,” and “no OCP use,” respectively. Our results are presented in forest plots in Fig. 2 for pregnancy risk and in Appendix F for secondary outcomes. Results stratified by selected variables are presented in Table 2 for pregnancy risk and in Appendix G for secondary outcomes. Meta-analysis results with statistically significant pooled findings are summarized in Table 3, and discussed below.

Fig. 2
figure 2

Forest plot of meta-analysis results for pregnancy risk ratio using random-effects model

Table 2 Meta-analysis results of pregnancy incidence, stratified by selected variables
Table 3 Statistically significant pooled relative risk and 95% CI for pregnancy and secondary outcomes

Primary Outcome—Pregnancy

Stratification by Follow-Up Period Only

Our analysis yielded a RR for pregnancy of 0.82 (95% CI 0.63–1.29), 1.3 (95% CI 1.02–1.65), and 0.96 (95% CI 0.81–1.13), for < 13, 12–23, and ≥ 24 months of follow-up, respectively (Fig. 2). The result for 12–23 months thus just crossed the threshold of statistical significance for increased risk of pregnancy. Individual study IOPTs with statistically significant results were Allen et al. (1994, 1997) for < 13-month follow-up (reduced pregnancy risk), Kirby et al. (1997b) for 13–24 months (increased risk), and Hawkins et al. (1999) (full intervention) for ≥ 24 months of follow-up (reduced risk).

Stratification by Follow-Up Period and by Study and Program Features

Table 2 shows the 36 pooled results for the three follow-up periods and seven intervention or study features that could plausibly be correlated with a finding of program outcomes in pooled results. (Five pooled results appear two or three times in Table 2, thus only 30 unique results). Broadly, the pooled outcomes did not show a statistically significant effect on pregnancy risk, and this is true for both the RCTs and non-RCTs considered separately. The six pooled results stratified by follow-up period and intervention or study features with statistically significant estimates include two with increased pregnancy risk and four with decreased risk (Tables 2 and 3).

Secondary Outcomes—Sexual Initiation, Condom Use, and OCP Use

Sexual Initiation

One of the pooled results showed a statistically significant outcome when stratified by the three follow-up periods only (Appendix F). At < 13-month follow-up, the pooled risk ratio was 0.87 (95% CI 0.78–0.97); however, this result was not apparent in the other two follow-up periods, 0.99 (95% CI 0.88–1.10) and 0.95 (95% CI 0.90–1.01) for 13–24 and 24 months+, respectively. Six of 21 unique study or program-type pooled comparisons were statistically significant, with point estimate RRs between 0.80 (95% CI 0.66–0.99) and 0.93 (95% CI 0.88–0.98) (Table 3 and Appendix G).

Condom Use

The pooled risk reduction for the < 13-month follow-up period showed a statistically significant effect, 0.84 (95% CI 0.75–0.95); however, the studies for which we could calculate a risk reduction ration for the 13–24-month period showed no statistically significant benefit 1.04 (95% CI 0.92–1.18) showed a statistically significant outcome when stratified by the three follow-up periods only (Appendix F). Of 19 unique pooled results by study or intervention characteristics, 4 were statistically significant and showed decreased risk, ranging from RR 0.79 (95% CI 0.62–0.95) to 0.86 (95% CI 0.75–0.98) (Table 3 and Appendix G).

OCP Use

None of the pooled results showed a statistically significant outcome when stratified by the three follow-up periods only (Appendix F). Of nine unique pooled results by study or intervention characteristics, only one was statistically significant, with increased risk, RR 1.12 (95% CI 1.02–1.22) (Table 3 and Appendix G).

Quality of the Evidence: GRADE Results

Overall, low to very-low-quality evidence suggests that school-based pregnancy prevention programs have no effect in reducing pregnancy rates in adolescents in the USA. The GRADE analysis is detailed in Appendix H, and summarized below. In evidence from RCTs, four trials contributing very-low-quality evidence found no difference in reported pregnancies at times ranging from 5 to 12 months (Allen et al. 1997; Coyle et al. 2006; Handler 1987; Kirby et al. 1997a). Evidence quality for this outcome was graded down for very serious risk of bias (among other issues, no trial was blinded and randomization methods were poor), serious inconsistency (wide range in point estimates, no trial achieved statistical significance), and serious imprecision (few outcome events). In six trials with longer follow-up, negative findings and evidence quality were similar. The longest trials had a low (not very low) evidence rating. Quality of the evidence from the non-RCTs was similar. Four studies provide very-low-quality evidence for no effect at 6 to 12 months (Allen et al. 1994; Howard and McCabe 1990; Kirby et al. 1991; Smith et al. 2000). Evidence quality was graded down for serious risk of bias and serious imprecision. In one study assessing outcomes at 18 months, there was also no difference in pregnancies (Kirby et al. 1991).

Discussion

We undertook a systematic review and meta-analysis of assessments of the specific effect of school-based programs in the USA to reduce pregnancy in adolescents among programs that measured pregnancy as an outcome. No such review has previously been published. Broadly, we found insufficient evidence to conclude that the studied programs were effective in reducing pregnancy, the primary study outcome. For one of the three follow-up periods into which results were stratified, we report a statistically significant increase in pregnancy risk. We also saw no consistent evidence of increasing condom or OCP use, or delaying sexual initiation, our secondary outcomes. However, there were statistically significant decreases in sexual initiation and lack of condom use for one of the three follow-up time strata, < 13 months. Because the literature includes varied study designs, intervention approaches, and populations, we conducted seven subgroup analyses on variables that might affect outcomes. None provided consistent evidence of effectiveness: For pregnancy, the majority of these subgroup analyses yielded risk reduction ratios which were not statistically significant. Of those that were statistically significant, four were in the direction of decreased risk and two indicated an increased risk of pregnancy. Regarding the secondary outcomes, the majority of the pooled risk reduction ratios were not statistically significant. The six that were statistically significant for sexual initiation showed a reduced risk of sexual initiation as did the four for no condom use. However, the one statistically significant subgroup analysis for OCP use showed an increased risk of no OCP use.

Our findings are consistent with other systematic reviews that have examined the effectiveness of programs aimed at preventing teen pregnancy, finding no statistically significant effect at preventing pregnancy (Dicenso et al. 2002; Underhill et al. 2007; Mason-Jones et al. 2016; Scher et al. 2006). Oringanje et al. (2016) reviewed 53 RCTs from low and middle-income countries that included school-based and community-based interventions. They found that interventions with multiple components (educational and contraceptive promoting) had a significant effect in preventing pregnancy. Subgroup analysis by educational interventions alone and by cluster RCTs showed no significant effect in preventing pregnancy. None of the four effective interventions were school-based, the modality of interest for our review. Other reviews that have found reduced pregnancy risk have relied on studies of poor quality and included community-based programs (Chin et al. 2012). Our findings are consistent with those of a companion article in this issue of Prevention Science by Mirzazadeh et al. which examined the effect of school-based programs to prevent HIV and other STIs in teens. This review found no consistent reductions in disease incidence. Since the risk behaviors for STI transmission and pregnancy are similar, the findings of the two papers tend to be mutually affirming.

Our review included abstinence-only interventions despite earlier reviews suggesting lack of effectiveness (Chin et al. 2012; Underhill et al. 2007). We did so because of the importance of this issue, the fact that their effectiveness remains contested (Weed 2012), and the possibility that recently published studies could suggest a different result. Our meta-analysis affirms earlier findings and did not include new studies on abstinence-only programs. However, that we also found no pattern of effectiveness in the comprehensive programs suggests that reasons for lack of benefit extend beyond the nature of the curriculum. Unfortunately, the four pooled RR results that showed statistically significant reductions in pregnancy, from a total of 30 unique pooled comparisons, are too few to test hypotheses regarding the correlates of program effectiveness. Similarly, the four individual study (unpooled) IOPTs that indicated a statistically significant decrease in pregnancy evinced no particular pattern of intervention design. All four were comprehensive rather than abstinence-only and were adult-led. One was implemented in a mixed setting (i.e., school-based program that included activities in the community or in which community members visited the school) (Allen et al. 1997), and two were strictly school-based (Allen et al. 1994; Hawkins et al. 1999) (Appendix D). Finally, 3 of the 21 studies we evaluated were published in 2016, and none of these showed statistically significant reductions in pregnancy risk. This limited evidence does not support a hypothesis that recent improvements in program design or implementation make for greater efficacy. Therefore, an important unanswered question is, “What are the determinants of effectiveness in school-based pregnancy prevention programs?”

Some investigators have questioned the premise that teen pregnancy is a cause of poor health and economic outcomes (Melissa and Levine 2012; Schalet et al. 2014; Sisson 2012). They suggest instead that, all else equal, poor life prospects increase pregnancy risks. If true, it helps explain why the few hours of a program might not have a marked effect on pregnancy rates.

While teen pregnancy prevention programs aim to improve a range of outcomes, the focus of this study was on pregnancy and three of pregnancy’s proximate causal predicates. Despite discouraging findings based on limited data, there may be specific intervention approaches that are effective. Future research may identify effective behavior change models or may establish, for example, the efficacy of programs that begin earlier or extend over many grades. Continued evaluation including well-powered, rigorous studies that minimize risk of bias is needed to identify what types of school-based programs can reduce adolescent pregnancy rates.

Limitations

Our finding of no consistent pattern of statistically significant effectiveness in reducing the risk represented by the secondary outcomes (sexual initiation, no condom, and OCP use) should be treated with caution. Our review was restricted to studies that reported on pregnancy and excluded those that measured secondary outcomes only. We may therefore have analyzed a biased sample of studies; interventions with studies that report on pregnancy may be systematically different from those that do not. Although we believe that the current analysis is comprehensive, confidence in substantive conclusions must be tempered by the poor quality of available evidence. The nature of school-based programs renders blinding impossible and true randomization very difficult or impossible. Imprecision is also inevitable with rare events such as pregnancy, and a certain amount of contamination and cross-over must be expected in the context of an uncontrolled setting such as schools in which students may be transferred or may move for any number of reasons. Thus, the low quality of evidence rating should be understood in comparison with more easily controlled clinical research. Beyond these inherent difficulties, many studies failed to adjust for confounding or had high loss to follow-up. Thus, it is conceivable that more rigorous studies might have yielded different and more positive results. As in all systematic reviews, we only evaluated studies that met inclusion criteria. Broader criteria, such as acceptance of earlier studies, might have yielded a different result. Furthermore, while we made every attempt to search comprehensively, it is possible that we missed high-quality studies which found better effectiveness. Finally, our classification of programs requires judgments about which informed reviewers can disagree. However, while a different classification of IOPTs would affect a subset of the calculated pooled RRs and confidence intervals, the basic finding of no consistent evidence of reduced pregnancy would be unaffected by any such re-classification.

Conclusion

This review is the first to assess the effectiveness of school-based interventions in reducing pregnancy in the USA. The data from included studies provide no consistent evidence that evaluated programs were effective in reducing pregnancy or in improving results in the secondary outcomes analyzed. Our study was not designed to identify specific approaches that may be effective. There were too few studies of any particular approach, such as service learning, peer-led interventions, and approaches based on cognitive behavioral theory and social learning theory to identify the relative effectiveness of these or other approaches, nor were we able to assess the relative effectiveness of programs that begin earlier and may extend over many grades, versus those that start later. Continued evaluation is needed to identify what specific types of school-based interventions can successfully reduce youth pregnancy rates.