Adolescence is a developmental period characterized by both opportunity and risk. The cognitive, physical, social, and emotional advances observed in this decade are important foundations for adult responsibility and independence. At the same time, emotional and behavioral problems increase and represent the primary cause of adolescent mortality and morbidity. Notably, the dramatic neurological changes that occur create a window of vulnerability and are associated with the onset of major mental illnesses such as anxiety, depression, and substance abuse (Dahl 2004; Greenberg and Lippold 2013). Experience shapes adolescents’ sensitive neural systems and may affect long-term mental health (Casey 2015).

Mindfulness-based programs (MBPs) show some promise for supporting healthy adolescent development and well-being. Defined as programs that train attentional skills in a certain way: with intention, present-moment focus, and acceptance (Kabat-Zinn 1990), MBPs teach and practice an approach to experience that is open, curious, and non-reactive.

Systematic reviews have concluded that MBPs are feasible within school settings and probably efficacious for anxiety, depression, and stress reduction (Black and Slavich 2016; Felver et al. 2016). Maynard et al. (2017) concluded that there are mixed effects of MBPs in schools, with some indication that MBPs can improve cognitive and socio-emotional outcomes but no support for improvement in behavior or academic achievement. They note that despite the growing support of MBPs for adults, youth may not benefit in the same ways or to the same extent as adults. Another meta-analysis of school-based studies (Carsley et al. 2018) found greater effects for older vs. younger adolescents and overall effects for well-being were found only for MBPs delivered by trained teachers rather than outside facilitators. However, there is a dearth of research on effects of MBPs implemented by actual classroom teachers in real-world settings.

Researchers have identified several competencies necessary for the effective delivery of MBPs including adequate coverage, pacing and organization of curriculum, interpersonal relational skills, skillful guiding of formal meditation practices, effective interactive inquiry, group dialogue and didactic teaching, and management of the learning environment. One school of thought suggests that effective mindfulness teachers share a common commitment to mindfulness through regular daily practice, which over time, manifests itself as “embodiment” or the authentic expression of the skills and dispositions necessary for effective mindfulness instruction (Crane et al. 2016; Roeser 2016). A key challenge to the portability and sustainability of MBPs is whether classroom teachers can be adequately trained to deliver such curricula in the context of typical classroom settings and constraints.

Despite the challenges involved, adolescents may be particularly receptive to the benefits of mindfulness training. Current evidence suggests that the plasticity of the adolescent brain makes this a sensitive period of development, maximally receptive to environmental inputs (Blakemore and Mills 2014). The prefrontal cortex (PFC), which integrates emotional and contextual information and has important regulatory functions, is being fine-tuned, with neuromaturation continuing into adulthood (Caballero et al. 2016). Subcortical brain regions related to mood, threat assessment (Davidson 2000), and reward-seeking (e.g., nucleus accumbens and ventral striatum; Gottfried 2011) also continue to develop and increase in reactivity. One frequently cited implication of these protracted developmental processes is that the executive and regulatory capacities of the adolescent brain have great difficulty overriding strong emotionality and reward-seeking, a model of adolescent brain development known as a “developmental mismatch” or “maturational imbalance” (Casey 2015; Somerville et al. 2010).

Experimental evidence shows that, compared to children and adults, adolescents display greater amygdala arousal (Hare et al. 2008), report more daily experience of negative affect from age 10 to 18 (Larson et al. 2002), are more inclined to maintain negative affect (Riediger et al. 2009), and are more likely to act impulsively when they perceive cues as threatening (Dreyfuss et al. 2014). Adolescents, compared to other age groups, are also less effective at maintaining attention on task in the presence of emotional stimuli, particularly when the stimuli are distressing (Cohen-Gilbert and Thomas 2013). The executive functions (EFs) of shifting and monitoring of attention, and the planning, initiating, and carrying out of goal-directed behavior are still developing (Casey 2015). Consequently, lower levels of top-down regulation (EFs) in the context of heightened emotionality create an imbalance than can predispose adolescents to risky behavior and maladaptive choices.

Emotion regulation (ER), defined as processes used to moderate affective experiences in order to meet situational demands, include modulating arousal, inhibiting automatic responses, persisting during stressful activities, and delaying gratification (Gross and Thompson 2007). Not surprisingly, difficulties in ER represent a core feature of many emotional and behavioral symptoms and disorders that emerge in adolescence (Powers and Casey 2015).

Despite the fact that EF and ER are considered separate constructs, it is important to recognize that brain circuitry related to both emotion and executive processing is completely intertwined and that cognitive and affective functions are not dissociable. Poorly controlled emotions can directly impair information processing and other goal-directed EFs (Best et al. 2009) and degrade academic performance (Rothbart et al. 2007). Similarly, poor cognitive control limits the power to regulate emotions and may increase rumination increasing risk for internalizing disorders in adolescents (Snyder and Hankin 2016).

Cognitive control is central to theories of ER. Strategies such as selecting and modifying situations, anticipating and planning outcomes, attention management, and cognitive reappraisal rely heavily on EF capacities. While top-down regulatory strategies are typically available to adults, they may be far less so for adolescents. Adolescents’ heightened emotionality and contextual sensitivity may compromise EFs and make strategy selection and implementation more difficult. Furthermore, adolescents’ ability to identity and differentiate negative emotions (angry, disgusted, sad, scared, and upset) is at the low point of a U-shaped developmental curve (Nook et al. 2018). Therefore, intervention approaches for adolescents need to take the normative developmental conditions of high arousal, greater negative affect, underdeveloped EF skills, and context-sensitivity into account.

The fundamental practice of mindfulness involves paying attention, either to a specific focus of attention or to the array of phenomenal experience, with an attitude of curiosity, nonjudgment, and acceptance (Kabat-Zinn 1990). Awareness and non-reactive experiencing of thoughts, emotions, and sensations, without suppression or avoidance, may minimize the risk of behavioral consequences like rumination or acting out (Chambers et al. 2009). Mindfulness practice exercises both executive (maintaining and shifting attention, inhibiting distractions) and affective (strengthening limbic and prefrontal regulatory circuitry) systems (Tang et al. 2015). Iterative reprocessing of information, as is done by reflectively returning attention again and again to a certain object of focus, promotes cognitive control and flexibility (Zelazo 2015). Therefore, mindfulness practice may be a promising means of scaffolding regulation and reflection during a period of rapid neurobiological change.

Learning to BREATHE (L2B) is an MBP for adolescents designed to strengthen ER and EF (Broderick 2013; see Fig. 1). The core curriculum provides developmentally adapted training in several core practices of mindfulness-based stress reduction (MBSR; Kabat-Zinn 1990) including body scan, awareness of thoughts and feelings, mindful movement, and loving-kindness/compassion practices.

Fig. 1
figure 1

Learning to Breathe (L2B) logic model

L2B outcomes on student well-being and learning have been explored in several studies. Early quasi-experimental studies showed reduced negative affect and increased well-being in 12th grade private school girls following L2B (Broderick and Metz 2009) and reduced stress, somatic complaints, and higher perceived emotion regulation efficacy compared to controls in public high school students (Metz et al. 2013). Three pilot randomized controlled studies have expanded this work. Ethnically diverse alternative high school students assigned to L2B, compared to a substance abuse prevention class (total sample was 27), demonstrated lower levels of depression but no significant effects on anxiety, mindfulness, and perceived stress compared to controls (Bluth et al. 2016). Fung et al. (2019) found improvements on stress reduction, internalizing problems and emotion regulation for predominantly Asian-American and Latino-American 9th graders with elevated mood symptoms after L2B and at 3-month follow-up. L2B participation did not lead to improvements in externalizing problems, attention problems, and expressive suppression. In a study involving at-risk high school students (Felver et al. 2018), levels of resilience in the L2B group were maintained over time compared to controls, but there were no intervention effects on self-reported problem behavior, school attendance, and quarterly academic grades.

Although the results of these studies are promising, most outcomes are based on self-report measures and L2B has yet to be examined in a large independent trial with classroom teachers as instructors. The goal of the present study was to assess the effectiveness of L2B, delivered by trained high school teachers, in authentic educational settings using a quasi-experimental control-group trial design. Furthermore, we utilized both student self-report measures and direct assessment of EF outcomes at post-test. We hypothesized that consistent with the L2B logic model, L2B would significantly reduce adolescents’ perceived stress, decrease their symptoms of anxiety and depression, improve their EFs, and enhance their well-being.

Method

Participants

Informed consent was obtained from all participants. As shown in the CONSORT diagram (Fig. 2), letters of consent were sent to parents of 260 students recruited from the classrooms of four teachers (two intervention and two control). Teachers taught multiple independent course sections (intervention = 6 classes, and control = 5 classes). Our study design blocked on school, and therefore, there was one intervention teacher and one control teacher nested within each school. During recruitment, five parents opted-out of participation, and 255 students were randomized to L2B or business-as-usual control conditions. A total of 131 students were enrolled in six classrooms of the teachers assigned to implement L2B (intervention group), and 124 students enrolled in the five classrooms of the control teachers (control group). Of the 255 randomized students, four were absent at both pre- and post-data collection; thus, the final analytic sample was 251 students. The racial/ethnic composition was diverse: 50% White, 16% Black, 9% Hispanic, and 5.6% Asian. About 57% was male, and 23% received free lunch. The mean age was 16 years old. About 55% reside in two-biological-parent families, 19% with stepparents, and 19.5% with single parents.

Fig. 2
figure 2

CONSORT diagram of participant flow through the phases of the study

Procedure

The study took place in two suburban high schools in the Northeast US. L2B was implemented during required health education classes in 11th grade. Two teachers volunteered to be trained in L2B and two other health teachers and their classrooms served as a business-as-usual control. After blocking on school, students were randomly assigned to one of 11 classrooms by administrators (intervention classrooms n = 6, control classrooms n = 5) taught by these four health teachers. Students assigned to control were exposed to the approved high school health curriculum which included units related to mental health (e.g., dimensions of wellness, disorders), social health (health and abuse relationships, bullying), human sexuality (e.g., human reproduction, contraception, sexually transmitted infections), substance abuse (e.g., effects of alcohol, tobacco, and other drugs), and nutrition and fitness (e.g., principles of nutrition, eating disorders, and body image).

Intervention

The L2B program is organized according to a scope and sequence of content units built around the “BREATHE” acronym (body, reflections, emotions, attention), in the current study, was delivered in 12 sessions (two sessions for each unit). Table 1 provides a brief overview of each unit theme and associated procedures, and Table 2 provides an overview of core intervention components including training and discussion topics, experiential activities, specific mindfulness-based practices, and homework practices. The first unit provides an introduction to mindfulness and practice in somatic awareness. The second unit incudes activities to understand and identify automatic self-talk and using mindfulness-based strategies to approach it. The third unit explores how emotions affect thoughts and somatic sensations. The fourth session focuses on understanding stress and stress reactivity. The fifth unit focuses on practices designed to cultivate compassion towards the self and others through loving-kindness practice. The sixth unit includes a wrap-up of prior sessions and discussion of how to integrate mindfulness into one’s daily life. Each L2B lesson follows a predictable format including a short introduction, activities for group participation and discussion to engage students in the lesson, and in-class mindfulness practice. Workbooks and CDs for home mindfulness practice are provided as a supplement. L2B content includes the core practices of MBSR including body scan, awareness of thoughts and feelings, mindful movement, and loving kindness practice developmentally adapted for adolescents.

Table 1 Overview of L2B curriculum sessions and procedures
Table 2 L2B core intervention components

Teacher Training

Prior to training teachers in L2B, the program developer and on-site coach met with intervention-condition teachers for four weekly individual training sessions (6 h total) to orient teachers to the concept of mindfulness and help them to establish a personal practice prior to teaching L2B. The sessions included practice in mindfulness strategies, weekly practice assignments, and journal writing to promote a personal practice. Subsequently, teachers attended a 2-day training (14 h total), led by the program developer, on implementing L2B curriculum with fidelity in their classrooms. The L2B training included practice teaching each lesson segment in small groups and sharing peer feedback, and discussion of the pedagogy of embodied mindful teaching. Materials were shared to assist teachers to recognize key transitions in each lesson and to balance use of time between activities, discussion, and mindfulness practice.

Coaching

During L2B implementation, five weekly coaching calls (60 min) occurred during the 12-lesson sequence over a 6-week period. Prior to the weekly calls, coaches (program developer and second coach who had several years of prior experience in delivering mindfulness-based training and support to teachers in school settings) viewed lesson videotapes. Coaching sessions focused on L2B implementation, including lesson content and delivery, inquiry and experiential practices, interpersonal relational skills, guidance of mindfulness practices, managing and holding the space for student and teacher experiences, and embodying the qualities and characteristics associated with mindfulness (Crane et al. 2016). In addition to calls, coaches made one classroom visit with each teacher during the study.

Thus, teachers were given background practice in mindfulness, and received a workshop on implementation of L2B and on-site and phone coaching support during implementation. One intervention group teacher had prior experience and training in MBP strategies for adults and had a regular meditation practice (> once per week) that was established for over a year. The remaining intervention and control group teachers did not have prior experience or training in MBPs and did not meditate on a regular basis prior to starting the L2B program.

Fidelity

Intervention fidelity (i.e., adherence) to manualized L2B lesson components was assessed by independent coders (n = 7) who were randomly assigned to code each lesson for fidelity. The intervention fidelity coding measure and manual were created by the program developer and included a point-by-point listing of all required actions and activities for each individual intervention lesson. All coders independently completed a partial day training on fidelity coding procedures led by the intervention developer and were required to achieve 80% or better reliability with master codes on training videos before coding study videos. All videos were coded by at least two coders. Average inter-coder agreement was 92% and disagreements were resolved by a third independent coder (study PI) in consultation with the program developer. The overall fidelity of implementation across sessions was 78.60% across teachers.

Measures

Measures were selected based on previous research and the L2B logic model which hypothesizes that L2B has direct immediate effects on measures of mindfulness, EF and ER skills, and student mental and physical health. Students completed a 30-min online battery of self-report measures and a 30-min battery of computer-administered EF measures. Pre-test assessments were administered 1 week prior to the intervention in the fall semester. Post-test assessments were completed 1 week after the program ended at the end of the fall semester. Scale scores were computed for respondents providing at least 80% of the items. Cronbach alphas were used to assess measure reliability at baseline.

Mindfulness

Mindfulness was measured using the Child and Adolescent Mindfulness Measure (CAMM; Greco et al. 2011). The CAMM is a 10-item measure to assess mindfulness skills in children and adolescents. The CAMM asks respondents to rate on a 5-point scale (1= never true to 5 = always true) the frequency with which they experience feelings, thoughts, and behaviors that reflect a lack of mindfulness (e.g., I think about things that happened in the past instead of thinking about things that are happening right now). Items were summed and higher scores represent greater mindfulness (α = .89).

Self-Compassion

Self-compassion was measured with the Self-Compassion Scale - Short Form (SCS-SF; Raes et al. 2011). The SCS-SF is a 12-item measure designed to assess the ability to be compassionate to oneself. The SCS-SF asks respondents to rate on a 5-point scale (1 = almost never to 5 = almost always) the frequency with which they behave in a manner that reflects self-compassion (e.g., When I’m going through a very hard time, I give myself the caring and tenderness I need.). The SCS-SF includes six dimensions of self-compassion, including self-kindness, self-judgment, common humanity, isolation, mindfulness, and over-identification. Items are summed to create a composite in which higher scores reflect greater self-compassion (α = 77).

Emotion Regulation

Emotion regulation was measured with the Difficulties in Emotion Regulation Scale (DERS; Gratz and Roemer 2004). The DERS is a 41-item questionnaire designed to assess five dimensions of emotion regulation difficulties: emotional awareness, emotional clarity, impulse control, access to emotion regulation strategies, and engaging in goal-directed behaviors. The DERS asks respondents to rate on a 5-point scale (1= almost never to 5 = almost always) the frequency with which they experience emotion regulation difficulties (e.g., When I’m upset, I have difficulty getting work done). Items are summed to form subscales and an overall composite, with higher scores reflecting greater difficulties in emotion regulation. Coefficient alpha for the composite scale was .86, and subscales ranged from .50 to .87. The impulse control difficulties (α = .51). and limited access to emotion regulation strategies (α = .65) subscales had internal consistencies less than .80 and should be interpreted with caution.

Depression

Depressive symptoms were measured with the Patient Health Questionnaire (PHQ-8; Kroenke et al. 2009). The PHQ-8 is an 8-item measure which asks respondents to rate on a 5-point scale (0 = not at all to 4 = nearly every day) the frequency with which they experience depression-related symptoms (e.g., Little interest or pleasure in doing things). Items were summed, with higher scores indicating higher levels of depression (α = .83).

Anxiety

Anxiety-related symptoms were measured using the Generalized Anxiety Disorder Scale (GAD-7; Spitzer et al. 2006). The GAD-7 asks respondents to report the frequency with which they experience generalized anxiety symptoms (e.g., Trouble relaxing) on a 4-point scale (0 = not at all to 4 = nearly every day). Items were summed, with higher scores indicating higher levels of anxiety (α = .89).

Rumination

Rumination was measured with the Rumination subscale of the Rumination and Reflection Questionnaire (RRQ; Trapnell and Campbell 1999). The RRQ rumination items are rated on a 5-point Likert scale (1 = strongly disagree to 5 = strongly agree). A priori, 9 of 12 rumination items (e.g., dwell over things, thinking back over embarrassing moments, reevaluating) were chosen that were most closely aligned to the L2B logic model. Items were summed with higher scores indicating higher levels of rumination (α = .85).

Stress

Stress was measured with two subscales of the Adolescent Stress Questionnaire (ASQ; Caballero et al. 2016) which measures the level of distress experienced in relation to common sources of adolescent stress. We used the subset of the ASQ items assessing two dimensions of adolescent stress; stress of school performance (6 items) and stress of peer pressure (5 items). The items are rated on a 5-point Likert scale (1 = not at all stressful to 5 very stressful) and summed for each dimension (α > .90 for both scales).

Somatization

Somatization was measured with the short-form of the Children’s Somatization Inventory (CSI; Walker et al. 2008) which consists of 13 items that assess the frequency of stress-related physical symptoms (e.g., headaches, nausea) on a scale from 1 (very slightly) to 5 (almost always). Items were summed to create a total scale score (α = .92).

Sleep

The Adolescent Sleep-Wake Scale (ASWS; LeBourgeois et al. 2005) assesses overall sleep quality including falling asleep (3 items), maintaining sleep (6 items), and reinstating sleep (3 items) during the last month using a six-point, Likert-type scale (1 =always to 6 = never), higher scores representing better sleep quality (α = .86).

Social Connectedness

The Social Connectedness Scale-Revised (SCC-R; Lee et al. 2001) measures the extent to which individuals feel socially connected to individuals and social groups. Items are rated on a six-point scale (1 = strongly disagree to 6 = strongly agree) with higher scores representing greater social connectedness. From the original 20-item scale, we utilized 12 items with the highest factor loadings to form a single composite score (α = .90).

Mind Wandering

The Mind Wandering Questionnaire (MWQ; Mrazek et al. 2013) is a five-item scale measuring the frequency of interruption of task focus by unrelated thoughts. Items are measured on a six-point scale (α = .91).

Growth Mindset

Growth mindset was measured with the Implicit Theories of Intelligence Scale for Children (IT; Dweck 1999). The IT assesses individual beliefs regarding the fixed vs. malleable nature of their intelligence (e.g., You can always greatly change how intelligent you are). Items are rated on a six-point scale with higher scores indicating endorsement of an incremental theory of intelligence (α = .80).

Substance Use

Substance use was assessed using the Substance Initiation Index (Spoth et al. 2007). Gateway substance use was measured by summing the responses to six binary items asking about lifetime experience of having a drink of alcohol, drinking more than a few sips of alcohol, having been drunk, smoking a cigarette, smoking marijuana or hash, and sniffing glue or gas. Illicit substance use was measured by summing the ratings of four binary items asking about lifetime use of methamphetamine, ecstasy, drugs or medications prescribed to someone else, and other drugs not prescribed by a doctor including Vicodin, Percocet, or Oxycontin. Coefficient alphas were .79 and .76 for gateway illicit substance use, respectively.

Negative Substance Use Consequences

Negative substance use consequences were measured with the Young Adult Alcohol Problems Screening Test (YAAPST; Hurlbut and Sher 1992). The YAAPST includes 12 items which assess the frequency with which individuals have experienced various negative consequences due to substance use (e.g., I have said or done embarrassing things). We modified the original scale to assess consequences the frequency (0 = never to 4 = four or more times) which have occurred in the past month due to alcohol, nicotine, marijuana, cocaine, inhalants, or other types of drugs (α = .91).

Inhibitory Control and Attention

A modified, computerized version of the Stroop Task (Siegrist 1995; MacLeod 1991) was created in E-Prime 3.0 (Psychology Software Tools (2016), Pittsburgh, PA) and presented on Acer netbooks. For the Stroop task, a series of color words (“red,” “blue,” or “yellow”) were sequentially presented for 2 s in the middle of a black screen. Participants were instructed to respond via button press to each word with the font color that it appeared in, not the name of the word itself. Colored stickers (yellow, blue, red) were placed on keyboard response keys (x, c, and v) to facilitate responding by reducing working memory load. Two trial types were presented—congruent and incongruent. On congruent trials, the word and color that it appeared in were the same (i.e., “red” appearing in red font). For incongruent trials, the word and font color were not the same (“red” appearing in blue font). Incongruent trials required participants to overcome their prepotent inclination to read the name of the word (thus engaging inhibitory control) and instead focus on and respond with the font color (attention). Sixty trials (thirty congruent and thirty incongruent) were randomly mixed and presented during the session. Ten practice trials were presented (prior to the appearance of color words) during which participants were asked to respond via button press with the correct color of a non-word prompt (“XXX”) to familiarize participants with response mode and mitigate practice effects. Outcome variables included reaction times (milliseconds) for congruent and incongruent trials and number of errors for congruent and incongruent trials.

Risk Taking

A modified version of the Balloon Analogue Risk Task (BART; Lejuez et al. 2002) coded in E-Prime (Psychology Software Tools, Pittsburgh, PA) was also administered to assess risk-taking propensity. During this task, participants were asked to inflate a virtual balloon (via keyboard button press) for potential monetary reward. Participants had the opportunity after each pump to stop and “bank” the earnings and move on to the next trial or continue to inflate the balloon. With each successive pump, the potential amount earned increased, however, so did the risk of the balloon popping. We used parameters for the BART as described in White et al. (2008): 30 balloons per run with an average break point of 64 out of a possible 128 pumps. For each pump, participants earned a “virtual” $10. No monetary reward was actually delivered to the students. To incentivize performance, however, students earning the highest 3 total dollar amounts in a given classroom earned a tangible reward (i.e., assortment of university sports bags, travel mugs, candy). Standard outcome variables for the BART were collected, including the adjusted mean number of pumps (the average number of pumps on trials when the balloon did not explode), mean total monetary reward accrued, and mean number of balloon explosions (White et al. 2008).

Working Memory, Attention, and Emotion Regulation

A modified Emotional Faces N-back Task (EFN-back), written in E-Prime (Psychology Software Tools, Pittsburgh, PA), was also administered as a means to assess working memory, attention, and emotion regulation (Ladouceur et al. 2005). During EFN-back administration, participants viewed a series of letters presented sequentially in the middle of the screen. Students were asked to respond via button press (spacebar) when the letter matched either a designated target letter (0-back condition) or matched the letter displayed 2 letters previous (2-back condition). On designated blocks, participants were presented with the series of letters flanked by pictures of emotional faces. Participants viewed eight blocks of trials flanked by emotional (happy, angry, or neutral) faces, as well as a block of trials with no faces present. Each block contained 12 trials, for a total of 96 trials (88 trials with faces). Outcome variables included the proportion of hits (correctly pressing the space bar when target letter appears) and false alarms (erroneously pressing the space bar in response to a non-target letter) across memory load (0-back, 2-back) and emotional valence conditions.

Engagement in Practice

At post-test, students in the L2B condition were asked how often they practiced each of the seven program components since the beginning of the L2B program: body scan, three mindful breaths, mindful eating, mindfulness of thoughts, mindfulness of emotions, mindful movement and/or mindful walking, and loving-kindness practice. Each item was rated on an 8-point response scale from never to multiple times per day. The summary score was calculated by averaging over the 7 items (α = .96).

Data Analyses

Main effects were analyzed using linear regression models for normally distributed continuous outcomes and generalized linear models for non-normally distributed count outcomes. All outcome analyses were performed in Mplus 7.2 (Muthén and Muthén 1998-2015). For self-report and neurocognitive/behavioral models and subsequent tests of moderation, only pre-test scores and a dummy variable representing school affiliation were included as covariates. Effect size, d, was calculated by dividing the adjusted group mean difference by the pooled standard deviation. To correct for type I error due to multiple pairwise contrasts, p values were adjusted using a Benjamini-Hochberg correction (Benjamini and Hochberg 1995). As recommended, adjustments were made according to hypothesis families, with corrections made to measures grouped as follows: internalizing symptoms, measures of ER, measures of mindfulness and self-compassion, sleep, and EF sequentially (see Table 3).

Table 3 Student self-report scales by intervention status

Attrition and Missing Data

Total attrition levels were low (n = 3; 1.6%). Examination of intervention by attrition interactions yielded no statistically significant differences on any pre-test variables. All missing data were handled using full-information maximum likelihood estimation (FIML) for each analysis under the assumption that missing is at random (Little and Rubin 2002).

Baseline Equivalence

Subsequent to assignment, we assessed the baseline equivalence of intervention and control groups and found no significant group differences in demographic characteristics. Only one baseline group difference was found for student self-report; impulse control difficulties were slightly higher in the intervention group, t =−2.41, p < .05. There were few differences in baseline neurocognitive outcomes; mean earnings on the BART task were higher in the control group (t = 2.44, p < .05); in the EFN-back task, the proportion of false alarm was higher in the intervention group for combination of 0-back and 2-back (t =−3.02, p < .01), for 0-back only (t =−2.19, p < .05), and for 2-back only (t =−2.52, p < .05; see Table 4).

Table 4 Neurocognitive (EF) scales by intervention status

Examining Clustering

As students are nested in classrooms, we examined the intraclass correlation coefficient (ICC) values of each outcome and found that ICC values were in the trivial range (.01 to .05) (Muthen and Satorra 1995). Given the limited number of schools (n = 2), we include a dummy indicator of schools as a fixed effect to hold constant all unobserved characteristics that may vary between schools.

Results

Preliminary Analyses

Descriptive and Distributional Characteristics

Prior to analyses, variables were examined to confirm approximate normality, identify outliers, homogeneity of variance, or any unusual patterns of missing data at the group or item level using SPSS 25.0. No unusual missing item patterns or distributional problems were detected.

As expected, self-report measures that include dichotomous or frequency counts of substance use were strongly skewed with many zero counts. We assessed over-dispersion by testing a dispersion parameter (α) and comparing goodness of fit between Poisson and negative binomial (NB) models (Atkins and Gallop 2007; Cameron and Trivedi 2013). If the Poisson assumption of equidispersion was violated, we estimated the NB model. We also assessed zero-inflation negative binomial models (ZINB) by examining frequency statistics and comparing model fits between the standard Poisson or NB and their zero-inflated counterparts. ZINB models assume two different origins of zero counts: the one due to a structural reason, and the other due to sampling. As a result, ZINB models consist of two parts, the one predicting a zero vs. a nonzero class (i.e., logistic portion) and the other modeling predicted counts for a nonzero class (i.e. counts portion). We used likelihood ratio (LR) tests to compare a set of nested models, e.g., Poisson vs. NB models, and relied on the Bayesian information criterion (BIC) to compare a set of non-nested models, e.g., Poisson vs. zero-inflated models (Green 1994; Hilbe 2011). Based on these analyses, we determined the most appropriate model for each of the count outcomes.

For gateway substance use, the standard Poisson provided the best fit; the dispersion parameter was not different from zero (α = 0, p > .05). For illicit substance use, the standard NB model was the best model with the LR test favoring the NB over the Poisson (χ2(1) = 23.99, p < .001). For substance use consequences, we found evidence of significant overdispersion (α = 2.34, p < .001) and the LR test favored the NB vs. the Poisson model, χ2(1) = 552.28, p < .001. The comparison of BIC fit statistics indicated stronger statistical support for the ZINB (BIC = 2164) over the standard NB (BIC = 2179). It was also substantively adequate to assume that zero observations in this measure have structural (i.e., those who reported zero because they have not been involved in substance use) and sampling origins (i.e., those who might be substance users but had no consequences).

L2B Impact on Student Self-Report Measures

Table 3 presents the means, standard deviations, and main effect results for self-report scales. On nearly all the self-report measures, we found no significant effect of the intervention. Only 3 comparisons yielded significant differences. At post-test, intervention students reported significantly lower levels of lack of emotional awareness (t =−2.58, p = 0.01, d =−0.28), an effect which remained significant after adjusting for multiple pairwise contrasts (Adj-p = 0.03). Contrary to our expectations, however, students in the intervention group reported higher levels of rumination (t = 2.35, p = 0.02, d = 0.23) and higher levels of difficulties engaging in goal-directed behavior (t = 3.26, p = 0.01, d = 0.37) at post-test. Although group differences in rumination were no longer significant after adjusting for multiple pairwise contrasts (Adj-p = 0.11), reported group differences related to difficulties engaging in goal-directed behavior remained statistically significant (Adj-p = 0.01).

Analyses of count models revealed no significant main effects of the intervention on substance use outcomes. Although it did not reach statistical significance in pre-post group comparisons, illicit substance use was considerably lower among students in the intervention group than those in the control (t =−1.30, p = 0.19). The estimated coefficient corresponds to the odds ratio of 0.57, which indicates that the intervention decreased illicit substance use by 43%.

L2B Impact on Behavioral Measures of EF

Table 4 presents the means, standard deviations, and main effect results for the behavioral assessments of EF. Several statistically significant effects emerged favoring the intervention group. Although no statistically significant effects were found on BART measures, the intervention group outperformed the comparison group on several dimensions of both the Stroop and N-back tasks.

On the Stroop task, reaction times for correctly executed congruent and incongruent trials were lower for intervention students compared to controls with corresponding effect sizes of −0.24 (t =−2.58, p = 0.01) and −0.19 (t =−2.16, p = 0.03), respectively. After adjusting for multiple pairwise contrasts, the reaction time on congruent trails remained significant (Adj-p = 0.04) and incongruent trails were marginally significant (Adj-p = 0.06). There were no significant intervention effects related to the number of errors on the Stroop task.

On the emotional faces N-Back test, no significant effects were observed with respect to face valence (happy, angry, neutral) across groups. Consequently, we collapsed across valence and assessed the effect of memory load (0-back vs. 2-back) across group. Intervention students showed a significantly lower proportion of false alarms, with corresponding effect sizes of −0.35 for all trials (t =−2.92, Adj-p = 0.02), 0.24 for 0-back trials (t =−1.69, Adj-p = 0.03), and −0.27 (t =−2.47, Adj-p = 0.15) for 2-back trials. There were no condition effects on proportion of hits.

Moderating Effects of Program Practice

We conducted exploratory tests of moderation to determine whether different levels of out-of-class program practice (i.e., dosage) impacted outcomes. The literature is limited with regards dose-response relationships in adolescent MBPs. Therefore, based on theory and in consultation with the program developer, we categorized practice into two discrete levels. Students with practice scale scores lower than 2 were categorized as an inadequate practice group (66% of intervention students), which corresponded to practicing less than once a month. Students with practice scale scores of 2 or higher were categorized as an adequate practice group (34% of intervention students), which corresponded to practicing at least once a month. The moderation analyses added the interaction term between group status and program practice and removed the main effect of program practice, thereby allowing the levels of the outcomes to differ by program practice only among the intervention group. We chose an alpha level of p < .05 for main effects and of p < .10 for interaction effects due to the reduced power to detect interaction effects as suggested by Aguinis (1995).

Table 5 presents the moderation results of program practice on self-report outcomes. More frequent program practice tended to be associated with better outcomes for a variety of measures. As displayed in Fig. 3, students in the adequate practice group showed significantly lower levels of overall difficulties in ER (d =0.36), lack of emotional awareness (d =0.35), lack of emotional clarity (d =0.32), impulse control difficulties (d =0.29), mind-wandering (d =0.31), and significantly higher levels of social connectedness (d =0.28). An odds ratio (OR) of 0.75 indicating a 25% reduction in the odds of students engaging in gateway substance use. Although falling just short of reaching the .10 level of significance, there was a similar trend for practice effects for stress of school performance (d =−0.27), self-compassion (d = 0.19), and substance use consequences. To assess whether practice effects were due to existing pre-test difference, we compared adequate and inadequate practice groups on all pre-test self-report measures and found fewer differences than would be expected by chance. Thus, practice effects do not appear to be related to pre-test differences in “risk.”

Table 5 Tests of moderation at adequate vs. inadequate practice levels
Fig. 3
figure 3figure 3

Differential intervention impacts by program practice levels, * p<.05, ** p<.01

The analysis of the ZINB model showed a trend for practice effects on substance use consequences for both count and logit portions of the model. The expected number of substance use consequences, among those likely in the non-zero group, was 34% lower for the adequate vs. inadequate practice groups (OR = 0.66). The probability of having no substance use consequences was 2.5 times higher for the adequate vs. inadequate practice groups (OR = 2.51). There were no significant interaction effects with practice for neurocognitive outcomes, indicating that the intervention impacts on neurocognitive outcomes were not significantly different by program practice levels.

Discussion

Our goal was to assess the efficacy of L2B as implemented by typical high school health teachers on measures of youth self-reports of emotion regulation, stress, anxiety and depression, substance use, and indicators of well-being. In addition, we conducted performance-based assessments of neurocognitive abilities using standardized EF measures. Prior studies have found youth exposed to the L2B program report significant improvements in emotion regulation, and concurrent decreases in negative affect, perceived stress, and stress-related somatic symptoms (Metz et al. 2013). Our study sought to extend these findings by conducting an independent study of L2B delivered by teachers in a real-world, ethnically diverse school setting.

Contrary to our expectations, we did not find intervention main effects on most of the adolescent self-report measures. Thus, we were not able to replicate findings from previous studies of L2B on these measures of depression (negative affect), emotion regulation, or perceived stress shown in some previous smaller studies of L2B. However, we did find that students exposed to L2B showed significant improvements on some components of 2 of the 3 measures of EF. Intervention students showed a significantly higher level of selective attention and inhibitory control as reflected by performance on the Stroop and EFN-back tasks, respectively. Across most variants of the Stroop task, slower and/or less accurate responding during the incongruent vs. congruent condition reflects a greater internal processing demand to resolve stimulus/response conflict and inhibit prepotent responding (Wolf et al. 2014). Both intervention and control groups showed this robust effect of greater demand on incongruent trials. However, students in the intervention, but not control group, generated significantly faster correct responses during the post-test on both congruent and incongruent trials, controlling for pre-test latencies. These data suggest that students who participated in L2B showed an overall higher level of selective attention to the task regardless of trial type. It may be the case that an increased focus on the task at hand coupled with decreased attention to distracters (e.g., noise/movement in the room), promoted by intervention training, lessened the overall cognitive load on students’ interference, and inhibitory control related brain circuitry, thus improving task performance.

On the emotional faces N-back task, we found no effect of face valence on task performance. When collapsed across valence type to assess the effect of memory load, however, we found that the intervention group generated a lower proportion of false alarms on 0-back trials and a trend (after adjusting for multiple tests) on the 2-back trials. A false alarm on the 2-back task occurs when a participant incorrectly presses the response button (i.e., current letter does not match the letter shown two screens back); it is sometimes referred to as an error of commission. This measure can be interpreted as failure to pay attention to or encode/update the internal representation of the target letter. Intervention participants thus appear to be paying more attention to changes in stimuli characteristics (i.e., the varying letters). Notably, this occurs in the absence of significant changes in correct response rate (i.e., “hits”) or, by extension, the number of misses (inverse of hits). Previous literature suggests that errors of omission and commission represent distinct error types supported by different psychological processes and likely different underlying neural circuitry (Meule 2017). Thus, it may be the case that the intervention was particularly effective for a specific dimension of attentional control related to commission errors. It is also particularly notable that this difference was observed during 2-back trials, as previous work suggests that attention can be disrupted under conditions of higher working memory load (e.g., Judah et al. 2013). Considered together, across the Stroop and EFN-back tasks, the L2B intervention led to small but consistent improvements in tests of selective attention.

In addition, the results of the moderation analyses suggest that beneficial program effects within the treatment group are influenced by the degree to which students report actively engaging in mindfulness practices outside of the classroom setting. Specifically, we found that students who reported higher rates of practice (greater than once per month) showed significant small-to-moderate interaction effects on measures of emotional awareness, emotional clarity, impulse control difficulties, social connectedness, mind-wandering, substance use, stress reduction, and self-compassion. A similar but non-significant trend was found for stress of school performance and self-compassion. On substance use measures, those in the adequate as compared to the inadequate practice group had an odds ratio that indicated there were 2.5 times more likely to report having no substance use-related consequences. Unlike self-report measures, we found no significant interaction effects with practice for EF outcomes suggesting that direct effects were not conditional on dosage in the same manner as self-reported social-emotional and behavioral outcomes. Although no baseline characteristics predicted practice time, we note in that practice time was not experimentally varied and thus should be subject to replication.

In considering the mixed results of this trial in light of prior findings, several design considerations are worth noting. First, this is one of the first studies of mindfulness-based interventions in education in which the intervention was presented by existing teachers rather than highly trained mindfulness experts. Our attempt here was to train and utilize regular classroom teachers who were relatively naïve regarding mindfulness at the onset of the study. The question was whether or not we could provide regular classroom teachers with both the background training in personal mindfulness as well as training in the implementation of L2B for students necessary to impact social-emotional and behavioral outcomes. Given the fact that L2B is a relatively well-structured and manualized curriculum, it represents an ideal program in which to explore these research questions.

We provided extensive supervision with both occasional live observations as well as weekly feedback calls after reviewing videos of previous lessons. However, none of the intervention teachers achieved highly proficient levels of implementation fidelity despite these supports. Although fidelity of implementation was not perfect, it likely provides an “average case” test of probable outcomes when utilizing teachers currently working in a typical American high school setting which has significant implications for scaling potential. Our experience suggests that it is not easy to train high school teachers to deliver this kind of mindfulness intervention without several years of support, and that this represents a key on-going question of implementation for such programs. In addition, detecting change in clinical symptoms in universal populations can be challenging given the vast majority of students participating in such interventions are unlikely to be symptomatic (Greenberg and Abenavoli 2017).

Second, there is little literature to guide program developers regarding the needed dosage to have significant impact on typical adolescents. Studies are needed that vary both dosage (number of sessions) and density of sessions (number of times per week) to assess how to optimize outcomes. Third, as reported above, most students did not report high levels of practice outside the classroom in spite of supplying audio-guided practices that could be used on a regular basis. We concur with Bailey et al.’ (2018) commentary regarding the practical challenges of implementing mindfulness-based practices at scale in school settings and that many students have limited “capacity to take on optional extras” such as mindfulness.

Although the present study did integrate mindfulness practice into the regular school day as recommended by Bailey et al. (2018), our findings suggest that adolescents’ own choices and willingness to engage in practices outside of the regular school setting may be key to the success of universal school-based approaches. As amount of practice outside of the structured school setting showed significant effects on outcomes, creating higher motivation to use practices on a regular basis should be considered a high priority for program development. Likewise, students may need direction and support on how to fit mindfulness-based practices into their busy schedules in order to form a regular practice habit.

Limitations and Directions for Future Research

Although this study had several strengths, including objective behavioral measures, authentic intervention implementation by trained classroom teachers, and a diverse sample of students, it is not without limitations. Not all subscale measures demonstrated optimal reliability in this field trial, and therefore, findings related to these subscales should be interpreted with caution. Not all measures were specifically designed for use with adolescents, which presents another potential limitation. Although objective behavioral data was collected (e.g., EF), the study featured youth self-report of social-emotional functioning and behavior as opposed to parent or teacher report which may yield alternative findings. Although there was a relatively small amount of missing data and it was not related to condition our missing data analysis assumes, it was missing at random. Moreover, finding new ways to measure practice that do not rely upon participant self-reports, including direct measures of usage through i-App delivered home practices, may be one way to improve our understanding of dose-response relations in these kinds of program for adolescents. In addition, given the intervention was delivered in a universal classroom setting, we were unable to control for the number of sessions students were exposed to due to student absences. We relied on a relatively small, but well-trained and supervised sample of teachers to examine the potential effectiveness of the L2B program on adolescent outcomes. It is quite possible that specific teacher or classroom qualities may differentially impact the effectiveness of MBPs delivered in this manner. Finally, the present study focused exclusively on pre- and post-intervention effects, and did not include a longer-term longitudinal investigation of effects.

This study contributes to the growing body of research on the effects of mindfulness programs delivered in school settings. While some studies have shown significant effects (Kuyken et al. 2013; Metz et al. 2013; Raes et al. 2014; Sibinga et al. 2016), others have not (Johnson et al. 2016, 2017). Given the mixed nature of findings in this study, we propose that a logical next step in the evaluation of such programs is to conduct larger effectiveness trials with a closer examination of dosage and practice effects (Kuyken et al. 2017). Examination of contextual, attitudinal, and motivational factors predicting adolescent adoption of mindfulness intervention strategies outside of school settings, and qualitative and mixed methods research investigating how best to support this, is a particularly important goal for the next wave of school-based mindfulness interventions.

Despite these limitations, this study provides mixed support regarding the potential effectiveness of a universal mindfulness program for high school students. The absence of direct effects on self-report measures implies that simply exposing adolescents to a mindfulness curriculum within the context of typical instruction, in the absence of supports for implementation, is unlikely to substantially impact youth self-report of social-emotional well-being or behavior. However, changes on EF favoring the intervention group were noted suggesting possible benefits on tasks related to susceptibility to cognitive interference, and selective attention are possible. Tests of moderation revealed dosage effects such that students who adopt the mindfulness practices they are taught and use them somewhat regularly can indeed benefit on multiple fronts. Greater effects may be possible with higher levels of dosage, and utilization of practices outside of the immediate school setting. As such, future research examining optimal dosage and strategies to increase strategy utilization is a key priority for future research.