The construct of mindfulness is increasingly visible in psychology in recent decades. Mindfulness-based interventions, such as mindfulness-based stress reduction (MBSR; Kabat-Zinn 1990) and mindfulness-based cognitive therapy (MBCT; Segal, Williams, and Teasdale Segal et al. 2002), are being used to treat a wide variety of psychological and medical conditions (Goldberg et al. 2018; Goyal et al. 2014; Zoogman et al. 2015). In addition, dispositional mindfulness has been associated with a host of psychological characteristics including psychiatric symptoms, well-being (Baer et al. 2008), and personality traits (Giluk 2009), as well as with neurobiological and behavioral markers (Brown, Weinstein, and Creswell Brown et al. 2013; Creswell, Way, Eisenberger, and Lieberman Creswell et al. 2007; Garland, Boettiger, Gaylord, Chanon, and Howard Garland et al. 2011).

As mindfulness is incorporated into the psychological canon, it becomes vital that reliable and valid measures of this construct are available (Lutz, Jha, Dunne, and Saron Lutz et al. 2015). To date, several self-report measures of mindfulness have been developed. Two of the most popular measures of this kind are the Five Facet Mindfulness Questionnaire (FFMQ; Baer, Smith, Hopkins, Krietmeyer, and Toney Baer et al. 2006) and the Mindful Attention Awareness Scale (MAAS; Brown and Ryan 2003). Despite the widespread use of measures like the FFMQ and the MAAS, some have questioned the validity of self-report measures of mindfulness (Lutz et al. 2015). Among others, Grossman (2008) has raised several such concerns, calling for more rigorous assessment of these measures’ psychometric properties. In particular, concerns have been raised regarding their construct validity (Goldberg et al. 2016; Van Dam et al. 2018), defined as the extent to which they measure what they are intended to measure (Crocker and Algina 2008).

Construct validity inquiries seek to establish evidence that score variance reflects variance on the construct of interest and to rule out that scores contain construct-irrelevant variance (Crocker and Algina 2008). Given most measures of mindfulness are self-report (although not all, e.g., Levinson, Stoll, Kindy, Merry, and Davidson Levinson et al. 2014), there are reasons to be skeptical about whether people accurately report their levels of mindfulness. If respondents are not generally aware or accurate in their self-perceptions (as is likely to be the case when an individual has a low level of mindfulness; Davidson and Kaszniak 2015; Grossman 2008), scores on the measures may instead reflect response biases such as social desirability (Tracey 2016) or may reflect variance in conceptually distinct but psychologically related constructs (e.g., positive or negative mood).

One test of construct validity recommended by Cronbach and Meehl (1955) is to examine whether a measure behaves as predicted in response to experimental manipulation. Thus, a basic test of construct validity for mindfulness measures is responsiveness to experimental manipulations intended to enhance mindfulness. We define this tendency to change in response to experimental treatment as responsiveness. In a meta-analytic context, at a basic level, we can ask whether the responsiveness for mindfulness-based interventions (comparing pre- and post-treatment means for participants in a mindfulness condition) differs significantly from 0.

Randomized clinical trials (RCTs) include other design features that invite more sophisticated tests of construct validity, especially RCTs testing mindfulness in clinical populations. Notably, RCTs involving mindfulness-based interventions conducted in clinical samples typically include both (a) comparison conditions and (b) measures of clinical outcomes. This suggests two additional critical tests of the validity of mindfulness self-reports in this experimental context: one that compares responsiveness within mindfulness measures and between conditions and one that derives an effect size reflecting comparative responsiveness between mindfulness measures and clinical outcome measures within conditions.

RCTs of mindfulness-based interventions include one or more comparison conditions, which allows assessment of relative responsiveness within mindfulness measures and between conditions. Broadly, comparison conditions can be classified as (a) specific active control conditions (i.e., bona fide treatments that are intended to be therapeutic; Wampold and Imel 2015), (b) non-specific active controls (i.e., placebo treatments that are not intended to be therapeutic), or (c) waitlist controls. While bona fide comparison conditions can be defined by their inclusion of ingredients that are intended to be therapeutic, placebo control conditions can vary considerably from study to study (Baskin, Tierney, Minami, and Wampold Baskin et al. 2003), which makes comparisons with non-specific active controls difficult to interpret. In the current study, non-specific active controls (k = 4) were excluded for this reason. Bona fide comparison conditions, however, provide an especially informative comparison, given they not only control for non-specific factors that contribute to efficacy of psychological treatments (Wampold and Imel 2015) but they also include specific therapeutic techniques (such as challenging irrational beliefs, in the case of cognitive behavioral therapy). Waitlist control conditions provide no treatment (or in some cases treatment-as-usual) and are intended to control for history and maturation effects on the outcome variable (Shadish et al. 2002). By conducting meta-analyses using a mixture of non-mindfulness-based bona fide comparison conditions and waitlist control conditions, the effects of instruction in mindfulness can be experimentally isolated.

An initial test of the validity of mindfulness self-report measures examines whether responsiveness of mindfulness measures is significantly greater for participants exposed to a mindfulness-based intervention compared with those exposed to specific active controls or a waitlist control condition. Even though bona fide comparison conditions do not directly teach mindfulness-enhancing techniques (e.g., mindfulness meditation), these treatments may target some features that could reasonably increase mindfulness (e.g., awareness of one’s inner experience through cognitive behavioral therapy); thus, we do not predict that changes in mindfulness will be absent in these conditions. Waitlist controls, in contrast, should not show increases in mindfulness over time.

RCTs conducted in clinical samples also typically include outcome measures targeted to the disorder under study. These measures can be used to assess the differential responsiveness between mindfulness measures and clinical outcomes, within conditions. For the RCTs considered here, all studies focused on some specific psychological problem (e.g., depression or anxiety) and included at least one measure of symptoms that characterize this problem (e.g., Beck Depression Inventory) along with a self-report measure of mindfulness. With multiple measures in each arm of the study, there is the possibility of examining the degree to which each type of outcome—measures of mindfulness and measures of clinical outcomes—is responsive to each of the two intervention types (i.e., to mindfulness or bona fide treatment interventions). We quantify the differential responsiveness of mindfulness measures and clinical outcome measures (in response to a particular experimental condition) as the difference between the effect size reflecting responsiveness of the mindfulness measure (expressed as a within-groups d, Becker 1988) and that for the clinical outcome measure. Thus, differential responsiveness is conceptualized as a comparison between mindfulness measures and clinical outcomes within conditions.

In the bona fide comparison conditions, the treatment targets psychological symptoms and any mindfulness effects are incidental; thus, for these treatments, we expected the change on targeted symptom measures to exceed changes on measures of mindfulness. For the waitlist control conditions, significant change was expected on neither the measures of mindfulness or clinical outcomes (although some regression to the mean can be expected on measures of clinical symptoms in clinical samples; Barnett, Van Der Pols, and Dobson Barnett et al. 2004). In the mindfulness-based treatment condition, we expected to see improvement on both the mindfulness measure and the targeted symptom measure, with no a priori expectations regarding which would increase more.

While concerns regarding the construct validity of mindfulness measures have been raised previously (Grossman 2008; Goldberg et al. 2016; Van Dam et al. 2018), to our knowledge, no prior work has used meta-analytic methods to assess the discriminant validity of mindfulness measures using the differential responsiveness comparisons just described. However, prior RCTs and one meta-analysis assess between-group effects on measures of mindfulness (comparing relative responsiveness of mindfulness measures between conditions).

Using data from a RCT of MBSR, Goldberg et al. (2016) examined relative changes in FFMQ scores for participants assigned to MBSR, a bona fide comparison condition that was intended to be therapeutic (Health Enhancement Program [HEP]; MacCoon et al. 2012), or a waitlist condition. Goldberg et al. failed to find evidence for specific responsiveness to the mindfulness intervention: FFMQ scores demonstrated equivalent improvement over time for individuals receiving MBSR or HEP, with at least some of the FFMQ subscales showing larger gains in the MBSR and HEP conditions relative to the waitlist control.

A recent meta-analysis also examined the degree to which changes in measures of mindfulness (e.g., FFMQ, MAAS) were differentially influenced by experimental manipulation. Across 88 studies, Quaglia et al. (2016) found evidence suggesting mindfulness-based interventions produce larger changes in self-report measures of mindfulness relative to both active and inactive (i.e., waitlist) control conditions across a range of mindfulness facets (i.e., attention, description, non-judgment, non-reactivity, observation). In contrast to Goldberg et al. (2016), Quaglia et al.’s results support the notion that mindfulness measures show greater responsiveness to interventions involving mindfulness, compared with other active treatment conditions.

The aim of the present study is to establish whether self-report mindfulness measures are responsive to mindfulness interventions, whether they respond specifically to the mindfulness-enhancing techniques in these interventions (as opposed to factors common to other psychotherapeutic treatments), and whether they show discriminant validity from measures of psychological symptoms. Thus, we sought to extend Quaglia et al.’s (2016) findings by testing not only specificity of relative responsiveness to experimental manipulation (as examined by Quaglia et al.) but also differential responsiveness (i.e., discriminant validity) compared with measures of clinical outcomes. In order to evaluate differential responsiveness, we restricted our search to randomized trials of clinical interventions using clinical samples. In addition, we included as mindfulness treatments only interventions based on mindfulness meditation allowing assessment of a more homogeneous family of therapies (e.g., MBCT, MBSR) and excluded interventions (e.g., Acceptance and Commitment Therapy; Hayes, Strosahl, and Wilson Hayes et al. 1999) that are grounded in mindfulness theory but do not teach formal mindfulness meditation practices (i.e., sitting meditation). Finally, we examined changes in total scores rather than subscales of mindfulness measures, based on factor analytic evidence suggesting an overall mindfulness factor in commonly used measures of mindfulness (e.g., Baer et al. 2006; Brown and Ryan 2003) and to reduce the number of analyses and increase the power of the statistical tests conducted.

Based on past findings, we made the following hypotheses. In regard to the relative responsiveness of mindfulness measures between conitions, we had three hypotheses. First (H1), given the focus of mindfulness-based interventions on the cultivation of mindfulness, we expected significant pre- to post-intervention and pre- to follow-up changes in mindfulness, for participants in the mindfulness conditions. Second (H2), we expected pre- to post-intervention and pre- to follow-up changes on mindfulness to be larger in the mindfulness condition, compared with alternative treatments and waitlist control conditions. However, many bona fide psychotherapeutic interventions emphasize mindfulness-relevant treatment elements such as introspection and self-awareness. Thus, for our third hypothesis (H3), we expected the mindfulness-to-waitlist comparison to be larger (reflecting greater changes in mindfulness scores) than the mindfulness-to-alternative-treatment comparison. In addition to assessing relative responsiveness, we derived differential responsiveness indices for each condition by subtracting the pre-post change effect size for the clinical outcome measure from that for the mindfulness measure. For our fourth hypothesis (H4), we expected differential responsiveness (within conditions) to be negative (greater change for the clinical outcome measure) in the alternative treatment condition and 0 (no difference in responsiveness) in the waitlist control condition. We had no hypothesis regarding differential responsiveness in the mindfulness condition, as both mindfulness and clinical outcomes were expected to change in response to the treatment.

Method

Eligibility Criteria

We included RCTs of mindfulness-based interventions for adult patients with psychiatric and medical diagnoses that appear on the American Psychological Association’s (APA) Division 12 (Society of Clinical Psychology; see Supplemental Materials Table 1, APA 2017) list of disorders with known evidence-based treatments. To be eligible, samples had to have either a formal diagnosis or elevated symptoms of a given disorder. Samples receiving treatment within a facility focused on a specific disorder (e.g., substance abuse treatment) were included. Elevated stress levels alone were not considered to reflect a clinical condition.

Table 1 Within-group responsiveness, by condition and type of outcome

To qualify, the mindfulness interventions had to have mindfulness meditation as a core component with home meditation practice as a treatment ingredient. While interventions combining mindfulness with other modalities (e.g., mindfulness and cognitive techniques as in MBCT; Segal et al. 2002) were included, therapies emphasizing the attitudinal stance of mindfulness (rather than the formal practice of mindfulness meditation) were excluded (e.g., acceptance and commitment therapy, dialectical behavior therapy [DBT]; Hayes et al. 1999; Linehan 1993). Other non-mindfulness forms of meditation (e.g., mantram repetition) were excluded. Interventions had to be delivered in real time (i.e., not provided through pre-recorded video instruction) and had to include more than one session (to allow for home meditation practice). Studies were also excluded for the following reasons: (1) not published in English, (2) not a peer-reviewed article, (3) data unavailable to compute standardized effect sizes (even after contacting study authors), (4) no disorder-specific (i.e., targeted) outcomes reported, (5) no measure of mindfulness included, (6) data redundant with other included studies, (7) no non-mindfulness-based intervention or condition included (i.e., the trial compared only two or more mindfulness-based interventions), and (8) no waitlist (or TAU that was provided to both the mindfulness and control condition) or bona fide comparison condition included.

Information Sources

We searched the following databases: PubMed, PsycInfo, Scopus, and Web of Science. In addition, a publically available comprehensive repository of mindfulness studies that is updated monthly was also searched (Black 2012). Citations from recent meta-analyses and systematic reviews were also included. Citations were included from the first available date (i.e., 1966) until January 2, 2017.

Search

We used the search terms “mindfulness” and “random*.” When a database allowed (e.g., PsycInfo), we restricted our search to clinical trials.

Study Selection

Titles and/or abstracts of potential studies were independently coded by the first author and a second co-author. Disagreements were discussed with a senior author until a consensus was reached.

Data Collection Process

Standardized spreadsheets were developed for coding both study-level and effect size-level data. Coders were trained by the first author through coding an initial sample of studies (k = 10) in order to achieve reliability. Data were extracted independently by the first author and a second co-author. Disagreements were discussed with a senior author. Inter-rater reliabilities were in the good to excellent range (Cicchetti 1994): Ks > 0.60 and ICCs > .80 in the current study. When sufficient data for computing standardized effect sizes were unavailable, study authors were contacted.

Data Items

Along with data necessary for computing standardized effect sizes, the following data were extracted: (1) disorder, (2) intent-to-treat (ITT) sample size, (3) whether an ITT analysis was reported, (4) sample demographics (mean age, percentage female, percentage with some college education), (5) country of origin, and (6) type of comparison condition.

Type of comparison condition was coded based on a two-tier system: waitlist conditions and bona fide comparison conditions. Waitlist conditions included waitlist controls as well as treatment-as-usual (TAU) conditions in which both the mindfulness and non-mindfulness arms received this treatment (i.e., there was no additional treatment provided to the TAU group). The bona fide treatment conditions included comparisons that were based on actual therapies and included specific treatment ingredients and mechanisms of change (Wampold and Imel 2015). The decision to code using this scheme was based on evidence that whether a comparison group represents a bona fide comparison condition significantly influences the relative efficacy of mindfulness-based interventions (Goldberg et al. 2018). Some studies included both bona fide and waitlist comparison condition (k = 8). In order to avoid duplicated data (i.e., comparing the mindfulness condition to both controls), we included only the bona fide comparison condition in between-group analyses.

Risk of Bias in Individual Studies

Considerations for minimizing bias in individual studies were drawn from both Jadad’s criteria (Jadad et al. 1996) as well as the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) system (Atkins et al. 2004). Based on the GRADE recommendation to select relevant study characteristics to quantify (Agency for Healthcare Research and Quality 2014) and based on the large number of potential study characteristics for assessing quality in psychotherapy trials (e.g., n = 185 quality criteria; Liebherz, Schmidt, and Rabung Liebherz et al. 2016), we restricted our analysis to randomized trials, employed intent-to-treat samples (when available) and coded the strength of the comparison condition.

Effect Size Computation

For each research hypothesis, we developed an effect size for the comparison of interest as described below. When multiple outcomes of the same type (mindfulness or clinical symptoms) were included in the same study, data were aggregated within-studies using the ‘MAd’ package (Del Re and Hoyt 2010), following procedures described in Borenstein, Hedges, Higgins, and Rothstein (Borenstein et al. 2009).

Effect Size Calculation for Relative Responsiveness Hypotheses

We quantified responsiveness of mindfulness scores (within conditions) by computing dwithin for each experimental condition.

$$ {d}_{\mathrm{within}}=\frac{M_{\mathrm{post}}-{M}_{\mathrm{pre}}}{SD_{\mathrm{pooled}}} $$
(1a)
$$ \operatorname{var}\left({d}_{\mathrm{within}}\right)=\left(\frac{1}{n}+\frac{d^2}{2n}\right)\bullet 2\left(1-r\right) $$
(1b)

where r is the correlation between pre- and post-scores on mindfulness. As is typically the case in meta-analyses of clinical trials, the primary studies did not report r, so we imputed a correlation of rXX = 0.50 between time points (somewhat lower than a typical test-retest correlation, to account for intervention effects; see Hoyt and Del Re (2018)). These effect sizes were corrected for bias, converting to Hedges’ gwithin as recommended by Borenstein, Hedges, Higgins, and Rothstein (Borenstein et al. 2009). Within-condition effect sizes were computed from pre- to post-treatment (or time point closest to post-treatment) as well as from pre- to last available follow-up time point.

We then quantified relative responsiveness (to the mindfulness intervention compared with the two comparison conditions) as the difference in the pre-post effects (i.e., change scores). The resulting effect size (called Δ, following Becker 1988) represents the amount by which change in mindfulness in the mindfulness condition exceeds change in mindfulness in the comparison condition, in standard deviation units.

$$ \Delta ={g}_{\mathrm{within}}^M-{g}_{\mathrm{within}}^C $$
(2a)
$$ \operatorname{var}\left(\Delta \right)=\operatorname{var}\left({g}_{\mathrm{within}}^M\right)+\operatorname{var}\left({g}_{\mathrm{within}}^C\right) $$
(2b)

where the M and C superscripts refer to the mindfulness and comparison conditions, respectively.

Effect Size Calculation for Differential Responsiveness Hypotheses

In the second set of hypotheses, we quantified differential responsiveness (i.e., for the mindfulness measure compared with the clinical outcome measure) by computing a dependent samples Δdep for each condition. Because this effect size is a difference between dependent estimates (i.e., two estimates derived from the same sample), the variance formula needs to take into account the correlation between the mindfulness and the clinical symptom effect sizes.

$$ {\Delta }_{\mathrm{dep}}={g}_{\mathrm{within}}^{\mathrm{mindful}}-{g}_{\mathrm{within}}^{\mathrm{clinical}} $$
(3a)
$$ \operatorname{var}\left({\Delta }_{\mathrm{dep}}\right)=\operatorname{var}\left({g}_{\mathrm{within}}^{\mathrm{mindful}}\right)+\operatorname{var}\left({g}_{\mathrm{within}}^{\mathrm{clinical}}\right)-2\bullet r\bullet \sqrt{g_{\mathrm{within}}^{\mathrm{mindful}}\bullet {g}_{\mathrm{within}}^{\mathrm{clinical}}} $$
(3b)

Correlations between mindfulness and clinical measures were often not reported in the primary studies. Consequently, we used an imputed value of r = .50, based on meta-analytic estimates of the association between dispositional mindfulness and neuroticism (Giluk 2009). (The sign of the correlation coefficient is positive because we reversed-scored clinical outcomes, so that positive effect sizes indicate improvement over time for both outcome variables.) Postive values of Δdep reflect greater responsiveness for the mindfulness measure (compared to the clinical measure) in the condition under study.

Analyses were conducted using the R statistical software and the ‘metafor’ and ‘MAd’ packages (Del Re and Hoyt 2010; Viechtbauer 2010). Random effects models were used with a restricted maximum-likelihood estimator and were weighted based on the inverse of the variance. Heterogeneity was assessed using the Q-statistic and quantified using I2.

Risk of Bias Across Studies

We assessed publication bias by visually inspecting funnel plots for asymmetry within the comparison of interest. In addition, primary models were re-estimated using trim-and-fill methods that account for the asymmetric distribution of studies around an omnibus effect (Viechtbauer 2010).

Results

Study Selection

A total of 9067 citations were retrieved. After 3485 duplicates were removed, 5582 unique titles and/or abstracts were coded. Following the application of the exclusion criteria (see flow diagram in Supplemental Materials), 69 articles including 55 unique samples were retained for analysis representing 4743 participants.

Study Characteristics

Effect sizes in standardized units (i.e., d) reflecting within-group and between-group changes on mindfulness as well as the relative responsiveness of mindfulness and clinical outcomes are shown in the Supplemental Materials along with other study characteristics (Table 2). The sample was on average 44.20 years old, 61.48% female, with 63.67% having some post-secondary education. The largest percentage of trials was conducted in the USA (52.73%). Approximately half of studies included waitlist control conditions (45.45%) and half included bona fide comparison conditions (54.55%). The most commonly studied disorder was depression (23.64%), followed by pain (21.82%), anxiety (16.36%), and addiction (9.09%). The majority of studies (58.18%) used either the FFMQ or the Kentucky Inventory of Mindfulness Skills (KIMS; Baer, Smith, and Allen Baer et al. 2004) to assess self-reported mindfulness; another 18.18% used the MAAS (with one study including both the FFMQ and the MAAS); the remaining studies (k = 12) used other self-report mindfulness measures.

Table 2 Relative responsiveness (mindfulness versus comparison conditions), by outcome type

Risk of Bias Within Studies

All included studies used randomized designs. More than half of the studies reported at least one ITT analysis (63.64%). When available, results from the ITT analysis were used.

Results of Individual Studies

For each included study, treatment effects on self-report measures of mindfulness and clinical outcomes are reported in Supplemental Materials.

Mindfulness Measures: Responsiveness to Intervention

The top half of Table 1 shows pre- to post-intervention and pre- to follow-up effect sizes by condition, for both mindfulness and clinical outcome measures. As expected (H1), there was evidence of significant changes in self-reported mindfulness in response to mindfulness interventions (g = 0.49 [0.39, 0.58] from pre- to post-treatment; g = 0.31 [0.17, 0.45] from pre- to follow-up). The parallel effect sizes for mindfulness responsiveness were close to 0 (and not significantly different from 0) in the waitlist conditions and were intermediate (and significantly different from 0) in the alternative treatment conditions.

Relative Responsiveness of Mindfulness Measures Across Experimental Conditions

The top half of Table 2 summarizes effect sizes (Becker’s Δ) comparing responsiveness in mindfulness scores between conditions (see Supplemental Materials for forest plots). As expected, (H2) mindfulness measures demonstrated enhanced responsiveness to mindfulness-based interventions relative to waitlist controls (Δ = 0.52, [0.40, 0.64] pre- to post-treatment; Δ = 0.52, [0.20, 0.84] from pre- to follow-up) and also relative to alternative, non-mindfulness-based bona fide comparison conditions (Δ = 0.25, [0.11, 0.38] pre- to post-treatment); however, the latter comparison was no longer significant at follow-up (Δ = 0.10, [− 0.08, 0.28]). Also in accordance with our predictions (H3), responsiveness effect sizes relative to waitlist conditions were larger than those relative to bona fide treatment comparisons at both time points (p < .05) (although the robustness of the follow-up finding was called into question in the sensitivity analysis, as discussed in the later section on risk of bias).

Differential Responsiveness Between Mindfulness Measures and Clinical Outcomes

Our final set of hypotheses examined discriminant validity of mindfulness measures and clinical outcome measures in the context of experimental manipulation. Differential responsiveness effect sizes were computed within conditions as the difference between within-group ds for mindfulness and clinical outcome measures (Δdep), then meta-analyzed across studies, with the results summarized in Table 3. We predicted (H4) that differential responsiveness would be negative (reflecting greater responsiveness for the clinical outcome measure) in the alternative treatment condition and near 0 for the waitlist condition. We made no prediction regarding whether clinical outcomes or measures of mindfulness would change more in the mindfulness conditions.

Table 3 Differential responsiveness of mindfulness and clinical outcomes, by condition

As shown in Table 3, we found negative differential responsiveness (i.e., change in mindfulness was smaller than change in clinical symptoms) in all three conditions. This difference in responsiveness was statistically significant (i.e., 95% CI excluded 0) for five of the six tests (three conditions; post-treatment and follow-up comparisons) except the test of the change to follow-up for the waitlist (Δdep = − 0.32 [− 0.65, 0.01]), which had the smallest amount of data available (k = 8), and therefore the lowest statistical precision (and power). This result supported our prediction for bona fide comparison conditions, although the negative differential responsiveness was not predicted in the waitlist condition. We consider possible explanations for this unexpected finding in the “Discussion” section.

Risk of Bias Across Studies

Bias in the above analyses was assessed through funnel plots and trim-and-fill analyses. Asymmetric funnel plots suggested evidence for publication bias for several models (see Supplemental Materials for funnel plots). Trim-and-fill analyses yielded adjusted effect sizes, although the direction of adjustment varied (i.e., some effects became larger). The sensitivity analyses called into question one effect that appeared significant in the main analyses: pre- to follow-up between-group relative responsiveness on mindfulness measures in mindfulness versus waitlist control conditions (adjusted Δ = 0.35, [− 0.03, 0.72]; Table 2).

Discussion

Our goal in this study was to examine evidence for construct validity of self-report measures of mindfulness derived from clinical trials that included a mindfulness intervention condition. These RCTs allow for robust examination of responsiveness to experimental manipulation, as described by Cronbach and Meehl (1955). Our meta-analytic findings provided support for the predictions (H1 to H3) that scores on mindfulness measures are responsive to experimental intervention: These measures registered moderate amounts of change in response to mindfulness interventions, little or no change in waitlist conditions, and intermediate levels of change in conditions implementing a non-mindfulness-based alternative treatment.

While these results mirror those of previous reports (Quaglia et al. 2016), it is worth noting explicitly here that patients report changes in mindfulness in both mindfulness and non-mindfulness-based interventions (albeit to a smaller degree in non-mindfulness-based interventions). Changes in mindfulness induced by non-mindfulness-based interventions could be due to a number of factors. This effect might indicate that the non-mindfulness-based interventions are implicitly or explicitly teaching mindfulness skills (e.g., meta-cognitive skills in the case of cognitive behavioral therapy). Alternatively, the responsiveness of mindfulness measures to non-mindfulness interventions may reflect construct-irrelevant variance (Hoyt et al. 2006), such as general negative affect, that contributes to variance in mindfulness scores—a limitation in the construct validity of self-report measures of mindfulness (Goldberg et al. 2016; Grossman 2008). Further research examining measures of mindfulness in the context of non-mindfulness-based interventions, as well as research employing multimethod assessment of mindfulness, can be helpful for clarifying what sources of variance contribute to scores on self-report measures of mindfulness (cf. Cronbach and Meehl 1955).

A second set of hypotheses examined differential responsiveness of mindfulness and clinical outcome measures. These analyses used meta-analytic methods to examine a type of discriminant validity in the experimental context. We predicted (H4) that responsiveness (i.e., change) for mindfulness measures should be smaller than responsiveness of clinical outcome measures in the bona fide (non-mindfulness) intervention condition and should be similar (and near 0) in the waitlist control condition. Given that we expected change on both measures of mindfulness and measures of clinical outcomes in the mindfulness condition, no hypothesis was made about differential responsiveness in this group.

Of our two directional hypotheses, only the hypothesis relating to bona fide comparison conditions was supported. As predicted, changes in clinical outcomes exceeded those of changes in measures of mindfulness, supporting the prediction of discriminant responsiveness to bona fide non-mindfulness-based mental health interventions.

Interestingly, the same pattern was observed for the waitlist and mindfulness comparisons as well. The presence of relatively larger effects on clinical outcomes than measures of mindfulness in the waitlist condition underscores a challenge for differential responsiveness predictions based on clinical trials data: the possibility of differential improvement in the absence of treatment. Although we predicted equivalent (and near 0) improvement for both sets of outcomes in the waitlist condition, there are at least three reasons that one might expect clinical symptoms to improve in the waitlist condition: regression to the mean, benefits of “treatment-as-usual” (given that it is generally not possible to prohibit control group participants from seeking assistance outside the study), and remoralization effects of the decision to seek treatment through participating in a research study (which may include seeking non-professional support and taking other actions outside the treatment context to ameliorate symptoms).

The presence of relatively larger effects on clinical outcomes than measures of mindfulness in the mindfulness condition is intriguing. While we did not have an a priori hypothesis related to this comparison, it is notable that the effect of mindfulness-based interventions on clinical outcomes is larger than that observed on measures of mindfulness, one of the key putative mediators of treatment effects in mindfulness interventions (Gu, Strauss, Bond, and Cavanagh Gu et al. 2015). In theory, one might expect effects on mediators to be similar or larger than effects on clinical outcomes, because the intervention is the proximal cause of the mediator variable and a distal cause (to the extent that the mediator explains the relation between intervention and outcome) of symptom reduction. Indeed, there is a strong consensus among mediation researchers that it is reasonable to search for mediated (indirect) effects even in the absence of a bivariate relation between the predictor variable and the outcome (Kenny et al. 1998; MacKinnon 2008; Shrout and Bolger 2002), which reinforces the notion that relations between the predictor and mediator may often be more robust than those between the predictor and outcome (the “total effect” in mediator models; Baron and Kenny 1986; MacKinnon 2008). In their meta-analysis of mindfulness as a mediator in mindfulness-based interventions, Gu et al. (2015) reported that intervention effects on mindfulness were somewhat larger than those on clinical outcome (rs = .34 and .27, for effects on mindfulness and clinical outcomes, respectively). Our finding of a small but statistically significant difference in effect size favoring the clinical outcome measures may be attributable to the restriction of our review to clinical samples and likely reflects additional pathways (i.e., beyond the mediated effect through changes in mindfulness) by which mindfulness-based interventions induce reductions in clinical symptoms (e.g., therapeutic alliance; Goldberg, Davis, and Hoyt Goldberg et al. 2013).

Limitations

Several limitations are worth acknowledging. The first is that our results were limited to published studies. Given the extensive nature of our literature search, we chose to exclude unpublished studies. However, publication bias is an increasing concern in psychology (DeCoster, Sparks, Sparks, Sparks, and Sparks DeCoster et al. 2015), and our sensitivity analyses (trim-and-fill, funnel plots) suggest the presence of publication bias in our sample. As null results have historically been more difficult to publish (or have been intentionally omitted from published studies; DeCoster et al. 2015), it is likely that the treatment differences we observed on self-report measures of mindfulness overestimate the true differences. A second limitation was not disaggregating by mindfulness component (i.e., measure or subscale). This was done to limit the number of analyses and increase statistical power but may have impacted of ability to detect differences in measure performance across specific aspects of mindfulness. A third limitation was not separating analyses by disorder. This would have allowed assessment of the extent to which changes in mindfulness compared with changes in outcomes for different disorders. We chose not to explore this possibility due to the small number of certain disorder types (e.g., ADHD), particularly when crossed with comparison group type. Future studies, presumably using trials that are yet to be published, could explore some of these possibilities. A final limitation was the possibility of limited statistical power, particularly for certain analyses (e.g., those involving comparisons with waitlist conditions at follow-up). It is conceivable that certain effects were not detected due to type II error.

Taken together, results from the current study provide partial support for the construct validity of self-report measures of mindfulness. Although responsive to mindfulness training, these measures appear to also change through other bona fide treatments, albeit to a lesser degree. Effects of mindfulness interventions on measures of mindfulness are also smaller than their effects on targeted outcomes, at least within the clinical samples included here.

As Cronbach and Meehl (1955) pointed out, instances of uncertain construct validity could implicate the measures used and/or the theory underlying the measures. This underscores the value in continued work on the measurement of mindfulness as well as efforts to untangle the mechanisms at play in mindfulness interventions. Future studies of mindfulness-based interventions will ideally include behavioral and neurobiological assessment of mindfulness and characteristics putatively related to mindfulness, along with self-report measures of mindfulness. Results from RCTs using these measures, particularly when also using comparison conditions that are intended to be therapeutic (Goldberg et al. 2017), can help assess the degree to which specific effects related to training in mindfulness are present. The development of novel assessment methods (e.g., significant-other ratings, observer ratings, mindfulness teacher ratings) may provide valuable alternatives to self-report measures of mindfulness in future studies.