1 Introduction

Gratitude is a state of affirming the goodness or good things in one’s life, accompanied by a recognition that the sources of this goodness lie at least partially outside the self, such as with the good intentions of another person (Emmons and Stern 2013). Gratitude may be elicited by another person when he or she provides some aid or benefit, but it may also stem from noninterpersonal sources, such as a feeling of thanks for waking up in the morning (Wood et al. 2010). A number of studies have demonstrated that gratitude has strong associations with measures of well-being, including positive correlations with positive affect, life satisfaction, extraversion, and forgiveness, and negative associations with neuroticism and substance abuse (see Watkins 2014; Wood et al. 2010 for reviews).

Additionally, the relationship between gratitude and symptoms of depression and anxiety has been the focus of a number of empirical studies (see Petrocchi and Couyoumdjian 2016 for a review). For example, both Stoeckel et al. (2014) and Watkins et al. (2003) found an inverse correlation of moderate strength between gratitude and symptoms of depression (r = − .48 and r = − .56, respectively). Likewise, Krumrei and Pargament (2008) reported an inverse association of moderate strength between gratitude and symptoms of anxiety (r = − .46). Additionally, Kendler et al. (2003) reported a reduced lifetime prevalence of generalized anxiety disorder among those scoring higher in thankfulness, OR = .82. Although investigators are just beginning to uncover the specific mechanisms by which gratitude relates to depression and anxiety, there are several possible explanations for the inverse association between these variables. First, as Wood et al. (2010) noted, gratitude is associated with interpreting various stimuli and life events in positive terms, which contrasts with the selective attention to negative qualities of the self, the world, and the future that is characteristic of depression and anxiety (Mogg and Bradley 2005; Peckham et al. 2010). Consistent with Wood et al.’s idea, Petrocchi and Couyoumdjian (2016) found the inverse relationship between gratitude and symptoms of depression and anxiety is accounted for by a less critical, less punishing, and more compassionate view of oneself. Researchers have also found gratitude is associated with greater relationship connection and satisfaction (Algoe et al. 2010), which are well-established buffers against psychopathology (Seppala et al. 2013). Finally, another way gratitude may guard against anxiety, in particular, is its relationship with uncertainty. A well-known characteristic of the worry observed in anxiety disorders is an intolerance of uncertainty (Carleton et al. 2012). Practicing gratitude may train one to be content in his or her present circumstances, whatever they may be, thus attenuating a fear of uncertain outcomes.

Given that gratitude is associated with a number of positive qualities, researchers have designed and tested several interventions to increase gratitude. For example, Emmons’ and McCullough’s (2003) seminal study asked participants to record five things for which they felt grateful each week for 10 weeks. They found significant increases in positive affect and hours spent exercising among the gratitude group, as well as improved sleep quality and reduced physical symptoms. Several researchers have also reported significant effects of gratitude interventions on symptoms of depression and anxiety. For instance, Seligman et al. (2005) compared the effects of two gratitude interventions to a control condition (journaling about early memories) for reducing depressive symptoms. In the first gratitude condition, named “three good things,” participants were instructed to keep a daily record of three good things and explain why they happened for one full week. In the second gratitude condition, named the “gratitude visit,” participants were asked to write and personally deliver a letter to someone they had never properly thanked. Seligman et al. found the three good things exercise led to reduced depressive symptoms for up to 6 months compared to the control group, and the gratitude visit reduced depressive symptoms for up to 1 month (although it had the largest positive improvements at post-test).

In a study on how gratitude interventions affect anxiety within a clinical sample, Kerr et al. (2015) recruited individuals on a waitlist for an outpatient psychology clinic. Client difficulties included depression, anxiety, and PTSD, as well as substance use and eating disorders. Individuals were randomized to either record up to five things they felt grateful for in the past day for 14 days, or to keep a daily mood diary in the control condition. The researchers found the gratitude intervention—but not the control task—significantly reduced anxiety over the 14-day period. Geraghty et al. (2010) obtained similar results. Using an online community sample, they randomized participants to a waitlist condition or to a gratitude condition in which participants listed six items for which they were grateful each day for 14 days. The gratitude condition led to significant reductions in worry, whereas the waitlist group experienced little change from baseline. Based on results such as these, researchers have suggested gratitude interventions may be an effective, low-cost, and easily implementable psychotherapeutic tool (Duckworth et al. 2005; Seligman et al. 2006).

However, in spite of these seemingly promising findings, a qualitative review by Wood et al. (2010) questioned the efficacy of gratitude interventions for a variety of outcomes, including depression and anxiety.Footnote 1 The authors argued many of the intervention studies included control groups that made inferences about the efficacy of gratitude interventions ambiguous. For example, they pointed out studies that, instead of including a neutral control task, included comparisons that may actually have harmful effects, such as listing hassles (Emmons and McCullough 2003) or things one was unable to accomplish over the summer (Watkins et al. 2003). Relying on control conditions that may have effects in the opposite direction of the gratitude intervention (i.e., increasing symptoms of distress or reducing well-being) may inflate the effect size estimate. Conversely, a study by Lyubomirsky and colleagues with a more neutral control condition (writing about one’s weekly schedule) found the gratitude intervention to have little effect on symptoms of depression (Lyubomirsky et al. 2011). Given the heterogeneity of these control groups, Wood et al. (2010) cautioned against a premature declaration of the success of gratitude interventions until more rigorous investigations could be conducted. Indeed, they noted of the 12 interventions included in their review, “only a very small number show that gratitude interventions are more effective than genuine controls” (Wood et al. 2010, p. 898).

Responding to Wood et al.’s (2010) critique, Davis et al. (2016) recently conducted a meta-analysis investigating the effects of gratitude interventions on gratitude, anxiety, and psychological well-being. They found gratitude interventions had generally limited effects, with effect sizes ranging from d = 0.31 for a measurement-only control (i.e., waitlist) to d = − 0.03 for a “psychologically active” comparison (i.e., one that might be reasonably expected to promote psychological well-being, such as an automatic thought record). Based on these results, the authors concluded the evidence for the efficacy of gratitude interventions on psychological well-being, anxiety, and even gratitude itself is weak.

Although the conclusions reached by Wood et al. (2010), and later by Davis et al. (2016), question the benefits of gratitude interventions, these exercises have grown quite prominent in popular culture as a means of self-help. Paid-subscription smart-phone applications like Happify (2016), best-selling books like The Gratitude Diaries (Kaplan 2015), and magazine editorials (Graff 2016) all claim that gratitude interventions will enhance one’s life. Additionally, university wellness centers have begun to advocate for the use of gratitude interventions as therapeutic tools (Emmons 2013; “Heart Centered Practices”). With such widespread use, it is important to determine whether gratitude interventions are indeed efficacious for specific psychological symptoms, such as depression and anxiety, which are among the most common mental health problems (Kendrick and Pilling 2012).

Our meta-analysis extends the work of past reviews in several ways. First, new studies on gratitude interventions have been published since the prior reviews in this area. Our meta-analysis updates the literature for all studies published before May 17th, 2018 (see methods). Second, past meta-analyses of gratitude interventions have focused on positive psychological functioning, such as positive affect (Sin and Lyubomirsky 2009) and life satisfaction (Davis et al. 2016). To date, ours is the first to conduct focused analyses on effects for symptoms of depression and anxiety. Sin and Lyubomirsky’s (2009) meta-analysis assessed the effects of positive psychology interventions (PPIs) on symptoms of depression, but only four of the included studies used a gratitude intervention; thus, it was not possible to disentangle the effect of PPIs in general from the specific effect of gratitude interventions. Davis et al. (2016) also included measures of depression and anxiety. However, depression and life satisfaction scores were aggregated to create a “psychological well-being” outcome, and only 10 of the 21 studies for this aggregate outcome included a depression measure; thus, the unique effect for symptoms of depression could not be obtained. Their anxiety effect size also included a measure of marital satisfaction (Snyder 1998), which may capture marital distress rather than anxiety-specific symptoms, such as worry (Meyer et al. 1990). Additionally, their inclusion of comparison groups with demonstrated therapeutic value makes conclusions drawn from their anxiety analysis ambiguous. For example, evidence suggests that comparison conditions such as progressive muscle relaxation (Cheung et al. 2003) and automatic thought records (Persons and Burns 1985) reduce symptoms of depression and anxiety. Therefore, in Davis et al.’s meta-analysis, it is uncertain whether gratitude interventions are ineffective for anxiety symptoms, or if they only improve symptoms to the same degree as other therapeutic interventions. Similarly, they did not distinguish between neutral and therapeutic controls in their moderator analyses of psychological well-being. As articulated by Chambless and Hollon (1998), such comparisons are inherently difficult to interpret, as null results may stem from a lack of power rather than equivalency between groups. Therefore, to eliminate this ambiguity, we only included studies with no-treatment (waitlist) and neutral comparison groups, i.e., active control tasks that were not intended to be therapeutic interventions and did not have empirical evidence of benefit for depression and/or anxiety.

Most recently, Dickens (2017) published a meta-analysis that found gratitude interventions have a small (d = 0.13) effect on depression when compared to waitlist or active control tasks. However, she did not report specific effect sizes for each of these two control types, but combined both under the label of “neutral conditions”. Additionally, she applied a number of exclusion criteria that limit the generalizability of her results, i.e., excluding studies involving daily gratitude journals, studies lasting 3 days or less, and studies that involved multiple gratitude interventions. In the current meta-analysis, we did not apply such exclusions, but instead included study duration and type of intervention as moderators of the effect size. For studies with multiple gratitude interventions, we aggregated the effect sizes and ran our models with and without these aggregate scores to include as much information as possible (see details under methods). With this broader inclusion criteria, we were able to include the largest number of studies examining depressive symptoms of any published meta-analysis to date. Additionally, the Dickens (2017) meta-analysis did not include symptoms of anxiety.

Additionally, neither Davis et al. (2016) nor Dickens (2017) performed a risk-of-bias assessment.Footnote 2 Such an assessment can reveal how study features such as participants’ awareness of the condition, dropout, and baseline differences may influence effect size estimates. For example, Bolier et al. (2013) found a larger effect size for PPIs with a greater risk of bias. Therefore, to assess for influences of bias, we conducted a risk-of-bias assessment using guidelines developed by the Cochrane Collaboration (Higgins et al. 2017).

Finally, for any variable assessed in a meta-analysis, there will be sources of error affecting its accuracy, such as measurement error. The influence of measurement error is particularly important for the outcome measure(s). For any given study in a meta-analysis, the effect size estimate will be attenuated to the extent that there is measurement error (unreliability) in the outcome measure(s) (Hedges and Olkin 1985). If measurement error is present across studies, then the overall effect size will be attenuated by the cumulative impact of the unreliability across studies. Past meta-analyses of gratitude interventions have not controlled for measurement error in the outcome(s). Therefore, we conducted a correction for attenuation to minimize the influence of unreliability on our effect size estimates. The method for this correction is described in the methods section below.

With this meta-analysis, we examined the effects of gratitude interventions on symptoms of depression and anxiety at both immediate post-test and follow-up periods. Additionally, we assessed the influence of several moderator variables using meta-regression. Specifically, we were interested in determining: (1) the effect size of gratitude interventions on symptoms of depression and anxiety, considered separately; (2) the overall aggregate effect size on symptoms of depression and anxiety, considered together; and (3) whether the type of control group and other study characteristics (e.g., risk of bias, duration of intervention, type of intervention) moderated the effects.

2 Methods

2.1 Literature Search

We searched four databases for studies investigating the effects of gratitude interventions on symptoms of depression and anxiety (Cochrane Libraries, PsycINFO, PubMed, and Web of Science). Additionally, to maximize the number of potential studies included, we manually searched the reference sections of published review articles that discussed gratitude interventions (Carl et al. 2013; Davis et al. 2016; Sin and Lyubomirsky 2009; Wood et al. 2010). Initial searches were conducted between June 1–2, 2016. A second search was conducted between May 17–18, 2018 to update the literature. See Online Resource 1 for detailed keyword profiles and filters applied to each database.

2.2 Inclusion Criteria

We used the following inclusion criteria:

  1. 1.

    Study was a scientific article in a relevant field and topic. Excluded studies were non-scientific essays from the humanities or scientific studies not focused on gratitude, such as miscellaneous biology or medical research.

  2. 2.

    Study used an experimental design with random assignment and a waitlist (measurement-only control) or neutral control condition. Excluded studies involved correlational or qualitative research, or solely included comparisons with treatments of demonstrated efficacy for anxiety and/or depression (i.e., solely including other active treatments without a neutral control). A comparison activity was considered efficacious if it was intended to be therapeutic by the study authors and it had empirical evidence demonstrating some benefit for depression and/or anxiety symptoms in past research (e.g., automatic thought records; Persons and Burns 1985). If a study included both an active treatment and a neutral/waitlist control, we used the neutral control as the comparison group.

  3. 3.

    Adequate statistical information was available to compute an effect size. If the study did not contain adequate data, we contacted the corresponding or first author to retrieve the necessary information. If he or she could or would not provide it, we excluded that study.

  4. 4.

    Study included at least one measure of symptoms of either anxiety or depression. Excluded studies exclusively measured some other outcome(s), such as life satisfaction or physical health. A full-text review was conducted for all studies excluded for this reason to ensure that no depression or anxiety measures were reported in the manuscript.

  5. 5.

    Intervention was designed to induce or increase gratitude. Excluded studies used a non-gratitude intervention (such as mindfulness) or combined gratitude with another technique into a single group (such as a gratitude and best-possible-self exercise conducted together).

In total, we screened 1277 abstracts for inclusion (953 unique studies across databases after removing duplicates). See Fig. 1 for a flowchart of the screening process.

Fig. 1
figure 1

Flowchart of the study inclusion process. Template adapted from “The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials,” by D. Moher, K. Schulz, and D. Altman, 2001, Lancet, 357, p. 1193. Journals publishing the original CONSORT flowchart have waived copyright protection

2.3 Data Extraction

Data extraction and a risk-of-bias assessment were conducted by the first author (DC) and independently checked by a second research assistant. If there was a discrepancy, each reviewer returned to the original paper to double check the data extraction. Disagreements were resolved by discussion. Data were collected on the study design, control group, intervention type, study duration, participant characteristics, baseline depressive symptoms, outcomes measured, post-test and follow-up raw data, sample size,Footnote 3 publication status and year, the presence of a compliance or adherence check, and the risk-of-bias criteria outlined below. Raw data for each study (means, SDs, and reliability coefficients) can be found in Online Resource 2. For studies containing a depression measure with a published threshold score for clinically-relevant symptoms of depression, we also coded whether the sample’s baseline depressive symptoms met the recommended threshold. See Online Resource 4 for a list of these threshold scores.

Additionally, we extracted reliability (Cronbach’s alpha) coefficients at baseline for each outcome within each study in order to correct effect sizes for attenuation. If the specific timepoint was not reported (e.g., “reliabilities ranged from .91 to .93 across timepoints”), then we used the median value. If the reliability was not reported for each subscale of an outcome (e.g., the DASS-21), we used the reliability for the total scale. If the alpha value was not reported, we used the reliability coefficient from the original publication of the scale. For one study (Smullen 2012), the reliability coefficient was not reported in either the study or the original publication of the scale. Therefore, we used the mean reliability value for all depression measures in our meta-analysis (0.86).

We used seven categories for the risk-of-bias assessment based on criteria from the Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al. 2017) and a previous meta-analysis of PPIs by Bolier et al. (2013). Each criterion was coded as 1 (meets criterion), 0 (fails to meet criterion), or N/A (insufficient information to determine criterion). A composite summary score was calculated for each study by adding the number of criteria met. We also created categorical groupings of bias risk. A study was categorized as low risk if 5–7 criteria were met, medium risk if 3–4 criteria were met, and high risk if 0–2 criteria were met. Both the summary score and categorical groupings were included as moderators in the meta-regression analyses. Additionally, because creating summary scores carries the limitation of assigning equal weight to all criteria (Higgins et al. 2017), we also analyzed each bias criterion as an independent categorical moderator in meta-regression, using the low-risk studies as the reference group.

The seven risk-of-bias criteria were: (1) Random sequence generation: did study authors describe a method for ensuring random assignment to groups; (2) Randomization concealment: were investigators unaware of assignment to groups and/or the randomization sequence; (3) Participants kept unaware of study condition; (4) Baseline comparability: were baseline values of depression and/or anxiety equivalent between groups, or were appropriate adjustments made to correct for baseline differences (e.g., including baseline depression as a covariate); (5) Participant attrition: if attrition occurred, was the attrition rate reported and analyzed, was attrition less than 50% of the initial randomized sample at post-test and follow-up, and was the attrition rate comparable between groups (no more than a 10% difference); (6) Correcting for missing data: if data were missing, did investigators make an attempt to correct for the missing data using an intention-to-treat analysis or other means of imputing data; (7) Miscellaneous: any idiosyncratic risk of bias not captured in the other categories (e.g., excluding participants with psychiatric conditions or reporting a deviation from the pre-specified study protocol).

2.4 Statistical Analyses

We performed the primary calculations and analyses using the meta-analysis software OpenMEE (Dietz et al. 2015). We computed a standardized mean difference for each study using the formula for Hedges’ g (Hedges and Olkin 1985). Hedges’ g is a form of Cohen’s d that corrects for bias in sample size, with the same interpretation guidelines for small, medium, and large effects of g = |0.2|, g = |0.5|, and g = |0.8+|, respectively. Values less than |0.2| can be considered trivial (Cohen 1988). The formula for Hedges’ g is:

\(g = \frac{{\bar{\Delta }_{exp} - \bar{\Delta }_{ctl} }}{{\sqrt {\frac{{\left( {n_{exp - 1 } } \right)SD_{exp}^{2} + \left( {n_{ctl - 1} } \right)SD_{ctl}^{2} }}{{\left( {n_{total - 2} } \right)}}} }} \times \left( {1 - \frac{3}{{4\left( {n_{exp} + n_{ctl} } \right) - 9}}} \right)\), where “exp” denotes the experimental group and “ctl” denotes the control group. The variance for Hedges’ g is calculated by the formula: Vg = J2 × Vd. In this equation, Vg is the variance of g, J is a correction factor defined by \(\left( {1 - \frac{3}{{4\left( {n_{exp} + n_{ctl } - 2} \right) - 1}}} \right)\), and Vd is the variance for Cohen’s d defined by \(\left( {\frac{{n_{exp} + n_{ctl } }}{{n_{exp } \times n_{ctl } }} + \frac{{d^{2} }}{{2\left( {n_{exp} + n_{ctl} } \right)}} } \right)\). For studies using a pre/post-test design, we calculated a change score for each group by subtracting the pre-test mean from the post-test mean, thus obtaining a negative value if a reduction in symptoms occurred. If pre-test measures were not reported, we calculated the effect size using only post-test means. Only four studies did not include pre-test data (see Table 4). However, work by McKenzie et al. (2016) has demonstrated that mixing change scores with post-test data in meta-analyses gives an unbiased estimate of the effect size when heterogeneity is present and a random effects model is used, as was the case for our study. For follow-up data, we subtracted the pre-test mean from the follow-up mean.Footnote 4 If multiple follow-up assessments were reported, we used the timepoint closest to 1 month, as this was the most common design across studies. Following the suggestion of Becker (1988), where possible we used the pre-test standard deviation to control for any intervention or practice effects that might affect the post-test variance. For studies reporting post-test only data, we used the post-test standard deviation. For studies reporting an intention-to-treat analysis, we used the means and standard deviations from the intention-to-treat data. If the intention-to-treat data were not reported for all groups (i.e., Southwell and Gould 2017), we used the data from study completers. All studies were weighted by the inverse of their variance, with studies of smaller variance receiving a greater weight in analyses (see Table 2 for weights of all studies for the overall meta-analysis at post-test).

We conducted separate meta-analyses for depression and anxiety, as well as an overall aggregated analysis including both outcomes. For studies that assessed anxiety and depressive symptoms within the same sample, we treated each outcome as independent for the separate depression and anxiety analyses. However, for the overall analysis we aggregated their effect sizes and variances into a single outcome using the method described by Borenstein et al. (2009). This technique ensured we did not double-count these studies. We assumed a correlation of .65 between depression and anxiety scales based on average correlations reported in past research (Dobson 1985). Finally, for studies that included multiple independent groups of gratitude interventions compared to the same control group, we combined the means and SDs across intervention groups into a single effect size for that study, again using the method outlined in Borenstein et al. (2009).Footnote 5

Presumably, gratitude interventions have differing effect sizes based on unique study characteristics such as the duration and type of intervention. Therefore, we decided to use the more conservative random-effects model in our meta-analysis. The random effects model accounts for heterogeneity in effect sizes across studies when calculating a pooled effect. We used the DerSimonian and Laird (1986) estimator to adjust for this heterogeneity in our analyses.

2.4.1 Corrections for Attenuation

If the reliability coefficient of the outcome measure is known for each study, then the individual effect sizes can be corrected for attenuation due to unreliability prior to conducting the meta-analysis. These corrections will then provide an estimate of the disattenuated population effect size. Following the procedure outlined in Hedges and Olkin (1985), we corrected individual effect sizes by dividing Hedges’ g by the square root of the Cronbach’s alpha value. We then corrected the corresponding variance for each study by dividing the variance by the Cronbach’s alpha value. We report the corrected (disattenuated) effect size estimates in the results section below. Additionally, corrected estimates for moderator analyses can be found in Online Resource 5.

2.5 Description of Moderators

We included several moderator variables for meta-regression analyses. Reference groups for categorical moderators are listed in Table 4. Consistent with the suggestion of Wood et al. (2010), we coded control groups as either waitlist or active controls. Participants in waitlist control groups completed no activities other than submitting symptom measures. Participants in active control groups completed non-gratitude tasks matched to the gratitude interventions in terms of time. If a study contained a waitlist and an active control group, we selected the active control as the comparison. If there were two active control groups, we selected the more neutral of the two (e.g., using early childhood memories rather than early positive childhood memories; Mongrain and Anselmo-Matthews 2012). We predicted active controls would have a smaller effect size than waitlist controls, as performing some structured activity may confer a greater expectation of benefit (placebo effect) than just completing measures in a waitlist group.

We coded the intervention type according to whether it was primarily interpersonal or intrapersonal in nature, or a combination of the two, following the suggestion made by Davis et al. (2016). Interpersonal interventions involved written and/or verbal expressions of gratitude to another person, such as the gratitude visit. Intrapersonal interventions involved personal reflections on things one has to be thankful for in life, but without instructions to direct the expression of gratitude toward a particular individual (such as gratitude journals and guided gratitude meditations). Combined interventions had participants complete both types of activities in a single group. We expected combined and interpersonal interventions to have larger effects than intrapersonal interventions at post-test (based on results from Seligman et al. 2005). If studies included multiple gratitude intervention groups, such as a separate interpersonal and intrapersonal condition, we excluded them from the moderator analysis for intervention type.

Additional planned moderators for the meta-regression were: (1) Online implementation: whether the study was conducted online or offline (i.e., in-person); (2) Publication status: was the study published or unpublished (i.e., a dissertation or thesis); (3) Depressive symptoms threshold: for depression measures that have published interpretation guidelines, did the sample’s average baseline depressive symptoms meet the published thresholds for a clinically relevant level of depressive symptomatology; (4) Baseline CES-D: the sample’s average baseline score on the Center for Epidemiological Studies Depression scale (the most commonly used depression instrument in our meta-analysis; Radloff 1977); (5) Year of publication; (6) Percentage of female participants; (7) Mean age of participants; (8) Duration of the intervention period, defined by both number of weeks and the number of days on which an activity was actually performed; and (9) whether the researchers included some form of an adherence or compliance check.

Based on the results of previous research (Harbaugh and Vasey 2014; Sin and Lyubomirsky 2009), we expected a larger effect size for more depressed samples, i.e., those meeting the depressive symptoms threshold and with a higher baseline level of depressive symptoms. We also expected a larger effect among samples with older adults, again based on past research by Sin and Lyubomirsky (2009). Finally, we expected a larger effect for published articles, offline studies, interventions of longer duration, and studies that included an adherence check. We had no a priori hypotheses for year of publication or percentage of females. Furthermore, because risk of bias in studies can influence effect sizes in either direction (underestimating or overestimating effects; Higgins et al. 2017), we did not specify directional hypotheses for the risk-of-bias analyses.

We assessed the influence of moderator variables using OpenMEE’s meta-regression feature. For all moderator analyses, we used the overall aggregate effect size (k = 27) to maximize power. The only exceptions were depressive symptoms threshold and baseline CES-D, for which the depression-specific effect size was used.

2.6 Outliers

We identified potential outliers by examining studies for which their 95% confidence intervals laid entirely outside of the pooled confidence interval for all studies for each outcome (depression, anxiety, and the overall analysis). The statistical analyses were then repeated with the outliers excluded. As a double check on the identification of outliers, we also used the “influence” procedure contained within the R package “metafor” (Viechtbauer 2010). The influence procedure computes various diagnostic tests to identify outliers by multiple criteria (e.g., Cook’s distance and the influence of each study on the heterogeneity of variance). The influence procedure exactly replicated the results reported below, i.e., it identified the same studies as outliers with no additional outliers identified.

2.7 Estimates of Publication Bias

In addition to the risk-of-bias assessment, we used Rosenthal’s (1979) fail-safe N to assess the potential for publication bias. The fail-safe N calculates the number of studies with null results that would be needed to inflate the observed p value above a specified alpha level in which the effect is no longer statistically significant. It can be conceived of as an assessment of the “file-drawer problem;” that is, the likelihood for unpublished studies with non-significant results to be excluded from the meta-analysis. If a large number of such studies potentially exist, the overall effect size is likely an overestimate of the true effect. A disadvantage of this method is its exclusive focus on p values, which only indirectly account for the effect size that is of primary interest. However, this method has clearly defined guidelines for interpretation and is suitable for meta-analyses of any size. Rosenthal (1979) suggested a fail-safe N value above 5 k + 10 reflects results that are tolerant to contradicting studies, where k is the number of studies included in the meta-analysis. Rosenthal noted this is a conservative threshold, meaning that if the fail-safe N is well above this value, there is increased confidence that the observed effect size estimate is trustworthy.

We did not include a funnel plot test in our study, as funnel plots do not provide valid estimates of publication bias when fewer than 30 studies are included (Lau et al. 2006). Similarly, due to the highly subjective nature of interpreting these plots, the Cochrane Collaboration has advised caution in their use (Sterne et al. 2017). However, interested readers may find the funnel plots for the overall meta-analysis at post-test and follow-up in Online Resource 3. Instead of funnel plots, we conducted a cumulative meta-analysis for the overall effect size at post-test and follow-up. For the cumulative meta-analyses, we followed the procedure described in chapter 13 of Schmidt and Hunter (2015), in which studies are sequentially added to the meta-analysis in order from largest to smallest sample size (i.e., the most precise studies to the least precise). Publication bias is evident if adding the smaller-sample studies causes an increase in the effect size’s magnitude. Forest plots for the cumulative meta-analyses are also included in Online Resource 3.

3 Results

3.1 Reliability of Data Extraction

We assessed inter-rater reliability for the extracted data (prior to any discussion of discrepancies) using intra-class correlations (ICCs) for continuous variables and kappas for categorical variables. Reliability for the continuous variables was excellent, with ICCs ranging from .91 to 1.00 across post-test and follow-up periods, with a mean of .99. Reliability for the categorical variables other than the risk-of-bias assessment was likewise good, with kappas ranging from .70 to 1.00 with a mean of .91. Reliability for the risk-of-bias variables ranged from moderate to good, with kappas ranging from .42 to .76 with a mean of .63.

3.2 Study Attributes

3.2.1 Sample Characteristics

Full details of study characteristics are presented in Table 1. Twenty-seven studies were included in the meta-analysis, with a grand total of 3675 participants at post-test and 2318 participants at follow-up. Of the total participants at post-test, 2030 participants were in the experimental (gratitude) group, and 1645 were in the control group. The combined sample size for individual studies ranged from 22 to 514. Eighteen studies included an active control group (totaling 1206 participants) and nine studies included a waitlist control (totaling 439 participants). Except for Ozimkowski (2007), all studies included a majority female sample, with the total percentage of females ranging from 55 to 100%. Mean age ranged from 19 to 69, with an average age of 32 years across studies. Thirteen studies (48% of the sample) included some form of an adherence or compliance check.

Table 1 Characteristics of included studies

Twenty-one studies included only a depression measure, two studies included only an anxiety measure, and four studies included measures of both depression and anxiety. Only two studies included a clinical sample (participants with a diagnosed disorder or seeking treatment for a psychological condition; Kerr et al. 2015; Southwell and Gould 2017). However, of the 18 studies for which published interpretation guidelines were available, 13 (over half of the studies with depression data) met the recommended threshold for a clinically relevant level of depressive symptoms, whereas only five did not meet this criteria. For the 10 studies that reported baseline CES-D data, average scores ranged from 13.53 to 34.15. The average CES-D score across studies was 20.31, which is 4.3 points above the recommended threshold of 16 for a clinically relevant level of depressive symptoms (Lewinsohn et al. 1997).

The overall level of reliability (Cronbach’s alpha) for outcomes was high. Reliability estimates for items on depression measures ranged from .60 to .94 with a mean of .86 (SD = .08). Reliability estimates for items on anxiety measures ranged from .80 to .95 with a mean of .88 (SD = .06).

3.2.2 Intervention Characteristics

The majority of studies used a pre/post-test design, with only four studies reporting post-test only data. Thirteen studies reported follow-up data, with the majority of studies (k = 9) using a 1-month follow-up. Assessing duration by days (on which an activity was performed), the intervention period ranged in frequency from 1 to 28 days across studies. Assessing duration by total weeks, the intervention period ranged from less than 1–8 weeks. The majority of studies (k = 19) used an intrapersonal gratitude intervention, four used an interpersonal intervention, two combined both types of interventions into a single condition, and two studies included both types of activities in separate groups and were thus not included in the moderator analysis. Twenty studies were conducted online and seven were conducted offline. The majority of studies (k = 22) were published articles, with only five unpublished theses or dissertations.

3.2.3 Risk of Bias Characteristics

See Online Resource 2 for the full risk-of-bias assessment. The summary score ranged from 0 to 5, with an average of 3.07 (SD = 1.44). Six studies were categorized as low risk, 12 as medium risk, and nine as high risk. No study met all seven criteria. The majority of studies (21/27) met criteria for baseline comparability. About half of the studies met criteria for keeping participants unaware of the condition (14/27 studies). Less than half of the studies passed the criteria for attrition (10/27), sequence generation (9/27), or missing data (7/27). Only two studies reported sufficient information about randomization concealment, with the majority of studies (19/27) categorized as N/A. Finally, only seven studies contained a miscellaneous risk of bias (e.g., excluding participants older than 45; Jackowska et al. 2016).

3.3 Post-test Main Effects

Main effects for post-test outcomes are presented in Table 3, and effect sizes for each study are listed in Table 2 (for the overall meta-analysis) and plotted in Figs. 2, 3, and 4. Gratitude interventions had a small but statistically significant effect on depressive symptoms, k = 24, g = − 0.23, SE = 0.05, p < .01, τ2 = 0.02. The corrected (disattenuated) depression effect size was g = − 0.24, SE = 0.05, p < .01, τ2 = 0.03. For anxiety (k = 5), gratitude interventions had a medium effect that approached statistical significance, g = − 0.52, SE = 0.30, p = .09, τ2 = 0.41. The corrected anxiety effect size was g = − 0.55, SE = 0.32, p = .09, τ2 = 0.45. For the overall meta-analysis (k = 27), gratitude interventions had a small, statistically significant effect on symptoms of depression and anxiety, g = − 0.29, SE = 0.06, p < .01, τ2 = 0.07. The corrected overall effect size was g = − 0.31, SE = 0.07, p < .01, τ2 = 0.08.

Table 2 Weights and effect sizes of included studies for the overall meta-analysis at post-test
Fig. 2
figure 2

Forest plot of included studies for the overall post-test analysis. Squares are individual effect sizes with their corresponding 95% CI indicated by the horizontal lines. The diamond is the overall 95% CI for all studies

Fig. 3
figure 3

Forest plot of included studies for the depression post-test analysis. Squares are individual effect sizes with their corresponding 95% CI indicated by the horizontal lines. The diamond is the overall 95% CI for all studies

Fig. 4
figure 4

Forest plot of included studies for the anxiety post-test analysis. Squares are individual effect sizes with their corresponding 95% CI indicated by the horizontal lines. The diamond is the overall 95% CI for all studies

Tests for heterogeneity of effect sizes were significant for all outcomes (depression, anxiety, and the overall effect), suggesting significant variation in effect sizes across studies. For depression, effect sizes ranged from − 0.96 to 0.07. For anxiety, effect sizes ranged from − 1.64 to − 0.02. For the overall analysis, effect sizes ranged from − 1.64 to 0.07. We identified two outliers for post-test outcomes: Geraghty et al. (2010), with an anxiety effect size of g = − 1.64, 95% CI [− 2.08, − 1.20]; and Ki (2009), with a depression effect size of g = − 0.96, 95% CI [− 1.28, − 0.63]. These two studies were outliers for their respective outcomes as well as the overall meta-analysis (see Figs. 2, 3, 4).

After removing these outliers from the dataset, the depression effect size was reduced to a trivial value, but remained statistically significant, g = − 0.17, SE = 0.04, p < .01; this equates to a 26% reduction in magnitude of the effect size. The anxiety effect size was reduced by 69% to a trivial value and was statistically non-significant, g = − 0.16, SE = 0.11, p = .13. For the overall meta-analysis, the effect size was trivial but remained statistically significant, g = − 0.18, SE = 0.04, p < .01, equating to a 38% reduction in magnitude. Notably, after the removal of outliers, all heterogeneity tests became non-significant, suggesting the outliers accounted for the heterogeneity in effect sizes. Excluding these outliers for the corrected effect sizes resulted in the same findings: the corrected depression g was − 0.18, anxiety g = − 0.17, and the overall g = − 0.19. Again, all heterogeneity tests became non-significant.

3.4 Follow-Up Main Effects

Main effects for follow-up outcomes are presented in Table 3, and effect sizes for individual studies are plotted in Fig. 5. Because only two studies reported anxiety follow-up data, we did not calculate an independent effect size for anxiety at follow-up. However, the available anxiety follow-up data were incorporated into the overall meta-analysis. Gratitude interventions had a small, statistically significant effect on depressive symptoms at follow-up, k = 12, g = − 0.24, SE = 0.06, p < .01, τ2 = 0.01. The corrected depression effect size was g = − 0.25, SE = 0.06, p < .01, τ2 = 0.02. For the overall meta-analysis (k = 13), gratitude interventions also had a small, statistically significant effect, g = − 0.23, SE = 0.06, p < .01, τ2 = 0.01. The corrected overall effect size was g = − 0.24, SE = 0.06, p < .01, τ2 = 0.02.

Table 3 Main effects of meta-analysis
Fig. 5
figure 5

Forest plot of included studies for the overall follow-up analysis. Squares are individual effect sizes with their corresponding 95% CI indicated by the horizontal lines. The diamond is the overall 95% CI for all studies

Heterogeneity tests were non-significant. However, a trend was observed for heterogeneity for depression, τ2 = 0.01, Q (11) = 17.19, p = .10. For both depression and the overall analysis, effect sizes ranged from − 0.86 to 0.06. We identified one outlier for follow-up outcomes: Cheng et al. (2015) reported a depression effect size of g = − 0.86, 95% CI [− 1.35, − 0.36].

After removing this outlier from the dataset, the depression effect size remained small and statistically significant (p < .01), and it was reduced to a value of g = − 0.20, SE = 0.05, a 17% reduction in magnitude. For the overall meta-analysis, the effect size also remained small and statistically significant (p < .01), and it was reduced to a value of g = − 0.19, SE = 0.05, a 17% reduction in magnitude. After removal of Cheng et al. (2015), the trend toward a significant heterogeneity test for depression was eliminated. Excluding this outlier for the corrected effect sizes resulted in the same findings: the corrected depression g was − 0.21 and the overall g = − 0.20. Again, the trend toward a significant heterogeneity test was eliminated.

3.5 Meta-regression Analyses

Results of all meta-regression analyses are presented in Table 4. For both the post-test and follow-up time periods, the effect size was significantly moderated by type of control group. The pooled effect size was larger for studies with waitlist control groups than for those using active controls. No other moderators were significant for post-test or follow-up.

Table 4 Meta-regression analyses

Due to the effect size differing based on the type of control group, we also tested interactions between all the continuous moderators and the type of control group used at post-test and follow-up. No significant interactions were found.

We repeated all moderator analyses with the corrected effect sizes. In all cases, the results were identical to the uncorrected analyses (see Online Resource 5).

3.5.1 Risk-of-Bias Assessment

See Table 4 for results of the risk-of-bias analyses. Neither the summary bias score nor the categorical risk groupings (low, medium, and high risk) significantly moderated the effect size at post-test or follow-up. When we examined individual risk categories, participants’ awareness of condition significantly moderated the effect size at follow-up only. Studies judged to be at high risk of bias for threats to participants’ awareness of condition (k = 4) had a larger pooled effect size (g = − 0.51, SE = 0.15) than studies judged to be at low risk (k = 7; g = − 0.17, SE = 0.06) and studies with insufficient information (k = 2; g = − 0.15, SE = 0.08). No other individual risk categories had significant effects for either post-test or follow-up periods. Repeating analyses with the corrected effect sizes did not change the results.

3.6 Results of Publication Bias Estimates

3.6.1 Fail-Safe N

We computed the fail-safe N using an alpha of .05 for both the overall post-test and follow-up effect sizes. The post-test fail-safe N was 560, and the follow-up fail-safe N was 122. Both of these values are well above the 5 k + 10 guidelines of 145 and 75 for post-test and follow-up, respectively. Therefore, it is unlikely the overall effect sizes of − 0.29 for post-test and − 0.23 for follow-up are overestimations resulting from the file drawer problem.

3.6.2 Cumulative Meta-analyses

As can be observed in the forest plots in Online Resource 3, there was a trend for the effect size to increase with smaller samples at post-test. This result suggests that there is a possible bias toward publishing less precise studies (i.e., smaller samples) that report larger effects of gratitude interventions at post-test. There was no discernible pattern of bias at follow-up, though the fewer number of studies (13) limits the ability to detect visual patterns.

4 Discussion

The primary aim of our meta-analysis was to determine the effectiveness of gratitude interventions for symptoms of depression and anxiety. Considered altogether, our analyses suggest gratitude interventions are of limited efficacy for reducing these symptoms. The overall effect sizes at post-test (g = − 0.29) and follow-up (g = − 0.23) suggest a small effect according to Cohen’s (1988) guidelines.

The fail-safe N values also suggest these effect sizes are unlikely to be influenced by potential unpublished studies with null results. However, if such studies exist, they would only serve to diminish the effect sizes even further. This argument is supported by the cumulative meta-analyses, which suggests that if publication bias was present in our results, it was likely overestimating the effect size at post-test. Furthermore, excluding outliers reduced the effect sizes to still-smaller values, particularly at post-test.

Additionally, and as predicted, the effect size was smaller when gratitude interventions were compared to active control conditions. Consistent with past reviews (Davis et al. 2016; Lyubomirsky et al. 2009), we found gratitude interventions had a medium effect when compared with waitlist-only conditions, but only a trivial effect when compared with putatively inert control conditions involving any kind of activity.

Finally, it should be noted that the reliability distribution of the dependent variables was generally high. Effect-size estimates are only substantively attenuated when the level of reliability is low (Schmidt and Hunter 2015), and we confirmed this in our dataset. In all cases, the analyses conducted with corrected effect sizes were consistent with the uncorrected estimates, i.e., results changed by only a few hundredths of a decimal point. Therefore, there is no evidence that the small effects obtained in this meta-analysis are a result of attenuation from unreliability. Based on our results, we agree with Davis et al. (2016) that a parsimonious explanation of gratitude interventions may be that they operate primarily through placebo effects, at least for depression and anxiety symptoms.

Excluding our prediction for the type of control group, we found little evidence to support our other hypotheses about moderator variables. Indeed, with one exception, none of the other tests of moderation were significant. The only exception was for participant awareness of condition, which significantly moderated effects at follow-up (but not at post-test). It is possible that among studies containing threats to participants’ awareness of condition, researchers may have unwittingly influenced participants in ways that would favor gratitude interventions, such as communicating an expectation of benefit to the gratitude but not to the control group. However, given the lack of consistency between the post-test and follow-up effects for this criterion, we interpret these results very cautiously. We also advise caution in interpreting the null results of the other categorical moderators, particularly where there was an uneven number of studies in each category. The differential representation of studies between categories may create limited power to detect group differences. For example, only four studies included an interpersonal intervention, whereas 19 studies included an intrapersonal intervention. The substantially lower number of interpersonal interventions makes it difficult to draw conclusions about whether interpersonal interventions are truly equivalent to intrapersonal interventions in their effects. Indeed, evidence from Seligman et al. (2005) would suggest that they are not equivalent, as the interpersonal intervention had a more substantial impact on depression at post-test than the intrapersonal intervention in that study.

5 Limitations

Our results should be interpreted in light of several important limitations. First, and most crucially, the samples included in our meta-analysis were mostly comprised of unselected participants, i.e., individuals who were not selected based on severity of depressive or anxiety symptoms. Only two studies included clinical samples (Kerr et al. 2015; Southwell and Gould 2017). Prior research suggests treatment effects may increase with depressive symptom severity. For example, Sin and Lyubomirsky (2009) found PPIs have a greater effect on reducing depressive symptoms among those who meet diagnostic status for a depressive disorder. Likewise, Harbaugh and Vasey (2014) found their gratitude intervention was effective only among those high in baseline depressive symptoms. Therefore, one objection to the results of our meta-analysis may be that the range of depressive symptoms was too restricted for gratitude interventions to have an effect. We think this explanation is unlikely for several reasons. First, over half of the studies in which symptoms of depression were assessed had a sample with baseline depressive symptoms meeting the recommended thresholds for clinically-relevant symptoms. We did not find evidence of a stronger effect for those samples in which participants, on average, met the established thresholds. Second, for studies including baseline CES-D data, the average scores ranged from 13.53 to 34.15. Six of the 10 studies reporting baseline CES-D data exceeded the threshold for clinically-relevant symptoms of depression (i.e., 16), with an average value across all studies of 20.31. However, we again did not find evidence that the depression effect size was moderated by baseline CES-D severity. Third, the post-testFootnote 6 depression effect sizes for the two studies with clinical samples were among some of the smallest in our meta-analysis (− 0.12 and − 0.06 for Kerr et al. 2015 and Southwell and Gould 2017, respectively). Based on these considerations, it appears that a sufficient range of depressive symptoms was present in our meta-analysis, but the effect of gratitude interventions did not increase with greater depressive symptomatology. Consequently, it seems unlikely that range restriction in our study accounts for the small effect size of gratitude interventions.

That said, we certainly acknowledge that though depressive symptoms exist on a continuum, meeting a threshold for clinically relevant symptoms on a depression measure does not equate to a diagnosis of a depressive disorder made by a mental health professional (American Psychiatric Association 2013). Therefore, our meta-analysis should be interpreted with this distinction between symptoms and diagnostic classes in mind. It is possible that future researchers may find a greater benefit of gratitude interventions for depressive symptoms by limiting recruitment to those meeting diagnostic status for a depressive disorder, which would allow for testing gratitude interventions in a more severely impaired population. However, to date, efforts along this line have yielded mixed results (Celano et al. 2017; Taylor et al. 2017).

Likewise, there may be other moderators that could influence gratitude interventions’ effectiveness that we did not assess, such as one’s level of self-criticism or emotional neediness (Sergeant and Mongrain 2011). The instructions given to participants could also moderate effects. For example, previous research by Sheldon et al. (2012) suggests that variety is an important moderator of PPIs. Thus, if participants were instructed to list three new things they are grateful for each day, it could reduce some of the hedonic adaptation that may occur from repeating the same gratitude list daily. Additionally, it is possible that some participants perceived the instructions differently or performed the activity differently within the same condition of a study, i.e., there may be unreliability in the treatment variable (Schmidt and Hunter 2015). However, the studies in our meta-analysis did not report an interrater reliability coefficient for the treatment variable, thus making it impossible to estimate the effect of treatment reliability on outcomes. Understanding whether instructions, treatment reliability, or other potential moderators increase the effectiveness of gratitude interventions is an important direction for future research. That said, if we base our conclusions on the currently available data, there is little evidence to suggest gratitude interventions are efficacious for reducing symptoms of depression or anxiety. Accordingly, suggestions by researchers to use gratitude interventions as a psychotherapeutic tool (e.g., Emmons and Stern 2013) should be taken cautiously until more substantial benefits can be demonstrated. Consequently, we recommend individuals seek more well-established treatments for difficulties with depression or anxiety symptoms until stronger benefits of gratitude interventions are found. For example, meta-analytic evidence suggests that computerized treatments for depression and anxiety, which are relatively low-cost and easily accessible, have strong effect sizes across comparison groups (Andrews et al. 2010).

A second limitation of our meta-analysis is the small number of studies (k = 5) for the anxiety effect size, which leads us to interpret this effect size cautiously. Although the effect size was of a medium magnitude (g = − 0.52), it was statistically non-significant with a wide confidence interval ranging from a strong, beneficial effect to a weak, harmful effect (− 1.11 to 0.08). Indeed, removing the outlier of Geraghty et al. (2010) eliminated the heterogeneity in the effect size and dropped its value to a trivial level of g = − 0.16. Future investigations with measures of anxiety are needed so that meta-analytic estimates will be better powered and less influenced by outliers. Including more anxiety studies would also allow investigators to examine if the anxiety effect is moderated by symptom severity, as we were unable to examine this possibility with the small number of studies with anxiety measures currently available. Nevertheless, based on the current data, it appears gratitude interventions have limited efficacy for anxiety symptoms.

Third, this meta-analysis applies only to gratitude interventions’ specific effects on symptoms of depression and anxiety as standalone interventions. It is not our intention to dismiss the value of gratitude interventions in general. It is entirely possible these interventions are more powerful for anxiety or depressive symptoms when they are integrated into a larger treatment package, as suggested by prior randomized trials with positive psychology exercises (Seligman et al. 2006; Taylor et al. 2017). Additionally, gratitude interventions may have stronger effects for constructs like relationship quality or general well-being, as Dickens’ (2017) meta-analysis would suggest. However, it is important to understand the outcomes for which gratitude interventions are the most efficacious, and then recommend these interventions only when individuals seek to impact those particular outcomes. Indeed, our meta-analysis suggests that whatever the merits gratitude interventions have for other outcomes, they are not efficacious for symptoms of depression or anxiety as standalone interventions.

Fourth, there may be limitations for generalizability of our results based on other sample characteristics such as age and sex. Though we did not find evidence that age and sex moderated effects, all but one study included a majority female sample, and only five studies contained a sample with a mean age of 40 or above. Prior research suggests women (Kashdan et al. 2009) and older adults (Sin and Lyubomirsky 2009) experience gratitude in unique ways. Therefore, it is unclear if our results generalize to samples with fewer women and a greater number of older adults.

Finally, our results only apply to gratitude interventions. They do not inform us about the association of gratitude as a general disposition with depression and anxiety. As we mentioned in the introduction, there is strong evidence that higher trait gratitude is associated with reduced psychopathology and greater well-being (Wood et al. 2010). Therefore, gratitude as a general disposition may still be a vital factor in human flourishing, and the efficacy (or lack thereof) of gratitude interventions should in no way be interpreted to mean gratitude is not an important element of well-being and the good life.

6 Conclusion

Based on the currently available data, we find limited evidence for the efficacy of gratitude interventions in reducing symptoms of depression and anxiety. They have a medium-sized effect when compared to no intervention at all, but a small effect when compared with any active control task. Nevertheless, future investigators may discover individual differences that moderate the effectiveness of gratitude interventions, such as severity of psychopathology or qualities like self-criticism and emotional neediness. Such distinctions will be crucial to uncover, especially as exercises like gratitude journals have begun to permeate into popular culture as a means of self-help. We believe that until gratitude interventions are shown to be more powerful, the suggestions to use the existing approaches as tools for reducing symptoms of depression or anxiety should be considered with caution.