Introduction

Extinction learning was first described by Ivan P. Pavlov in the early stages of the experimental study of classical conditioning (Pavlov, 1927). In his extinction experiments, Pavlov found that after conditioned salivation (i.e., conditioned response, CR) to an auditory cue has been established by pairing the auditory stimulus (i.e., the conditioned stimulus, CS) with food (i.e., unconditioned stimulus, US), this conditioned responding decreased when the auditory cue was repeatedly presented in the absence of food (i.e., extinction took place). Since then, the generality of extinction and its related phenomena have been widely studied for theoretical (e.g., Bouton, 1993; Bouton et al., 2021; Delamater, 2004; Delamater & Westbrook, 2014; McConnell et al., 2013; McConnell & Miller, 2014; Miguez et al., 2014a, b; Miller & Laborda, 2011; Nihei et al., 2023; Uengoer et al., 2020), translational, and applied reasons (e.g., Bowers & Ressler, 2015; Crombie et al., 2021; Lipp et al., 2020; O'Malley & Waters, 2018; Quezada et al., 2018; San Martín et al., 2018; Scheveneels et al., 2019; Spix et al., 2021; Waters et al., 2021; Zbozinek & Craske, 2018). In the first case, research has been conducted in order to determine the mechanisms, for instance, behind contextual control (e.g., Harris et al., 2000; Miguez et al., 2014a, b); in the second case, the interest lies in applying its results to potential treatment for several psychological conditions (e.g., Quezada et al., 2018; Shiban et al., 2015).

Regarding the potential applications of extinction, in the last couple of decades, a good deal of attention has been given to the study of extinction as a model for exposure therapy, since extinction learning is likely the most relevant basic phenomenon involved in this type of treatment (e.g., Craske et al., 2018; Waters et al., 2021). Just as with extinction in the laboratory, conditioned responding decreases in exposure therapy when patients are repeatedly confronted with situations and stimuli that, for example, have been previously associated with aversive consequences or with the effect of drugs in the organism (e.g., Laborda & Miguez, 2015; Mellentin et al., 2017). In the case of anxiety-related disorders (Craske et al., 2018), fear and anxiety responses decrease as patients are systematically confronted with their feared stimuli and situations, in the absence of their expected aversive consequences (e.g., exposure to the image of a dog for a phobic patient in the absence of its feared biting). Exposure therapy is the most effective therapeutic approach to most anxiety disorders (e.g., Chambless & Ollendick, 2001; Parker et al., 2018). However, in certain situations, we can expect relapse after treatment, a situation that in the case of fear and anxiety has been termed return of fear (e.g., Rachman, 1989).

In the experimental study of extinction, several situations that resemble and model the return of extinguished responses after exposure therapy have been described (e.g., Bouton, 2002; Vervliet et al., 2013a, b; Vurbic & Bouton, 2014). In a spontaneous recovery situation, extinguished responses return when tested after a post-extinction time interval (e.g., Pavlov, 2079; Rescorla, 2004), and in a renewal situation, extinguished responses return when they are tested out of the context in which extinction took place (e.g., Alfaro et al., 2019; Bouton & Bolles, 1979; Bouton & King, 1983; Polack et al., 2013). For example, ABA renewal occurs if the acquisition of an association occurs in one context (A), extinction in a second context (B), and testing takes place back in the acquisition context (A), while in ABC renewal testing occurs in a third context (context C). If both acquisition and extinction are conducted in the same context and testing in a different one, the design is called AAB renewal. In reinstatement (Rescorla & Heth, 1975), recovery occurs after post-extinction presentations of the US alone, and in rapid reacquisition (Ricker & Bouton, 1996), reacquisition of the extinguished response proceeds faster compared to a newly acquired association.

Many behavioral manipulations have been developed to prevent response recovery after experimental extinction, with the expectation that these manipulations could also hold translational value and are potentially able to reduce relapse after exposure therapy (e.g., Bouton, Woods, et al., 2006b; Laborda et al., 2011; Lipp et al., 2020; see also Laborda et al., 2014, for similar manipulations applied to fear prevention). Among these, research has been focused on the use of retrieval cues for extinction (e.g., Brooks & Bouton, 1993, 1994), which consist of the presentation of a discrete cue during extinction and the recovery test. Massive extinction (i.e., the use of a large number of extinction trials; Denniston et al., 2003; Díaz et al., 2017) has also been examined extensively; other potential manipulations have been comparatively less investigated (e.g., the use of occasional reinforcement during extinction; Bouton et al., 2004). However, not all reports have been positive concerning their effectiveness (see, e.g., Bustamante et al., 2019, for an example in retrieval cues, and Thomas et al., 2009, for massive extinction); thus, it is still of interest whether these techniques are effective, and what factors might affect their results.

Based on the human learning literature, in which training an association in multiple contexts has been shown to improve retrieval (e.g., Smith, 1982, 1985; Smith et al., 1978; for an example in conditioning without extinction see Miguez et al., 2014a, b), Bouton (1991) suggested that training extinction in several different contexts may encourage the potential generalization of extinction learning to other circumstances or contexts, and thus help decrease post-extinction recovery and relapse. Gunther et al. (1998) were the first to empirically evaluate whether extinction in multiple contexts does reduce recovery after extinction training, in a study conducted with rats. In their first experiment, subjects in the experimental group received fear conditioning in one context (A), extinction in three different contexts (BCD), and were finally tested for fear responding in yet another new context (E), that is, an ABC renewal design with extinction in multiple contexts. This group showed significantly less fear recovery than subjects in the control group, which received the same number of extinction trials in only one context (B), demonstrating for the first time that extinction in multiple contexts can reduce the recovery of extinguished responding, in this case preventing the ABC renewal of extinguished conditioned fear. In humans, this technique was first reported by Pineño & Miller (2004), using a predictive learning task. In their first experiment, participants had to avoid “mines” that were predicted by different lights as they were driving a truck with war refugees. Different town names were used as different contexts. For one group, the predictive relationship was trained in one context (A), extinguished in a different one (B), and then tested in a third context (C), while a second group received similar acquisition in one context (A), extinction in three different ones (BCD), and finally testing in a different context (E). Finally, a third group received extinction in only one context (A). The results showed that there was ABC renewal in humans, as well as attenuation of recovery after the extinction of predictive learning in multiple contexts, complementing the work that had been developed in non-human animals.

Since then, numerous studies have evaluated whether extinction in multiple contexts could prevent the recovery of extinguished conditioned responses, in different preparations and across different research groups and laboratories, with several reporting positive results (e.g., Bernal-Gamboa et al., 2020; Chelonis et al., 1999; Glautier et al., 2013; González et al., 2016; Olatunji et al., 2017; Shiban et al., 2013). However, some studies have also failed to observe an effect of the treatment on response recovery (e.g., Bouton et al., 2006a, b; Hermann et al., 2020; Neumann, 2006). Thus, it is of interest to integrate and review the existing evidence in order to examine whether extinction in multiple contexts is an effective tool for preventing relapse, and which elements present in the different studies might affect this effectiveness.

Even when some researchers have presented partial narrative syntheses concerning this and other techniques to reduce recovery after extinction (e.g., Laborda et al., 2011; Pittig et al., 2016; Weisman & Rodebaugh, 2018), to date, no meta-analysis nor systematic review on this effect has been conducted. Recently, Chao et al. (2021), have posted a preprint of a protocol aimed to systematically review and integrate the effect of extinction in multiple contexts, but focusing solely on fear recovery; the results to our knowledge have not been published yet.

Therefore, the aim of the present study was to review and integrate, using a multilevel meta-analytic approach, the evidence regarding the effect of conducting extinction in multiple contexts across all available experimental preparations, settings, and experimental subjects and participants. The analysis was conducted by estimating and integrating the effect sizes of the studies and calculating confidence intervals, using a multilevel meta-analytic approach (Van den Noortgate et al., 2015; Fernández-Castilla et al., 2020). We hypothesized, based on the available evidence, that extinction in multiple contexts would be effective at reducing response recovery or relapse, although potentially sensitive to differences in the experimental preparations used to examine it.

The protocol for the present study was not published before the review was conducted, but it is now available as Online Supplementary Material (OSM).

Method

The present research synthesis was performed following PRISMA and APA recommendations for conducting and reporting meta-analytical studies (Appelbaum et al., 2018; Cooper, 2017, 2018; Moher et al., 2009; Page et al., 2021).

Search strategy

The following electronic databases were searched: Web of Knowledge (formerly Web of Science), Scopus, and APA PsycInfo. Additionally, two further searches were implemented in the search for grey literature at PsyArXiv Preprints and the ProQuest Dissertations and Theses Global Database. Finally, the reference lists and previous citations of all selected articles were searched, and all corresponding authors were contacted via email and asked for both unpublished and published results. The final search was performed in June 2023.

Search terms were derived from existing literature on extinction in multiple contexts, which was previously known by the authors. The obtained search terms were compared and extended using the online APA Thesaurus of Psychological Index Terms (available at https://www.apa.org/pubs/databases/training/thesaurus), in order to assess their accuracy and obtain potential synonyms. The critical search terms were: “extinction” AND “multiple contexts” AND “recovery”, but we also included variants on each of these terms (Extinction: “exposure” OR “confrontation” OR “outcome interference”; multiple contexts: “several contexts” OR “many contexts”; recovery: “ABC” OR “ABA” OR “AAB” OR “AAC” OR “renewal” OR “spontaneous recovery” OR “reinstatement” OR “rapid reacquisition” OR “relapse” OR “return” OR “resurgence” OR “reoccurrence” OR “context shift” OR “context change”). Truncated search terms were used when corresponding (i.e., “extin*” instead of “extinction”). For the specific search terms used in each database see the OSM.

Study selection and eligibility criteria

Two independent researchers (JB and MS) selected the studies for the analysis, screening titles and abstracts against the eligibility criteria. In the cases that were deemed relevant or ambiguous, the whole text was screened. After the first independent screening, both reviewers met to compare and resolve discrepancies. In all cases, the final eligibility decision was made by four researchers (JB, MS, GM, and ML).

Studies were selected for the present meta-analysis if they: (a) presented extinction training in multiple contexts; (b) were written in either English or Spanish; (c) presented a post-extinction recovery assessment (e.g., renewal, spontaneous recovery); (d) used an experimental design (i.e., presence of experimental and control conditions, manipulation of an independent variable, and random assignment of subjects or participants to the different conditions); (e) compared training in multiple contexts with training in one context; and (f) reported effect size statistics or, otherwise, enough descriptive data to calculate them. In the cases in which the presence of one or more of these elements was unclear, the authors examined the articles to reach a consensus for inclusion or exclusion. These criteria for study selection also serve to control for bias and/or quality of the individual studies, since they also assess the method and design with which the studies were conducted.

Data collection

Data from all selected articles were extracted and codified by two independent reviewers (JB and MS). For the codification process, a data extraction sheet was developed by four of the researchers (JB, MS, GM, and ML), according to the following: (a) Title; (b) Publication date; (c) Research group and/or laboratory in which the experiment was performed; (d) Experimental subjects (i.e., rodents, humans, etc.); (e) Type of sample (e.g., clinical, pre-clinical, etc.); (f) Experimental tasks (e.g., fear conditioning, taste aversion, etc.); (g) Dependent variables (e.g., number of responses per time unit, latency of response, etc.); and (h) Other elements of the experimental design and/or the task: CS, US, number of trials and sessions, number of extinction contexts, type of recovery, and statistical data. Importantly, an “experiment” was defined, according to the eligibility criteria, as any individual comparison (i.e., measure of recovery) between a multiple and a single context condition; thus, a factorial design that compared, for instance, the effect of multiple contexts extinction on both ABA and ABC renewal, would be analyzed as two different experiments instead of one.

Measures and data analysis

Statistical analyses were performed using Cohen’s d, which is a standardized measure of the difference between two group means (Cohen, 1988). Values between 0.2 and 0.5 are conventionally considered to indicate a small effect size, between 0.5 and 0.8 a medium effect size, and 0.8 and larger indicate a large effect size. When effect size data were not directly available in the study, they was estimated from descriptive data using the online calculator made available by Wilson (n.d.; available on https://campbellcollaboration.org/research-resources/effect-size-calculator.html). When no descriptive data were provided, effect size statistics were either calculated from descriptive data (means and standard error of the mean or SEM) extracted from the figures using the WebPlotDigitizer online tool (Rohatgi, 2020; available on https://automeris.io/WebPlotDigitizer) or obtained through direct communication with the corresponding author.

Based on the extracted data, an estimation of Cohen’s d for each critical comparison was calculated using the different dependent variables available for each experiment, that is, any measure that provided a direct comparison between performance after extinction in single and multiple contexts. Thus, all critical comparisons consisted of the standardized difference during a recovery test, of a condition with extinction conducted in multiple contexts, against a condition where extinction was conducted in a single context. All dependent variables were codified, and their effect size computed for the analyses.

Analyses were then performed based on a multilevel approach using the MetaForest package for R (which includes the commonly used Metafor package; Viechtbauer, 2010; Van Lissa, 2017), which fitted the data into a three-level model. A multilevel meta-analysis is the recommended technique when the effect sizes are dependent and/or nested, and the correlation between variables is unknown (e.g., Fernández-Castilla et al., 2020). This approach is required in this case since experiments in this field usually examine the effect of the manipulation across several dependent variables.

Heterogeneity was examined using Q (Hedges, 1982); a significant Q-value indicates that the studies in the meta-analysis are unlikely to have a common effect size. Alpha was set to .05. An exploratory analysis of theoretically relevant moderators was also conducted using the MetaForest package (Van Lissa, 2017), which applies a machine learning approach based on a tree algorithm to the raw effect sizes. A “tree algorithm” divides the data into groups by selecting one moderator, and finding the value that leads to the most homogeneous post-split group. A homogeneous post-split group is taken to indicate that the division criterion is relevant, and should be retained. This division is repeated after each split, until a criterion is reached or the number of cases is too small to continue. MetaForest conducts the analysis by drawing a number of bootstrap samples on which a tree grows. Then, all predictions by the trees are averaged, thus selecting the categorical moderators which show a consistent effect on the replications. In this case, the retained moderators had to have an effect on more than 50% of the replications (Van Lissa, 2017, 2020).

Following this analysis, partial dependence of each selected moderator was estimated. “Partial dependence” in this case refers to the relationship of the individual moderator to the effect size, averaged over the value of all other predictors of moderators (Van Lissa, 2017, 2020), and illustrates the impact that the model estimates for the individual moderator on the pooled effect size.

Publication bias

Publication bias was assessed with Egger’s test (Egger et al., 1997), and Trim and Fill (Duval & Tweedie, 2000). Egger’s test examines the potential asymmetry in the distribution of the observed effect sizes using the standard error as predictor; significant asymmetry is taken as indicating publication bias. Trim and Fill, on the other hand, estimates the potential number of non-published studies that would be needed to obtain a symmetric distribution of effect sizes, and calculates a new pooled effect size after filling the distribution. If the new calculated effect size is different from the original one, it indicates the existence of publication bias.

It is worth noting, however, that traditional publication bias analyses present several limitations within a multilevel meta-analysis (for a thorough analysis, see Fernández-Castilla et al., 2021); thus, the potential informative value of these analyses might also be rather limited.

Results

Study selection

Results of the search can be seen in the following PRISMA flow diagram (Fig. 1). A total of 402 studies were found across all sources, of which 101 were first found on Web of Knowledge, 40 on Scopus, 45 on APA PsycInfo, 215 on ProQuest Dissertations & Theses Global and one (consisting of unpublished data) that was received directly from one researcher. After eliminating duplicates, 332 studies remained: 101 obtained on Web of Knowledge, two on Scopus, 14 on APA PsycInfo, 214 on ProQuest Dissertations & Theses Global, and one from a researcher. No studies were found in the references lists. Afterward, these 332 studies were screened by title and abstract; 302 were discarded because no extinction training was included, and the remaining 30 studies were screened by text. Of these 30, four were discarded because effect size statistics were not reported, and it was not possible to extract them from figures or descriptive data; one study was discarded because it did not compare multiple contexts extinction with a single context condition. Thus, the final sample consisted of 25 studies, in English, published between 1998 and 2023, and from which 37 individual experiments (or critical comparisons), with 57 effect sizes, were included in the analyses.

Fig. 1
figure 1

PRISMA flow diagram showing article search and selection

A summary of the selected studies can be seen in Table 1. Of the 25 studies, seven were conducted in rodents, and 18 in human subjects. From these human samples, 12 were conducted in general population, four in pre-clinical samples, and two in diagnosed participants (clinical samples). Regarding the experimental paradigm, 14 studies were conducted using some variant of fear conditioning; the other experimental paradigms present in the sample were predictive learning (four studies), alcohol reactivity or tolerance (three studies), instrumental learning (two studies), conditioned taste aversion (one study), and conditioned disgust (one study). Fifteen studies examined response recovery using an ABC renewal procedure, while nine studies did it in ABA renewal, and three in ABC renewal plus spontaneous recovery. Note that in this case, several studies examined more than one recovery procedure.

Table 1 Characteristics of the studies included in the meta-analysis

Effect of extinction in multiple contexts on response recovery

Effect sizes and the corresponding confidence intervals for each experiment are depicted in Fig. 2. The results of the multilevel analysis showed a large aggregate effect size for the 37 experiments and 57 effect sizes, Hedges’ g = 0.92, 95% CI [0.68–1.16]; this means that extinction in multiple contexts reduces effectively post-extinction response recovery, compared to extinction in multiple contexts. The Q-value obtained was 257.71, which is higher than the critical chi-square value (df = 56), indicating that the sample was heterogeneous. The variance distribution across the three levels of the analysis was 18.82% in Level 1 (participants), 32.74% in Level 2 (within-studies), and 48.42% in Level 3 (between-studies).

Fig. 2
figure 2

Estimated effect sizes and confidence intervals of each critical comparison. Note. Lines indicate the effect sizes and the corresponding 95% confidence intervals. The last line represents the integrated effect size. Effect sizes for each dependent variable are reported separately in the same order as in Table 1

Moderator analysis

The exploratory moderator analysis conducted permutations using the categories of “CS exposure,” which divided the studies according to “long,” “medium,” or “short” exposure times based on the CS duration multiplied by the number of extinction trials (“long” consisted of studies with more than 900 s of total exposure; “medium” between 300 s and 900 s of exposure, and “short” under 300 s; “NA” depicts those studies for which exposure time could not be estimated); “experimental subject” (rodents or humans); “type of sample” (clinical, pre-clinical or non-pathological); “experimental task” (fear conditioning, predictive learning, alcohol reactivity/tolerance or instrumental conditioning), and “recovery type” (ABA renewal, ABC renewal, or ABC renewal plus spontaneous recovery). The results of the analyses (shown in Fig. 3A) indicated that “CS exposure” had a recursive variable importance of 86%, that is, it had an effect in 86% of the replications; “experimental subject” had a variable importance of 83%, and “type of recovery” of 66%. Both “type of sample” and “experimental task” failed to reach the 50% threshold, with recursive variables importance of 48% and 32%, respectively.

Fig. 3
figure 3

Moderator analysis (variable importance and partial dependence). (A) The recursive variable importance of each predictor, with the percentage of impact on replications. Greater percentages indicate more impact or variable importance. (B) shows partial dependence of each subset of the preselected moderators, indicating the effect size of each category. Points indicate individual effect sizes, and brackets indicate the 95% percentile interval of the predictions of individual trees of the model

A partial dependence analysis further explored the direction of the effect within each of the preselected moderators (Fig. 3B). The analyses regarding CS exposure showed that on average both short (Hedges’ g = 1.14) and moderate exposure times (Hedges’ g = 1.06) were associated with somewhat greater effect sizes than long exposure times (Hedges’ g = 0.97); in the case of experimental subjects, experiments conducted with human participants had overall a greater effect size (Hedges’ g = 1.14) than experiments with rodents (Hedges’ g = 0.75). Finally, regarding the type of recovery, experiments that examined ABC renewal plus spontaneous recovery (Hedges’ g = 0.98) or ABA renewal (Hedges’ g = 1.11) showed greater effect sizes than those conducted in ABC renewal (Hedges’ g = 0.87). It is worth noting, however, that a partial dependence analysis does not conduct hypothesis testing; thus, it does not show whether these differences are significant, only that they are on average numerically different. Moreover, partial dependence analysis is sensitive to sample size, since it reflects the overall impact of a given category on the model, as shown in the individual data points of each moderator (Van Lissa, 2017, 2020).

Publication bias

Egger’s test showed a significant asymmetry in the distribution of observed effect sizes, z = 4.45, p < .0001, indicating potential publication bias. Trim and Fill did not confirm these results; the post-analysis effect size was larger than the original one, Hedges’ g = 1.02, 95% CI [0.80–1.24].

Discussion

The present multilevel meta-analysis examined the effect of extinction in multiple contexts on response recovery. The findings show that, overall, extinction in multiple contexts is effective in reducing response recovery compared to a single context extinction, as expressed in the large pooled effect size. Heterogeneity, measured with Q, was also significant. The exploratory moderator analysis indicated that the categories with impact on the effect sizes were exposure time to the CS, the experimental subject, and the type of recovery.

The main relevance of the present meta-analysis lies in its multi-level approach, which integrated effect sizes obtained from different dependent variables, across different experimental preparations, and with different experimental subjects. The results show that extinction in multiple contexts effectively reduces response recovery compared to extinction in a single context, and with a high level of confidence as shown by the confidence interval around the average effect size. Thus, the present results indicate that extinction (or exposure) in multiple contexts might be a useful tool for clinicians to prevent or reduce relapse after treatment.

These results are, however, qualified by the heterogeneity of the sample, as shown by Q. The high level of heterogeneity in this case means that there are differences between studies that are not explained by either sampling error or by the assumed distribution of the effect sizes (Cooper, 2017), and thus do not allow the assumption that the effect is similar in every context. In this regard, the moderator analysis suggested several sources for this heterogeneity. The moderators with the largest impact on the effect sizes were experimental subject (with a greater effect in humans compared to rodents); type of recovery (with the effect on ABC renewal being smaller than on both ABA, and ABC renewal plus spontaneous recovery), and exposure time (with a smaller effect in long exposure time than in both moderate and short exposure). The first result is somewhat counterintuitive in that animal studies should allow a more complete experimental control, and less variability in outcomes than human studies. This should in turn lead to more consistent and larger effects. This assumption seems to be unfounded, at least according to the present data, which suggests that human studies are equally useful in detecting large effect sizes.

Regarding the effect of exposure time, at first glance it is inconsistent with previous evidence suggesting that extinction in multiple contexts with longer extinction or exposure times might be particularly effective in reducing recovery (e.g., Laborda & Miller, 2013; Thomas et al., 2009); however, the present results show that shorter exposure times are also associated with larger effect sizes; that is, both small and moderate amounts of extinction training in multiple contexts are similarly effective. Why it is the case is not clear; further research might aim at examining this issue more systematically. Finally, the moderator analysis also showed that the effect of extinction in multiple contexts was larger when it was tested on ABC renewal plus SR, or on ABA renewal, compared to only ABC Renewal. This is consistent with previous evidence showing that, typically, ABC renewal is weaker than ABA renewal, and sometimes even not observed (e.g., Harris et al., 2000; Neumann, 2006; Üngör & Lachnit, 2006), but summation of two or more recovery phenomena can lead to a greater recovery in responding (e.g., Laborda & Miller, 2013), which would be the case when testing ABC renewal plus SR. Based on this data, one could hypothesize that the treatment should have a greater impact on weaker types of recovery such as ABC Renewal, perhaps even eliminating it; however, the present results are more consistent with the opposite view. It is likely that a stronger recovery effect (such as ABA renewal) makes for a better comparison, and makes it easier to observe any decrease in responding after multiple contexts extinction compared to weaker renewal types. From an associative viewpoint there should be a larger effect of multiple contexts extinction on stronger recovery types, as a result of a stronger error correction (e.g., Rescorla & Wagner, 1972). In other words, a stronger renewal effect would be the result of a stronger recall of the acquisition context compared to extinction; regardless of the associative mechanisms behind it, if extinction in multiple contexts counteracts this recall, its effect should be more pronounced with a stronger renewal effect than with a smaller one, which was observed in these data.

No clear recommendations for clinicians are evident from these analyses. From the comparison between human and animal samples, it can only be surmised that there is a reliable effect of extinction in multiple contexts in human participants across different preparations, and not only in animal studies. The length of exposure time does not offer a clear guideline, since both short and moderate exposure times are associated with a larger effect; neither does the recovery type, considering that in therapy there is usually no report of the type of recovery being treated, and as such, it is not a relevant factor (e.g., Polack Laborda & Miller, 2013).

One relevant point to consider for the moderator analysis lies on the particular statistical tools reported in this meta-analysis. A forest (or tree) algorithm such as the one implemented in these analyses operates by partitioning the data using the categorical moderators as criteria. If a post-split group is homogeneous, it means that the categorical predictor used to split the data is relevant; conversely, if the resulting groups are heterogeneous, it is assumed that the predictor is not relevant and can be discarded. This approach solves several potential limitations of linear models (Van Lissa, 2017, 2020), since it does not assume a particular distribution of the data (i.e., are non-parametric) and allow more flexibility in the specification of any model. However, these models are also sensitive to sample size and small changes in the data, which can make its interpretation more difficult. For example, of the 25 studies integrated in this meta-analysis, seven were conducted with rodents, and 18 with humans. The moderator analysis reflected this disparity by assigning more impact within the model to human than animal studies, or conversely, less impact to the smallest subset of the data. This might explain to some extent the reported partial dependence, at least in this case. Thus, a direct interpretation of the reported outcomes should be cautious, and consider the different sample sizes (which in some categories were small) and confidence intervals of the predictions.

In addition to the last point, it is worth considering that the effect sizes are not extremely different between subsets of the data, and that the confidence intervals of the predictions tend to overlap; thus, the model predicts (in a qualitative sense) similar effect sizes for all categorical moderators. This can be taken as the manipulation having a general effect regardless of the heterogeneity of the sample; conversely, any experiment aiming at examining extinction in multiple contexts is highly likely to find an effect, as long as it follows some basic recommendations regarding the experimental design and procedure, which were indeed followed by all studied included in this analysis.

On the other hand, a qualitative assessment of the variability within the evidence might help shed light on several issues not considered so far in this discussion. There is a large diversity of experimental tasks and approaches to the study of extinction and relapse included in this study, and the present meta-analysis integrates experiments conducted in both rodents and humans, and in several highly different experimental tasks. Although only one of the moderators (experimental subject) can be considered as a procedural variable, the codification of the experiments revealed a high variability within each moderator, and also within studies. For instance, although all experiments in fear conditioning with human participants correspond theoretically to the same procedure, all differ in variables such as the dependent measures used, type and length of CS, number and duration of trials, etc. As a result, several categories were not useful for the analysis because of small sample sizes, which might introduce variability that is not necessarily expressed in the present analysis.

On a conceptual level, one final element to consider is that, empirically speaking, there is no certainty that the present meta-analysis integrated a single effect; it is possible that when we compare, for instance, extinction of fear conditioning and of magazine approach, we are effectively comparing different response systems (e.g., Fanselow, 1994; Fanselow & Wassum, 2016) with different features and mechanisms. Arguably, the main approach to this issue in historical terms has been to consider the several learning tasks promoting a single phenomenon with similar underlying mechanisms, which is reflected in the highly diverse studies integrated into this synthesis. One way to improve future syntheses of the evidence in this field would be to standardize the different experimental tasks and procedures to some extent (e.g., Vervliet et al., 2013a, b); whenever possible, the different studies using a given learning task should implement similar parameters and procedures. This would probably improve replicability and comparability in the field and make the assessment of the evidence much easier for both theoretical and applied purposes. Such standardization effort might take form, for instance, as a pre-registered set of experiments conducted in different laboratories with similar experimental subjects and parameters, with the aim of assessing the effectiveness of manipulations such as extinction in multiple contexts on recovery and/or relapse in a consistent and probably more definitive manner.

Regarding the limitations of the present synthesis, at least two relevant elements are related to the overall quality and accessibility of the evidence. The first one is the nature of the statistical analyses reported in the literature. Older studies rely almost exclusively on a report of the p-value, while estimates of statistical power and reports of effect size are usually absent, although this changes gradually with more recent studies (see, e.g., Bouton et al., 2006a, b; Gunther et al., 1998, for examples of the first case, and Krisch et al., 2018, for a more recent and thorough statistical report). This resulted in that most effect sizes had to be estimated based on descriptive data either provided in the studies or extracted from their figures. These estimations show, as depicted in Fig. 2, that even in studies that reported a statistically significant effect, confidence intervals often include zero, meaning that there is a high likelihood that in such cases there is no effect, or it is very small. Researchers in this field and others should strive to report comprehensive descriptive data and statistical analyses, including at least effect sizes with their corresponding confidence intervals.

Second, the evidence analyzed in the present synthesis is also limited in the sense that several sources (e.g., conferences or congress abstracts) were not included, and the search was limited to electronic databases, even in the case of gray literature. An effort in contacting each corresponding author for non-published data yielded only one result. The results of the search strategy thus do not include most of the potential unpublished data.

Third, publication bias analysis represents a third limitation of the present meta-analysis. Although the sample was analyzed for potential publication bias, one of the tests (Egger’s test) detected it, while the other (Trim and Fill) did not. Thus, it is not possible to conclude whether there is publication bias in this field. The limitations of traditional bias analyses in multi-level or multivariate meta-analysis are known, and might explain these results. According to Fernández-Castilla et al. (2021), the most used bias analyses (e.g., Sterne & Egger, Funnel plot Test, and Trim and Fill) are overall inadequate, unless several different conditions are met for each technique. For example, their simulations showed that Trim and Fill worked well when the population effect size was moderate to large, there was a high variability among effect sizes, and many effect sizes were included in the analysis. Although it is unclear to which degree the present meta-analysis meets all these assumptions, it is likely that at least this sample does not include the appropriate number of effect sizes for bias analysis to be reliable.

Overall, the data suggest that extinction in multiple contexts should be an effective manipulation for reducing relapse in exposure therapy. Even if we take the heterogeneity of the sample into account, in all sub-samples the effect size varied from moderate to large, and with reliable confidence intervals; thus, the present results suggest that the effect of multiple contexts extinction appears to be general and should be recommended as a technique for clinicians.