Keywords

1 Introduction

Systematic reviews and meta-analyses of well-conducted randomized controlled trials (RCTs) that address the same clinical question(s) can provide the highest level of evidence for decision-making on interventions and are vital in the practice of evidence-based medicine. Although meta-analysis constitutes a valuable tool to summarize study-specific results and may reduce both bias and uncertainty from individual studies, it widely depends on the quality, homogeneity, and freedom from bias of the available studies. The main two threats of the meta-analysis validity are:

  1. 1.

    The between-study variability beyond random error, termed heterogeneity

  2. 2.

    The phenomenon that small RCTs suggest different, often larger, intervention effects than large RCTs, termed “small-study effects” [13]

A certain degree of variability in study-specific intervention effects is almost always present due to chance, but additional variability might occur due to many reasons. These might include differences in the way studies are conducted and how the intervention effect estimates are measured. There are three different types of heterogeneity:

  1. 1.

    Clinical heterogeneity, which is referred to as the variability in the participants, interventions, and outcomes

  2. 2.

    Methodological heterogeneity, which reflects the variability in study design and risk of bias

  3. 3.

    Statistical heterogeneity, which is referred to as the variability in the intervention effects

Statistical heterogeneity is usually a consequence of clinical or methodological variability, or both, among trials, and is often called “heterogeneity” omitting the term “statistical.” The estimation of heterogeneity is an additional aim in meta-analysis as it improves interpretation of results and can provide insights on the summary intervention effect predictions. One of the most widely statistical methods used in meta-analysis is the inverse-variance method; it uses the reciprocal of the within-study variances as study weights. The presence of heterogeneity affects the estimation of study weights and hence the estimated uncertainty of the summary intervention effect.

A commonly encountered association in meta-analysis is the one between the estimated study-specific intervention effects and the size of studies; it can be caused by several reasons. One possible explanation is that small studies with nonsignificant results are less likely to be published, because journals and authors may tend to publish and submit small studies with significant results. Other explanations may include selective outcome reporting (e.g., reporting outcomes with statistically significant results), heterogeneity between small and large studies (e.g., small studies recruit patients of high baseline risk that would largely benefit from the intervention), mathematical artifact between the two factors, or simply coincidence.

Several approaches have been proposed to estimate the between-study heterogeneity and small-study effects as a result of selection bias (including publication bias, language bias, citation bias, and reporting bias) [46]. This chapter includes a review of the graphical methods, statistical tests, and statistical measures used in pairwise meta-analysis to evaluate homogeneity and selection bias.

2 Approaches for Assessing the Between-Study Heterogeneity

A key aim in meta-analysis is to make inferences about the between-study heterogeneity as its presence can have a considerable impact on the meta-analysis conclusions. There are multiple approaches available to evaluate heterogeneity in meta-analysis, including graphical methods and statistical tests to assess its presence, statistical measures to quantify heterogeneity, and methods to estimate its magnitude. This section discusses several alternatives to appraise between-study heterogeneity in meta-analysis.

2.1 Graphical Representation of the Between-Study Heterogeneity

A visual inspection of graphical representations is commonly the first approach researchers select to assess the variation between study-specific effects due to heterogeneity, beyond what is expected by chance. This is an informal approach but a very useful way to indicate outlier studies, as well as those that might be responsible for the between-study heterogeneity. In the next subsections, we present the graphical displays that have most commonly been used in the meta-analysis literature [7, 8].

2.1.1 Forest Plot

Forest plots (Fig. 12.1) are the most popular plots in meta-analysis; they display the study-specific effect estimates along with their confidence intervals, and at the bottom of the plot, the meta-analysis result is provided [1012]. The effect measure (e.g., odds ratio) is usually presented on the horizontal axis allowing detailed study data to be plotted alongside the results, such as the number of events and sample size for each study arm. However, some authors argue that the effect measure should be presented on the vertical axis as dependent variables are commonly plotted in statistics [13]. The size of the plotting symbol used to represent the intervention effect is usually selected to be proportional to the inverse of the variance of the study effect estimate. Therefore, more precise estimates (i.e., with smaller variance) are represented by larger plot symbols, highlighting also the amount of information that they contribute to the meta-analysis.

Fig. 12.1
figure 1

Forest plot. Meta-analysis of three randomized controlled trials of histamine H2 receptor antagonists (H2 blockers) in conjunction with acetylsalicylic acid (ASA) therapy for outcome of peptic ulcer (Reproduced with permission [9])

A greater variation in the study-specific intervention effects, more than it would be expected by chance alone, suggests there is evidence for between-study heterogeneity. In a forest plot, this is usually inspected by the poor overlap of the intervention effects’ confidence intervals.

2.1.2 Galbraith Plots

Galbraith (or radial) plots (Fig. 12.2) are often used to present the results of studies in a meta-analysis and to informally assess between-study heterogeneity [15, 16]. The plot is a scatter plot of the standardized study-specific intervention effects, i.e., the estimated effect measures (e.g., log-odds ratio) divided by their standard errors (SE) (or equivalently the z-score) on the y-axis, against their inversed SEs on the x-axis. Each study is represented by a single point, and a regression line is drawn corresponding to the pooled fixed-effect meta-analysis estimate. Therefore, the slope of the regression is as an estimate of the intervention effect, when there are no small-study effects. In addition, the 95 % confidence region of the through-the-origin regression line is depicted by the area between the two lines drawn at a vertical distance of \( \pm 2 \) above and below the regression line. Under the assumption that all studies estimate a common (fixed) intervention effect, we expect that the majority (95 %) of study points lie within this confidence region.

Fig. 12.2
figure 2

Galbraith plot. Log-odds ratio for ischemic heart disease in trials of serum cholesterol reduction by type of intervention (Reproduced with permission [14])

Using this graphical representation, studies outside this region contribute to between-study heterogeneity, and the imprecise (small 1/SE, or large SE, or small studies) intervention effects lie close to the y-axis, whereas precise intervention effects will be situated further away.

2.1.3 L’Abbé Plot

L’Abbé plots (Fig. 12.3) facilitate the examination of whether the intervention effects across studies are homogeneous, but they can be used for dichotomous outcome data only [18]. This type of plot presents the risks (or odds) in the intervention group on the y-axis against those of the control group on the x-axis and often includes the diagonal line of equality and a regression line. The diagonal line of equality indicates that the risks in the control and intervention groups are equal within trials, and the regression line represents the risk ratio (or odds ratio), which is estimated by pooling the results in the meta-analysis. It is advisable that the study points are presented according to the precision of the intervention effect estimates (or study size) to make the plot more informative [7].

Fig. 12.3
figure 3

L’Abbé plot. Rates of smoking cessation in the intervention and control group (Reproduced with permission [17])

The plot can be used to infer the presence of heterogeneity, specifically where trials are widely spread around the regression line. In the absence of heterogeneity, study points should lie closely around the regression line.

2.1.4 Baujat Plot

Baujat plots (Fig. 12.4) are used to identify studies that influence the overall intervention effect and impact on the magnitude of the heterogeneity [19]. The rationale is that excluding an influential study will affect the meta-analytic estimate, and hence this plot assesses which studies cause the between-study heterogeneity and the greatest shifts in the overall intervention effect. The plot presents the contribution of each study to the Cochran Q-statistic (see Sect. 12.2.2.1) on the x-axis against the influence of each study. The influence of each study is defined as the standardized squared difference between the overall intervention effects with and without the ith study under the fixed-effect model, on the y-axis. Studies lying on the upper right corner of the plot are the most influential with the highest contribution to the total heterogeneity.

Fig. 12.4
figure 4

Baujat plot for a meta-analysis of chemotherapy in head and neck cancer (Reproduced with permission [19])

2.2 Statistical Tests for the Evaluation of the Between-Study Variance

The most commonly used method to assess the homogeneity assumption in meta-analysis is to carry out a statistical test. Several tests for this evaluation have been suggested in the literature, including the “generalized Cochran Q,” Wald, likelihood ratio, and score tests [20, 21]. A popular choice for the between-study homogeneity assessment in meta-analyses is the Cochran Q-statistic (see Sect. 12.2.2.1) [22]. It has been suggested that among the aforementioned homogeneity tests, the Cochran Q-statistic performs best in terms of type I error for meta-analyses with large studies (e.g., with arm size greater than 640) [21]. The Cochran Q-statistic belongs to the “generalized Cochran between-study variance statistics” family [23], with

$$ {Q}_a={\displaystyle \sum }{a}_i{\left({y}_i-{\widehat{m}}_a\right)}^2, $$

where y i is the observed intervention effect (e.g., log-odds ratio), index i refers to the ith study with \( i=1,\dots k, \) a i the weight assigned to each study, and \( {\widehat{m}}_a=\left({\displaystyle \sum }{a}_i{y}_i\right)/{\displaystyle \sum }{a}_i \) the overall intervention effect. Jackson showed that Q a has a \( {\chi}_{k-1}^2 \) distribution as a linear combination of independent central χ 21 random variables [24].

2.2.1 Cochran Q-Statistic

The standard test widely used in meta-analysis, is the Cochran Q-statistic testing the hypothesis that all studies share a common true effect (μ) or equivalently that the between-study variance (τ 2) is zero [22]. The Cochran Q-statistic is a special form of the “generalized Cochran between-study variance statistic” for \( {a}_i=1/{v}_i, \) with v i the within-study variance in study \( i=1,\dots k \). Hence, the Q-statistic is the weighted sum of squared differences between the observed study-specific effects and the overall effect across studies derived under the fixed-effect model. Under the null hypothesis, \( {H}_0\kern0.1em :\kern0.2em {t}^2=0 \), the Q-statistic follows approximately a χ 2-distribution with \( k-1 \) degrees of freedom and a critical region \( Q>{c}_{k-1,\;1-\left(a/2\right)}^2, \) where a is the confidence level. Several efforts have been done to define the distribution of the Q-statistic, including Biggerstaff and Tweedie approximating Q with a gamma distribution, and Biggerstaff and Jackson deriving the exact distribution, when \( {\tau}^2\ne 0 \) [25, 26].

It has been shown that the power of the test to detect heterogeneity depends on the number and size of studies, as well as the magnitude of the true between-study variance [21]. Simulation studies suggest that the test has low power when the total information available in the meta-analysis is low (e.g., sparse data, small size and number of studies), and hence a nonsignificant result might erroneously be interpreted as absence of between-study heterogeneity [21, 27]. It is therefore recommended that reviewers use 0.10 as a cutoff level of significance instead of the usual 0.05 [28, 29]. However, a higher cutoff value increases type I error and the risk of drawing false-positive results. The Q-statistic may suggest significant heterogeneity when many studies are included in the meta-analysis and particularly when their sample sizes are very large (see, e.g., Barbui et al. that included over 15,000 participants from 135 studies) [30]. The power of the test may also be limited when the study sizes differ substantially or a single study is a lot larger when compared with the others in the analysis [27].

2.2.2 Generalized Q-Statistic

Similarly to Cochran Q, the generalized Q-statistic (Q gen) is a special form of the “generalized Cochran” between-study variance statistic for \( {a}_i=1/\left({v}_i+{\tau}^2\right) \). The Q gen-statistic is the weighted sum of squared differences between the observed study-specific effects and the overall effect derived under the random-effects model. Under the null hypothesis that the true between-study variance is equal to a certain amount \( \left({t}_0^2\ge 0\right), \) Q gen follows a χ 2-distribution with \( k-1 \) degrees of freedom and a critical region: \( {Q}_{\mathrm{gen}}>{c}_{k-1,\;1-\left(a/2\right)}^2. \)

To the best of our knowledge, the properties of the test have not been examined, providing an avenue for further work.

2.2.3 Cochran Q-Statistic Adjusted for Small-Study Effects

Rücker et al. extended Cochran Q-statistic by adjusting for small-study effects [31]. We call “small-study effects” the tendency of small studies to show larger intervention effects compared to the larger studies (see also Sect. 12.4). This can be derived by

$$ {Q}_a^{\mathrm{Adj}}={\displaystyle \sum }{a}_i{\left({y}_i-{\widehat{m}}_a^{\mathrm{Adj}}-\frac{{\widehat{s}}_a}{\sqrt{a_i}}\right)}^2, $$

where \( {\widehat{m}}_a^{\mathrm{Adj}} \) is the summary intervention effect adjusted for small-study effects with \( {a}_i=1\kern0.1em /\kern0.1em {v}_i \) and ŝ a represents a potential small-study effect. The Q Adj a measures the residual variation with respect to a fixed-effect model allowing for small-study effects, and compared to the Cochran’s Q-statistic, it holds that \( {Q}_a^{\mathrm{Adj}}\le Q \). Under the null hypothesis of no between-study heterogeneity, Q Adj a follows a χ 2-distribution with \( k-2 \) degrees of freedom and a critical region: \( {Q}_a^{\mathrm{Adj}}>{c}_{k-2,\;1-\left(a/2\right)}^2. \)

In the presence of small-study effects, it is suggested to use Q Adj a to assess the remaining between-study heterogeneity [31]. The main limitation of the Cochran’s Q-statistic adjusted for small-study effects is that it depends on the choice of the estimation method for τ 2 (see Sect. 12.2.4).

2.3 Statistical Measures to Quantify Between-Study Variance

The statistical tests discussed in Sect. 12.2.2 are only useful for testing the existence of heterogeneity, but do not quantify the extent of heterogeneity. To date, several statistical measures have been suggested for the quantification of the degree of variability in a meta-analysis that is explained by between-study differences rather than by random error [3234]. As for every point estimate, apart from quantifying between-study heterogeneity using a statistical measure, it is important to quantify its corresponding uncertainty too. Confidence intervals provide information on the precision and the range of values that reflect the statistical measure for heterogeneity. Methods for constructing the confidence intervals include the variance estimates recovery method [35, 36], the method using the distribution of Q a -statistic [2426, 32], the method based on the statistical significance of Q [32], the method based on the between-study variance estimator (see Sect. 12.2.4) [5, 32, 37], and the method using a nonparametric bootstrap approach [32].

2.3.1 H 2 Index

H 2 index (also known as Birge ratio) [38] has been presented by Higgins and Thompson [32] and shows the excess of the observed Q over its expected value, \( E(Q)=k-1 \). The measure reflects the relationship of between- and within-study variance and can be obtained by

$$ {H}^2=\frac{Q}{k-1}=\frac{{\widehat{t}}_{\mathrm{DL}}^2+{\widehat{s}}^2}{{\widehat{s}}^2} $$

where \( {\widehat{t}}_{\mathrm{DL}}^2 \) is the estimated between-study variance using the DerSimonian and Laird [39] estimator and \( {\widehat{s}}^2 \) is the “typical” within-study variance:

$$ {\widehat{s}}^2=\frac{{\displaystyle \sum}\frac{1}{v_i}\left(k-1\right)}{{\left({\displaystyle \sum}\frac{1}{v_i}\right)}^2-{\displaystyle \sum }{\left(\frac{1}{v_i}\right)}^2}. $$

The statistic takes values within the range (1, \( \infty \)), and in the absence of between-study heterogeneity, it equals 1. Higgins and Thompson [32] suggest that there is no universal rule to define thresholds for ‘low,’ ‘moderate,’ and ‘high’ heterogeneity for H 2. However, they suggest that values greater than 1.5 may show considerable heterogeneity, and values lower than 1.2 may show moderate to low heterogeneity.

2.3.2 I 2 Index

The I 2 index reflects the percentage of the total variability in a set of effect measures that is due to between-study variability beyond what is expected by within-study error. The “generalized I 2 statistics” family [37] can be expressed as

$$ {I}^2=\frac{{\widehat{t}}^2}{{\widehat{t}}^2+{\widehat{s}}^2} $$

where \( {\widehat{t}}^2 \) is the estimated between-study variance using one of the methods suggested in the literature (see Sect. 12.2.4) [5]. The I 2 index can be expressed as a percentage ranging from 0 to 100 %, where a value of 0 % indicates no observed heterogeneity. The Cochrane Handbook advises avoiding the use of specific thresholds for the interpretation of the I 2 statistic as they may be misleading. A general guideline to its interpretation is the following [3]:

  • From 0 to 40 %, may not be important.

  • From 30 to 60 %, may represent moderate heterogeneity.

  • From 50 to 90 % may represent substantial heterogeneity.

  • From 75 to 100 %, may represent considerable heterogeneity.

  • Note that should these guidelines be used with caution, and always interpret the I 2 index along with its confidence interval.

2.3.2.1 I 2 Index Based on Cochran Q-Statistic

The I 2 based on Cochran Q-statistic is the most popular statistic and is usually the default method to quantify heterogeneity in most meta-analysis software. The method is a special form of the “generalized I 2 statistics” using the DerSimonian and Laird approach [39] (see Sect. 12.2.4.1):

$$ {I}_{\mathrm{DL}}^2=\frac{{\widehat{t}}_{\mathrm{DL}}^2}{{\widehat{t}}_{\mathrm{DL}}^2+{\widehat{s}}^2}. $$

Alternatively, the method can be presented as

$$ {I}_{\mathrm{DL}}^2=\frac{H^2-1}{H^2}=\frac{Q-\left(k-1\right)}{Q} $$

in terms of either H 2 or Cochran’s Q-statistic and its degrees of freedom \( \left(k-1\right) \). The I 2 statistic should be interpreted with caution when the number and size of studies in the meta-analysis are small (e.g., for fewer than ten studies in the meta-analysis and studies with fewer than 100 participants) [34, 40, 41]. Simulation studies have shown that \( {I}_{\mathrm{DL}}^2 \) increases with increasing study size [40, 41] and that it is associated with low power when a small number of studies are included in the meta-analysis [34]. Empirical evidence suggests care is also needed with the interpretation of \( {I}_{\mathrm{DL}}^2 \) when a meta-analysis includes roughly fewer than 500 events and that 95 % confidence intervals for \( {I}_{\mathrm{DL}}^2 \) have on average a good coverage [42].

2.3.2.2 I 2 Index Based on Generalized Q-Statistic

The I 2 based on generalized Q-statistic is a special form of the “generalized I 2 statistics” expressed as [37]

$$ {I}_{\mathrm{PM}}^2=\frac{{\widehat{t}}_{\mathrm{PM}}^2}{{\widehat{t}}_{\mathrm{PM}}^2+{\widehat{s}}^2} $$

where \( {\widehat{t}}_{\mathrm{PM}}^2 \) is the estimated between-study variance using the Paule and Mandel estimator (see Sect. 12.2.4.1) [5, 43]. A simulation study suggested that the confidence interval for \( {I}_{\mathrm{PM}}^2 \) is wider compared to those of \( {I}_{\mathrm{DL}}^2 \) and that \( {I}_{\mathrm{PM}}^2 \) maintains coverage close to the nominal level in contrast to \( {I}_{\mathrm{DL}}^2 \) method [37].

2.3.3 R 2 Index

An alternative to H 2 and I 2 measures is the R 2 statistic that describes the quadratic inflation in the confidence interval for the summary intervention effect under the random-effects model compared to that from the fixed-effect model

$$ {R}^2=\frac{\mathrm{Var}\left({\widehat{m}}_{\mathrm{RE}}\right)}{\mathrm{Var}\left({\widehat{m}}_{\mathrm{RE}}\right)} $$

where \( {\widehat{m}}_{\mathrm{RE}} \) is the overall intervention effect under the random-effects model with weights \( {a}_i=1\kern0.1em /\kern0.1em \left({v}_i+{\widehat{t}}^2\right) \) and \( {\widehat{m}}_{\mathrm{FE}} \) the overall intervention effect under the fixed-effect model with weights \( {a}_i=1/{v}_i \). The statistic takes values within the range (1, \( \infty \)), and 1 suggests identical inferences under the two meta-analysis models and homogeneity across the study-specific effects. It should be noted that R 2 and H 2 are equal when all study-specific estimates have equal precision. Since R 2 is a function of \( {\widehat{t}}^2 \) alone (the weights are assumed to be known), one approach to estimate the confidence interval for R 2 is via the calculation of the confidence interval for τ 2. However, note that approaches based on the Cochran’s Q-statistic may not be applicable for constructing confidence intervals for R 2.

2.3.4 D 2 Index

Wetterslev et al. proposed the D 2 statistic to quantify the relative variance when we change from the random-effects model to the fixed-effect model [33]. The statistic is interpreted as the proportion of the between-study heterogeneity in meta-analysis relative to the total model variance of the included studies and is given by

$$ {D}^2=\frac{\mathrm{Var}\left({\widehat{m}}_{\mathrm{RE}}\right)-\mathrm{V}\mathrm{a}\mathrm{r}\left({\widehat{m}}_{\mathrm{FE}}\right)}{\mathrm{Var}\left({\widehat{m}}_{\mathrm{RE}}\right)}=1-\frac{1}{R^2} $$

or equivalently

$$ {D}^2=\frac{{\widehat{t}}^2}{{\widehat{t}}^2+{\widehat{s}}_D^2}, $$

where

$$ {\widehat{s}}_D^2=\frac{{\widehat{t}}^2\left(\mathrm{V}\mathrm{a}\mathrm{r}\left({m}_{\mathrm{FE}}\right)\right)}{\mathrm{Var}\left({m}_{\mathrm{RE}}\right)-\mathrm{V}\mathrm{a}\mathrm{r}\left({m}_{\mathrm{FE}}\right)} $$

is the sampling error. Although D 2, similar to I 2, is interpreted as a percentage (taking values between 0 and 1), a simulation study suggested that D 2 is equal to or greater than I 2, irrespective of the chosen effect measure and number of studies in the meta-analysis [33].

2.3.5 G 2 Index

Rücker et al. proposed an alternative statistic, called G 2, to measure between-study heterogeneity while adjusting for small-study effects (see also Sect. 12.4) [31]. The statistic can be obtained by

$$ {G}^2=1-\frac{{\left[{\displaystyle \sum }{a}_i{y}_i^{\mathrm{Adj}}-\frac{1}{k}\left({\displaystyle \sum}\sqrt{a_i}\right)\left({\displaystyle \sum}\sqrt{a_i}{y}_i^{\mathrm{Adj}}\right)\right]}^2}{\left[{\displaystyle \sum }{a}_i-\frac{1}{k}{\left({\displaystyle \sum}\sqrt{a_i}\right)}^2\left]\right[{\displaystyle \sum }{a}_i{\left({y}_i^{\mathrm{Adj}}\right)}^2-\frac{1}{k}{\left({\displaystyle \sum}\sqrt{a_i}{y}_i^{\mathrm{Adj}}\right)}^2\right]}, $$

where y Adj i are the study-specific intervention effect estimates adjusted for small-study effects, \( {y}_i^{\mathrm{Adj}}={\widehat{m}}_{\mathrm{RE}}^{\mathrm{Adj}}+\sqrt{{\widehat{t}}^2\kern0.1em /\left({v}_i+{\widehat{t}}^2\right)}\left({y}_i-{\widehat{m}}_{\mathrm{RE}}^{\mathrm{Adj}}\right) \), with \( {\widehat{m}}_{\mathrm{RE}}^{\mathrm{Adj}} \) the summary intervention effect under the random-effects model and adjusted for small-study effects, and \( {a}_i=1\kern0.1em /\kern0.1em {v}_i \).

The G 2 statistic is closely related to the Q-statistic adjusted for small-study effects (see Sect. 12.2.2.3), and it is suggested to quantify heterogeneity in the presence of small-study effects [31]. Similarly to I 2 and D 2, G 2 is interpreted as a percentage (taking values between 0 and 1) and reflects the proportion of the variability in the intervention effect that is not explained under the fixed-effect model that allows for the presence of small-study effects.

2.4 Estimating the Between-Study Variance

An important aspect in meta-analysis is to quantify the extent of between-study heterogeneity. The DerSimonian and Laird (DL) between-study variance estimator is the most commonly implemented approach and is the default approach in many statistical software (e.g., RevMan) [39, 44]. However, its use has often been criticized because the method may underestimate the true between-study variance, thereby producing narrow confidence intervals (CIs) for the overall intervention effect, especially for a small number of studies (e.g., \( k<10 \)) [45]. Hence, several alternative methods have been proposed that vary in popularity and complexity. The estimators for τ 2 are categorized as closed form and iterative methods, and their families presented in the literature to date are:

  1. 1.

    The method of moments estimators (e.g., DL and Paule and Mantel (PM)) [39, 43]

  2. 2.

    The maximum likelihood estimators (e.g., maximum likelihood (ML) [20, 46] and restricted maximum likelihood (REML) [46])

  3. 3.

    The model error variance estimators (e.g., Sidik and Jonkman method) [47]

  4. 4.

    The Bayes estimators (e.g., Rukhin Bayes, full Bayes) [48, 49]

  5. 5.

    The bootstrap estimators [50]

It has been shown that estimating the between-study variance in meta-analyses including only a few studies is particularly inaccurate [5052]. Therefore, it is recommended to quantify the uncertainty around the point estimates to avoid misleading results. Again, several options exist to quantify the uncertainty in the estimated amount of the between-study variance [20, 24, 53].

In this chapter, we briefly describe the most popular estimators for the between-study variance, as well those recommended for the most frequently encountered meta-analysis. For a comprehensive overview of methods used for estimating the between-study variance and its uncertainty, see Veroniki et al. [5].

2.4.1 Approaches for the Between-Study Variance Point Estimate

2.4.1.1 Method of Moments Estimators

The generalized method of moments (GMM) estimator [23] can be derived by equating Q a (see Sect. 12.2.2) and its expected value:

$$ E\left({Q}_a\right)=\left({\displaystyle \sum }{a}_i{v}_i-\frac{{\displaystyle \sum }{a}_i^2{v}_i}{{\displaystyle \sum }{a}_i}\right)+{\tau}^2\left({\displaystyle \sum }{a}_i-\frac{{\displaystyle \sum }{a}_i^2}{{\displaystyle \sum }{a}_i}\right) $$

Then, solving for τ 2 , we obtain

$$ {\widehat{t}}_{\mathrm{GMM}}^2= \max \left\{0,\frac{Q_a-\left({\displaystyle \sum }{a}_i{v}_i-\frac{{\displaystyle \sum }{a}_i^2{v}_i}{{\displaystyle \sum }{a}_i}\right)}{{\displaystyle \sum }{a}_i-\frac{{\displaystyle \sum }{a}_i^2}{{\displaystyle \sum }{a}_i}}\right\} $$

The method of moments estimators presented in the following subsections is a special case of the GMM estimator with varying weights a i .

2.4.1.1.1 DerSimonian and Laird (DL)

This method is the most frequently used approach for the estimation of the between-study variance, and many software programs have DL as the default method. The DL estimator is a non-iterative method and is a special case of the GMM estimators with study weights \( {a}_i=1\kern0.1em /\kern0.1em {v}_i \).

Simulation studies have suggested that the DL method performs well when the true between-study variance is small or close to zero and the number of studies in the meta-analysis is large, whereas when τ 2 is large, DL produces estimates with significant negative bias [37, 47, 52, 5456]. The negative bias that has been reported with respect to the DL estimator seems to be something related to using effect size measures based on 2 × 2 table data (e.g., odds ratios, risk ratios), where problems arise when using very large τ 2 values in simulation studies. In particular, very large τ 2 can lead to extreme values of the effect size measure, at which point many tables will include zero cells and the accuracy and applicability of the inverse-variance method becomes questionable. Jackson et al. evaluated the efficiency of the DL estimator asymptotically and showed that DL is inefficient when the studies included in the meta-analysis are of different sizes and particularly when τ 2 is large [57]. However, they suggested that the DL estimator performs well and can be efficient for inference on the summary effect when the number of studies included in the meta-analysis is large. The confidence interval for the between-study variance when using the DL method can be ideally estimated using the Jackson’s method [24], as they are based on the same statistical principle and are naturally paired.

2.4.1.1.2 Paule and Mandel (PM)

Paule and Mandel [43] proposed to profile the generalized Q-statistic (see Sect. 12.2.2.2) until Q gen equals its expected value (i.e., \( E\kern0.2em \left({Q}_{\mathrm{gen}}\right)=k-1 \)). The PM estimator is an iterative method and a special case of the GMM estimator with \( {a}_i=1\kern0.1em /\kern0.1em \left({\tau}^2+{v}_i\right) \).

Rukhin et al. showed that when assumptions underlying the method do not hold, the method is more robust than the DL estimator, which depends on large sample sizes [58]. It has been shown that the PM method has upward bias for a small number of studies and heterogeneity and downward bias for large number of studies and heterogeneity [52], but generally the method is less biased than its alternatives. One simulation study suggested that PM outperforms the DL and REML (see below) estimators in terms of bias [59]. Panityakul et al. [59] showed that the PM estimator is approximately unbiased for large sample sizes, and Bowden et al. [37] in their empirical study showed that as heterogeneity increases, \( {\widehat{t}}_{\mathrm{PM}}^2 \) becomes greater than \( {\widehat{t}}_{\mathrm{DL}}^2 \). The uncertainty around the between-study variance using the PM method can be ideally estimated using the Q-profile method [53], as they are based on the same statistical principle and are naturally paired.

2.4.1.2 Maximum Likelihood Estimators

The maximum likelihood estimators are iterative methods and are derived after maximizing the (restricted) log-likelihood function [20, 60]. A limitation of the methods is that their success to converge to a solution depends on the selection of the maximization technique (e.g., Newton-Raphson, expectation-maximization algorithm).

2.4.1.2.1 Maximum Likelihood (ML)

The method is asymptotically efficient and can be obtained by iterating

$$ {\widehat{t}}_{\mathrm{ML}}^2= \max \left\{0,\frac{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}^2\left({\left({y}_i-{\widehat{m}}_{\mathrm{RE}}\left({\widehat{t}}_{\mathrm{ML}}^2\right)\right)}^2-{v}_i\right)}{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}^2}\;\right\} $$

and

$$ {\widehat{m}}_{\mathrm{RE}}\left({\widehat{t}}_{\mathrm{ML}}^2\right)=\frac{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}{y}_i}{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}\;} $$

until they converge and do not change from one iteration to the next. The study weights are derived under the random-effects model, \( {w}_{i,\mathrm{R}\mathrm{E}}=1/\left({v}_i+{\widehat{t}}_{\mathrm{ML}}^2\right) \). An initial estimate of \( {\widehat{t}}_{\mathrm{ML}}^2 \) can be decided a priori as a plausible value of the heterogeneity variance, or it can be estimated with any other non-iterative estimation method. Each iteration step requires nonnegativity.

Simulation studies have suggested that although the ML estimator is efficient, it exhibits large negative bias for large τ 2 when the number and size of studies are small (e.g., for fewer than 10 studies and fewer than 80 participants in each study) [5052, 56, 59]. It has been shown that the ML method is more efficient than PM, and REML methods, but exhibits the largest amount of bias [51, 52, 60, 61]. However, because of the large amount of bias, it is recommended avoiding the ML estimator [56, 59]. The confidence interval for the between-study variance when using the ML method can be ideally computed using the profile likelihood method [1], as they are based on the same statistical principle and are naturally paired.

2.4.1.2.2 Restricted Maximum Likelihood (REML)

The REML method is often used to correct for the negative bias produced by the ML method and can be obtained by

$$ {\widehat{t}}_{\mathrm{RE}\mathrm{ML}}^2= \max \left\{0,\frac{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}^2\left({\left({y}_i-{\widehat{m}}_{\mathrm{RE}}\left({\widehat{t}}_{\mathrm{RE}\mathrm{ML}}^2\right)\right)}^2-{v}_i\right)}{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}^2}+\frac{1}{{\displaystyle \sum }{w}_{i,\;\mathrm{R}\mathrm{E}}}\right\}, $$

with study weights derived under the random-effects model, \( {w}_{i,\;\mathrm{R}\mathrm{E}}=1\kern0.1em /\kern0.1em \left({v}_i+{\widehat{t}}_{\mathrm{REML}}^2\right) \) [39, 52]. The estimator is calculated by an iterative process with a nonnegative initial estimate. Again, each iteration step requires nonnegativity.

Simulation studies suggested that the REML method underestimates the true between-study variance, especially when the data are sparse [47, 52, 54, 56, 62]. For dichotomous outcome data, it was shown that the REML estimator is less biased, but less efficient than the DL estimator [51, 52]. For continuous data, it has been suggested that the REML estimator is less efficient than the ML estimator and comparable to DL estimator [56]. An empirical study [63] with dichotomous outcome data showed that the REML estimator can be smaller or larger in magnitude than the DL method. REML is recommended when large studies are included in the meta-analysis [56]. The uncertainty around the between-study variance when using the REML estimator can be ideally estimated using the profile likelihood method [20].

2.4.1.3 Bayes Estimators
2.4.1.3.1 Full Bayes (FB)

The FB approach takes into account the uncertainty of all parameters (including τ 2) in the results. Several investigators claim that in practice the differences between frequentist and Bayesian approaches appear to be small [60, 64]. The FB method uses non-informative priors to approximate a likelihood-based analysis. When the number of studies is large, the choice of the prior does not have a major influence on the results since they are data driven. The choice of prior is particularly important though when the number of studies is small, as it may impact on the estimated between-study variance and hence on the overall intervention effect [65, 66].

A simulation study compared 13 different prior distributions for the heterogeneity variance and suggested that the results might vary substantially when the number of studies is small [65]. The study showed that, in terms of bias, none of the distributions considered performed best for all meta-analysis scenarios. More specifically, inverse-gamma, uniform, and Wishart distributions for the between-study variance all perform poorly when the number of studies is small (<10) and produce estimates with substantial bias. An inverse-gamma prior with small hyper-parameters is often considered to be an approximately non-informative prior, but it was shown that inferences can be sensitive to the choice of hyper-parameters [67, 68]. Informative priors were recently proposed for the between-study variance using the log-odds ratio and standardized mean difference effect measures, and these might considerably improve estimation when few studies are included in the meta-analysis [6971]. The uncertainty around the between-study variance when using the FB estimator can be ideally estimated using Bayesian credible intervals.

3 Possible Causes and Approaches to Deal with Heterogeneity

Despite the best efforts of investigators to construct a dataset of carefully selected studies where the homogeneity assumption would hold, an imbalance in the distribution of effect modifiers might arise resulting in between-study heterogeneity. The identification of the causes of heterogeneity may help to account for such variation in the results thereby aiding in the interpretation of existing data, as well the planning of future studies. Between-study heterogeneity may be due to clinical and/or methodological heterogeneity, biases, and chance [3, 72]. Clinical heterogeneity suggests that a possible variability in intervention or patient-level characteristics, or in outcomes studied, can influence the intervention effect. Methodological heterogeneity refers to the variability across studies due to study design or quality (e.g., inadequate randomization or allocation concealment, high dropout rates, intention-to-treat versus per-protocol analyses). In addition to biases captured by methodological heterogeneity, there are other biases that might cause between-study heterogeneity, including selection or funding biases. It is also possible that outlier studies show extreme results due to chance (e.g., studies with small sizes and/or event rates).

Quantifying the amount of between-study heterogeneity and exploring its sources are among the most important aspects of meta-analysis. When heterogeneity is identified, the first step researchers should follow is to check the data included in the meta-analysis for potential data abstraction errors. If no errors are found and between-study variability beyond chance is still evident, a different choice in effect measure may improve homogeneity. Empirical studies have shown that relative measures (e.g., odds ratio, risk ratio) are associated with less heterogeneity than absolute measures (e.g., risk difference) [7375]. Heterogeneity might also be due to intervention effect modifiers. This exploration might include applying subgroup or meta-regression analyses adjusting the estimated intervention effects accordingly. It should be noted that the use of individual patient data in meta-analysis allows for a thorough investigation of potential sources of heterogeneity and a better evaluation of both within- and between-study heterogeneity, avoiding the assumption that a relationship between groups holds between individuals as well [76, 77]. For small to moderate amount of heterogeneity (for a general guideline, see Sect. 12.2.3.2), one can apply the random-effects model assuming that the true study-specific effects are not identical but come from the same distribution. Under the random-effects model, the between-study variation is taken into account in the meta-analysis results, but this is not a remedy for heterogeneity as it still exists.

To facilitate the interpretation of the meta-analysis’ result capturing both between-study variance and variance of summary intervention effect, a prediction interval of the possible intervention effect in an individual setting can be calculated [7880]. A prediction interval indicates the range of values for the true intervention effect when a future study is conducted and can be obtained by

$$ {\widehat{m}}_{\mathrm{RE}}\pm {t}_{1-\frac{a}{2},\;k-2}\sqrt{{\widehat{t}}^2+\mathrm{v}\mathrm{a}\mathrm{r}\left({\widehat{m}}_{\mathrm{RE}}\right)} $$

where \( {t}_{1-a/2,\;k-2} \) is the 100(\( 1-a/2 \)) % quantile of the \( {t}_{k-2} \) distribution. A prediction interval can be calculated when at least three studies are included in the meta-analysis.

4 Methods to Appraise Small-Study Effects

The association between size and effect of the studies included in a meta-analysis should be explored, as the presence of selection bias and small-study effects may lead to meaningless conclusions. Funnel plots and statistical tests based on funnel plot asymmetry are popular in meta-analysis for assessing small-study effects. Several methods have been suggested to adjust for small-study effects, including the trim-and-fill method, the Copas selection model, and various regression-based approaches (for a review, see Mavridis and Salanti) [6].

4.1 Graphical Representation of Small-Study Effects

Funnel plots facilitate the visual examination for detecting bias or heterogeneity, and often it is not possible to distinguish between the two. A funnel plot (see Fig. 12.5) is a scatter plot of the study-specific intervention effect estimates against a measure of precision or study size. In agreement with forest plots (see Sect. 12.2.1.1) and in contrast to conventional scatter plots, the intervention effect estimates are usually plotted on the x-axis, whereas the study size or precision is plotted on the y-axis [8284]. It is recommended to plot the SE (or 1/SE) of the intervention effect on the vertical axis, rather than study size, as study power is based on several other factors apart from sample size alone (e.g., number of events, standard deviation) [84], and these are summarized by SE. The plot usually includes a triangular 95 % confidence region and a vertical line corresponding to summary intervention effect under the fixed-effect model. In the absence of bias and heterogeneity, 95 % of the studies are expected to lie within the triangular region and be scattered symmetrically around the summary intervention effect. In such a case, the plot resembles a symmetrical and inverted funnel. Small studies are expected to lie at the bottom of the graph and widely spread around the summary intervention effect compared to larger studies. It is advisable to draw funnel plots when ten or more studies are available in the meta-analysis [7].

Fig. 12.5
figure 5

Funnel plot. Example of symmetrical funnel plot (Reproduced with permission [81])

An asymmetric funnel plot suggests there is a relationship between the study-specific effect measure and precision, which might be due to selection bias (including publication bias, language bias, citation bias, and reporting bias), small-study effects, heterogeneity, sampling variation, or chance [10]. An inappropriate choice of effect measure might also result in an asymmetrical funnel plot. It should be noted that some effect measures (e.g. log-odds ratios and standardized mean differences) are correlated with their SEs, and this may produce artificial funnel plot asymmetry. In the presence of small-study effects, the funnel plot will be asymmetrical with small studies missing at the bottom right corner (for an efficacy outcome, and at the left corner for a safety outcome) suggesting an unfavorable effect. Some argue that the visual interpretation of a funnel plot is a subjective issue, and sometimes it is difficult to distinguish between symmetry and asymmetry [85, 86].

Peters et al. proposed a modified version of the conventional funnel plot, in which extra contours representing the statistical significance of each study are added (see Fig. 12.6) [87]. This may aid visual interpretation by suggesting that if the missing studies come from a “nonsignificance area,” then asymmetry may be due to selection bias. However, if the missing studies come from a “significance-area” or there is a certain direction of the intervention effect, then asymmetry is probably due to factors other than selection bias [81].

Fig. 12.6
figure 6

Contour-enhanced funnel plot for trials of the effect of intravenous magnesium on mortality after myocardial infarction. Example of asymmetrical funnel plot (Reproduced with permission [81])

4.2 Tests for Small-Study Effects and Selection Bias

4.2.1 Funnel Plot-Based Tests

Apart from assessing for small-study effects using a visual inspection of funnel plots, several tests have been suggested to statistically assess funnel plot asymmetry. The tests are categorized as (1) rank-correlation tests or (2) linear regression tests. Begg and Mazumdar used a nonparametric rank-correlation method for the examination of the association between the standardized intervention effect estimates and their SEs [88]. When small studies (with large SEs) tend to have larger intervention effect estimates compared to the larger studies, the test identifies a correlation between the two factors. However, the test is associated with low power, and Begg suggests using a very liberal significance level (such as 0.10) [89]. Gjerdevik and Heuch suggested modification of Begg test based on Spearman rho and Kendall tau, to improve type I and II error rates; they suggested that the test based on Spearman rho is preferred for small datasets [90]. Egger et al. proposed a more powerful test compared to Begg test to assess the funnel plot asymmetry based on a regression analysis of Galbraith plot (see also Sect. 12.2.1.2) [83]. The test is based on the weighted linear regression of the standardized intervention effect (z-score) against study precision, with weights equal to the inverse of the variance. The intercept of the regression is used to measure asymmetry; specifically if it is estimated to be statistically significantly different from 0, then there is evidence of selection bias, and a negative intercept would suggest small-study effects are present. Tang and Liu suggested an alternative test using a linear regression of intervention effect estimate on 1/\( \sqrt{n} \), with weights n the study size [91].

Several modifications of the tests have been presented in the literature, which apply to dichotomous outcome data only. More specifically, for group correlation, the test by Schwarzer et al. could be used [92]. For linear regression, several modifications have been proposed including those by Macaskill et al. [93], Harbord et al. [94], Peters et al. [95], and the “arcsine” test by Rücker et al. [96]. For all aforementioned tests, the cutoff P-value 0.10 is considered to infer asymmetry in the funnel plot.

More specifically, the test proposed by Macaskill et al. is a linear regression of the intervention effect estimate on n, with weights m E m NE/n, where m E and m NE represent the total number of events and nonevents, respectively [93]. Harbord et al. [94] presented a modified version of the test proposed by Egger et al. [83], based on the efficient score (\( Z=a-{m}_{\mathrm{E}}{n}_{\mathrm{E}}\kern0.1em /\kern0.1em n \)) and its variance (\( V={n}_{\mathrm{E}}{n}_{\mathrm{C}}{m}_{\mathrm{E}}{m}_{\mathrm{NE}}\kern0.1em /\kern0.1em {n}^2\left(n-1\right) \)) of the log-odds ratio, where n E and n C are the sample sizes of the experimental and control groups, respectively. Peters et al. [95] suggested a slightly modified test compared to Macaskill et al. [93] test using the log-odds ratio effect measure and a linear regression of intervention effect estimate on 1/n, with weights m E m NE/n, for a better control of type I error. Schwarzer et al. [92] suggested a rank-correlation test for sparse data, using mean and variance of the noncentral hypergeometric distribution and avoiding correlation between log-odds ratio and its SE. However, for large between-study heterogeneity, the test has low power compared to the other tests [92]. Although the tests by Harbord et al. [94], Peters et al. [95], and Schwarzer et al. [92] have been presented using the odds ratio effect measure, they can be applied for other effect measures too. However, for a dichotomous outcome and the log-odds ratio or log-risk ratio, the intervention effect is statistically dependent on its variance, and hence tests based on these two factors might erroneously suggest the small-study effects’ presence. Rücker et al. [96] suggested a test based on arcsine transformation of observed risks avoiding false-positive results when a large intervention effect or substantial between-study heterogeneity is present. In contrast to the other tests, the one suggested by Rücker et al. [96] can model studies with zero events in both arms.

Sterne et al. [81] advise using regression tests to address selection bias and small-study effects as they have larger power compared to rank tests as well as avoiding tests for funnel plot asymmetry if all studies are of similar sizes and hence of similar SEs. The Egger test has greater power for continuous outcomes than for dichotomous outcomes and is suggested for testing for funnel plot asymmetry. For dichotomous outcomes, the Harbord, Peters, and Rücker tests are suggested, as they have greater power compared to the other tests and avoid the mathematical association between log-odds ratio and its SE (this is also known as “regression to the mean”). It should be noted though that the performance of the tests deteriorates as the between-study heterogeneity increases. A general recommendation is to select one of the Harbord, Peters, and Rücker tests for small heterogeneity (\( {\tau}^2<0.1 \)) and to use Rücker test for large heterogeneity (\( {\tau}^2>0.1 \)) [3, 81].

4.3 Adjusting Intervention Effect Estimates for Small-Study Effects

4.3.1 Trim-and-Fill Method

The trim-and-fill method is a nonparametric method and aims to correct for funnel plot asymmetry due to small-study effects. The method is a four-step process:

  1. 1.

    The smaller studies are “trimmed” (i.e., removed) so that a symmetrical funnel plot is produced.

  2. 2.

    The summary intervention effect from the “trimmed” funnel plot is estimated.

  3. 3.

    The omitted studies are returned to the funnel plot and their “missing counterparts” are imputed or “filled” as their mirror images.

  4. 4.

    An adjusted overall intervention effect with its corresponding confidence interval is estimated using the complete set of studies [97, 98].

This is a nonparametric method and provides an estimate of both the number of missing studies and of the summary intervention effect adjusted for selection bias. Although no assumptions are required about the mechanism leading to selection bias, the trim-and-fill method assumes that the small-study effect is solely caused by selection bias and that in truth there should be a symmetric funnel plot. However, the adjusted intervention effect should be interpreted with caution as it is not necessarily the intervention effect that would have been observed in the absence of selection bias.

Simulation studies have shown that the method performs well in the presence of selection bias, but it underestimates the intervention effect when there is large between-study heterogeneity and no selection bias [99, 100].

4.3.2 Selection Models

To evaluate the potential impact of missing studies on the results of a meta-analysis, selection models have been suggested that account for the mechanism by which studies are published. Selection models assume that missing studies are not missing at random, and the observed studies are due to certain characteristics (e.g., sample size, quality of design) that increase their propensity for publication. These models associate each observed study with an a priori probability to be published, and then estimate the summary intervention effect from the distribution of the observed sample.

A popular selection model in meta-analysis is the one developed by Copas [101], in which the probability that a study is observed depends on its SE. Although selection models correct effect estimates for selection bias, they have not been widely used probably because of their complexity, the large number of studies needed and the strong modeling assumptions about the severity of selection bias (i.e., that the factor causing small-study effects is selection bias). Copas [101] suggested applying a sensitivity analysis so that the researcher has the full picture of the estimated values of the intervention effect (and its uncertainty) under a range of assumptions about the severity of selection bias. It has been alternatively suggested to use expert opinion to inform the probabilities of publication [102]. A Copas selection model accounts for the correlation between the observed intervention effect and the probability that a study is published, which is:

  1. 1.

    Zero in the absence of selection bias.

  2. 2.

    Positive for a large intervention effect and large propensity for publication (e.g., for safety outcomes).

  3. 3.

    Negative for a large intervention effect and small propensity for publication (e.g., for efficacy outcomes; harms are less likely to be studied in trials and hence less likely to be published) [101, 103, 104].

Empirical studies using large collections of meta-analyses with dichotomous data suggest that the Copas selection model is preferable than the trim-and-fill method, as the latter produces systematically larger SEs and P-values [105, 106].

4.3.3 Extrapolation Methods

Extrapolation approaches model the relationship between the observed intervention effects and a measure of their uncertainty (e.g., SE). Stanley [107] and Copas and Malley [108] are early proponents of the regression-based approaches, with Stanley [107] adjusting the estimated intervention effect and Copas and Malley [108] adjusting the P-values for small-study effects. The approach suggested by Moreno et al. [109, 110] regresses the study-specific effects against their precision and computes the “unbiased” intervention effect as the extrapolation of the regression line to predict the intervention effect in a study with infinite sample size (or zero SE). The slope of the meta-regression is used to test for funnel plot asymmetry (see also Sect. 12.4.2.1), and the intercept is interpreted as the estimated intervention effect of a study with infinite sample size and hence infinite precision, adjusted for selection bias.

A key concern in these methods, as already stated in Sect. 12.4.2.1, is the mathematical association between some effect measures (e.g., log-odds ratio) and its SE, which might erroneously suggest the presence of small-study effects. Also, the performance of these methods depends on the variability of the meta-analysis’ study sizes; if, for example, all studies are small, then the methods will not perform well. The regression-based methods, as any meta-regression model, suffer from lack of power to detect existing associations when few studies are available and in the presence of substantial heterogeneity. Simulation studies suggest that extrapolation within funnel plots outperform the trim-and-fill method, but still the adjusted effect estimates should be interpreted with caution [4, 109].

5 Moderators and Confounders

The impact of moderators and confounders is best viewed in light of the prior sections on heterogeneity issues and small-study effects, as any meaningfully important moderator or confounder is likely going to have an impact on homogeneity and symmetry of effects. The typical approaches to moderator and confounders include subgroup analyses and regression methods, which can be undertaken in the context of meta-analysis as well as more comprehensive overviews of reviews. As always, it remains important to recognize the presence of clustering and to minimize, especially in umbrella reviews, the risk of duplicate entry of trials with multiple arms as this may have a biasing effect on the accuracy and precision of the overall estimates.

6 Discussion

This chapter illustrates a vast range of approaches to evaluate the presence and estimate the magnitude of between-study heterogeneity as well as a wide variety of methods to test and adjust for small-study effects, which can easily be extrapolated to the analysis of key moderators and confounders. Heterogeneity and selection bias are two of the greatest threats in meta-analysis and may lead to meaningless and/or overoptimistic intervention effect estimates. Researchers should routinely address and explore reasons for their presence and assess the extent to which these may influence the meta-analysis results.

Recent methodological research supports use of the random-effects model when completing a meta-analysis because it accounts for the between-study heterogeneity [3, 111, 112]. The random-effects model is considered more realistic than the fixed-effect model in most contexts. The new methodologies in meta-analysis help us incorporate heterogeneity and adjust for small-study effects and general funnel plot asymmetries as parts of the modeling that can also be reflected in the results. As presented in this chapter, both heterogeneity and selection bias can be examined using graphical methods, statistical tests, subgroup, and meta-regression analyses.

When selection bias is present, it is advisable that researchers make efforts to reduce (or if possible to eliminate) it, such as identifying unpublished or difficult to locate material from the “gray” literature for potential inclusion in the meta-analysis [113]. Also, exploration of heterogeneity should always take place when conducting a meta-analysis but should be interpreted with caution if individual participant data is not used in the statistical modeling. When few studies are included in a meta-analysis, we suggest conducting a sensitivity analysis using a variety of methods for addressing heterogeneity and small-study effects, before reaching definitive conclusions.