Introduction

Recent work has considerably advanced the definition, identification and estimation of different controlled or natural effects in causal mediation analysis. As a result, various estimation methods such as linear structural equation modeling, outcome and mediator regression-based methods, parametric g-computation, inverse-probability-weighted (IPW) fitting of marginal structural models (MSMs), and sequential g-estimation [1, 2] have been applied to mediation settings. With some notable exceptions [35], relatively few approaches can incorporate general types of mediator and outcome variables. The simulation-based approach introduced by Imai et al. [3] and the regression-based approach proposed by Valeri and VanderWeele [5] are increasingly popular approaches; the former may be unfamiliar to epidemiologists, and the latter requires different regressions or approximations for different mediator and outcome types. Meanwhile, g-computation [1], as an alternative for computing marginal effects over IPW fitting of MSM [6], holds promise to provide a unified framework for effect estimation in causal mediation analysis, especially given its capability in dealing with time-varying exposure and confounding [4, 7]. However, a didactic demonstration of the application of g-computation in causal mediation analysis, reflecting recent decompositions for mediation and interaction, is lacking.

In this paper, we demonstrate the utility of (parametric) g-computation in estimating various components of the total effect, such as (1) natural direct and indirect effects, (2) standard and stochastic controlled direct effects, and (3) reference and mediated interactions, using standard statistical software. We focus on marginal effects (standardized over covariates). The current approach extends the previous work on g-computation demonstration for total effect [6] and the gformula package (mediation option) in Stata [4] by showing the actual steps in estimation and incorporating estimation for various component effects under 2-, 3- and 4-way effect decomposition [8, 9]. The paper is organized as follows: we will first review the definition and identification criteria for different effects in mediation context. Then we will review the g-computation algorithm and introduce steps for mediation analysis. After an illustrative example using a partially simulated data set, we will discuss the strengths and limitations of g-computation and its relation to other existing estimation procedures. Readers familiar with the background material on mediation analysis under the potential outcomes framework can go directly to g-computation steps section.

Notation and definitions

Let X denote the exposure of interest, Y the outcome of interest, M the mediator of interest, and Z a set of covariates not affected by the exposure but which are assumed to be sufficient for confounding control for total, direct and indirect effects estimation. Throughout, we assumed X preceded M, which preceded Y. Let Y x and M x denote respectively the potential values of the outcome and mediator that would have occurred had exposure X been set, possibly counter to fact, to a specific value x. Similarly, let Y xm denote the potential value of Y that would have occurred had X and M been set, possibly counter to fact, to x and m respectively. We use \(Y_{{xM_{{x^{*} }}}}\) to express potential outcome value had the exposure X been set to x and M to \(M_{{x^{*} }}\). Let x (index) and \(x^{*}\) (reference) denote two values of the exposure we wish to compare, and m (index) and \(m^{*}\) (reference) denote two controlled values of the mediator we wish to compare.

Total effect (TE) compares exposure level x to \(x^{*}\), while allowing the mediator to obtain its natural value under each exposure level. Thus, by assuming generalized consistency [10], TE can also be defined as \({\text{E}}\left[ {Y_{{xM_{x} }} - Y_{{x^{*} M_{{x^{*} }} }} } \right]\). TE can be decomposed into different types of component effects. There are two-way, three-way, and four-way decompositions of the total effect as presented in Table 1 [8, 9, 11, 12]. The counterfactual definitions are listed in Table 2 (left column). The choice of effect decomposition should be guided by substantive research questions (Table 3).

Table 1 Summary of effect decomposition in causal mediation analysis
Table 2 Definition and empirical analog of total effect and component effects
Table 3 Potential research questions related to various effects under effect decomposition

There are four types of “direct” effects. The standard controlled direct effect (CDE) compares exposure level x to \(x^{*}\) while fixing the mediator to a specific level. The CDE estimates the effect of X on Y while fixing M to m for every individual in the population and it can be different for different levels of m [11, 12]. The stochastic CDE (CDEsto) compares exposure level x to \(x^{*}\) while randomizing the mediator to a pre-specified distribution \(M^{\prime }\). Accordingly, the stochastic CDE subsumes both the standard CDE and the stochastic mediation contrast [13] in that the standard CDE corresponds to \(M^{\prime }\) being a constant m for the total population (i.e., full intervention) while the stochastic mediation contrast corresponds to \(M^{\prime }\) being a constant for a subset of the population and being the observed distribution for the rest (i.e., partial intervention).

The average pure direct effect (PDE) compares exposure level x to \(x^{*}\) while the mediator M is set to the natural value it would have attained under the reference level \(x^{*}\) of exposure (i.e., \(M_{{x^{*} }}\)). Accordingly, the average total direct effect (TDE) differs from PDE in that the mediator M is set to the natural value it would have attained under the index level x of exposure (i.e., M x ). Direct effects are relative to the mediator M of interest, that is, they are effects through alternative pathways other than through M.

The average pure indirect effect (PIE) compares mediator level M x to \(M_{{x^{*} }}\) while setting exposure to reference level \(x^{*}\). The average total indirect effect (TIE) differs from the PIE in that the exposure is set to index level x.

Reference interaction effect (RIE) and mediated interaction effect (MIE) capture the effect of X on Y due to interaction only and the effect of X on Y due to both interaction and mediation respectively [9]. RIE and MIE are sometimes combined to reflect the effect of X on Y due to overall interaction and termed “portion attributable to interaction” (PAI) [9].

Assumptions for identification and estimation

To identify and estimate the effect decomposition quantities, we invoke the stable unit treatment value assumption (SUTVA) [1, 15], and assumptions of consistency [16], conditional exchangeability (no-uncontrolled-confounding), and positivity [17]. The conditional exchangeability assumption for mediation analysis includes the following [12, 18]: (1) the effect of the exposure X on the outcome Y is unconfounded conditional on a set Z of measured covariates; (2) the effect of the mediator M on Y is unconfounded conditional on both X and Z; (3) the effect of X on M is unconfounded conditional on Z. For identifying standard CDE, (1) and (2) are sufficient but for stochastic CDE, all three assumptions are needed. Successful randomization of the exposure will support the assumption of no uncontrolled confounding of the exposure–mediator and exposure–outcome relations but will not guarantee the absence of uncontrolled confounding of the mediator–outcome relation [19]. In the presence of possible violation of this assumption, sensitivity analysis are needed [3, 20]. To identify natural effects, a fourth conditional exchangeability assumption is needed: (4) none of the mediator–outcome confounders are affected by exposure. Assumption (4) is known as the cross-world independence assumption [21] because it requires that, conditional on Z, the mediator that would have been observed in a world under \(X = x^{*}\) is independent of the outcome that would have been observed in a world under X = x (i.e., ). This assumption is problematic because these two variables \(M_{{x^{*} }}\) and \(Y_{{xM_{x} }}\) can never be observed together [18, 22]. Recent research proposed identification criteria [23] and effect decomposition [18] aimed at circumventing the violation of this fourth assumption. However, this issue is beyond the scope of this article. In addition, we assume no selection bias and measurement error. Under the above assumptions, different types of effect can be nonparametrically identified and estimated using the empirical analogs listed in Table 2 (right column). For the current paper, we adopted a fully parametric approach, i.e., positing parametric (regression) models for each expectation in these empirical analogs to estimate their parameters from the observed data and then integrating over the covariate and/or mediator distribution [24]. We further assumed no model misspecification.

G-computation

G-computation algorithm was first introduced by Robins [1] to estimate the causal effect of a time-varying exposure in the presence of time-varying confounders that are affected by exposure, a scenario where traditional regression-based methods would fail. In recent years, several didactic examples were given in the literature [6, 25, 26], promoting the use of this causal analytic technique. Increasingly, more studies have applied this methodology in estimating the effect of dynamic treatment regimes [27, 28] or projecting the impact of hypothetical interventions [2933]. In a simple setting with a single-time exposure and an outcome, g-computation can be seen as the generalization of standardization. Accessible examples of g-computation with detailed discussion of its strengths and limitations can be found elsewhere [2, 4, 6, 30].

G-computation steps in causal mediation analysis

The g-computation steps have been summarized in Fig. 1. Step 1 involves obtaining the parameters of the assumed covariate distributions, and fitting the assumed models for the mediator (M model) and the outcome (Y model) using observed data. The key covariates needed for the M model are confounders of the exposure–mediator relation. The key covariates needed for the Y model are confounders of the exposure–outcome relation and mediator–outcome relation. To avoid simulating the covariate set Z, one can replace steps 1a and 2a with resampling (Fig. 1). Step 2 entails simulating the covariate set Z (step 2a), an exposure intervention variable X (step 2b), and the potential mediator (step 2c), and outcome Y (step 2d) sequentially for J (J can be as large as computationally feasible) copies of the original sample. The simulation repetition done here is to reduce Monte Carlo simulation error. The intervention variable X should be distinguished from the observed exposure variable as the intervention X and the simulated covariates are marginally independent of each other. In step 2c, we simulate each potential mediator as a function of the simulated covariates and intervention from the previous steps (2a and 2b), using the parameters obtained from the M model in step 1b. Similarly, in step 2d, we simulate a potential outcome variable that corresponds to each specific type of effect as a function of the simulated covariates, exposure intervention, and mediator from the previous steps, using the parameters obtained from the Y model in step 1c. Step 3 involves regressing each different potential outcome variable on the intervention variable X to obtain estimates of each marginal effect using the pooled data with J copies of the original sample. Repeat step 2–3 on K (usually 200 or more) bootstrapped samples taken at random with replacement from the original data. The Wald type 95 % confidence interval (CI) was calculated as: point estimate ±1.96 × SD, where SD was the standard deviation of the K resultant point estimates from the final regression in the third step.

Fig. 1
figure 1

Steps for G-computation marginal structural model (G-comp MSM) in causal mediation analysis. Let Z denote a set of covariates, X denote exposure, M denote mediator, and Y denote outcome of interest. Variables in step 1 are observed variables whereas variables in step 2 are all simulated variables. Step 1a and 2a combined can be replaced by resampling

Illustration

We used a directed acyclic graph (DAG) [34] to represent the data generating process for the illustration example (Fig. 2). We used the India sample from the World Health Survey (WHS) [35]. We used all the observed covariates to simulate exposure smoking, mediator body mass index (BMI, 5-unit increase), and outcome composite health score (0–100) sequentially according to the data generating process shown in Fig. 2. Covariates for confounding control included age, gender, education, urbanicity, and depression. Detailed description for generating this partially simulated data set can be found in the supplementary materials. In the illustrative example, we will focus on the interpretations for PDE and TIE from the most common decomposition used in epidemiology and the interpretations for component effects based on the recently introduced 4-way decomposition. Since smoking was binary, we used 1 to represent “yes” and 0 “no”. A second illustrative example using binary exposure, mediator and outcome was included in the supplementary materials. All analyses were done using SAS version 9.4 (SAS Institute Inc., Cary, NC) and the accompanying SAS code can be found in the supplementary materials.

Fig. 2
figure 2

Directed acyclic graph (DAG) representing the data generating process for the first illustrative example. X, M, and Y represent the exposure smoking, mediator body mass index, and outcome composite health score. Z represents a set of exposure–mediator, exposure–outcome and mediator–outcome confounders that includes age, gender, education, urbanicity, and depression

To estimate different component effects of smoking on health, we implemented the following steps:

  • Step 1 Obtaining empirical parameters

    • (1a) We obtained the marginal expectation of each variable except the outcome, and the standard deviation for the continuous age variable.

    • (1b) The mediator BMI was regressed on smoking, age, gender, education, urbanicity and depression to obtain the regression coefficients and root mean square error (RMSE) for the linear M model.

    • (1c) The outcome, overall health score, was then regressed on smoking, BMI, smoking × BMI, age, gender, education and depression to obtain the regression coefficients and RMSE for the linear Y model.

  • Step 2 Simulating the potential mediators and outcomes

    • (2a) We created 1000 copies of the original sample and simulated age, gender, education, urbanicity, and depression that followed the same distribution as the observed variables.

    • (2b) We simulated a smoking intervention variable (X) that followed the observed smoking prevalence but was marginally independent of all simulated covariates.

    • (2c) We simulated each potential BMI variable as a function of the smoking intervention, age, gender, education, urbanicity and depression (Table 4, simulating M), using the regression coefficients and RMSE from the M model fit in (1b).

      Table 4 Equations used to simulate potential mediators and outcomes in step 2 of the G-computation marginal structural model for the illustrative example
    • (2d) We simulated a potential health score variable for each type of effect as a function of the smoking intervention, potential BMI from (2c), product term between smoking intervention and potential BMI, age, gender, education and depression (Table 4, simulating Y), using the regression coefficients and RMSE from the Y model fit in (1c).

We will use PDE and TIE as examples to explain steps (2c) and (2d) further. For binary exposure, recall that PDE compares X = 1 to X = 0 while setting the mediator M to M 0 (natural value under reference exposure). In (2c), we simulated the potential BMI variable (M 0) as a function of non-smoking (setting X = 0 in the equation for simulating M) and other determinants of BMI. Next, we simulated the potential health score variable (Y PDE ) as a function of the smoking intervention variable (X), potential BMI variable (M 0) from (2c), and other determinants of health in (2d). On the other hand, TIE compares mediator level M 1 to M 0 while setting exposure to index level 1. In (2c), we simulated the potential BMI variable (M x ) as a function of smoking intervention (setting X = x in the equation for simulating M) and other determinants of BMI. Then, we simulated the potential health score variable (Y TIE ) as a function of smoking (setting X = 1 in the equation for simulating Y), potential BMI variable (M x ) from (2c), and other determinants of health in (2d). In this way, the potential BMI variable (M x ) transmitted the effect of smoking intervention to health.

  • Step 3 Fitting final marginal structural models (MSMs)

We regressed each different potential health score variable on the smoking intervention to obtain point estimates of each marginal effect using the pooled sample. We repeated step 2–3 on 200 bootstrapped samples of the same size taken at random with replacement from the original data to obtain Wald type 95 % CIs.

Results from the illustrative example were presented in Table 5. This example is for illustration purpose and thus the results were not intended for quotation. Estimates followed the effect decompositions as described in Table 1. Smoking had an overall negative impact on health (TE: −0.96, 95 % CI −1.79, −0.13), but the majority of this impact was through pathways other than changing BMI (PDE: −0.70, 95 % CI −1.54, 0.14). When BMI was fixed at 24 for everyone, smoking did not appear to affect health directly (CDE: 0.27, 95 % CI −0.95, 1.50). In a hypothetical intervention where BMI was no longer affected by smoking and other covariates, smoking still had a negative impact on health (CDEsto: −0.81, 95 % CI −1.63, 0.02). The impact of smoking on health that was due to interaction with BMI (RIE: −0.99, 95 % CI −1.71, −0.26) was much larger than the part that was due to both mediation and interaction (MIE: −0.14, 95 % CI −0.27, −0.01). The presence of such smoking-BMI interaction contributed to the difference seen when comparing CDE to PDE and CDEsto.

Table 5 Effect estimate (95 % CI) for the illustrative example using g-computation of marginal structure models (N = 5326)

Discussion

In this article we demonstrated the utility of parametric g-computation in estimating various marginal effects under different and detailed effect decompositions as well as stochastic controlled direct effects, using Monte Carlo simulations in standard statistical software. To our knowledge, this is the first use of g-computation for 3- and 4-way decomposition of effects recently introduced by VanderWeele [9]. Our approach yielded similar results as those obtained from VanderWeele’s approach [9] (online supplementary materials). However, marginal (standardized) effect measures obtained via g-computation approach are not conceptually equal to the conditional (on covariates) effect measures obtained from the latter approach and results from the two approaches may differ [36]. Alternative imputation [37] or simulation [3] based methods are also available for common mediation parameters.

G-computation has several strengths. It uses models for the outcome and the mediator, which produces more efficient estimates (with narrower confidence intervals) than the weighting approaches that use models for the mediator and/or the exposure [37, 38]. It can be used to estimate various types of effect of interest, incorporate nonlinearities and exposure-covariate and mediator-covariate interactions, and deal with general types of outcome, exposure and mediator. This simulation-based approach can be used to estimate various effects on both difference and ratio scales.

However, g-computation is not without its limitations. The parametric g-computation method applied in mediation settings, like the general g-formula, relies on a correctly specified model for the outcome. For natural effects specifically, it additionally requires that the model for the mediator is correctly specified, as with other approaches published previously [3842]. When such parametric distribution for M is in doubt, a distribution-free approach with regard to the mediator [43] can be used. Alternative approaches are to use non-parametric g-computation method that combines bootstrapping and simulation as suggested in Imai et al. [3], or implement a doubly robust estimator that requires at least one of the model for exposure and mediator being correctly specified [37]. In addition, the computation time depends on the sample size and the number of covariates. As these two numbers increase, a random subset of the sample can be selected to perform the Monte Carlo simulation [4].

G-computation in mediation analysis, especially parametric g-formula implemented via Monte Carlo simulation, can be seen as a special application of the longitudinal time-varying g-computation formula [1, 7]. In this case, both post-baseline exposure and confounders that are affected by exposure can be seen as mediators. The steps are similar in simulating the baseline confounders and exposure intervention first and then the consequences of the exposure following the data structure represented in a specific DAG, though additional assumptions are needed in the mediation setting. G-computation as a unifying and flexible framework will gain popularity with the increase in applications of mediation analysis to answer mechanistic questions about either contextual or individual level causes [13, 44, 45].

By showing the steps for g-computation in estimating different quantities of interest in causal mediation analysis, we hope to encourage a wider audience of applied researchers to implement this framework, using software packages of their choice. Given the growing interest in adopting and applying complex systems approaches to examine complex disease etiologies, this method, especially with its simulation component, will be an important intermediate step towards this journey [46].