Introduction

When treatment effect modifiers have a different distribution among participants in a randomized trial compared to the target population of substantive interest, the average treatment effect estimate from the trial is not directly applicable to the target population. A growing literature describes methods for extending — transporting or generalizing [1, 2] — inferences for the average treatment effect from the trial to the target population [37]. These methods critically depend on adjusting for a large number of covariates to ensure that the trial and target population are conditionally exchangeable, allowing estimation of the target population average treatment effect.

Yet, the target population average treatment effect may not be sufficient for guiding treatment or policy decisions in the presence of strong effect modification [8], especially when a small set of strong effect modifiers can be identified on the basis of background knowledge. In such cases, the target population conditional average treatment effect (CATE) as a function of these strong effect modifiers may be a more useful estimand [9]. For example, investigators may be able to identify a few “causal” effect modifiers of primary interest and choose to estimate the target population CATE given these causal effect modifiers to guide clinical or policy decisions, while also recognizing that they need to adjust for many additional candidate effect modifiers, including “surrogate” ones, to render the trial and target populations exchangeable [10, 11].

Recent work [12, 13] has described methods for estimating target population CATEs over distinct subgroups formed by a few key discrete (or discretized) covariates (i.e., “subgroup-specific average treatment effects”), when extending inferences from the trial to the target population. For example, a g-formula approach (outcome model-based) involves estimating an outcome model for each treatment group in the trial, conditional on the large number of covariates needed for exchangeability, obtaining predictions under each model for the individuals in the target population, and averaging the difference between the predictions under each model within levels of the subgroup variable to estimate CATEs in the target population (see [13] for details). This g-formula approach and related weighting and augmented (doubly robust) weighting methods [13], however, cannot be used to estimate the target population CATE over continuous covariates or multiple discrete covariates, because the number of observations at each covariate level is not adequate for estimation, even in large data sets [11, 14].

In this paper, we describe methods for estimating the target population CATE given a small set of continuous or discrete effect modifiers. Specifically, we build on recent advances in estimating CATEs in observational studies [1521] and efficient and robust methods for generalizability and transportability analyses [6, 7], to propose a flexible two-step regression procedure for estimating the target population CATEs when a trial is embedded in a sample from the target population (i.e., in nested trial designs [22]). In the first step of the procedure, a pseudo-outcome is formed using models for the conditional probability of trial participation and the conditional expectation of the outcome in the trial. In the second step, the pseudo-outcome is regressed on the key effect modifiers. The procedure can support valid inference about the CATE when using data-adaptive (e.g., machine learning) approaches to estimate the probability of trial participation and the expectation of the outcome (in the first step), and allows flexible modeling of the treatment effect as a function of the effect modifiers (in the second step). Thus, the procedure facilitates the examination of heterogeneity over a low-dimensional set of key effect modifiers, while allowing adjustment for a potentially high-dimensional set of covariates that is sufficient to render the trial and target population exchangeable (i.e., adjust for selective participation). We show how to construct pointwise confidence intervals for the CATE at a specific value of the key effect modifiers and uniform confidence bands for the CATE function. We illustrate the proposed methods using data from the Coronary Artery Surgery Study (CASS) to estimate CATEs given history of myocardial infarction and baseline ejection fraction value in the target population of trial-eligible patients with stable ischemic heart disease.

Study design, data, and causal estimands

Study design


We consider a nested trial design [22], where the trial is embedded within a cohort sampled from the target population of substantive interest. The nesting can be achieved by designing a prospective cohort study of individuals from the target population and inviting some of the cohort members to participate in the trial, while collecting information on baseline covariates on all cohort members, including those who are not invited or do not agree to participate in the trial. Nesting can also be achieved by retrospectively linking records from a completed trial with records from a cohort that is sampled from the target population. Regardless of how nesting is achieved, we assume that the cohort in which the trial is nested can be viewed as a simple random sample from the target population [23]. Nested trial designs can be used for generalizability analyses, when the target population represented by the cohort meets the trial eligibility criteria, as well as for transportability analyses, when the target population represented by the cohort is broader than the population defined by the trial’s eligibility criteria (see [1, 2] for details regarding the definitions of the terms generalizability and transportability that we use in this paper; these terms are not used consistently in the literature, e.g., reference [4] suggests different definitions).

Data


In the nested trial design, data on a vector of baseline covariates, X, are available from all individuals in the cohort, regardless of participation in the trial. Data on the assigned treatment, A, and the outcome, Y, need only be available among trial participants. We use S as the indicator for trial participation (\(S = 1\) for randomized individuals; \(S=0\) for non-randomized individuals). Thus, the observed data are realizations of independent random draws of the tuple \(O_i = (X_i, S_i, S_{i}A_{i}, S_{i}Y_{i})\), for \(i = 1, \ldots , n\), where n is the total number of individuals in the cohort (both randomized and non-randomized). For simplicity, we assume throughout that the treatment is binary; extensions to address multi-valued treatments are similar to the binary case but require a slightly different set of identification conditions and modifications to the estimation methods (e.g., to use a “generalized” propensity score approach [15] to estimate the probability of treatment; see [21] for related results using multi-valued treatments in observational studies). We also assume that the outcome is measured at the end of the study (binary, continuous, or count; we do not consider failure-time outcomes in this paper). Throughout, italic upper-case letters denote random variables and corresponding lower-case letters denote realizations. We use \(f(\cdot )\) to generically denote densities.

Causal estimands


Let \(Y^a\) denote the potential (counterfactual) outcome under each treatment \(a \in \{0,1\}\), that is, the outcome that would be observed under intervention to set treatment to a [2426]. Furthermore, let \({{\widetilde{X}}}\) denote a vector that contains a small subset of the covariates in X that, on the basis of mechanistic understanding and prior empirical evidence, are a priori considered as the “key” effect modifiers under consideration. We are interested in the target population CATE given \({{\widetilde{X}}} = {{\widetilde{x}}}\),

$$\begin{aligned} \hbox {E}\big [ Y^1 - Y^0 \big | {{\widetilde{X}}} = {{\widetilde{x}}} \big ] = \hbox {E}\big [ Y^1 \big | \widetilde{X} = {{\widetilde{x}}} \big ] - \hbox {E}\big [ Y^0 \big | {{\widetilde{X}}} = {{\widetilde{x}}} \big ]. \end{aligned}$$

The CATE at some specific value of the key effect modifier, \({{\widetilde{x}}}\), is sometimes referred to as the “group-average treatment effect” [17].

In generalizability and transportability analyses, the covariate vector X may be high-dimensional because investigators collect information on multiple covariates in the hope of rendering the trial and the target population exchangeable (in the sense formalized in the next section). In contrast, the vector of key effect modifiers \({{\widetilde{X}}}\) is typically low-dimensional, containing only the small subset of baseline covariates that are deemed to be key effect modifiers of interest. For instance, in our illustrative example, the CASS investigators identified history of myocardial infarction and abnormal left ventricular function (defined as ejection fraction <50%) as key effect modifiers and examined them in subgroup analyses using data from trial participants [27, 28]. Thus, we may want to examine the association between the treatment effect in the target population and history of myocardial infarction and ejection fraction. Such examination, however, is likely to first require conditioning on many additional covariates to render the trial participants exchangeable with the target population.

Identification

Identifiability conditions


The following conditions, which are sufficient for identifying the average treatment effect in the target population [6, 29], are also sufficient for identifying the CATE in the target population:

  1. (1)

    Consistency of potential outcomes: if \(A_i = a\), then \(Y_i = Y^{a}_i\), for each \(a \in \{0, 1\}\) and for every individual i in the target population.

  2. (2)

    Mean exchangeability over A in the trial: for each \(a \in \{0,1\}\) and every x with positive density in the trial \(f(x, S = 1 ) \ne 0\), \(\hbox {E}[ Y^{a} | X = x , S = 1, A =a] = \hbox {E}[ Y^{a} | X = x, S = 1]\).

  3. (3)

    Positivity of the treatment probability in the trial: \(\Pr [A=a | X = x, S=1] > 0\) for each \(a \in \{0,1\}\) and every x with positive density in the trial \(f(x , S = 1) \ne 0\).

  4. (4)

    Exchangeability in effect measure over S: \(\hbox {E}[ Y^1 - Y^0 | X = x , S = 1] = \hbox {E}[ Y^1 - Y^0 | X = x ]\), for every x with positive density in the target population \(f(x) \ne 0\).

  5. (5)

    Positivity of trial participation: \(\Pr [S=1 | X = x] >0,\) for every x with positive density in the target population \(f(x) \ne 0\).

The consistency condition over all individuals in the target population implicitly requires the absence of “hidden” versions of treatment [3032], trial engagement effects [2, 29], and interference [30, 33]. These conditions are largely untestable and need to be considered on the basis of substantive knowledge. The conditions of mean exchangeability and positivity of the treatment probability are expected to hold in (marginally or conditionally) randomized trials comparing well-defined interventions [11]. The condition of “exchangeability in effect measure over S” reflects an untestable assumption of lack of effect measure modification by trial participation, conditional on baseline covariates, and needs to be examined in light of substantive knowledge and subjected to sensitivity analyses [34]. Stronger assumptions of exchangeability in expectation or in distribution between the trial and target population allow identification of other estimands that are not identifiable under exchangeability in measure (e.g., potential outcome means) [3, 6, 35]. Last, positivity of trial participation is in principle testable, but its assessment can be challenging when X is high-dimensional [36].

To focus on issues related to extending inferences from trials, we assume that there are no missing data, no losses to follow-up, and complete adherence to treatment. The methods we describe can be extended to address these complications, under additional assumptions [29] and provided additional data are collected [37]. Furthermore, our results also apply to generalizing or transporting inferences from an observational study nested in a broader cohort, provided we are willing to assume that conditions (2) and (3) hold in the observational study (i.e., no unmeasured baseline confounding and positivity of treatment within levels of the measured confounders) [38].

Identification of CATEs


In Web Appendix 1, we show that, under conditions (1) through (5), the target population CATE given \({{\widetilde{X}}} = {{\widetilde{x}}}\), that is, \(\hbox {E}\big [ Y^1 - Y^0 \big | {{\widetilde{X}}} = {{\widetilde{x}}} \big ]\), is identified by

$$\begin{aligned} \delta ( {{\widetilde{x}}}) \equiv \hbox {E}\big [ \phi (O) \big | {{\widetilde{X}}} = {{\widetilde{x}}} \big ]; \end{aligned}$$
(1)

where the pseudo-outcome \(\phi (O)\) is defined as

$$\begin{aligned} \phi (O) = \phi _{1}(O)- \phi _{0}(O) , \end{aligned}$$

with

$$\begin{aligned} \phi _{a}(O) = \dfrac{S I(A=a)}{ p(X) e_a(X)} \big \{ Y - g_a(X) \big \} + g_a(X), \text{ for } a = 0, 1; \end{aligned}$$

\(p(X) = \Pr [S=1|X]\) is the conditional probability of trial participation given covariates; \(e_a(X) = \Pr [A = a | X, S=1]\) is the conditional probability of treatment a in the trial given covariates; and \(g_a(X) = \hbox {E}[Y | X, S = 1, A = a]\) is the conditional expectation of the outcome in the trial given covariates, among individuals assigned to treatment \(a \in \{0,1\}\). We refer to the functions p(X), \(e_a(X)\), and \(g_{a}(X)\) as “nuisance functions” because they are useful in identifying and estimating CATEs, but, in our setup, are not of scientific interest per se. We refer to \(\phi (O)\) generically as a “pseudo-outcome” because it is a constructed variable (using the models for the probability of participation, the probability of treatment, and the expectation of the outcome) that is used as a “response” (i.e., “left-hand-side” variable) in the regression of equation (1). Because \(\phi (O)\) involves the observed data O and nuisance functions that are identifiable from the observed data under the nested trial design [22], we conclude that CATEs given \({{\widetilde{X}}}\) are identifiable. Of note, \(\phi (O)\) is the (uncentered) influence function [39] of the functional that identifies the target population average treatment effect under a nonparametric model for the observed data that obeys conditions (1) through (5); see reference [6] for details.

An inverse probability weighted pseudo-outcome is formed by setting the \(g_{1}(X)\) and \(g_{0}(X)\) terms to zero in the expression for \(\phi (O)\) (see Web Appendix 1 and Web Appendix 2 for details). For the remainder of this paper, we do not consider this simpler pseudo-outcome because it does not allow for valid inference when using machine learning to estimate the nuisance functions [40].

Estimation & inference

Two-step estimation procedure

To extend causal inferences about CATEs given key effect modifiers from the trial to the target population, we propose a two-step procedure, similar to approaches for estimating CATEs in observational analyses with baseline confounding by measured variables [19, 20]. In the first step, we create the pseudo-outcome using models for the probability of trial participation, the expectation of the outcome, and (optionally) the probability of treatment in the trial, to account for differences between the trial and the target population by conditioning on a large set of effect modifiers. In the second step, we regress the pseudo-outcome on the key effect modifiers to estimate the CATE function.


Step 1: Estimation of the nuisance functions to form the pseudo-outcome: Forming the pseudo-outcome for each observation \(i=1,\ldots ,n\) in the data follows the identification results for \(\phi (O)\) by using estimators for the nuisance functions (denoted by “hats”):

$$\begin{aligned} {\widehat{\phi }}(O_i) = {\widehat{\phi }}_{1}(O_i)- {\widehat{\phi }}_{0}(O_i), \end{aligned}$$
(2)

with

$$\begin{aligned} {\widehat{\phi }}_{a}(O_i) = \dfrac{S_i I(A_i=a)}{ {\widehat{p}}(X_i) {\widehat{e}}_a(X_i)} \big \{ Y_i - {\widehat{g}}_a(X_i) \big \} + {\widehat{g}}_a(X_i), \text{ for } a = 0, 1. \end{aligned}$$

Of note, the average of \({\widehat{\phi }}(O_i)\) over the n observations in the sample, that is, \(n^{-1} \sum _{i=1}^n {\widehat{\phi }}(O_i)\) gives a “doubly robust” or “augmented inverse probability weighted” [41] estimator of the average treatment effect in the target population (to see this, compare \({\widehat{\phi }}_{a}(O_i)\) with the summand in equation (5) of reference [6]).

When calculating the pseudo-outcomes, there are several options for estimating the nuisance functions for the probability of participation, expectation of the outcome, and the probability of treatment, that is, p(X), \(g_{a}(X)\), and \(e_a(X)\), respectively. The probability of treatment in the trial, \(e_a(X)\), is typically known by design so its estimation is straightforward using simple parametric models (e.g., logistic regression) [42, 43]. Alternatively, the known probability of treatment can be used. In contrast, the functions p(X) and \(g_{a}(X)\) are unknown, involve conditioning on the high-dimensional baseline covariates that are necessary for exchangeability to hold between the trial and the target population, and may be complex, perhaps including nonlinearities and interactions. In practice, parametric models are commonly used to estimate the probability of participation, p(X), or the expectation of the outcome, \(g_{a}(X)\). When parametric models are used, the procedure is “model doubly robust” [41], in the sense that the CATE function can be consistently estimated when at least one of the parametric models for participation or the outcome is correctly specified. Nevertheless, parametric models may poorly approximate both p(X) and \(g_{a}(X)\). At the same time, the high-dimension of X precludes fully nonparametric estimation of these models [14] (e.g., non-smooth nonparametric (frequency) estimation of the expectation of Y given X is infeasible if X is high-dimensional or has continuous components [44]). To make progress, investigators can instead use machine learning (data-adaptive) methods to reduce model misspecification and allow more flexible modeling of the nuisance functions.

The cost of using data-adaptive approaches is that they converge to the true underlying nuisance function at a slower rate than parametric models. Informally, for valid inference, a fast enough rate of convergence of the data-adaptive approaches to the underlying true function is important to ensure that the bias of the estimated CATE function is “small” relative to its standard error [45]. Without a fast enough rate of convergence for the nuisance functions, bias remains, resulting in an inconsistent estimator without optimal coverage. To avoid bias when using data-adaptive approaches, by combining models for the probability of participation and the expectation of the outcome in the construction of the pseudo-outcome, we can rely on estimators of the nuisance functions that converge at a “fast enough,” even if slower than parametric, combined rate (i.e., the estimator of the pseudo-outcome in equation (2) has a “rate-robustness property” [46]). Several data-adaptive methods can have rates that are fast enough (e.g., the highly adaptive least absolute shrinkage and selection operator (HAL) [47] and generalized additive models (GAMs) [48, 49]). When using data-adaptive approaches to estimate the nuisance functions for the probability of participation and the expectation of the outcome, we assume that the chosen approaches are consistent for the true underlying functions.

Background knowledge about aspects of the data-generating process can be used to select approaches that produce good approximations of the nuisance functions. For example, if we expect the relationship between trial participation and covariates, or the outcome and covariates, to be highly nonlinear or involve statistical interactions among covariates, random forest methods [50] may be a good choice to estimate the nuisance functions. If we expect sparsity, the least absolute shrinkage and selection operator [51] or other sparsity-appropriate modeling approaches may be preferred.

Regardless of the estimation approach (i.e., whether using parametric or data-adaptive approaches), the estimated functions are used to calculate the pseudo-outcomes in equation (2). Importantly, the participation and outcome models should include the high-dimensional set of variables needed to satisfy condition (4) to make the trial and target population conditionally exchangeable.


Step 2: Pseudo-outcome regression: We fit a regression of the estimated pseudo-outcome on the key effect modifiers, \({{\widetilde{X}}}\), to estimate the target population CATE as a function of \({{\widetilde{x}}}\),

$$\begin{aligned} {\widehat{\delta }}( {{\widetilde{x}}}) = {\widehat{\hbox {E}}}\big [\widehat{\phi }(O) \big | {{\widetilde{X}}} = {{\widetilde{x}}} \big ]. \end{aligned}$$
(3)

We refer to this second step of the procedure as a “pseudo-outcome regression.” To consistently estimate the target population CATE function, we need to correctly specify the regression model in the second step, and have consistent estimators of the nuisance functions in the first step. One approach is to use a parametric model (e.g., least squares regression) to model the relationship between the pseudo-outcome and the key effect modifiers \({{\widetilde{X}}}\), but correct model specification may be challenging if \({{\widetilde{X}}}\) contains continuous components that are not guaranteed to be linear and may have complex functional forms. To mitigate the risk of model misspecification, given that in our setup \({{\widetilde{X}}}\) is low dimensional, it will often be possible to use nonparametric regression to flexibly model the relationship between the pseudo-outcome and the key effect modifiers, \({{\widetilde{X}}}\). For example, when \({{\widetilde{X}}}\) contains discrete covariates, it is simple to split the data into subgroups defined by the different levels of the covariates and estimate the mean of the pseudo-outcome within each subgroup – a non-smooth nonparametric “regression” approach [44]. When \({{\widetilde{X}}}\) also contains continuous components, we can use smoothing nonparametric techniques such as series [19] or kernel (local linear) regression [20] methods. For example, if \({{\widetilde{X}}}\) consists of a single continuous covariate, we can use a series estimator in the second step by fitting an ordinary least squares regression of the pseudo-outcome on a flexible polynomial (alternatively, we can use splines or other basis functions) [19]. The goal is to approximate the CATE function with a flexible model that is easy to understand and graph.

Inference

We now discuss how to obtain both pointwise confidence intervals and uniform confidence bands. Pointwise confidence intervals are appropriate for the estimated CATE at a specific value of the key effect modifiers. These intervals capture the uncertainty at a specific point (i.e., at a specific value of the covariates \({{\widetilde{X}}}\)) but are too narrow (will have significant undercoverage) if inferences are drawn over multiple points, as is the case when examining the entire domain of the CATE function. If investigators are interested in examining heterogeneity across the entire domain of the CATE function, uniform confidence bands reflect uncertainty over multiple points. The inference strategies we describe are appropriate when the CATE function in the second step of the two-step estimation procedure is estimated using least squares, series, or kernel regression, provided appropriate technical requirements are met (e.g., regularity conditions, undersmoothing in the kernel regression, etc.; see [19] for details regarding ordinary least squares or series regression, and [20] for kernel regression). Statistical inference when using other approaches in the second step of the two-step estimation procedure would require case-by-case examination.


Pointwise inference: Pointwise confidence intervals for the CATE at a specific \({{\widetilde{x}}}\) value can be obtained using standard approaches, under reasonable technical conditions [19]. Specifically, a \((1- \alpha )\%\) pointwise confidence interval at \({{\widetilde{x}}}\) is given by \(({\widehat{\delta }}({{\widetilde{x}}}) \pm z_{1-\alpha /2} \times {\widehat{\sigma }}({{\widetilde{x}}})),\) where \(z_{1-\alpha /2}\) is the \((1- \alpha /2)\) quantile of the standard normal distribution and \({\widehat{\sigma }}({{\widetilde{x}}})\) is the estimated standard error of the CATE at \({{\widetilde{x}}}\), which can be obtained using the nonparametric bootstrap [52]. Alternatively, when using ordinary least squares or series regression in the second step of the CATE estimation procedure, the robust variance estimator (i.e., the Huber-White sandwich estimator) [53, 54], which is readily available in standard software packages, can be used to obtain \({\widehat{\sigma }}({{\widetilde{x}}})\). When using the pseudo-outcomes defined in equation (2), it is not necessary to account for the uncertainty in the fitted nuisance models in the first step when estimating the robust variance; inference can be carried out as if the nuisance functions were known [, 19].


Uniform inference: Uniform confidence bands are needed to obtain valid coverage over the set of values of the key effect modifiers that we16 want to examine. Suppose, for concreteness, that we are using series regression in the second step of the CATE estimation procedure. Series regression involves an estimator of the CATE as a function of \({{\widetilde{x}}}\) with the form \({\widehat{\delta }} ({{\widetilde{x}}}) = m({{\widetilde{x}}}){\widehat{\beta }}\), where \(m({{\widetilde{x}}})\) is a vector of series or sieve basis functions (e.g., polynomials, splines, or wavelets) of \({{\widetilde{x}}}\), and \({\widehat{\beta }}\) is the least squares estimator of the regression coefficients. We will evaluate \({\widehat{\delta }} ({{\widetilde{x}}})\) over a set of grid points \({\mathcal {P}}\), where \({\mathcal {P}}\) is a subset of the possible values of the effect modifiers \({{\widetilde{X}}}\). We work with the grid points in \({\mathcal {P}}\) instead of all possible values of \({{\widetilde{X}}}\) to allow for the possibility that some components of \({{\widetilde{X}}}\) are continuous. One option is to use as grid points all the unique values of \({{\widetilde{x}}}\) observed in the data; another is to choose grid points that capture the “interesting” values of \({{\widetilde{x}}}\). To obtain uniform inference over the set \({\mathcal {P}}\), following [19, 20], we use a weighted bootstrap procedure [5557] (sometimes referred to as the wild or multiplier bootstrap). We describe the weighted bootstrap procedure in Web Appendix 3; additional considerations for data-adaptive approaches are discussed in Web Appendix 4.


Using a two-step approach to model conditional potential outcome means and other CATE measures: See Web Appendix 5 for how the two-step procedure can be modified to model conditional potential outcome means (e.g., conditional counterfactual risks) and treatment effect measures other than the mean/risk difference.

Examining heterogeneity in cass

CASS design and data

The Coronary Artery Surgery Study (CASS) [27, 58] compared coronary artery bypass grafting surgery plus medical therapy (hereafter, “surgery”) versus medical therapy alone in a randomized trial that was nested within a cohort of trial-eligible patients with stable ischemic heart disease. Patients were enrolled from August 1975 to May 1979 and followed-up for death up to December 1996. The cohort consisted of 2099 trial-eligible patients of whom 780 participated in the trial. For the first 10 years of follow-up, there was no censoring among trial participants.

The original CASS analysis prespecified variables that the investigators believed to be important effect modifiers and risk factors for the outcome of mortality [27]. These variables included history of myocardial infarction and abnormal left ventricular function (defined as ejection fraction value less than 50%). One analysis of the trial participants in CASS at 10-years of follow-up [59] found no difference in survival probability between treatment groups in the overall trial sample, but found that patients with an ejection fraction less than 50% had significantly improved survival with surgery (surgery 79%, medical therapy 61%). No other subgroup-specific benefits were found in the trial [28, 59]. A re-analysis of both randomized and observational data from CASS found heterogeneity on the risk difference scale for mortality at 10-years of follow-up among subgroups defined by history of myocardial infarction and abnormal left ventricular function [60]. Furthermore, a meta-analysis of 7 early trials (including CASS) comparing surgery versus medical therapy found that abnormal left ventricular function was an important modifier for the effect of treatment on mean survival time and that patients with abnormal left ventricular function derived greater absolute benefit from surgery [61]. A more recent randomized trial reported that among patients with ischemic cardiomyopathy and low ejection fraction (<35%), surgery was more beneficial than medical therapy [62]. Thus, we decided to use the methods described above to explore whether history of myocardial infarction and ejection fraction (treated as a continuous variable) were indeed effect modifiers in the target population of all trial-eligible patients.

Table 1 Baseline characteristics in CASS (August 1975 to December 1996). \(S=1\) indicates the randomized group; \(S=0\) indicates the non-randomized group; \(A=1\) indicates surgery; \(A=0\) indicates medical therapy

A total of 1686 patients had complete data on the baseline covariates we used in our analysis (731 randomized, 368 to surgery and 363 to medical therapy; 955 non-randomized, 430 receiving surgery and 525 medical therapy). Table 1 summarizes the basic descriptive statistics for the baseline covariates. In general, randomized and non-randomized patients had similar characteristics, but non-randomized patients were more likely to have taken a beta-blocker regularly, have a higher left main coronary percent obstruction, and have a higher left ventricular wall score.

Statistical methods

We evaluated the target population CATEs (risk differences) for mortality at 10 years of follow-up, conditional on history of myocardial infarction and baseline ejection fraction value. We analyzed the 986 patients with a history of myocardial infarction and the 700 patients without a history of myocardial infarction separately, when estimating the nuisance functions and when estimating the pseudo-outcome regression. We obtained pointwise confidence intervals and uniform confidence bands within the subgroups defined by history of myocardial infarction.

In the first step to estimate the nuisance functions for the outcome and participation, we used parametric models (logistic regression) and included the main effects of all baseline covariates listed in Table 1, except we modeled age and ejection fraction using B-splines (basis splines) of order 3 (degree 2) with an interior knot placed at the median of age or ejection fraction. We modeled the outcome separately in each treatment group in the trial. Because the model for treatment in the trial cannot be misspecified, to estimate \(e_{a}(X)\), we used a simple logistic model that included the main effects of age, severity of angina, ejection fraction value, systolic blood pressure, proximal left anterior artery percent obstruction, and left ventricular wall score [6].

In the second step, we fit the regression of the pseudo-outcome on ejection fraction. We used ordinary least squares regression with either a B-spline or polynomial of ejection fraction. We did not want to assume that the CATE function has a linear form, so we chose to use B-splines of order 3 with an interior knot placed at the median of ejection fraction. For the polynomial of ejection fraction, we set the degree to 3. These models are flexible enough to capture most reasonable nonlinearities in the treatment effect over ejection fraction. When forming the pointwise confidence intervals, we used the robust variance estimator. We obtained uniform confidence bands using the weighted bootstrap procedure with 200 replicates.

To evaluate the robustness of our results to model specification of the nuisance functions, we repeated our analyses using generalized additive models (GAMs) instead of parametric models in the first step of the procedure. For comparison, we also estimated trial-only CATEs, by modifying the procedure to only use the trial pseudo-outcome (which does not include the participation weight) and to fit the second step regression only among trial-participants [18].

Results

Figure 1 shows the estimated target population CATE function stratified by history of myocardial infarction over a set of ejection fraction values, from 40 to 80, when using parametric models in the first step and splines in the second step. The CATE functions for patients with and without a history of myocardial infarction show different patterns, but both look like they could be reasonably well-approximated by a linear fit. For patients with a history of myocardial infarction, the CATE function linearly increases from a risk difference of approximately -0.25 for patients with an ejection fraction of 40% up to a risk difference of approximately 0.15 for patients with an ejection fraction of 80%. The uniform confidence band suggests that the data are incompatible with the hypothesis that the CATE function is constant at 0 (no effect), over the set of ejection fraction values we considered. In other words, among patients with a history of myocardial infarction, the treatment effect appears to vary over ejection fraction, suggesting that surgery may be more beneficial (compared to medical therapy) for patients with lower ejection fraction, compared to those with higher ejection fraction.

Fig. 1
figure 1

Target population CATE function estimated using parametric models in the first step and spline regression in the second step CATE = conditional average treatment effect; MI = myocardial infarction. The black line indicates the estimated CATE function; dashed gray lines connect 95% pointwise upper and lower confidence limits; solid gray lines depict the uniform 95% confidence band. The set of grid points went up to ejection fraction values of 80%, over a grid of evenly spaced points in steps of 1%. In each panel, the confidence bands are uniform over ejection fraction (conditional on history of MI).

For patients without a history of myocardial infarction, the CATE function decreases from a risk difference of approximately 0.15 for patients with an ejection fraction of 40% to slightly less than 0 for patients with an ejection fraction of 70%; then it increases up to approximately 0.20 for patients with an ejection fraction of 80%. Because the uniform confidence band contains zero, across all levels of ejection fraction, the data are not incompatible with the hypothesis that the CATE function is constant over ejection fraction for patients without a history of myocardial infarction.

In Web Appendix 6, we provide additional results for the CASS analysis. Web Appendix Figure 1 shows that the second step regression using polynomials instead of splines yielded similar results. The estimated trial-only CATE functions, provided in Web Appendix Figures 2 and 3, were similar to the corresponding estimated target population CATE functions. We also found that repeating the analysis with GAMs in the first step resulted in similar CATE functions; see Web Appendix Figures 4 through 7. We have provided a simulated dataset and \(\texttt {R}\) code [63] that implements the two-step estimation procedure and produces graphs of the CATE function on GitHub (https://github.com/serobertson/GeneralizabilityCATE). We provide additional details about the code in Web Appendix 7.

Discussion

We described a two-step pseudo-outcome regression procedure for estimating target population CATEs in nested trial designs used to extend inferences from a randomized trial to a target population. We also described how to obtain pointwise confidence intervals for the CATE at specific effect modifier values and uniform confidence bands for the CATE function. This two-step procedure provides a regression-based framework for examining CATEs given discrete as well as continuous covariates, whereas previously proposed methods only allow the estimation of CATEs within subgroups defined by discrete covariates [12, 13]. Even when all covariates of interest are discrete, working within a regression framework may be advantageous because it allows the representation of smoothness or homogeneity assumptions by omitting covariate-by-covariate product terms from the regression specification; such assumptions are not as easy to represent with previously proposed methods [12, 13].

We note the different roles of the baseline covariates in the two steps of the procedure: the first step “controls” for enough variables to address selective trial participation; the second step focuses on a much smaller set of key effect modifiers. This duality is analogous to the difference between the variables needed to address confounding and effect modifiers in previous work on estimating CATEs in observational studies [15,16,17,18,19,20,21, 64]. In applications, examining heterogeneity over a lower dimensional set of covariates may be motivated by scientific or policy considerations. For example, key effect modifiers may be identified on the basis of previous investigations, and the methods described here can be used in confirmatory assessments of heterogeneity. Or, it may be desirable to base treatment decisions on only a subset of potential effect modifiers while ignoring unacceptable ones (e.g., even if insurance status were a strong effect modifier, we might prefer to not use it to make treatment decisions; instead, we might choose to examine heterogeneity only over lab measurements or past medical history).

Our methods are motivated by applications in which a few key effect modifiers of interest can be identified by the investigators (e.g., on the basis of prior studies). When the effect modifiers of interest are more numerous it may be possible to summarize them into a score (e.g., using an outcome or treatment effect model obtained from external data) and use that score in the second step of our procedure. It is also possible to extend our methods to settings where \({{\widetilde{X}}}\) is of moderate to high dimensionality, or even substituting X for \({{\widetilde{X}}}\), as is often the case in discovery-oriented investigations [18, 6567] (we briefly touch on these approaches in Web Appendix 1 and Web Appendix 5). In such investigations, the study goal is to predict individualized responses for members of the target population and the second step of the procedure typically is modified to use data-adaptive approaches appropriate for high-dimensional covariates [18]. The development of methods for valid inference in this context is an area of active research. Broadly speaking, when examining heterogeneity over high-dimensional covariates, there exists a trade-off between the flexibility of the model specification and the strength of the technical assumptions needed for valid estimation and inference [68].

In summary, we proposed a two-step estimation procedure for estimating the target population CATE as a function of key effect modifiers in nested trial designs. This procedure is useful for examining the dependence of the CATE on a small set of key effect modifiers, while adjusting for a large set of covariates needed to ensure the exchangeability of the trial and the target population.