Introduction

When identifying the most relevant information for policy makers or clinicians looking to make a decision about how to act in a particular population or for a particular patient, both the actions being considered and the context to which they will be applied matter [1, 2]. One hierarchy of study designs places results from randomized controlled trials (RCTs) at the pinnacle of the pyramid of evidence because RCTs minimize internal bias due to confounding by design through randomization [3]. Setting aside the fact that RCTs may still suffer from internal biases other than confounding bias, RCTs often are conducted in highly selected study samples that may yield a very different context than the target population in which the decision is being made. This mismatch of context in and composition of the trial sample and the target population is a key component of external bias, which is undervalued in this evidence hierarchy.

Lest we forget how much the target population matters, we present two examples: (1) estimates of the effect of medication assisted therapy (buprenorphine/naloxone), motivational interviewing, and motivational incentives on substance use would have been very different—typically less effective and no longer statistically significant—had trials testing these interventions been conducted in samples that were more representative of all treatment-eligible persons in the US [4]. (2) Estimates of the (adverse) effect of antidepressants on suicidal ideation or behaviors in depressed youth may have been overstated in trials that under-enrolled or explicitly excluded youth at the highest risk for these outcomes [5, 6].

The term target validity has been proposed to describe the total difference between the estimate of association obtained in a particular study sample and the true effect in the target population of interest [7••]. Target bias is the sum of internal bias and external bias. We loosely define internal bias as the difference between the estimate of association in the study sample and the true effect in the study sample and external bias as the difference between the true effect in the study sample and the true effect in the target population. Although the moniker “target validity” is new, the concept has been previously described in the education, social sciences, and policy literature [1, 6, 8, 9]. The concept of target validity encourages epidemiologists to take threats to external validity as seriously as they have traditionally taken threats to internal validity, to integrate consideration of both internal and external validity when evaluating the strength of available evidence for informing a particular decision in a particular population, and to better evaluate the tradeoffs between experimental and observational studies [9]. This is in contrast to the common view that external validity is secondary to, or contingent on, internal validity [10].

Although threats to internal validity are well-known to epidemiologists (e.g., confounding bias, information bias), threats to external validity are less well-understood. Focusing on the external validity of RCTs is a beneficial heuristic in that (1) the sampling mechanism into a trial, including inclusion and exclusion criteria, is often explicit and thus so are differences between the trial sample and the target population; (2) we know quite a bit about the implicit sampling mechanisms into trials (e.g., under-sampling of older people and people of minority race/ethnicity) thanks to previous research [11,12,13,14]; and (3) we can pretend that bias due to confounding negligible and can thus assume that the majority of the differences (bias) between the estimate of association in the sample and the target (population) average treatment effect (TATE) of interest is due to lack of external validity. That is, we assume the estimate of association in the sample is a good estimate of the sample average treatment effect (SATE), but the SATE is a poor approximation of the TATE. Despite the relative inattention paid to external validity, the assumptions required for external validity may be quite strong [15•].

Despite the specter of unmeasured confounding in observational studies, they have generally (despite some notable exceptions) returned similar results as subsequent randomized trials investigating the same exposures [16]. That is, for many exposures, particularly those that could be randomized, internal validity of observational studies may be better than is often assumed. Additionally, observational studies do not tend to have as strict of inclusion and exclusion criteria as trials, making them potentially more similar to the target populations we might be interested in. However, the external validity of observational studies is still potentially of concern, given the increasing use of “big data,” administrative databases, and pooled or collaborative cohort studies, which rely on samples that arise from sampling mechanisms that are myriad and unclear [17].

Formal Frameworks for External Validity

Threats to target validity associated with internal validity (e.g., confounding bias) have been extensively described. We define internal bias as the difference between the association measured in the sample, E[Y| A = 1, S = 1] − E[Y| A = 0, S = 1], and the SATE, E(Ya = 1 − Ya = 0| S = 1). Here, we use Y to denote the outcome, A to denote treatment received, S = 1 to denote membership in the study sample, and Ya to denote the outcome Y that would occur if treatment a were assigned (the potential outcome).

Next, we informally define and describe key threats to external validity, since these threats have been less frequently explored in the literature. There are at least three reasons that the SATE may not equal the TATE (we will wait to formally define this latter quantity until later in this section, for reasons that will become clear), including:

  • There are modifiers, Z, of the effect of treatment, and the distribution of those modifiers is different in the study sample and the target population [1, 8, 18,19,20, 21•, 22,23,24];

  • The version of treatment (including details of how the treatment is delivered) impacts the effect of treatment, and the distribution of the versions of treatment is different in the study sample and the target population [25]; and

  • There is interference (one persons’ exposure impacts another persons’ outcome), and the patterns of interference differ between the two populations.

It is (typically implicitly) assumed that sample membership or trial participation itself, S = 1, does not have a direct effect on the outcome [21•]; that is, if sample membership was itself an intervention (e.g., if the act of being observed as part of being in the study changes participants behavior in a way that changes the outcome not directly through receipt of the intervention), the “versions” of treatment in the study sample and the target population would differ, and reason 2 above would lead to differences between the SATE and the TATE [25].

The majority of work done on external validity of study results has focused on differences in the distribution of effect modifiers—that is, external biases related to sample composition. The magnitude of the external biases related to sample composition is a function of the probability of selection into the sample, the heterogeneity of treatment effects, and the association between sample membership and effect modifiers [1, 18, 24]. Existing frameworks for describing this problem, determining identifiability of the TATE, and defining estimators of the TATE are more similar to one another than they are different. A key feature of these all of these frameworks for defining external validity, however, is that the target population needs to be well-characterized (theoretically enumerable). Moving forward, for mainly logistic but sometimes theoretical reasons, we split out external validity into “generalizability” and “transportability.”

“Generalizability” refers to the situation in which our study sample is a proper subset of the target population, but the study sample may or may not be a simple random sample from the target population. That is, the TATE of interest is E(Ya = 1 − Ya = 0), or the effects in a target population of which the study sample are members. If the study sample is a simple random sample of the target population, results from that study are generalizable to the target population in expectation. Multiple methods are available to adjust for the situation in which the study sample is not a simple random sample of the target population under two key assumptions: mean exchangeability between the sample and the target population, E[Ya| Z, S = 1] = E[Ya| Z] for every \( a\in \mathcal{A} \); and positivity of trial participation, P(S = 1| Z = z) > 0 for all z such that P(Z = z) > 0 [21•, 22]. Again, Z is the set of modifiers that are associated with sample membership; however, it is worth noting that all covariates that are associated with the outcome will be effect measure modifiers on at least one scale. The positivity assumption implies that tractable generalizability problems (i.e., situations in which estimation of a TATE is possible from a study sample because identifiability can conceivably be met) are those that do not require extrapolation beyond the characteristics of the persons in the study sample [21•]. The practical implication of the positivity assumption is that in order for trials to hope to provide good information about the expected TATE, they must enroll a full spectrum of patients; for example, trials conducted only in adults < 50 years old cannot provide information about the effect of interest in a target population that includes all adults without making the (strong) assumption that age does not modify that effect.

In contrast, “transportability” has been used to refer to the situation in which our study sample and target population are not overlapping [25,26,27]. That is, the TATE of interest is E(Ya = 1 − Ya = 0| S = 0); we are interested in the effect in a set of persons who were NOT members of the study sample or the complement to the study sample in the set of individuals created by combining the study sample and the target population. Transportability, then, may involve extrapolation to a different context [26]. Another definition of transportability that has been put forward is the “extension of inferences from [a] trial to a target population that includes participants who are not part of the trial-eligible population” [28]. Here, the implication is that the TATE of interest is E(Ya = 1 − Ya = 0), but the positivity assumption is not met. This definition of transportability explicitly involves extrapolation beyond the characteristics of the persons in the study. Transportability, then, appears to require stronger assumptions than generalizability from a theoretical sense.

Confusingly, the term “transportability” (or “transportability weights”) has also been used occasionally to describe methods for extending inference from a study sample to a target population to which the study sample belongs (a “generalizability” problem, as defined above), when data on the entire target population is unavailable or the particular subset of the target that participated in the study is not enumerable [27]. For example, we may have a trial of antihypertensive treatment conducted in a sample of adults with hypertension living in the United States (US) but no data on the full target population (all adults with hypertension living in the US). Instead, we may have data on a random or representative sample of the target population. Large governmental surveys often serve this purpose (e.g., NHANES). In this situation, we do not want to “generalize” to the population represented by the union of the data from the study sample, denoted S = 1, and the target population (or a sample of the target population), denoted S = 0. The TATE of interest is no longer E[Ya = 1 − Ya = 0], but rather it is E[Ya = 1 − Ya = 0| S = 0]. Thus, sometimes even when we are theoretically generalizing results, we might use methods that were designed for “transportability” [19].

Assumptions and Identifiability

A set of assumptions sufficient to identify the TATE parallel a sufficient set of assumptions for identification of the SATE. In addition to mean exchangeability between the sampled and unsampled members of the target population perhaps conditional on covariates and positivity, we assume treatment version irrelevance (also called causal consistency) or the same distribution of versions of treatment in the study sample and the target population; no interference or the same patterns of interference in the study sample and the target population; no measurement error including of all Z variables; and correct causal model(s) specification [22].

Barenboim and Pearl proposed the use of “selection diagrams,” an extension of directed acyclic graphs (DAGs), for encoding assumptions about causal relationships in the sample and in a distinct target populations and then determining whether a TATE is identifiable from the available data [26]. As long as the characteristics that differ between the two populations are all pre-treatment covariates, the assumptions sufficient for generalizability and transportability, and sufficient sets of covariates for mean exchangeability between the sample and the target population, coincide [29].

Assessing the Generalizability or Transportability of Effects

Important limitations of existing study results for guiding policy or treatment decisions as related to inclusion of key populations in public health and medical research have been qualitatively recognized for some time [14, 30,31,32,33,34,35,36,37,38]. Quantitative assessments of the differences between a study sample and target population improve the rigor of such exercises and include, for example, reweighting the sample by the inverse probability of membership in the sample and then comparing the differences in characteristics of the weighted sample and the target population using standardized mean differences [39, 40]. However, for any (qualitatively or quantitatively observed) differences in sample composition to result in external bias, those characteristics that differ between the study sample and target population(s) of interest must also modify the treatment effect [5].

Any predictors of the outcome are likely to modify the treatment effect on at least one scale. This implies that assessing the generalizability or transportability of effects not only requires specifying the target population to whom one would like to make inference but also the scale on which one would like to report results. There are mathematical arguments supporting the idea that odds ratios are the least heterogeneous measure of association, while risk differences are most heterogeneous; however, absolute measures of effect are arguably most meaningful from both a public health and etiologic perspective.

Methods to Account for External Bias

In Design

The best way to ensure target validity in expectation would be to randomly sample the study sample from the target population (ensuring external validity in expectation), then randomly assign treatment to members of the study sample. Random sampling of the target population would ensure the sample is representative of the target population on both measured and unmeasured covariates, in expectation. However, random sampling of the target population is often not possible for logistical, ethical, or practical reasons [41,42,43,44]. One design option to improve the generalizability of trial results is purposive stratified sampling [45, 46] or pragmatic or practical clinical trials that tend to have less restrictive inclusion and exclusion criteria [47, 48]. However, while pragmatic trials these are more likely to be generalizable than traditional efficacy trials, they are still not expected to yield study samples that perfectly reflect the target population. Furthermore, many questions about generalizability and transportability of study arise after the research has been conducted with reference to a new target population. It is far more efficient to use existing study results to estimate or approximate the TATE in each of these new target populations than it would be to conduct new, separate trials in all possible target populations of interest.

In Analysis

Just as methods exist to account for non-random treatment assignment (e.g., regression adjustment, propensity score methods including weighting, g-computation or standardization, and doubly robust methods), methods exist to account for non-random sampling into the study sample. Most of these methods are analogous to those used to account for confounding and selection bias. Broadly, these methods can be grouped into methods based on modeling the probability of the outcome, methods based on modeling the probability of sample membership, and doubly robust methods that combine the two approaches [21•, 24, 49, 50].

Outcome model-based approaches to account for sample composition of a trial typically involve estimating subgroup-specific treatment effects from the sample and the averaging them by the proportion of the target population in each of those subgroups. Pearl termed this the “post-stratification formula.” Fundamentally, the formula looks just like Robins’s g-formula [51], where rather than estimating the average treatment effect by averaging over the distribution of covariates in the sample, we average over the distribution of covariates in the target population. One outcome model-based approach to generalizing study results is to model the outcome as a flexible function of observed covariates using data from the study sample and then predict outcomes for all members of the target population under each treatment of interest. This has been done using parametric regression models [52] and machine learning methods [53, 54].

Alternatively, the treatment effect can be estimated in the study sample that has been weighted to look like the target population. Specific details regarding the construction of these sample membership weights depends on whether we envision the problem as one of transportability or generalizability, both theoretically, but also practically. If we have a dataset enumerating the target population and also the specific members of the target population who were selected into the study sample (a generalizability problem both theoretically and practically), the weights are simply the inverse of the probability of sample membership for everyone in the sample and zero for everyone else. If, on the other hand, the study sample is not a subset of the target population (a transportability problem theoretically) or data on the target population and the study sample are not linkable (a transportability problem practically), we turn to a different set of weights. Data on the study sample and the target population may not be linkable if we only have data on a sample of the target population (e.g., from a sample survey), or we do not have data on which members of the target population were included in the sample (e.g., from administrative records) [55•]. “Transportability” weights are the inverse of the odds of sample membership for everyone in the sample and zero for everyone else [19, 27].

In both generalizability and transportability problems, weights are typically estimated using predicted probabilities from a sample membership model, with generalizability weights akin to propensity score weighting (inverse probability of exposure weighting) and transportability weights resembling those for estimating the average treatment among the treated (ATT) [56]. Applications of sample membership weighting methods have been most prevalent in the literature (relative to outcome model-based methods) [18, 24, 57].

In order to implement the methods described above, one must find an appropriate secondary data source on the target population of interest, which can be quite challenging to do in practice [58]. First, the data must include comparable measures on a sufficient set of covariates, such that the assumption of mean exchangeability between the sample and the target is met. Sensitivity analyses are possible if data on effect modifiers are missing in the target population, or in both data sets, for example by specifying plausible distributions of those effect modifiers and the strength of the effect modification [59, 60]. Second, one must assume that the population data are either a random sample of the true target population of interest or a complete census. However, many promising sources of publicly available data come from complex surveys, like the National Survey on Drug Use and Health (NSDUH), where it is known that study participants were not in fact randomly sampled [61]. Simply transporting RCT results to survey samples like NSDUH without properly accounting for the sampling methodology will result in biased estimates of the TATE. In other words, generalizations may be accurate for the NSDUH study sample but not for the population that the NSDUH sample represents. Recent methodological work has addressed this by determining how to incorporate survey weights from population data into existing generalizability methods [62].

Novel Study Designs to Account for External Bias

Two particularly novel study designs have been demonstrated in real-world data to account for external bias and explicitly assess the plausibility of some of the key assumptions (detailed above).

Nesting Trials Within Clinical Cohorts

If trials obtain permission from participants to link to their medical record data or if trials are nested within medical systems such that trial participants are identifiable within the population of patients who would have been eligible to participate in the trial, there are unique study design possibilities. Most basic is the potential for the methods described above to be used to generalize trial results to the broader target population [19, 22, 28, 49]. Alternatively, if the treatment under study is available outside of the trial, trial results that have been generalized to the cohort could be compared with the estimated effect of treatment in the cohort based on observational data [63]. Generalized trial results would be expected to differ from the truth if the adjustment set did not include a sufficient set of modifiers, while the association between exposure and outcome estimated in the target population directly using non-randomized treatment would be expected to differ from the truth if the adjustment set did not include a sufficient set of confounders [63,64,65,66,67]. If results from the two approaches are similar, we can have more confidence in the estimate of the TATE [21•, 63]. This is an example of triangulation of study results [68].

Leveraging Lack of Treatment Availability Outside a Trial

If at least one arm of the trial (treatment A = a) is currently available in the target population, the assumption of “mean generalizability” or “mean exchangeability over S” [21•] can be partially evaluated by comparing generalizing the outcomes from the trial under treatment to observed outcomes under treatment a in the target population [69]. A major difference between the generalized and observed outcomes implies that the generalized treatment effect for treatment a versus a′ is likely to differ from the true TATE. This type of analysis is particularly useful when studying a novel treatment where the placebo or standard of care arm of the trial is the only (currently) available treatment in the target population. In such cases, comparing the generalized outcomes in the placebo arm of the trial to the observed outcomes in the target population gives a sense of whether the (partial) conditional exchangeability assumption over S is likely to have been met. This approach has been demonstrated in education [24] and health using contemporaneous controls in the general population before broad availability of the treatment under study [70]. It has also been demonstrated using historical controls when estimating the effects of experimental medical treatments for terminally ill patients available under “right to try” laws [71]. Critically, this method only tests for a failure to transport or generalize. Even if the generalized outcomes and observed outcomes in the target population are identical under a, there is no guarantee that the generalized outcomes under treatment a′ will equal the unobserved outcomes in the target population under a′.

Conclusions

The utility of an estimate for informing a public health decision is a function of how accurately it maps to the causal effect of interest in the relevant target population [2]. Recent work on target validity has focused on increasing awareness of impact that external bias has on overall bias. Valid estimates of effect for relevant target populations are attainable given rich descriptive data on target populations, and new methods for extending results from one study sample to another population, under the set of assumptions described above. The strength of assumptions under which such an extension is possible is a function of how different the study sample and target population are from one another with respect to covariates that modify the effect of treatment. Although representativeness (of a study sample with respect to the target population) may not be necessary for all studies [41,42,43,44], the distribution of covariates in the sample and the target population should, at a very minimum, be considered when answering policy-relevant questions. Methods are available to account for differences in measured covariates between a study sample and target population and should be carefully implemented when drawing population inferences from non-representative samples.