Introduction

Many pharmacoepidemiology studies compare safety and effectiveness across treatment options that have not been randomly assigned. Treatment groups may differ in terms of prognostic factors, and crude comparisons will often lack causal interpretation. Such bias arises because health professionals rightly use clinical parameters to recommend treatment when they anticipate potential benefit and withhold it when concerned about adverse events. This phenomenon and its variants are referred to as confounding by indication, channeling, or protopathic bias [1]. Propensity score methods were designed to confront such confounding. They do so by modeling how prognostic factors (henceforth covariates, X) guide treatment decisions and using this knowledge to construct treatment groups with similar covariate distributions. Given their reliance on a treatment model instead of (or in addition to) an outcome model, propensity scores may be particularly relevant for pharmacoepidemiology, where it is often difficult to have adequate outcome models given rare events and potentially large numbers of confounders. Moreover, in medicine, treatment decisions are well understood and investigators can leverage clinical expertise and guidelines to build plausible models. As they draw on the familiar concepts of balance from clinical trials, propensity score analyses are very accessible to clinical, industry, and regulatory stakeholders.

In this review, we discuss critical aspects in the use of propensity scores in pharmacoepidemiologic research. We address study design, covariate choice, model selection, using the propensity score, and strategies for dealing with unmeasured bias. For each, we highlight current understanding, recent developments, and opportunities for progress.

Data and Design in Pharmacoepidemiology

Treatment Comparisons

Pharmacoepidemiology often employs routinely collected healthcare data that facilitates payment and reimbursement for medical services and products (e.g., insurance claims) and also medical care (e.g., electronic medical records) [2]. To estimate causal effects from these data, investigators must carve out a sample of patient experience that resembles a well-designed study that is conceivable in the real world, ethics and practicalities aside [3••]. Many design flaws at this stage can be avoided by anchoring study entry, treatment group assignment, and start of follow-up around a change in drug use.

The “new user” or “incident user” design anchors a study at the initial prescription dispensing (henceforth “use”) of a drug after some period of non-use, perhaps comparing users with those initiating an active comparison or with non-users, with treatment group membership held fixed. More elaborate permutations anchor at treatment decisions, e.g., a dose intensification [4, 5, 6•]. These designs, which can be incorporated in distributed data network studies, are motivated to mitigate pernicious confounding, selection, and immortal-time biases and to answer clinically relevant questions [7,8,9]. Unless noted otherwise, our review will consider a comparison of incident use vs. non-use but extends to other treatment decisions.

Targets of Inference

Another key aspect of design is the choice of a target population, which hinges on the underlying clinical question. When interested in treatment effects among the study sample, the focus of this review, one may seek to estimate effects among the entire sample (average treatment effect (SATE)) or among a treatment group (average effect of treatment on the treated (SATT)) [10]. These estimands will differ when there are heterogeneous effects and the distributions of effect modifiers vary across treatment groups [11]. When the distributions are grossly dissimilar (an extreme being ubiquitously prescribed or withheld treatment for certain types of patients), lack of overlap precludes estimating the SATE and, in severe cases, the SATT [12, 13]. In a later section, we discuss possible strategies to estimate the SATT or effects among an “overlap” population that has shared common support [12, 13].

Causal Inference, Potential Outcomes, and the Propensity Score

Potential Outcomes

Propensity score methods draw on the potential outcomes framework that was developed for randomized trials [14, 15]. Therein, we consider potential outcomes Y i (A = a) under treatment Y i (1) and no treatment Y i (0), observing only one for a given individual i. Under some circumstances, we can expect individuals to share the same distribution of potential outcomes regardless of their actual treatment status, i.e., Y(a) ∐ A for all a, allowing a contrast of outcomes among those actually treated vs. not E[Y| A = 1] − E[Y| A = 0] to stand in for a causal contrast of outcomes had everyone been treated vs. not E[Y(a = 1)] − E[Y(a = 0)]. This is an unbiased estimate in a randomized trial, where we know that treatment assignment is independent of potential outcomes. When treatment is merely observed, that assumption is not guaranteed. We attempt to measure enough covariates X such that the potential outcomes are rendered conditionally independent of treatment, i.e., Y(a) ∐ A ∣ X, and then estimate the conditional average causal effect E[Y| A = 1, X] − E[Y| A = 0, X]. Independence between potential outcomes and treatment is often referred to as “ignorability” or “exchangeability” [15, 16]. Moving from potential to observed outcomes often relies on the Stable-Unit-Treatment-Value-Assumption (SUTVA) which encodes that (i) the treatment of one individual does not affect the outcome of another (“no interference”) and (ii) the outcome observed under actual treatment and a hypothetical intervention assigning treatment are equivalent (“treatment-version irrelevance” and consistency) [17, 18]. Some methods have been extended to relax SUTVA [19, 20].

The Propensity Score for Binary and Categorical Treatments

For a binary treatment A of use (A = 1) vs. non-use (A = 0), the propensity score e(X) is the probability of use given the covariates X, i.e., P(A = 1| X). Rosenbaum and Rubin proved that (i) among patients with the same propensity score e(X), the covariates X will be balanced; (ii) if one can estimate causal treatment effects by adjustment for X (i.e., ignorability holds given X), then one can estimate causal treatment effects by adjusting for the propensity score (i.e., ignorability holds given e(x)) [15].

To compare treatments 1 (A = 1) and 2 (A = 2) with a common referent of non-use (A = 3), one needs to balance covariates across all three groups. This can be achieved by defining a propensity score function—the generalized propensity score—that describes how the distribution of treatment A depends on covariates X. This is a set of probabilities P(A = a| X) that sum to one, in this case P(A = 1| X), P(A = 2| X),  and P(A = 3| X) [21, 22]. Among individuals with the same propensity score function, the covariate distributions of all three treatment groups are balanced. The generalized propensity score can be extended to continuous and ordinal treatments such as dose [22, 23].

Building the Propensity Score Model

In modeling the propensity score, the goal is not to perfectly predict treatment. (In fact, perfect prediction implies intractable confounding). Rather, it is to reduce confounding by (i) selecting enough covariates to render potential outcomes independent of treatment; (ii) producing estimates of the probability of treatment that accurately reflect how treatment is related to the covariates; and (iii) using those probabilities to create treatment and comparison groups with similar covariate distributions. In this sense, measures of discriminatory power of a predictive model such as the c-statistic should not guide covariate choice nor model specification [24, 25].

Covariate Definition and Selection

Choosing covariates to measure is challenging when the underlying causal structure is unknown. For a thorough overview of relevant issues, see the review by Sauer et al. [26]. A safe strategy is to include risk factors for the outcome (these will improve statistical precision) [27]. Covariates that predict treatment but not the outcome (true instruments) are best avoided as these can not only reduce precision [27] but also inflate any bias from any remaining confounding, a phenomenon described as Z-bias [28]. For all other covariates, simulation and theoretical results favor adjusting for the covariate to reduce potential confounding [28,29,30, 31•].

In databases where it is possible to measure hundreds of covariates, algorithms have been proposed to help investigators choose among them. The high-dimensional propensity score (hdPS) approach selects all covariates that are associated with treatment and outcome, in the hope that adjusting for a rich set of proxy covariates will protect against unmeasured confounding [32]. Recent work on hdPS warns against pre-screening covariates by their prevalence and suggest that its performance in studies with few exposed outcomes can be enhanced by incorporating machine learning tools and Bayesian estimation of the covariate-outcome association [33,34,35]. A variety of other algorithms have been developed to improve the treatment effect estimate in high-dimensional settings while limiting the selected covariates to a set that suffices to control for confounding. In one, a backwards selection algorithm sequentially discards variables that are independent of the outcome given treatment and the remaining covariates [36]. In another, the “least absolute shrinkage and selection operator” (lasso) is applied to an outcome regression model to select covariates for inclusion in the propensity score [34]. Several methods seek to also optimize the mean square error of the treatment effect, including procedures that iteratively select variables for candidate outcome and propensity score models (Collaborative Targeted Maximum Likelihood (C-TMLE); and Bayesian Adjustment for Confounding (BAC)) and modified stepwise “change-in-estimate” selection strategies [37,38,39,40,41]. All of these strategies presume a single outcome. When several are of interest, simulation results suggest that a generic propensity score model based on their shared confounders performs nearly as well as separate models built for each outcome [42]. These algorithms appear promising for variable selection but have not been studied in depth.

In defining and measuring covariates, one must ensure that they are truly pre-treatment covariates. This follows naturally in a design that aligns cohort entry, the start of follow-up, and treatment definitions at a change in treatment, e.g., the index date. If covariates are not stable attributes over the study period, such as a measure of symptoms, care must be taken to ensure that such time-varying covariates are not effects of treatment. One should assess these before the index date and determine whether the immediate pre-treatment value or a richer summary of its history are most predictive of the outcome and the indexing treatment decision [43]. Assessments should also consider the sensitivity of defining covariates by drawing on all available data or restricting to a fixed window of some length [44]. Though some exploratory studies have examined these strategies empirically and through simulation, there is not unequivocal evidence regarding their merits and shortfalls [45]. An excellent review by Brookhart et al. describe additional concerns that arise when using medical billing and service codes to assess health status [46].

Modeling the Propensity Score

In observational studies, the true propensity score is unknown and must be estimated. For binary treatments, this is typically accomplished through a logistic regression model with at least main effects. The impact of iteratively adding interaction and higher terms can be evaluated by how well the resulting propensity score approach balances covariates (discussed later). Some alternatives seek to automate the process of covariate and model selection by leveraging machine learning tools, regularization, or loss-based estimation. For example, implementations of ensemble methods of bagged/boosted classification and regression trees and random forests appear to balance covariates better than logistic regression in simulation studies though the reverse has been seen with empirical data [25, 47]. Other approaches seek to optimize not prediction error but covariate selection and the propensity score’s performance in constructing comparable treatment groups. These include C-TMLE mentioned above [39, 40]; the outcome-adaptive lasso, which uses shrinkage to deselect variables that predict exposure but not the outcome [48]; generalized boosted models which combine piecewise regression trees to capture interactions [49, 50]; and the covariate balancing propensity score which uses a generalized method of moments approach [51, 52]. Early simulations suggest that these methods perform well though their limits and tradeoffs have yet to be fully characterized [53].

Some complexities are worth mentioning. Measurement error in covariates often occurs in administrative healthcare data and can lead to residual confounding [2, 46]. Recent theoretical work on leveraging prior knowledge and external validation samples for correcting the propensity score could be explored as a solution [54,55,56]. Another complexity is that treatments may be administered differently across time, physicians, health institutions, and systems, reflecting gradients in practice or quality of care. Some argue against estimating propensity scores within clusters when they have no effect on outcomes (i.e., the clustering variables are instruments) [57]. Others point out that failing to reflect heterogeneous relationships between covariates and treatment can lead to a mis-specified propensity score model [58] and propose a model-fitting strategy to confront this [59]. These concerns can be evaluated empirically through checking whether the estimates from propensity scores that ignore clustering produce comparable treatment groups. Future empirical work on this issue should consider therapeutic examples where there is considerable clinical uncertainty and discretion (e.g., psychiatry), complex trade-offs between benefits and risks (e.g., anticoagulant therapy), and where treatment rules are less established (e.g., newly marketed medications).

Evaluating the Propensity Score Approach

The performance of the propensity score approach should be assessed in terms of how well it has balanced covariates across treatment groups. Technically speaking, perfect balance would imply the same multivariate covariate distributions across treatment groups, though this is likely impossible to achieve let alone diagnose. A less ambitious goal is to only balance covariates in ways that reflect their role in a hypothesized model for the outcome [60•]. For example, an additive outcome model would suggest that balance of marginal means is sufficient, whereas a non-additive model would suggest that relevant interactions also be balanced. Moreover, balance on covariate transformations (such as a log-transformation or higher-order terms) should also be achieved if these are related to the outcome [61]. Therefore, aggregate measures of balance such as overlap in treatment densities, the Mahalanobis distance, or average standardized mean difference, may be useful in detecting gross imbalance but could still mask important differences [62•, 63].

Balance should thus be assessed for each covariate. While hypothesis tests of equality of means tend to reject when residual imbalances threaten causal inference, they are generally avoided in propensity score analyses as balance is an “in-sample” property and hypothesis tests conflate substantive differences with sample size [10, 64]; in large healthcare datasets, even clinically irrelevant differences can manifest as statistically significant. The standardized mean difference across each covariate can be reported by dividing the difference in covariate means by the pooled standard deviation in the original population (e.g., unmatched/unweighted). In terms of benchmarks, absolute standardized mean differences of less than 0.25 or 0.1 have been put forth as a rule of thumb, but ideally imbalances should be minimized without limit [10, 60]. One can go beyond the means to diagnose differences in covariate distributions using ratios of variance, the Kolmogorov-Smirnov distance, or box plots. It is worth pointing out that covariate balance is merely a sufficient condition for comparability of treatment groups [65]. The predicted counterfactual outcome among the referent treatment group can be leveraged to assess “prognostic” balance that summarizes over multiple covariates [66]. With all balance measures, their assessment should align with how the propensity score is to be used. They should be applied in matched samples, after inverse probability weighting, or within levels of propensity score subclasses [67, 68•, 69]. Residual imbalances can be tackled by including the unbalanced covariates in the outcome model, which is a form of double robustness [70].

Estimating Treatment Effects in Observational Studies

Matching

Matching is the most popular use of the propensity score in pharmacoepidemiology and will generally estimate the SATT [63]. Here, the propensity score is used as a measure of distance between treated and comparison units. The simplest algorithm is to use the propensity score to find a “matched” comparison for each treated unit (“one-to-one matching”). Differences in the outcome between treated and comparison groups in the matched sample, achieved non-parametrically or otherwise, provide estimates of the treatment effect. A frequent modification to this nearest neighbor approach is to only select comparisons that fall within a certain distance of the propensity score (a caliper). Though this leads to better balance, it can cause some treated units to be discarded, such that the target reflects an “overlap” population rather than the full treated group [71]. Most implementations of matching use a “greedy” algorithm that can exhaust the best matches early in the process without regard for the overall similarity of the treated and comparison groups. An alternative, optimal matching, seeks to minimize the average distance across pairs. Another variation finds more than one match for each treated unit (1:k matching, variable ratio matching). While improving precision, this can increase bias due to the inclusion of more distant comparisons [72]. See Stuart for a discussion on these tradeoffs [67]. It may be possible to apply a weight of 1/k after variable ratio matching to decrease the influence of larger matched sets, but the implications and utility of this proposal should be studied further. Extending these matching approaches to the generalized propensity score requires choosing an appropriate distance measure targeting a particular SATT or a population with common support, but success will depend on the degree and nature of overlap [73]. Exciting new methods such as cardinality matching and others bypass estimation of the propensity score entirely while optimizing sample size given specified covariate balance constraints (even for categorical treatments) [60]. Future empirical research might compare their performance with existing approaches, especially when the propensity score and prognostic score are integrated as distance measures.

Subclassification

An alternative to matching is to divide the population into subclasses according to the propensity score distribution in the overall populations or a particular treatment group [74]. Subclass indicators and their interactions with treatment can be entered as covariates in a regression model, otherwise marginal estimates of the SATE or SATT can be obtained by averaging over the subclass-specific effects through weighting [75]. The subclasses may be based on percentiles, quantiles, or some other scheme. The optimal classification strategy and its granularity may depend on whether treatment and outcome are rare, e.g., as in the case of safety outcomes for newly marketed medications [76]. Nevertheless, the working assumption is that within chosen subclasses, the treated and comparisons have similar covariate distributions which can be confirmed empirically [66]. It has been shown that creating just five to ten subclasses can remove at least 90% of the bias attributable to the covariates used to construct the propensity score [77]. With very fine strata, subclassification is akin to full matching; at its limit, it implies inverse probability weighting [78]. Subclassification can also be used with the generalized propensity score [23].

Weighting

In both matching and subclassification, the values of the propensity score are used to create sets where the treated and control have similar propensity scores (though not exactly so) and thus similar covariate distributions. In contrast, weighting uses the propensity score values directly. A weighting approach that estimates the SATE defines weights as the inverse of the probability of treatment received \( \frac{1}{P\left[A=a|X\right]} \) , mirroring weights in survey sampling [79]. The weights can then be used in non-parametric or parametric analyses of the SATE. Hypothesis tests and 95% confidence intervals can be constructed with a robust sandwich variance estimator [80]. To achieve more precision, the weights can be standardized by including the unconditional probability of treatment received in the numerator \( \frac{P\left[A=a\right]}{P\left[A=a|X\right]} \). Assessments of effect heterogeneity across a moderator V should include it within the conditioning event of the numerator and denominator, i.e., \( \frac{P\left[A=a|V\right]}{P\left[A=a|X,V\right]} \) [81]. This formulation can also be extended to continuous treatments [16]. If interest lies in the effect among a particular treatment group (i.e., the SATT) then a weighting strategy sometimes coined as “weighting by the odds” or “SMR weighting” uses that treatment’s conditional probability as the numerator, i.e., \( \frac{P\left[A={a}^{\prime }|X\right]}{P\left[A=a|X\right]} \), where a is the reference treatment level of interest and a denotes the unit’s actual treatment condition [82]. By definition, the weights for individuals in the reference treatment group of interest reduce to 1.

A limitation of weighting is that areas of limited overlap can result in low treatment probabilities and extremely large weights for certain individuals. These result in wider 95% confidence intervals, though if the propensity score model and weight specifications are correct, this level of influence and uncertainty are appropriate. In practice, though, it is common to sacrifice a little validity for precision by truncating the weights to the 99th or other percentile [83]. The impact of this procedure can be assessed by inspecting covariate balance [68•, 84••].

To some extent, areas of non-overlap can be addressed by “trimming” the sample to remove treated observations in the tails of the propensity score distribution that may lack comparison or treated units (as with caliper matching) [12, 13]. This can be particularly relevant, and important, in pharmacoepidemiology contexts where there is a group of people who would never be prescribed one of the drugs (treatments) of interest, because of contraindications or some other reason. Restricting attention to individuals in the area of common support can be thought of as focusing on the individuals for whom there is some clinical equipoise in terms of which treatment to provide [85]. An exciting advance is the development of weighting strategies that target not the SATE or SATT but the treatment effect among the population with common support. These include the “matching weight” which asymptotically targets a one-to-one matched sample by replacing the numerator with the minimum of the conditional treatment probabilities: e.g., \( \frac{\min \left(P\left[A=0|X\right],P\left[A=1|X\right]\right)}{P\left[A=a|X\right]} \) in the binary case [86]. An alternative is the “overlap weight” which targets the entire population of common support by replacing the numerator with the product of the conditional treatment probabilities: e.g., \( \frac{P\left[A=0|X\right]\times P\left[A=1|X\right]}{P\left[A=a|X\right]} \) [87]. Like standard inverse probability weighting, they can accommodate categorical treatments but have the added benefit of being bounded between 0 and 1 [87, 88]. For categorical treatments, they do, however, presuppose sufficient overlap across all (and not just pairwise) comparisons. Early studies suggest that, in the context of up to three unequally sized treatment groups and rare binary outcomes, matching weights exhibit better balance and lower bias and mean square error compared to standard inverse probability weighting and matching [88, 89]. But their performance in simulated or empirical cases with more than three treatment groups (a realistic setting in pharmacoepidemiology) has yet to be evaluated.

Regression

A final approach is to regress the outcome on treatment and the estimated propensity score. In pharmacoepidemiology where outcomes are often non-linear and treatment effects are heterogeneous, estimating unbiased treatment effects will require that such models are correctly specified (which might involve higher order or spline terms for the propensity score and its interaction with treatment) [90]. For this reason, other uses of the propensity score are generally favored. Nonetheless, in recent years, some recent theoretical insights have been developed [90]. If the propensity score model is correct, in large samples, one can obtain valid tests of the null hypothesis of no treatment effect (on the difference or ratio scale) provided a robust variance estimator is used [90]. It remains unclear how to diagnose the balancing property of the estimated propensity score when it is to be used as a regressor.

Joint and Time-Varying Treatments

Our review has focused on causal inference for treatment decisions at a common (single) decision point. But pharmacoepidemiology often explores effects of discontinuing therapy or of adhering to treatment plans. Moreover, treatment decisions involving the addition or withdrawal of a second drug to increase effectiveness, mitigate side-effects, or to treat an emerging comorbidity, are vitally important. With each of these questions, including that of drug-drug interaction, naïve regression on time-varying versions of propensity scores can be inappropriate and induce bias because time-varying covariates (and propensity score summaries of them) may be affected by treatment [91, 92]. Special methods such as marginal structural models, structural nested models, or adaptations of the g-formula are required to adequately adjust for confounding in the presence of such feedback loops [80, 81, 93,94,95,96]. The spirit of design in propensity score approaches can be retained in these analyses by empirically assessing feedback and adapting balance measures to account for treatment history [84••].

Bias Correction for Unobserved Confounding

So far our discussion has assumed that there is no unmeasured confounding. When this assumption fails, the treatment effect estimate will be biased. Techniques are available to help quantify the potential bias. If the unmeasured confounder(s) are measured in a subset or external sample, a technique known as propensity score calibration has been proposed along with a two-stage approach that avoids its surrogacy assumption [97,98,99]. An alternative strategy is to carry out a sensitivity analysis (subsumed within “bias analysis”) that models potential bias from unmeasured confounding as a function of how strongly the treatment and outcome are associated with an unmeasured confounder [100,101,102]. Practical implementations of these tools rely on certain no-interaction assumptions, but recent work has provided new bounding formulas that do not require them and are much easier to use [103••]. One can also pursue a sensitivity analysis by defining a bias parameter as the difference in potential outcomes across treatment groups given the propensity score [104]. Uncertainty in the parameters can be incorporated using frequentist or Bayesian frameworks, as in probabilistic bias analysis [105, 106]. The concept of design sensitivity extends the logic of sensitivity analysis to compare implications of potential unmeasured bias across a range of study design features and can be suited to evaluate effects that are substantial but rare [107••]. Concerning longitudinal treatment effects, many important time-varying confounders are often not available in electronic healthcare databases. Though a few correction and sensitivity-analysis techniques exist, the easier-to-use tools we have described have yet to be translated to these settings [108, 109].

Conclusions

Since their adoption into pharmacoepidemiology, propensity score methods have expanded in ways that are important for studies of comparative effectiveness and drug safety. Extensions include ways to simultaneously compare multiple drugs, both after a fixed decision point and over time. Novel matching algorithms have emerged that provide greater control over balance and sample size. Weighting estimators have evolved to target matched and overlap estimands and in doing so avoid potentially harmful extrapolation. Sensitivity and bias analysis techniques have emerged that can leverage validation data or examine robustness to unobserved confounding. And although not covered in detail in this review, extensions for time-varying exposures have also progressed, including our ability to approach such questions in a design-based paradigm. The field is still wrestling with issues of covariate and model selection, as well as variance estimation, and machine-learning algorithms are being re-tuned away from minimizing prediction error towards improving the quality of the treatment effect estimate [110, 111]. Many of these new and exciting developments will need to be explored through further simulation and empirical examples, but together they represent a bright path ahead in overcoming many of the design and analytic challenges in the study of therapeutic effects and harms.