1 Introduction

Randomized controlled trials (RCTs) are regarded as the best research design for causal inference. However, RCTs are not always feasible in behavioral and social science research due to practical or ethical barriers. Consequently, non-RCTs (or observational studies) are often used as an alternative. Unfortunately, the validity of such studies is often called into question because of potential selection bias in observational data (Rosenbaum 2010; Shadish et al. 2002). To increase the validity of observational studies, multiple strategies have been developed to deal with selection bias.

One of those strategies, propensity score methods, is most popular. This popularity stems from the methods’ properties that mimic those of RCTs (Bai 2011b; Pan and Bai 2016b; Rubin 2008). While many researchers tout the advantages of propensity score methods, some methodologists and statisticians raise concerns about the rationale and applicability of propensity score methods (Pearl 2010; King and Nielsen 2016). In this review, we address these concerns by reviewing the development history and the assumptions of propensity score methods, followed by the fundamental techniques of and software packages for propensity score methods. We also discussed the issues in and debates about the use of propensity score methods.

2 Development history of propensity score methods

In their seminal work of propensity score methods, Rosenbaum and Rubin (1983b) introduced the fundamental concept of a propensity score and its basic applications in observational studies. They defined a propensity score as the conditional probability of assignment to a particular treatment given observed covariates. They adopted the causal framework for treatment effect estimation from previous eminent literature (Fisher 1951; Hamilton 1979; Kempthorne 1952; Rubin 1974, 1978). Based on both the large- and small-sample theories, Rosenbaum and Rubin (1983b) proved that the adjustment using propensity scores calculated from all observed covariates is sufficient to remove selection bias in observational data.

The key concept behind propensity score methods is that they can be used to balance the distributions of covariates between the treatment and control groups. This logic is based on Rubin’s (1978) causal effect theory and employs the “ ignorability assumption”. This simply means that the choice to assign a subject to the control or treatment group is effectively random when conditioned on observable characteristics, and missing data can be treated as occurring randomly as well. It is a fundamental assumption of propensity score methods.

In Rosenbaum and Rubin’s (1983b) study, they expounded on Cochran and Rubin’s (1973) previous work on the use of a single normally distributed covariate. They tested the use of propensity scores calculated from multiple covariates to adjust the unbalanced distributions of covariates between the treatment and control groups, and made three proposals for standard applications of propensity scores: (a) propensity score matching, (b) subclassification based on propensity scores, and (c) multivariate covariate adjustment.

Propensity score matching creates a matched set of subjects in the control group to those in the treatment group with similar propensity scores. The goal is to mimic random selection and eliminate bias in observational data. They articulated four reasons for the advantages of propensity score matching over model-based alternative adjustment on random samples: (a) propensity score matching allows researchers to easily analyze matched pairs and adjust for confounding variables; (b) the variance of the estimate of the average treatment effect is smaller in the matched sample than in random samples because the distributions of the covariates in the matched sample are similar; (c) model-based adjustment on matched samples is more robust than the adjustment on random samples; and (d) small sample sizes do not allow control of multiple covariates with model-based methods, but propensity score matching does.

In a later study, Rosenbaum and Rubin (1985) developed three different propensity score matching techniques, with an emphasis on multivariate matching: (a) nearest available matching on the estimated propensity scores, (b) Mahalanobis metric matching including propensity scores, and (c) nearest available Mahalanobis metric matching within a caliper defined by propensity scores.

Nearest available matching (nearest neighbor matching or greedy matching) was initially defined as matching the treated and control subjects on their closest propensity scores without replacement, with the treated and control subjects randomly ordered. Later, it was found that propensity score matching would create different matched pairs if the subjects were not chosen randomly, but in order, such as from the smallest propensity score to the largest or from the largest to the smallest. Furthermore, matching with and without replacement also created significantly different matched pairs.

Mahalanobis metric matching with propensity scores is a procedure that includes the estimated propensity scores in the calculation of Mahalanobis distance and then to match the sample without replacement on the Mahalanobis distance with the treated and control subjects randomly ordered. A variation of Mahalanobis metric matching with propensity scores is the nearest available Mahalanobis metric matching within a caliper that creates matched pairs of treated and control subjects using Mahalanobis distance within a caliper band defined by propensity scores so as to control the difference between matched pairs.

While Rosenbaum and Rubin (1985) clearly described the procedures of propensity score matching in their study and favored caliper matching, they did not examine significant issues with propensity score matching, such as ranking the order of subjects on propensity scores, the size of calipers, and matching with or without replacement. Further, their discussion of propensity score matching was limited to only one sample.

Meanwhile, Rosenbaum and Rubin (1984) developed another propensity score method to reduce selection bias in observational studies using subclassification on propensity scores. Their simulation study showed that subclasses formed using propensity scores can balance all the covariates, and that five subclasses could remove up to 90% of the bias for each of the covariates.

In 1989, Rosenbaum introduced the optimal matching algorithm based on network flow theory, a departure from the previous greedy matching procedures, such as nearest neighbor matching and caliper matching. Optimal matching generally identifies matched pairs that minimize the total distance in propensity scores between the treatment and control groups. In optimal matching, one treated subject can be matched with multiple control subjects; therefore, it has the advantage of keeping all the control subjects. Rosenbaum (1989) claimed that optimal matching is often better than greedy matching, but this should not be taken as an absolute conclusion because the performance of different matching techniques can be data specific.

Following Rosenbaum and Rubin’s pioneer works in 1970s and 1980s, other techniques of propensity score methods were also developed. Recent developments have focused mainly on the improvement of propensity score estimation accuracy and matching quality. Improving propensity score estimates involves: (a) the classification and regression tree procedure to examine each predictor variable for creating two distinctly different samples (Lemon et al. 2003), (b) boosted regression using regression tree to derive propensity scores (McCaffrey et al. 2004), and (c) bootstrapping propensity scores to account for sampling errors (Bai 2013).

Improvements to matching quality include: (a) kernel matching that utilizes weighted regression in matching procedures (Heckman et al. 1997), (b) full matching that incorporates optimal matching with replacement of the subjects from both the treatment and control groups (Hansen 2004), (c) genetic matching that is a nonparametric procedure using a search algorithm to determine the weight of each covariate for maximizing the balance of observed potential confounders across the matched treated and control subjects (Diamond and Sekhon 2013), and (d) interval matching that matches subjects based on confidence intervals other than point estimates of propensity scores to accommodate estimation errors in propensity scores (Pan and Bai 2015b).

In the meantime, Hirano and Imbens (2001) also proposed model-based direct adjustment using propensity score weighting, which is defined as the inverse probability of treatment weighting (IPTW) using propensity scores. IPTW has grown in popularity because of its capacity to deal with complex data. Austin and Stuart (2015) described IPTW in detail and pointed out that to use the adjustment correctly, researchers must examine whether weighting balances covariates between the treatment and control groups.

3 Assumptions of propensity score methods

Like all other statistical methods, assumptions are required for applying propensity score methods. These assumptions are the ignorable treatment assignment assumption, the stable unit treatment value assumption, and sufficient common support.

3.1 The ignorable treatment assignment assumption

Suppose each subject (or unit) i (i = 1, …, N) has a treatment condition Zi (1 = treatment or 0 = control), an outcome Y1i (potential outcome for treated unit) or Y0i (potential outcome for control unit), and a covariate value vector Xi = (Xi1,…, XiK)T, where N is the number of subjects and K is the number of covariates. The ignorable treatment assignment assumption requires that assignment to the treatment or control group is independent of both outcomes after accounting for observed covariates: (Y1i, Y0i) ⊥ Zi | Xi. Under this assumption, if the distributions of propensity scores are balanced between the treatment and control groups, the distributions of the covariates used for obtaining propensity scores are also balanced. That is, (Y1i, Y0i) ⊥ Zi | Xi ⇒ (Y1i, Y0i) ⊥ Zi | p(Xi), where p(Xi) = Pr(Zi = 1 | Xi) defined as a propensity score. Therefore, one can assume that selection bias can be removed or significantly reduced after propensity score adjustments if no confounding variables are left unmeasured. This is the foundation of propensity score theory. Therefore, all influential covariates associated with the treatment estimation and outcomes should be included in the propensity score estimation model. In reality, hidden bias often exists because we are unable to include unobserved covariates in the propensity score estimation model; therefore, treatment effect estimation will be affected.

One way to address this unobserved confounding issue is to conduct sensitivity analysis. Sensitivity analysis assesses the impact of an unobserved confounding variable under certain assumptions made by the researcher. This technique helps researchers better understand the limitation of propensity score methods due to unobserved confounding, and more and more researchers have been conducting sensitivity analysis (Groenwold et al. 2010; Schneeweiss 2006; Lin et al. 1998; Robins et al. 2000b; Rosenbaum and Rubin 1983a). There are several variants of sensitivity analysis with specific techniques, such as marginal structural models (Robins et al. 2000a), linear programming (MacLehose et al. 2005), Bayesian sensitivity analysis (Greenland 2005), external adjustment (Huesch 2013), propensity score-based approach (Li et al. 2011), and the robustness index (Pan and Bai 2016a). Among these techniques, Rosenbaum and Rubin’s (1983a) is the most frequently used with a ready-to-implement statistical package in R (Keele 2015).

3.2 The stable unit treatment value assumption

The stable unit treatment value assumption (SUTVA) requires that the treatment effect for each subject be independent of other subjects’ responses; thus, the treatment for each subject is stable or the same (Rosenbaum and Rubin 1983b). To apply propensity score methods, such as propensity score matching, SUTVA assumes that: (a) the potential outcome of the selected subjects in the treatment group should not be affected by the treatment status of other subjects in the study groups and (b) each subject receives the same amount of treatment as the others in the treatment group who were selected through propensity score matching.

In practice, SUTVA can be easily violated if subjects in the treatment group interact with subjects in the control group. For example, a subject in the treatment group of a weight loss exercise program may like to share his or her experience of the treatment with his or her friends who happen to be in the control group. Such information sharing between the treatment and control groups can make the potential outcomes dependent on each other, or the treatment is not given consistently to each subject in the treatment condition, which makes the treatment effect unstable. Therefore, an appropriate research design followed by a rigorous procedure for complying with the research protocol is needed to reduce the likelihood of violating SUTVA.

3.3 Sufficient common support

The third assumption is about common support (or overlap) between the distributions of propensity scores for the treatment and control groups. It implies that the propensity scores of the two groups should overlap. This allows researchers to make a reasonable (or unbiased) comparison between the two groups. Common support can be improved by having more variability in propensity scores in the control group than the treatment group (Pan and Bai 2015a). In practice, common support improvement can be achieved by having proportionally more participants in the control group who can be matched to those in the treatment group.

There are several ways of checking common support. First, we can make a visual inspection of propensity score distributions with histograms or density graphs of propensity score distributions. Second, we can use hypothesis testing, such as Kolmogorov–Smirnov test, to determine if propensity score distributions are significantly different from each other. Third, we can compute the standardized difference score: \(d = \frac{M_t - M_c}{s_p}\) to compare the means of the propensity scores for the treatment (\(M_t\)) and control (\(M_c\)) groups, where sp is the pooled standard deviation of the propensity scores. A small standardized difference score (e.g., d < 0.5) indicates sufficient common support. Last, we can trim the minimum and maximum values of propensity scores in each group by removing the subjects whose propensity scores are smaller than the minimum or larger than the maximum in the opposite group (Caliendo and Kopeinig 2008; Pan and Bai 2015a; Smith and Todd 2005).

4 Fundamental techniques of propensity score methods

The fundamental techniques of propensity score methods can be generally classified into five categories: propensity score matching (e.g., nearest neighbor matching, clipper matching, Mahalanobis matching with propensity scores) (Rosenbaum and Rubin 1985), subclassification on propensity scores (Rosenbaum and Rubin 1984), propensity score weighting (Hirano and Imbens 2001), covariate adjustment with propensity scores, and doubly robust estimation. To better understand the fundamental techniques, we start discussing these topics with the underlying theoretical framework of causal inference.

4.1 Causal inference and propensity score methods

In the counterfactual framework for causal inference, the quantity of interest is the treatment effect for each subject i, which is defined as Δi = Y1iY0i (Holland 1986; Rubin 1974; Winship and Morgan 1999), where Y1i is the potential outcome for an individual subject exposed to the treatment, while Y0i is the potential outcome for the same individual without receiving any treatment at the same time. Unfortunately, for each subject i, only Y1i or Y0i, is observable, but not both, at the same time because the same subject cannot be simultaneously assigned to both the treatment or control conditions. Alternatively, one can estimate the average treatment effect (ATE) for the population, which is defined as ATE = E(Y1Y0) = E(Y1) − E(Y0), where E(Y1) is the expected value of outcome Y1 for all the subjects in the treatment group and E(Y0) is the expected value of Y0 for all the subjects in the control group. In RCTs, ATE is an unbiased estimate of the treatment effect of Δi because the treatment group does not, on average, differ systematically from the control group on their observed and unobserved background characteristics, due to randomization. In observational studies, treatment effect estimation could be biased because the treatment and control groups may not be comparable due to the potential group selection bias without randomization.

Selection bias can be overt, hidden, or both (Rosenbaum 2010). Fortunately, propensity score methods set forth by Rosenbaum and Rubin (1983b) can reduce overt bias in observational studies by balancing the distributions of observed characteristics (or covariates) between the treatment and control groups. Therefore, propensity score methods allow one to obtain an unbiased estimate of ATE from observational studies under the assumption of the ignorable treatment assignment assumption.

ATE is not always the quantity of interest (Heckman et al. 1997; Rubin 1977). For instance, one may be interested in the treatment effect of a smoking cessation program for smokers who volunteered to participate in the program, not necessarily for all people in the population. In this case, one wants to estimate the average treatment effect for the treated (ATT) for the population, which is defined as ATT = E(Y1Y0| Z = 1) = E(Y1| Z = 1) − E(Y0| Z = 1). This still encounters the counterfactual problem that one can only observe the average treatment effect Y1 for people who receives the treatment, but it is not observable for the effect Y0 for the participants had they not been treated. To deal with this problem, one can analyze matched data on propensity scores. The matched subjects in the control group have similar probabilities of Z = 1 to these of the corresponding subjects in the treatment group and, therefore, propensity score methods allow one to estimate ATT.

4.2 Covariates selection and propensity score estimation

Before implementing the techniques of propensity score methods, we first need to estimate propensity scores from given covariates. It is essential to include all influential covariates in the propensity estimation model so that the propensity scores used for balancing the distributions of the covariates will be accurate. While we can use statistics, such as correlations, to guide covariates selection, covariates should be selected based on the theory or existing literature about the relationships of the covariates to the outcome variables and treatment assignment conditions (Rubin 2001).

In general, there are three types of relationships of covariates to the treatment conditions and the outcome variables. First, the best covariates to be selected are those related to both the treatment conditions and outcome variables, since the relationships indicate that the covariates may both alter the treatment and influence the treatment effect. Second, if a covariate is associated with the outcome variable, but not with the treatment, it may still change the outcome estimation; therefore, it needs to be included in the propensity score model (Rubin and Thomas 1996; Brookhart et al. 2006). Third, if a variable is related to the treatment, but not outcome, the decision to include or exclude that variable in the propensity score model depends on the direction of the relationship between the potential covariate and the treatment condition (Brookhart et al. 2006). If the covariate has an impact on the treatment, it should be used for the propensity score estimation, as it could alter the treatment. However, if the covariate is associated with the treatment conditions, but does not have an impact on the treatment (nor is it related to the outcome variable), it should not be included in the model because this type of variable will not affect treatment effect.

After establishing a set of influential covariates, the next step is to estimate propensity scores. Rosenbaum and Rubin (1983b) first recommended using logistic regression or discriminant analysis to estimate propensity scores. It is worth noting that in logistic regression or discriminant analysis, we only need to model an assignment probability given covariates without assuming any functional form of the distributions of the probability. That is, propensity score estimation is semiparametric and thus robust. Nevertheless, more advanced techniques, such as classification and regression trees, neural networks, and bootstrap techniques, were later adopted for obtaining more accurate propensity score estimation (Westreich et al. 2010). Propensity scores can be obtained by running these models using statistical software, such as R, SAS, or STATA. There are many statistical packages for implementation of propensity score methods. It is common for these statistical packages to include propensity score estimation in the propensity score adjustment procedures. In the next section, we will introduce commonly used propensity score methods.

4.3 Propensity score matching

Before the inception of propensity score matching, exact matching was the traditional matching technique used in quasi-experimental designs for matching on a few categorical variables. Another traditional matching technique, Mahalanobis matching, was used for matching on multivariate continuous variables (Rubin 1980). Then, two sets of propensity score matching algorithms, greedy matching and complex matching, followed. Bai (2013: Fig. 1) defined a propensity score matching typology that depicts the developmental relationships among the propensity score matching techniques.

Nearest neighbor matching (Rosenbaum and Rubin 1985) is the foundation of all propensity score matching techniques. It is based on the greedy matching algorithm that matches each subject i in the treatment group with a subject j in the control group by the smallest absolute distance between their propensity scores: di= minj|p(Xi) − p(Xj)|. To conduct nearest neighbor matching without replacement, we need to decide how to rank the subjects, namely randomly, largest to smallest, or smallest to largest, based on their propensity scores. For matching with replacement, the ranking order is not needed because the same control subject can be used multiple times. An alternative to nearest neighbor matching is caliper matching (Cochran and Rubin 1973), which matches each subject i in the treatment group with a subject j in the control group within a prespecified caliper band b: di = minj{|p(Xi) − p(Xj)| < b}, to reduce the risk of bad matches when the distance of the propensity scores between the matched pairs is too great. Based on Cochran and Rubin’s (1973)study, Rosenbaum and Rubin (1985) recommended that the prespecified caliper band should be less than or equal to a quarter of the standard deviation of the propensity scores; such a caliper will remove up to 90% of selection bias. In practice, the caliper bandwidth can also be defined by the researcher to make better matched pairs; however, it is usually challenging for researchers to select a reasonable data-specific caliper or to detect the tolerance level on the maximum propensity score distance. A variant of caliper matching is radius matching (Dehejia and Wahba 2002), which is a one-to-many matching with each subject i in the treatment group matched with multiple subjects in the control group within a prespecified caliper band: dij = {|p(Xi) − p(Xj)| < b}. Recently, Pan and Bai (2015b) extended caliper matching to interval matching that matches subjects based on confidence intervals in propensity scores to accommodate estimation errors of propensity scores. In other words, if the confidence intervals (CI) of propensity scores overlap: CI(p(Xi)) ∩ CI(p(Xj)) ≠ ∅, the two subjects are taken as a matched pair.

Another type of greedy matching is Mahalanobis metric matching with propensity scores (Rosenbaum and Rubin 1985), which matches each subject i in the treatment group with a subject j in the control group according to the closest Mahalanobis distance calculated on proximities of the variables: di = minj{Dij}, where Dij = (ViVj)S−1(ViVj)T, where V is a combined data matrix {X, p(X)} and S is the sample variance–covariance matrix of V for the control group. In practice, this matching method does not perform as well as other propensity score matching techniques such as nearest neighbor matching and caliper matching (Bai 2011a). Mahalanobis caliper matching (Guo et al. 2006) and genetic matching (Diamond and Sekhon 2013) are two variants of Mahalanobis metric matching with propensity scores. Mahalanobis caliper matching uses di = minj{Dij < b}, where Dij = (XiXj)S−1(XiXj)T; and genetic matching uses the weighted Mahalanobis: Dij = (XiXj)(S−1/2)TWS−1/2(XiXj)T or Dij = (ViVj)(S−1/2)TWS−1/2(ViVj)T, where W is a weighting matrix and S1/2 is the Cholesky decomposition of S.

It is worth noting that after propensity score matching, treatment effect estimation on the matched data will only give us ATT as RCTs do. To estimate ATE, one needs to use the entire original data with propensity score weighting, covariate adjustment with propensity scores, or propensity score matching-related methods, as described below.

There are propensity score matching-related or complex matching methods, some of which do not strictly match individual subjects. For example, subclassification (or stratification) (Rosenbaum and Rubin 1984) classifies all the subjects in a sample into several strata based on the corresponding number of percentiles of propensity scores and then matches stratum by stratum instead of individual subjects. Cochran and Rubin (1973) observed that five strata would remove up to 90% of selection bias. Optimal matching (Rosenbaum 1989) is another type of complex matching with an algorithm that is strikingly different from that of greedy matching. In greedy matching, after a match is made, the matched pairs will not be reconsidered. Each pair of matched subjects is considered the best matched pair currently available, whereas in optimal matching, previously matched pairs can be reconsidered to achieve the overall minimal or optimal distance. Optimal matching is particularly useful when there are not many appropriate control subjects to be matched with the treated subjects. An extension of optimal matching is full matching (Hansen 2004), which is also considered a special case of subclassification. Full matching produces subclasses in an optimal way. A fully matched sample consists of matched subsets, in which each matched set can contain one treated subject and one or more control subjects, or one control subject and one or more treated subjects. Full matching is optimal because it minimizes a weighted average of the estimated distance measure between each treated subject and each control subject within each subclass. The last type of complex matching is kernel matching (or local linear matching) (Heckman et al. 1997). It combines matching and outcome analysis into one procedure with one-to-all matching by using nonparametric matching estimators to obtain weighted averages of all subjects in the control group for constructing the counterfactual outcome. A variant of kernel matching is difference-in-differences matching, which calculates the differences between the outcome of the treated subjects and the weighted average differences in outcome for the control subjects (Heckman et al. 1997).

4.4 Propensity score weighting

Propensity score weighting, such as IPTW (Hirano and Imbens 2001), is another useful propensity score method which combines propensity scores directly into treatment effect estimations. The IPTW estimator weights the observations of the dependent variable by the inverse of the propensity scores for balancing the treatment and control groups. This propensity score weighting approach becomes increasingly popular because of its capacity to deal with multiple data formats, such as nested data, longitudinal data, and multi-treatment data; and therefore, it has been greatly discussed and applied in the literature (Harder et al. 2010; McCaffrey et al. 2004; Schafer and Kang 2008; Stone and Tang 2013).

IPTW can be used to estimate both ATE and ATT. In ATE, weights are applied to both the treatment and control groups, whereas ATT only weights the control group. To estimate ATE, one first weights the observations for the treatment group by the inverse of the propensity score: \(Y_{wt_i} = \frac{1}{{p({\mathbf{X}}_{i})}}Y_{t_i}\), and then weight the observations for the control group by the inverse of 1 minus the propensity score: \(Y_{wc_i} = \frac{1}{{1 - p(X_{i} )}}Y_{c_i}\), where \(Y_{wt_i}\) and \(Y_{wc_i}\) are the weighted observations of the dependent variable for each subject in the treatment group and each subject in the control group, respectively; \(Y_{t_i}\) and \(Y_{c_i}\) are the original observations for those subjects in the treatment and control groups, respectively. Then, the weighted observations are summed and divided by the total sample size (N = nt + nc, where nt and nc are the sample sizes of the treatment and control groups, respectively). Lastly, ATE is the difference between the two averages: \({\text{ATE}} = \frac{1}{N}\left( {\mathop \sum \nolimits_{i = 1}^{{n_t}} Y_{wt_i} - \mathop \sum \nolimits_{i = 1}^{{n_c}} Y_{wc_i}} \right)\). As opposed to ATE, ATT does not weight the observations in the treatment group but only the control group by the ratio of the propensity score to the inverse of 1 minus the propensity score: \(Y_{wc_i} = \frac{{p({\mathbf{X}}_{i} )}}{{1 - p({\mathbf{X}}_{i} )}}Y_{c_i}\); and thus, ATT is computed as follows: \({\text{ATT}} = \frac{1}{n_t}\mathop \sum \nolimits_{i = 1}^{n_t } Y_{t_i} - \frac{1}{n_c}\mathop \sum \nolimits_{i = 1}^{n_c} Y_{wc_i}\).

4.5 Covariate adjustment with propensity scores

To control the confounding factors in observational studies, analysis of covariance is commonly used to partial out the effects of confounding effect on the treatment effect by including influential covariates in the statistical model. These covariates are assumed to have effects on outcomes and, therefore, confound the treatment effect estimation. Even though traditional covariate analyses can control for confounding factors to some extent, it only decomposes the variance in the outcome into variance explained by the covariates, variance explained by the treatment conditions, and residual variance. However, it is difficult to determine if the covariate analysis model is correctly specified the relationships between treatment selection and baseline covariates to the outcome (Austin 2011). Therefore, covariate analysis does not model the selection bias since it does not analyze the confounding effect directly resulting from the unbalanced covariates distributions for the treatment and control groups. Therefore, the simplest method for using propensity scores is to include the propensity scores in the regression model to adjust the contributions of the covariates to the treatment effect based on the composite score of all the covariates to account for the probability of the individual subject to be assigned to one of the treatment conditions, commonly to the treatment condition as defined in the propensity score method.

Covariate adjustment with propensity scores is usually applied to the entire original data to obtain ATE, which is the estimated regression coefficient β1 from the following multiple regression model with propensity score adjustment on the entire original data: Yi = β0 + β1Zi + β2p(Xi) + β3Zip(Xi) + εi, where Zi is the treatment condition, p(Xi) is the propensity score, and Zip(Xi) is the interaction between the treatment condition and the propensity score.

4.6 Doubly robust methods

In practice, these propensity score methods may not sufficiently reduce selection bias when propensity score estimation model is misspecified (Schafer and Kang 2008). Model misspecification may occur if any influential covariates are not included in the propensity score estimation model or if there are any misspecified forms of covariates, such as interaction effect, higher-order terms, or non-linear trends. Model misspecification can happen not only for the propensity score estimation model, but also in the outcome regression model. In this situation, using doubly robust methods will increase the accuracy of outcome estimation after propensity score adjustments. Doubly robust estimation incorporates outcome regression model and propensity score model in treatment effect estimation, which is robust to one model misspecification (either regression model or propensity score model) (Bang and Robins 2005; Li et al. 2017). Doubly robust procedures were found to reduce more bias than just using one propensity score procedure alone (Shadish et al. 2008). Therefore, doubly robust estimation is increasingly used when implementing propensity score methods.

Doubly robust procedures can be used with many types of propensity score adjustment methods. Schafer and Kang (2008) suggest using a doubly robust procedure, in which the individual covariates are still included in the treatment effect estimation model after the propensity scores adjustments. Imai and Ratkovic (2014) proposed covariate balancing propensity score which exploits the dual characteristics of the propensity score as a covariate balancing score and the conditional probability of treatment assignment using generalized method-of-moments or empirical likelihood framework. The method is found to improve the performance of propensity score weighting, and can be extended to non-binary treatment conditions and longitudinal data, and generalizing experimental and instrumental variable estimates (Imai and Ratkovic 2014). While it is worth noting that doubly robust procedure is appealing, it can still result in biased estimate if both regression model and propensity score model are misspecified (Funk et al. 2001; Li et al. 2017).

4.7 Evaluation of covariate balance

It is important to evaluate covariate balance before and after propensity score matching. Prior to evaluating covariate balance, common support of propensity score distributions should be assessed. As discussed in Sect. 3.3, common support should be sufficient in propensity score matching to create good matched pairs, which ensures the matched subjects in the control group are similar to those in the treatment group in probability of assignment to the treatment group. Thus, propensity score matching with sufficient common support can approximate random assignment similar to that of RCTs. Although the existing literature (e.g., Caliendo and Kopeinig 2008; Heckman et al. 1997; Rubin 2001) discusses the importance of sufficient comment support, most studies do not include a standard for how much common support is sufficient for propensity score matching. Based on Bai (2015), we recommend that if 75% of the propensity scores overlap, this may provide a better sample pool for finding matching pairs.

Before implementing propensity score matching, it is essential to evaluate covariate balance to understand the status of selection bias. If the treatment and control groups are well balanced on all the covariates, it means that there is no selection bias and no need to conduct propensity score matching, subclassification, or weighting procedures. After propensity score matching, it is important to evaluate covariate balance again to see if selection bias is sufficiently reduced. If not, further model-based covariate adjustment should be considered.

There are three criteria for evaluating covariance balance to see if selection bias exists in any of the covariates. First, selection bias in the kth covariate is defined as the mean difference in the covariate between the treatment and control groups: \(B_{k} = M_{t_k} - M_{c_k}\), where \(M_{t_k}\) and \(M_{c_k}\) are the means of the covariate for the treatment and control groups, respectively. Intuitively, an independent-sample t test for continuous covariates or chi-square test for categorical covariates could be readily applied to test the selection bias. However, researchers should be cautious about using significant tests as the only means by which to evaluate covariate balance. The aim of evaluating covariate balance is not to test for sample differences that may be affected by factors such as sample size and variance instead of covariate imbalance (Pan and Bai 2016b). Second, we can examine the standardized bias (SB) for a covariate: \(\it {\text{SB}}_{k} = \frac{{B_{k} }}{{s_{{{\text{p}}_{k} }} }} = \frac{{M_{{{\text{t}}_{k} }} - M_{{{\text{c}}_{k} }} }}{{s_{{{\text{p}}_{k} }} }}\), where \(\it s_{{{\text{p}}_{k} }}\) is the pooled standard deviation of the covariate (Rosenbaum and Rubin 1985). If the absolute value of SBk is less than 0.05 (or 5%), the matching method is considered effective in balancing the covariate (Caliendo and Kopeinig 2008). The third criterion is the percent bias reduction (PBR) on the covariate: \(PBR_k = \frac{{B_{{k, \,{\text{before matching}}}} \,\,-\,\, B_{{k, \,{\text{after matching}}}} }}{{B_{{k, \,{\text{before matching}}}} }}\), with a PBRk larger than 0.80 (or 80%) indicating an effective bias reduction (Cochran and Rubin 1973).

In addition, graphics are a means of evaluating and visualizing covariate balance, such as QQ plot, histograms, and Love plot (Ahmed et al. 2006; Cochran and Rubin 1973; Pan and Bai 2015a; Pattanayak 2015; Rosenbaum and Rubin 1985).

5 Issues in and debates about propensity score methods

As researchers embrace the advantages of using propensity score methods, some significant issues should be noted. It is also beneficial for the researchers to understand the concerns and debates about the use of these methods.

5.1 Issues in propensity score methods

The most conspicuous issue in propensity score methods is hidden bias. The fundamental theory of propensity score methods assumes that all the confounding variables are observable so that propensity scores calculated from the observed covariates can accurately represent the distributions of all confounding variables. Unfortunately, hidden bias due to unobservable confounding variables often exists because we cannot observe all confounding variables. To mitigate this issue, the selection of covariates should be first guided by theory, and then researchers should include all potential covariates that we could observe in propensity score estimation models (Pan and Bai 2016b). It is as important or more important to conduct sensitivity analysis for testing the model sensitivity to hidden bias from unobserved confounding variables.

Other important issues in propensity score methods are mainly related to propensity score matching such as matching with or without replacement or issues of sample reduction after matching. Matching with or without replacement should be considered each time when conducting propensity score matching, because the two different matching approaches are likely to produce significantly different matched data, especially when sample size is small (Pan and Bai 2015a). The selection of the two matching approaches also affects the treatment effect estimation after matching. For example, if matching with replacement is used, the analysis for estimating the treatment effect after matching should incorporate weighted scores so as to balance the subjects who appear multiple times in the matched data.

Propensity score matching usually removes unmatched subjects due to selection bias or covariate imbalance. Such sample reduction after matching may result in a matched sample unrepresentative of the target population. To combat this problem, large samples are preferable for implementing propensity score matching because large samples can produce more reliable results (Bai 2011b; Hirano et al. 2003; Månsson et al. 2007; Rubin 1997). Otherwise, different propensity score methods such as propensity score weighting should be considered.

5.2 Debates about propensity score methods

Just as many other statistical methods have been criticized, propensity score methods have also been questioned. There are two sides to the debates about propensity score methods (e.g., Pearl 2010; King and Nielsen 2016).

Pearl (2010) first argued that associational concepts can be defined in terms of joint distribution of observed variables such as the distribution of propensity scores, but causal concepts cannot be. Therefore, propensity scores would not be a causal concept, and only experimental control can verify causal assumptions. Pearl’s argument seems correct, but it does not mean that propensity score methods ignore such an issue. In fact, propensity score methods have an assumption of the strongly ignorable treatment assignment that addresses the causal problem due to unobserved confounding variables. In practice, it is possible to estimate causality using observational data as long as all important confounding variables are well controlled. Furthermore, sensitivity analysis can test how sensitive of the treatment effect estimation with propensity scores to uncontrolled, but less influential confounding variables so as to safeguard the causal claims using propensity score methods. As Rubin (2009) pointed out, Pearl’s argument might be irrelevant to propensity scores. Propensity score methods are intended to be outcome free, and the assumption of the strongly ignorable treatment is designed to be conditional on all observed values; therefore, it is assumed to control the influence of all possible confounding variables.

King and Nielsen (2016) hold a different opinion in regards to propensity score matching and argue that (a) propensity score matching cannot approximate a completely randomized experiment, (b) it is not comparable to a fully blocked randomized experiment, and (c) it is problematic due to some observations that increase imbalance and model dependency. These concerns seem reasonable when researchers overlook the assumptions of propensity score matching, which happens often. Thus, researchers are strongly urged to review the literature (e.g., King and Nielsen 2016; Pan and Bai 2016b; Rubin 2009) when implementing propensity score matching in their research.

6 Available software packages for propensity score methods

There is a variety of software packages available for implementing propensity score methods, including SAS, R, STATA, and SPSS. Some packages also have functions to combine propensity score procedures with treatment effect estimation. All the packages have advantages and disadvantages on specific propensity score methods. For example, MatchIt in R (Ho et al. 2011) has most types of propensity score matching techniques including nearest neighbor matching, caliper matching, optimal matching, full matching, and genetic matching as well as subclassification. MatchIt also allows researchers to conduct matching with or without replacement and 1-to-1 or 1-to-many matching. The R package is easy to implement. Some STATA modules for propensity score methods (e.g., Leuven and Sianesi 2012) are also straightforward to use along with treatment effect estimation functions, which has an advantage over other packages. SAS (SAS Institute Inc. 2017a, b) recently developed two procedures for treatment effect estimation (PROC CAUSALTRT) and propensity score matching (PORC PSMATCH). The two procedures are available in SAS/STAT 14.2 or a newer version, or in the free SAS University Edition. SPSS modules for propensity score matching are available in its pull-down menu. In SPSS, we can also use Python-based extensions FUZZY to install add-on packages to implement MatchIt (Thoemmes 2012). Schuler (2015) provided a comprehensive survey on the use of all the software packages for propensity score methods along with useful code and examples.

7 Conclusion

Propensity score methods are popular and effective statistical techniques for reducing selection bias in observational data, and they increase the validity of causal inference based on observational studies. Some researchers have raised concerns about the rationale and applicability of propensity score methods. We addressed these concerns by reviewing the development history and the assumptions of propensity score methods, followed by the fundamental techniques of and software packages for propensity score methods. Our aim is to provide information about propensity score methods from a historical point of view, to emphasize the importance of checking assumptions, and to help researchers select the best methods for their observational studies.