1 Introduction

We develop a comprehensive methodologic framework to provide consistent estimate of treatment effects under multiple endogeneity and response heterogeneity in a non-randomized observational study design. Our application is one of the first to exploit the advantages of panel data structure to mitigate the bias from time varying self-selection and heterogeneity of treatment effects, using control function (CF) techniques. Prior to developing the methodological framework, we begin by explaining the source and cause of selection bias and treatment effect heterogeneity using the case of comparative effectiveness of biologic disease modifying anti-rheumatic drugs (DMARDs) used in Rheumatoid Arthritis (RA). We briefly outline the limitations of extant literature and present the econometric issues that need to be addressed for consistent estimation of treatment effects. We then develop the framework and estimation strategy after explaining the structural parameters of interest.

In observational studies, threats to internal validity of causal inferences stems from non-random assignment to treatment and individual heterogeneity in the response to the treatments. Causal relationships cannot be estimated unless these sources of bias are controlled. Additionally, in healthcare, decision makers are generally interested in simultaneous comparison of multiple treatments available as treatment choices. These issues obfuscate instrumental variables (IV) analyses, which traditionally have compared only two options/treatments at a time while ignoring response heterogeneity. While panel data approaches to IV estimation techniques are more complicated, they offer enhanced control over selection-bias and heterogeneity, and more importantly can provide valid excluded instruments that satisfy the order and rank conditions for identification under multiple endogeneity (Kawatkar et al. 2012a). Using the case of biologic DMARDs used in RA, this study describes detailed assumptions and practical application of advanced panel data econometric methodology to overcome time varying endogeneity in treatment assignment, as well as heterogeneity in treatment effects.

Comparisons among the individual DMARDs using non-experimental study design are intricate since choosing a specific DMARD is dependent upon disease severity, physician and patient preference, treatment guidelines, financial considerations, contraindications, co-morbidities, concomitant medications, and other factors unique to each patient (Kvien 2004; Cush 2005; Michaud and Wolfe 2007; Saag et al. 2008). Most of these treatment selection variables are not observable in secondary data, and these omitted variables are correlated with the treatment choice. This correlation between the treatment choice and the structural error term makes the population level treatment effect parameters biased since these parameters incorrectly explain the variance associated with the correlated, yet omitted variables. Econometric methods to correct this endogeneity caused due to the so called “selection on unobservables” problem have relied on instrument variables and control function approaches (Heckman and Robb 1985; Heckman and Navarro-Lozano 2004). Identification in these techniques is contingent on the existence of excluded instrumental variable(s), which are factors that are uncorrelated to the error, correlated (strongly) with the endogenous variable(s), and have no direct effect on the outcome of interest.

The majority of current published studies on comparative effectiveness of DMARDs ignore the issue of selection on unobservables and heterogeneity of the treatment effect (Bullano et al. 2006; Ollendorf et al. 2009; Wu et al. 2007; Michaud et al. 2003). When adoption/rejection of treatment is based on an individual’s idiosyncratic gains/losses, this heterogeneity in treatment effects needs to be controlled to avoid biased parameter estimates. In general, heterogeneity reflects patient diversity in risk of disease, responsiveness to treatment, and utility for associated outcomes (Kravitz et al. 2004). In this framework, implementation of econometric techniques including instrumental variables and control function based approaches need sophisticated appreciation of assumptions under which these estimators are consistent. Conditional on their assumptions, control function approaches are more efficient compared to instrument variables techniques, especially in presence of multiple treatment effects, heterogeneity, and self-selection (Heckman and Urzua 2010). In the case of control function approaches, estimation involves specification of a treatment selection model, which is generally a probit model for the binary endogenous case and multinomial logit for the polychotomous endogenous treatments. Given the multiple DMARD options available for the treatment of RA, we focus on issues involving polychotomous endogenous treatments.

A point that distinctly stands out in the treatment selection of DMARDs is that some of these second line drugs are viewed as closer substitutes for one another than other drugs, e.g. substitution amongst biologic DMARDs as compared to substitution between standard DMARDs and biologic DMARDs. It is reasonable to assume that when selecting a drug, physicians view drugs within a class to be closer substitutes for each other as compared to drugs from another classification category. Failure to account for this correlation amongst the alternatives may give rise to the independence of irrelevant alternatives (IIA) problem (McFadden 1981). Hence, a discrete choice modeling procedure that relaxes the IIA assumption of multinomial logit and accounts for correlation amongst the treatment choice of biologic DMARDs is more appropriate in the current study. In contrast to the multinomial logit, the multinomial probit allows the random components of the utility of the different treatment choice alternatives to be non-independent and non-identical and thus, avoids the IIA problem. Comparing the multinomial probit to the multinomial logit as a treatment choice model will be one of the aims of this study.

The reason for varying the discrete choice treatment selection model is because the selectivity bias correction terms may be sensitive to the specific probability models even though there may be only slight differences in the probability models themselves (Lee 1982). Lee’s (1982) approach provides a way to generate a large class of models with selectivity. By specifying different transformations, we can allow different implicit distributions on the error, and thus, any specific probability choice model need not dictate the method of correcting the selectivity bias term (Lee 1982). In contrast to Lee’s (1983) approach for selectivity bias correction in the polychotomous treatments, other published methods for correction of selection bias differ either in the assumptions imposed on the error covariance structure, or on the assumptions of linearity of the error terms (Dahl 2002; Dubin and McFadden 1984; Lee 1983; Hay 1980, 1984). In recent years, Trivedi and co-authors have addressed the problem of multinomial treatments and a continuous outcome using a range of approaches including finite mixture models, maximum simulated likelihood, Markov chain Monte Carlo methods and copulas (Deb et al. 2006a, b; Deb and Trivedi 2006; Zimmer and Trivedi 2006).

The primary objective of this study is to assess the incremental quarterly total expenditure of adalimumab, etanercept, and leflunomide treatments as compared to standard of care (i.e. methotrexate treatment) while controlling time varying endogeneity and allowing for heterogeneity of treatment effects. The secondary objectives are to assess the impact on estimated treatment effects when the restrictive IIA assumptions on treatment choice models are relaxed, thus allowing more flexibility, and to check the sensitivity to different selection bias correction techniques for the control functions in the presence of multiple endogenous treatments. The methodological framework presented has broad applicability to any disease treatment/intervention comparative effectiveness study involving non-experimental retrospective study design comparing multiple classifications of treatments.

2 Treatment effects framework

According to Heckman and Robb (1985), two different definitions are associated with the notion of a selection bias free estimate of the impact of treatment on the outcome, which in this study is total quarterly healthcare expenditure. The first notion defines the structural parameter of interest as the impact of treatment on quarterly expenditure if RA patients are randomly assigned to DMARD treatment, also known as the average treatment effect (ATE) (Heckman and Robb 1985). On the other hand, average treatment effect in the treated (ATT) defines the structural parameter of interest in terms of the difference between the post-treatment expenditure of those treated with biologic DMARDs and leflunomide, and what the expenditure in post-treatment period for these same patients would have been in the absence of treatment (Heckman and Robb 1985). Similarly, we can define average treatment effect in the untreated (ATU) which is the mean gain that those who are nonparticipants would have if they received the treatment. ATE, ATT, and ATU coincide only when treatment has an equal impact on everyone (homogeneous treatment effect) or else if assignment to treatment is random and attention centers on estimating the mean response to treatment (Heckman and Robb 1985). Heckman and Robb (1985) argue that ATT is most useful for forecasting future treatment effects when the same treatment assignment rules, which have been used in available samples, characterizes future treatment and thus ATT is sufficient to estimate the future treatment effect (Heckman and Robb 1985).

2.1 Conceptual framework for heterogeneous treatment effects

Suppose a health policy is proposed for formulary expansion of biologic DMARDs i.e., to shift patients currently on standard DMARDs to any of the biologic DMARDs. These drugs have been tried in some patients and we know their outcomes. The outcome Y in this study is quarterly total expenditure. We also know expenditure in patients where biologic DMARDs treatment was not adopted. What can we conclude about the likely effectiveness of this policy in patients who are currently not on biologic DMARDs?

To answer the question on policy effects, we build a model of counterfactuals.

In the current analysis, the DMARD treatment (d) can take on multiple values i.e. its polychotomous

$${\text{D}} = ({\text{d}}_{1} ,{\text{d}}_{2} , \ldots {\text{d}}_{\text{m}} )$$
(1)

We want to define potential outcomes Y mt for every possible treatment in D at time (t). Keeping “t” implicit, Ym is the outcome of a patient under treatment dm while Yn is the outcome for dn for DMARD treatments in D. Then (Ym − Yn) is the treatment effect which may vary amongst patients (Ym,i). We observe characteristics X of various patients. We can decompose Ym,i into its mean given X, μm(Xi), and deviation from mean, Um,i where E(Um,i | Xi) = 0 (Heckman et al. 2006). One way to define the treatment effects for individual (i) is to make pair wise comparisons of the form:

$$\Delta {\mathbf{Y}}_{{\mathbf{i}}} = \left( {{\mathbf{Y}}_{{\mathbf{m}}} - {\mathbf{Y}}_{{\mathbf{n}}} } \right)$$
(2)

However, we only observe the potential outcome corresponding to the treatment obtained.

$${\text{To}}\;{\text{formalize}}\;{\text{this,}}\;{\text{let}}\;{\text{K}}_{\text{m}} \left\{ {\begin{array}{*{20}l} { = 1\quad {\text{iff}}\;{\text{treatment}} = {\text{d}}_{\text{m}} } \hfill \\ { = 0 \quad {\text{Otherwise}}} \hfill \\ \end{array} } \right.$$
(3)

We can define the observed outcomes for individual i as Y i

$${\text{Y}}_{\text{i}} = \mathop \sum \limits_{{{\text{m}}\, = \,1}}^{\text{M}} {\text{Y}}_{\text{m}} {\text{K}}_{\text{m}}$$
(4)

Since the individual level treatment effect ΔY i is never identified, it is common to define average effects. We can define treatment effects conditioned on a vector of covariates (X) and unobservable factors (U) affecting participation, as follows

$${\text{Average}}\;{\text{treatment}}\;{\text{effect}}\; ( {\text{ATE)}} = {\text{E[}}({\text{Y}}_{\text{m}} - {\text{Y}}_{\text{n}} )|{\text{X]}}$$
(5a)
$${\text{Average}}\;{\text{treatment}}\;{\text{effect}}\;{\text{in}}\;{\text{the}}\;{\text{treated (ATT)}} = {\text{E[}}({\text{Y}}_{\text{m}} - {\text{Y}}_{\text{n}} )|{\text{X}},{\text{K}}_{\text{m}} = 1,{\text{ K}}_{\text{n}} = 0 ]$$
(5b)
$${\text{Average}}\;{\text{treatment}}\;{\text{effect}}\;{\text{in}}\;{\text{the}}\;{\text{untreated}}\;({\text{ATU}}) = {\text{E}}[ ( {\text{Y}}_{\text{m}} - {\text{Y}}_{\text{n}} )|{\text{X}},{\text{K}}_{\text{m}} = 0 ,\;{\text{K}}_{\text{n}} = 1]$$
(5c)
$${\text{Marginal}}\;{\text{treatment}}\;{\text{effect}}\;({\text{MTE}}) = {\text{MTE}}\left( {\text{u}} \right) \equiv {\text{E[}}({\text{Y}}_{\text{m}} - {\text{Y}}_{\text{n}} )|{\text{X}},{\text{U}} = {\text{u]}}$$
(5d)

If the treatment is independent of the potential outcomes, so that

$${\text{d}}_{\text{m}} \bot {\text{Y}}_{\text{n}}$$
(6)

then \({\text{E}}[{\text{Y}}_{\text{m}} ] = {\text{E}}[{\text{Y}}_{\text{m}} |{\text{X}},{\text{K}}_{\text{m}} = 1]\) and estimation of the treatment effects would be trivial using multivariate or propensity score analyses. However, the assumption in Eq. (6) is very strong and never supported in observational studies due to agents self-selecting into a treatment based on their unobserved (to the analyst) gains from that treatment. Heckman et al. (2006) term these models as models with essential heterogeneity (Heckman et al. 2006). Under essential heterogeneity, traditional IV approach fails in identifying the treatment effects and the estimates depend on the instrument choice and apply to an unknown population.

Under essential heterogeneity, a model for potential outcomes can be written as

$${\text{Y}}_{\text{i}} =\upmu_{\text{m}} ({\text{X}}_{\text{i}} ) + [\upmu_{\text{n}} ({\text{X}}_{\text{i}} ){-}\upmu_{\text{m}} ({\text{X}}_{\text{i}} ) + {\text{U}}_{\text{n,i}} {-}{\text{U}}_{\text{m,i}} ]*{\text{K}}_{\text{m}} + {\text{U}}_{\text{m,i}}$$
(7)
$${\text{Y}}_{\text{i}} =\upmu_{\text{m}} ( {\text{X}}_{\text{i}} )+ [\upmu_{\text{n}} ( {\text{X}}_{\text{i}} ){-}\upmu_{\text{m}} ( {\text{X}}_{\text{i}} )]*{\text{K}}_{\text{m}} + [{\text{U}}_{\text{n,i}} {-}{\text{U}}_{\text{m,i}} ]*{\text{K}}_{\text{m}} + {\text{U}}_{\text{m,i}}$$
(8)

By assumption, an excluded instrument Z is independent of Um,i and (Un,i − Um,i). However, to identify the treatment effects, Z needs to be independent of (Un,i − Um,i)*Km. If patients self-select into treatment based on partial or full knowledge of (Un,i − Um,i), then simple IV approach does not identify the treatment effects.

For values of u close to zero, the marginal treatment effect is the expected effect of treatment on individuals who have unobservables that make them most likely to participate in treatment and who would participate even if the mean scale utility μ(Z) is small. A general correlated random coefficients (CRC) model for these treatment effects can be expressed as

$${\text{E}}\left[ {({\text{Y}})|.} \right] = {\text{f}}\left( {{\text{D}}{\varvec{\upalpha}},{\text{X}}{\varvec{\upbeta}},\left[ {{\text{D}}*({\text{X}} - {\ddot{\text{X}}})} \right]{\varvec{\upzeta}},\;\left[ {{\text{h}}(\uplambda_{\text{j}} )} \right]{\varvec{\uprho}},\;\left[ {{\text{D}}*({\text{h}}(\uplambda_{\text{j}} ))} \right]{\varvec{\upxi}}} \right)$$
(9)

and thus \({\text{E}}\left[ {({\text{Y}}_{\text{m}} - {\text{Y}}_{\text{n}} )|.} \right] = {\text{D}}{\varvec{\upalpha}} + \left[ {{\text{D}}*({\text{X}} - {\ddot{\text{X}}})} \right]{\varvec{\upzeta}} + \left[ {{\text{D}}*({\text{h}}(\uplambda_{\text{j}} ))} \right]{\varvec{\upxi}}\) captures the heterogeneous treatment effects of individual DMARD. In Eq. (9), (X − Ẍ) is mean centering the covariate vector to account for observed heterogeneity (D*(X − Ẍ)) in treatment effects. The model also accounts for unobserved heterogeneity in treatment effects through the interaction (D*[h(λj)]) of treatment and the control function. The effect of multiple endogeneity is mitigated through the addition of the control function (h(λj)).

By using these control functions in a correlated random coefficients model (Heckman and Vytlacil 1998), our proposed estimators are very realistic to model clinical/medical outcomes and consistently identify the heterogeneous comparative effectiveness parameters using non-experimental observational study design. This makes these parameters important and informative to health policy and clinical decision making.

2.2 Choice theoretical models to define the control function [h(λj)]

Under essential heterogeneity, simple IV needs to be supplemented with explicit choice theory to answer many interesting questions, including questions of benefits of introducing a policy as well as distributional questions such as the percentage of persons harmed by a policy (Heckman et al. 2006). There remain two distinct approaches to identify the treatment effects defined in Eq. (5) (Heckman and Navarro-Lozano 2004). The control function approach is one of the prominent approaches for dealing with selection bias in the CRC model. An alternative to the control functions approach is the Local Instrumental Variable (LIV) described by Heckman and Vytlacil using the MTE framework (Heckman and Vytlacil 2007). The MTE framework employs a latent index for treatment selection and using the output from a two-step procedure we can estimate the treatment effects. To motivate the justification for such a treatment selection model and the resulting control function, we will first define a policy invariant structural model for treatment choice.

The decision to undertake treatment (dm) may be determined by the patient, the physician, or both. Whatever the specific content of the rule, it can be described in terms of an index function framework. Let S m , be an index of benefits to the appropriate decision-maker from treatment dm. It is a function of observed (Z m ) and unobserved (V m ) variables. Thus,

$${\text{S}}_{\text{m}} = {\text{Z}}_{\text{m}} + {\text{V}}_{\text{m}}$$
(10)

In terms of this function, Eq. (3) can be written as

$${\text{K}}_{\text{m}} \left\{ {\begin{array}{*{20}l} { = 1\quad {\text{if}}\; {\text{S}}_{\text{m }} > 0} \hfill \\ { = 0 \quad {\text{Otherwise}}} \hfill \\ \end{array} } \right.$$
(11)

i.e. patients undertake a particular DMARD treatment if they experience net utility from it. Letting S m denote the index function in a decision rule and further assuming that Z m is distributed independently of V m , makes Eq. (11) a standard discrete choice model which is consistent with the random utility model of utility maximization (McFadden 1981). Separability between V m and Z m in the choice equation plays an important role in the properties of instrumental variable estimators in models with essential heterogeneity (Heckman et al. 2006). It also implies the monotonicity condition considered by Imbens and Angrist.(Imbens and Angrist 1994).

Under the random utility theory, the random utility function of individual ‘i’ for choice ‘m’, is decomposed into a deterministic and stochastic components.

$${\text{S}}_{\text{im}} = {\text{Z}}_{\text{im}} + {\text{V}}_{\text{im}}$$
(12)

where Zim is a deterministic utility function, assumed to be linear in the explanatory variables, and Vim is an unobserved random variable. Different assumptions on the distribution of the error components gives rise to different classes of models.

2.3 Estimation

For a random draw i from the population at time t, the outcome model is assumed to be linear

$${\text{Y}}_{\text{it}} = {\text{a}}_{\text{i}} + {\text{X}}_{\text{it}} {\varvec{\upbeta}}_{{\mathbf{i}}} + {\text{D}}_{\text{it}} {\varvec{\upgamma}}_{{\mathbf{i}}} + {\text{U}}_{\text{it}} ,\quad {\text{t}} = 1, \ldots ,{\text{T}}$$
(13)

where Y it is the log transformed total quarterly expenditure, a i is a J × 1 vector of individual-specific fixed effects, X it is a 1 × M vector of exogenous covariates that change across time, β i is a M × 1 vector of individual-specific slopes associated with X it , D it is a 1 × M vector of endogenous treatments (DMARDs) that change across time, γ i is a M × 1 vector of associated treatment effects, and U it is an idiosyncratic error which maybe correlated to a i and D it . We allow the heterogeneity, to be correlated with the endogenous treatments as well as observed covariates in X it .

For estimating the treatment effects γ i , we propose panel data endogeneity corrected correlated random coefficients models. We allow for the fact that individual response to treatment can deviate from the mean and that these idiosyncrasies are correlated to treatment choice. Traditional fixed effects panel data models allow for time invariant heterogeneity to be correlated to the error, in the form of individual intercepts. However, it treats heterogeneity as a nuisance parameter (Heckman and Robb 1985; Wooldridge 2010). Average treatment effect (ATE) parameter is thus identifiable since we purge the heterogeneity if it manifests only as time invariant intercept effects. A more realistic model is to allow for individual slopes, which in turn are correlated to the endogenous treatment and hence, to the error. Heckman and Vytlacil (1998) term this as a correlated random coefficient (CRC) model (Heckman and Vytlacil 1998). We describe a two-step procedure for consistent estimation of ATE, ATT and ATU using the CRC model with bias correction.

In the first step, to model the DMARD treatment choice, the multinomial logit and the multinomial probit specifications are compared. As described in the introduction, the reason to vary between these two models is because drugs within a therapeutic class will be closer substitutes for each other as compared to drugs from another classification category and failure to account for this correlation could give rise to the IIA problem. These discrete choice models are estimated as pooled models since, as pointed by Fernández-Val and Vella (2011), estimation of a fixed effects non-linear selection equation will generally be plagued by the incidental parameters problem. Secondly, the individual fixed effects, if controlled by dummies, will create a bias in the control function used in the second stage outcome equation (Fernández-Val and Vella 2011). The next step is to create a control function based on the correlation between the error terms from the treatment selection and outcome models respectively. We apply three different bias correction techniques to handle selectivity/endogeneity using the multinomial logit selection model and generalization of Heckman’s approach for the multinomial probit model described in Terza (1985) (Lee 1983; Dubin and McFadden 1984; Dahl 2002; Terza 1985; Heckman 1976).

First we obtain the predicted probabilities from each treatment model and construct the selection bias correction term (λ m ) for each treatment m using the methods described by Lee (1983) (MNL_LEE), Dubin and McFadden (1984) (MNL_DMF), Dahl (2002) (MNL_DAHL) and Terza (1985) (MNP_IMR) respectively.

$$\uplambda_{\text{m}}^{\text{lee}} = \left\{ {{{ - {\varphi }({\text{J}}({\text{p}}_{\text{m}} ))} \mathord{\left/ {\vphantom {{ - {\varphi }({\text{J}}({\text{p}}_{\text{m}} ))} {{\text{p}}_{\text{m}} }}} \right. \kern-0pt} {{\text{p}}_{\text{m}} }}} \right\}$$
(14a)
$$\uplambda_{\text{m}}^{\text{dubin}} { = }\left\{ {\left[ {{{({\text{p}}_{\text{m}} )*{ \ln }({\text{p}}_{\text{m}} )} \mathord{\left/ {\vphantom {{({\text{p}}_{\text{m}} )*{ \ln }({\text{p}}_{\text{m}} )} {(1 - {\text{p}}_{\text{m}} )}}} \right. \kern-0pt} {(1 - {\text{p}}_{\text{m}} )}}} \right] + { \ln }({\text{p}}_{1} )} \right\}\quad {\text{for}}\;{\text{m}} > 1$$
(14b)
$$\uplambda_{\text{m}}^{\text{dahl}} = \left\{ {{\text{f}}({\text{p}}_{\text{m}} )} \right\}$$
(14c)
$$\uplambda_{\text{m}}^{\text{imr}} = {{{\varphi }({\text{p}}_{\text{m}} )} \mathord{\left/ {\vphantom {{{\varphi }({\text{p}}_{\text{m}} )} {\varPhi ({\text{p}}_{\text{m}} )}}} \right. \kern-0pt} {\varPhi ({\text{p}}_{\text{m}} )}}$$
(14d)

where J(p m ) = Φ1(p m ) involves the inverse of the cumulative standard normal distribution.

φ is the standard normal probability density function and p m is the predicted probability from the first step selection model, ln(p m ) is the logarithm of p m , and f(p m ) is a polynomial of p m (squared polynomial expansion used in our analysis) for individual i at time t where subscripts i and t are suppressed for clarity. Lastly, for the multinomial probit selection model, an inverse Mills ratio (λ imr m ) was defined as in (14d) where φ(p m ) is the probability density function while Φ(p m ) is the cumulative density function.

In the second step, we employ a fixed effects regression function to estimate the conditional expectation of Y it

$${\text{E}}\left[ {{\text{Y}}_{\text{it}} |{\text{X}}_{\text{it}} ,{\text{D}}_{\text{it}} } \right] = {\varvec{\upeta}}_{{\mathbf{i}}} + {\text{X}}_{\text{it}} {\varvec{\upbeta}} + {\text{D}}_{\text{it}} {\varvec{\upgamma}} + \left( {{\text{D}}_{\text{it}} *\left( {{\text{X}}_{\text{it}} - {\ddot{\text{X}}}_{\text{i}} } \right)} \right){\varvec{\upzeta}} + \left( {{\text{h}}\left( {\uplambda_{\text{it}} } \right)} \right){\varvec{\uprho}} + \left( {{\text{D}}_{\text{it}} *\left( {{\text{h}}\left( {\uplambda_{\text{it}} } \right)} \right)} \right){\varvec{\upxi}}$$
(15)

where i is the expected value of X it and thus (X it    i ) is mean-centering the exogenous variables. The last two terms are the control function and the interaction of the control function with the endogenous treatment indicators respectively, with (h(λ m )) defined as

$${\text{h}}(\uplambda_{\text{m}} )^{\text{lee/dahl/imr}} = ({\text{D}}_{\text{m}} *(\uplambda_{\text{m}} ) + {\text{D}}_{\text{n}} *(\uplambda_{\text{n}} ))$$
(16a)
$${\text{h}}(\uplambda_{\text{m}} )^{\text{dubin}} = (\uplambda_{\text{m}} ) + (\uplambda_{\text{n}} )\quad {\text{for}}\;{\text{j}} > 1$$
(16b)

where m ≠ n are the j treatment options and λ m as defined in Eq. (14).

Identification of the treatment effects in Eq. (15) relies on exclusion restrictions (instruments) in X it and the vector of covariates (Z it ) from the treatment selection model in step 1. These exclusion restrictions satisfy three properties: (a) they are (strongly) correlated to the treatment (exposure), (b) uncorrelated to the error term of the outcome model and (c) do not exert a direct impact on the outcome and only act through the endogenous treatments. Furthermore, in the presence of multiple endogenous treatments, we need at least as many excluded instruments as the number of endogenous variables in the model (Heckman et al. 2008). These requirements are extremely demanding since observational data generally does not provide many options for defining multiple valid instruments. We overcome this issue by exploiting the longitudinal nature of our data and make use of established approaches from the dynamic panel data literature which employs lagged values of variables as instruments (Arellano and Bover 1995; Arellano and Bond 1991; Blundell and Bond 1998). Specifically, for each endogenous DMARD treatment at time t, the first lag of observed treatment (treatment at time t − 1) serves as the excluded instrument vector. Thus, we rely on the physician recommending a certain treatment after observing the patient’s experience and outcomes based on past DMARD choice. This generally is the way most physicians will treat chronic diseases in patients, and hence, the first lag is a theoretically as well as clinically valid instrument. Each treatment has its own unique lag value and hence, we always satisfy the order condition. Since we allow for these exclusion conditions necessary for identification, we do not have to rely on the joint distributional assumption for identification. To allow for the use of lag values as valid instruments, we augment the model with a sequential exogeneity assumption on the observed treatments. We assume that X is is uncorrelated with U it for all s and t, but that U it is uncorrelated with D it only for s < t.

We further assume that

$${\text{E}}\left( {{\text{U}}_{\text{it}} |{\text{X}}_{\text{it}} ,{\text{D}}_{{{\text{i,t}} - 1}} ,{\text{D}}_{{{\text{i,t}} - 2}} , \ldots ,{\text{D}}_{\text{i1}} ,{\text{a}}_{\text{i}} } \right) = 0,\quad {\text{t}} = 1, \ldots ,{\text{T}}$$
(17)

thus, X it are strictly exogenous while the D it are only sequentially exogenous, conditional on the unobserved effect a i and only have a contemporaneous effect on Y it .

Thus, in the correlated random coefficients model estimated as in Eq. (15) mean-centering the exogenous variables and interacting with the endogenous regressors assures that γ m is the estimated average treatment effect of treatment m compared to methotrexate. The addition of the control function described in Eq. (16) controls for the time varying endogeneity in treatment choice, and interaction of this generalized residual with the endogenous treatment allows for the correlated random coefficients due to unobserved response heterogeneity. The standard error on the control function can serve as a test for the endogeneity assumption of treatment choice. However, several issues question the validity of analytic standard errors. First, using estimated values for λ m creates heteroskedastic errors. Secondly, we are dealing with generated regressors due to the two-step procedure. Lastly, mean-centering exogenous variables by sample average estimates instead of population expectation values also invalidate the standard errors. Bootstrapping the errors accounting for the panel nature of the data provides asymptotically consistent standard errors and avoids aforesaid issues.

We log transformed the total quarterly expenditure outcome since expenditure data generally has non-negative values, high zero mass, heteroskedasticity, heavy skewness in the right tail, and is leptokurtic (Manning 2006). We used Duan’s smearing retransformation specific to each treatment to obtain estimates on the scale of interest (Duan 1983). Treatment effects were evaluated by averaging the individual marginal effects to avoid the problem of reintroduction of covariate imbalance (Greene 2002). Confidence intervals were based on non-parametric bootstrapped (1000 repetitions) percentiles with alpha of 5%.

2.4 Data

We used 100% of the fee-for-service portion of California Medicaid (Medi-Cal) paid claims and eligibility files for enrollees with RA between 01/01/98 and 12/31/05. Medi-Cal covers outpatient, inpatient, and prescription drugs for poor or disabled Californians. Paid claims files included information from institutional claims at the claim level, professional services claims at the service level, and pharmacy claims at the specific drug level. Eligibility files include the enrollment status of each month, in addition to enrollee’s demographic information.

2.5 Study cohort

The study cohort included enrollees between 18 and 100 years of age who had a diagnosis code for RA and filled a prescription for a biologic (adalimumab or etanercept) or traditional DMARD (leflunomide or methotrexate) during the study period. The RA diagnosis was identified using the International Classification of Diseases (ICD9-CM), (9th Revision, Clinical Modification, codes 714.xx). For each patient, a “first-index date” was defined as the first date that a RA patient filled any DMARD prescription. “Incident case” of DMARD utilization was defined as a patient with at least a 12 month eligibility period prior to the first-index date without any DMARD medication prescription. To ensure a minimum of a 3 month follow up period, we required patients to have at least a 90 day continuous eligibility period after the “first-index date”.

Analyses were conducted at panel level. We started with the first known prescription claim for a DMARD (identified by “first-index date”) and followed all costs which occurred for the following 90 days, including the day of the prescription fill. Using an intention-to-treat approach, any new claim which occurred during that quarter got attributed to the treatment started on that quarter’s index date. Subsequently, a prescription claim any time after the 90th day following the “first index date” triggered a new episode and was followed for a subsequent 90 days. Thus, each person could appear multiple times based on the number of eligible quarters he/she was observed in the claims files. Based on prior literature, we excluded patients with Crohn’s disease, psoriasis or psoriatic arthritis, ankylosing spondylitis, solid organ transplantation, HIV/AIDS, any indication of cancer, or if they visited a mental health institution in the 12-month period prior to the first index date (Smedstad et al. 1996a, b; Grijalva et al. 2007).

2.6 Expenditure and covariates

The primary outcome was total quarterly expenditure, which included expenditures on pharmacy, outpatient visits, long-term care, inpatient stays, and emergency department visits. Long-term care included the cost of services from skilled nursing and intermediate care facilities. Inpatient stays and emergency department visits included all provider related and facilities related cost of services incurred during inpatient stays and emergency department visits. All expenditures were adjusted to 2008 U.S. dollars using the medical component of the consumer price index. Contractual amounts reimbursed by Medi-Cal were used to calculate treatment “costs” as opposed to “charges”. Three expenditure variables were constructed in the final panel level data. The log transformed total health-care costs in the 90 days post each episode’s index date was the dependent variable. The log transformed pre-episode total pharmacy costs consisted of total expenditures in the 6 months prior to the start of the episode for all pharmaceutical utilization. Log transformed pre-episode total non-pharmacy costs included total expenditure in the 6 months prior to the start of the episode, excluding pharmaceutical utilization. These two pre-episode expenditure covariates served as a proxy for disease severity. Age was calculated as of each episode’s index date. Additionally, major preexisting comorbidities, which were captured by the Elixhauser comorbidity index, were calculated based on diagnoses codes of claims during the 6 month duration prior to each episode’s index date, excluding RA as a comorbidity (Elixhauser et al. 1998).

3 Results

The final analysis compared patients on adalimumab, etanercept, leflunomide and methotrexate. The data contained 3014 individual patients with a mean of 5 and maximum of 24 quarters per patient, resulting in 14,158 total panel observations. Mean age of the sample was 58.5 (± 14.3) years. The majority of the sample was female (76.5%) and Caucasian (33.4%). Adalimumab and etanercept users were slightly younger as compared to methotrexate and leflunomide users (Table 1). On average, adalimumab and etanercept users had higher pre-episode total pharmacy expenditures, but lower pre-episode total non-pharmacy expenditures as compared to methotrexate and leflunomide users. The primary outcome, average total quarterly expenditure for adalimumab [$7579 (± 5645)] and etanercept [$7431 (± 9640)] users was much higher as compared to methotrexate [$4057 (± 7598)] and leflunomide [$4564 (± 7794)] users. The large standard deviation in total quarterly expenditure illustrates the extreme skewness, which is typical of medical expenditure data. To reduce the effect of outliers on the estimated means and avoid predictions in negative values of expenditures, we log transformed all expenditure variables.

Table 1 Distribution of socio-demographics and other covariates

We performed a generalized Hausman test to evaluate validity of the IIA assumption from the two different specifications of the selection model. The generalized Hausman test, which compared if outcome-J versus outcome-K are independent of other alternatives, rejected the IIA assumption, thus indicating the multinomial probit might be the preferred selection model choice.

To check for the presence of time varying endogeneity in treatment assignment, we used the t-distribution based statistical significance of the control function and found endogeneity to be significant under both specifications of the selection models, and also under all bias correction methods. The statistical significance of the interaction between the control function and the treatment indicators rejected the assumption of homogeneity of the treatment effects. Lastly, the Hausman specification test to evaluate the validity of a random effects assumption for the outcome equation was strongly rejected in favor of the fixed effects model.

The bias correction approaches proposed by Lee (1983) and Dahl (2002) make stronger assumptions but avoid the risk of multicollinearity as compared to the approach proposed by Dubin and McFadden (1984). We assessed the presence of multicollinearity using the Belsley, Kuh, and Welch’s condition index (Belsley et al. 1980). Based on the condition index of the model described in Eq. (15), we found that multicollinearity was much stronger in the Dubin and McFadden (1984) approach (condition index = 44). However, the other three bias correction approaches also displayed some evidence of multicollinearity (Table 2), which could inflate the standard errors.

Table 2 Condition index to assess multicollinearity

3.1 Average treatment effect

The ATE comparing adalimumab [$1852 (372–3860)] and etanercept [$1856 (597–4008)] to methotrexate was statistically significant based only on Lee’s (1983) bias correction approach (Table 3). However, as compared to leflunomide, the incremental quarterly expenditure of adalimumab and etanercept were significantly higher based on Lee’s (1983) and Dahl’s (2002) bias correction approaches. When the selection model was changed to the multinomial probit and the bias correction was based on generalized Heckman type correction, the ATE comparing adalimumab and etanercept to methotrexate and leflunomide, respectively, were statistically significant and displayed a higher magnitude of incremental difference, as compared to any of the three multinomial logit based correction approaches, and also resulted in higher variance as compared to Lee’s (1983) approach (Table 4). The incremental differences between leflunomide and methotrexate, as well as between etanercept and adalimumab, were not significantly different from zero.

Table 3 Varying the bias correction method in the multinomial logit selection equation
Table 4 Varying the selection equation

3.2 Average treatment effect in the treated

The average treatment effect on the treated was statistically significant when comparing the incremental difference between etanercept and methotrexate, as well as adalimumab and leflunomide based on MNL_LEE approach (Table 3). The incremental difference between etanercept and leflunomide was statistically significant based on MNL_LEE as well as the MNL_DAHL approach. The ATT obtained under the MNP_IMR approach were statistically significant between the biologics as compared to methotrexate and leflunomide, respectively. As in the case of ATE, the ATT between leflunomide and methotrexate, as well as between etanercept and adalimumab, were not statistically significant under any bias correction or selection model specification. Additionally, the ATT obtained under MNL_DMF approach was dramatically lower in magnitude, as compared to the ATT obtained by MNL_LEE, MNL_DAHL and MNP_IMR approaches. Under most approaches, excluding the treatment effects associated with etanercept, the ATT was always lower as compared to ATE and ATU.

3.3 Average treatment effect in the untreated

The average treatment effect in the untreated was significantly higher for both biologics as compared to methotrexate as well as leflunomide respectively (Table 3) under the MNL_LEE and MNL_DAHL approaches. These results held true when the selection equation was varied from the multinomial logit to the multinomial probit with the magnitude of difference being higher under the MNP_IMR approach, but with a high overlap between the confidence intervals of the other three logit based approaches. The ATU comparing etanercept to leflunomide was the only treatment effect that was statistically significant under the MNL_DMF approach. In general, except for the treatment effects associated with etanercept, ATU was always higher compared to ATT under the different bias correction approaches and selection model specifications.

4 Discussion

We provide a modeling framework for simultaneous comparison of multiple treatment options using control functions with panel data models to control for the bias introduced by response heterogeneity as well as endogeneity. The results of the study identify the average treatment effect, the treatment effect in the treated, and the treatment effect in the untreated associated with biologic DMARDs in rheumatoid arthritis. RA presents an enormous economic burden on society in terms of the direct medical costs, the indirect costs which include lost wages and a caregiver’s time, and the intangible costs of pain, fatigue, lowered self-esteem, or other psychological problems. The incremental economic burden of RA is a staggering $22.3 billion (in 2008 USD) annually on U.S. healthcare (Kawatkar et al. 2012b). Furthermore, in less than a decade, the primary driver of this incremental expenditure in RA has shifted from hospital expenditure to pharmacy expenditure. The additional pharmacy expenditure accounts for approximately 66% of the incremental total expenditure of RA (Kawatkar et al. 2012b). Escalation in pharmacy expenditures has increased the interest in the comparative effectiveness of biologic and traditional DMARDs. Not surprisingly, comparative effectiveness of biologic therapy in RA is in the first quartile of the Institute of Medicine’s Initial National Priorities for comparative effectiveness research (IOM 2009). Hence, our results are critical to inform public authorities and decision makers for improving policy making on the formulary expansion of biologic DMARDs and their reimbursement. In answering this question, our findings imply that if a formulary expansion policy for biologic DMARDs was considered, the incremental acquisition expenditure associated with adalimumab, etanercept, and leflunomide may not be offset by commensurate reductions in non-pharmacy routine and catastrophic resource utilization within the first year of treatment. Hence, judicious prescribing of these agents may be warranted unless gains in health-related quality of life make these DMARDs cost-effective.

On a broader level, our framework has a much wider application to future comparative effectiveness and health technology assessment (HTAs) involving multiple treatments/interventions. Our study framework provides a relatively simple approach to generic HTA questions using inexpensive observational data. In addition, the simultaneous evaluation of multiple treatments makes the approach much more meaningful to healthcare decision and policy makers.

For our methodological framework, we follow the contemporary marginal treatment effect approach and specify a latent index treatment selection model. The latent index for treatment selection was varied between the restrictive multinomial logit to the flexible multinomial probit. A control function was created to minimize the influence of time varying confounding in a correlated random coefficients model. The sensitivity of the control function to the various assumptions and restrictions of several selection bias correction approaches was assessed. By employing a time varying selection-bias corrected correlated random coefficients model, we allowed for a very general and clinically realistic model to simultaneously evaluate the comparative effectiveness of multiple treatments on expenditure outcomes using an observational study design. We add to the literature by describing a framework to simultaneously evaluate comparative effectiveness of multiple treatments in panel data setting as well as acknowledge the heterogeneity that characterizes real world medical outcomes.

The presence of time varying endogeneity in the data biases the treatment effects obtained from a naïve fixed effects based model. To mitigate this omitted variable bias, a control function was added to the fixed effects model. Intuitively, the control function represents the probability of not receiving a particular treatment, given that the individual was ‘at risk’ of receiving that treatment. The control function approach comes with the additional assumption that the latent index/selection model is correctly specified in terms of the functional form of the exogenous regressors, as well as the specification of the model. Given this assumption, it is important to understand how the treatment effect varies as a function of the latent index model’s specification. We confirmed that the multinomial probit model was preferred over the multinomial logit model since the restrictive IIA assumption implicit in the latter, was rejected. Future studies estimating heterogeneous treatment effects should select the index model with care, especially when dealing with multiple treatments.

Since the multinomial probit was the preferred choice model, primary inferences on study objectives are drawn from the MNP_IMR approach. Based on the MNP_IMR, the ATE of adalimumab [vs. methotrexate $2081 ($585–$4342); vs. leflunomide $2115 ($624–$4412)] and etanercept [vs. methotrexate $2061 ($726–$4390); vs. leflunomide $2094 ($664–$4551)] offers insight on the impact of formulary expansion of biologic DMARDs, and provides an estimate of the average effect if RA patients were randomly assigned to biologic DMARDs. In the current study, excluding etanercept, the treatment effect in the treated was always lower compared to the ATE. This indicates that those who received a particular DMARD treatment experienced above average gains from that treatment and hence, could be considered to have been assigned to the treatment correctly. The treatment effects associated with etanercept, however, indicate that not everyone who received etanercept benefitted from that treatment. This could be associated with the fact that etanercept was introduced about 4 years prior to adalimumab, and thus, there may have been some inexperience amongst rheumatologists in regards to who could benefit from its use. The late market entry of adalimumab, which is a direct competitor for the subcutaneously injected etanercept, may also have had an effect on pricing of both biologics, which is a key driver of total expenditure for biologic DMARD users. Another key result indicates that substitution of the low cost methotrexate with leflunomide for patients with suboptimal response or contraindicated to methotrexate, may not result in an overall increase in total expenditures, even though leflunomide is relatively more expensive to procure. Similarly, substitution between etanercept and adalimumab may not increase overall per patient total expenditures. Substitution of methotrexate or leflunomide with either biologic DMARD could potentially add approximately $8000 annually to overall expenditures, and for such substitutions, the benefit risk trade off needs to consider health related quality of life and productivity gains from the biologics.

In clinical practice, physicians attempt to prescribe treatments that are more likely to work for specific patients. Ignoring this heterogeneity reduces the clinical relevance and generalizability of the inferences. Moreover, a priori, we cannot eliminate the presence of this heterogeneity, and the conservative approach is to allow for such differences through model specification (Heckman et al. 2006). In this study, heterogeneity manifested in terms of the magnitude of the parameters. Furthermore, in the case of models with essential heterogeneity and polychotomous discrete treatment choice, the misspecification of the indicator function (either its functional form or its arguments), generally produces biased estimates of the parameters of the model under the control function approaches (Heckman et al. 2006). Moreover, for the polychotomous choice, identification requires a unique (non-overlapping) instrument/exclusion restriction for each treatment choice, unless identification at infinity is invoked (Heckman et al. 2006). Bourguignon et al. (2007) have reported that selection bias correction based on the multinomial logit model can provide a fairly good correction for the outcome equation, even when the IIA hypothesis is violated (Bourguignon et al. 2007). Their study also contrasted the underlying assumptions made by the different methods available for selection bias correction, when selection is specified as a multinomial logit model. They report that in many cases, the approach initiated by Dubin and McFadden (1984), as well as the semi-parametric alternative proposed by Dahl (2002), may be preferable to the approach proposed by Lee (1983) (Dubin and McFadden 1984; Dahl 2002; Lee 1983). In a counterfactual treatment effect analysis, a few additional issues arise, namely, multicollinearity and relevance of each method to the counterfactual treatment effect analysis. In this study, multicollinearity was quite severe in the MNL_DMF approach, which could be responsible for the statistically non-significant results obtained by this approach, even when other approaches indicated significant differences. Additionally, intercepts are not identified in Dahl’s (2002) approach, and also in misspecified parametric approaches which could make them weak for counterfactual analysis (Bourguignon, Fournier, and Gurgand 2007). In this study, we find that the differences in bias correction techniques used for constructing the control function had a significant impact on the estimated treatment effect parameters. Additionally, when estimating heterogeneous treatment effects, especially when treatment selection is a discrete choice set, the specification of latent index model matters. These are key issues to consider when estimating polychotomous treatment effects.

4.1 Limitations

Our study has not exhausted all the bias correction approaches in the published literature. For example, Hay (1980) was one of the pioneers in generalization of selectivity bias correction to the case of polychotomous treatment choice (Hay 1980). Hay’s (1980) logit based selectivity bias correction approach applies a slightly different framework by specifying the polychotomous treatment choice problem with multiple binary-choice rules on partial observations; however, even that framework may still be sensitive to the IIA problem. More importantly, similar to Dubin and McFadden (1984), the key issue in Hay’s approach could be multicollinearity arising from implementation in counterfactual analysis since, for each treatment choice j, there are M  1 correction terms. Barrios (2004) provides a generalized sample selection bias correction method under every random utility maximization compatible specification for the selected sample using a mixed logit selection equation (Barrios 2004). Although Barrios’ approach relaxes the IIA assumption by the mixed logit selection specification, it requires much richer data as is generally obtained from a choice experiment on drug attributes. As compared to the control function based approach in this paper, recent advances in treatment effect estimation using the “Local Instrumental Variables” framework proposed by Heckman and colleagues, make significantly fewer assumptions about identification and functional form requirements, but do require the identification at infinity argument in the case of polychotomous treatment choices (Heckman and Urzua 2010; Heckman and Navarro-Lozano 2004).

Some limitations apply to the secondary claims data used in this study. Claims data collected for administrative purposes may contain errors or omissions in coding which could lead to an incomplete or biased assessment of costs. Furthermore, we did not discount the drug costs reported in the claims data to reflect the various rebates and discounts paid by pharmaceutical manufacturers. Hence, these costs may represent an upper bound of true costs to the payer. Lastly, in this dataset, we did not have the ability to quantify the potential health related quality of life and productivity benefits of biologic DMARD treatments. From a societal perspective, these non-monetary gains could more than offset the incremental costs associated with these therapies.

4.2 Conclusion

Our results are interesting since they point out very important sources of bias when estimating comparative effectiveness. Firstly, the results illustrate the need to control for time varying selection-bias and heterogeneity of treatment effects in panel data models. The large difference in ATE, ATT, and ATU should be convincing reasons not to assume homogeneous treatment effects. Sorting on gains is an important source of bias in comparative effectiveness studies involving medical outcomes. In the presence of heterogeneity and multiple treatments, the specification of the latent index model should be carefully chosen, along with selection bias correction techniques appropriate to the choice of the latent index model.

These issues have an important impact on policy. Under one set of assumptions (e.g. MNL_DMF), we may accept a formulary expansion policy on biologic DMARDs to be cost-neutral, while rejecting the same policy as not cost-saving under another set of assumptions (e.g. MNP_IMR). Models need to be realistic to mimic contemporary clinical decisions in order to be helpful in policy decision-making. We have shown that the panel data correlated random coefficients model with selection-bias correction is a practical and realistic tool to assess polychotomous treatment effects in non-experimental studies.