Instrumental variable estimation of causal effects with applying some model selection procedures under binary outcomes

Orihara, Shunichiro; Goto, Atsushi; Taguri, Masataka

doi:10.1007/s41237-022-00177-9

Instrumental variable estimation of causal effects with applying some model selection procedures under binary outcomes

Original Paper
Published: 09 July 2022

Volume 50, pages 241–262, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Behaviormetrika Aims and scope Submit manuscript

Instrumental variable estimation of causal effects with applying some model selection procedures under binary outcomes

Download PDF

451 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In observational studies, unmeasured covariates are an important problem. In the presence of some unmeasured covariates, some instrumental variable methods, such as the two-stage residual inclusion (2SRI) estimator or limited-information maximum likelihood (LIML) estimator, can still obtain an unbiased estimate for causal effects despite the existence of nonlinear models, such as logistic regression and probit models. However, not only a correct outcome model but also a correct treatment model needs to be specified. Therefore, it is important to identify the correct models. In this paper, we consider model selection procedures for 2SRI and LIML, and confirm their properties through simulation and real datasets. Specifically, we confirm the model selection procedures can detect the correct treatment and outcome models, and unbiased causal effects can be estimated. The model selection properties are confirmed through simulation datasets and GENEVA Diabetes Study datasets. From the simulation and data analysis results, we recommend that LIML with any model selection procedures is a good choice when there are binary outcomes and any concerns about unmeasured covariates.

Should multiple imputation be stratified by exposure group when estimating causal effects via outcome regression in observational studies?

Article Open access 16 February 2023

A Robust Bayesian Approach for Causal Inference Problems

Strengthening Instrumental Variables Through Weighting

Article 26 May 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Observational studies are usually interested in estimating causal effects between treatments and outcomes. When all covariates or confounders (hereafter, referred to as “covariates”) are observed, the covariates can be adjusted and an unbiased estimator for causal effects can be obtained, as in the case of “no unmeasured confounding” (c.f. Hernán and Robins 2020). No unmeasured confounding is a sufficient assumption for the estimation of an unbiased estimator of causal effects. However, there are serious risks in estimating biased causal effects unless the covariates are adjusted appropriately. When some covariates are not observed, usually, an unbiased estimator cannot be obtained, as in the case of some unmeasured covariates. Unmeasured covariates constitute an important problem in causal inference, since no unmeasured confounding is no longer applied. Therefore, different estimation methods should be applied.

In this study, the focus is on instrumental variable (IV) methods. A two-stage least squares (2SLS) estimator is one of the most important two-step procedures in the estimation of IV causal effects when there are unmeasured covariates (Wooldridge 2010). 2SLS is useful, but it requires the key assumption that there is a linear relationship between the treatment variable and outcome variable. If this assumption is violated, a biased estimate of the causal effect of interest may be obtained. Terza et al. (2008) introduced a two-stage residual inclusion (2SRI) estimator similar to the control function approach (Wooldridge 2010). 2SRI is another two-step procedure expanded to include nonlinear models, such as logistic regression and probit models, whereby an unbiased estimate of the causal effect can be obtained even when there are nonlinear models. Although 2SRI overcomes the problem of 2SLS, it may derive biased causal effects, as mentioned in Basu et al. (2017) and Wan et al. (2018). According to the simulation results of Basu et al. (2017), a full-likelihood approach derives a more accurate estimate than 2SRI (see also Section 5 of Burgess et al. 2017). Therefore, in this manuscript, a limited-information maximum likelihood (LIML) estimator (Wooldridge 2014) is also considered. LIML estimator uses a full-likelihood approach, but has features similar to those of 2SRI and the control function approach. Both 2SRI and LIML can be used for nonlinear models; however, not only the correct outcome model but also the correct treatment model needs to be specified (Basu et al. 2017). Therefore, detecting the correct models is an important process when using 2SRI and LIML.

In model selection, information criteria are commonly used to select the “correct” model. The Akaike information criterion (AIC; Akaike 1974) is the best-known information criterion for selecting the best model in the prediction of future outcomes. The Bayesian information criterion (BIC) proposed by Schwarz (1978) is another well-known information criterion with model selection consistency (i.e., it selects the correct model with probability 1) under certain assumptions (Nishii 1984, Shao 1997). In the field of causal inference, some previous studies exist (Brookhart and van der Laan 2006; Vansteelandt et al. 2012) and Taguri et al. (2014). Although the considered procedures varied among the studies, motivation is the same: to estimate unbiased causal effects, a valid model needs to be considered. However, to best of our knowledge, there are no previous reports related to any model selection procedures for 2SRI and LIML.

In this paper, we consider model selection procedures for 2SRI and LIML, and confirm their properties through simulation and real datasets. Specifically, we confirm the model selection procedures can detect the correct treatment and outcome model, and unbiased causal effects can be estimated. Since previous studies have considered model selection procedures neither for 2SRI (the control function approach) nor for LIML in this context, the contribution of this study may be considered significant for these estimation procedures. In Sect. 2, a motivational example is introduced and the model considered in this study is presented. Two situations are considered: continuous and dichotomous treatments. In addition, we introduce AIC-type and BIC-type information criteria. In Sect. 3, the properties of 2SRI and LIML with model selection are confirmed using simulation datasets. In the simulation, we consider a case in which the distribution of unmeasured covariates is correctly specified. In Sect. 4, data analysis is performed using the GENEVA Diabetes Study dataset. Supplementary information on simulations and the GENEVA Diabetes Study datasets are found in the Appendix. Some calculations and supplemental simulations are found in the Web Appendix.

2 Motivation example and IV methods

First, a motivational example, the GENEVA Diabetes Study datasets which store subjects’ demographic information (phenotype), genetic information (genotype), and outcomes (presence or absence of diabetes), is introduced. In this study, the causal effect of body mass index (BMI) on the incidence of diabetes is investigated. As is well known, diabetes affects some parts of the body, such as eyes, kidney, and heart. There are more than 400 million diabetic patients worldwide (Cheng et al. 2019). In addition, BMI and the incidence of diabetes have a positive relationship such that high BMI increases the likelihood of developing diabetes. To estimate the causal effect correctly, the covariates, regardless of whether they are observed or not, need to be adjusted when the datasets are derived from observational studies. Cheng et al. (2019) and Richardson et al. (2020) used an instrumental variable approach with the genetic information constituting the instrumental variables. This analysis strategy is called “Mendelian randomization” (Burgess et al. 2017). In this study, Mendelian randomization was also conducted using the genetic information included in the GENEVA Diabetes Study datasets.

Herein, a more general formulation is considered. Let $n$ be the sample size and assume that $i = 1,\ 2,\dots , n$ are i.i.d. samples. $\varvec{X}\in {\mathbb {R}}^{p}$ and $\varvec{Z}\in {\mathbb {R}}^{K}$ denote vectors of covariates and IVs, respectively. The following relationship is assumed for the unmeasured variables:

$$\begin{aligned} \left( \begin{array}{c} V\\ U \end{array} \right) \sim F(v,u;\xi ),\ \ \left( \begin{array}{c} V\\ U \end{array} \right) \mathop {\perp \!\!\!\!\!\perp }\left( \begin{array}{c} \varvec{X}\\ \varvec{Z} \end{array} \right) , \end{aligned}$$

(2.1)

where $\xi$ is a parameter of the joint distribution $(V,U)\in {\mathbb {R}}^{2}$, referred to as “unmeasured covariates” in this study. These assumptions are similar to those of Wooldridge (2014); (2.1) suggests a LIML estimation procedure. Next, the models considered in this paper are introduced.

Treatment model (continuous treatment)
$$\begin{aligned} W=\varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })+V \end{aligned}$$
(2.2)
Treatment model (dichotomous treatment)
$$\begin{aligned} W=\varvec{1}\left\{ \varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })+V\ge 0\right\} \end{aligned}$$
Outcome model
$$\begin{aligned} Y=\varvec{1}\left\{ \varphi _{2}(W,\varvec{X};\varvec{\beta })+U\ge 0\right\} , \end{aligned}$$
(2.3)
where $\varphi _{1}$ and $\varphi _{2}$ are twice differentiable predictors with respect to parameters $\varvec{\alpha }$ and $\varvec{\beta }$, respectively. For instance, $\varphi _{1}$ and $\varphi _{2}$ can be selected as a linear model:
$$\begin{aligned} \varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })=Z_{i}^{\top }\varvec{\alpha }_{z}+X_{i}^{\top }\varvec{\alpha }_{x},\ \ \ \varphi _{2}(W,\varvec{X};\varvec{\beta })=W_{i}^{\top }\beta _{w}+X_{i}^{\top }\varvec{\beta }_{x}. \end{aligned}$$
In addition, the parameter spaces of $\varvec{\alpha }$ and $\varvec{\beta }$ are denoted by $\Theta _{\varvec{\alpha }}$ and $\Theta _{\varvec{\beta }}$, respectively.

The explanation of the above variables by DAGs (c.f. Hernán and Robins 2020) is shown in the Web Appendix A.

The pair of models (2.2) and (2.3) is called the Rivers-Vuong model (RV model; Rivers and Vuong (1988)), where

$$\begin{aligned} F(v,u;\xi )=N_{2}\left( \varvec{0}_{2}, \left( \begin{array}{cc} \sigma _{v}^2&{}\rho \sigma _{v}\\ \rho \sigma _{v}&{}1 \end{array} \right) \right) ,\ \ \xi =(\sigma _{v},\rho )^{\top },\, \sigma _{v}>0,\ \rho \ne 0. \end{aligned}$$

Under the RV model, (2.3) becomes a probit model. Note that the following discussions are not limited to the above treatment / outcome models but apply to other parametric models, as well. Note also that the IVs $\varvec{Z}$ follow three IV features (see Baiocchi et al. 2014): (1) causal association with the treatment variable $T$, (2) no association with the unobserved variables ($V$, $U$), and (3) no direct causal association with the outcome variable $Y$. The first and third features are explained using the above treatment and outcome models, respectively. The second feature is explained by (2.1).

To estimate the parameters $\varvec{\theta }=\left( \varvec{\alpha }^{\top },\varvec{\beta }^{\top },\varvec{\xi }^{\top }\right) ^{\top }\in \Theta =\Theta _{\varvec{\alpha }}\times \Theta _{\varvec{\beta }}\times \Theta _{\varvec{\xi }}$, two IV estimators are introduced: a 2SRI estimator and a LIML estimator.

2.1 Two-stage residual inclusion

The 2SRI estimator estimates the causal effects in two steps. In the first step, the treatment variable is regressed onto the instrumental variables to construct the residuals of the treatment variable. Specifically, (2.2) and (2.3) are considered. In particular, consider the ordinary least squares estimator of $\varvec{\alpha }$:

$$\begin{aligned} \hat{\varvec{\alpha }}=\mathop {\mathrm{arg~min}}\limits _{\varvec{\alpha }}\sum _{i=1}^{n}\left( w_{i}-\varphi _{1}(\varvec{z}_{i},\varvec{x}_{i};\varvec{\alpha })\right) ^2. \end{aligned}$$

For each predictor $\varphi _{1}(\varvec{Z}_{i},\varvec{X}_{i};\hat{\varvec{\alpha }})$, the residuals are derived:

$$\begin{aligned} v_{i}(\hat{\varvec{\alpha }})=w_{i}-\varphi _{1}(\varvec{z}_{i},\varvec{x}_{i};\hat{\varvec{\alpha }}). \end{aligned}$$

(2.4)

In the second step, the outcome variable is regressed not only onto the treatment variables but also onto the residuals of the treatment variables. In the following model, the residuals are plugged into (2.3). For instance, when $U$ is a logistic distribution, the outcome model becomes a logistic regression model:

$$\begin{aligned} p_{i}(\varvec{\beta },\gamma )=expit\left\{ \varphi _{2}(w_{i},\varvec{x}_{i};\varvec{\beta })+v_{i}(\hat{\varvec{\alpha }})\gamma \right\} . \end{aligned}$$

(2.5)

In contrast, when $U$ is a normal distribution, the above outcome model becomes a probit model:

$$\begin{aligned} p_{i}(\varvec{\beta },\gamma )=\Phi \left( \varphi _{2}(w_{i},\varvec{x}_{i};\varvec{\beta })+v_{i}(\hat{\varvec{\alpha }})\gamma \right) . \end{aligned}$$

(2.6)

Under (2.5) or (2.6), the maximum likelihood estimator of ($\varvec{\beta },\gamma )$ is considered:

$$\begin{aligned} \left( \begin{array}{c} \hat{\varvec{\beta }}\\ {\hat{\gamma }} \end{array} \right)&=\mathop {\mathrm{arg~max}}\limits _{\varvec{\beta },\, \gamma }\ \log \left[ \prod _{i=1}^{n}p_{i}(\varvec{\beta },\gamma )^{y_{i}}\left( 1-p_{i}(\varvec{\beta },\gamma )\right) ^{1-y_{i}}\right] \nonumber \\&=\mathop {\mathrm{arg~max}}\limits _{\varvec{\beta },\, \gamma }\ \ell _{2SRI}(\varvec{\beta },\gamma ). \end{aligned}$$

(2.7)

Through the above procedures, a 2SRI estimator $\hat{\varvec{\beta }}$ is obtained. Note that the residuals are more complicated than (2.4) for dichotomous treatment (Tchetgen et al. 2015); this is drawback of 2SRI. As mentioned below, LIML need not be considered the residual; a full-likelihood is only necessary.

In Section 3, to consider performance of model selection procedures, the following AIC and BIC are considered:

$$\begin{aligned} AIC_{2SRI}&=-2\ell _{2SRI}(\hat{\varvec{\beta }},{\hat{\gamma }})+2(|\hat{\varvec{\beta }}|+1)\\ BIC_{2SRI}&=-2\ell _{2SRI}(\hat{\varvec{\beta }},{\hat{\gamma }})+(|\hat{\varvec{\beta }}|+1)\log (n), \end{aligned}$$

where $|\cdot |$ is the number of elements. Note that these are applied to select an outcome model without considering v as the predicted residuals (i.e. the same handling as the other covariates).

2.2 Limited-information maximum likelihood

Let us consider the likelihood function $L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}L_{LIML,i}(\varvec{\theta })$ conditioning on $\varvec{z}$ and $\varvec{x}$:

$$\begin{aligned} L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}f(y_{i},w_{i}|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta })=\prod _{i=1}^{n}{\mathrm{P}}(y_{i}|w_{i},\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta })f(w_{i}|\varvec{z}_{i},\varvec{x}_{i};\varvec{\alpha }). \end{aligned}$$

(2.8)

In the following, the specific form of the likelihood for the two cases is explicitly defined. In the case of the Rivers-Vuong model, (2.8) becomes

$$\begin{aligned} L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}\Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) ^{y_{i}}\left( 1-\Phi \left( \frac{\varphi _{i2}(\varvec{\beta }) +\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) \right) ^{1-y_{i}}\frac{1}{\sqrt{2\pi \sigma _{v}^2}}\exp \left\{ -\frac{v_{i}^2(\varvec{\alpha })}{2\sigma _{v}^2}\right\} \end{aligned}$$

(2.9)

(see Web Appendix B.1). Therefore, the log-likelihood $\ell _{LIML}(\varvec{\theta })=\log L_{LIML}(\varvec{\theta })$ becomes:

$$\begin{aligned} \ell _{LIML}(\varvec{\theta })&=\sum _{i=1}^{n}\left\{ y_{i}\log \Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) +(1-y_{i})\log \left( 1-\Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) \right) \right. \\&\quad \left. -\frac{v_{i}^2(\varvec{\alpha })}{2\sigma _{v}^2}-\log \left( \sqrt{2\pi \sigma _{v}^2}\right) \right\} . \end{aligned}$$

For dichotomous treatment, (2.8) becomes

$$\begin{aligned} L_{LIML}(\varvec{\theta })&=\prod _{i=1}^{n}{\mathrm{P}}\left( y_{i}=1,w_{i}=1|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{y_{i}w_{i}}{\mathrm{P}}\left( y_{i}=0,w_{i}=1|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{(1-y_{i})w_{i}}\\&\times {\mathrm{P}}\left( y_{i}=1,w_{i}=0|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{y_{i}(1-w_{i})}{\mathrm{P}}\left( y_{i}=0,w_{i}=0|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{(1-y_{i})(1-w_{i})}. \end{aligned}$$

Therefore, the log-likelihood becomes

$$\begin{aligned} \ell _{LIML}(\varvec{\theta })&=\sum _{i=1}^{n}\left\{ y_{i}w_{i}\log \left\{ 1-F(\infty ,-\varphi _{i2}(\varvec{\beta });\xi )-F(-\varphi _{i1}(\varvec{\alpha }),\infty ;\xi )+F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \right. \nonumber \\&\quad +(1-y_{i})w_{i}\log \left\{ F(\infty ,-\varphi _{i2}(\varvec{\beta });\xi )-F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \nonumber \\&\quad +y_{i}(1-w_{i})\log \left\{ F(-\varphi _{i1}(\varvec{\alpha }),\infty );\xi )-F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \nonumber \\&\quad +\left. (1-y_{i})(1-w_{i})\log \left\{ F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \right\} \end{aligned}$$

(2.10)

(see Web Appendix B.2), where

$$\begin{aligned} F(v,\infty ;\xi )=\lim _{u\rightarrow \infty }F(v,u;\xi ),\ \ F(\infty ,u;\xi )=\lim _{v\rightarrow \infty }F(v,u;\xi ). \end{aligned}$$

By maximizing the likelihood (2.8), a limited-information maximum likelihood estimator can be derived as

$$\begin{aligned} \hat{\varvec{\theta }}=\mathop {\mathrm{arg~max}}\limits _{\varvec{\theta }\in \Theta }\ \ell _{LIML}(\varvec{\theta }). \end{aligned}$$

(2.11)

Note that the joint distribution $F(v,u;\xi )$ has to be specified when using LIML. However, the distribution is somewhat flexible; for instance, some parametric copulas can be selected (e.g.,Biller and Corlu 2012; Fantazzini 2009). When the marginal distributions of $V$ and $U$ are assumed to be the logistic distributions $F^{logis}_{V}(v)$ and $F^{logis}_{U}(u)$, respectively, and some parametric copulas, such as the t-copula or Clayton copula $C(\cdot ,\cdot ;\xi )$ are assumed, the joint distribution becomes

$$\begin{aligned} F(v,u;\xi )=C(F^{logis}_{V}(v),F^{logis}_{U}(u);\xi ). \end{aligned}$$

In Section 3, to consider performance of model selection procedures, the following AIC and BIC are considered:

$$\begin{aligned} AIC_{LIML}&=-2\ell _{LIML}(\hat{\varvec{\beta }},{\hat{\gamma }})+2|\hat{\varvec{\theta }}|\\ BIC_{LIML}&=-2\ell _{LIML}(\hat{\varvec{\beta }},{\hat{\gamma }})+|\hat{\varvec{\theta }}|\log (n) \end{aligned}$$

2.3 Interpretation under potential outcomes

The 2SRI and LIML can estimate the average treatment effects (ATE) (Rosenbaum and Rubin 1983). To estimate ATE by these methods, G-computation (e.g., Hernán and Robins, 2020)) can be applied:

1.
By solving (2.7) and (2.11), the 2SRI estimates or LIML estimates of $\varvec{\beta }$ can be obtained.
2.
To estimate a probability under the particular treatment value (written as $w'$), the average is calculated over all populations; for instance, U under the normal distribution and the probit model:
$$\begin{aligned} \hat{\mathrm{P}}\left( Y_{w'}=1\right) =\frac{1}{n}\sum _{i=1}^{n}\hat{\mathrm{P}}\left( Y=1|w',\varvec{x}_{i}\right) =\frac{1}{n}\sum _{i=1}^{n}\Phi \left( \varphi _{2}\left( w',\varvec{x}_{i};\hat{\varvec{\beta }}\right) \right) , \end{aligned}$$
where $Y_{w'}$ corresponds to the potential outcome under treatment $w'$, and $\varvec{x}_{i}$ are the observed covariates. Regarding 2SRI, $\varvec{x}_{i}$ also includes the residual term of the 1st stage model.

From the above steps, ATE is estimated:

$$\begin{aligned} {\hat{\mathrm{E}}}[Y_{w'}]-{\hat{\mathrm{E}}}[Y_{w''}]={\hat{\mathrm{P}}}\left( Y_{w'}=1\right) -\hat{\mathrm{P}}\left( Y_{w''}=1\right) . \end{aligned}$$

3 Simulations

In this section, the properties of model selection procedures and parameter estimates of 2SRI and LIML are confirmed. Because no previous studies have considered model selection procedures for 2SRI (or the control function approach) and LIML, our simulation results may provide some guidance for using these estimation procedures. To confirm these properties, (1) the number of times the true model was selected for each procedure and the corresponding proportions were determined, and (2) descriptive statistics of estimates for each procedure were calculated. The number of iterations for the simulations was 1000.

3.1 Continuous treatment and normal unmeasured covariates

The Rivers–Vuong model was considered. In this setting, it was confirmed that LIML and 2SRI estimator with model selection perform well. Since we can apply 2SLS under this situation, the results of the 2SLS are summarized as well for reference. The simulation settings were as follows:

Covariates: $X_{1}\sim N(0,1),\, X_{2}\sim Ber(0.5),\, X_{3}\sim N(0,1)$
An instrumental variable: $Z\sim Ber(0.5)$
Unmeasured covariates: $\left( \begin{array}{c} V\\ U \end{array} \right) \sim N\left( \varvec{0}_{2},\left( \begin{array}{cc} 1&{}\rho \\ &{}1 \end{array} \right) \right)$
- Weak correlation: $\rho =0.3$
- Strong correlation: $\rho =0.6$
A treatment model: $W=1+\alpha _{z}Z+X_{2}+X_{3}+V$
- Weak instrumental variable: $\alpha _{z}=0.2$ $\Rightarrow$ The correlation between a treatment and IV is approximately 0.06.
- Strong instrumental variable: $\alpha _{z}=1$ $\Rightarrow$ The correlation between a treatment and an IV is approximately 0.3.
An outcome model: $Y=\varvec{1}\left\{ 0.5+0.6W+0.5X_{1}+0.5X_{2}+U\ge 0 \right\}$

To select a treatment model and outcome model, candidate models were prepared. The supplemental information is provided in Appendix A.

The simulation results are summarized in tables and supplemental figures. The results of model selection are summarized in Table 1, where 2SRI: AIC and 2SRI: BIC are 2SRI with each model selection procedures, LIML: AIC and LIML: BIC are LIML with each model selection procedures. The column “True model” shows the number of times the selected method was the true model (i.e., the pair of models a4 and b2; see Appendix). The column “Including true model” shows the number of times each selected method was the true or larger models (i.e., not misspecified models; see Appendix). The column “Both true model” shows the number of times the 2SRI estimator selected the true model in the first step and the second step. Throughout the simulations, “(1) Weak correlation and Strong IV” were used as reference settings. In terms of selection probabilities of the “True model,” BIC did not display high probability for small samples ($N=100$); however, it was the best out of the three selection procedures in all cases of large samples ($N=300$). This result is the same as the previous theoretical results (the model selection consistency; see Nishii 1984 and Shao 1997). For selection probabilities of “Including true model,” both 2SRI and LIML displayed high probability even for small samples. This is also the same feature as that of AIC. Regarding 2SRI, the selection probabilities of “Both true models” were also high. In (2) and (3), these correspond to the weak IV and the strongly correlated unmeasured covariate situations, and labelled as “(2) Weak correlation and Weak IV” and “(3) Strong correlation and Strong IV” respectively. The selection probabilities of both “True model” and “Including true model” are somewhat different from (1); however, these are no remarkable difference from the results of model selection only. Therefore, it can be seen that model selections have stable results regardless of the situation and model selection procedure.

Table 1 Summary of the results of model selection for each estimator (continuous treatment and normal unmeasured covariates)

Full size table

The estimated coefficients of treatment $W$ in the outcome model are summarized in Table 2, Figs. 1, and 2, where 2SLS is 2SLS (without model selection), 2SRI: Full model is 2SRI with the largest model among the candidates, and LIML: Full model is LIML with the largest model among the candidates in the table; the red line denotes the true value in the figure.

With model selection vs. without model selection Comparing the results with and without model selection procedures, it is appeared that the estimates with model selection procedures are more efficient and more unbiased; especially when there are large samples. In particular, the results of 2SRI without model selection is unstable. From the results, using model selection procedure is important to estimate the causal effects correctly.
2SRI vs. LIML In (1), for both the small sample and large sample cases, the LIML estimator with BIC was the most efficient result among the three results with model selection procedures. The LIML estimator with AIC also worked well in the sense of the unbiased result. The 2SRI estimator displayed a large bias and low efficiency; however, both results improved for large samples. In (2), surprisingly the LIML estimator yielded more accurate and unbiased results than (1) when there are only small samples. As mentioned in Burgess et al. (2017), the LIML estimator is more robust than any other well-known IV methods under weak IV situations. However, the simulation result is notable since the weak IV results are more accurate than (1). Table 1 shows that the model selection probabilities of both “True model” and “Including true model” are smaller than the other situations; more simple and accurate models tend to be selected over the true model. Therefore, the LIML estimator can derive the causal effects correctly using model selection procedures. Whereas, the 2SRI estimator suffers from the weak IV problems (c.f. Burgess et al. 2017). In (3), the LIML estimator yielded results similar to (1). However, the 2SRI estimator with and without model selection displayed a large bias and low efficiency for small samples. In the large samples, the efficiency somewhat improved; however, one important point is that 2SRI is still biased. These results are similar to those reported by Basu et al. (2017) and Wan et al. (2018). It is derived from the model construction of 2SRI that the residual term included in 2nd step is assumed as fixed covariate (see section 2.2 also). Actually, the bias is included in the all results of 2SRI. In particular, the situation where there are strongly correlated unmeasured covariates derives large bias (c.f. Wan et al. 2018).
2SLS vs. 2SRI & LIML The 2SLS method had some bias and instability compared with other methods. As there is no linear relationship between treatment and outcome, the results are natural. Therefore, we need to pay attention carefully when using 2SLS.

Thus, a good choice is to consider using LIML with model selection procedures; however, the best model selection procedure depends on the specific case, as mentioned in the Introduction. Whereas, overcoming weak IV problems is an advantage of LIML with model selection because other well-known IV methods do not have this feature. 2SRI with model selection can be selected for “valid" causal relationships, however, biased or unstable estimates may be obtained when there are only weak IVs, or there are strong unobserved relationships between treatments and outcomes.

Table 2 Summary of descriptive statistics for each estimator (coefficients of W, continuous treatment and normal unmeasured covariates)

Full size table

4 Data analysis

In this section, the real data analysis is performed using 2SRI and LIML with model selection procedures. From the simulation results, the difference between the AIC and BIC is a little under large samples. Therefore, the AIC is used to select the valid model.

4.1 Analysis plan

The analysis follows the flow outlined below:

1.
Detecting the genetic information used as instrumental variables. According to Cheng et al. (2019), there are 52 SNPs related to BMI; however, only 19 SNPs are included in the GENEVA Diabetes Study datasets. Since SNP is a weak instrument variable (weak correlation with a treatment variable), as many SNPs as possible should be used to increase the efficiency of the estimation.
2.
Detecting the risk factors related to incidence of diabetes. According to Chen et al. (2018) and Narayan et al. (2007), age and sex are two important risk factors. In addition, both factors may have interaction effects on the incidence of diabetes. Therefore, a candidate model with interaction terms must be included.
3.
Detecting the “valid” model using the AIC and estimating the causal effect. To select a treatment model and an outcome model, candidate models were prepared and the AIC was used to select the model. The supplemental information is provided in the Appendix B.

Age categorization was considered as follows:

Age categories If $Age<50$, then age was coded as $``0''$; otherwise, if $50\le Age<60$, then age was coded as $``1''$; otherwise, if $60\le Age<70$, then age was coded as “2"; otherwise, age was coded as $``3''$.

There were 5481 subjects with either demographic or genetic data. In this study, a complete case analysis was conducted; subjects who had no missing data in the sex, age, BMI, and genetic data categories, were included in the analysis. Consequently, 5,036 subjects ($100\times 5036/5481=91.9\%$) were included in the analysis. Note that BMI and SNPs are treated as continuous variables in the following analysis.

4.2 Analysis results

First, the participants’ demographic data were confirmed. Table 3 summarizes the mean (SD) for continuous parameters and the number of subjects (%) for categorical parameters.

Table 3 Demographic data

Full size table

Regarding demographic data, there were some differences between the BMI categories. Therefore, there are concerns regarding the confounding effects of age and sex. In addition, the incidence of diabetes is different.

Table 4 Association of genetic variants with BMI and diabetes

Full size table

Table 4 summarizes the correlations between the two parameters. The correlation of SNP (instrumental variable) with BMI (a treatment variable) was quite small, raising a concern about the weak IV problem, as expected.

The estimated causal effects are summarized in Table 5. The result of the logistic regression (“Naive,” without using SNPs) has some obvious biases. The 2SRI estimation may provide a somewhat questionable result compared with the results of Hu et al. (2001) (risk ratio: 2.67). This is from the results that the 2SRI estimates are unstable similar to the simulation results. Regarding LIML, however, this result may be more plausible since it is less sensitive to weak IV problems as shown in the simulation results. The estimated causal effects for each sex are also summarized in Table 5. Since the cohorts of males and females are different, a supplemental analysis was planned. Unfortunately, the estimates become more unstable since there are only small sample size (male: 2366 and female: 2670). Regarding 2SRI, the results are also unstable. Whereas, the results are the same direction of the causal effect as the main result. From the results in Table 6, the main result (Table 5) is quite reasonable.

Table 5 Summary of estimates of the causal effects (point estimates (95%CI))

Full size table

Table 6 Summary of estimates of the causal effects by each sex (point estimates (95%CI))

Full size table

From the viewpoint of model selection, “Model 51” was selected for 2SRI and LIML (see also Appendix). According to Chen (2018), there are interactions between age and BMI; however, the interaction model was not selected. From the results of Chen (2018), younger subjects may display stronger interaction effects than older subjects, whereas our data included only subjects aged 40 years or older. Therefore, the interaction term may not be selected.

The above analyses have some limitations. First, as mentioned previously, only 19 SNPs were used in our data analysis; thus, there may be some concerns about the weak IV problem. Cheng et al. (2019) used 52 SNPs; however, a critical limitation of this study is that only 19 SNPs were used in the analysis. Second, the sample size was limited for the dbGaP data. To overcome the weak IV problem, a large sample size is necessary for Mendelian randomization (Burgess et al. 2017). Therefore, the derived result may be inefficient and requires care when interpreting the results.

5 Conclusion and future work

In this study, a binary outcome model with unmeasured covariates was considered. Two-stage residual inclusion (2SRI) is applied in this situation; however, some biased estimates may be derived (Basu et al. 2017). Therefore, limited-information maximum likelihood (LIML), which has features similar to those of 2SRI, was also considered in this study. Since model selections are important to estimate unbiased causal effects, the AIC and BIC for 2SRI and LIML are considered in this study. From the simulation results, LIML with the AIC or BIC works well compared with using full models when an unmeasured covariate distribution is specified correctly, especially, overcoming the weak IV problem is an advantage of LIML. In contrast, 2SRI may derive biased or unstable estimates when there are only weak IVs or strong unobserved relationships between treatments and outcomes. 2SRI and LIML with the AIC were applied to the GENEVA Diabetes Study as Mendelian randomization. The results show that the causal effects are similar to those of previous research; however, there may be some concern about weak IV problems. From the above, we recommend that LIML with any model selection procedures is a good choice when there are binary outcomes and any concerns about unmeasured covariates.

As mentioned, the results are significant contributions in cases of unmeasured covariates and nonlinear outcomes because there has been no research on model selection procedures when both the true treatment model and the true outcome model need to be specified. However, several future studies should be conducted. First, only a binary outcome was considered in this study. Because 2SRI considers a likelihood in the 2nd step and LIML considers a full-likelihood, the method can be expanded to more complex models, for instance, a more general outcome of an exponential family or a time-to-event outcome (Kianian et al. 2019 and Martínez-Camblor et al. 2019). In particular, LIML needs to consider likelihoods of both the outcome and treatment variables; however, the other restrictions are limited. For instance, LIML is not restricted to binary instrumental variables (Wang and Tchetgen 2018 and Kianian et al. 2019) or continuous treatment (Martínez-Camblor et al. 2019). Therefore, LIML with any model selection procedures has great potential expandability. Next, the impact of the misspecification of an unmeasured covariate distribution needs to be carefully confirmed. As the simulation results in Web Appendix C show, the impact may be limited; however, the estimation behavior in other cases is not clear. Therefore, it is necessary to continue with simulations to consider more varied situations.

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MATH Google Scholar
Baiocchi M, Cheng J, Small DS (2014) Instrumental variable methods for causal inference. Stat Med 33(13):2297–2340
Article Google Scholar
Basu A, Coe N, Chapman CG (2017) Comparing 2SLS VS 2SRI for binary outcomes and binary exposures (No. w23840). National Bureau of Economic Research
Biller B, Corlu CG (2012) Copula-based multivariate input modeling. Surv Oper Res Manag Sci 17(2):69–84
Google Scholar
Brookhart MA, van der Laan MJ (2006) A semiparametric model selection criterion with applications to the marginal structural model. Comput Stat Data Anal 50(2):475–498
Article MATH Google Scholar
Burgess S, Small DS, Thompson SG (2017) A review of instrumental variable estimators for Mendelian randomization. Stat Methods Med Res 26(5):2333–2355
Article Google Scholar
Chen Y et al (2018) Association of body mass index and age with incident diabetes in Chinese adults: a population-based cohort study. BMJ Open 8(9):e021768
Article Google Scholar
Cheng L, Zhuang H, Ju H, Yang S, Han J, Tan R, Hu Y (2019) Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: a mendelian randomization study. Front Genet 10:94
Article Google Scholar
Fantazzini D (2009) The effects of misspecified marginals and copulas on computing the value at risk: a Monte Carlo study. Comput Stat Data Anal 53(6):2168–2188
Article MATH Google Scholar
Hernán MA, Robins JM (2020) Causal inference: what if. Chapman & Hill/CRC, New York
Google Scholar
Hu FB, Manson JE, Stampfer MJ, Colditz G, Liu S, Solomon CG, Willett WC (2001) Diet, lifestyle, and the risk of type 2 diabetes mellitus in women. N Engl J Med 345(11):790–797
Article Google Scholar
Kianian B, Kim JI, Fine JP, Peng L (2019) Causal proportional hazards estimation with a binary instrumental variable. arXiv:1901.11050
Martínez-Camblor P, Mackenzie T, Staiger DO, Goodney PP, O’Malley AJ (2019) Adjusting for bias introduced by instrumental variable estimation in the Cox proportional hazards model. Biostatistics 20(1):80–96
Article Google Scholar
Narayan KV, Boyle JP, Thompson TJ, Gregg EW, Williamson DF (2007) Effect of BMI on lifetime risk for diabetes in the US. Diabetes Care 30(6):1562–1566
Article Google Scholar
Nishii R (1984) Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat, 758–765
Richardson TG, Sanderson E, Elsworth B, Tilling K, Smith GD (2020) Use of genetic variation to separate the effects of early and later life adiposity on disease risk: mendelian randomisation study. bmj, 369
Rivers D, Vuong QH (1988) Limited information estimators and exogeneity tests for simultaneous probit models. J Econ 39(3):347–366
Article MATH Google Scholar
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55
Article MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shao J (1997) An asymptotic theory for linear model selection. Stat Sin 221–242
Taguri M, Matsuyama Y, Ohashi Y (2014) Model selection criterion for causal parameters in structural mean models based on a quasi-likelihood. Biometrics 70(3):721–730
Article MATH Google Scholar
Tchetgen EJT, Walter S, Vansteelandt S, Martinussen T, Glymour M (2015) Instrumental variable estimation in a survival context. Epidemiology (Camb, MA) 26(3):402
Terza JV, Basu A, Rathouz PJ (2008) Two-stage residual inclusion estimation: addressing endogeneity in health econometric modeling. J Health Econ 27(3):531–543
Article Google Scholar
Vansteelandt S, Bekaert M, Claeskens G (2012) On model selection and model misspecification in causal inference. Stat Methods Med Res 21(1):7–30
Article MATH Google Scholar
Wan F, Small D, Mitra N (2018) A general approach to evaluating the bias of 2-stage instrumental variable estimators. Stat Med 37(12):1997–2015
Article Google Scholar
Wang L, Tchetgen ET (2018) Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J R Stat Soc Ser B Stat Methodol 80(3):531
Wooldridge JM (2010) Econometric analysis of cross section and panel data. MIT Press, New York
MATH Google Scholar
Wooldridge JM (2014) Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables. J Econ 182(1):226–234
Article MATH Google Scholar

Download references

Acknowledgements

We would like to express our grateful thanks to editors for their useful comments. This manuscript is sophisticated by their comments, and becomes more useful. This work was supported by JSPS KAKENHI Grant Number JP21K10500. Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA; U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01 CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse, and the NIH contract “High-throughput genotyping for studying the genetic contributions to human disease” (HHSN268200782096C). The datasets used for the analyses described in this manuscript were obtained from dbGaP from http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000091.v1.p1 through dbGaP accession number phs000091.v1.p.

Author information

Authors and Affiliations

22-2 Seto, Kanazawa-ku, Yokohama City, Kanagawa, Japan
Shunichiro Orihara, Atsushi Goto & Masataka Taguri

Authors

Shunichiro Orihara
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Goto
View author publications
You can also search for this author in PubMed Google Scholar
Masataka Taguri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shunichiro Orihara.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Maomi Ueno.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 418 kb)

Appendices

A Supplementary information for simulations

Data generating programs, simulation datasets, simulation programs, simulation results, and programs for deriving tables and figures are available at the following URL:

https://drive.google.com/file/d/17nZla3cQDYTvka-260Ib2Qc9XUjn3B5o/view?usp=sharing

1.1 A.1 Candidates of models

To select a treatment model and outcome model, the following candidate models are presented (Table 7):

Candidates for treatment models
$$\begin{aligned} W=\alpha _{0}+z\alpha _{1}+x_{2}\alpha _{2}+x_{3}\alpha _{3}+\left( z\times x_{2}\right) \alpha _{4}+\left( z\times x_{3}\right) \alpha _{5}+\left( x_{2}\times x_{3}\right) \alpha _{6}+V \end{aligned}$$
Candidates for outcome models
$$\begin{aligned} Y=\varvec{1}\left\{ \beta _{0}+w\beta _{1}+x_{1}\beta _{2}+x_{2}\beta _{3}+x_{3}\beta _{4}+\left( x_{1}\times x_{2}\right) \beta _{5}+\left( x_{1}\times x_{3}\right) \beta _{6}+\left( x_{2}\times x_{3}\right) \beta _{7}+U\ge 0\right\} \end{aligned}$$

Table 7 Settings of candidates for treatment and outcome models

Full size table

1.2 A.2 Supplemental figures for continuous treatment and normal unmeasured covariates

B. Supplementary information for data analysis

1.1 B.1 Descriptions of GENEVA diabetes study datasets

The details of the GENEVA Diabetes Study are available at the following URL:

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000091.v2.p1

There are two main genotype datasets and one phenotype dataset, named “phg000036v1,” “phg000048v1,” and “phenotype,” respectively. Note that the genotype datasets constitute one dataset per subject, and the data and subject IDs are connected through annotation files. The GENEVA Diabetes Study datasets are encrypted using the NCBI data encryption algorithm. To decode the encryption, the “SRA Toolkit” is required. The details of the data encryption are found at the following URL:

https://www.ncbi.nlm.nih.gov/books/NBK570250/

1.2 B.2 Candidate models

See Table 8.

Candidates for treatment models
$$\begin{aligned} BMI=\alpha _{0}+SNPs\, \alpha _{1}+age\, \alpha _{2}+sex\, \alpha _{3}+\left( age\times sex\right) \alpha _{4}+V \end{aligned}$$
Candidates for outcome models
$$\begin{aligned} Diabetes&=\varvec{1}\left\{ \beta _{0}+BMI\beta _{11}+BMI^2\beta _{12}+age\, \beta _{2}+sex\, \beta _{3}+\left( BMI\times age\right) \beta _{4}\right. \\&\left. +\left( BMI\times sex\right) \beta _{5}+\left( age\times sex\right) \beta _{6}+U\ge 0\right\} \end{aligned}$$

Table 8 Candidate settings for analysis models

Full size table

About this article

Cite this article

Orihara, S., Goto, A. & Taguri, M. Instrumental variable estimation of causal effects with applying some model selection procedures under binary outcomes. Behaviormetrika 50, 241–262 (2023). https://doi.org/10.1007/s41237-022-00177-9

Download citation

Received: 12 March 2022
Accepted: 10 June 2022
Published: 09 July 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s41237-022-00177-9

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Instrumental variable estimation of causal effects with applying some model selection procedures under binary outcomes

Abstract

Similar content being viewed by others

Should multiple imputation be stratified by exposure group when estimating causal effects via outcome regression in observational studies?

A Robust Bayesian Approach for Causal Inference Problems

Strengthening Instrumental Variables Through Weighting

1 Introduction