1 Introduction

Observational studies are usually interested in estimating causal effects between treatments and outcomes. When all covariates or confounders (hereafter, referred to as “covariates”) are observed, the covariates can be adjusted and an unbiased estimator for causal effects can be obtained, as in the case of “no unmeasured confounding” (c.f. Hernán and Robins 2020). No unmeasured confounding is a sufficient assumption for the estimation of an unbiased estimator of causal effects. However, there are serious risks in estimating biased causal effects unless the covariates are adjusted appropriately. When some covariates are not observed, usually, an unbiased estimator cannot be obtained, as in the case of some unmeasured covariates. Unmeasured covariates constitute an important problem in causal inference, since no unmeasured confounding is no longer applied. Therefore, different estimation methods should be applied.

In this study, the focus is on instrumental variable (IV) methods. A two-stage least squares (2SLS) estimator is one of the most important two-step procedures in the estimation of IV causal effects when there are unmeasured covariates (Wooldridge 2010). 2SLS is useful, but it requires the key assumption that there is a linear relationship between the treatment variable and outcome variable. If this assumption is violated, a biased estimate of the causal effect of interest may be obtained. Terza et al. (2008) introduced a two-stage residual inclusion (2SRI) estimator similar to the control function approach (Wooldridge 2010). 2SRI is another two-step procedure expanded to include nonlinear models, such as logistic regression and probit models, whereby an unbiased estimate of the causal effect can be obtained even when there are nonlinear models. Although 2SRI overcomes the problem of 2SLS, it may derive biased causal effects, as mentioned in Basu et al. (2017) and Wan et al. (2018). According to the simulation results of Basu et al. (2017), a full-likelihood approach derives a more accurate estimate than 2SRI (see also Section 5 of Burgess et al. 2017). Therefore, in this manuscript, a limited-information maximum likelihood (LIML) estimator (Wooldridge 2014) is also considered. LIML estimator uses a full-likelihood approach, but has features similar to those of 2SRI and the control function approach. Both 2SRI and LIML can be used for nonlinear models; however, not only the correct outcome model but also the correct treatment model needs to be specified (Basu et al. 2017). Therefore, detecting the correct models is an important process when using 2SRI and LIML.

In model selection, information criteria are commonly used to select the “correct” model. The Akaike information criterion (AIC; Akaike 1974) is the best-known information criterion for selecting the best model in the prediction of future outcomes. The Bayesian information criterion (BIC) proposed by Schwarz (1978) is another well-known information criterion with model selection consistency (i.e., it selects the correct model with probability 1) under certain assumptions (Nishii 1984, Shao 1997). In the field of causal inference, some previous studies exist (Brookhart and van der Laan 2006; Vansteelandt et al. 2012) and Taguri et al. (2014). Although the considered procedures varied among the studies, motivation is the same: to estimate unbiased causal effects, a valid model needs to be considered. However, to best of our knowledge, there are no previous reports related to any model selection procedures for 2SRI and LIML.

In this paper, we consider model selection procedures for 2SRI and LIML, and confirm their properties through simulation and real datasets. Specifically, we confirm the model selection procedures can detect the correct treatment and outcome model, and unbiased causal effects can be estimated. Since previous studies have considered model selection procedures neither for 2SRI (the control function approach) nor for LIML in this context, the contribution of this study may be considered significant for these estimation procedures. In Sect. 2, a motivational example is introduced and the model considered in this study is presented. Two situations are considered: continuous and dichotomous treatments. In addition, we introduce AIC-type and BIC-type information criteria. In Sect. 3, the properties of 2SRI and LIML with model selection are confirmed using simulation datasets. In the simulation, we consider a case in which the distribution of unmeasured covariates is correctly specified. In Sect. 4, data analysis is performed using the GENEVA Diabetes Study dataset. Supplementary information on simulations and the GENEVA Diabetes Study datasets are found in the Appendix. Some calculations and supplemental simulations are found in the Web Appendix.

2 Motivation example and IV methods

First, a motivational example, the GENEVA Diabetes Study datasets which store subjects’ demographic information (phenotype), genetic information (genotype), and outcomes (presence or absence of diabetes), is introduced. In this study, the causal effect of body mass index (BMI) on the incidence of diabetes is investigated. As is well known, diabetes affects some parts of the body, such as eyes, kidney, and heart. There are more than 400 million diabetic patients worldwide (Cheng et al. 2019). In addition, BMI and the incidence of diabetes have a positive relationship such that high BMI increases the likelihood of developing diabetes. To estimate the causal effect correctly, the covariates, regardless of whether they are observed or not, need to be adjusted when the datasets are derived from observational studies. Cheng et al. (2019) and Richardson et al. (2020) used an instrumental variable approach with the genetic information constituting the instrumental variables. This analysis strategy is called “Mendelian randomization” (Burgess et al. 2017). In this study, Mendelian randomization was also conducted using the genetic information included in the GENEVA Diabetes Study datasets.

Herein, a more general formulation is considered. Let \(n\) be the sample size and assume that \(i = 1,\ 2,\dots , n\) are i.i.d. samples. \(\varvec{X}\in {\mathbb {R}}^{p}\) and \(\varvec{Z}\in {\mathbb {R}}^{K}\) denote vectors of covariates and IVs, respectively. The following relationship is assumed for the unmeasured variables:

$$\begin{aligned} \left( \begin{array}{c} V\\ U \end{array} \right) \sim F(v,u;\xi ),\ \ \left( \begin{array}{c} V\\ U \end{array} \right) \mathop {\perp \!\!\!\!\!\perp }\left( \begin{array}{c} \varvec{X}\\ \varvec{Z} \end{array} \right) , \end{aligned}$$
(2.1)

where \(\xi\) is a parameter of the joint distribution \((V,U)\in {\mathbb {R}}^{2}\), referred to as “unmeasured covariates” in this study. These assumptions are similar to those of Wooldridge (2014); (2.1) suggests a LIML estimation procedure. Next, the models considered in this paper are introduced.

  • Treatment model (continuous treatment)

    $$\begin{aligned} W=\varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })+V \end{aligned}$$
    (2.2)
  • Treatment model (dichotomous treatment)

    $$\begin{aligned} W=\varvec{1}\left\{ \varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })+V\ge 0\right\} \end{aligned}$$
  • Outcome model

    $$\begin{aligned} Y=\varvec{1}\left\{ \varphi _{2}(W,\varvec{X};\varvec{\beta })+U\ge 0\right\} , \end{aligned}$$
    (2.3)

    where \(\varphi _{1}\) and \(\varphi _{2}\) are twice differentiable predictors with respect to parameters \(\varvec{\alpha }\) and \(\varvec{\beta }\), respectively. For instance, \(\varphi _{1}\) and \(\varphi _{2}\) can be selected as a linear model:

    $$\begin{aligned} \varphi _{1}(\varvec{Z},\varvec{X};\varvec{\alpha })=Z_{i}^{\top }\varvec{\alpha }_{z}+X_{i}^{\top }\varvec{\alpha }_{x},\ \ \ \varphi _{2}(W,\varvec{X};\varvec{\beta })=W_{i}^{\top }\beta _{w}+X_{i}^{\top }\varvec{\beta }_{x}. \end{aligned}$$

    In addition, the parameter spaces of \(\varvec{\alpha }\) and \(\varvec{\beta }\) are denoted by \(\Theta _{\varvec{\alpha }}\) and \(\Theta _{\varvec{\beta }}\), respectively.

The explanation of the above variables by DAGs (c.f. Hernán and Robins 2020) is shown in the Web Appendix A.

The pair of models (2.2) and (2.3) is called the Rivers-Vuong model (RV model; Rivers and Vuong (1988)), where

$$\begin{aligned} F(v,u;\xi )=N_{2}\left( \varvec{0}_{2}, \left( \begin{array}{cc} \sigma _{v}^2&{}\rho \sigma _{v}\\ \rho \sigma _{v}&{}1 \end{array} \right) \right) ,\ \ \xi =(\sigma _{v},\rho )^{\top },\, \sigma _{v}>0,\ \rho \ne 0. \end{aligned}$$

Under the RV model, (2.3) becomes a probit model. Note that the following discussions are not limited to the above treatment / outcome models but apply to other parametric models, as well. Note also that the IVs \(\varvec{Z}\) follow three IV features (see Baiocchi et al. 2014): (1) causal association with the treatment variable \(T\), (2) no association with the unobserved variables (\(V\)\(U\)), and (3) no direct causal association with the outcome variable \(Y\). The first and third features are explained using the above treatment and outcome models, respectively. The second feature is explained by (2.1).

To estimate the parameters \(\varvec{\theta }=\left( \varvec{\alpha }^{\top },\varvec{\beta }^{\top },\varvec{\xi }^{\top }\right) ^{\top }\in \Theta =\Theta _{\varvec{\alpha }}\times \Theta _{\varvec{\beta }}\times \Theta _{\varvec{\xi }}\), two IV estimators are introduced: a 2SRI estimator and a LIML estimator.

2.1 Two-stage residual inclusion

The 2SRI estimator estimates the causal effects in two steps. In the first step, the treatment variable is regressed onto the instrumental variables to construct the residuals of the treatment variable. Specifically, (2.2) and (2.3) are considered. In particular, consider the ordinary least squares estimator of \(\varvec{\alpha }\):

$$\begin{aligned} \hat{\varvec{\alpha }}=\mathop {\mathrm{arg~min}}\limits _{\varvec{\alpha }}\sum _{i=1}^{n}\left( w_{i}-\varphi _{1}(\varvec{z}_{i},\varvec{x}_{i};\varvec{\alpha })\right) ^2. \end{aligned}$$

For each predictor \(\varphi _{1}(\varvec{Z}_{i},\varvec{X}_{i};\hat{\varvec{\alpha }})\), the residuals are derived:

$$\begin{aligned} v_{i}(\hat{\varvec{\alpha }})=w_{i}-\varphi _{1}(\varvec{z}_{i},\varvec{x}_{i};\hat{\varvec{\alpha }}). \end{aligned}$$
(2.4)

In the second step, the outcome variable is regressed not only onto the treatment variables but also onto the residuals of the treatment variables. In the following model, the residuals are plugged into (2.3). For instance, when \(U\) is a logistic distribution, the outcome model becomes a logistic regression model:

$$\begin{aligned} p_{i}(\varvec{\beta },\gamma )=expit\left\{ \varphi _{2}(w_{i},\varvec{x}_{i};\varvec{\beta })+v_{i}(\hat{\varvec{\alpha }})\gamma \right\} . \end{aligned}$$
(2.5)

In contrast, when \(U\) is a normal distribution, the above outcome model becomes a probit model:

$$\begin{aligned} p_{i}(\varvec{\beta },\gamma )=\Phi \left( \varphi _{2}(w_{i},\varvec{x}_{i};\varvec{\beta })+v_{i}(\hat{\varvec{\alpha }})\gamma \right) . \end{aligned}$$
(2.6)

Under (2.5) or (2.6), the maximum likelihood estimator of (\(\varvec{\beta },\gamma )\) is considered:

$$\begin{aligned} \left( \begin{array}{c} \hat{\varvec{\beta }}\\ {\hat{\gamma }} \end{array} \right)&=\mathop {\mathrm{arg~max}}\limits _{\varvec{\beta },\, \gamma }\ \log \left[ \prod _{i=1}^{n}p_{i}(\varvec{\beta },\gamma )^{y_{i}}\left( 1-p_{i}(\varvec{\beta },\gamma )\right) ^{1-y_{i}}\right] \nonumber \\&=\mathop {\mathrm{arg~max}}\limits _{\varvec{\beta },\, \gamma }\ \ell _{2SRI}(\varvec{\beta },\gamma ). \end{aligned}$$
(2.7)

Through the above procedures, a 2SRI estimator \(\hat{\varvec{\beta }}\) is obtained. Note that the residuals are more complicated than (2.4) for dichotomous treatment (Tchetgen et al. 2015); this is drawback of 2SRI. As mentioned below, LIML need not be considered the residual; a full-likelihood is only necessary.

In Section 3, to consider performance of model selection procedures, the following AIC and BIC are considered:

$$\begin{aligned} AIC_{2SRI}&=-2\ell _{2SRI}(\hat{\varvec{\beta }},{\hat{\gamma }})+2(|\hat{\varvec{\beta }}|+1)\\ BIC_{2SRI}&=-2\ell _{2SRI}(\hat{\varvec{\beta }},{\hat{\gamma }})+(|\hat{\varvec{\beta }}|+1)\log (n), \end{aligned}$$

where \(|\cdot |\) is the number of elements. Note that these are applied to select an outcome model without considering v as the predicted residuals (i.e. the same handling as the other covariates).

2.2 Limited-information maximum likelihood

Let us consider the likelihood function \(L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}L_{LIML,i}(\varvec{\theta })\) conditioning on \(\varvec{z}\) and \(\varvec{x}\):

$$\begin{aligned} L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}f(y_{i},w_{i}|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta })=\prod _{i=1}^{n}{\mathrm{P}}(y_{i}|w_{i},\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta })f(w_{i}|\varvec{z}_{i},\varvec{x}_{i};\varvec{\alpha }). \end{aligned}$$
(2.8)

In the following, the specific form of the likelihood for the two cases is explicitly defined. In the case of the Rivers-Vuong model, (2.8) becomes

$$\begin{aligned} L_{LIML}(\varvec{\theta })=\prod _{i=1}^{n}\Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) ^{y_{i}}\left( 1-\Phi \left( \frac{\varphi _{i2}(\varvec{\beta }) +\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) \right) ^{1-y_{i}}\frac{1}{\sqrt{2\pi \sigma _{v}^2}}\exp \left\{ -\frac{v_{i}^2(\varvec{\alpha })}{2\sigma _{v}^2}\right\} \end{aligned}$$
(2.9)

(see Web Appendix B.1). Therefore, the log-likelihood \(\ell _{LIML}(\varvec{\theta })=\log L_{LIML}(\varvec{\theta })\) becomes:

$$\begin{aligned} \ell _{LIML}(\varvec{\theta })&=\sum _{i=1}^{n}\left\{ y_{i}\log \Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) +(1-y_{i})\log \left( 1-\Phi \left( \frac{\varphi _{i2}(\varvec{\beta })+\rho v_{i}(\varvec{\alpha })}{\sqrt{1-\rho ^2}}\right) \right) \right. \\&\quad \left. -\frac{v_{i}^2(\varvec{\alpha })}{2\sigma _{v}^2}-\log \left( \sqrt{2\pi \sigma _{v}^2}\right) \right\} . \end{aligned}$$

For dichotomous treatment, (2.8) becomes

$$\begin{aligned} L_{LIML}(\varvec{\theta })&=\prod _{i=1}^{n}{\mathrm{P}}\left( y_{i}=1,w_{i}=1|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{y_{i}w_{i}}{\mathrm{P}}\left( y_{i}=0,w_{i}=1|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{(1-y_{i})w_{i}}\\&\times {\mathrm{P}}\left( y_{i}=1,w_{i}=0|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{y_{i}(1-w_{i})}{\mathrm{P}}\left( y_{i}=0,w_{i}=0|\varvec{z}_{i},\varvec{x}_{i};\varvec{\theta }\right) ^{(1-y_{i})(1-w_{i})}. \end{aligned}$$

Therefore, the log-likelihood becomes

$$\begin{aligned} \ell _{LIML}(\varvec{\theta })&=\sum _{i=1}^{n}\left\{ y_{i}w_{i}\log \left\{ 1-F(\infty ,-\varphi _{i2}(\varvec{\beta });\xi )-F(-\varphi _{i1}(\varvec{\alpha }),\infty ;\xi )+F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \right. \nonumber \\&\quad +(1-y_{i})w_{i}\log \left\{ F(\infty ,-\varphi _{i2}(\varvec{\beta });\xi )-F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \nonumber \\&\quad +y_{i}(1-w_{i})\log \left\{ F(-\varphi _{i1}(\varvec{\alpha }),\infty );\xi )-F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \nonumber \\&\quad +\left. (1-y_{i})(1-w_{i})\log \left\{ F(-\varphi _{i1}(\varvec{\alpha }),-\varphi _{i2}(\varvec{\beta });\xi )\right\} \right\} \end{aligned}$$
(2.10)

(see Web Appendix B.2), where

$$\begin{aligned} F(v,\infty ;\xi )=\lim _{u\rightarrow \infty }F(v,u;\xi ),\ \ F(\infty ,u;\xi )=\lim _{v\rightarrow \infty }F(v,u;\xi ). \end{aligned}$$

By maximizing the likelihood (2.8), a limited-information maximum likelihood estimator can be derived as

$$\begin{aligned} \hat{\varvec{\theta }}=\mathop {\mathrm{arg~max}}\limits _{\varvec{\theta }\in \Theta }\ \ell _{LIML}(\varvec{\theta }). \end{aligned}$$
(2.11)

Note that the joint distribution \(F(v,u;\xi )\) has to be specified when using LIML. However, the distribution is somewhat flexible; for instance, some parametric copulas can be selected (e.g.,Biller and Corlu 2012; Fantazzini 2009). When the marginal distributions of \(V\) and \(U\) are assumed to be the logistic distributions \(F^{logis}_{V}(v)\) and \(F^{logis}_{U}(u)\), respectively, and some parametric copulas, such as the t-copula or Clayton copula \(C(\cdot ,\cdot ;\xi )\) are assumed, the joint distribution becomes

$$\begin{aligned} F(v,u;\xi )=C(F^{logis}_{V}(v),F^{logis}_{U}(u);\xi ). \end{aligned}$$

In Section 3, to consider performance of model selection procedures, the following AIC and BIC are considered:

$$\begin{aligned} AIC_{LIML}&=-2\ell _{LIML}(\hat{\varvec{\beta }},{\hat{\gamma }})+2|\hat{\varvec{\theta }}|\\ BIC_{LIML}&=-2\ell _{LIML}(\hat{\varvec{\beta }},{\hat{\gamma }})+|\hat{\varvec{\theta }}|\log (n) \end{aligned}$$

2.3 Interpretation under potential outcomes

The 2SRI and LIML can estimate the average treatment effects (ATE) (Rosenbaum and Rubin 1983). To estimate ATE by these methods, G-computation (e.g., Hernán and Robins, 2020)) can be applied:

  1. 1.

    By solving (2.7) and (2.11), the 2SRI estimates or LIML estimates of \(\varvec{\beta }\) can be obtained.

  2. 2.

    To estimate a probability under the particular treatment value (written as \(w'\)), the average is calculated over all populations; for instance, U under the normal distribution and the probit model:

    $$\begin{aligned} \hat{\mathrm{P}}\left( Y_{w'}=1\right) =\frac{1}{n}\sum _{i=1}^{n}\hat{\mathrm{P}}\left( Y=1|w',\varvec{x}_{i}\right) =\frac{1}{n}\sum _{i=1}^{n}\Phi \left( \varphi _{2}\left( w',\varvec{x}_{i};\hat{\varvec{\beta }}\right) \right) , \end{aligned}$$

    where \(Y_{w'}\) corresponds to the potential outcome under treatment \(w'\), and \(\varvec{x}_{i}\) are the observed covariates. Regarding 2SRI, \(\varvec{x}_{i}\) also includes the residual term of the 1st stage model.

From the above steps, ATE is estimated:

$$\begin{aligned} {\hat{\mathrm{E}}}[Y_{w'}]-{\hat{\mathrm{E}}}[Y_{w''}]={\hat{\mathrm{P}}}\left( Y_{w'}=1\right) -\hat{\mathrm{P}}\left( Y_{w''}=1\right) . \end{aligned}$$

3 Simulations

In this section, the properties of model selection procedures and parameter estimates of 2SRI and LIML are confirmed. Because no previous studies have considered model selection procedures for 2SRI (or the control function approach) and LIML, our simulation results may provide some guidance for using these estimation procedures. To confirm these properties, (1) the number of times the true model was selected for each procedure and the corresponding proportions were determined, and (2) descriptive statistics of estimates for each procedure were calculated. The number of iterations for the simulations was 1000.

3.1 Continuous treatment and normal unmeasured covariates

The Rivers–Vuong model was considered. In this setting, it was confirmed that LIML and 2SRI estimator with model selection perform well. Since we can apply 2SLS under this situation, the results of the 2SLS are summarized as well for reference. The simulation settings were as follows:

  • Covariates: \(X_{1}\sim N(0,1),\, X_{2}\sim Ber(0.5),\, X_{3}\sim N(0,1)\)

  • An instrumental variable: \(Z\sim Ber(0.5)\)

  • Unmeasured covariates: \(\left( \begin{array}{c} V\\ U \end{array} \right) \sim N\left( \varvec{0}_{2},\left( \begin{array}{cc} 1&{}\rho \\ &{}1 \end{array} \right) \right)\)

    • Weak correlation: \(\rho =0.3\)

    • Strong correlation: \(\rho =0.6\)

  • A treatment model: \(W=1+\alpha _{z}Z+X_{2}+X_{3}+V\)

    • Weak instrumental variable: \(\alpha _{z}=0.2\) \(\Rightarrow\) The correlation between a treatment and IV is approximately 0.06.

    • Strong instrumental variable: \(\alpha _{z}=1\) \(\Rightarrow\) The correlation between a treatment and an IV is approximately 0.3.

  • An outcome model: \(Y=\varvec{1}\left\{ 0.5+0.6W+0.5X_{1}+0.5X_{2}+U\ge 0 \right\}\)

To select a treatment model and outcome model, candidate models were prepared. The supplemental information is provided in Appendix A.

The simulation results are summarized in tables and supplemental figures. The results of model selection are summarized in Table 1, where 2SRI: AIC and 2SRI: BIC are 2SRI with each model selection procedures, LIML: AIC and LIML: BIC are LIML with each model selection procedures. The column “True model” shows the number of times the selected method was the true model (i.e., the pair of models a4 and b2; see Appendix). The column “Including true model” shows the number of times each selected method was the true or larger models (i.e., not misspecified models; see Appendix). The column “Both true model” shows the number of times the 2SRI estimator selected the true model in the first step and the second step. Throughout the simulations, “(1) Weak correlation and Strong IV” were used as reference settings. In terms of selection probabilities of the “True model,” BIC did not display high probability for small samples (\(N=100\)); however, it was the best out of the three selection procedures in all cases of large samples (\(N=300\)). This result is the same as the previous theoretical results (the model selection consistency; see Nishii 1984 and Shao 1997). For selection probabilities of “Including true model,” both 2SRI and LIML displayed high probability even for small samples. This is also the same feature as that of AIC. Regarding 2SRI, the selection probabilities of “Both true models” were also high. In (2) and (3), these correspond to the weak IV and the strongly correlated unmeasured covariate situations, and labelled as “(2) Weak correlation and Weak IV” and “(3) Strong correlation and Strong IV” respectively. The selection probabilities of both “True model” and “Including true model” are somewhat different from (1); however, these are no remarkable difference from the results of model selection only. Therefore, it can be seen that model selections have stable results regardless of the situation and model selection procedure.

Table 1 Summary of the results of model selection for each estimator (continuous treatment and normal unmeasured covariates)

The estimated coefficients of treatment \(W\) in the outcome model are summarized in Table 2, Figs. 1, and 2, where 2SLS is 2SLS (without model selection), 2SRI: Full model is 2SRI with the largest model among the candidates, and LIML: Full model is LIML with the largest model among the candidates in the table; the red line denotes the true value in the figure.

  • With model selection vs. without model selection Comparing the results with and without model selection procedures, it is appeared that the estimates with model selection procedures are more efficient and more unbiased; especially when there are large samples. In particular, the results of 2SRI without model selection is unstable. From the results, using model selection procedure is important to estimate the causal effects correctly.

  • 2SRI vs. LIML In (1), for both the small sample and large sample cases, the LIML estimator with BIC was the most efficient result among the three results with model selection procedures. The LIML estimator with AIC also worked well in the sense of the unbiased result. The 2SRI estimator displayed a large bias and low efficiency; however, both results improved for large samples. In (2), surprisingly the LIML estimator yielded more accurate and unbiased results than (1) when there are only small samples. As mentioned in Burgess et al. (2017), the LIML estimator is more robust than any other well-known IV methods under weak IV situations. However, the simulation result is notable since the weak IV results are more accurate than (1). Table 1 shows that the model selection probabilities of both “True model” and “Including true model” are smaller than the other situations; more simple and accurate models tend to be selected over the true model. Therefore, the LIML estimator can derive the causal effects correctly using model selection procedures. Whereas, the 2SRI estimator suffers from the weak IV problems (c.f. Burgess et al. 2017). In (3), the LIML estimator yielded results similar to (1). However, the 2SRI estimator with and without model selection displayed a large bias and low efficiency for small samples. In the large samples, the efficiency somewhat improved; however, one important point is that 2SRI is still biased. These results are similar to those reported by Basu et al. (2017) and Wan et al. (2018). It is derived from the model construction of 2SRI that the residual term included in 2nd step is assumed as fixed covariate (see section 2.2 also). Actually, the bias is included in the all results of 2SRI. In particular, the situation where there are strongly correlated unmeasured covariates derives large bias (c.f. Wan et al. 2018).

  • 2SLS vs. 2SRI & LIML The 2SLS method had some bias and instability compared with other methods. As there is no linear relationship between treatment and outcome, the results are natural. Therefore, we need to pay attention carefully when using 2SLS.

Thus, a good choice is to consider using LIML with model selection procedures; however, the best model selection procedure depends on the specific case, as mentioned in the Introduction. Whereas, overcoming weak IV problems is an advantage of LIML with model selection because other well-known IV methods do not have this feature. 2SRI with model selection can be selected for “valid" causal relationships, however, biased or unstable estimates may be obtained when there are only weak IVs, or there are strong unobserved relationships between treatments and outcomes.

Table 2 Summary of descriptive statistics for each estimator (coefficients of W, continuous treatment and normal unmeasured covariates)

4 Data analysis

In this section, the real data analysis is performed using 2SRI and LIML with model selection procedures. From the simulation results, the difference between the AIC and BIC is a little under large samples. Therefore, the AIC is used to select the valid model.

4.1 Analysis plan

The analysis follows the flow outlined below:

  1. 1.

    Detecting the genetic information used as instrumental variables. According to Cheng et al. (2019), there are 52 SNPs related to BMI; however, only 19 SNPs are included in the GENEVA Diabetes Study datasets. Since SNP is a weak instrument variable (weak correlation with a treatment variable), as many SNPs as possible should be used to increase the efficiency of the estimation.

  2. 2.

    Detecting the risk factors related to incidence of diabetes. According to Chen et al. (2018) and Narayan et al. (2007), age and sex are two important risk factors. In addition, both factors may have interaction effects on the incidence of diabetes. Therefore, a candidate model with interaction terms must be included.

  3. 3.

    Detecting the “valid” model using the AIC and estimating the causal effect. To select a treatment model and an outcome model, candidate models were prepared and the AIC was used to select the model. The supplemental information is provided in the Appendix B.

Age categorization was considered as follows:

  • Age categories If \(Age<50\), then age was coded as \(``0''\); otherwise, if \(50\le Age<60\), then age was coded as \(``1''\); otherwise, if \(60\le Age<70\), then age was coded as “2"; otherwise, age was coded as \(``3''\).

There were 5481 subjects with either demographic or genetic data. In this study, a complete case analysis was conducted; subjects who had no missing data in the sex, age, BMI, and genetic data categories, were included in the analysis. Consequently, 5,036 subjects (\(100\times 5036/5481=91.9\%\)) were included in the analysis. Note that BMI and SNPs are treated as continuous variables in the following analysis.

4.2 Analysis results

First, the participants’ demographic data were confirmed. Table 3 summarizes the mean (SD) for continuous parameters and the number of subjects (%) for categorical parameters.

Table 3 Demographic data

Regarding demographic data, there were some differences between the BMI categories. Therefore, there are concerns regarding the confounding effects of age and sex. In addition, the incidence of diabetes is different.

Table 4 Association of genetic variants with BMI and diabetes

Table 4 summarizes the correlations between the two parameters. The correlation of SNP (instrumental variable) with BMI (a treatment variable) was quite small, raising a concern about the weak IV problem, as expected.

The estimated causal effects are summarized in Table 5. The result of the logistic regression (“Naive,” without using SNPs) has some obvious biases. The 2SRI estimation may provide a somewhat questionable result compared with the results of Hu et al. (2001) (risk ratio: 2.67). This is from the results that the 2SRI estimates are unstable similar to the simulation results. Regarding LIML, however, this result may be more plausible since it is less sensitive to weak IV problems as shown in the simulation results. The estimated causal effects for each sex are also summarized in Table 5. Since the cohorts of males and females are different, a supplemental analysis was planned. Unfortunately, the estimates become more unstable since there are only small sample size (male: 2366 and female: 2670). Regarding 2SRI, the results are also unstable. Whereas, the results are the same direction of the causal effect as the main result. From the results in Table 6, the main result (Table 5) is quite reasonable.

Table 5 Summary of estimates of the causal effects (point estimates (95%CI))
Table 6 Summary of estimates of the causal effects by each sex (point estimates (95%CI))

From the viewpoint of model selection, “Model 51” was selected for 2SRI and LIML (see also Appendix). According to Chen (2018), there are interactions between age and BMI; however, the interaction model was not selected. From the results of Chen (2018), younger subjects may display stronger interaction effects than older subjects, whereas our data included only subjects aged 40 years or older. Therefore, the interaction term may not be selected.

The above analyses have some limitations. First, as mentioned previously, only 19 SNPs were used in our data analysis; thus, there may be some concerns about the weak IV problem. Cheng et al. (2019) used 52 SNPs; however, a critical limitation of this study is that only 19 SNPs were used in the analysis. Second, the sample size was limited for the dbGaP data. To overcome the weak IV problem, a large sample size is necessary for Mendelian randomization (Burgess et al. 2017). Therefore, the derived result may be inefficient and requires care when interpreting the results.

5 Conclusion and future work

In this study, a binary outcome model with unmeasured covariates was considered. Two-stage residual inclusion (2SRI) is applied in this situation; however, some biased estimates may be derived (Basu et al. 2017). Therefore, limited-information maximum likelihood (LIML), which has features similar to those of 2SRI, was also considered in this study. Since model selections are important to estimate unbiased causal effects, the AIC and BIC for 2SRI and LIML are considered in this study. From the simulation results, LIML with the AIC or BIC works well compared with using full models when an unmeasured covariate distribution is specified correctly, especially, overcoming the weak IV problem is an advantage of LIML. In contrast, 2SRI may derive biased or unstable estimates when there are only weak IVs or strong unobserved relationships between treatments and outcomes. 2SRI and LIML with the AIC were applied to the GENEVA Diabetes Study as Mendelian randomization. The results show that the causal effects are similar to those of previous research; however, there may be some concern about weak IV problems. From the above, we recommend that LIML with any model selection procedures is a good choice when there are binary outcomes and any concerns about unmeasured covariates.

As mentioned, the results are significant contributions in cases of unmeasured covariates and nonlinear outcomes because there has been no research on model selection procedures when both the true treatment model and the true outcome model need to be specified. However, several future studies should be conducted. First, only a binary outcome was considered in this study. Because 2SRI considers a likelihood in the 2nd step and LIML considers a full-likelihood, the method can be expanded to more complex models, for instance, a more general outcome of an exponential family or a time-to-event outcome (Kianian et al. 2019 and Martínez-Camblor et al. 2019). In particular, LIML needs to consider likelihoods of both the outcome and treatment variables; however, the other restrictions are limited. For instance, LIML is not restricted to binary instrumental variables (Wang and Tchetgen 2018 and Kianian et al. 2019) or continuous treatment (Martínez-Camblor et al. 2019). Therefore, LIML with any model selection procedures has great potential expandability. Next, the impact of the misspecification of an unmeasured covariate distribution needs to be carefully confirmed. As the simulation results in Web Appendix C show, the impact may be limited; however, the estimation behavior in other cases is not clear. Therefore, it is necessary to continue with simulations to consider more varied situations.