Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Lee, Shen-Ming; Le, Truong-Nhat; Tran, Phuoc-Loc; Li, Chin-Shang

doi:10.1007/s00180-022-01250-3

Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Original paper
Published: 15 July 2022

Volume 38, pages 899–934, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Statistics Aims and scope Submit manuscript

Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Download PDF

378 Accesses
4 Citations
Explore all metrics

Abstract

Logistic regression is a standard model in many studies of binary outcome data, and the analysis of missing data in this model is a fascinating topic. Based on the idea of Wang D, Chen SX (2009) Empirical likelihood for estimating equations with missing values. Ann Stat, 37:490–517, proposed are two different types of multiple imputation (MI) estimation methods, which each use three empirical conditional distribution functions to generate random values to impute missing data, to estimate the parameters of logistic regression with covariates missing at random (MAR) separately or simultaneously by using the estimating equations of Fay RE (1996) Alternative paradigms for the analysis of imputed survey data. J Am Stat Assoc, 91:490–498. The derivation of the two proposed MI estimation methods is under the assumption of MAR separately or simultaneously and exclusively for categorical/discrete data. The two proposed methods are computationally effective, as evidenced by simulation studies. They have a quite similar efficiency and outperform the complete-case, semiparametric inverse probability weighting, validation likelihood, and random forest MI by chained equations methods. Although the two proposed methods are comparable with the joint conditional likelihood (JCL) method, they have more straightforward calculations and shorter computing times compared to the JCL and MICE methods. Two real data examples are used to illustrate the applicability of the proposed methods.

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

An evaluation of methods to handle missing data in the context of latent variable interaction analysis: multiple imputation, maximum likelihood, and random forest algorithm

Article 11 August 2022

Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis

Article Open access 22 August 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

“Missing data” is a widespread issue in practical data analysis and frequently appear in many areas of science for various reasons, e.g., survey non-response, data collection conditions, expensive or long-term experiments. Rubin (1976) used the missing mechanism concept to formalize the missing value, where the missingness indicators are considered as random variables that are described by a distribution. For the missingness mechanism, there are three main types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR means that the appearance of missing values is completely independent of all variables with missing observations and variables with no missing observations. MAR means that the missingness is only related to the variables with no missing observations, but unrelated to the variables with missing observations. MNAR, neither MAR nor MCAR, implies that missing values are related to observed and unobserved values. The impact of missing data on statistical research can lead to biased estimates of parameters, lose information, decrease statistical power, increase standard errors, or weaken generalizability of findings (Dong and Peng 2013). Various methods for dealing with missing values in regression models were proposed; see, e.g., Rubin (1976), Little (1992), Zhao and Lipsitz (1992), Wang et al. (2002), Lee et al. (2012), and Lukusa et al. (2016) for more details.

Logistic regression is often applied in many studies in which researchers would like to investigate the relationship between a binary response variable and covariates (Hosmer et al. 2013). In practice, the analysis of logistic regression with one or more covariates MAR frequently appears, which provides specific challenges. There were some studies on this issue. For instance, Lipsitz et al. (1998) derived a modified conditional logistic regression with covariates MAR. Wang et al. (1997) provided the weighted semiparametric estimation method to investigate the properties of regression parameter estimators when the selection probabilities are estimated by kernel smoothers. Wang et al. (2002) proposed the joint conditional likelihood (JCL) method to estimate the parameters of logistic regression with covariates MAR by combining complete-case (CC) (or validation) data, where the cases without missing observations are included, and non-complete (or non-validation) data that include the cases with missing observations. When both the outcome and covariates in logistic regression are MAR, Lee et al. (2012) presented the two semiparametric estimation methods, validation likelihood (VL) and JCL estimation methods, to estimate the logistic regression parameters. Similarly, Hsieh et al. (2013) applied these approaches to estimate the parameters of logistic regression with the outcome and covariates MAR separately or simultaneously. Jiang et al. (2020) developed a stochastic approximation version of the EM (SAEM) algorithm, which is based on Metropolis-Hastings sampling, to perform statistical inference for the parameters of logistic regression with missing covariates, and compared their estimators with those of the random forest multiple imputation by chained equations (MICE) method from mice package in R (Buuren and Groothuis-Oudshoorn 2011). Tran et al. (2021) recently estimated the parameters of logistic regression with categorical/discrete covariates MAR separately or simultaneously via the JCL estimation method that uses the information from a CC and three non-complete data sets to improve efficiency in estimation.

Based on the results of Tran et al. (2021), although the JCL estimation method outperforms the CC, semiparametric inverse probability weighting (SIPW), and VL estimation methods, its calculations are more complex and, hence, it takes longer computing time. In addition, the estimators of the SAEM approach outperform those of the MICE method, but the computing time of the SAEM approach is longer compared to the MICE method. Moreover, it can be seen from Jiang et al. (2020) that the MICE estimators are underestimators, although the variables are MCAR. Therefore, we are highly motivated to develop other estimation methods for logistic regression with covariates MAR separately or simultaneously that are not only comparable with the JCL and MICE estimation methods in terms of efficiency but also simpler and faster in calculation.

In this work, based on the ideas of Fay (1996), Wang and Chen (2009), and Lee et al. (2016, 2020), we develop two different types of MI methods to estimate the parameters of logistic regression with two covariate vectors MAR separately or simultaneously under the assumption that all covariates and surrogates are categorical/discrete. We also compare the two proposed MI methods with the JCL method of Tran et al. (2021) and MICE method in terms of efficiency in estimation and computing time. Our proposed MI methods are two-step procedures based on the suggestion of Fay (1996) that makes the calculation process simpler and faster than the three-step procedure in Rubin (1987). Firstly, each type MI estimation method uses three empirical conditional distribution functions (CDFs) (Wang and Chen 2009) to generate random values. The first type MI (MI1) method uses only the CC data. The second type MI (MI2) method uses both the CC and non-complete data to generate random values to impute missing data. Secondly, solve the estimating equations to obtain estimates of the logistic regression parameters (Fay 1996). The estimating equations are more convenient in practice because they are solved only once rather than combining M estimating equations, where M is the number of imputations, to get the pooled estimates as done in Rubin (1987), such that it shortens computing time. The formulas of Lee et al. (2016, 2020) are applied to estimate the variances of the two proposed MI estimators to improve efficiency in estimation.

Section 2 presents the assumptions and notations used throughout this work. Section 3 reviews the SIPW, VL, JCL, and MICE estimation methods. Two different types of MI estimation methods are proposed in Sect. 4. In Sect. 5, the finite-sample performances of the proposed methods are investigated by conducting extensive simulations under various settings. Two real data sets are used to demonstrate the practical use of the proposed methods. Section 6 presents some discussions and conclusions.

2 Assumptions and notations

Let Y be a binary outcome variable denoting whether an event of interest occurs, where $Y=1$ if event occurs and $Y=0$ otherwise. Suppose that ${{{\mathbf {\mathtt{{X}}}}}}_1=(X_1,X_2,\dots ,X_{r_1})^T$ is a vector of $r_1$ categorical/discrete covariates, and ${{{\mathbf {\mathtt{{X}}}}}}_2=(X_{r_1+1},X_{r_1+2},\dots ,X_p)^T$ is a vector of $r_2$ categorical/discrete covariates, where $p=r_1+r_2$. It is assumed that ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ may be missing separately or simultaneously. In this work, we do not consider the case where some covariates in ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ have missing observations. Thus, when it is said that ${{{\mathbf {\mathtt{{X}}}}}}_s$, $s=1,2$, is missing, it means all covariates in ${{{\mathbf {\mathtt{{X}}}}}}_s$ have missing observations simultaneously. Assume that $\varvec{Z}=(Z_1,Z_2,\dots ,Z_q)^T$ a vector of q categorical/discrete covariates that are always observed. Let $\varvec{X}=({{{\mathbf {\mathtt{{X}}}}}}_1^T,{{{\mathbf {\mathtt{{X}}}}}}_2^T)^T$, ${\mathcal {X}}=(1,\varvec{X}^T,\varvec{Z}^T)^T$, and $\{(Y_i,{\mathcal {X}}_i): i=1,2,\dots ,n\}$ be a random sample. Assume that $\varvec{X}$ is MAR. The logistic regression model is considered as follows:

$$\begin{aligned} P(Y_i=1|\varvec{X}_i,\varvec{Z}_i) =H(\beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i) =H({\varvec{\beta }}^T{\mathcal {X}}_i),\ \ i=1,,2,\dots ,n, \end{aligned}$$

(1)

where $H(u)=\{1+\exp (-u)\}^{-1}$ and $\varvec{\beta }=(\beta _0,\varvec{\beta }_1^T,\varvec{\beta }_2^T,\varvec{\beta }_3^T)^T$ is a vector of parameters associated with ${\mathcal {X}}_i$. The main goal is to estimate $\varvec{\beta }$ when some of the $\varvec{X}_i$s are MAR. Let $\delta _{ij}$, $j=1,2,3,4$, denote the missingness statuses of $\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T$ in which $\delta _{i1}=1$ if both ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are observed; 0 otherwise; $\delta _{i2}=1$ if ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ is missing and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is observed; 0 otherwise; $\delta _{i3}=1$ if ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ is observed and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is missing; 0 otherwise; $\delta _{i4}=1$ if both ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are missing; 0 otherwise. Some studies used a surrogate variable that can be measured or is measured easily that is used instead of a variable that cannot be measured or is measured difficultly, so it can be used for an MAR variable in a model to improve the information and, hence, enhance the performance of estimation and prediction. See, e.g., Wang et al. (1997, 2002), Hsieh et al. (2010, 2013), and Lee et al. (2011, 2012, 2020) for more details. We also consider the possibility of categorical/discrete surrogate vectors ${{{\mathbf {\mathtt{{W}}}}}}_1$ and ${{{\mathbf {\mathtt{{W}}}}}}_2$ for ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$, respectively, such that ${{{\mathbf {\mathtt{{W}}}}}}_1$ and ${{{\mathbf {\mathtt{{W}}}}}}_2$ are dependent on ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$, respectively, and independent of Y given $\varvec{X}$ and $\varvec{Z}$. That is, there are correlations between ${{{\mathbf {\mathtt{{W}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{W}}}}}}_2$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$, respectively. Hence, we can have the logistic regression model $P(Y_i=1|{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i,{{{\mathbf {\mathtt{{W}}}}}}_{1i},{{{\mathbf {\mathtt{{W}}}}}}_{2i})=P(Y_i=1|{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i) =H(\beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i)$ as given in (1). Let $\varvec{W}=({{{\mathbf {\mathtt{{W}}}}}}_1^T,{{{\mathbf {\mathtt{{W}}}}}}_2^T)^T$ and $\varvec{V}_i=(\varvec{Z}_i^T,\varvec{W}_i^T)^T$, $i=1,\dots ,n$. The CC data set ($\delta _{i1}=1$) consists of $(Y_i,\varvec{X}_i,\varvec{V}_i)$, and the three non-complete data sets include $(Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i)$, $(Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i)$, and $(Y_i,\varvec{V}_i)$, respectively, when $\delta _{i2}$, $\delta _{i3}$, and $\delta _{i4}$ are equal to 1. Under the assumption of MAR mechanism (Rubin 1976) of ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$, the selection probability model is

$$\begin{aligned} P(\delta _{ij}=1|Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i,\varvec{W}_i)=\pi _j(Y_i,\varvec{Z}_i,\varvec{W}_i) =\pi _j(Y_i,\varvec{V}_i), \ j=1,2,3,4, \end{aligned}$$

(2)

with $\sum _{j=1}^{4}\pi _j(Y_i,\varvec{V}_i)=1$. $\pi _j(Y_i,\varvec{V}_i)$s are the unknown nuisance parameters and need to be estimated. In this study, under the assumption that $\varvec{V}_i$s are categorical/discrete vectors, the nonparametric estimators of $\pi _j(Y_i,\varvec{V}_i)$ are given as follows:

$$\begin{aligned} {\widehat{\pi }}_j(Y_i,\varvec{V}_i)=\frac{\sum _{k=1}^n\delta _{kj}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^nI(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)},\ j=1,2,3,4, \end{aligned}$$

(3)

where $I(\cdot )$ is an indicator function. It is noticed that in this work the requirement of categorical/discrete covariates is for mathematical derivation purposes. ${\widehat{\pi }}_1(Y_i,\varvec{V}_i)$s are used as weights for the SIPW and VL estimation methods and ${\widehat{\pi }}_j(Y_i,\varvec{V}_i)$s, $j=1,2,3,4$, are used as weights for modification of conditional probabilities for the JCL estimation method in Sect. 3. If covariates are continuous, one can use the kernel estimation approach by using the arguments of Wang and Wang (2001).

3 Review of estimation methods

This section briefly reviews the four famous estimation methods, SIPW, VL, JCL, and MICE methods, for logistic regression with covariates MAR separately or simultaneously. These methods are also used to compare their estimation performance with the proposed approaches in the following sections.

3.1 SIPW estimation method

Horvitz and Thompson (1952) proposed the weighted estimator that uses inverse probability weighting (IPW) to reduce biased estimation and has become known as the H-T estimator. When the selection probabilities are known, Zhao and Lipsitz (1992) extended the H-T estimator to propose the IPW estimator to improve efficiency in estimation. This approach is, however, limited in practice because the selection probabilities are usually unknown. Therefore, some authors, e.g., Wang et al. (1997) and Wang and Wang (2001), suggested the SIPW approach that uses the nonparametric estimators of the unknown selection probabilities as weighted inverse terms. See, e.g., Hsieh et al. (2010) and Lee et al. (2012) for further details. Considering the logistic regression model (1) when ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ are MAR separately or simultaneously, one can obtain the SIPW estimator $\widehat{\varvec{\beta }}_W$ of $\varvec{\beta }$ by solving the following estimating equations:

$$\begin{aligned} \varvec{U}_W(\varvec{\beta },\widehat{\varvec{\pi }}_1) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\dfrac{\delta _{i1}}{{\widehat{\pi }}_1(Y_i,\varvec{V}_i)}{\mathcal {X}}_i(Y_i-H({\varvec{\beta }}^T{\mathcal {X}}_i))=\varvec{0}, \end{aligned}$$

(4)

where $\widehat{\varvec{\pi }}_1=\left( {\widehat{\pi }}_{11},\dots ,{\widehat{\pi }}_{1n}\right)$ for ${\widehat{\pi }}_{1i}={\widehat{\pi }}_1(Y_i,\varvec{V}_i)$, given in (3), being the estimator of $\pi _{1i}=\pi _1(Y_i,\varvec{V}_i)$ defined in (2).

3.2 VL estimation method

Breslow and Cain (1988) proposed the conditional maximum likelihood (ML) approach to estimate the parameters of logistic regression for two-stage case-control data. They showed that their estimator of $\varvec{\beta }$ is not only consistent and asymptotically a normal distribution but also is useful when information of covariates is missing for a large part of the sample. Wang et al. (2002), Lee et al. (2012), and Hsieh et al. (2013) applied this approach to provide the VL estimation method, which uses the CC data, to solve the issue of missing data in logistic regression. When ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ are MAR separately or simultaneously, Tran et al. (2021) estimated $\varvec{\beta }$ by using the following estimating equations:

$$\begin{aligned} \widehat{\varvec{U}}_V(\varvec{\beta }) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\delta _{i1} {\mathcal {X}}_i(Y_i-{\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta })) =\varvec{0}, \end{aligned}$$

(5)

where

$$\begin{aligned} {\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }) =H\left( \beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i +\ln \frac{{\widehat{\pi }}_1(1,\varvec{V}_i)}{{\widehat{\pi }}_1(0,\varvec{V}_i)}\right) \end{aligned}$$

(6)

and ${\widehat{\pi }}_1(Y_i,\varvec{V}_i)$ is given in (3). One can solve $\widehat{\varvec{U}}_V(\varvec{\beta })=\varvec{0}$ to obtain the VL estimator $\widehat{\varvec{\beta }}_V$ of $\varvec{\beta }$.

3.3 JCL estimation method

Both the SIPW and VL approaches that use only the CC data set ($\delta _1=1$) may not maximize efficiency in estimation. Therefore, to overcome this drawback, Wang et al. (2002) proposed the JCL estimation method that combines the CC and non-complete data; see, e.g., Lee et al. (2012) and Hsieh et al. (2013) for more details. Tran et al. (2021) proposed the JCL method to estimate the parameters of logistic regression with covariates MAR separately or simultaneously.

When ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ are MAR separately or simultaneously, one can obtain the JCL estimator $\widehat{\varvec{\beta }}_J=({\widehat{\beta }}_0,\widehat{\varvec{\beta }}_{J1}^T,\widehat{\varvec{\beta }}_{J2}^T,\widehat{\varvec{\beta }}_{J3}^T)^T$ of $\varvec{\beta }$ in the logistic regression model (1) by solving the following estimating equations:

$$\begin{aligned} \widehat{\varvec{U}}_J(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\big \{\delta _{i1} {\mathcal {X}}_i(Y_i-{\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta })) +\delta _{i2}\widehat{{\mathcal {A}}}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)(Y_i-{\widehat{H}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }))\nonumber \\&\quad +\delta _{i3}\widehat{{\mathcal {B}}}({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)(Y_i -{\widehat{H}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }))+\delta _{i4} \widehat{{\mathcal {T}}}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)(Y_i-{\widehat{H}}_4(\varvec{V}_i;\varvec{\beta }))\big \} \nonumber \\&=\varvec{0}, \end{aligned}$$

(7)

where ${\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta })$ is defined in (6), and

$$\begin{aligned} {\widehat{H}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i +{\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)+\ln \frac{{\widehat{\pi }}_2(1,\varvec{V}_i)}{{\widehat{\pi }}_2(0,\varvec{V}_i)}\right) , \end{aligned}$$

(8)

$$\begin{aligned} {\widehat{H}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_3^T\varvec{Z}_i +{\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)+\ln \frac{{\widehat{\pi }}_3(1,\varvec{V}_i)}{{\widehat{\pi }}_3(0,\varvec{V}_i)}\right) , \end{aligned}$$

(9)

$$\begin{aligned} {\widehat{H}}_4(\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_3^T\varvec{Z}_i+{\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2) +\ln \frac{{\widehat{\pi }}_4(1,\varvec{V}_i)}{{\widehat{\pi }}_4(0,\varvec{V}_i)}\right) , \end{aligned}$$

(10)

$$\begin{aligned} \widehat{{\mathcal {A}}}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta }_1)&=\left( 1,{\widehat{R}}_2^{(\varvec{\beta }_1)}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i; \varvec{\beta }_1),{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T\right) ^T, \end{aligned}$$

(11)

$$\begin{aligned} \widehat{{\mathcal {B}}}({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\,\varvec{\beta }_2)&=\left( 1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{\widehat{R}}^{(\varvec{\beta }_2)}_3 ({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2),\varvec{Z}_i^T\right) ^T, \end{aligned}$$

(12)

$$\begin{aligned} \widehat{{\mathcal {T}}}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\left( 1,{\widehat{R}}_4^{(\varvec{\beta }_1)}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2), {\widehat{R}}^{(\varvec{\beta }_2)}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2),\varvec{Z}_i^T\right) ^T, \end{aligned}$$

(13)

for

$$\begin{aligned} {\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1j}}I(Y_j=0,{{{\mathbf {\mathtt{{X}}}}}}_{2j}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,{{{\mathbf {\mathtt{{X}}}}}}_{2k}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$

(14)

$$\begin{aligned} {\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2j}}I(Y_j=0,{{{\mathbf {\mathtt{{X}}}}}}_{1j}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,{{{\mathbf {\mathtt{{X}}}}}}_{1k}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$

(15)

$$\begin{aligned} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1j}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2j}}I(Y_j=0,\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$

(16)

and

$$\begin{aligned} {\widehat{R}}_2^{(\varvec{\beta }_1)}({{{\mathbf {\mathtt{{X}}}}}}_{2i},V_i;\varvec{\beta }_1)&=\dfrac{\partial }{\partial \varvec{\beta }_1^T} {\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1),\ {\widehat{R}}_3^{(\varvec{\beta }_2)}({{{\mathbf {\mathtt{{X}}}}}}_{1i},V_i;\varvec{\beta }_2)=\dfrac{\partial }{\partial \varvec{\beta }_2^T} {\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2),\\ {\widehat{R}}_4^{(\varvec{\beta }_1)}(V_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\dfrac{\partial }{\partial \varvec{\beta }_1^T} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2),\ {\widehat{R}}_4^{(\varvec{\beta }_2)}(V_i;\varvec{\beta }_1,\varvec{\beta }_2)=\dfrac{\partial }{\partial \varvec{\beta }_2^T} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2). \end{aligned}$$

3.4 MICE estimation method

MI (Rubin 1987, 1996) is a simple and powerful method for dealing with missing data and is available in several commonly used statistical software and packages. One can use the MI approaches to generate fully imputed (“completed”) data sets by retaining the observed values and replacing the missing data with plausible values from an imputation model. There are various extensions of the MI method of Rubin (1987) in which their differences are mostly in the structural assumption of the imputation model. See, e.g., Rubin (1987, 1996), Rubin and Schenker (1986), Fay (1996), and Pahel et al. (2011) for details. Their procedures often follow the three basic steps, shown in Figure 1A: Step 1) Imputation: impute missing values M times to obtain M imputed (“completed”) data sets; Step 2) Analysis: analyze each of the M imputed (“completed”) data sets by using the chosen statistical methods; Step 3) Pool: combine the M analysis results from Step 2 into one result by using the formula of Rubin (1987). The MI methods are almost different in the way of their imputed data to fill missing values (Step 1), e.g., mean, regression, hot deck, cold deck, principal component, and chained-equation imputation methods (Little and Rubin 2019).

The mice package (Buuren and Groothuis-Oudshoorn 2011) is one of the useful and famous packages that uses MI by chained equations, also known as fully conditional specification and sequential regression MI, in Step 1 to impute values for missing data. Specifically, assume $\varvec{u}=(u_1,u_2,\dots ,u_k)$ is a vector of k variables with missing observations. Firstly, all missing values of $u_i$ are filled in by using basic random sampling and replacement with observed values themselves. Secondly, construct a regression model of $u_1$ on $u_2,\dots ,u_k$. Simulated draws from the corresponding posterior predictive distribution of $u_1$ are used to replace missing values of $u_1$. Then, generate a regression model of $u_2$ based on the $u_1$ variable with imputed values and $u_3,\dots ,u_k$. Again, simulated draws from the corresponding posterior predictive distribution of $u_2$ are used to replace missing values of $u_2$. Repeat the process for each of $u_3,\dots ,u_k$. This procedure is called a “cycle”. Repeat this cycle several times (e.g., 10 or 20) to create a first imputed (“complete”) data set. Finally, repeat the procedure M times to obtain M imputed (“completed”) data sets. Steps 2 and 3 of this MI method follow the rule of Rubin (1987).

In summary, this section has introduced four estimation methods used for the benchmark comparisons in this work and the convenience of reference. Although the simulation results of Tran et al. (2021) showed that the JCL method outperforms the CC, VL, and SIPW methods, it is pretty hard to obtain the JCL estimates of the logistic regression parameters because of its complex calculations and long computing time. In addition, the MICE approach can be applied to the case of both continuous and categorical/discrete data and any missingness mechanism (White et al. 2011), but its estimators for logistic regression with missing covariates underestimate the parameters, although their standard errors were small (Jiang et al. 2020). Moreover, the procedure of this approach may take much time to compute. Therefore, in Sect. 4, we propose two different types of MI methods that not only provide efficient estimation comparable with the JCL and MICE approaches but also have more straightforward computations to shorten computing time.

4 Proposed MI estimation methods

This section introduces two different types of MI methods based on the ideas of Fay (1996), Wang and Chen (2009), and Lee et al. (2016, 2020). The procedure of these two proposed MI methods has two steps (Fay’s type, Figure 1B): Step 1) Impute values for non-complete data by using the empirical CDFs of missing values given the observed data as done in Wang and Chen (2009) to obtain M imputed (“completed”) data sets; Step 2) Solve the estimating equations only one time to obtain the estimates of the logistic regression parameters. The estimated variances of the MI estimators are then obtained by using the formulas of Lee et al. (2016, 2020). The main difference of these two MI methods is in the way of missing data imputation in Step 1. As mentioned in Sect. 1, this procedure is more convenient in practice and saves computing time. The details of these two different types of MI methods are stated in the following sections.

4.1 Type 1 MI (MI1) method

Let $F_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, $F_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$, and $F_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)$, where ${\mathbf {x}}=({{{\mathbf {\mathtt{{x}}}}}}_1^T,{{{\mathbf {\mathtt{{x}}}}}}_2^T)^T$, be the CDFs of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$, and $\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T$ given $(Y_i,\varvec{V}_i)$, respectively. To build the MI1 method, we consider the following empirical CDFs of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$, and $\varvec{X}_i$ given $(Y_i,\varvec{V}_i)$:

$$\begin{aligned} {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2k}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2s}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1), \nonumber \\ {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1k}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1s}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \nonumber \\ {\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n} \delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I(\varvec{X}_k\le {\mathbf {x}}), \end{aligned}$$

(17)

respectively. When ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are MAR separately or simultaneously, their missing values are imputed several times by random values generated from ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$, and ${\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)$. The MI1 procedure is summarized as follows:

Step 1.:

Imputation: Generate the vth imputed (“completed”) data ($v=1,2,\dots ,M$) based on the missingness status of $\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T$, $i=1,2,\dots ,n$.

i):: If $\delta _{i1}=1$ (${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are observed), keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Set ${\mathcal {X}}_i=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T)^T$ for all v.
ii):: If $\delta _{i2}=1$ (${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ is missing and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is observed), keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ and generate $\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{1iv}$ from ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$. Define $\widetilde{{\mathcal {X}}}_{2iv}=(1,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{1iv},{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}^T_i)^T$.
iii):: If $\delta _{i3}=1$ (${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ is observed and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is missing), keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and generate $\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{2iv}$ from ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Define $\widetilde{{\mathcal {X}}}_{3iv}=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T$.
iv):: If $\delta _{i4}=1$ (both ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are missing), generate $\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{1iv}$ and $\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{2iv}$ from ${\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Define $\widetilde{{\mathcal {X}}}_{4iv}=(1,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{1iv},\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T$.

Step 2.:

Analysis: Solve the following estimating equations:

$$\begin{aligned} \varvec{U}_{M1}(\varvec{\beta }) =\dfrac{1}{\sqrt{n}} \sum _{i=1}^{n}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2} \widetilde{\varvec{S}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\varvec{S}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\varvec{S}}_{4i}(\varvec{\beta })\right) =\varvec{0}, \end{aligned}$$

(18)

where $\varvec{S}_i(\varvec{\beta })={\mathcal {X}}_i(Y_i-H(\varvec{\beta }^T{\mathcal {X}}_i))$, $\widetilde{\varvec{S}}_{ki}(\varvec{\beta })=M^{-1}\sum _{v=1}^{M}\widetilde{{\mathcal {X}}}_{kiv}(Y_i-H(\varvec{\beta }^T\widetilde{{\mathcal {X}}}_{kiv}))$, $k=2,3,4$, to obtain the MI1 estimator, $\widehat{\varvec{\beta }}_{M1}$, of $\varvec{\beta }$. Next, calculate the estimated variance of $\widehat{\varvec{\beta }}_{M1}$, $\widehat{{\,\mathrm{Var}\,}}(\widehat{\varvec{\beta }}_{M1})$, by the formula of Lee et al. (2016), which is also a Rubin-type estimated variance (Rubin 1987), as follows:

$$\begin{aligned} \varvec{G}_{M1}^{-1}(\widehat{\varvec{\beta }}_{M1})\left\{ \dfrac{1}{M}\sum _{v=1}^{m}\sum _{i=1}^{n} (\widetilde{\varvec{U}}_{vi}(\widehat{\varvec{\beta }}_{M1}))^{\otimes 2} +\left( 1+\dfrac{1}{M}\right) \dfrac{\sum _{v=1}^{m}(\widetilde{\varvec{U}}_{v} (\widehat{\varvec{\beta }}_{M1}))^{\otimes 2}}{M-1}\right\} (\varvec{G}_{M1}^{-1}(\widehat{\varvec{\beta }}_{M1}))^T, \end{aligned}$$

(19)

where $\varvec{G}_{M1}^{-1}(\varvec{\beta })$ is the gradient of $-M^{-1}\sum _{v=1}^{M}\widetilde{\varvec{U}}_{v}(\varvec{\beta })=-\varvec{U}_{M1}(\varvec{\beta })$, for

$$\begin{aligned} \widetilde{\varvec{U}}_{vi}(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta }) +\delta _{i2}\widetilde{\varvec{S}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\varvec{S}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\varvec{S}}_{4i}(\varvec{\beta })\right) , \\ \widetilde{\varvec{U}}_{v}(\varvec{\beta })&=\sum _{i=1}^{n}\widetilde{\varvec{U}}_{vi}(\varvec{\beta }). \end{aligned}$$

Notice that in Step 1, the imputed data sets are generated by random sampling from the empirical CDFs based on the CC data. The indicator variables $\delta _{ij}$, $j=1,2,3,4$, $i=1,2,\dots ,n$, are used to identify exactly the partitioned covariate vector without missing observations that are used as the information for the empirical CDFs. For example, when $\delta _{i2}=1$ ($\delta _{i3}=1$), the condition from the observed ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ (${{{\mathbf {\mathtt{{X}}}}}}_{1i}$), $Y_i$, and $\varvec{V}_i$ is used to create a set of values for the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ (${{{\mathbf {\mathtt{{X}}}}}}_{2i}$); when $\delta _{i4}=1$, both ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are missing, the condition from only $Y_i$ and $\varvec{V}_i$ is used to create a set of values for the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. In other words, the procedure uses the exact and more information from the covariate vector without missing observations for missing data imputation, and, hence, estimation may be more efficient. In addition, because this method only has two steps and solves the estimating equations one time, it can shorten computing time.

4.2 Type 2 MI (MI2) method

The MI2 method is quite similar to the MI1 method except for the formulas of empirical CDFs, which are used to generate values to fill missing data. Consider the empirical CDFs of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ given $({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$, and $\varvec{X}_i$ given $(Y_i,\varvec{V}_i)$ as follows:

$$\begin{aligned} \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k3})I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s3})I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right) \,I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1),\\ \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k2})I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s2})I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \\ \widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I(\varvec{X}_k\le {\mathbf {x}}), \end{aligned}$$

respectively. When ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ are MAR separately or simultaneously, their missing values are imputed several times by random observations generated from $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)$, $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)$, and $\widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)$ according to the missingness statuses of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. The procedure of the MI2 method is stated as follows:

Step 1.:

Imputation: Generate the vth imputed (“completed”) data ($v=1,2,\dots ,M$) according to the missingness status of $\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T$, $i=1,2,\dots ,n$.

i):: If $\delta_{i1}=1$, keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Set ${\mathcal {X}}_i=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T)^T$ for all v.
ii):: If $\delta _{i2}=1$, keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ and generate $\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{1iv}$ from $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$. Define $\widetilde{\widetilde{{\mathcal {X}}}}_{2iv}=(1,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{1iv}, {{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}^T_i)^T$.
iii):: If $\delta _{i3}=1$, keep the values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and generate $\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{2iv}$ from $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Define $\widetilde{\widetilde{{\mathcal {X}}}}_{3iv}=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T$.
iv):: If $\delta _{i4}=1$, generate $\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{1iv}$ and $\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{2iv}$ from $\widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)$ to fill the missing values of ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ and ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$. Define $\widetilde{\widetilde{{\mathcal {X}}}}_{4iv}=(1,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{1iv},\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T$.

Step 2.:

Analysis: Solve the following estimating equations:

$$\begin{aligned} \varvec{U}_{M2}(\varvec{\beta }) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n} \left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2}\widetilde{\widetilde{\varvec{S}}}_{2i}(\varvec{\beta }) +\delta _{i3}\widetilde{\widetilde{\varvec{S}}}_{3i}(\varvec{\beta }) +\delta _{i4}\widetilde{\widetilde{\varvec{S}}}_{4i}(\varvec{\beta })\right) =\varvec{0}, \end{aligned}$$

(20)

where $\varvec{S}_i(\varvec{\beta })={\mathcal {X}}_i(Y_i-H(\varvec{\beta }^T{\mathcal {X}}_i))$, $\widetilde{\widetilde{\varvec{S}}}_{ki}(\varvec{\beta })=M^{-1}\sum _{v=1}^{M}\widetilde{\widetilde{{\mathcal {X}}}}_{kiv}(Y_i-H(\varvec{\beta }^T\widetilde{\widetilde{{\mathcal {X}}}}_{kiv}))$, $k=2,3,4$, to obtain the MI2 estimator, $\widehat{\varvec{\beta }}_{M2}$, of $\varvec{\beta }$. Then, calculate the estimated variance of $\widehat{\varvec{\beta }}_{M2}$, $\widehat{{\,\mathrm{Var}\,}}(\widehat{\varvec{\beta }}_{M2})$, by the following formulas:

$$\begin{aligned} \varvec{G}_{M2}^{-1}(\widehat{\varvec{\beta }}_{M2})\left\{ \dfrac{1}{M}\sum _{v=1}^{m}\sum _{i=1}^{n} (\widetilde{\widetilde{\varvec{U}}}_{vi}(\widehat{\varvec{\beta }}_{M2}))^{\otimes 2} +\left( 1+\dfrac{1}{M}\right) \dfrac{\sum _{v=1}^{m}(\widetilde{\widetilde{\varvec{U}}}_{v}(\widehat{\varvec{\beta }}_{M2}) )^{\otimes 2}}{M-1}\right\}\quad (\varvec{G}_{M2}^{-1}(\widehat{\varvec{\beta }}_{M2}))^T, \end{aligned}$$

(21)

where $\varvec{G}_{M2}^{-1}(\varvec{\beta })$ is the gradient of $-M^{-1}\sum _{v=1}^{M}\widetilde{\widetilde{\varvec{U}}}_{v}(\varvec{\beta })=-\varvec{U}_{M2}(\varvec{\beta })$, for

$$\begin{aligned} \widetilde{\widetilde{\varvec{U}}}_{vi}(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2} \widetilde{\widetilde{\varvec{S}}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\widetilde{\varvec{S}}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\widetilde{\varvec{S}}}_{4i}(\varvec{\beta })\right) , \\ \widetilde{\widetilde{\varvec{U}}}_{v}(\varvec{\beta })&=\sum _{i=1}^{n}\widetilde{\widetilde{\varvec{U}}}_{vi}(\varvec{\beta }). \end{aligned}$$

Notice that each of the two proposed MI methods uses three empirical CDFs of Wang and Chen (2009) to generate random values for ${{{\mathbf {\mathtt{{X}}}}}}_1$ and ${{{\mathbf {\mathtt{{X}}}}}}_2$ MAR separately or simultaneously. However, there is a slight difference between these two MI approaches in how information is obtained from the observed data to replace the missing data in the first and second empirical CDFs, while the third empirical CDFs are the same. The MI1 method uses only $\delta _{i1}=1$, but the information of observed ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is given in the indicator function in ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)$, and the information of observed ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ is given in the indicator function in ${\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)$. The MI2 method, however, combines $\delta _{i1}=1$ and $\delta _{i3}=1$ in $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)$, and $\delta _{i1}=1$ and $\delta _{i2}=1$ in $\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)$, but the information of observed ${{{\mathbf {\mathtt{{X}}}}}}_{1i}$ or ${{{\mathbf {\mathtt{{X}}}}}}_{2i}$ is not given in the indicator functions $I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)$s. Moreover, by the conditions inside the indicator functions in the empirical CDFs, the MI2 method can be applied to the case of both the continuous and categorical/discrete $\varvec{X}$ data, while the MI1 method can only be applied to the case of the categorical/discrete $\varvec{X}$ data, which is its limitation in practice. Despite their differences in methodology, both the techniques aim to improve information for missing data imputation and maximize estimation efficiencies. The next section investigates the finite-sample performances of the proposed methods compared to the CC, SIPW, VL, JCL, and MICE estimation methods via extensive simulations.

5 Simulation and real data studies

5.1 Simulation studies

Monte Carlo simulations were conducted to examine the finite-sample performances of the following estimators:

(1)
$\widehat{\varvec{\beta }}_F$: full data ML estimator used as a benchmark for comparisons
(2)
$\widehat{\varvec{\beta }}_C$: CC estimator
(3)
$\widehat{\varvec{\beta }}_W$: SIPW estimator that is the solution of $\varvec{U}_W(\varvec{\beta },\widehat{\varvec{\pi }}_1)=\varvec{0}$ in (4)
(4)
$\widehat{\varvec{\beta }}_V$: VL estimator that is the solution of $\widehat{\varvec{U}}_V(\varvec{\beta })=\varvec{0}$ in (5)
(5)
$\widehat{\varvec{\beta }}_J$: JCL estimator that is the solution of $\widehat{\varvec{U}}_J(\varvec{\beta })=\varvec{0}$ in (7)
(6)
$\widehat{\varvec{\beta }}_{M1}$: MI1 estimator that is the solution of $\varvec{U}_{M1}(\varvec{\beta })=\varvec{0}$ in (18)
(7)
$\widehat{\varvec{\beta }}_{M2}$: MI2 estimator that is the solution of $\varvec{U}_{M2}(\varvec{\beta })=\varvec{0}$ in (20)
(8)
$\widehat{\varvec{\beta }}_{ME}$: MICE estimator from mice package in R, used as a benchmark for comparisons.

To evaluate the estimation performances of the proposed methods as well as compare with the other estimation methods under various situations, we were motivated to construct five scenarios, where the two univariates $X_1$ and $X_2$ were uncorrelated in Scenarios 1-4, and correlated in Scenario 5. Specifically, Scenario 1 studied the impact of the sample size on the performance of all the estimation methods under the same selection probabilities. Three sets of selection probabilities were considered in Scenario 2 to examine the influence of the missing rates on the performances of the estimation methods. The aim of Scenario 3 was the same as Scenario 2 except only changing the value of $\varvec{\beta }$ to know whether the efficiencies of the estimators were altered or not under different logistic regression models. In Scenario 4, three different numbers of imputations were studied to know how they affected the performances of the two proposed MI estimation methods. Finally, Scenario 5 provided six different correlation coefficients between $X_1$ and $X_2$ under the same selection probabilities and sample size to investigate the performances of all the methods in these situations.

For each experimental configuration, 1, 000 replications were performed. $M=30$ was considered except considering $M=5,25$, and 45 in Scenario 4. Calculated were the bias, standard deviation (SD), asymptotic standard error (ASE), and coverage probability (CP) of a 95% confidence interval for each estimator. To evaluate the relative efficiencies (REs) of estimators, we computed the ratio of mean square error (MSE) of each of the other estimators, without including the full data ML estimator, to those of the MI1 and MI2 estimators, respectively, where the MSE of an estimator was defined as the sum of the square of bias and the square of SD, i.e., $\text {MSE}=\text {bias}^2+\text {SD}^2$.

Scenario 1. The goal was to assess how well all the estimation methods performed when the two covariates $X_1$ and $X_2$ were independent by using the same observed selection probabilities and considering the three different sample sizes $n = 500$, 1, 000, and 2, 000. The discrete distribution of the four values $(-0.3,-0.1,0.4,1)$ with probabilities (0.2, 0.3, 0.3, 0.2), respectively, was used to generate the data of $X_1$. The data of $X_2$ were generated from the discrete distribution of the four values $(-1,-0.4,0.2,0.6)$ with probabilities (0.1, 0.3, 0.3, 0.3), respectively. The Bernoulli distribution with success probability 0.4 was used to generate the data of Z. The surrogate variables of $X_1$ and $X_2$ were $W_k$ that is 1 if $X_k>0$; 0 if $X_k\le 0$, $k=1,2$. The Bernoulli distribution with success probability $P(Y=1|X_1,X_2,Z)=H(\beta _0+\beta _1X_1+\beta _2 X_2+\beta _3Z)$ was used to generate the data of Y, where $\varvec{\beta }=(\beta _0,\beta _1,\beta _2,\beta _3)^T=(1,-0.5,1,\log (2))^T$. The following multinomial logistic regression model

$$\begin{aligned} \ln \left( \dfrac{P(\delta _{ij}=1|Y_i,W_{1i},W_{2i},Z_i)}{P(\delta _{i4}=1|Y_i,W_{1i},W_{2i},Z_i)}\right) =\alpha _j+\gamma _1Y_i+\gamma _2W_{1i}+\gamma _3W_{2i}+\gamma _4Z_i \end{aligned}$$

(22)

was used to generate the data of $\delta _{ij}$ given $(Y_i,W_{1i},W_{2i},Z_i)$, $i=1,2,\dots ,n$, $j=1,2,3$, where $\varvec{\alpha }=(\alpha _1,\alpha _2,\alpha _3)^T=(2,0.6,0.6)^T$ and $\varvec{\gamma }=(\gamma _1,\gamma _2,\gamma _3,\gamma _4)^T=(0.7,-0.2,0.1,-1.2)^T$. Under the three different sample sizes, the observed selection probabilities were similar and about 0.6, 0.15, 0.15, and 0.1, respectively. This means that the percentages of complete cases, only $X_1$ missing, only $X_2$ missing, and both of them missing were 60%, 15%, 15%, and 10%, respectively.

The simulation results of Scenario 1 are illustrated in Table 1. The full data ML method overall outperformed the other approaches, but it is only considered as a benchmark comparison because it has the practical disadvantage that no missing data are required. The biases of the CC (for $\beta _0,\beta _3$) and MICE (for $\beta _1,\beta _2$) estimators were the largest, which implies that these two methods have the worst estimation. The performances of the SIPW and VL methods were similar, and the MI1, MI2 and JCL methods had similar performances, particularly when $n=1,000$ and 2,000. The SD and ASE of each estimator were similar except the MICE estimator for $\beta _1$ and $\beta _2$, and decreased when the sample size was increased. The ASEs of the MI1 and MI2 estimators were similar and the smallest compared to the other estimators except the full data ML estimator. The empirical CPs for all the estimation methods were overall close to the nominal probability 95% except the CC (for $\beta _0$ when $n=2,000$; $\beta _3$ when $n=1,000,2,000$) and MICE (for $\beta _0$ when $n=2,000$; $\beta _1$ when $n=500, 2,000$; $\beta _2$ when $n=1,000,2,000$) methods. In addition, it can be seen from Table 2 that the relative efficiency values were larger than 1 except the JCL (for $\beta _0$ when $n=500,1,000$; $\beta _1,\beta _2$ when $n=500,1,000,2,000$) and MICE (for $\beta _0$ when $n=500,1,000$; $\beta _1$ when $n=500$ and MICE versus MI1; $\beta _3$ when $n=500,1,000,2,000$) estimators, which shows that the two proposed MI estimators were comparable with the JCL and MICE estimators (for $\beta _0$ as $n=500,1,000$; $\beta _1$ as $n=500$; $\beta _3$ as $n=500,1,000,2,000$) in terms of efficiency. The relative efficiency values of the MICE estimator to the two proposed MI estimators tended to be increased and were larger than 1 (for $\beta _1,\beta _2$) when the sample size was increased, i.e., the two different types of MI estimators are more efficient than the MICE estimator for $\beta _1$ and $\beta _2$ when the sample size was increased.

Table 1 Simulation results of scenario 1 ($M=30$; $n=500,1,000,2,000$); $\varvec{\beta }=(1,-0.5,1,\log (2))^T$; $\varvec{\alpha }=(2,0.6,0.6)^T$; $\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T$; observed selection probabilities: (0.6, 0.15, 0.15, 0.1)

Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Abstract

Similar content being viewed by others

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

An evaluation of methods to handle missing data in the context of latent variable interaction analysis: multiple imputation, maximum likelihood, and random forest algorithm

Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis

1 Introduction

2 Assumptions and notations

3 Review of estimation methods

3.1 SIPW estimation method

3.2 VL estimation method

3.3 JCL estimation method

3.4 MICE estimation method

4 Proposed MI estimation methods

4.1 Type 1 MI (MI1) method

4.2 Type 2 MI (MI2) method

5 Simulation and real data studies

5.1 Simulation studies

5.2 Examples

5.2.1 Example 1

5.2.2 Example 2

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation