1 Introduction

“Missing data” is a widespread issue in practical data analysis and frequently appear in many areas of science for various reasons, e.g., survey non-response, data collection conditions, expensive or long-term experiments. Rubin (1976) used the missing mechanism concept to formalize the missing value, where the missingness indicators are considered as random variables that are described by a distribution. For the missingness mechanism, there are three main types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR means that the appearance of missing values is completely independent of all variables with missing observations and variables with no missing observations. MAR means that the missingness is only related to the variables with no missing observations, but unrelated to the variables with missing observations. MNAR, neither MAR nor MCAR, implies that missing values are related to observed and unobserved values. The impact of missing data on statistical research can lead to biased estimates of parameters, lose information, decrease statistical power, increase standard errors, or weaken generalizability of findings (Dong and Peng 2013). Various methods for dealing with missing values in regression models were proposed; see, e.g., Rubin (1976), Little (1992), Zhao and Lipsitz (1992), Wang et al. (2002), Lee et al. (2012), and Lukusa et al. (2016) for more details.

Logistic regression is often applied in many studies in which researchers would like to investigate the relationship between a binary response variable and covariates (Hosmer et al. 2013). In practice, the analysis of logistic regression with one or more covariates MAR frequently appears, which provides specific challenges. There were some studies on this issue. For instance, Lipsitz et al. (1998) derived a modified conditional logistic regression with covariates MAR. Wang et al. (1997) provided the weighted semiparametric estimation method to investigate the properties of regression parameter estimators when the selection probabilities are estimated by kernel smoothers. Wang et al. (2002) proposed the joint conditional likelihood (JCL) method to estimate the parameters of logistic regression with covariates MAR by combining complete-case (CC) (or validation) data, where the cases without missing observations are included, and non-complete (or non-validation) data that include the cases with missing observations. When both the outcome and covariates in logistic regression are MAR, Lee et al. (2012) presented the two semiparametric estimation methods, validation likelihood (VL) and JCL estimation methods, to estimate the logistic regression parameters. Similarly, Hsieh et al. (2013) applied these approaches to estimate the parameters of logistic regression with the outcome and covariates MAR separately or simultaneously. Jiang et al. (2020) developed a stochastic approximation version of the EM (SAEM) algorithm, which is based on Metropolis-Hastings sampling, to perform statistical inference for the parameters of logistic regression with missing covariates, and compared their estimators with those of the random forest multiple imputation by chained equations (MICE) method from mice package in R (Buuren and Groothuis-Oudshoorn 2011). Tran et al. (2021) recently estimated the parameters of logistic regression with categorical/discrete covariates MAR separately or simultaneously via the JCL estimation method that uses the information from a CC and three non-complete data sets to improve efficiency in estimation.

Based on the results of Tran et al. (2021), although the JCL estimation method outperforms the CC, semiparametric inverse probability weighting (SIPW), and VL estimation methods, its calculations are more complex and, hence, it takes longer computing time. In addition, the estimators of the SAEM approach outperform those of the MICE method, but the computing time of the SAEM approach is longer compared to the MICE method. Moreover, it can be seen from Jiang et al. (2020) that the MICE estimators are underestimators, although the variables are MCAR. Therefore, we are highly motivated to develop other estimation methods for logistic regression with covariates MAR separately or simultaneously that are not only comparable with the JCL and MICE estimation methods in terms of efficiency but also simpler and faster in calculation.

In this work, based on the ideas of Fay (1996), Wang and Chen (2009), and Lee et al. (2016, 2020), we develop two different types of MI methods to estimate the parameters of logistic regression with two covariate vectors MAR separately or simultaneously under the assumption that all covariates and surrogates are categorical/discrete. We also compare the two proposed MI methods with the JCL method of Tran et al. (2021) and MICE method in terms of efficiency in estimation and computing time. Our proposed MI methods are two-step procedures based on the suggestion of Fay (1996) that makes the calculation process simpler and faster than the three-step procedure in Rubin (1987). Firstly, each type MI estimation method uses three empirical conditional distribution functions (CDFs) (Wang and Chen 2009) to generate random values. The first type MI (MI1) method uses only the CC data. The second type MI (MI2) method uses both the CC and non-complete data to generate random values to impute missing data. Secondly, solve the estimating equations to obtain estimates of the logistic regression parameters (Fay 1996). The estimating equations are more convenient in practice because they are solved only once rather than combining M estimating equations, where M is the number of imputations, to get the pooled estimates as done in Rubin (1987), such that it shortens computing time. The formulas of Lee et al. (2016, 2020) are applied to estimate the variances of the two proposed MI estimators to improve efficiency in estimation.

Section 2 presents the assumptions and notations used throughout this work. Section 3 reviews the SIPW, VL, JCL, and MICE estimation methods. Two different types of MI estimation methods are proposed in Sect. 4. In Sect. 5, the finite-sample performances of the proposed methods are investigated by conducting extensive simulations under various settings. Two real data sets are used to demonstrate the practical use of the proposed methods. Section 6 presents some discussions and conclusions.

2 Assumptions and notations

Let Y be a binary outcome variable denoting whether an event of interest occurs, where \(Y=1\) if event occurs and \(Y=0\) otherwise. Suppose that \({{{\mathbf {\mathtt{{X}}}}}}_1=(X_1,X_2,\dots ,X_{r_1})^T\) is a vector of \(r_1\) categorical/discrete covariates, and \({{{\mathbf {\mathtt{{X}}}}}}_2=(X_{r_1+1},X_{r_1+2},\dots ,X_p)^T\) is a vector of \(r_2\) categorical/discrete covariates, where \(p=r_1+r_2\). It is assumed that \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) may be missing separately or simultaneously. In this work, we do not consider the case where some covariates in \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) have missing observations. Thus, when it is said that \({{{\mathbf {\mathtt{{X}}}}}}_s\), \(s=1,2\), is missing, it means all covariates in \({{{\mathbf {\mathtt{{X}}}}}}_s\) have missing observations simultaneously. Assume that \(\varvec{Z}=(Z_1,Z_2,\dots ,Z_q)^T\) a vector of q categorical/discrete covariates that are always observed. Let \(\varvec{X}=({{{\mathbf {\mathtt{{X}}}}}}_1^T,{{{\mathbf {\mathtt{{X}}}}}}_2^T)^T\), \({\mathcal {X}}=(1,\varvec{X}^T,\varvec{Z}^T)^T\), and \(\{(Y_i,{\mathcal {X}}_i): i=1,2,\dots ,n\}\) be a random sample. Assume that \(\varvec{X}\) is MAR. The logistic regression model is considered as follows:

$$\begin{aligned} P(Y_i=1|\varvec{X}_i,\varvec{Z}_i) =H(\beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i) =H({\varvec{\beta }}^T{\mathcal {X}}_i),\ \ i=1,,2,\dots ,n, \end{aligned}$$
(1)

where \(H(u)=\{1+\exp (-u)\}^{-1}\) and \(\varvec{\beta }=(\beta _0,\varvec{\beta }_1^T,\varvec{\beta }_2^T,\varvec{\beta }_3^T)^T\) is a vector of parameters associated with \({\mathcal {X}}_i\). The main goal is to estimate \(\varvec{\beta }\) when some of the \(\varvec{X}_i\)s are MAR. Let \(\delta _{ij}\), \(j=1,2,3,4\), denote the missingness statuses of \(\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T\) in which \(\delta _{i1}=1\) if both \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are observed; 0 otherwise; \(\delta _{i2}=1\) if \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) is missing and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is observed; 0 otherwise; \(\delta _{i3}=1\) if \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) is observed and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is missing; 0 otherwise; \(\delta _{i4}=1\) if both \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are missing; 0 otherwise. Some studies used a surrogate variable that can be measured or is measured easily that is used instead of a variable that cannot be measured or is measured difficultly, so it can be used for an MAR variable in a model to improve the information and, hence, enhance the performance of estimation and prediction. See, e.g., Wang et al. (1997, 2002), Hsieh et al. (2010, 2013), and Lee et al. (2011, 2012, 2020) for more details. We also consider the possibility of categorical/discrete surrogate vectors \({{{\mathbf {\mathtt{{W}}}}}}_1\) and \({{{\mathbf {\mathtt{{W}}}}}}_2\) for \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), respectively, such that \({{{\mathbf {\mathtt{{W}}}}}}_1\) and \({{{\mathbf {\mathtt{{W}}}}}}_2\) are dependent on \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), respectively, and independent of Y given \(\varvec{X}\) and \(\varvec{Z}\). That is, there are correlations between \({{{\mathbf {\mathtt{{W}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{W}}}}}}_2\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), respectively. Hence, we can have the logistic regression model \(P(Y_i=1|{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i,{{{\mathbf {\mathtt{{W}}}}}}_{1i},{{{\mathbf {\mathtt{{W}}}}}}_{2i})=P(Y_i=1|{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i) =H(\beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i)\) as given in (1). Let \(\varvec{W}=({{{\mathbf {\mathtt{{W}}}}}}_1^T,{{{\mathbf {\mathtt{{W}}}}}}_2^T)^T\) and \(\varvec{V}_i=(\varvec{Z}_i^T,\varvec{W}_i^T)^T\), \(i=1,\dots ,n\). The CC data set (\(\delta _{i1}=1\)) consists of \((Y_i,\varvec{X}_i,\varvec{V}_i)\), and the three non-complete data sets include \((Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i)\), \((Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i)\), and \((Y_i,\varvec{V}_i)\), respectively, when \(\delta _{i2}\), \(\delta _{i3}\), and \(\delta _{i4}\) are equal to 1. Under the assumption of MAR mechanism (Rubin 1976) of \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), the selection probability model is

$$\begin{aligned} P(\delta _{ij}=1|Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{Z}_i,\varvec{W}_i)=\pi _j(Y_i,\varvec{Z}_i,\varvec{W}_i) =\pi _j(Y_i,\varvec{V}_i), \ j=1,2,3,4, \end{aligned}$$
(2)

with \(\sum _{j=1}^{4}\pi _j(Y_i,\varvec{V}_i)=1\). \(\pi _j(Y_i,\varvec{V}_i)\)s are the unknown nuisance parameters and need to be estimated. In this study, under the assumption that \(\varvec{V}_i\)s are categorical/discrete vectors, the nonparametric estimators of \(\pi _j(Y_i,\varvec{V}_i)\) are given as follows:

$$\begin{aligned} {\widehat{\pi }}_j(Y_i,\varvec{V}_i)=\frac{\sum _{k=1}^n\delta _{kj}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^nI(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)},\ j=1,2,3,4, \end{aligned}$$
(3)

where \(I(\cdot )\) is an indicator function. It is noticed that in this work the requirement of categorical/discrete covariates is for mathematical derivation purposes. \({\widehat{\pi }}_1(Y_i,\varvec{V}_i)\)s are used as weights for the SIPW and VL estimation methods and \({\widehat{\pi }}_j(Y_i,\varvec{V}_i)\)s, \(j=1,2,3,4\), are used as weights for modification of conditional probabilities for the JCL estimation method in Sect. 3. If covariates are continuous, one can use the kernel estimation approach by using the arguments of Wang and Wang (2001).

3 Review of estimation methods

This section briefly reviews the four famous estimation methods, SIPW, VL, JCL, and MICE methods, for logistic regression with covariates MAR separately or simultaneously. These methods are also used to compare their estimation performance with the proposed approaches in the following sections.

3.1 SIPW estimation method

Horvitz and Thompson (1952) proposed the weighted estimator that uses inverse probability weighting (IPW) to reduce biased estimation and has become known as the H-T estimator. When the selection probabilities are known, Zhao and Lipsitz (1992) extended the H-T estimator to propose the IPW estimator to improve efficiency in estimation. This approach is, however, limited in practice because the selection probabilities are usually unknown. Therefore, some authors, e.g., Wang et al. (1997) and Wang and Wang (2001), suggested the SIPW approach that uses the nonparametric estimators of the unknown selection probabilities as weighted inverse terms. See, e.g., Hsieh et al. (2010) and Lee et al. (2012) for further details. Considering the logistic regression model (1) when \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) are MAR separately or simultaneously, one can obtain the SIPW estimator \(\widehat{\varvec{\beta }}_W\) of \(\varvec{\beta }\) by solving the following estimating equations:

$$\begin{aligned} \varvec{U}_W(\varvec{\beta },\widehat{\varvec{\pi }}_1) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\dfrac{\delta _{i1}}{{\widehat{\pi }}_1(Y_i,\varvec{V}_i)}{\mathcal {X}}_i(Y_i-H({\varvec{\beta }}^T{\mathcal {X}}_i))=\varvec{0}, \end{aligned}$$
(4)

where \(\widehat{\varvec{\pi }}_1=\left( {\widehat{\pi }}_{11},\dots ,{\widehat{\pi }}_{1n}\right)\) for \({\widehat{\pi }}_{1i}={\widehat{\pi }}_1(Y_i,\varvec{V}_i)\), given in (3), being the estimator of \(\pi _{1i}=\pi _1(Y_i,\varvec{V}_i)\) defined in (2).

3.2 VL estimation method

Breslow and Cain (1988) proposed the conditional maximum likelihood (ML) approach to estimate the parameters of logistic regression for two-stage case-control data. They showed that their estimator of \(\varvec{\beta }\) is not only consistent and asymptotically a normal distribution but also is useful when information of covariates is missing for a large part of the sample. Wang et al. (2002), Lee et al. (2012), and Hsieh et al. (2013) applied this approach to provide the VL estimation method, which uses the CC data, to solve the issue of missing data in logistic regression. When \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) are MAR separately or simultaneously, Tran et al. (2021) estimated \(\varvec{\beta }\) by using the following estimating equations:

$$\begin{aligned} \widehat{\varvec{U}}_V(\varvec{\beta }) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\delta _{i1} {\mathcal {X}}_i(Y_i-{\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta })) =\varvec{0}, \end{aligned}$$
(5)

where

$$\begin{aligned} {\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }) =H\left( \beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i +\ln \frac{{\widehat{\pi }}_1(1,\varvec{V}_i)}{{\widehat{\pi }}_1(0,\varvec{V}_i)}\right) \end{aligned}$$
(6)

and \({\widehat{\pi }}_1(Y_i,\varvec{V}_i)\) is given in (3). One can solve \(\widehat{\varvec{U}}_V(\varvec{\beta })=\varvec{0}\) to obtain the VL estimator \(\widehat{\varvec{\beta }}_V\) of \(\varvec{\beta }\).

3.3 JCL estimation method

Both the SIPW and VL approaches that use only the CC data set (\(\delta _1=1\)) may not maximize efficiency in estimation. Therefore, to overcome this drawback, Wang et al. (2002) proposed the JCL estimation method that combines the CC and non-complete data; see, e.g., Lee et al. (2012) and Hsieh et al. (2013) for more details. Tran et al. (2021) proposed the JCL method to estimate the parameters of logistic regression with covariates MAR separately or simultaneously.

When \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) are MAR separately or simultaneously, one can obtain the JCL estimator \(\widehat{\varvec{\beta }}_J=({\widehat{\beta }}_0,\widehat{\varvec{\beta }}_{J1}^T,\widehat{\varvec{\beta }}_{J2}^T,\widehat{\varvec{\beta }}_{J3}^T)^T\) of \(\varvec{\beta }\) in the logistic regression model (1) by solving the following estimating equations:

$$\begin{aligned} \widehat{\varvec{U}}_J(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n}\big \{\delta _{i1} {\mathcal {X}}_i(Y_i-{\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta })) +\delta _{i2}\widehat{{\mathcal {A}}}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)(Y_i-{\widehat{H}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }))\nonumber \\&\quad +\delta _{i3}\widehat{{\mathcal {B}}}({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)(Y_i -{\widehat{H}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }))+\delta _{i4} \widehat{{\mathcal {T}}}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)(Y_i-{\widehat{H}}_4(\varvec{V}_i;\varvec{\beta }))\big \} \nonumber \\&=\varvec{0}, \end{aligned}$$
(7)

where \({\widehat{H}}_1({{{\mathbf {\mathtt{{X}}}}}}_{1i},{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta })\) is defined in (6), and

$$\begin{aligned} {\widehat{H}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2i}+\varvec{\beta }_3^T\varvec{Z}_i +{\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)+\ln \frac{{\widehat{\pi }}_2(1,\varvec{V}_i)}{{\widehat{\pi }}_2(0,\varvec{V}_i)}\right) , \end{aligned}$$
(8)
$$\begin{aligned} {\widehat{H}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1i}+\varvec{\beta }_3^T\varvec{Z}_i +{\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)+\ln \frac{{\widehat{\pi }}_3(1,\varvec{V}_i)}{{\widehat{\pi }}_3(0,\varvec{V}_i)}\right) , \end{aligned}$$
(9)
$$\begin{aligned} {\widehat{H}}_4(\varvec{V}_i;\,\varvec{\beta })&=H\left( \beta _0+\varvec{\beta }_3^T\varvec{Z}_i+{\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2) +\ln \frac{{\widehat{\pi }}_4(1,\varvec{V}_i)}{{\widehat{\pi }}_4(0,\varvec{V}_i)}\right) , \end{aligned}$$
(10)
$$\begin{aligned} \widehat{{\mathcal {A}}}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\,\varvec{\beta }_1)&=\left( 1,{\widehat{R}}_2^{(\varvec{\beta }_1)}({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i; \varvec{\beta }_1),{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T\right) ^T, \end{aligned}$$
(11)
$$\begin{aligned} \widehat{{\mathcal {B}}}({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\,\varvec{\beta }_2)&=\left( 1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{\widehat{R}}^{(\varvec{\beta }_2)}_3 ({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2),\varvec{Z}_i^T\right) ^T, \end{aligned}$$
(12)
$$\begin{aligned} \widehat{{\mathcal {T}}}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\left( 1,{\widehat{R}}_4^{(\varvec{\beta }_1)}(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2), {\widehat{R}}^{(\varvec{\beta }_2)}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2),\varvec{Z}_i^T\right) ^T, \end{aligned}$$
(13)

for

$$\begin{aligned} {\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1j}}I(Y_j=0,{{{\mathbf {\mathtt{{X}}}}}}_{2j}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,{{{\mathbf {\mathtt{{X}}}}}}_{2k}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$
(14)
$$\begin{aligned} {\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2j}}I(Y_j=0,{{{\mathbf {\mathtt{{X}}}}}}_{1j}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,{{{\mathbf {\mathtt{{X}}}}}}_{1k}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$
(15)
$$\begin{aligned} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\ln \dfrac{\sum _{j=1}^{n}\delta _{j1} e^{\varvec{\beta }_1^T{{{\mathbf {\mathtt{{X}}}}}}_{1j}+\varvec{\beta }_2^T{{{\mathbf {\mathtt{{X}}}}}}_{2j}}I(Y_j=0,\varvec{V}_j=\varvec{V}_i)}{\sum _{k=1}^{n}\delta _{k1}I(Y_k=0,\varvec{V}_k=\varvec{V}_i)}, \end{aligned}$$
(16)

and

$$\begin{aligned} {\widehat{R}}_2^{(\varvec{\beta }_1)}({{{\mathbf {\mathtt{{X}}}}}}_{2i},V_i;\varvec{\beta }_1)&=\dfrac{\partial }{\partial \varvec{\beta }_1^T} {\widehat{R}}_2({{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_i;\varvec{\beta }_1),\ {\widehat{R}}_3^{(\varvec{\beta }_2)}({{{\mathbf {\mathtt{{X}}}}}}_{1i},V_i;\varvec{\beta }_2)=\dfrac{\partial }{\partial \varvec{\beta }_2^T} {\widehat{R}}_3({{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_i;\varvec{\beta }_2),\\ {\widehat{R}}_4^{(\varvec{\beta }_1)}(V_i;\varvec{\beta }_1,\varvec{\beta }_2)&=\dfrac{\partial }{\partial \varvec{\beta }_1^T} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2),\ {\widehat{R}}_4^{(\varvec{\beta }_2)}(V_i;\varvec{\beta }_1,\varvec{\beta }_2)=\dfrac{\partial }{\partial \varvec{\beta }_2^T} {\widehat{R}}_4(\varvec{V}_i;\varvec{\beta }_1,\varvec{\beta }_2). \end{aligned}$$

3.4 MICE estimation method

MI (Rubin 1987, 1996) is a simple and powerful method for dealing with missing data and is available in several commonly used statistical software and packages. One can use the MI approaches to generate fully imputed (“completed”) data sets by retaining the observed values and replacing the missing data with plausible values from an imputation model. There are various extensions of the MI method of Rubin (1987) in which their differences are mostly in the structural assumption of the imputation model. See, e.g., Rubin (1987, 1996), Rubin and Schenker (1986), Fay (1996), and Pahel et al. (2011) for details. Their procedures often follow the three basic steps, shown in Figure 1A: Step 1) Imputation: impute missing values M times to obtain M imputed (“completed”) data sets; Step 2) Analysis: analyze each of the M imputed (“completed”) data sets by using the chosen statistical methods; Step 3) Pool: combine the M analysis results from Step 2 into one result by using the formula of Rubin (1987). The MI methods are almost different in the way of their imputed data to fill missing values (Step 1), e.g., mean, regression, hot deck, cold deck, principal component, and chained-equation imputation methods (Little and Rubin 2019).

The mice package (Buuren and Groothuis-Oudshoorn 2011) is one of the useful and famous packages that uses MI by chained equations, also known as fully conditional specification and sequential regression MI, in Step 1 to impute values for missing data. Specifically, assume \(\varvec{u}=(u_1,u_2,\dots ,u_k)\) is a vector of k variables with missing observations. Firstly, all missing values of \(u_i\) are filled in by using basic random sampling and replacement with observed values themselves. Secondly, construct a regression model of \(u_1\) on \(u_2,\dots ,u_k\). Simulated draws from the corresponding posterior predictive distribution of \(u_1\) are used to replace missing values of \(u_1\). Then, generate a regression model of \(u_2\) based on the \(u_1\) variable with imputed values and \(u_3,\dots ,u_k\). Again, simulated draws from the corresponding posterior predictive distribution of \(u_2\) are used to replace missing values of \(u_2\). Repeat the process for each of \(u_3,\dots ,u_k\). This procedure is called a “cycle”. Repeat this cycle several times (e.g., 10 or 20) to create a first imputed (“complete”) data set. Finally, repeat the procedure M times to obtain M imputed (“completed”) data sets. Steps 2 and 3 of this MI method follow the rule of Rubin (1987).

Fig. 1
figure 1

The main steps of multiple imputation. A Rubin’s type. B Fay’s type

In summary, this section has introduced four estimation methods used for the benchmark comparisons in this work and the convenience of reference. Although the simulation results of Tran et al. (2021) showed that the JCL method outperforms the CC, VL, and SIPW methods, it is pretty hard to obtain the JCL estimates of the logistic regression parameters because of its complex calculations and long computing time. In addition, the MICE approach can be applied to the case of both continuous and categorical/discrete data and any missingness mechanism (White et al. 2011), but its estimators for logistic regression with missing covariates underestimate the parameters, although their standard errors were small (Jiang et al. 2020). Moreover, the procedure of this approach may take much time to compute. Therefore, in Sect. 4, we propose two different types of MI methods that not only provide efficient estimation comparable with the JCL and MICE approaches but also have more straightforward computations to shorten computing time.

4 Proposed MI estimation methods

This section introduces two different types of MI methods based on the ideas of Fay (1996), Wang and Chen (2009), and Lee et al. (2016, 2020). The procedure of these two proposed MI methods has two steps (Fay’s type, Figure 1B): Step 1) Impute values for non-complete data by using the empirical CDFs of missing values given the observed data as done in Wang and Chen (2009) to obtain M imputed (“completed”) data sets; Step 2) Solve the estimating equations only one time to obtain the estimates of the logistic regression parameters. The estimated variances of the MI estimators are then obtained by using the formulas of Lee et al. (2016, 2020). The main difference of these two MI methods is in the way of missing data imputation in Step 1. As mentioned in Sect. 1, this procedure is more convenient in practice and saves computing time. The details of these two different types of MI methods are stated in the following sections.

4.1 Type 1 MI (MI1) method

Let \(F_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \(F_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(F_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\), where \({\mathbf {x}}=({{{\mathbf {\mathtt{{x}}}}}}_1^T,{{{\mathbf {\mathtt{{x}}}}}}_2^T)^T\), be the CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T\) given \((Y_i,\varvec{V}_i)\), respectively. To build the MI1 method, we consider the following empirical CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(\varvec{X}_i\) given \((Y_i,\varvec{V}_i)\):

$$\begin{aligned} {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2k}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{2s}={{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1), \nonumber \\ {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1k}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,{{{\mathbf {\mathtt{{X}}}}}}_{1s}={{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \nonumber \\ {\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n} \delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I(\varvec{X}_k\le {\mathbf {x}}), \end{aligned}$$
(17)

respectively. When \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are MAR separately or simultaneously, their missing values are imputed several times by random values generated from \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \({\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\). The MI1 procedure is summarized as follows:

Step 1.:

Imputation: Generate the vth imputed (“completed”) data (\(v=1,2,\dots ,M\)) based on the missingness status of \(\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T\), \(i=1,2,\dots ,n\).

i):

If \(\delta _{i1}=1\) (\({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are observed), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Set \({\mathcal {X}}_i=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T)^T\) for all v.

ii):

If \(\delta _{i2}=1\) (\({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) is missing and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is observed), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) and generate \(\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{1iv}\) from \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\). Define \(\widetilde{{\mathcal {X}}}_{2iv}=(1,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{1iv},{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}^T_i)^T\).

iii):

If \(\delta _{i3}=1\) (\({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) is observed and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is missing), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and generate \(\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{2iv}\) from \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Define \(\widetilde{{\mathcal {X}}}_{3iv}=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T\).

iv):

If \(\delta _{i4}=1\) (both \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are missing), generate \(\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{1iv}\) and \(\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}_{2iv}\) from \({\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Define \(\widetilde{{\mathcal {X}}}_{4iv}=(1,\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{1iv},\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T\).

Step 2.:

Analysis: Solve the following estimating equations:

$$\begin{aligned} \varvec{U}_{M1}(\varvec{\beta }) =\dfrac{1}{\sqrt{n}} \sum _{i=1}^{n}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2} \widetilde{\varvec{S}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\varvec{S}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\varvec{S}}_{4i}(\varvec{\beta })\right) =\varvec{0}, \end{aligned}$$
(18)

where \(\varvec{S}_i(\varvec{\beta })={\mathcal {X}}_i(Y_i-H(\varvec{\beta }^T{\mathcal {X}}_i))\), \(\widetilde{\varvec{S}}_{ki}(\varvec{\beta })=M^{-1}\sum _{v=1}^{M}\widetilde{{\mathcal {X}}}_{kiv}(Y_i-H(\varvec{\beta }^T\widetilde{{\mathcal {X}}}_{kiv}))\), \(k=2,3,4\), to obtain the MI1 estimator, \(\widehat{\varvec{\beta }}_{M1}\), of \(\varvec{\beta }\). Next, calculate the estimated variance of \(\widehat{\varvec{\beta }}_{M1}\), \(\widehat{{\,\mathrm{Var}\,}}(\widehat{\varvec{\beta }}_{M1})\), by the formula of Lee et al. (2016), which is also a Rubin-type estimated variance (Rubin 1987), as follows:

$$\begin{aligned} \varvec{G}_{M1}^{-1}(\widehat{\varvec{\beta }}_{M1})\left\{ \dfrac{1}{M}\sum _{v=1}^{m}\sum _{i=1}^{n} (\widetilde{\varvec{U}}_{vi}(\widehat{\varvec{\beta }}_{M1}))^{\otimes 2} +\left( 1+\dfrac{1}{M}\right) \dfrac{\sum _{v=1}^{m}(\widetilde{\varvec{U}}_{v} (\widehat{\varvec{\beta }}_{M1}))^{\otimes 2}}{M-1}\right\} (\varvec{G}_{M1}^{-1}(\widehat{\varvec{\beta }}_{M1}))^T, \end{aligned}$$
(19)

where \(\varvec{G}_{M1}^{-1}(\varvec{\beta })\) is the gradient of \(-M^{-1}\sum _{v=1}^{M}\widetilde{\varvec{U}}_{v}(\varvec{\beta })=-\varvec{U}_{M1}(\varvec{\beta })\), for

$$\begin{aligned} \widetilde{\varvec{U}}_{vi}(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta }) +\delta _{i2}\widetilde{\varvec{S}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\varvec{S}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\varvec{S}}_{4i}(\varvec{\beta })\right) , \\ \widetilde{\varvec{U}}_{v}(\varvec{\beta })&=\sum _{i=1}^{n}\widetilde{\varvec{U}}_{vi}(\varvec{\beta }). \end{aligned}$$

Notice that in Step 1, the imputed data sets are generated by random sampling from the empirical CDFs based on the CC data. The indicator variables \(\delta _{ij}\), \(j=1,2,3,4\), \(i=1,2,\dots ,n\), are used to identify exactly the partitioned covariate vector without missing observations that are used as the information for the empirical CDFs. For example, when \(\delta _{i2}=1\) (\(\delta _{i3}=1\)), the condition from the observed \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) (\({{{\mathbf {\mathtt{{X}}}}}}_{1i}\)), \(Y_i\), and \(\varvec{V}_i\) is used to create a set of values for the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) (\({{{\mathbf {\mathtt{{X}}}}}}_{2i}\)); when \(\delta _{i4}=1\), both \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are missing, the condition from only \(Y_i\) and \(\varvec{V}_i\) is used to create a set of values for the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). In other words, the procedure uses the exact and more information from the covariate vector without missing observations for missing data imputation, and, hence, estimation may be more efficient. In addition, because this method only has two steps and solves the estimating equations one time, it can shorten computing time.

4.2 Type 2 MI (MI2) method

The MI2 method is quite similar to the MI1 method except for the formulas of empirical CDFs, which are used to generate values to fill missing data. Consider the empirical CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(\varvec{X}_i\) given \((Y_i,\varvec{V}_i)\) as follows:

$$\begin{aligned} \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k3})I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s3})I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right) \,I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1),\\ \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k2})I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s2})I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \\ \widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i)}\right)\, I(\varvec{X}_k\le {\mathbf {x}}), \end{aligned}$$

respectively. When \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) are MAR separately or simultaneously, their missing values are imputed several times by random observations generated from \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)\), \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)\), and \(\widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\) according to the missingness statuses of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). The procedure of the MI2 method is stated as follows:

Step 1.:

Imputation: Generate the vth imputed (“completed”) data (\(v=1,2,\dots ,M\)) according to the missingness status of \(\varvec{X}_i=({{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T)^T\), \(i=1,2,\dots ,n\).

i):

If \(\delta_{i1}=1\), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Set \({\mathcal {X}}_i=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,{{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}_i^T)^T\) for all v.

ii):

If \(\delta _{i2}=1\), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) and generate \(\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{1iv}\) from \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\). Define \(\widetilde{\widetilde{{\mathcal {X}}}}_{2iv}=(1,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{1iv}, {{{\mathbf {\mathtt{{X}}}}}}_{2i}^T,\varvec{Z}^T_i)^T\).

iii):

If \(\delta _{i3}=1\), keep the values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and generate \(\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{2iv}\) from \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Define \(\widetilde{\widetilde{{\mathcal {X}}}}_{3iv}=(1,{{{\mathbf {\mathtt{{X}}}}}}_{1i}^T,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T\).

iv):

If \(\delta _{i4}=1\), generate \(\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{1iv}\) and \(\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}_{2iv}\) from \(\widetilde{{\widetilde{F}}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\) to fill the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\). Define \(\widetilde{\widetilde{{\mathcal {X}}}}_{4iv}=(1,\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{1iv},\widetilde{\widetilde{{{{\mathbf {\mathtt{{X}}}}}}}}^T_{2iv},\varvec{Z}^T_i)^T\).

Step 2.:

Analysis: Solve the following estimating equations:

$$\begin{aligned} \varvec{U}_{M2}(\varvec{\beta }) =\dfrac{1}{\sqrt{n}}\sum _{i=1}^{n} \left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2}\widetilde{\widetilde{\varvec{S}}}_{2i}(\varvec{\beta }) +\delta _{i3}\widetilde{\widetilde{\varvec{S}}}_{3i}(\varvec{\beta }) +\delta _{i4}\widetilde{\widetilde{\varvec{S}}}_{4i}(\varvec{\beta })\right) =\varvec{0}, \end{aligned}$$
(20)

where \(\varvec{S}_i(\varvec{\beta })={\mathcal {X}}_i(Y_i-H(\varvec{\beta }^T{\mathcal {X}}_i))\), \(\widetilde{\widetilde{\varvec{S}}}_{ki}(\varvec{\beta })=M^{-1}\sum _{v=1}^{M}\widetilde{\widetilde{{\mathcal {X}}}}_{kiv}(Y_i-H(\varvec{\beta }^T\widetilde{\widetilde{{\mathcal {X}}}}_{kiv}))\), \(k=2,3,4\), to obtain the MI2 estimator, \(\widehat{\varvec{\beta }}_{M2}\), of \(\varvec{\beta }\). Then, calculate the estimated variance of \(\widehat{\varvec{\beta }}_{M2}\), \(\widehat{{\,\mathrm{Var}\,}}(\widehat{\varvec{\beta }}_{M2})\), by the following formulas:

$$\begin{aligned} \varvec{G}_{M2}^{-1}(\widehat{\varvec{\beta }}_{M2})\left\{ \dfrac{1}{M}\sum _{v=1}^{m}\sum _{i=1}^{n} (\widetilde{\widetilde{\varvec{U}}}_{vi}(\widehat{\varvec{\beta }}_{M2}))^{\otimes 2} +\left( 1+\dfrac{1}{M}\right) \dfrac{\sum _{v=1}^{m}(\widetilde{\widetilde{\varvec{U}}}_{v}(\widehat{\varvec{\beta }}_{M2}) )^{\otimes 2}}{M-1}\right\}\quad (\varvec{G}_{M2}^{-1}(\widehat{\varvec{\beta }}_{M2}))^T, \end{aligned}$$
(21)

where \(\varvec{G}_{M2}^{-1}(\varvec{\beta })\) is the gradient of \(-M^{-1}\sum _{v=1}^{M}\widetilde{\widetilde{\varvec{U}}}_{v}(\varvec{\beta })=-\varvec{U}_{M2}(\varvec{\beta })\), for

$$\begin{aligned} \widetilde{\widetilde{\varvec{U}}}_{vi}(\varvec{\beta })&=\dfrac{1}{\sqrt{n}}\left( \delta _{i1}\varvec{S}_i(\varvec{\beta })+\delta _{i2} \widetilde{\widetilde{\varvec{S}}}_{2i}(\varvec{\beta })+\delta _{i3}\widetilde{\widetilde{\varvec{S}}}_{3i}(\varvec{\beta })+\delta _{i4}\widetilde{\widetilde{\varvec{S}}}_{4i}(\varvec{\beta })\right) , \\ \widetilde{\widetilde{\varvec{U}}}_{v}(\varvec{\beta })&=\sum _{i=1}^{n}\widetilde{\widetilde{\varvec{U}}}_{vi}(\varvec{\beta }). \end{aligned}$$

Notice that each of the two proposed MI methods uses three empirical CDFs of Wang and Chen (2009) to generate random values for \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) MAR separately or simultaneously. However, there is a slight difference between these two MI approaches in how information is obtained from the observed data to replace the missing data in the first and second empirical CDFs, while the third empirical CDFs are the same. The MI1 method uses only \(\delta _{i1}=1\), but the information of observed \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is given in the indicator function in \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), and the information of observed \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) is given in the indicator function in \({\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\). The MI2 method, however, combines \(\delta _{i1}=1\) and \(\delta _{i3}=1\) in \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)\), and \(\delta _{i1}=1\) and \(\delta _{i2}=1\) in \(\widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)\), but the information of observed \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) or \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) is not given in the indicator functions \(I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i)\)s. Moreover, by the conditions inside the indicator functions in the empirical CDFs, the MI2 method can be applied to the case of both the continuous and categorical/discrete \(\varvec{X}\) data, while the MI1 method can only be applied to the case of the categorical/discrete \(\varvec{X}\) data, which is its limitation in practice. Despite their differences in methodology, both the techniques aim to improve information for missing data imputation and maximize estimation efficiencies. The next section investigates the finite-sample performances of the proposed methods compared to the CC, SIPW, VL, JCL, and MICE estimation methods via extensive simulations.

5 Simulation and real data studies

5.1 Simulation studies

Monte Carlo simulations were conducted to examine the finite-sample performances of the following estimators:

  1. (1)

    \(\widehat{\varvec{\beta }}_F\): full data ML estimator used as a benchmark for comparisons

  2. (2)

    \(\widehat{\varvec{\beta }}_C\): CC estimator

  3. (3)

    \(\widehat{\varvec{\beta }}_W\): SIPW estimator that is the solution of \(\varvec{U}_W(\varvec{\beta },\widehat{\varvec{\pi }}_1)=\varvec{0}\) in (4)

  4. (4)

    \(\widehat{\varvec{\beta }}_V\): VL estimator that is the solution of \(\widehat{\varvec{U}}_V(\varvec{\beta })=\varvec{0}\) in (5)

  5. (5)

    \(\widehat{\varvec{\beta }}_J\): JCL estimator that is the solution of \(\widehat{\varvec{U}}_J(\varvec{\beta })=\varvec{0}\) in (7)

  6. (6)

    \(\widehat{\varvec{\beta }}_{M1}\): MI1 estimator that is the solution of \(\varvec{U}_{M1}(\varvec{\beta })=\varvec{0}\) in (18)

  7. (7)

    \(\widehat{\varvec{\beta }}_{M2}\): MI2 estimator that is the solution of \(\varvec{U}_{M2}(\varvec{\beta })=\varvec{0}\) in (20)

  8. (8)

    \(\widehat{\varvec{\beta }}_{ME}\): MICE estimator from mice package in R, used as a benchmark for comparisons.

To evaluate the estimation performances of the proposed methods as well as compare with the other estimation methods under various situations, we were motivated to construct five scenarios, where the two univariates \(X_1\) and \(X_2\) were uncorrelated in Scenarios 1-4, and correlated in Scenario 5. Specifically, Scenario 1 studied the impact of the sample size on the performance of all the estimation methods under the same selection probabilities. Three sets of selection probabilities were considered in Scenario 2 to examine the influence of the missing rates on the performances of the estimation methods. The aim of Scenario 3 was the same as Scenario 2 except only changing the value of \(\varvec{\beta }\) to know whether the efficiencies of the estimators were altered or not under different logistic regression models. In Scenario 4, three different numbers of imputations were studied to know how they affected the performances of the two proposed MI estimation methods. Finally, Scenario 5 provided six different correlation coefficients between \(X_1\) and \(X_2\) under the same selection probabilities and sample size to investigate the performances of all the methods in these situations.

For each experimental configuration, 1, 000 replications were performed. \(M=30\) was considered except considering \(M=5,25\), and 45 in Scenario 4. Calculated were the bias, standard deviation (SD), asymptotic standard error (ASE), and coverage probability (CP) of a 95% confidence interval for each estimator. To evaluate the relative efficiencies (REs) of estimators, we computed the ratio of mean square error (MSE) of each of the other estimators, without including the full data ML estimator, to those of the MI1 and MI2 estimators, respectively, where the MSE of an estimator was defined as the sum of the square of bias and the square of SD, i.e., \(\text {MSE}=\text {bias}^2+\text {SD}^2\).

Scenario 1. The goal was to assess how well all the estimation methods performed when the two covariates \(X_1\) and \(X_2\) were independent by using the same observed selection probabilities and considering the three different sample sizes \(n = 500\), 1, 000, and 2, 000. The discrete distribution of the four values \((-0.3,-0.1,0.4,1)\) with probabilities (0.2, 0.3, 0.3, 0.2), respectively, was used to generate the data of \(X_1\). The data of \(X_2\) were generated from the discrete distribution of the four values \((-1,-0.4,0.2,0.6)\) with probabilities (0.1, 0.3, 0.3, 0.3), respectively. The Bernoulli distribution with success probability 0.4 was used to generate the data of Z. The surrogate variables of \(X_1\) and \(X_2\) were \(W_k\) that is 1 if \(X_k>0\); 0 if \(X_k\le 0\), \(k=1,2\). The Bernoulli distribution with success probability \(P(Y=1|X_1,X_2,Z)=H(\beta _0+\beta _1X_1+\beta _2 X_2+\beta _3Z)\) was used to generate the data of Y, where \(\varvec{\beta }=(\beta _0,\beta _1,\beta _2,\beta _3)^T=(1,-0.5,1,\log (2))^T\). The following multinomial logistic regression model

$$\begin{aligned} \ln \left( \dfrac{P(\delta _{ij}=1|Y_i,W_{1i},W_{2i},Z_i)}{P(\delta _{i4}=1|Y_i,W_{1i},W_{2i},Z_i)}\right) =\alpha _j+\gamma _1Y_i+\gamma _2W_{1i}+\gamma _3W_{2i}+\gamma _4Z_i \end{aligned}$$
(22)

was used to generate the data of \(\delta _{ij}\) given \((Y_i,W_{1i},W_{2i},Z_i)\), \(i=1,2,\dots ,n\), \(j=1,2,3\), where \(\varvec{\alpha }=(\alpha _1,\alpha _2,\alpha _3)^T=(2,0.6,0.6)^T\) and \(\varvec{\gamma }=(\gamma _1,\gamma _2,\gamma _3,\gamma _4)^T=(0.7,-0.2,0.1,-1.2)^T\). Under the three different sample sizes, the observed selection probabilities were similar and about 0.6, 0.15, 0.15, and 0.1, respectively. This means that the percentages of complete cases, only \(X_1\) missing, only \(X_2\) missing, and both of them missing were 60%, 15%, 15%, and 10%, respectively.

The simulation results of Scenario 1 are illustrated in Table 1. The full data ML method overall outperformed the other approaches, but it is only considered as a benchmark comparison because it has the practical disadvantage that no missing data are required. The biases of the CC (for \(\beta _0,\beta _3\)) and MICE (for \(\beta _1,\beta _2\)) estimators were the largest, which implies that these two methods have the worst estimation. The performances of the SIPW and VL methods were similar, and the MI1, MI2 and JCL methods had similar performances, particularly when \(n=1,000\) and 2,000. The SD and ASE of each estimator were similar except the MICE estimator for \(\beta _1\) and \(\beta _2\), and decreased when the sample size was increased. The ASEs of the MI1 and MI2 estimators were similar and the smallest compared to the other estimators except the full data ML estimator. The empirical CPs for all the estimation methods were overall close to the nominal probability 95% except the CC (for \(\beta _0\) when \(n=2,000\); \(\beta _3\) when \(n=1,000,2,000\)) and MICE (for \(\beta _0\) when \(n=2,000\); \(\beta _1\) when \(n=500, 2,000\); \(\beta _2\) when \(n=1,000,2,000\)) methods. In addition, it can be seen from Table 2 that the relative efficiency values were larger than 1 except the JCL (for \(\beta _0\) when \(n=500,1,000\); \(\beta _1,\beta _2\) when \(n=500,1,000,2,000\)) and MICE (for \(\beta _0\) when \(n=500,1,000\); \(\beta _1\) when \(n=500\) and MICE versus MI1; \(\beta _3\) when \(n=500,1,000,2,000\)) estimators, which shows that the two proposed MI estimators were comparable with the JCL and MICE estimators (for \(\beta _0\) as \(n=500,1,000\); \(\beta _1\) as \(n=500\); \(\beta _3\) as \(n=500,1,000,2,000\)) in terms of efficiency. The relative efficiency values of the MICE estimator to the two proposed MI estimators tended to be increased and were larger than 1 (for \(\beta _1,\beta _2\)) when the sample size was increased, i.e., the two different types of MI estimators are more efficient than the MICE estimator for \(\beta _1\) and \(\beta _2\) when the sample size was increased.

Table 1 Simulation results of scenario 1 (\(M=30\); \(n=500,1,000,2,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(2,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.6, 0.15, 0.15, 0.1)
Table 2 Relative efficiencies in scenario 1 (\(M=30\); \(n=500,1,000,2,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(2,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.6, 0.15, 0.15, 0.1)

Scenario 2. In this scenario, we examined the impact of observed selection probabilities, i.e., when the missing rates were changed, on the efficiencies of the estimators given \(n=1,000\) fixed. The same as Scenario 1 were the values for \(\varvec{\beta }\) and \(\varvec{\gamma }\) and the procedure to generate the data of \(X_1\), \(X_2\), Z, Y, \(W_1\), and \(W_2\). For the multinomial logit model in (22) to generate the data of \(\delta _{ij}\), set \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\) and \(\varvec{\alpha }=(0.7,0.5,0.5)^T\), \((1.5,0.6,0.6)^T\), and \((2.5,0.6,0.6)^T\) to obtain the three sets of observed selection probabilities, (0.31, 0.26, 0.26, 0.17), (0.47, 0.20, 0.20, 0.13), and (0.72, 0.11, 0.11, 0.06), respectively.

Table 3 shows the simulation results of this scenario. The biases, SDs and ASEs of the last seven estimators overall tended to be decreased when the CC percentage was increased from 31% to 72%. The serious bias still happened to the CC (for \(\beta _0,\beta _3\)) and MICE (for \(\beta _1,\beta _2\)) estimators. The performances of the MI1 and MI2 estimation methods were essentially the same, and the ASEs of the two proposed MI estimators were overall the smallest in comparison with the other estimators except the full data ML estimator. The empirical CPs based on all the estimation methods were overall close to the nominal probability 95% except the CC (for \(\beta _0\) when CC percentage = 0.31, 0.47), SIPW (for \(\beta _3\) when CC percentage = 0.31), MI1 and MI2 (for \(\beta _1,\beta _2\) when CC percentage = 0.31), and MICE (for \(\beta _1\) when CC percentage = 0.31; \(\beta _2\) when CC percentage = 0.31, 0.47, 0.72) methods. The relative efficiency values in Table 4 were still greater than 1 except the JCL (for \(\beta _0,\beta _1,\beta _2\)) and MICE (for \(\beta _0\) when CC percentage = 0.72; \(\beta _3\)) estimators, and tended to decrease except the JCL (for \(\beta _0,\beta _1,\beta _2\)) and MICE (for \(\beta _3\)) estimators when the CC percentage was increased, indicating that the two different types of MI estimators were comparable with the JCL estimator and the most efficient compared to the CC, SIPW, VL, and MICE estimators. When the CC percentage was 0.72, the relative efficiency values were (very) close to 1 for the SIPW, VL, and JCL estimators compared to the two MI estimators.

Table 3 Simulation results of scenario 2 (\(M=30\); \(n= 1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\)

Scenario 3. In this scenario, we wished to know whether changing the values of the logistic regression parameters affects the performances of the estimation methods. Therefore, we kept all the settings in Scenario 2 except changing \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\) to \(\varvec{\beta }=(-1,1,0.7,-1)^T\). As the simulation results given in Tables 5 and 6, the three sets of observed selection probabilities were (0.29, 0.24, 0.24, 0.23), (0.45, 0.19, 0.19, 0.16), and (0.70, 0.10, 0.10, 0.10), respectively, which were quite similar to those in Scenario 2, although the first three observed selection probabilities 0.29, 0.24, and 0.24 in the first set in this scenario were slightly reduced, compared to the first three observed selection probabilities 0.31, 0.26, and 0.26 in the first set in Scenario 2, and the last observed selection probability 0.23 was slightly increased compared to the last one 0.17. The performances of all these estimation methods in this scenario were quite similar to those in Scenario 2. One of the reasons might have the similar missing rates in this scenario and Scenario 2. Hence, in general, changing the value of \(\varvec{\beta }\) had a little effect on the selection probabilities and the efficiencies of the proposed estimation methods.

Table 4 Relative efficiencies in scenario 2 (\(M=30\); \(n= 1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\)
Table 5 Simulation results of scenario 3 (\(M=30\); \(n=1,000\)); \(\varvec{\beta }=(-1,1,0.7,-1)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\)
Table 6 Relative efficiencies in scenario 3 (\(M=30\); \(n=1,000\)); \(\varvec{\beta }=(-1,1,0.7,-1)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\)

Scenario 4. The purpose of this scenario was to examine the impact of the number of multiple imputations on the performances of the proposed methods when \(X_1\) and \(X_2\) were independent with a fixed sample size. All the settings were the same as in Scenario 1 except considering \(M=5,25\), and 45, respectively, given \(n=1,000\). The observed selection probabilities were 0.60, 0.15, 0.15, and 0.10.

The simulation results for the MI1, MI2 and MICE methods in Table 7 were essentially the same as those in Table 1 of Scenario 1 when \(n=1,000\). Table 8 shows that the relative efficiency values were also overall the same as those in Table 2 of Scenario 1 for \(n=1,000\). Therefore, this simulation study demonstrated that the two proposed MI methods were not affected by the number of imputations.

Table 9 provides a summary of computing time for each estimation method. The JCL and MICE methods had the longest and second longest computing time, respectively. The JCL method took an average of 21.88 seconds to perform one simulation, which are approximately 73, 35.3 and 23 times the two proposed MI methods when \(M=5,25\), and 45, respectively. On average, the MICE methods took 1.15, 5.73, and 10.30 seconds to perform one simulation when \(M=5,25\), and 45, which are approximately 3.8, 9.2 and 10.8 times the two proposed MI methods, respectively. However, the performances of the JCL and two proposed MI methods were essentially the same. Therefore, based on the simulation results, the MI1 and MI2 methods overall outperformed the other methods, except the full data ML and JCL methods, when estimating the parameters of logistic regression with covariates MAR.

Table 7 Simulation results of scenario 4 \((n=1,000)\); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(2,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.60, 0.15, 0.15, 0.10); \(M=5,25,45\)
Table 8 Relative efficiencies in scenario 4 (\(M=5,25,45\); \(n=1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(2,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.60, 0.15, 0.15, 0.10)

Scenario 5. The scenario aimed to examine the performances of all the approaches when \(X_1\) and \(X_2\) were correlated given that the observed selection probabilities and sample size were fixed. The settings of M, \(\varvec{\beta }\), and \(\varvec{\gamma }\) in this scenario were the same as those in Scenario 1. We set \(\varvec{\alpha }=(1.15,0.6,0.6)^T\) to compare the efficiencies of all the estimation methods in a situation of higher missing rates, i.e., the observed selection probabilities (0.40, 0.23, 0.23, 0.14), given \(n=1,000\). In addition, to generate the correlated data of \(X_1\) and \(X_2\), the distribution that had the four values \(-0.3,-0.1,0.4\), and 1 with probabilities 0.2, 0.3, 0.3, and 0.2, respectively, was used to first generate the data of \(X_1\). Then, given each value of \(X_1\), the data of \(X_2\), \((-1,-0.4,0.2,0.6)\), were generated such that the correlations between \(X_1\) and \(X_2\) were \(\rho =-0.21,-0.53,-0.71,0.21,0.51\), and 0.71, respectively. For example, for the case of \(\rho =-0.21\), if \(X_1=-0.3\), the four values of \(X_2\) were generated with probabilities 0.42, 0.1, 0.2, and 0.28, respectively. Similarly, when \(X_1=-0.1,0.4\), and 1, we generated the four values of \(X_2\) with probabilities (0.2, 0.1, 0.4, 0.3), (0.2, 0.4, 0.2, 0.2), and (0.65, 0.05, 0.05, 0.25), respectively. Finally, the surrogate variables \(W_1\) and \(W_2\) of \(X_1\) and \(X_2\), respectively, were set as in Scenario 1.

Table 9 Summary of computing time (second) for performing one simulation for each estimation method in Scenario 4 (\(M=5,25,45\); \(n=1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(2,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.60, 0.15, 0.15, 0.10)

Given in Tables 10 and 11, respectively, are the simulation results for \(X_1\) and \(X_2\) with negative and positive correlation values. The biases of the CC and MICE estimators were overall larger than the other estimators. The biases were overall not changed much when the absolute correlation value was increased except the MICE estimator (for \(\beta _1\) when \(\rho\) was changed from \(-0.21\) to \(-0.71\); \(\beta _0,\beta _1, \beta _2\) when \(\rho\) was changed from 0.21 to 0.71). The SDs and ASEs of all the estimators, except the estimators of \(\beta _3\), tended to be increased with the increase in the absolute correlation value. The SD and ASE of the CC estimator were the largest. The SD and ASE of the JCL estimator were comparable with or (slightly) larger than those of the MI1 and MI2 estimators. The empirical CPs based on all the estimation methods were overall close to the nominal probability 95% except the CC (for \(\beta _0\) when \(\rho =-0.53\); \(\beta _3\) when \(\rho =-0.53,-0.71,0.51, 0.71\)), MI1 (for \(\beta _0\) when \(\rho =0.71\); \(\beta _1,\beta _2\) when \(\rho =-0.71,0.51\)), MI2 (for \(\beta _1\) when \(\rho =-0.71,0.51\)), and MICE (for \(\beta _0\) when \(\rho =-0.21,0.21,0.51,0.71\); \(\beta _1,\beta _2\) for all \(\rho\) values) methods. The relative efficiency values of the CC, SIPW, VL, and MICE estimators to the two different types of MI estimators were greater than 1 except the relative efficiency values of the SIPW to MI2 estimators and the MICE to the two different types of MI estimators (for \(\beta _0\) when \(\rho =-0.53,-0.71\); \(\beta _1\) when \(\rho =-0.21\); \(\beta _3\) when \(\rho =0.21,0.51,0.71\)). The performances of the MI1 and MI2 estimators were comparable with that of the JCL estimator in terms of efficiency (Table 12).

Table 10 Simulation results of scenario 5 (\(M=30\); \(\rho =-0.21,-0.53,-0.71\); \(n=1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(1.15,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.40, 0.23, 0.23, 0.14)
Table 11 Simulation results of scenario 5 (\(M=30\); \(\rho =0.21,0.51,0.71\); \(n=1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(1.15,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.40, 0.23, 0.23, 0.14)
Table 12 Relative efficiencies in scenario 5 (\(M=30\); \(\rho =-0.21,-0.53,-0.71,0.21,0.51,0.71\); \(n=1,000\)); \(\varvec{\beta }=(1,-0.5,1,\log (2))^T\); \(\varvec{\alpha }=(1.15,0.6,0.6)^T\); \(\varvec{\gamma }=(0.7,-0.2,0.1,-1.2)^T\); observed selection probabilities: (0.40, 0.23, 0.23, 0.14)

In summary, Scenarios 1–5 show that the performances of the JCL and two proposed MI methods were comparable, but Scenario 4 demonstrates that the MI1 and MI2 methods shortened the computing time the JCL and MICE methods took.

5.2 Examples

Two real data examples were used to demonstrate the practicality of the two proposed MI methods and CC, SIPW, VL, JCL, and MICE methods. \(M=30\) imputations were used.

5.2.1 Example 1

The first real data example is the data set of the Global Longitudinal Study of Osteoporosis in Women (GLOW500M) (Hosmer et al. 2013). Let Y be a binary outcome variable to denote whether a respondent had any fracture in the first year, where \(Y=1\) if yes; \(Y=0\) if no. Three covariates are considered. \(X_1\) is used to indicate the history of prior fracture that had missing values, in which \(X_1\) is 1 if yes and 0 if no. \(X_2\) denotes the self-reported risk of fracture that had missing values, where \(X_2=1\) if less than others of the same age; 2 if same as others of the same age; 3 if greater than others of the same age. Z denotes age at enrollment with integer values from 55 to 90 without missing values. The sample size is \(n=500\). The rates of only \(X_1\) missing, only \(X_2\) missing, and both of them missing were 16%, 16%, and 4%, respectively. Thus, the size of the CC data set is 320 (64%). In addition, let W denote the hip fracture of mother that is 1 if yes; 0 otherwise. Because W is not significant in logistic regression with the response Y, and W is correlated with \(X_2\) (Spearman’s rank correlation coefficient is 0.13 with p-value 0.025) under the CC data, W is considered as a surrogate variable of \(X_2\) in this study. Moreover, the self-reported risk of fracture was dichotomized to be less than others of the same age (\(X_2= 1\)) versus same as or greater than others of the same age (\(X_2 = 2\) or 3). The dummy variable \(DX_2\) is used for the dichotomized self-reported risk of fracture, which is 1 if \(X_2=1\) and 0 otherwise. Age at enrollment (Z) was categorized to be three groups, \(Z\le 60\), \(60<Z\le 70\), and \(Z>70\). Let \(DZ_1\) and \(DZ_2\) be dummy variables for the categorized age at enrollment. \(DZ_1\) is 1 if \(Z\le 60\) and 0 otherwise, and \(DZ_2=1\) if \(60<Z\le 70\) and 0 otherwise. The missingness mechanism of \(X_1\) and \(X_2\) was identified as MAR by Tran et al. (2021). The following logistic regression model is used to fit the data set:

$$\begin{aligned} P(Y=1|X_1,X_2,Z,W)=H(\beta _0+\beta _1X_1+\beta _2DX_2+\beta _3DZ_1+\beta _4DZ_2). \end{aligned}$$
(23)

The estimates of \(\beta _k\)s and their corresponding ASEs are given in Table 13. The analysis results indicate that \(\beta _k\), \(k=0,1,2,3,4\), are statistically significantly different from zero based on all the estimation methods except \(\beta _3\) and \(\beta _4\) based on the CC method. Based on the SIPW, VL, JCL, MI1, MI2 and MICE methods, the estimates of \(\beta _k\)s and their ASEs were overall quite similar. According to all the estimation methods, the results of testing \(\beta _1=0\) and \(\beta _2=0\) and their estimates reveal that women with the history of prior fracture were more likely to have fracture(s) in the first year compared to those without the history of prior fracture, and women with self-reported risk of fracture less than others of the same age were less likely to have fracture(s) in the first year compared to those whose self-reported risk of fracture was the same as or greater than others of the same age. The results of testing \(\beta _3=0\) and \(\beta _4=0\) and their estimates based on the SIPW, VL, JCL, MI1, MI2 and MICE estimation methods indicate that women aged 60 years or younger and those older than 60 years and younger than or equal to 70 years were less likely to have fracture(s) in the first year compared to those older than 70 years.

Table 13 Results of logistic regression analysis of GLOW500M data
Table 14 Results of logistic regression analysis of Cable TV data

5.2.2 Example 2

The two proposed MI methods are applied to analyze the second real data example, the set of cable television (TV) data collected from the customer survey study of 1,586 residents in three cities in Taiwan (Lee et al. 2011). The satisfaction level of cable TV service is considered as the binary outcome variable, denoted by Y (\(1=\) satisfied; \(0=\) neutral and dissatisfied). The two covariates with missing values are response, denoted by \(X_1\) (\(1=\) Yes; \(0=\) No), to the question “Have you been given a discount on cable TV?”, and response, denoted by \(X_2\) (\(0=\) the number of children under 12 years old is equal 0; \(1=\) the number of children under 12 years old is \(\ge 1\)), to the question “How many children under the age of 12 live with you?”. Another non-missing covariate is response, denoted by Z (\(1=\text {Yes}\); \(0=\text {No}\)), to the question “Are you paying for the fourth channel?”. Because the rates of only \(X_1\) missing, only \(X_2\) missing, and both of them missing were 17.9%, 1.3%, and 0.4%, respectively, the CC data set consists of 1,274 respondents (80.3%). A surrogate variable of \(X_1\) and \(X_2\) is response, denoted as W (\(1=\) Yes; \(0=\) No), to the question “Would you pay extra money for additional channels?”. We fit the following logistic regression model to the data set:

$$\begin{aligned} P(Y=1|X_1,X_2,Z,W) = H(\beta _0+\beta _1X_1+\beta _2X_2+\beta _3Z). \end{aligned}$$
(24)

The multinomial logistic regression model \(\log \left( \pi _j(Y,W,Z)/\pi _4(Y,W,Z)\right) =\alpha _j+\gamma _{1j}Y+\gamma _{2j}W+\gamma _{3j}Z\), \(j=1,2,3\), is used to examine the effects of Y, W, and Z on the missingness mechanism of \(X_1\) and \(X_2\), i.e., on their selection probabilities. The p-values of Wald chi-squared tests for the effects of Y (i.e., testing \(H_0:\gamma _{1j}=0\), \(j=1,2,3\)) and Z (i.e., testing \(H_0:\gamma _{3j}=0\), \(j=1,2,3\)) are 0.0021 and \(<0.0001\), respectively, so Y and Z are statistically significantly related to the missingness mechanism of \(X_1\) and \(X_2\) and, hence, it is reasonable to assume that \(X_1\) and \(X_2\) are MAR.

Table 14 gives the estimates of \(\beta _k\), \(k=0,1,2,3\), and their corresponding ASEs. The results of testing \(\beta _k=0\), \(k=0,1,2,3\), are statistically significant except testing \(\beta _1=0\) by using the MICE estimation method. The estimates of \(\beta _1\) and \(\beta _3\) are positive, implying that the respondents were more likely to be satisfied with the cable TV service when they were offered a discount and when they paid for the fourth channel, respectively. Based on all the estimation methods, the estimates of \(\beta _2\) are negative, revealing that it is less likely for the respondents to report satisfaction with the cable TV service when there were children under 12 years old in their family. The parameter estimates based on the SIPW, VL, JCL, MI1, MI2, and MICE methods are quite similar and different from those based the CC method. Moreover, the ASEs of the two different types of MI estimators of \(\beta _0\), \(\beta _1\), and \(\beta _3\) are the smallest in comparison with the other estimation methods, showing the outperformance of the two proposed MI methods and consistence with the simulation results.

Moreover, one can also apply the estimation methods to analyze the cable TV data set with artificial missing values and higher missing rates. To this end, let \((\delta _1^0,\delta _2^0,\delta _3^0,\delta _4^0)\) be indicators for the original missingness statuses of \(X_1\) and \(X_2\). Let \((\delta _1^*,\delta _2^*,\delta _3^*,\delta _4^*)\) be indicators for artificial missingness statuses of \(X_1\) and \(X_2\) under the assumption of MAR mechanism. Let \(({\tilde{\delta }}_1,{\tilde{\delta }}_2,{\tilde{\delta }}_3,{\tilde{\delta }}_4)\) be the combination of indicators for the original and artificial missingness statuses of \(X_1\) and \(X_2\), where \({\tilde{\delta }}_{ij}=\delta _{i1}^0\delta _{ij}^*+(1-\delta _{i1}^0)\delta _{ij}^0\), \(i=1,2,\dots ,n\), \(j=1,2,3,4\). The following multinomial logistic regression model

$$\begin{aligned} \ln \left( \dfrac{P(\delta _{ik}^*=1|Y_i,W_i,Z_i)}{P(\delta _{i4}^*=1|Y_i,W_i,Z_i)}\right) =\alpha _k^*+\gamma _1^*Y_i+\gamma _2^*W_i+\gamma _3^*Z_i,\ k=1,2,3, \end{aligned}$$
(25)

is applied to generate the data of \(\delta _{ij}^*\), \(j=1,2,3,4\), given \((Y_i,W_i,Z_i)\), where \(\varvec{\alpha }^*=(\alpha _1^*,\alpha _2^*,\alpha _3^*)^T=(1.7,-0.5,1)^T\) and \(\varvec{\gamma }^*=(\gamma _1^*,\gamma _2^*,\gamma _3^*)^T=(-0.7,-0.5,-0.3)^T\). One can obtain the selection probabilities of \(({\tilde{\delta }}_1,{\tilde{\delta }}_2,{\tilde{\delta }}_3,{\tilde{\delta }}_4)\) as (0.406, 0.234, 0.231, 0.129), which imply that the percentages of complete cases, only \(X_1\) missing, only \(X_2\) missing, and both of them missing are 40.6%, 23.4%, 23.1%, and 12.9%, respectively.

The analysis results of the artificial cable TV data set are given in Table 15. In general, these results have some changes compared to those of the original cable TV data set in Table 14 because of increased missing rates, but they still represent well the performance properties of the proposed estimation methods. Indeed, the changes are the results of testing \(\beta _k=0\), \(k=1,2,3,4\). Testing \(\beta _1=0\) is not statistically significant by using the SIPW, JCL, MI1, MI2, and MICE estimation methods. Testing \(\beta _2=0\) is not statistically significant by using the CC, SIPW, and VL estimation methods. In addition, the ASEs of all the estimators are increased, especially the CC estimators of all the parameters, and the SIPW and VL estimators of \(\beta _1\) and \(\beta _2\). The ASEs of the two proposed MI estimators are changed less compared to the other estimators. The (absolute) estimates of \(\beta _0\) and \(\beta _3\) are increased, and those of \(\beta _2\) are decreased for almost all the estimation methods. The estimates of \(\beta _1\) are increased based on the first three estimation methods and decreased based on the last four estimation methods. Again, the parameter estimates based on the MI1 and MI2 estimation methods are changed less compared to the JCL and MICE estimation methods.

In summary, the two real data sets are used to evaluate the applicability of the seven estimation methods and confirm the results in the simulation study section. When the logistic regression model includes all categorical/discrete variables and has two covariates MAR separately or simultaneously, the analytical results are highly consistent with the conclusions in the simulation studies. The biases and ASEs of the MI1, MI2, and JCL estimators are indeed slightly similar and better than those of the SIPW and VL estimators. The results also imply that one may make mistakes in obtaining the fitted logistic regression model if using the CC or MICE (from mice package) approach. For instance, in real data example 1, the CC estimation method concludes that \(DZ_1\) and \(DZ_2\) are insignificant; in real data example 2, the MICE estimation method suggests that \(X_1\) is not effective at the significance level 5%. Furthermore, the analysis of the artificial cable TV data example shows that the two proposed methods provide more stable results than the other estimation methods.

Table 15 Results of logistic regression analysis of cable TV data with artificial missing values

6 Conclusion

Two different types of MI methods have been proposed to estimate the parameters of logistic regression with covariates missing separately or simultaneously. Based on the idea of Wang and Chen (2009), for each type MI estimation method we have proposed three empirical CDFs to generate the random values for missing data and estimated the logistic regression parameters by using the estimating equations of Fay (1996), which are more convenient in practice than those of Rubin (1987) because they are solved only once. The simulation studies have showed that the two proposed MI methods’ performances were comparable with the JCL method, but they had shorter computing time than the JCL and MICE estimation methods, and are easily implemented. The two proposed MI methods overall outperformed the CC, SIPW, VL, and MICE methods. Two real data sets have been used to illustrate the practical use of the two proposed MI methods.

Although we have focused on the case where covariates are categorical/discrete, one can also consider the case of continuous covariates by using the nonparametric kernel approach of Wang and Chen (2009) to construct the empirical CDFs. For example, assume that \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) are continuous covariate vectors and MAR separately or simultaneously, \(\varvec{Z}\) is a categorical/discrete covariate vector and always observed, \({{{\mathbf {\mathtt{{W}}}}}}_1\) and \({{{\mathbf {\mathtt{{W}}}}}}_2\) are categorical/discrete surrogate vectors of \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), respectively, and, hence, \(\varvec{V}=(\varvec{Z}^T,\varvec{W}^T)^T\) is a categorical/discrete covariate vector. The MI2 method still works for the case. For the MI1 method, to impute the missing values of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\), one can construct the empirical CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\) and \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\) as follows:

$$\begin{aligned} {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}^*({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i){\mathcal {K}}_{{\mathcal {H}}_1}({{{\mathbf {\mathtt{{X}}}}}}_{2k}-{{{\mathbf {\mathtt{{X}}}}}}_{2i})}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i){\mathcal {K}}_{{\mathcal {H}}_1}({{{\mathbf {\mathtt{{X}}}}}}_{2s} -{{{\mathbf {\mathtt{{X}}}}}}_{2i})}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1), \\ {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}^*({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i,\varvec{V}_k=\varvec{V}_i){\mathcal {K}}_{{\mathcal {H}}_2}({{{\mathbf {\mathtt{{X}}}}}}_{1k}-{{{\mathbf {\mathtt{{X}}}}}}_{1i})}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i,\varvec{V}_s=\varvec{V}_i){\mathcal {K}}_{{\mathcal {H}}_2}({{{\mathbf {\mathtt{{X}}}}}}_{1s} -{{{\mathbf {\mathtt{{X}}}}}}_{1i})}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2). \end{aligned}$$

Here \({\mathcal {H}}_s\) is a \(p_s\times {p_s}\), \(s=1,2\), symmetric positive definite matrix and depends on n for \(p_1=r_2=\text {length}({{{\mathbf {\mathtt{{X}}}}}}_2)\) and \(p_2=r_1=\text {length}({{{\mathbf {\mathtt{{X}}}}}}_1)\). \({\mathcal {K}}_{{\mathcal {H}}_s}({\mathbf {u}})=|{\mathcal {H}}_s|^{1/2}{\mathcal {K}}_s({\mathcal {H}}_s^{-1/2}{\mathbf {u}})\) for \({\mathcal {K}}_s(\cdot )\) being a \(p_s\)-variate kernel with \(\int {\mathcal {K}}_s({\mathbf {u}})d{\mathbf {u}}=1\) and \({\mathcal {H}}^{1/2}_s\) the bandwidth matrix. \({\widetilde{F}}_{\varvec{X}_i}^*({\mathbf {x}}|Y_i,\varvec{V}_i)\) is equal to \({\widetilde{F}}_{\varvec{X}_i}({\mathbf {x}}|Y_i,\varvec{V}_i)\) in (17). We can use the MI1 procedure in Secti. 4.1 to estimate the parameters. The simulation studies demonstrate the two proposed MI methods still work well for this case; simulation results are not reported.

Furthermore, to build the MI1 method for the case where both \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\) are continuous covariate vectors and MAR separately or simultaneously, \(\varvec{Z}\) is a continuous covariate vector, and \({{{\mathbf {\mathtt{{W}}}}}}_1\) and \({{{\mathbf {\mathtt{{W}}}}}}_2\) are continuous surrogate vectors of \({{{\mathbf {\mathtt{{X}}}}}}_1\) and \({{{\mathbf {\mathtt{{X}}}}}}_2\), respectively, one can construct the following empirical CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(\varvec{X}_i\) given \((Y_i,\varvec{V}_i)\):

$$\begin{aligned} {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}^{**}({{{\mathbf {\mathtt{{x}}}}}}_1|{{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}_1}({{{\mathbf {\mathtt{{X}}}}}}_{2k}-{{{\mathbf {\mathtt{{X}}}}}}_{2i},\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i) {{\mathcal {K}}_{{\mathcal {H}}_1}({{{\mathbf {\mathtt{{X}}}}}}_{2s}-{{{\mathbf {\mathtt{{X}}}}}}_{2i}, \varvec{V}_s-\varvec{V}_i)}}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1), \nonumber \\ {\widetilde{F}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}^{**}({{{\mathbf {\mathtt{{x}}}}}}_2|{{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}_2}({{{\mathbf {\mathtt{{X}}}}}}_{1k}-{{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i){{\mathcal {K}}_{{\mathcal {H}}_2}({{{\mathbf {\mathtt{{X}}}}}}_{1s}-{{{\mathbf {\mathtt{{X}}}}}}_{1i},\varvec{V}_s-\varvec{V}_i)}}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \nonumber \\ {\widetilde{F}}_{\varvec{X}_i}^{**}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n} \left( \dfrac{\delta _{k1}I(Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}_3}(\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i){{\mathcal {K}}_{{\mathcal {H}}_3}(\varvec{V}_s-\varvec{V}_i)}}\right)\, I(\varvec{X}_k\le {\mathbf {x}}). \end{aligned}$$

Here \({\mathcal {H}}_s\) is a \(q_s\times {q_s}\), \(s=1,2,3\), symmetric positive definite matrix and depends n for \(q_1=r_2+\text {length}(\varvec{V})\), \(q_2=r_1+\text {length}(\varvec{V})\), and \(q_3=\text {length}(\varvec{V})\). \({\mathcal {K}}_{{\mathcal {H}}_s}({\mathbf {u}})=|{\mathcal {H}}_s|^{1/2}{\mathcal {K}}_s({\mathcal {H}}_s^{-1/2}{\mathbf {u}})\) for \({\mathcal {K}}_s(\cdot )\) being a \(q_s\)-variate kernel with \(\int {\mathcal {K}}_s({\mathbf {u}})d{\mathbf {u}}=1\) and \({\mathcal {H}}_s^{1/2}\) the bandwidth matrix. Similarly, to develop the MI2 method for this case, one can form the empirical CDFs of \({{{\mathbf {\mathtt{{X}}}}}}_{1i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{2i},Y_i,\varvec{V}_i)\), \({{{\mathbf {\mathtt{{X}}}}}}_{2i}\) given \(({{{\mathbf {\mathtt{{X}}}}}}_{1i},Y_i,\varvec{V}_i)\), and \(\varvec{X}_i\) given \((Y_i,\varvec{V}_i)\) as follows:

$$\begin{aligned} \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{1i}}^{**}({{{\mathbf {\mathtt{{x}}}}}}_1|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k3})I (Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}}(\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s3})I(Y_s=Y_i){{\mathcal {K}}_{{\mathcal {H}}} (\varvec{V}_s-\varvec{V}_i)}}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{1k}\le {{{\mathbf {\mathtt{{x}}}}}}_1),\\ \widetilde{{\widetilde{F}}}_{{{{\mathbf {\mathtt{{X}}}}}}_{2i}}^{**}({{{\mathbf {\mathtt{{x}}}}}}_2|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{(\delta _{k1}+\delta _{k2})I (Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}}(\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}(\delta _{s1}+\delta _{s2})I(Y_s=Y_i){{\mathcal {K}}_{{\mathcal {H}}} (\varvec{V}_s-\varvec{V}_i)}}\right)\, I({{{\mathbf {\mathtt{{X}}}}}}_{2k}\le {{{\mathbf {\mathtt{{x}}}}}}_2), \\ \widetilde{{\widetilde{F}}}_{\varvec{X}_i}^{**}({\mathbf {x}}|Y_i,\varvec{V}_i)&=\sum _{k=1}^{n}\left( \dfrac{\delta _{k1}I(Y_k=Y_i){{\mathcal {K}}_{{\mathcal {H}}} (\varvec{V}_k-\varvec{V}_i)}}{\sum _{s=1}^{n}\delta _{s1}I(Y_s=Y_i){{\mathcal {K}}_{{\mathcal {H}}}(\varvec{V}_s-\varvec{V}_i)}}\right)\, I(\varvec{X}_k\le {\mathbf {x}}). \end{aligned}$$

Here \({\mathcal {H}}\) is a \(q_3\times {q_3}\) symmetric positive definite matrix and depends n. \({\mathcal {K}}_{{\mathcal {H}}}({\mathbf {u}})=|{\mathcal {H}}|^{1/2}{\mathcal {K}}({\mathcal {H}}^{-1/2}{\mathbf {u}})\) for \({\mathcal {K}}(\cdot )\) being a \(q_3\)-variate kernel with \(\int {\mathcal {K}}({\mathbf {u}})d{\mathbf {u}}=1\) and \({\mathcal {H}}^{1/2}\) the bandwidth matrix.

Finally, as an extension, the two proposed MI methods can be applied to estimate the parameters of logistic regression with outcome and covariates missing separately or simultaneously, and to estimate the parameters of a multinomial logit model with the same missing issue. These tasks are future research.