1 Introduction

Missing data is a prevalent issue in research areas like biomedicine, social sciences, and survey sampling. Underlying any missing data problem is the statistical model for the data if none of the data were missing (Tsiatis 2006). The missingness mechanism plays a crucial role in distinguishing different types of missingness problems. The missingness is named ignorable if it depends on the observed data only; otherwise, it is named nonignorable (Little and Rubin 2019; Zhao and Ma 2022). In practice, generalized additive partial linear models, combining interpretability and flexibility, are widely used for modeling different response types. Nonignorable models are underused due to the complexity of the identification and estimation procedures needed to recover parameters of interest as functions of observed data (Nabi and Bhattacharya 2022).

Identification is generally not accessible under nonignorable missingness without additional assumptions. One approach is to employ the shadow variable strategy (Wang et al. 2014; Zhao and Shao 2015; Miao and Tchetgen Tchetgen 2016). Another similar method involves using instrumental variables (Tchetgen Tchetgen and Wirth 2017; Sun et al. 2018). Recently, Zhao and Ma (2022) and Li et al. (2022) combined these two approaches, establishing clear identifiability for the model. However, selecting suitable instrumental variables or shadow variables can be difficult, especially when dealing with numerous covariates (Cameron and Trivedi 2005). An alternative approach to address identifiability without using instrumental variables or shadow variables relies on stronger assumptions regarding the distribution of the response or the response mechanism. Stronger assumptions about the response mechanism allow for the derivation of identifiability based on the distribution of the observed data, as demonstrated by Morikawa and Kim (2021) and Beppu et al. (2022). Miao et al. (2016), Cui et al. (2017), and Du et al. (2023) made assumptions that the response in the full data follows a specific distribution, such as exponential families. However, when these stronger assumptions on the response distribution and mechanism may lead to misspecification, utilizing instrumental variables or shadow variables remains a reasonable approach.

Further advancements are needed to develop estimation methods when the observed likelihood is identifiable. Extensive research has been conducted in this area, with various approaches proposed. For example, Wang et al. (2014), Shao and Wang (2016), and Wang et al. (2021) employed the generalized estimating equations approach. The empirical likelihood approach was utilized by Tang et al. (2014) and Cui et al. (2022). Calibration was employed by Kott and Chang (2010) and Hamori et al. (2019), while the pseudo likelihood approach was applied by Fang and Shao (2016) and Chen et al. (2021). These studies contribute to the existing literature by providing different estimation methods for addressing this issue.

Limited attention has been given in the existing literature to situations where regression models involve nonparametric functions of interest and the response is affected by nonignorable missingness, despite its prevalence in practical applied research. Du et al. (2023) tackle the challenge of identifying and estimating generalized additive partial linear models by assuming that the response in the full data follows exponential family. On the other hand, Shao and Wang (2022) propose estimators for regression models with a single nonparametric function when the data distribution is unknown, but their focus does not specifically address model identifiability. In this paper, generalized additive partial linear models are identified through the imposition of three types of monotone missing data mechanisms: logistic model, probit model, and complementary log-log model. The logistic and probit models are popular missing data mechanisms (Wang et al. 2014). The complementary log-log model has an important application in the area of survival analysis and hazard modeling (An and Brown 2008). These three models are likely to be most familiar to the target audience. Polynomial spline basis functions are used to approximate the unknown nonparametric function, and estimating equations for the mean response are formulated based on inverse probability weighting. Our contributions focus on three main aspects.

  1. (1)

    Our proposed approach identifies generalized additive partial linear models through the imposition of three types of monotone missing data mechanisms: logistic model, probit model, and complementary log-log model. Identifiability is achieved by either assuming instrumental variable dependence without additional assumptions or assuming a linear relationship between the score function and the response variable, without the use of instrumental variables. The mild sufficient conditions for identifiability stem from leveraging the analytical properties of the propensity function.

  2. (2)

    The missing model parameter estimators are obtained using the conditional score function. To address the curse of dimensionality, we employ dimension reduction techniques to achieve easily attainable univariate kernel estimation. The parameter and nonparametric function estimators in the regression model are obtained using inverse probability weighting. The unknown smooth functions are approximated by a linear combination of regression splines and incorporated into the covariate vector for statistical inference using generalized estimation equations.

  3. (3)

    Under certain regularity conditions, we establish the asymptotic normality of the proposed estimators for the parametric components and the convergence rate of the estimators for the nonparametric functions. Simulation studies demonstrate the favorable performance of the proposed inference procedure across various settings. We also apply the proposed method to a dataset from the Chinese Household Income Project study conducted in 2013.

The paper is structured as follows. Section 2 establishes the sufficient conditions for the identifiability of the observed likelihood in the generalized additive partially linear models under nonignorable missingness. In Sect. 3, we introduce the estimation procedure and establish the consistency and asymptotic normality of the estimators. The performance of the proposed method is evaluated through simulation studies in Sect. 4. Section 5 demonstrates the application of the new method using data from the Chinese Household Income Project 2013. Concluding remarks are provided in Sect. 6. The proofs of Theorem 1-4 can be found in the Supplementary Material.

2 Model settings and identifiability

Let Y be the response variable, \(\textbf{X}\) and \(\textbf{Z}\) be the fully observed covariates, where \(Y \in \mathbb {R}\), \(\textbf{X}\equiv (1,X_1,\ldots ,X_{ d_1-1})^\top \in \mathbb {R}^{d_1}\) and \(\textbf{Z}\equiv (Z_1,\ldots ,Z_{d_2})^\top \in \mathbb {R}^{d_2}\). Define the binary variable r to be the missingness indicator, if Y is observable, r takes 1, otherwise takes 0. We assume that the probability \(P(r=1|Y=y,\textbf{X}=\textbf{x}, \textbf{Z}=\textbf{z})\) depends on y, \(\textbf{x}\), and \(\textbf{z}\) and denote it by \(\pi (y,\textbf{x},\textbf{z};\alpha ,\varvec{\theta })\). We specify it using a logistic model, probit model or complementary log-log model as follows

$$\begin{aligned} \pi (y,\textbf{x},\textbf{z};\alpha ,\varvec{\theta })=\hbox {expit}(\xi )\,\,\hbox {or}\,\,\Phi (\xi ) \,\,\hbox {or}\,\, 1-\exp \{-\exp (\xi )\}, \end{aligned}$$
(1)

where \(\xi =\alpha y + \varvec{\theta }^\top (\textbf{x}^\top ,\textbf{z}^\top )^\top \), \(\hbox {expit}(\cdot )\equiv \exp (\cdot )/\{1+\exp (\cdot )\}\), \(\Phi (\cdot )\) is the cumulative distribution function of the standard normal distribution. \(\alpha \in \mathbb {R}\) is the nonignorable parameter and \(\varvec{\theta }=(\theta _0,\ldots ,\theta _{d_1+d_2-1})^\top \) is an unknown \((d_1+d_2)\)-dimensional parameters. In the model described by Eq. (1), the probability of missingness depends on the potentially missing Y through the parameter \(\alpha \). When \(\alpha =0\), the missing mechanism is independent of the potential missing Y, indicating that it is missing at random. Conversely, if \(\alpha \ne 0\), it indicates nonignorable missingness.

Denoting \(p(y|\textbf{x},\textbf{z})\) as the conditional density function of y given \(\textbf{x}\) and \(\textbf{z}\), the conditional density function of a single sample based on observed data can be expressed as

$$\begin{aligned}{} & {} \{p(y,r=1|\textbf{x},\textbf{z})\}^{I(r=1)}\{p(r=0|\textbf{x},\textbf{z})\}^{I(r=0)}\\{} & {} =\{p(r=1|y,\textbf{x},\textbf{z})p(y|\textbf{x},\textbf{z})\}^{I(r=1)}[E\{p(r=0|Y,\textbf{X},\textbf{Z})|\textbf{X},\textbf{Z}\}]^{I(r=0)}. \end{aligned}$$

Suppose we have an independent random sample \({(Y_i,r_i,\textbf{X}_i,\textbf{Z}_i), i=1,\cdots ,n}\). The observed likelihood given \({\textbf{X}_i, \textbf{Z}_i}\) can be written as

$$\begin{aligned} L_n{} & {} =\prod \limits _{i=1}^n\Big \{\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\alpha ,\varvec{\theta })p(Y_i|\textbf{X}_i,\textbf{Z}_i)\Big \}^{r_i}\nonumber \\{} & {} \quad \times \Big [\int \{1- \pi (y,\textbf{X}_i,\textbf{Z}_i;\alpha ,\varvec{\theta })\}p(y|\textbf{X}_i,\textbf{Z}_i) dy \Big ]^{1-r_i}. \end{aligned}$$
(2)

Nonignorable missingness in Y poses challenges to the identifiability of the observed likelihood, as highlighted by Wang et al. (2014). In Sect. 4, we demonstrate an unidentifiable example and discuss the resulting fluctuations in the estimators when the model lacks identifiability. Identifiability of the observed likelihood function (2) depends on the unique determination of \(\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\alpha ,\varvec{\theta })\) and \(p(Y_i|\textbf{X}_i,\textbf{Z}_i)\) given \({\textbf{X}_i,\textbf{Z}_i}\). If there exist two sets of parameters \((\alpha ,\varvec{\theta },p(y|\textbf{x},\textbf{z}))\) and \((\alpha ^*,\varvec{\theta }^*,p^*(y|\textbf{x},\textbf{z}))\) such that

$$\begin{aligned}{} & {} \pi (y,\textbf{x},\textbf{z};\alpha ,\varvec{\theta })p(y|\textbf{x},\textbf{z}) = \pi (y,\textbf{x},\textbf{z};\alpha ^*,\varvec{\theta }^*)p^*(y|\textbf{x},\textbf{z}), \end{aligned}$$

holds for all \((y,\textbf{x},\textbf{z})\) in an open set of \(\mathbb {R}^{d_1+d_2+1}\), taking logarithms on both sides gives

$$\begin{aligned} h(\xi )+\log p(y|\textbf{x},\textbf{z}) =h(\xi ^*)+\log p^*(y|\textbf{x},\textbf{z}), \end{aligned}$$
(3)

where \(\xi ^*=\alpha ^* y+\varvec{\theta }^{*\top } (\textbf{x}^\top ,\textbf{z}^\top )^\top \), \(h(\xi )\) can take three types of forms \(\log \{\hbox {expit}(\xi )\}\), \(\log \{\Phi (\xi )\}\), \(\log [1-\exp \{-\exp (\xi )\}]\). The observed likelihood is identifiable if (3) implies that

$$\begin{aligned} \alpha =\alpha ^*,\quad \varvec{\theta }=\varvec{\theta }^*, \quad p(y|\textbf{x},\textbf{z})=p^*(y|\textbf{x},\textbf{z}). \end{aligned}$$

To ensure identifiability, we can adopt the instrumental variable assumption, as defined in Assumption 1, following a similar approach as proposed by Tchetgen Tchetgen and Wirth (2017) and Sun et al. (2018).

Assumption 1

The missing mechanism \(\pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })\) includes a variable that is conditionally independent of Y given the other covariates.

The absence of a direct effect of the instrumental variable on the response echoes an assumption commonly encountered in causal inference. Moreover, in the context of this paper, the instrumental variable is integrated within the missing mechanism model. Tchetgen Tchetgen and Wirth (2017) introduced homogeneous additive selection bias, making selection bias independent of the instrumental variable, which enables the identifiability of mean functionals for observed covariates. Sun et al. (2018) restricted the ratio \(p(y|\textbf{x},\textbf{z})/p^*(y|\textbf{x},\textbf{z})\) and establishes the identifiability of \(p(y,r|\textbf{x},\textbf{z})\). By employing three distinct forms of monotone missing data mechanisms: logistic model, probit model, and complementary log-log model, we are able to leverage their analytical attributes to streamline the requirements for ensuring model identifiability.

Theorem 1

Under Assumption 1, if there is at least one continuous variable in the nonlinear component, the observed likelihood (2) is identifiable.

The proof is provided in the Supplementary Material. Theorem 1 establishes a sufficient condition for the identifiability of the observed likelihood under the instrumental variable assumption. However, determining a reasonable instrumental variable beforehand is often impractical, and detecting its presence from observed data is challenging. In cases where the instrumental variable assumption fails or reasonable instrumental variables are difficult to choose, stronger assumptions on the response may be necessary beyond what standard statistical methods typically require.

Assumption 2

Let \(\mu (\textbf{X},\textbf{Z})=E(Y|\textbf{X},\textbf{Z})\), and \(\upsilon (\cdot )\) represents the nuisance parameter. Defining \(S_\mu \{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=\partial \log p(Y|\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot ))/\partial \mu \), and we allow:

$$\begin{aligned} S_\mu \{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=a\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}Y+b\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}. \end{aligned}$$
(4)

We now illustrate three cases demonstrating the validity of Assumption 2 for various common distributions.

Example 1

(Exponential family case): Assuming that the probability density function \(p(Y|\textbf{X},\textbf{Z})\) belongs to the exponential family, then

$$\begin{aligned} S_\mu (\textbf{X},\textbf{Z})=\frac{Y-\mu }{E[\{Y-\mu )\}^2|\textbf{X},\textbf{Z}]}, \end{aligned}$$

and \(a\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=1/E\{(Y-\mu )^2|\textbf{X},\textbf{Z}\}\) and \(b\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=-\mu (\textbf{X},\textbf{Z})/E\{(Y-\mu )^2|\textbf{X},\textbf{Z}\}\). The exponential family encompasses various common distributions, such as the normal distribution and gamma distribution for continuous responses, the Bernoulli distribution for binary responses, and the Poisson distribution and geometric distribution for discrete responses.

Example 2

(Quasi-likelihood case): For the quasi-Poisson model with nonignorable nonresponse data, we can specify the structure of the probability density function based on assumptions about the conditional mean and variance functions in the following manner:

$$\begin{aligned} p(Y|\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot ))=\frac{\exp \{(Y\log \mu -\mu )/\phi )\}}{E[\exp \{(Y\log \mu -\mu )/\phi )\}|\textbf{X},\textbf{Z}]}, \end{aligned}$$

then

$$\begin{aligned} S_\mu (\textbf{X},\textbf{Z})=\frac{Y-\mu }{\phi \mu }, \end{aligned}$$

and \(a\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=1/\{\phi \mu (\textbf{X},\textbf{Z})\}\) and \(b\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=-1/\phi \). This approach can be generalized to other quasi-likelihood statistical models as well.

Example 3

(Truncated distribution case): We assume the probability density function \(p(Y|\textbf{X},\textbf{Z};\mu ,\nu (\cdot ))\) takes the following form:

$$\begin{aligned} p(Y|\textbf{X},\textbf{Z};\mu ,\nu (\cdot ))=\frac{\frac{1}{\sqrt{2\pi \sigma ^2}}\exp (\frac{-(Y-\mu )^2}{2\sigma ^2})}{\Phi (b-\mu /\sigma )-\Phi (a-\mu /\sigma )}, \quad \quad a\le y\le b, \end{aligned}$$

where \(\Phi \) is the cumulative distribution function of the standard normal distribution. Often the goal is to make inference back to the original population and not on the truncated population that is sampled (Hattaway 2010). In this case, the inference is focused on estimating \(\mu \), which represents the expectation of the original distribution. Let

$$\begin{aligned} S_\mu (\textbf{X},\textbf{Z})=\frac{Y-\mu }{\sigma ^2}-\frac{\partial \log [\Phi \{(b-\mu )/\sigma )-\Phi ((a-\mu )/\sigma \}]}{\partial \mu },\quad \quad a\le y\le b, \end{aligned}$$

where \(a\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=1/\sigma ^2\) and \(b\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}=-\mu (\textbf{X},\textbf{Z})/\sigma ^2-\partial \log [\Phi \{(b-\mu (\textbf{X},\textbf{Z}))/\sigma )-\Phi ((a-\mu (\textbf{X},\textbf{Z}))/\sigma \}]/\partial \mu \). We can extend this approach to truncated distributions.

Let \(\mu (\textbf{X},\textbf{Z})=\lambda (\eta )\), where \(\lambda (\cdot )\) represents the inverse of the link function between the response and regression parameter \(\eta \) that is modeled as an additive partial linear function

$$\begin{aligned} \eta =\varvec{\beta }^\top \textbf{X}+\sum _{k=1}^{d_2}g_{k}(Z_{k}). \end{aligned}$$
(5)

To ensure identifiability, we assume that the additive nonparametric functions in (5) are centered, i.e., \(E[g_{k}(Z_{k})]=0\) for \(k=1,\ldots , d_2\). The inclusion of a linear component \(\varvec{\beta }^\top \textbf{X}\) in model (5) makes it easier to interpret, while the inclusion of the nonparametric component \(\sum _{k=1}^{d_2}g_{k}(Z_{k})\) enhances its flexibility.

Theorem 2

Under Assumption 2, if the inverse of the link function \(\lambda (\cdot )\) is a known one-to-one, first differential function and there is at least one continuous variable in the nonlinear component, then

  1. (i)

    When Y is a binary variable, \(\log \lambda (x)\) is strictly concave and the sign of the first derivative of the nonlinear component \(\sum _{k=1}^{d_2}g_{k}(z_{k})\) is known at point zero, the observed likelihood (2) is identifiable;

  2. (ii)

    When Y is a discrete variable with at least three values, the observed likelihood (2) is identifiable if the sign of \(\alpha \) is known;

  3. (iii)

    When Y is a continuous variable and \(h(\xi ) = \log {\hbox {expit}(\xi )}\) is used, the observed likelihood (2) is identifiable when the sign of at least one element of the parameter vector \((\alpha , \varvec{\theta }^\top )^\top \) is known;

  4. (iv)

    When Y is a continuous variable, \(h(\xi )=\log \{\Phi (\xi )\}\,\,\hbox {or}\,\,\log [1-\exp \{-\exp (\xi )\}],\) the observed likelihood (2) is identifiable.

The proof is provided in the Supplementary Material. The three examples above illustrate the wide applicability of Theorem 2. In contrast to Theorem 1 in Du et al. (2023), Theorem 2 expands the identifiability of the models range to a more general form and also facilitates the establishment of identifiable pseudo-likelihood functions, all without requiring instrumental variable assumptions. The inverse of the link function \(\lambda (\cdot )\) is a known one-to-one, first differential function commonly employed in quasi-likelihood models (Wang et al. 2011). For binary variables, commonly used propensity functions, like the logistic model, probit model, and complementary log-log model, satisfy the condition that \(\log \lambda (x)\) is strictly concave. Estimating \(g'_k(x_{k0})\) involves using a local least squares algorithm, with \(x_{k0}\) chosen as a fixed point within a neighborhood where missingness does not occur (of length \(O(n^{-1/5})\)) (Fan et al. 1996). Prior knowledge of the sign of the unknown parameters in the missing mechanism models is required for parameter identifiability in the case of discrete response variables with at least three values and continuous response variables with a logistic missing data mechanism. According to Krosnick et al. (2002), factors like respondents’ cognitive level, motivation, and social status influence nonresponse probability. Based on this, we can speculate on the trend of nonresponse probability and infer the sign of the parameters in the missing mechanism model. For instance, in a household income survey, high-income individuals might be less likely to disclose their true income, suggesting \(\alpha <0\). The identifiability of the observed likelihood (2) is guaranteed when the model (4) reduces to generalized additive models, as stated in Theorem 1 and Theorem 2.

3 Estimation method

By considering a nonparametric form of \(p(y|\textbf{x},\textbf{z})\) and utilizing the observed likelihood (2), The score function method proposed by Cui and Zhou (2017) for the parameters in the missing model can be derived as follows:

$$\begin{aligned} \sum \limits _{i=1}^n \Big \{r_i\frac{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })} -(1-r_i)\frac{E[\{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })\}|\textbf{X}_i,\textbf{Z}_i]}{E[\{1-\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta }))\}|\textbf{X}_i,\textbf{Z}_i]}\Big \}, \end{aligned}$$
(6)

where \(\varvec{\delta }=(\alpha ,\varvec{\theta }^\top )^\top \), \(\pi '(\cdot )\) denotes the partial derivative of \(\pi (\cdot )\) with respect to \(\varvec{\delta }\). To estimate the parameters in equation (6) using the kernel method, a multivariate kernel is required. However, the standard nonparametric kernel regression estimators face challenges due to the curse of dimensionality. In this paper, we propose an improved formulation of equation (6) as:

$$\begin{aligned} \textbf{V}(\varvec{\delta })=\sum \limits _{i=1}^n \Big \{r_i\frac{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })} +(1-r_i)\frac{\omega '(O_i;\varvec{\delta })}{\omega (O_i;\varvec{\delta })}\Big \}, \end{aligned}$$
(7)

where \(\omega (O;\varvec{\delta })=E[\{1-\pi (Y,\textbf{X},\textbf{Z};\varvec{\delta })\}|O]\), \(O=\varvec{\theta }^\top (\textbf{X}^\top ,\textbf{Z}^\top )^\top \), \(\omega '(\cdot )\) denotes the partial derivative of \(\omega ( \cdot )\) with respect to \(\varvec{\delta }\). The score function (7) is unbiased since

$$\begin{aligned} E\{\textbf{V}(\varvec{\delta })\}= & {} \sum \limits _{i=1}^n E\Big \{r_i\frac{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })} +(1-r_i)\frac{\omega '(O_i;\varvec{\delta })}{\omega (O_i;\varvec{\delta })}\Big \}\\= & {} \sum \limits _{i=1}^n E[E\{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta }) +\omega '(\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })|O_i\}]\\= & {} 0. \end{aligned}$$

Given observational data, we have

$$\begin{aligned}{} & {} \omega (O_i;\varvec{\delta })=E[r_i\{1-\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })\}/\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })|O_i],\\{} & {} \omega '(O_i;\varvec{\delta })=-E[r_i\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })/\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })|O_i]. \end{aligned}$$

There are some existing methods of estimating \(\omega (O;\varvec{\delta })\) and \(\omega '(O;\varvec{\delta })\), for details see Fan et al. (1996). The local constant estimators are given

$$\begin{aligned}{} & {} \hat{\omega }(o;\varvec{\delta })=\frac{\sum _{i=1}^n k_h(O_i-o)[r_i\{1-\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })\}/\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })]}{\sum _{i=1}^n k_h(O_i-o)},\\{} & {} \hat{\omega }'(o;\varvec{\delta })=-\frac{\sum _{i=1}^n k_h(O_i-o)\{r_i\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })/\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })\}}{\sum _{i=1}^n k_h(O_i-o)}, \end{aligned}$$

where \(k(\cdot )\) is a given kernel function, h represents the bandwidth, and \(k_h(t)\) is defined as k(t/h)/h.

Hence, we formulate the estimation equation for the parameter \(\varvec{\delta }\) in the missing model as follows:

$$\begin{aligned} \hat{\textbf{V}}(\varvec{\delta })=\sum \limits _{i=1}^n \hat{\textbf{V}}_i(\varvec{\delta })=\sum \limits _{i=1}^n \Big \{r_i\frac{\pi '(Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })} +(1-r_i)\frac{\hat{\omega }'(\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\hat{\omega }(\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}\Big \}. \end{aligned}$$
(8)

In estimating equation (8), we simplify the problem by projecting the propensity function onto a linear combination of covariates, transforming it into a univariate kernel estimation problem. While this approach may yield less efficient estimators, it offers simplicity and ease of implementation. In practical calculations, the nleqslv package in R can be used to solve equation (8) and obtain the estimator for \(\varvec{\delta }\).

The parameter \(\varvec{\delta }\) in the missing model exhibits consistency and asymptotic normality, as described in Theorem 3. This is contingent upon the asymptotic properties of the estimated function \(\hat{\textbf{V}}(\varvec{\delta })\).

Theorem 3

Under identifiable observed likelihood (2) and the satisfaction of conditions (A)-(E) in the Supplementary Material, if \(nh^2/\log (1/h) \rightarrow \infty \) and \(nh^4\rightarrow 0\), the following holds:

  1. (i)

    \(\hat{\varvec{\delta }}\) converges in probability to the true value \(\varvec{\delta }_0\),

  2. (ii)

    \(\sqrt{n}(\hat{\varvec{\delta }}-\varvec{\delta }_0){\mathop {\longrightarrow }\limits ^{\mathcal {D}}} \mathcal {N}(0, \varvec{\Omega }(\varvec{\delta }_0))\),

where \(\varvec{\Omega }(\varvec{\delta })=A^{-1}(\varvec{\delta })B(\varvec{\delta })\{A^{-1}(\varvec{\delta })\}^\top \),

$$\begin{aligned} A(\varvec{\delta })=E\{\partial \textbf{V}_{i}(\varvec{\delta })/\partial \varvec{\delta }^\top \},\quad B(\varvec{\delta })=E\{\textbf{R}_{i}(\varvec{\delta })\textbf{R}_{i}^\top (\varvec{\delta })\}, \end{aligned}$$
(9)

and

$$\begin{aligned} \textbf{R}_{i}(\varvec{\delta })= \frac{\{1-r_i/\pi (Y_i,\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })\}\omega '(\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}{\omega (\textbf{X}_i,\textbf{Z}_i;\varvec{\delta })}. \end{aligned}$$

The proof of Theorem 3 is provided in the Supplementary Material. Based on Theorem 3, the asymptotic representation for \(\hat{\textbf{V}}(\varvec{\delta })\) can be viewed as a special case of generalized estimating equations as described in Wang et al. (2021). In practical applications, the covariance matrix can be obtained using \(\hat{A}^{-1}(\hat{\varvec{\delta }})\hat{B}(\hat{\varvec{\delta }})\{\hat{A}^{-1}(\hat{\varvec{\delta }})\}^\top \), where

$$\begin{aligned} \hat{A}(\hat{\varvec{\delta }})=n^{-1}\sum _{i=1}^{n}\partial \hat{\textbf{V}}_{i}(\hat{\varvec{\delta }})/\partial \varvec{\delta }^\top ,\quad \hat{B}(\hat{\varvec{\delta }})=n^{-1}\sum _{i=1}^{n}\textbf{R}_{i}(\hat{\varvec{\delta }})\textbf{R}_{i}^\top (\hat{\varvec{\delta }}). \end{aligned}$$

Now we will consider the estimators of unknown parameter vector \(\varvec{\beta }\) and unknown functions \(g_{k}(Z_{k})=0, k=1,\ldots , d_2\). When the propensity score is known, one approach to estimation is to plug the estimated \(\hat{\varvec{\delta }}\) from (8) into \(\pi (y,\textbf{x},\textbf{z};\varvec{\delta })\) and use inverse weighting of the complete cases. This method employs a generalized estimating equation and estimates the nonparametric components using polynomial splines. Recall that \(\textbf{Z}=(Z_1,\ldots ,Z_{d_2})^\top \) represents a vector of covariates, \(\textbf{Z}_i=(Z_{i1},\ldots ,Z_{id_2})^\top \) is the vector of covariates for the ith observation, and \(\eta _i\) for the ith observation can be expressed as \(\eta _i=\varvec{\beta }^\top \textbf{X}_i+\sum _{k=1}^{d_2}g_{k}(Z_{ik})\). Assuming that \(Z_k\) is distributed on a compact interval \([t_{k}^l,t_k^r], k=1,\ldots , d_2,\) without loss of generality, we can set all intervals to be \([t_{k}^l,t_k^r]=[0,1], k=1,\ldots , d_2.\) According to the approach proposed by Wang and Yang (2007), the smooth unknown functions \(g_k\)’s can be effectively approximated using a linear combination of polynomial spline functions. Let \(\mathcal {S}_n\) be the space of polynomial splines on the interval [0, 1] of order \(q\ge 1\). We define a knot sequence with J interior knots and denote it as

$$\begin{aligned} \tau _{-q}=\cdots =\tau _{-1}=\tau _0=0<\tau _1<\ldots<\tau _J<1=\tau _{J+1}=\cdots =\tau _{J+q+1}, \end{aligned}$$

where \(J \equiv J_n\) is chosen to increase as the sample size n increases, and the specific order is provided in condition (I) in the Supplementary Material. Then \(\mathcal {S}_n\) consists of functions \(\tilde{\omega }\) that satisfy the following properties: (i) \(\tilde{\omega }\) is a polynomial of degree q on each of the subintervals \(I_s=[\tau _s,\tau _{s+1}), s=0,\ldots ,J_n-1, I_{J_n}=[\tau _{J_n},1];\) (ii) for \(q\ge 1,\) \(\tilde{\omega }\) is a \((q-1)\) times continuously differentiable on [0, 1]. For the kth covariate \(Z_k\), let \(\{\tilde{b}_{j,k}(Z_k), j=1,\ldots ,J_n+q+2, k=1,\ldots , d_2\}\) be the B-spline basis functions of order q of the space of \(\mathcal {S}_n.\) Let \(N_n=J_n+q+1\), we adopt the normalized B-spline space \(\mathcal {S}_n^0\) introduced in Xue and Yang (2006) with the normalized basis as follows, \(1\le j\le N_n, 1\le k \le d_2,\)

$$\begin{aligned} B_{j,k}(Z_k)=\sqrt{N_n}\Big \{\tilde{b}_{j+1,k}(Z_k)-\frac{E(\tilde{b}_{j+1,k})}{E(\tilde{b}_{1,k})}\tilde{b}_{1,k}(Z_k)\Big \}. \end{aligned}$$
(10)

The normalized B-spline approximation for \(g_k(Z_k)\) can then be expressed as following

$$\begin{aligned} g_k(Z_k)\approx \tilde{g}_k(Z_k)=\sum _{j=1}^{N_n}\gamma _{j,k}B_{j,k}(Z_k)-\frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{N_n}\gamma _{j,k}B_{j,k}(Z_{ik}). \end{aligned}$$
(11)

Denoting \(\varvec{\gamma }=(\gamma _{1,1},\ldots ,\gamma _{N_n,d_2})^\top \) as a vector of coefficients of dimension \(N_nd_2\), \(\textbf{B}_{i,k}=(B_{1,k}(Z_{ik}),\ldots ,B_{J_n,k}(Z_{ik}))^\top \) and \(\textbf{B}_i=(\textbf{B}_{i,1}^\top ,\ldots ,\textbf{B}_{i,d_2}^\top )^\top \), we can simplify the notation by representing \(\sum _{j=1}^{N_n}\gamma _{j,k}B_{j,k}(Z_k)\) as \(\tilde{g}_k(Z_k)\). Using the normalized B-splines, we approximate \(\eta _i\) as \(\tilde{\eta }_i=\textbf{X}_i^\top \varvec{\beta }+\tilde{g}(\textbf{Z}_i)=\textbf{X}_i^\top \varvec{\beta }+ \textbf{B}_i^\top \varvec{\gamma }\).

Suppose that the conditional variance function \(\hbox {var}(Y|\textbf{X},\textbf{Z})= \phi V(\mu (\textbf{X},\textbf{Z}))\) for \(\phi >0\) and some known positive function V. Let the quasi-score function be defined as

$$\begin{aligned} q(\varvec{\beta },g)=\frac{Y-\mu }{ V(\mu )}\times \frac{\partial \mu }{\partial \eta }, \end{aligned}$$
(12)

By replacing the unknown smooth function with the approximation given in Equation (11), we can obtain the following estimating equation:

$$\begin{aligned} \sum \limits _{i=1}^n \textbf{U}_{i}(\hat{\varvec{\delta }},\varvec{\beta },\tilde{g})= \begin{pmatrix} \sum \limits _{i=1}^n\textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\varvec{\beta },\tilde{g})\\ \sum \limits _{i=1}^n\textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\varvec{\beta },\tilde{g}) \end{pmatrix} =0, \end{aligned}$$
(13)

where

$$\begin{aligned} \textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\varvec{\beta },\tilde{g})=\frac{r_i q_{i}(\varvec{\beta },\tilde{g})}{\pi _i(\hat{\varvec{\delta }})}\textbf{X}_{i}, \quad \textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\varvec{\beta },\tilde{g})=\frac{r_i q_{i}(\varvec{\beta },\tilde{g})}{\pi _i(\hat{\varvec{\delta }})}\textbf{B}_{i}. \end{aligned}$$

Then the estimators for \(\varvec{\beta }\) and \(\varvec{\gamma }\) can be obtained by solving for (13), and \(\hat{g}=\textbf{B}^\top \hat{\varvec{\gamma }}\).

We will introduce some notation. Let

$$\begin{aligned} D(\varvec{\delta },\varvec{\beta },g)=E\{\partial \textbf{U}_{\varvec{\beta },i}(\varvec{\delta },\varvec{\beta },g)/\partial \varvec{\beta }^\top \}-G(\varvec{\delta },\varvec{\beta },g)E\{\partial \textbf{U}_{\varvec{\gamma },i}(\varvec{\delta },\varvec{\beta }, g)/\partial \varvec{\beta }^\top \}, \end{aligned}$$

where

$$\begin{aligned} G(\varvec{\delta },\varvec{\beta },g)=E\{\partial \textbf{U}_{\varvec{\beta },i}(\varvec{\delta },\varvec{\beta },g)/\partial \varvec{\gamma }^\top \}[E\{\partial \textbf{U}_{\varvec{\gamma },i}(\varvec{\delta },\varvec{\beta },g)/\partial \varvec{\gamma }^\top \}]^{-1}. \end{aligned}$$

And \(\Sigma (\varvec{\delta },\varvec{\beta },g)=E\{m(\varvec{\delta },\varvec{\beta },g)m(\varvec{\delta },\varvec{\beta },g)^\top \}\), where

$$\begin{aligned} m_i(\varvec{\delta },\varvec{\beta },g)= & {} \{D(\varvec{\delta },\varvec{\beta },g)\}^{-1}\frac{1}{\sqrt{n}}\sum \limits _{i=1}^n[\textbf{U}_{\varvec{\beta },i}(\varvec{\delta },\varvec{\beta },g)-G(\varvec{\delta },\varvec{\beta },g)\textbf{U}_{\varvec{\gamma },i}(\varvec{\delta },\varvec{\beta },g) ]\\{} & {} \quad -\{D(\varvec{\delta },\varvec{\beta },g)\}^{-1}[E\{\partial \textbf{U}_{\varvec{\beta },i}(\varvec{\delta },\varvec{\beta },g)/\partial \varvec{\delta }^\top \}\\{} & {} \quad -G(\varvec{\delta },\varvec{\beta },g)E\{\partial \textbf{U}_{\varvec{\gamma },i}(\varvec{\delta },\varvec{\beta },g)/\partial \varvec{\delta }^\top \}] A^{-1}(\varvec{\delta })\frac{1}{\sqrt{n}}\textbf{R}(\varvec{\delta }). \end{aligned}$$

We use \(D_0\) and \(\Sigma _0\) to denote the values of \(D(\varvec{\delta },\varvec{\beta },g)\) and \(\Sigma (\varvec{\delta },\varvec{\beta },g)\) at \(\varvec{\delta }_0,\varvec{\beta }_0,g_0\), respectively. Theorem 4 describes the asymptotic properties of the proposed estimators.

Theorem 4

Under identifiable observed likelihood (2) and conditions (B)-(J), if \(nh^2/\log (1/h) \rightarrow \infty \) and \(nh^4\rightarrow 0\), we have

  1. (i)

    \(\Vert \hat{g}_k-g_{0k}\Vert =O_p\{(N_n/n)^{1/2}\}, 1\le k\le d_2\),

  2. (ii)

    \(\sqrt{n}(\hat{\varvec{\beta }}-\varvec{\beta }_0){\mathop {\longrightarrow }\limits ^{\mathcal {D}}}N(0, D_0^{-1}\Sigma _0{D_0^{-1}}^\top )\),

where \(\Vert \hat{g}_k-g_{0k}\Vert ^2=E\{\hat{g}_k(Z_k)-g_{0k}(Z_k)\}^2\).

The proof of Theorem 4 is provided in the Supplementary Material. In practical applications, the covariance matrix can be estimated using \(\hat{D}^{-1}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\hat{\Sigma }(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\{\hat{D}^{-1}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\}^\top \), where

$$\begin{aligned} \hat{D}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})= & {} n^{-1}\sum _{i=1}^{n}\{\partial \textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\beta }^\top -\hat{ G}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\partial \textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\beta }^\top \},\\ \hat{{G}}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})= & {} n^{-1}\sum _{i=1}^{n}\partial \textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\gamma }^\top \Big [\sum _{i=1}^{n}\partial \textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\gamma }^\top \Big ]^{-1},\\ \hat{\Sigma }(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})= & {} n^{-1}\sum _{i=1}^{n}\{m_i(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})m_i^\top (\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\},\\ m_i(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})= & {} \{\hat{D}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\}^{-1} \frac{1}{\sqrt{n}}\sum \limits _{i=1}^n[\textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})-\hat{G}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g}) ]\\{} & {} -\{\hat{D}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\}\Big [\frac{1}{n}\sum \limits _{i=1}^n\partial \textbf{U}_{\varvec{\beta },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\delta }^\top \\{} & {} -\hat{G}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\frac{1}{n}\sum \limits _{i=1}^n\partial \textbf{U}_{\varvec{\gamma },i}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})/\partial \varvec{\delta }^\top ] \hat{A}^{-1}(\hat{\varvec{\delta }})\frac{1}{\sqrt{n}}\hat{\textbf{U}}(\hat{\varvec{\delta }}). \end{aligned}$$

4 Simulations

In this section, we present the simulation results of the proposed estimators introduced in Sect. 3. We consider three types of response models: Logistic regression, quasi-Poisson regression, and truncated normal regression. These models are subject to different missing data mechanisms, namely logistic, probit, and complementary log-log models.

The covariate vector \((\textbf{X}^\top ,\textbf{Z}^\top )^\top =(X_1,X_2,Z_1,Z_2)^\top \), and \(g(\textbf{Z})=g_1(Z_1)+g_2(Z_2)=\sin (4\pi Z_1)+5(Z_2-0.5)^2-5/12\), where \(Z_1\) and \(Z_2\) are independently uniformly distributed on [0, 1]. We assume that \(T_1\) and \(T_2\) are normally distributed in \(\mathbb {R}^2\) with

$$\begin{aligned} \mu =\left( \begin{array}{c} 1 \\ 0 \\ \end{array} \right) , \Sigma = \left( \begin{array}{cc} 1 &{} 1 \\ 1 &{} 2 \\ \end{array} \right) . \end{aligned}$$

To account for the dependence between \(\textbf{X}\) and \(\textbf{Z}\), we assume the following relationship: \(X_1=T_1+0.5(Z_1+Z_2)\) and \(X_2=T_2+0.5(Z_1+Z_2)\).

The simulation models are designed as follows:

  • Binary case: The response variable Y follows a Bernoulli distribution

    $$\begin{aligned} P(Y=1|\textbf{X},\textbf{Z}) = \hbox {expit}\{\beta _0+\beta _1X_1+\beta _2X_2+g_1(Z_1)+g_2(Z_2)\}, \end{aligned}$$

    where \((\beta _0,\beta _1,\beta _2)^\top =(-1,1,-1)^\top .\) The missing data mechanism model is of a logistic form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \hbox {expit}(\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.2,1.8,0.2,-0.2)^\top ,\) or a probit form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \Phi (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.5,1.3,-0.2,0.2)^\top ,\) or a complementary log-log form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= 1-\exp \{-\exp (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1)\}, \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.3,0.9,-0.2,0.2)^\top .\) The first case leads to that the percentage of complete data is about \(90.1\%\), the second one is about \(90.8\%\) and the third one is about \(89.4\%.\)

  • Quasi-Poisson case: Y follows quasi-Possion with conditional expectation

    $$\begin{aligned} P(Y=0|\textbf{X},\textbf{Z}) = 1/\hbox {exp}\{\beta _0+\beta _1X_1+\beta _2X_2+g_1(Z_1)+g_2(Z_2)\}, \end{aligned}$$

    where \((\beta _0,\beta _1,\beta _2)^\top =(0.5,-0.5,0.5)^\top \) and dispersion parameter \(\phi =1.5\). The missing data mechanism model is of a logistic form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \hbox {expit}(\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.3,1.5,0.2,0.2)^\top ,\) or a probit form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \Phi (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.15,0.8,0.2,-0.2)^\top ,\) or a complementary log-log form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= 1-\exp \{-\exp (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1)\}, \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(0.3,0.8, -0.1, -0.1)^\top .\) The first case leads to that the percentage of complete data is about \(90.2\%\), the second one is about \(88.7\%\) and the third one is about \(88.9\%.\)

  • Truncated normal case: Y is generated according to the following model:

    $$\begin{aligned} Y\sim TN(\beta _0+\beta _1X_1+\beta _2X_2+g_1(Z_1)+g_2(Z_2),\sigma ^2,\mu (\textbf{X},\textbf{Z})-c,\mu (\textbf{X},\textbf{Z})+c), \end{aligned}$$

    where \(\mu (\textbf{X},\textbf{Z})=\beta _0+\beta _1X_1+\beta _2X_2+g_1(Z_1)+g_2(Z_2)\). The parameter vector is set at \((\beta _0,\beta _1,\beta _2,\sigma ^2,c)^\top =(1,2,2,1,2)^\top .\) The indicator variable r is generated from Bernoulli distribution with probability function being specified as a logistic form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \hbox {expit}(\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(-0.5,2,1.5,0.5)^\top ,\) or a probit form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \Phi (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(-0.2,1,1,-0.6)^\top ,\) or a complementary log-log form

    $$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= 1-\exp \{-\exp (\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1)\}, \end{aligned}$$

    with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(-0.2,1,0.7,-0.5)^\top .\) The first case leads to that the percentage of complete data is about \(82.6\%\), the second one is about \(85.7\%\) and the third one is about \(86.3\%.\)

The simulation study was conducted with different sample sizes: \(n=2000\) for the Binary case, \(n=1000\) for the quasi-Poisson case, and \(n=500\) for the truncated normal case. The reason for conducting simulations with three different sample sizes is rooted in the fact that when the population distribution is highly imbalanced, such as in a binary scenario, might need a substantial sample size for the central limit theorem to kick in and produce sampling distributions that approximate a normal distribution. Ideally, Binary and skewed multi-category discrete predictors demand larger sample sizes compared to normally-distributed continuous predictors (Olvera Astivia et al. 2019).

The number of knots \(N_n\) was determined automatically using the R package mgcv. The proposed estimators were implemented in R using the iteration algorithm described in Sect. 3. The simulation results based on 1000 runs are summarized in Tables 1, 2, and 3. These tables present the bias, standard deviation (SD), approximate 95% confidence intervals (CI), and coverage rate (CR) of the estimated parameters. The confidence intervals were constructed using the formula “estimator ± 1.96SE,” where SE is the square root of the diagonal elements of the matrix \(\hat{A}^{-1}(\hat{\varvec{\delta }})\hat{B}(\hat{\varvec{\delta }}){\hat{A}^{-1}(\hat{\varvec{\delta }})}^\top \) and \(\hat{D}^{-1}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})\hat{\Sigma }(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g}){\hat{D}^{-1}(\hat{\varvec{\delta }},\hat{\varvec{\beta }},\hat{g})}^\top \). Figures 1, 2, and 3 depict the mean of the fitted nonparametric functions and the approximate 95% confidence bands (CB). Overall, in the three examples with different missing data mechanism models, both the parameter estimators and the nonparametric function estimators perform well. As we might expect, the estimators in the same response model with different missing data mechanisms show the similar bias and variance because of the similar missing rate. While the estimators in the truncated normal case exhibit the highest bias and variance due to the smallest sample size and highest missing rate, compared to the binary and quasi-Poisson cases. For Brnary case under the missingness mechanism of logistic form, the estimate \(\theta _2\) has a slightly lower coverage rate.

In the study conducted by Du et al. (2023), the analysis involves utilizing data with nonresponse alongside a parametric distribution \(p(y|\textbf{x},\textbf{z})\) that is a member of the exponential family. The optimal estimator can be obtained by maximizing the observed likelihood. In this paper, we consider nonparametric \(p(y|\textbf{x},\textbf{z})\) and constructs estimating equations for mean of response based on the inverse probability weighting. Thus, our method expands the scope of applicability for these models. And with large samples, even if \(p(y|\textbf{x},\textbf{z})\) belongs to the exponential family, the difference between the two methods almost disappears.

In order to assess the stability of the proposed inference method, we consider scenarios where the missingness type or the missingness mechanism model is misspecified. We assume that the response variable Y is generated according to

$$\begin{aligned} Y\sim N(\beta _0+\beta _1X_1+\beta _2X_2+g_1(Z_1)+g_2(Z_2),\sigma ^2), \end{aligned}$$

the parameter vector is set at \((\beta _0,\beta _1,\beta _2,\sigma ^2)^\top =(1,2,2,1)^\top .\) The indicator variable r is generated from Bernoulli distribution with probability function being specified as a logistic form

$$\begin{aligned} \pi (Y,\textbf{X},\textbf{Z};\alpha ,\varvec{\theta })= \hbox {expit}(\alpha Y+\theta _0+\theta _1X_1+\theta _2Z_1), \end{aligned}$$

with \((\alpha ,\theta _0,\theta _1,\theta _2)^\top =(-0.5,2,1.5,0.5)^\top \), and The percentage of complete data in the dataset is approximately 82.4%. Here we mainly focus on that the missingness type is mis-specified to be missing completely at random or missing at random, and the missingness mechanism is mis-specified to be of the probit form. The sample size is 500, and based on 1000 simulation runs, Table 4 presents the bias, standard deviation (SD), approximately 95% confidence intervals (CI) of the parameters with coverage rate (CR). Figure 4 shows the mean of the fitted nonparametric functions and the approximately 95% confidence band (CB). The performance of the proposed method is not good when using complete data or missing at random mechanism. One of them fails completely. However, when the missingness mechanism is misspecified as a probit form, the estimators of both parameters and nonparametric functions are less affected. The reason is that the performance with the probit and logistic model is very similar, and in this case, misspecification of the response model is not a serious problem (Morikawa and Kim 2021). Additionally, we observed that the probit and logistic models yielded almost identical outcomes across the three types of response models.

Estimation is not possible when the parameter is non-identifiable. Despite providing simulation results, the estimators are challenging to compute due to fluctuations.

Table 1 Bias \(\times 10^2\), standard deviation (SD) \(\times 10^2\), confidence interval (CI) and coverage rate (CR) of \(\varvec{\beta }\) and \(\varvec{\delta }\) using three link functions for Binary case
Fig. 1
figure 1

The left graph is for Logistic, the middle graph is for Probit and the right graph is for Clog-log

Table 2 Bias \(\times 10^2\), standard deviation (SD) \(\times 10^2\), confidence interval (CI) and coverage rate (CR) of \(\varvec{\beta }\) and \(\varvec{\delta }\) using three link functions for quasi-Possion case
Fig. 2
figure 2

The left graph is for Logistic, the middle graph is for Probit and the right graph is for Clog-log

Table 3 Bias \(\times 10^2\), standard deviation (SD) \(\times 10^2\), confidence interval (CI) and coverage rate (CR) of \(\varvec{\beta }\) and \(\varvec{\delta }\) using three link functions for truncated normal case
Fig. 3
figure 3

The left graph is for Logistic, the middle graph is for Probit and the right graph is for Clog-log

Table 4 Bias \(\times 10^2\), standard deviation (SD) \(\times 10^2\), confidence interval (CI) and coverage rate (CR) of \(\varvec{\beta }\) and \(\phi \) for misspecification
Fig. 4
figure 4

The left graph is based on complete data, the middle graph on missing at random and the right graph on the probit missing data mechanism

The missing mechanism is assumed to follow a logistic form, that is \(h(\alpha y +\theta _0+\theta _1 z_1)=\log \{\hbox {expit} (\alpha y + \theta _0+\theta _1z_1)\},\) and \(p(y|\textbf{x},\textbf{z};\varvec{\beta },g,\phi )=\exp [-y/\{g_1(z_1)+\beta _0\}]/\{g_1(z_1)+\beta _0\},\) and \(g_1(z_1)+\beta _0>0\). The condition given in (3) reduces to

$$\begin{aligned}{} & {} \log \{\hbox {expit}(\alpha y + \theta _0+\theta _1z_1)\}-y/\{g_1(z_1)+\beta _0\}-\log \{g_1(z_1)+\beta _0\}\\{} & {} \quad =\log \{\hbox {expit}(\alpha ^* y + \theta _0^*+\theta _1^*z_1)\}-y/\{g^*_1(z_1)+\beta ^*_0\}-\log \{g^*_1(z_1)+\beta ^*_0\}. \end{aligned}$$

For example, we can take that

$$\begin{aligned} \begin{array}{lllll} &{}(\alpha ,\theta _0,\theta _1,\beta _0,g_1(z_1))^\top &{}=&{} (1,-1,-1,1-e^{-1},e^{-1}-e^{-z_1-1})^\top ,\\ &{}(\alpha ^*,\theta _0^*,\theta _1^*,\beta _0^*,g_1^*(z_1))^\top &{}=&{} (-1,1,1,e-1,e^{z_1+1}-e)^\top , \end{array} \end{aligned}$$

which satisfies the above formula. Since

$$\begin{aligned} \begin{array}{lllll} (\alpha ,\theta _0,\theta _1,\beta _0,g_1(z))^\top \ne (\alpha ^*,\theta _0^*,\theta _1^*,\beta _0^*,g_1^*(z))^\top \end{array}. \end{aligned}$$

Hence, this model is considered non-identifiable. We generated data from the non-identifiable model with a sample size of 1000. Based on 100 simulation runs, Figs. 5 illustrates the estimators of \(\beta _0\), \(\alpha \), \(\theta _0\), and \(\theta _1\) varied between two sets of values.

Fig. 5
figure 5

The estimators of \(\beta _0,\alpha ,\theta _0\) and \(\theta _1\)

5 Real data analysis

The CHIP survey (2013) aims to measure the distribution of personal income and related economic factors in rural, migrant, and urban areas of China (Sicular et al. 2020). The survey includes data from cities and towns in fifteen provinces, which are representative of different regions in the country. These provinces include Liaoning, Shanxi, Jiangsu, Shandong, Guangdong, Anhui, Henan, Sichuan, Hunan, Hubei, Gansu, Xinjiang, Yunnan, Beijing, and Chongqing. The selected provinces represent the north, eastern coastal areas, interior regions, and western regions of China.

In this study, the analysis focuses on urban data, which consists of a sample of 12,233 individuals. The percentage of missingness in the data is 22.4%. Instead of assuming a linear relationship between work experience and the log of income, a smooth function is used, similar to the Mincer earnings function. The model is specified as follows:

$$\begin{aligned} \log \hbox {E} = \beta _0 + \beta _1 \hbox {S} + g_1(\hbox {Exper}) + \varepsilon , \end{aligned}$$
(14)

In this model, the logarithm of earnings \(\log \hbox {E}\) is related to years of schooling S and work experience (\(\text {Exper}\)), which is calculated as \(\text {age} - \hbox {S} - 6\). The relationship is subject to an unobserved random error (\(\varepsilon \)) with variance \(\phi \). Without considering the cost of education, the rates of return to schooling can be calculated as

$$\begin{aligned} \partial \log \hbox {E}/ \partial \hbox {S} = \beta _1. \end{aligned}$$

The missing data mechanism is modeled using the following model:

$$\begin{aligned}{} & {} P(r=1|\log \hbox {E}, \log \hbox {S},\log \hbox {Exper})\nonumber \\{} & {} \qquad \qquad = \hbox {expit}(\alpha \log \hbox {E} + \theta _0+\theta _1 \log \hbox {S} + \theta _2 \log \hbox {Exper}). \end{aligned}$$
(15)

To account for the large values and fluctuations in schooling and experience, we replace \(\hbox {S}\) and \(\hbox {Exper}\) with \(\log \hbox {S}\) and \(\log \hbox {Exper}\). \(\log \hbox {S}\) represents the logarithm of \(1+\) years of education because uneducated groups exist, and \(\log \hbox {Exper}\) represents the logarithm of work experience.

Table 5 Estimate and standard deviation (SD) for the parameters of (14) under the nonignorable missingness (NIM) and Missing at random (MAR)
Fig. 6
figure 6

The left graph is obtained under the nonignorable missingness and the right graph is under missing at random

Table 5 presents parameter estimates for models (14) and (15) under the nonignorable missingness and missing at random. The results show that under the nonignorable missing mechanism, \(\log \)-income has a significant negative effect on the probability of missingness. Moreover, the rates of return to schooling in China is 8.30% under the nonignorable missingness assumption and 9.67% under the assumption of missingness at random. These estimators are consistent with the existing literature (Gao and Smyth 2015; Kang and Peng 2012), which suggests that the returns to education in China range from 8% to 10%. It is worth noting that the rates of return to schooling in China have stagnated or even declined after 2005 due to factors such as education expansion and labor mobility (Cai and Wang 2010). The rates of return to schooling in China under the nonignorable missingness is smaller than that under missing at random, suggesting that it is required to model the missingness mechanism. Figure 5 illustrates the functional relationship between \(\log \)-income and work experience, revealing an inverted U-shaped pattern. This finding is consistent with the classic hypothesis of the Mincerian earnings equation, which suggests that there is an optimal level of work experience that maximizes income.

6 Conclusions

In this study, semiparametric estimators have been developed specifically for handling nonignorable missing data. These estimators are designed to accommodate the logistic model, probit model, and complementary log-log model, which are commonly used to characterize the missing data mechanism. The instrumental variable assumption ensures identifiability without requiring additional assumptions, giving it an advantage over the assumptions proposed by Tchetgen Tchetgen and Wirth (2017) and Sun et al. (2018). If the score function \(S_\mu \{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}\) can be written as \(a\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}Y+b\{\textbf{X},\textbf{Z};\mu ,\upsilon (\cdot )\}\), such as in the case of samples following exponential family, identifiability can be achieved without relying on instrumental variables. By employing the kernel method and spline method, we extend the generalized linear regression model to the generalized additive partial linear model when the distribution of the response variable is unknown.

The approaches presented in Cui and Zhou (2017) and Morikawa et al. (2017) for estimating the missing mechanism model, due to their lack of dimensionality reduction, become invalid when dealing with numerous covariates. In this paper, we utilize dimension reduction techniques to achieve readily achievable univariate kernel estimation. While univariate kernel estimation may compromise estimator efficiency, it significantly reduces computational complexity. To enhance estimation efficiency, one can employ the estimation method based on the effective score introduced by Morikawa and Kim (2021). Nevertheless, under the assumption that \(p(y|\textbf{x},\textbf{z})\) belongs to the exponential family, the optimal estimator can be derived by maximizing the observed likelihood (Morikawa and Kim 2021).

There are many directions worthy of further research. A possible extension in this research area involves transforming the identifiability of the observation likelihood into the identifiability of the parameters of interest, such as mean functionals (Li et al. 2021). Indeed, the development of doubly robust estimation methods and efficient estimation techniques for nonignorable missing data is a crucial research area. Furthermore, incorporating more sophisticated structures in the missing mechanism model is another promising research direction.

7 Supplementary Information

Supplementary material is available online at Statistical Papers.