1 INTRODUCTION

Although Poisson models (or binomial models) are the most widely used tools for modeling count data, we are seeing more and more count data with zero inflation in several fields such as economics, biomedical studies, criminology, insurance, sociology and political science. When the number of observed zeros is greater than that predicted by standard counting distributions, zero inflation (ZI) regression models are an alternative for modeling such data. For more information on ZI regression models, see Lambert [1], Diallo et al. [4, 7], Kouakou et al. [27], Ali et al. [28]. The ZIP distribution proposed by Lambert has gained popularity. ZIP regression models have been used successfully in a variety of important applications, see for example Dietz et al. [2], Yau et al. [3], and Cheung et al. [5].

However, the ZIP distribution has two group regression parameters, one for the probability of being an zero-inflation and the other for the Poisson mean. The parameters have latent class interpretations, these latent classes are often thought to classify some not-at-risk group and the at-risk group indicating a difference in susceptibility between the two populations. Because entire population parameters interpretations are desired, Long et al. [16] introduced the marginalized zero-inflation Poisson regression (MZIP).

In practice, the data are most often partially observed. In this context, the basic method used is called the complete cases method which consists in removing the individuals who have at least one missing data. This method is simple to implement. However, when the proportion of individuals who have a missing data is higher than 5\(\%\) this method gives bad results. Two other alternatives to the complete case for the treatment of missing data are the Monte Carlo EM algorithm (MCEM) and multiple imputation (MI). The MCEM and MI methods are efficient but require quite high computational loads. Finally, the IPW method that is often used requires that we find the right model for the selection probability. see for example Diallo et al. [7] and Benecha et al. [18]. To circumvent these modeling difficulties while proposing a non-numerical method, Lukusa et al. [8, 9] proposed weighted semiparametric estimators that are suitable when the selection probabilities are expressed in terms of covariates of the same nature. However, there is little work on the estimation of the MZIP model in the context of missing data. This work aims to fill this gap. In this article, we propose a semiparametric approach in which the probability of selection that is a function of continuous, discrete and categorical covariates is estimated nonparametrically. This alternative consists in discretizing the continuous covariates using Jenks’s method to have categorical covariates.

The rest of the paper is organized as follows. In the Section 2, We present the MZIP regression model and its maximum likelihood estimator. We present the SIPWK and SIPW estimation methods of MZIP model when the covariates are missing at random (MAR) and the consistency and asymptotic normality of the SIPW estimators are established in Section 3. The performance of the presented estimators are evaluated in Section 4. As an illustration, we apply these methods to real data in Section 5. A discussion and some perspectives are presented in Section 6. The technical proofs are reported in an Appendix.

2 MARGINALIZED ZIP MODELS

The ZIP distribution is used to model the counting variable of interest, namely \(Y_{i}\), \(i=1\ldots n\). \(Y_{i}\) takes the value of from a Poisson distribution, with a mean of \(\mu_{i}\), with a probability of \(1-\psi_{i}\), or is drawn to zero from a Bernoulli distribution, with a probability of \(\psi_{i}\). For example in dental caries research, the marginal mean \(\nu_{i}\) caries count is often of more interest than the mean caries count \(\mu_{i}\) of a susceptible latent group of individuals see Preisser [17].

Because entire population parameter interpretations are desired, the marginal mean \(\nu_{i}\) can be modeled directly to give overall exposure effect estimates. Given that \(\mu_{i}=\nu_{i}/(1-\psi_{i})\) the representation of the MZIP distribution is

$$\mathbb{P}(Y_{i}=k)=\begin{cases}\psi_{i}+(1-\psi_{i})\exp(-\nu_{i}/(1-\psi_{i})),\quad k=0\\ \displaystyle(1-\psi_{i})\frac{\exp(-\nu_{i}/(1-\psi_{i}))[\nu_{i}/(1-\psi_{i})]^{k}}{k!},\quad k>0.\end{cases}$$
(2.1)

In the MZIP model, Long et al. [16] links regression parameters directly to the marginal mean \(\nu_{i}\), while employing another set of parameters to model the probability of being an excess zero (i.e., \(\psi_{i}\)). The parameters \(\nu_{i}\) and \(\psi_{i}\) of MZIP model are modeling by

$$\textrm{logit}(\psi_{i})=\mathbf{Z}_{i}^{T}\gamma\quad\textrm{and}\quad\textrm{log}(\nu_{i})=\mathbf{X}_{i}^{T}\alpha,$$
(2.2)

where \(\gamma=(\gamma_{1},\gamma_{2},\ldots,\gamma_{q})^{T}\) is a \((q\times 1)\) column have the same interpretation as in ZIP model, \(\alpha=(\alpha_{1},\alpha_{2},\ldots,\alpha_{p})^{T}\) is a \((p\times 1)\) vector of regression parameters for \(\nu_{i}\) having interpretations as the log-incidence density ratio (IDR) for the entire sample population and \(\mathbf{X}_{i_{(p\times 1)}}\) and \(\mathbf{Z}_{i_{(q\times 1)}}\) denote the vectors of covariates for the \(i\)th individual. Let \(\theta=(\gamma^{T},\alpha^{T})^{T}\). Consider that we observe a sample of \(n\) independent copies \((Y_{1},\mathbf{X}_{1},\mathbf{Z}_{1})\), \((Y_{2},\mathbf{X}_{2},\mathbf{Z}_{2}),\ldots,(Y_{n},\mathbf{X}_{n},\mathbf{Z}_{n})\) of \((Z,\mathbf{X},\mathbf{Z})\). Then, the log-likelihood of \(\theta\) is

$$l_{n}(\theta)=\sum_{i=1}^{n}-\textrm{log}(1+e^{\mathbf{Z}^{T}_{i}\gamma})+J_{i}\textrm{log}\left(e^{\mathbf{Z}^{T}_{i}\gamma}+e^{-(1+\textrm{exp}(\mathbf{Z}^{T}_{i}\gamma))\textrm{exp}(\mathbf{X}^{T}_{i}\alpha)}\right)$$
$${}+\sum_{i=1}^{n}(1-J_{i})\left(-(1+e^{\mathbf{Z}^{T}_{i}\gamma})e^{\mathbf{X}^{T}_{i}\alpha}+Y_{i}\textrm{log}(1+e^{\mathbf{Z}^{T}_{i}\gamma})+\mathbf{X}^{T}_{i}\alpha Y_{i}-\textrm{log}(Y_{i}!)\right),$$

where \(J_{i}=1_{\{Y_{i}=0\}}\). The maximum likelihood estimator \(\hat{\theta}_{F,n}=(\hat{\gamma}^{T}_{n},\hat{\alpha}^{T}_{n})^{T}\) of \(\theta\) is the solution of the equation \(U_{F,n}(\theta)=0\), with

$$U_{F,n}(\theta)=\frac{1}{\sqrt{n}}\frac{\partial l_{n}(\theta)}{\partial\theta}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\partial l_{i}(\theta)}{\partial\theta}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\dot{l_{i}}(\theta)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\mathbf{Z}_{i}B_{i}(\theta),\mathbf{X}_{i}A_{i}(\theta))^{T},$$
(2.3)

where

$$A_{i}(\theta)=(Y_{i}-e^{\mathbf{X}^{T}_{i}\alpha}(1+e^{\mathbf{Z}^{T}_{i}\gamma}))(1-J_{i})-\frac{e^{\mathbf{X}^{T}_{i}\alpha}(1+e^{\mathbf{Z}^{T}_{i}\gamma})J_{i}}{e^{\mathbf{Z}^{T}_{i}\gamma+h_{i}(\theta)}+1},$$
$$B_{i}(\theta)=\frac{J_{i}e^{\mathbf{Z}^{T}_{i}\gamma}\left(e^{h_{i}(\theta)}-e^{\mathbf{X}^{T}_{i}\alpha}\right)}{e^{\mathbf{Z}^{T}_{i}\gamma+h_{i}(\theta)}+1}+\frac{e^{\mathbf{Z}^{T}_{i}\gamma}(Y_{i}-1)}{1+e^{\mathbf{Z}^{T}_{i}\gamma}}-(1-J_{i})e^{\mathbf{X}^{T}_{i}\alpha+\mathbf{Z}^{T}_{i}\gamma},$$

and

$$h_{i}(\theta)=(1+\textrm{exp}(\mathbf{Z}^{T}_{i}\gamma))\textrm{exp}(\mathbf{X}^{T}_{i}\alpha).$$

3 ESTIMATING PARAMETERS WITH MISSING COVARIATES

Let \(\mathbf{X}\) and \(\mathbf{Z}\) be the vectors covariates with missing data and \(Y\) always observed. Let \(\Delta_{i}\) be a dummy variable that is \(1\) when \(\{\mathbf{Z}_{i},\mathbf{X}_{i}\}\) is completely observed, \(0\) otherwise, see Rubin [12] for details. We consider covariates mixed (continuous, discrete, and categorial). Let \(\mathbf{V}=(Y,\mathbf{S}^{D},\mathbf{S}^{C})^{T}\), where \(\mathbf{S}^{D}=(\mathbf{X}^{D(\textrm{obs}),T},\mathbf{Z}^{D(\textrm{obs}),T})\) denote the vector of discretes variables that are always observed on each individual, \(\mathbf{S}^{C}=(\mathbf{X}^{C(\textrm{obs}),T},\mathbf{Z}^{C(\textrm{obs}),T})\) denote the vector of continuous variables that are always observed on each individual and \(\{\mathbf{X}^{(\textrm{miss}),T},\mathbf{Z}^{(\textrm{miss}),T}\}\) the missing components of \(\{\mathbf{X},\mathbf{Z}\}\). Under the MAR mechanism, define the selection probability

$$\pi(\mathbf{V}_{i})=\mathbb{P}(\Delta_{i}=1|Y_{i},\mathbf{X}_{i},\mathbf{Z}_{i})=\mathbb{P}(\Delta_{i}=1|\mathbf{V}_{i}).$$

3.1 Kernel-Based Weighting Estimator of a MZIP Model

Let \(\mathbf{D}=(\mathbf{X}^{(\textrm{obs}),T},\mathbf{Z}^{(\textrm{obs}),T})\) and \(d\in\{d_{1},d_{2},\ldots,d_{m}\}\) denote the distinct values of the \(\mathbf{D}\). We consider \(\hat{\pi}(y,d)\) a Nadaraya–Watston (N-W) [22, 24] type estimator of \(\pi(y,d)\) defined by

$$\hat{\pi}(y,d)=\frac{\sum_{k=1}^{n}\Delta_{k}K_{h}(Y_{k}=y,\mathbf{D}_{k}-d)}{\sum_{i=1}^{n}K_{h}(Y_{i}=y,\mathbf{D}_{i}-d)},$$

where \(K_{h}\) is a kernel function and \(h\) is a bandwidth satisfying some conditions stated in Wang [23]. The resulting semiparametric kernel-assisted weighting (SIPWK) estimator \(\hat{\theta}_{n}^{wsk}\) of \(\theta\) in models 2.1 and 2.2 is the solution of the equation

$$U_{w,n}(\theta,\hat{\pi})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Delta_{i}}{\hat{\pi}(Y_{i},\mathbf{D}_{i})}\dot{l_{i}}(\theta)=0.$$
(3.1)

In the following section, we present another weighted semiparametric estimation of a MZIP regression model.

3.2 Semiparametric IPW (SIPW) Estimator of a MZIP Model

We recall that \(\mathbf{S}^{C}=(\mathbf{X}^{C(\textrm{obs}),T},\mathbf{Z}^{C(\textrm{obs}),T})\) is the set of observed continuous covariates. Inspired by Jenks’ method [26], we discretize this set. Using Herbert’s method [25], we obtain the number of optimal classes. Jenk’s method is based on the similarity principle. The method minimizes the intraclass variance. This method allows to have new categorical covariates \(\mathbf{S}^{\prime,D}\).

Let \(s_{1}^{D},s_{2}^{D},\ldots,s_{m}^{D}\) denote the distinct values of the \(\mathbf{S}_{i}^{D}\)s, \(s_{1}^{\prime,D},s_{2}^{\prime,D},\ldots,s_{m}^{\prime,D}\) denote the distinct values of the \(\mathbf{S}^{\prime,D}\)s. The nonparametric estimator of \(\pi(y,s^{D},s^{\prime,D})\) is given by the following expression:

$$\hat{\pi}(y,s^{D},s^{\prime,D})=\frac{\sum_{k=1}^{n}\Delta_{k}I(Y_{k}=y,\mathbf{S}^{D}_{k}=s^{D},\mathbf{S}^{\prime,D}_{k}=s^{\prime,D})}{\sum_{i=1}^{n}I(Y_{i}=y,\mathbf{S}^{D}_{i}=s^{D},\mathbf{S}^{\prime,D}_{i}=s^{\prime,D})},$$

where \(y=0,1,2,\ldots\), \(s^{D}\in\{s_{1}^{D},s_{2}^{D},\ldots,s_{m}^{D}\}\) and \(s^{\prime,D}\in\{s_{1}^{\prime,D},s_{2}^{\prime,D},\ldots,s_{m}^{\prime,D}\}\).

Thus, in this context, the SIPW estimator \(\hat{\theta}_{n}^{ws}\) of \(\theta\) in models 2.1 and 2.2 is the solution of the equation

$$\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{\Delta_{i}}{\hat{\pi}(Y_{i},\mathbf{S}^{D}_{i},\mathbf{S}^{\prime,D}_{i})}\dot{l_{i}}(\theta)=0.$$
(3.2)

We study the asymptotic properties of \(\hat{\theta}_{n}^{F}\) and \(\hat{\theta}_{n}^{ws}\) in the following section.

3.3 Asymptotic Results

To establish the asymptotic properties of \(\hat{\theta}^{F}_{n}\) and \(\hat{\theta}_{n}^{ws}\) we give conditions of regularity.

  • \(\mathbf{H1.}\) The true parameter value \(\theta_{0}:=(\gamma_{0}^{T},\alpha_{0}^{T})^{T}\) lies in the interior of some known compact set of \(\mathbb{R}^{p}\times\mathbb{R}^{q}\).

  • \(\mathbf{H2.}\) Let \(\textrm{supp}(\mathbf{S}^{D})\) denote’s the support of \(\mathbf{S}^{D}\) and \(\textrm{supp}(\mathbf{S}^{\prime,D})\) denote’s the support of \(\mathbf{S}^{\prime,D}\). Assume \(\textrm{supp}(\mathbf{S}^{D})\) and \(\textrm{supp}(\mathbf{S}^{\prime,D})\) does not depend on \(\theta\). Furthermore, for any \(y=0,1,\ldots\) , for \(s^{D}\in\textrm{supp}(\mathbf{S}^{D})\) and for \(s^{\prime,D}\in\textrm{supp}(\mathbf{S}^{\prime,D})\), the selection probability \(\pi(y,s^{D},s^{\prime,D})>0\).

  • \(\mathbf{H3.}\) \(\mathbb{E}\left[\frac{\dot{l_{i}}(\theta)\dot{l_{i}}(\theta)^{T}}{\pi(\mathbf{V}_{i})}\right]\) is finite and positive definite in neighborhood of the true \(\theta\).

  • \(\mathbf{H4.}\) In a neighborhood of the true \(\theta\), the first and second derivatives of \(U_{F,n}(\theta)\) with respect to \(\theta\) exist almost surely and are uniformly bounded above by a fonction of \((Y,\mathbf{X},\mathbf{Z})\), whose expectations exist.

  • \(\mathbf{H5.}\) The first derivatives of \(U_{w,n}(\theta,\pi)\) with respect to \(\theta\) exist almost surely in a neighborhood of \(\theta_{0}\). Additionally, in such a neighborhood, these first derivatives are uniformly bounded above by a function of \((Y,\mathbf{X},\mathbf{Z})\), whose expectations exist.

The asymptotic properties of \(\hat{\theta}^{F}_{n}\) and \(\hat{\theta}^{ws}_{n}\) are stated in Theorems 1 and 2, respectively. The detailed of proofs of Theorem 1 in the Appendix A and Theorem 2 in the Appendix B.

Before studying the asymptotic properties of the estimators, we define by

$$\Sigma_{n}(\theta)=-n^{-1/2}\frac{\partial U_{F,n}}{\partial\theta^{T}}=-\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{\partial^{2}l_{i}(\theta)}{\partial\theta\partial\theta^{T}}\right\}\;\ \text{and}\;\ Q_{F}(\theta_{0})=\mathbb{E}\left[\dot{l_{1}}(\theta_{0})\dot{l_{1}}(\theta_{0})^{T}\right].$$

Because each component of \(\Sigma_{n}(\theta)\) is a mean of independent and identically distributed random variables, we have \(\mathbb{E}\left[\Sigma_{n}(\theta)\right]=\mathbb{E}\left[-\frac{\partial^{2}l_{1}(\theta)}{\partial\theta\partial\theta^{T}}\right]=\Sigma(\theta).\)

Theorem 1. Assume that conditions (H1), (H2), and (H4) hold. Then \(\hat{\theta}^{F}_{n}\) converges in probability to \(\theta_{0}\), as \(n\rightarrow\infty\) and \(\sqrt{n}(\hat{\theta}_{n}^{F}-\theta_{0})\) has an asymptotic normal distribution with mean zero and covariance matrix \(\Delta_{F}\), with \(\Delta_{F}:=\Sigma(\theta_{0})^{-1}Q_{F}(\theta_{0})[\Sigma(\theta_{0})^{-1}]^{T}\), where \(Q_{F}(\theta)=\mathbb{E}\left[\dot{l_{1}}(\theta)\dot{l_{1}}(\theta)^{T}\right]\).

Since the inverse of the Fisher information matrix is the variance of the score function, we can have \(\Sigma(\theta_{0})=Q_{F}(\theta_{0})\). Finally \(\Delta_{F}=\Sigma(\theta_{0})^{-1}\).

Theorem 2. Assume that conditions (H1), (H2), and (H4) hold. Then \(\hat{\theta}^{ws}_{n}\) converges in probability to \(\theta_{0}\), as \(n\rightarrow\infty\) and \(\sqrt{n}(\hat{\theta}^{ws}_{n}-\theta_{0})\) has an asymptotic normal distribution with mean zero and covariance matrix \(\Delta_{ws}\), with \(\Delta_{ws}:=\Sigma(\theta_{0})^{-1}\{\Omega_{3}(\theta_{0},\pi)-\left[\Omega_{4}(\theta_{0},\pi)-\Omega_{5}(\theta_{0},\pi)\right]\}[\Sigma(\theta_{0})^{-1}]^{T}\), where \(\Omega_{3}(\theta_{0},\pi)=\mathbb{E}\left[\frac{\dot{l}_{i}(\theta_{0})\dot{l}_{i}(\theta_{0})^{T}}{\pi(Y_{i},\mathbf{S}^{D}_{i},\mathbf{S}^{\prime,D}_{i})}\right]\), \(\Omega_{4}(\theta_{0},\pi)=\mathbb{E}\left[\frac{\dot{l}_{i}^{*}(\theta_{0})\dot{l}_{i}^{*}(\theta_{0})^{T}}{\pi(Y_{i},\mathbf{S}^{D}_{i},\mathbf{S}^{\prime,D}_{i})}\right]\), \(\Omega_{5}(\theta_{0},\pi)=\mathbb{E}\left[\dot{l}_{i}^{*}(\theta_{0})\dot{l}_{i}^{*}(\theta_{0})^{T}\right]\), and \(\dot{l}_{i}^{*}(\theta_{0})=\mathbb{E}\left[\dot{l}_{i}(\theta_{0})|Y_{i},\mathbf{S}^{D}_{i},\mathbf{S}^{\prime,D}_{i}\right]\).

4 SIMULATIONS STUDY

In this section, we study the performances under various conditions of the following estimators:

  • \(\hat{\theta}^{F}_{n}\) the maximum likelihood estimator obtained by solving the equation \(U_{F,n}(\theta)=0\) where \(U_{F,n}(\theta)\) is defined in 2.3.

  • \(\hat{\theta}^{wsk}_{n}\) the SIPWK estimator obtained by solving the Eq. (3.1).

  • \(\hat{\theta}^{ws}_{n}\) the SIPW estimator obtained by solving the Eq. (3.2).

In this numerical study, we consider samples of size \(n=2000\) and \(1000\).

$$\textrm{logit}(\psi_{i})=\gamma_{1}Z_{i1}+\gamma_{2}Z_{i2}+\gamma_{3}Z_{i3}+\gamma_{4}Z_{i4},$$
$$\textrm{log}(\nu_{i})=\alpha_{1}X_{i1}+\alpha_{2}X_{i2}+\alpha_{3}X_{i3},$$
(4.1)

where \(X_{i1}=Z_{i1}=1\), \(Z_{i2}=X_{i2}\), and \(Z_{i2}\), \(Z_{i3}\), \(Z_{i4}\), \(X_{i3}\), follows, respectively, the Gaussian distribution \(N(0,1.7)\), Poisson distribution \(P(0.5)\), exponential distribution \(E(1)\), and binomial distribution \(B(1,0.5)\). The regression parameter \(\alpha\) is chosen as follows \(\alpha=(1.2,0.2,-0.7)^{T}\). The regression parameter \(\gamma\) is chosen as follows

  • case 1: \(\gamma=(-1,0.4,0.3,0.45)^{T}\) ,

  • case 2: \(\gamma=(-1,0.62,0.3,0.8)^{T}\).

In case 1 (respectively case 2), the average percentage of zero inflation in this simulation is \(41\%\) (respectively \(65\%\)). In the variable \(Z_{i4}\), we assume that the data are missing. The average fraction of missing data (AFMD) in the simulated samples is equal to \(15\) and \(30\%\). We used a multiplicative kermel (the Dirac discrete kermel for discrete variables and the Gaussian kernel for the continuous variable) for the kernel-based weighting estimator of an MZIP model. Finally, for each configuration (sample size, proportions of zero inflation and missing data), we simulate \(N=1000\) samples and calculate \(\hat{\theta}_{n}^{ws}\) and \(\hat{\theta}_{n}^{wsk}\). We use the statistical software R.3.5.2 to perform our simulations and the maxlik package (see Henningsen et al. [19]) to solve Eqs. (2.3), (3.1). We compute the bias of the estimates \(\hat{\gamma}_{j,n}\) and \(\hat{\alpha}_{k,n}\). We obtain the bias, the standard deviation (SD) and the mean square error (RMSE) for each estimator \(\hat{\gamma}_{j,n}(j=1,...,4)\) and \(\hat{\alpha}_{k,n}(k=1,...,3)\). For comparison purposes, we also provide the results that would be obtained if there were no missing covariates. In this case, the MLE is obtained by solving the score equation (2.2) (FD estimator). In Table 1, we present the results for \(n=500\), 41\(\%\) (top) and 65\(\%\) (bottom) zero inflation and mean missing data 15 and 30\(\%\). Table 2, we present the results for \(n=1000\), 41\(\%\) (top) and 65\(\%\) (bottom) zero inflation and mean missing data 15 and 30\(\%\). Table 3 provides the results for \(n=2000\), 41\(\%\) (top) and 65\(\%\) (bottom) zero inflation and the average missing data 15 and 30\(\%\). The Tables 13 show that both methods perform well, as the results obtained with both methods are close to the base case. The results also show that the bias and RMSE of the proposed method are generally better than the bias and RMSE of the SIPWK method. Let us now examine the performance of the proposed estimator. The results in Tables 13 show that the bias, standard deviation, and RMSE decrease as the sample size increases and the proportion of individuals with missing covariates decreases. Furthermore, the bias remains reasonable even with 30\(\%\) missing data. The estimator \(\hat{\theta}^{F}_{n}\) is obviously better than \(\hat{\theta}^{wsk}_{n}\) and \(\hat{\theta}^{wsk}_{n}\), but FD is only possible in the absence of missing data.

Table 1 Simulation results for \(n=500\), zero inflation: 41\(\%\) (top) and 65\(\%\) (bottom)
Table 1 (Contd.)
Table 2 Simulation results for \(n=1000\), zero inflation: 41\(\%\) (top) and 65\(\%\) (bottom)
Table 2 (Contd.)
Table 3 Simulation results for \(n=2000\), zero inflation: 41\(\%\) (top) and 65\(\%\) (bottom)
Table 3 (Contd.)

5 APPLICATION

In this section, we describe an application of the MZIP model to NMES1988 data obtained from the National Medical Expenditure Survey (NMES) conducted in 1987–1988. We analyze the variable ofnp (number of consultations with a non-physician health professional in a practice) by the MZIP. The proportion of zero in the observations of this variable is equal to 0.6818. This very high proportion suggest a situation of inflation of zeros. For each of the individuals \(i\,(i=1\ldots n=4406)\) of the sample, let \(Y_{i}\) denote the number of consultations a non-physician health professional in a practice.

  • \(\psi_{i}\) represents the probability that patient \(i\) will give up in such a way systematic to consult a non-physician professional.

  • \(\nu_{i}\) represents the average number of consultations with a health professional not doctor, for a patient \(i\).

To model the marginal mean and zero-inflation parameters \(\nu_{i}\) and \(\psi_{i}\) defined in (2.2), where \(Z_{i}\) and \(X_{i}\) are the set of covariates, we proceeded as follows. First, we fitted an MZIP regression model incorporating all the covariates available in (2.2), i.e., taking \(X_{i}=Z_{i}\) for each \(i\). Next, Wald tests were used to select the relevant covariates in the sub-models (2.2). Through this procedure, we identify three significant predictors included in \(\nu_{i}\) (chronic, gender, school) and six significant predictors included in \(\psi_{i}\) (chronic, medicaid, age, income, gender, school). The significant covariates are gender (1 for female, 0 for male), age (in years, divided by 10), school (number of years of education), income (in 10 000 dollars), chronic diseases (cancer, arthritis, diabetes…), and medicaid (a binary variable indicating whether the individual is covered by medicaid or not). The covariate age (in years, divided by 10) was discretized before applying the proposed method. We therefore model \(\psi_{i}\) and \(\nu_{i}\) as follows:

$$\textrm{logit}(\psi_{i})=\gamma_{1}\textrm{inter}+\gamma_{2}\textrm{chronic}+\gamma_{3}\textrm{medicaid}+\gamma_{4}\textrm{age}+\gamma_{5}\textrm{income}+\gamma_{6}\textrm{gender}+\gamma_{7}\textrm{school},$$
(5.1)
$$\textrm{log}(\nu_{i})=\alpha_{1}\textrm{inter}+\alpha_{2}\textrm{chronic}+\alpha_{3}\textrm{gender}+\alpha_{4}\textrm{school}.$$
(5.2)

We simulated \(15\%\) (moderate) and \(30\%\) (high) proportions of missing data in the ‘‘income’’ variable, respectively. Indeed, among the covariates, the ‘‘income’’ variable is the most likely to have missing data, as it is more sensitive and confidential information. Respondents are often reluctant to disclose their income, which can lead to higher rates of missing data for this variable. According to Mishra et al. [29], National Health and Nutrition Examination Survey, the rate of missing data in the ‘‘income’’ variable is often high, reaching or exceeding 15\(\%\). Tables 4 and 5 show the estimation results for the case with no missing data (FD) and 15\(\%\) missing data, followed by the case with no missing data (FD) and 30\(\%\) missing data, respectively. We can say that the proposed method is robust because when the percentage of missing data increases, the covariates remain significant and the coefficients keep the same signs as in the reference case (FD). We can state that the variables of Medicaid status and gender are identified as the most influential factors in the decision to never use consultations with a non-physician health care professional. Medicaid recipients are more likely to forego a non-physician health care professional during an office visit. One explanation is that patients covered by Medicaid can limit their consultations to those that are necessary, i.e., not see a doctor, given that Medicaid is health insurance for the less well-off.

Table 4 Analysis of health care data with 15\(\%\) missing data
Table 5 Analysis of health care data with 30\(\%\) missing data

The probability of never using a doctor decreases with chronic, income, school, and age. The probability of never using a non-physician health care professional in a medical office decreases with the level of education because better-informed patients may tend to diversify their use of care. This probability decreases as health status worsens (in part because patients with worsening health status tend to favor visits to health professionals). This probability decreases with income because patients with higher incomes prefer to visit a health care professional.

The number of chronic illnesses and the level of education are the variables that most influence the average number consultations with non-physician healthcare professionals because patients with chronic conditions and those with higher levels of education visit regularly.

6 CONCLUSIONS

In this article, we have proposed a method for estimating the parameters of the MZIP model with MAR covariates. We compare the performance of this estimator with that of the kernel-assisted weighted estimator. The analysis of the numerical results concludes that the proposed \(\hat{\theta}^{ws}_{n}\) estimator and the \(\hat{\theta}^{wsk}_{n}\) estimator has a good performance. However, the simulation results suggest that the proposed method is more efficient than the kernel-assisted weighting method. The proposed SIPW estimator was used to analyze data from the U.S. public health economics NMES1988. The results of this analysis confirm the robustness of the proposed SIPW estimator.

In this paper, we assume that our data are MAR. But the missing data model is not monotonic in many practical situations. Adapting this approach to non-monotonic missing data in MZIP regression deserves further research.