Abstract
In this paper, we estimate the mean of the partially linear single-index errors-in-variables model with missing response variables. The linear covariate is measured with additive error, therefore missing is not random. Two special estimators are defined that include a semiparametric regression imputation estimator and a marginal average estimator. These estimators are shown to be asymptotically normal and have the same asymptotic variance. A simulation experiment is used to illustrate our proposed method.
Similar content being viewed by others
1 Introduction
Semiparametric errors-in-variables (EV) models have attracted broad attention and have been deeply studied during the last two decades. Relevant studies include partially linear EV models (Liang et al. [5], He and Liang [4]), varying coefficient EV models (You et al. [16], Zhao and Xue [17]), partially linear varying coefficient EV models (You and Chen [15], Wei and Mei [13]), partially linear additive EV models (Wei et al. [11, 12]). Here, we consider the following partially linear single-index EV model:
where Y is a response variable, the single covariate \(Z \in \mathbb{R}^{p}\) is observed completely, the linear covariate \(X \in \mathbb{R}^{q}\) is observed with additive error, and only its substitute V can be observed; \(g(\cdot )\) is an unknown smooth link function, ε is the random error with \(E(\varepsilon |Z,X)=0\), \(\operatorname{Var}(\varepsilon |Z,X)<\infty \); \((\boldsymbol{\alpha },\boldsymbol{\beta })\) is an unknown vector in \(\mathbb{R} ^{p} \times \mathbb{R}^{q}\) with \(\|\boldsymbol{\alpha }\|=1\) which ensures identifiability, and the first nonzero component of α is positive, where \(\|\cdot \|\) denotes the Euclidean norm. The measurement error e is independent of \((Y,Z,X)\) with \(E(e)=0\) and \(\operatorname{Cov}(e)=\varSigma _{e}\). Here, we assume that \(\varSigma _{e}\) is known. If it is unknown, the estimation method is analogous to the partial replication method of Liang et al. [5] in a partially linear EV model. For the complete data set, the partially linear single-index EV model has been discussed by Liang and Wang [6] and Chen and Cui [1].
It is well known that the studies on the mean \(E(Y)=\boldsymbol{\theta }\) are very important in regression models. If all the responses in the sample are available, the response variable mean can be usually obtained. However, in fact, some responses may be missing. This missing response problem may be caused by various reasons. For example, it may be too expensive to acquire the response Y’s and only part of Y’s are available. In practice, missing-data problems frequently occur in epidemiology studies, survey sampling, social science, and many other fields. Therefore, it is necessary to study the mean \(E(Y)= \boldsymbol{\theta }\) based on the missing data set.
However, there’s little research about the response variable mean in the partially linear single-index model. In this paper, we focus on the mean \(E(Y)=\boldsymbol{\theta }\), when there are missing responses in the partially linear single-index EV model (1.1). An indicator variable δ is introduced in order to indicate whether an observation of Y is missing or observed, i.e., \(\delta =0\) indicates that Y is missing and \(\delta =1\) indicates that Y is observed. Throughout this paper, if X is observable, we assume the data missing mechanism is as follows:
for some unknown \(\pi (Z, X)\). In addition, we also assume that the measurement error e is independent of δ, \(p(\delta =1|Y, Z, X, V)=\pi (Z, X)\). Since X is observed with measurement errors, Y is not missing at random if there are no further assumptions. The details can be seen in the paper of Liang et al. [7].
The imputation method is a common method of dealing with missing data, which fills in a plausible value for each missing data and then analyzes the result as if they were complete data. When some responses are missing, Cheng [2] applied kernel regression imputation to estimate θ in a Nonparametric Model. Similar to the method of Cheng [2], Wei [10] estimated θ in a partially linear varying-coefficient EV model with missing responses. In addition, the marginal average method also can be used in a missing data set in place of the imputation method. When some responses are missing in a partially linear model, Wang et al. [9] and Liang et al. [7] used the above two methods to estimate the mean of the responses with the covariates X being observed and not observed, respectively. In this paper, we extend the method in Liang et al. [7] to the partially linear single index EV models, propose two estimators of θ in model (1.1) with missing response. The estimators are shown to be asymptotically normal and have the same asymptotic variance.
The rest of this paper is organized as follows. In Sect. 2, two estimators of θ are proposed and a relative asymptotic result is presented. In Sect. 3, some simulation results are reported. All proofs are shown in Sect. 4.
2 Methodology and result
2.1 Estimation of the mean \(E(Y)=\boldsymbol{\theta }\)
In order to derive the estimators of θ, first we use the complete method of Qi and Wang [8] to estimate the regression coefficients, the single-index coefficients and the nonparametric function. By the least-squares method and the correction for attenuation technique, an estimator of can be defined as
where \(\hat{m}_{Y}(t) = \sum_{i=1}^{n} \frac{\delta _{i} K_{h_{1}}(Z _{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t)}{\sum_{i=1}^{n} \delta _{i} K _{h_{1}}(Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t)} Y_{i}\) and \(\hat{m}_{V}(t) = \sum_{i=1}^{n} \frac{\delta _{i} K_{h_{1}}(Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t)}{\sum_{i=1}^{n} \delta _{i} K_{h_{1}}(Z _{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t)} V_{i}\) are the estimators of \(m_{Y}(t)=\frac{E(\delta _{i} Y_{i} |Z_{i}^{{\mathsf{T}}} \boldsymbol{\alpha }=t)}{E(\delta _{i} |Z_{i}^{{\mathsf{T}}} \boldsymbol{\alpha }=t)}\) and \(m_{V}(t)=\frac{E(\delta _{i} V_{i} |Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }=t)}{E(\delta _{i} |Z_{i}^{{\mathsf{T}}} \boldsymbol{\alpha }=t)}\), \(K_{h_{1}}(t)=\frac{K_{1}(\frac{t}{h_{1}})}{h_{1}}\), with \(K_{1}(\cdot )\) being a kernel function and \(h_{1}\) being a suitable bandwidth.
After obtaining the estimator of β, we can obtain the estimators \(\hat{g}_{n}(\cdot )\) and \(\hat{g}'_{n}(\cdot )\) of \(g(\cdot )\) and \(g'(\cdot )\) for any fixed α. By the locally linear method of Fan and Gijbels [3], we approximate \(g(t)\) within the neighborhood of \(t_{0}\), \(g(t)\approx g(t_{0})+g'(t _{0})(t-t_{0})\) and minimize
where \(K_{h_{2}}(t)=\frac{K_{2}(\frac{t}{h_{2}})}{h_{2}}\), with \(K_{2}(\cdot )\) being a kernel function and \(h_{2}\) being a suitable bandwidth.
However, (2.1) and (2.2) cannot be applied directly in practice, since α is unknown. So we need to estimate by minimizing
which yields, say \(\hat{\boldsymbol{\alpha }}_{n}\). Note that \(\hat{\boldsymbol{\beta }}_{n}\) and \(\hat{g}_{n}(\cdot )\) can also be used to obtain \(\hat{\boldsymbol{\alpha }}_{n}\) in (2.3). The complete estimation procedure is decomposed in an iterative process with the following steps:
Step 1. Acquire an initial value \(\hat{\boldsymbol{\alpha }}_{0}\), for example, by the method of Xia and Härdle [14], and let \(\hat{\boldsymbol{\alpha }}_{n} = \frac{\hat{\boldsymbol{\alpha }}_{0}}{\| \hat{\boldsymbol{\alpha }}_{0} \| }\).
Step 2. When \(\boldsymbol{\alpha } =\hat{\boldsymbol{\alpha }}_{n}\), we can obtain \(\hat{\boldsymbol{\beta }}_{nk}\), \(\hat{g}_{nk}(\cdot )\) based on (2.1) and (2.2).
Step 3. The solution of (2.3) is denoted as \(\hat{\boldsymbol{\alpha }} _{n(k+1)}\). Let \(\hat{\boldsymbol{\alpha }}_{n}= \frac{\hat{\boldsymbol{\alpha }}_{n(k+1)}}{ \| \hat{\boldsymbol{\alpha }}_{n(k+1)} \| }\).
Step 4. Iterate Steps 2 and 3 until convergence is achieved.
Next, we turn to estimate the mean \(E(Y)=\boldsymbol{\theta }\). Similar to Wang et al. [9] and Liang et al. [7], we construct two estimators of θ. First, each missing \(Y_{i}\) is imputed by the estimated regression function \(V_{i}^{{\mathsf{T}}}\hat{\boldsymbol{\beta }}_{n}+ \hat{g}_{n}(Z_{i}^{{\mathsf{T}}}\hat{\boldsymbol{\alpha }}_{n})\). Consequently, we obtain the semiparametric regression imputation estimator of θ, which is designed as
Second, we only consider the sample average of the estimated regression function, that is, every \(Y_{i}\) is ignored. Accordingly, we get the marginal average estimator of θ, which is defined as
2.2 Asymptotic result
In this section, the asymptotic normality of θs will be summarized. And it will be shown that they are asymptotically equivalent.
For a concise representation, let \(P(t_{0},\delta )=\delta /E(\delta |Z^{{\mathsf{T}}} \boldsymbol{\alpha }=t_{0})\) and \(\widetilde{\mathcal{S}}=\mathcal{S}- \frac{E(\delta \mathcal{S}|Z^{{\mathsf{T}}}\boldsymbol{\alpha })}{E(\delta |Z^{{\mathsf{T}}} \boldsymbol{\alpha })}\), for example, \(\widetilde{X}_{i}=X_{i}-\frac{E(\delta X|Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha })}{E(\delta |Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha })}\). Moreover, in order to state the asymptotic results, the following assumptions will be used:
- \((C1)\):
Let \(\varGamma _{\widetilde{X}}=E\{\delta \widetilde{X}^{\otimes 2} \}\), \(\varGamma _{\widetilde{Z}}=E\{\delta [\widetilde{Z}g'(Z^{{\mathsf{T}}}\boldsymbol{\alpha })]^{\otimes 2}\}\) and \(\varGamma _{\widetilde{Z}\widetilde{X}}=E\{\delta \widetilde{Z} \widetilde{X}^{{\mathsf{T}}}g'(Z^{{\mathsf{T}}} \boldsymbol{\alpha })\}\).
- \((C2)\):
The bandwidth satisfies \(h_{1}=h_{0}n^{-\frac{1}{p+4}}\) for some positive constant \(h_{0}\), \(\frac{nh_{2}^{p}}{\log n} \rightarrow \infty \), where p is the dimension of Z.
- \((C3)\):
The kernels \(K_{i}(\cdot )\) (\(i=1,2\)) are bounded symmetric density functions with compact support \((-1,1)\), and they satisfy \(\int uK _{i}(u)\,du=0\), \(\int u^{2}K_{i}(u)\,du\neq 0\).
- \((C4)\):
The density function \(f(t)\) of \(Z^{{\mathsf{T}}} \boldsymbol{\alpha }\) is bounded away from 0 and has two bounded derivatives on its support.
- \((C5)\):
\(g(\cdot )\), \(m_{Y}(\cdot )\), \(m_{V}(\cdot )\) have two bounded, continuous derivatives on their supports.
- \((C6)\):
The probability function \(\pi (Z, X)\) has bounded continuous second partial derivatives, and is bounded away from zero on the support of \((Z,X)\).
- \((C7)\):
\(E(|\varepsilon |^{4}<\infty )\), \(E(|e|^{3}<\infty )\).
Now we give the following asymptotical result.
Theorem 2.1
Assume that conditions \((C1)\)–\((C7)\)are satisfied. Then we obtain
where \(\boldsymbol{\varTheta }_{1}=E [P(Z^{{\mathsf{T}}}\boldsymbol{\alpha }, \delta )\varepsilon +[1-P(Z^{{\mathsf{T}}}\boldsymbol{\alpha },\delta )]e^{{\mathsf{T}}}\boldsymbol{\beta } +E[g'(Z^{{\mathsf{T}}} \boldsymbol{\alpha })\widetilde{Z}^{{\mathsf{T}}}] \cdot \varGamma _{\widetilde{Z}}^{-1}\delta g'(Z^{{\mathsf{T}}} \boldsymbol{\alpha })\widetilde{Z} (\varepsilon -e^{{\mathsf{T}}} \boldsymbol{\beta }) +E[\widetilde{V}^{{\mathsf{T}}}-g'(Z^{{\mathsf{T}}}\boldsymbol{\alpha })\widetilde{Z}^{{\mathsf{T}}} \varGamma _{\widetilde{Z}}^{-1}\varGamma _{\widetilde{Z}\widetilde{X}}]\cdot \varGamma _{\widetilde{X}}^{-1} \{\delta [\widetilde{V} (\varepsilon -e ^{{\mathsf{T}}} \boldsymbol{\beta })+\varSigma _{uu}\boldsymbol{\beta }]\} ]^{2}\)and \(\boldsymbol{\varTheta }_{2}=E[X^{{\mathsf{T}}}\boldsymbol{\beta }+g(Z^{{\mathsf{T}}}\boldsymbol{\alpha })-\boldsymbol{\theta }]^{2}\).
3 Simulation
In this section, we present a simulation study to analyze the finite sample performance of the regression imputation estimator \(\boldsymbol{\theta }_{1}\) and the marginal average estimator \(\boldsymbol{\theta } _{2}\).
The simulation uses the partial linear single-index EV model (1.1) with a specific link function:
where X is generated from the standard normal distribution, trivariate Z is simulated from the uniform distribution \(U[0,1]\), e is generated from the normal distribution \(N(0,0.25^{2})\), ε is simulated from the normal distribution with mean 0 and variance 0.01, and \(\boldsymbol{\alpha }=(\frac{\sqrt{3}}{3},\frac{\sqrt{3}}{3},\frac{ \sqrt{3}}{3})^{{\mathsf{T}}}\), \(\boldsymbol{\beta }=1\). The kernel functions were taken to be \(K_{i}(t)=\frac{3}{4}(1-t^{2})^{2}\) if \(\|t\|\leq 1 \), and 0 otherwise, \(i=1,2\).
The choices of bandwidths are quite crucial. In this paper, we use the least-squares delete-one cross-validation (CV) method to select bandwidths: \(\hat{h}_{1}\) and \(\hat{h}_{2}\) are chosen as
where \(\hat{\boldsymbol{\beta }}_{n}^{(-i)}\), \(\hat{g}_{n}^{(-i)}\) and \(\hat{\boldsymbol{\alpha }}_{n}^{(-i)}\) are the “leave-one-out” versions of \(\hat{\boldsymbol{\beta }}_{n}\), \(\hat{g}_{n}\) and \(\hat{\boldsymbol{\alpha }}_{n}\), respectively. However, the \(h_{i}\), \(i=1,2\) from (3.2) may not the optimal bandwidths because they may not satisfy the conditions imposed in the theorems. According to their conditions, the optimal bandwidth according to (3.2) is to choose a constant \(h_{0}\).
Based on model (3.1), we considered the following four response probabilities of missing, namely:
Case 1: \(P(\delta =1|Z=z, X=x)=\frac{\exp (0.6+z^{{\mathsf{T}}} \phi +\varphi x)}{1+\exp (0.6+z^{{\mathsf{T}}}\phi +\varphi x)}\),where \(\phi =(-0.12,-0.012,-0.12)^{{\mathsf{T}}}\), \(\varphi =0.35\);
Case 2: \(P(\delta =1|Z=z, X=x)=\frac{\exp (0.6+z^{{\mathsf{T}}} \phi +\varphi x)}{1+\exp (0.6+z^{{\mathsf{T}}}\phi +\varphi x)}\),where \(\phi =(0.2,0.2,0.2)^{{\mathsf{T}}}\), \(\varphi =0.45\);
Case 3: \(P(\delta =1|Z=z, X=x)=\frac{\exp (0.6+z^{{\mathsf{T}}} \phi +\varphi x)}{1+\exp (0.6+z^{{\mathsf{T}}}\phi +\varphi x)}\),where \(\phi =(0.65,0.65,0.65)^{{\mathsf{T}}}\), \(\varphi =0.8\);
Case 4: \(P(\delta =1|Z=z, X=x)=0.9\) for all z and x.The average missing rates were 0.4, 0.3, 0.2, and 0.1, respectively. From the 1000 simulated values of \(\hat{\boldsymbol{\theta }}_{1}\), \(\hat{\boldsymbol{\theta }}_{2}\), we calculated the biases and standard errors (SE) of the two estimators. The simulated results are reported in Table 1.
From Table 1, we observe that
- \((a)\):
Biases and SE decrease as n increases for every fixed missing rate. Also, SE increase as the missing rate increases for every fixed sample size n.
- \((b)\):
The SE of \(\hat{\boldsymbol{\theta }}_{1}\), \(\hat{\boldsymbol{\theta }}_{2}\) are nearly the same for every fixed missing rate and sample size.
4 Proof of the main result
In order to prove the main result, we first give some lemmas.
Lemma 4.1
Under conditions \((C1)\)–\((C7)\), we have
Proof of Lemma 4.1
When \(\boldsymbol{\alpha }= \hat{\boldsymbol{\alpha }}_{n}\), the estimators of \(g(\cdot )\) and \(g'(\cdot )\) can be obtained from (2.2). By a straightforward calculation,
Then focusing on the top equation only and using Taylor expansion, we have
that is,
Note that \(\frac{1}{n}\sum_{i=1}^{n} K_{h_{2}}(Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t_{0})=f(t_{0})+o_{p}(1)\). Dividing all terms in (4.2) by \(\frac{1}{n} \sum_{i=1}^{n} K_{h_{2}}(Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t_{0})\), we obtain
Noting that \(\frac{\frac{1}{n}\sum_{i=1}^{n} \delta _{i} K_{h_{2}}(Z _{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t_{0})}{\frac{1}{n}\sum_{i=1} ^{n} K_{h_{2}}(Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }-t_{0})}=E( \delta |Z_{i}^{{\mathsf{T}}}\boldsymbol{\alpha }=t_{0})+o_{p}(1)\), we get
Similarly, we also have
Thus we get equation (4.1). □
Lemma 4.2
Under conditions \((C1)\)–\((C7)\), we have
Proof of Lemma 4.2
This proof is given in Qi and Wang [8], we omit the details here. □
Lemma 4.3
Under conditions \((C1)\)–\((C7)\), we have
Proof of Lemma 4.3
The proof of Lemma 4.3 is similar to the proof of Theorem 1 by Liang et al. [7], we omit the details here. □
Proof of Theorem 2.1
Here we only consider the asymptotic normality of \(\boldsymbol{\theta }_{1}\). The asymptotic result for \(\boldsymbol{\theta }_{2}\) is obtained similarly. □
For \(\boldsymbol{\theta }_{1}\), we have
where
From Taylor expansion and the continuity of \(g'(\cdot )\), we obtain that
By Lemma 4.1 and (4.4), it is easy to get
where
We have
Combining Lemma 4.2 and calculating directly, we can get
where
By a straightforward calculation, it follows that
Furthermore, it is easy to get
Combining (4.3), (4.5), (4.6), (4.7), (4.8), (4.9), and Lemma 4.3, one can get
This, together with the central limit theorem, proves Theorem 2.1 for \(\hat{\boldsymbol{\theta }}_{1}\).
References
Chen, X., Cui, H.J.: Empirical likelihood for partially linear single-index errors-in-variables model. Commun. Stat., Theory Methods 38(15), 2498–2514 (2009)
Cheng, P.E.: Nonparametric estimation of mean functionals with data missing at random. J. Am. Stat. Assoc. 89, 81–87 (1994)
Fan, J.Q., Gijbels, I.: Local Polynomial Modelling and Its Applications. Chapman & Hall, London (1996)
He, X.M., Liang, H.: Quantile regression estimates for a class of linear and partially linear errors-in-variables models. Stat. Sin. 10, 129–140 (2000)
Liang, H., Härdle, W., Carroll, R.J.: Estimation in a semiparametric partially linear errors-in-variables model. Ann. Stat. 27, 1519–1535 (1999)
Liang, H., Wang, N.: Partially linear single-index measurement error models. Stat. Sin. 15, 99–116 (2005)
Liang, H., Wang, S.J., Carroll, R.J.: Partially linear models with missing response variables and error-prone covariates. Biometrika 94, 185–198 (2007)
Qi, X., Wang, D.H.: Estimation in a partially linear single-index model with missing response variables and error-prone covariates. J. Inequal. Appl. 2016, 11 (2016). https://doi.org/10.1186/s13660-015-0941-8
Wang, Q.H., Linton, O., Härdle, W.: Semiparametric regression analysis with missing response at random. J. Am. Stat. Assoc. 99, 334–345 (2004)
Wei, C.H.: Estimation in partially linear varying-coefficient errors-in-variables models with missing responses (Chinese ed.). Acta Math. Sci. 30, 1042–1054 (2010)
Wei, C.H., Jia, X.J., Hu, H.S.: Statistical inference on partially linear additive models with missing response variables and error-prone covariates. Commun. Stat., Theory Methods 44, 872–883 (2015)
Wei, C.H., Luo, Y.B., Wu, X.Z.: Empirical likelihood for partially linear additive errors-in-variables models. Stat. Pap. 53(2), 485–496 (2012)
Wei, C.H., Mei, C.L.: Empirical likelihood for partially linear varying-coefficient models with missing response variables and error-prone covariates. J. Korean Stat. Soc. 41, 97–103 (2012)
Xia, Y.C., Härdle, W.: Semi-parametric estimation of partially linear single-index models. J. Multivar. Anal. 97, 1162–1184 (2006)
You, J.H., Chen, G.M.: Estimation of a semiparametric varying-coefficient partially linear errors-in-variables model. J. Multivar. Anal. 97(2), 324–341 (2006)
You, J.H., Zhou, Y., Chen, G.M.: Corrected local polynomial estimation in varying-coefficient models with measurement errors. Can. J. Stat. 34(3), 391–410 (2006)
Zhao, P.X., Xue, L.G.: Variable selection for varying coefficient models with measurement errors. Metrika 74, 231–245 (2011)
Acknowledgements
The authors thank the two referees and editor(s) for carefully reading the paper and for their valuable suggestions and comments which greatly improved the paper.
Availability of data and materials
The data sets analyzed in the current study can be generated by Monte Carlo experiments.
Funding
This work is supported by Philosophy and Social Sciences Planning Project of Guangdong Province during the “13th Five-Year” Plan Period (No. GD18CYJ08, GD18XGL26), National Social Science Foundation of China (18CTQ032) and Guangdong Polytechnic of Science and Technology Research Project (No. XJPY2018006, XJMS2018006).
Author information
Authors and Affiliations
Contributions
The authors contributed equally to the writing of this paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qi, X., Yu, Z. Estimation of the mean of the partially linear single-index errors-in-variables model with missing response variables. J Inequal Appl 2020, 18 (2020). https://doi.org/10.1186/s13660-020-2299-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-020-2299-9