1 Introduction

Generalized varying coefficient partially linear models (GVCPLM) (Li and Liang 2008) are powerful extensions of generalized partially linear models (GPLM). These models offer additional flexibility compared to GPLM when modeling data with discrete response variable, because they further relax model assumptions imposed on GPLM and allow interactions between covariates and certain unknown functions depending on other covariates, while keep some linear components there. GVCPLM are also useful generalizations of varying coefficient models (Hastie and Tibshirani 1993), which have been applied to parsimoniously describe data structure and uncover scientific feature, and have been studied in the context of quasi-likelihood principle. As well known in the literature, several useful semiparametric models can be classified as special cases of GVCPLM in one way or another to name a few such as GPLM (Hunsberger 1994; Hunsberger et al. 2002; Lin and Carroll 2001; Severini and Staniswalis 1994); partially linear models (Härdle et al. 2000; Robinson 1988; Speckman 1988); semivarying-coefficient models (Fan and Huang 2005; Xia et al. 2004; Zhang et al. 2002) and varying coefficient models (Cai et al. 2000; Hastie and Tibshirani 1993).

Li and Liang (2008) studied variable selection for GVCPLM using the SCAD (Fan and Li 2001) to identify parametric components and generalized likelihood ratio test (Fan et al. 2001) to select nonparametric components. Wang and Xia (2009) proposed a shrinkage method for selecting nonparametric components in varying coefficient models. Wang et al. (2011) developed an estimation procedure and variable selection procedure for generalized additive partial linear models (PLM) with an incorporation of polynomial spline smoothing to estimate nonparametric functions and penalized SCAD quasi-likelihood-based estimators to select linear covariates. Li et al. (2011) considered variable selection on varying coefficient partially linear models when both the number of parametric and nonparametric components diverge at appropriate rates. Wei et al. (2011) further considered variable selection and estimation in “large p, small n” setting using the group Lasso idea (Yuan and Lin 2006).

Measurement errors are often encountered in biomedical research. Simply ignoring the errors can cause bias in estimation and lead to a loss of power for accurately detecting the relationship among variables. Regression calibration and simulation extrapolation (SIMEX, Cook and Stefanski 1994) are two widely useful methods for eliminating or reducing bias caused by measurement errors. But the corresponding estimators are consistent only in special cases such as linear or loglinear regression, and approximately consistent in general cases. There are possible alternative methods to remedy consistency concerns by deriving unbiased score functions in the presence of measurement error, for example, the conditional score method (Stefanski and Carroll 1987) and corrected score method (Stefanski 1989), which are essentially M-estimation methods, and the usual numerical methods and asymptotic theory for M-estimators are applicable. But just like other methods, these two methods have their own limitations. In particular, conditional scores can be derived under specific assumptions on the model for response given covariates and the error model for the surrogates, and some conditional score methods may require integration, while corrected scores also impose sufficient assumptions on the error model to ensure unbiased estimation of the true-data estimating function. See Carroll et al. (2006) for more detailed discussions on a variety of estimation and inference methods for nonlinear measurement errors models. Ma and Carroll (2006), Ma and Tsiatis (2006) and Tsiatis and Ma (2004) investigated estimation efficiency for semiparametric models with measurement errors. Hall and Ma (2007) studied semiparametric estimators of functional measurement error models. Yi et al. (2012) considered marginal analysis of longitudinal data when errors-in-variables and missing response appear simultaneously.

Efforts have been made to address various scientific questions using semiparametric models in the presence of measurement errors. For example, Sinha et al. (2010) proposed a semiparametric Bayesian method for handling measurement errors commonly appeared in nutritional epidemiological studies. Carroll and Wang (2008) studied effects of measurement errors on microarray data analysis, and noticed that a direct application of the simulation extrapolation method leads to inconsistent estimators. The authors proposed a permutation SIMEX method which leads to consistent estimators in theory. In environmental research, environmental factors are generally measured with error. Lobach et al. (2008, 2010) developed a genotype-based approach for association analysis of case–control studies of gene–environment interactions using pseudo-likelihood principle to reduce bias caused by measurement errors.

Recently, variable selection in semiparametric regressions with measurement errors has been considered. Liang and Li (2009) developed two variable selection procedures, penalized least squares and penalized quantile regression, for PLM with measurement errors. Ma and Li (2010) proposed a penalized estimating equation with SCAD penalty for a class of parametric measurement error models and semiparametric measurement error models. As observed in Liang and Li (2009), if measurement errors are ignored, some variable selection procedures may falsely choose variables and result in a final biased model.

In this article, we study estimation and variable selection for GVCPLM when the covariates are error prone. We consider three problems: first, calibrating the error-prone covariates using ancillary information and applying nonparametric regression techniques; second, developing quasi-likelihood profile estimating procedures and justifying that the corresponding estimators of parameters of interest are asymptotically normal; third, proposing a penalized quasi-likelihood procedure for selecting significant parameters and a generalized likelihood ratio test for selecting nonzero nonparametric functions. Zhou and Liang (2009) once studied the case where the link function is identity one, and gave a variety of examples to illustrate the flexibility of the model. The authors developed a profile-based estimation procedure to estimate unknown parameters of interest.

It is remarkable that extension of the profile estimation procedure of Zhou and Liang (2009) to GVCPLM is by no means trivial. For the case of identity link function, the profile least-square technique can be used and a closed form of estimators is available. Nevertheless, for GVCPLM with measurement errors, only quasi-likelihood-based objective function is available. Whether the resulting estimators still have nice properties such as asymptotic normality is theoretically difficult to address. In GVCPLM, Li and Liang (2008) proposed SCAD-type procedure for parametric component selection and theoretically showed its oracle properties under certain assumptions. Whether such a procedure can be developed under a measurement error framework is not clear and has not been investigated in the literature. No measurement errors, Fan et al. (2001) proposed a generalized quasi-likelihood ratio test (GLRT) to investigate whether the coefficient functions in GVCPLM are constant or not. In this paper, we investigate Wilks phenomenon in the context of error-prone semiparametric setting. We further propose a bootstrap procedure to estimate null distribution of GLRT. To the best of our knowledge, this Wilks phenomenon under error prone is new and the findings contribute to the literature on semiparametric modeling.

The remainder of the paper is organized as follows. In Sect. 2, we propose the quasi-likelihood procedure for estimation of parametric components, then we develop a penalized quasi-likelihood for variable selection. Sampling properties of the proposed procedures are investigated. In Sect. 3, estimation procedure and variable selection procedure are proposed for the nonparametric component. The null distribution of the GLRT is also established. Simulation results and a real data analysis are presented in Sect. 4. Regularity conditions and technical proofs are given in the Appendix.

2 Estimation and variable selection for parametric components

Let \(X=(X_{1}, \ldots , X_{p})^{T}\in \mathbb {R}^{p}\), \(\xi =(\xi _{1}, \ldots , \xi _{d})^{T} \in \mathbb {R}^{d}\), \(W=(W_{1}, \ldots , W_{r})^{T} \in \mathbb {R}^{r}\), \(U\in \mathbb {R}\) be the covariates and Y be the response variable. The GVCPLM are of form:

$$\begin{aligned} g\{\mu (U,\xi , W, X)\}=\beta ^{T}\xi +\theta ^{T}W+\alpha (U)^{T}X, \end{aligned}$$
(1)

where \(g(\cdot )\) is a known link function, \(\beta \) and \(\theta \) are vectors of unknown regression coefficients and \(\alpha (\cdot )\) is a vector of unknown smooth nonparametric functions of U. The response Y is related to covariates (U, \(\xi \), W, X) through an unknown mean function \(\mu (U, \xi , W, X)=E(Y|U, \xi , W, X)\) and the conditional variance determined by a known positive function \(T(\cdot )\), i.e., \(\mathrm{Var}(Y|U, \xi , W, X)=\sigma ^{2}T\{\mu (U, \xi , W, X)\}\). The components \(\xi \) are unobserved directly, but auxiliary variables \((\eta , V)\) are available to remit \(\xi \). \(\eta \) is related to V via

$$\begin{aligned} \eta =\xi (V)+e, \end{aligned}$$
(2)

where e is a measurement error and independent of (XWVUY), and has a positive finite covariance matrix \(\Sigma _{e}=E(e e^{T})\). We term (1) and (2) generalized varying coefficient partially linear measurement error models (GVCPLMeM).

Let \(\big \{(Y_{i}, U_{i}, \eta _{i}, V_{i}, W_{i}, X_{i})\big \}_{i=1}^{n}\) be an i.i.d. sample from \((Y, U, \eta , V, W, X)\). When the covariates \(\xi \) are measured with error, we first calibrate \(\xi \) using ancillary observed sample \(\big \{(\eta _{i}, V_{i})\big \}_{i=1}^{n}\).

2.1 Covariate calibration

We introduce the calibration estimation procedure for \(\xi \) in this section. For notational simplicity, we assume V is univariate throughout this paper. Let \(\eta _{ik}\) be the kth entry of vector \(\eta _{i}\) for \(i=1, \ldots , n\). To estimate \(\xi _{k}(v)\), the kth component of \(\xi (v)\), we employ the local linear smoothing technique (Fan and Gijbels 1996). That is, to minimize

$$\begin{aligned} \sum _{i=1}^{n}\big \{\eta _{ik}-c_{0k}-c_{1k}(V_{i}-v)\big \}^{2}L_{b_{k}}(V_{i}-v) \end{aligned}$$
(3)

with respect to \(c_{0k}, c_{1k}\), where \(L_{b}(\cdot )=L(\cdot /b)/b\) with \(L(\cdot )\) be a kernel function, \(b=b_{k}\) (\(k=1, \ldots , p\)) is a bandwidth. Let \(\hat{c}_{0k}, \hat{c}_{1k}\) be the minimizers of (3). Write

$$\begin{aligned} \hat{\xi }_{k}(v)=\hat{c}_{0k}=\frac{D_{20,k}(v)D_{01,k}(v)-D_{10,k}(v)D_{11,k}(v)}{D_{00,k}(v)D_{20,k}(v)-D_{10,k}^{2}(v)}, \end{aligned}$$
(4)

where \(D_{s_{1}s_{2},k}(v)=\sum _{i=1}^{n}L_{b_{k}}(V_{i}-v)(V_{i}-v)^{s_{1}} \eta _{ik}^{s_{2}}\) for \(s_{1}=0,1,2\), \(s_{2}=0,1\), \(k=1, \ldots , d\).

We now list the assumptions needed in the following proposition and theorems. The following are the regularity conditions for our asymptotic results.

  1. (C1)

    \(q_{2}(x, y)<0\) for \(x\in \mathbb {R}\) and y in the range of the response variable.

  2. (C2)

    The functions \(T''(\cdot )\) and \(g'''(\cdot )\) are continuous.

  3. (C3)

    The random variable U has bounded support \(\mathcal {U}\). The elements of the function \(\alpha ''(u)\) are continuous in \(u\in \mathcal {U}\).

  4. (C4)

    The density functions \(f_{U}(u)\), \(f_{V}(v)\) of U, V are Lipschitz continuous and bounded away from 0 and infinite on their supports, respectively. Moreover, the joint density function \(f_{U, V}(u, v)\) of (UV) is continuous on the support \(\mathcal {U}\times \mathcal {V}\).

  5. (C5)

    Let \(Z=\beta ^{T}\xi +\theta ^{T}W+\alpha (U)^{T}X\). Then, s\(E\big [q_{l}^{s}(Z,Y)N^{\otimes 2}\big |U=u\big ]\), \(E\big [q_{l}^{s}(Z, Y)N^{\otimes 2}\big |V=v\big ]\) and \(E\big [q_{l}^{s}(Z, Y)N^{\otimes 2}\big |U=u, V=v\big ]\) for \(l=1,2\), \(s=1,2\) are Lipschitz continuous and twice differentiable on \(u\in \mathcal {U}\) and \(v\in \mathcal {V}\). Moreover, \(E\{q_{2}^{2}(Z, Y)\}< \infty \), \(E\{q_{1}^{2+\delta }(Z, Y)\}< \infty \) for some \(\delta >2\) and \(E\big [\rho _{2}(Z)N^{\otimes 2}\big |U=u\big ]\) is nonsingular for each \(u\in \mathcal {U}\).

  6. (C6)

    The kernel functions \(K(\cdot )\), \(L(\cdot )\) are univariate bounded, continuous and symmetric density functions satisfying that \(\int t^{2}K(t)\mathrm{d}t \not = 0 \), \(\int t^{2}L(t)\mathrm{d}t\not =0\), and \(\int |t|^{j}K(t)\mathrm{d}t <\infty \), \(\int |t|^{j}L(t)\mathrm{d}t <\infty \). for \(j=1,2,3,4\). Moreover, the second derivatives of \(K(\cdot )\) and \(L(\cdot )\) are bounded on \(\mathbb {R}\).

  7. (C7)

    The bandwidths h and b satisfy:

    1. (i)

      \(b=b_{k}\), \(k=1,\ldots , d\), \(b_{k}\asymp c_{b}h_{o}\) for some constant \(c_{b}>0\); \(h\asymp c_{h}h_{o}\) for some constant \(c_{h}>0\).

    2. (ii)

      \(h_{o}\rightarrow 0\) as \(n\rightarrow \infty \), \(nh_{o}^{2}/(\log h_{o}^{-1})^{4} \rightarrow \infty \), \(nh_{o}^{4}\rightarrow 0\).

  8. (C8)

    For all \(\lambda _{1j}\), \(\lambda _{2s}\), \(j=1, \ldots , d\), \(s=1, \ldots , r\), \(\lambda _{1j}\rightarrow 0\), \(\sqrt{n}\lambda _{1j}\rightarrow \infty \), \(\lambda _{2s}\rightarrow 0\), \(\sqrt{n}\lambda _{2s}\rightarrow \infty \), and \( \displaystyle \lim \inf _{n\rightarrow \infty } \lim \inf _{u\rightarrow 0^{+}}p'_{\lambda _{1j}}(u)/\lambda _{1j}>0\), \( \lim \inf _{n\rightarrow \infty } \lim \inf _{u\rightarrow 0^{+}}p'_{\lambda _{2s}}(u)/\lambda _{2s}>0. \)

Condition (C1) is imposed to ensure the local likelihood concave and guarantees the solution unique. Conditions (C2) and (C3) are usual smooth conditions (Li and Liang 2008). Condition (C4) is a technique condition commonly imposed for conventional nonparametric regression analysis. Condition (C5) is needed for Taylor expansion and ensures asymptotic variance finite. Condition (C6) is commonly imposed for nonparametric kernel smoothing. Condition (C7) is generally required for bandwidths h and \(b_{k}\) in semiparametric setting. Condition (C8) is a technique condition involved in the SCAD variable selection procedure (Fan and Li 2001; Liang and Li 2009).

Proposition 1

Under the conditions (C4), (C6) and (C7), we have

$$\begin{aligned}&\hat{\xi }_{k}(v)-\xi _{k}(v)\nonumber \\&\quad =\frac{\mu _{L2}}{2}b_{k}^{2}\xi _{k}^{(2)}(v)+\frac{1}{nf_{{\text{ V }}}(v)}\sum _{i=1}^{n}L_{b_{k}}(V_{i}-v)e_{ki}+o\left( b_{k}^{2} +\log b_{k}^{-1}/\sqrt{nb_{k}}\right) \nonumber \\ \end{aligned}$$
(5)

holds uniformly on \(v \in \mathcal {V}\), where \(\mu _{Lj}=\int u^{j}L(u)\mathrm{d}u\), \(\xi _{k}^{(2)}(v)\) is the second derivatives of \(\xi _{k}(v)\), \(e_{ki}\) is the kth component of \(e_{i}\), \(i=1, \ldots , n\).

The proof of (5) can be completed in a way similar to Zhou and Liang (2009).

2.2 Quasi-likelihood-based estimation

After we calibrate \(\xi \), we model the “synthesis” data \(\{Y_{i}, U_{i}, \hat{\xi }_{i}, W_{i}, X_{i}; 1\le i \le n\}\) using the local likelihood principle (Fan and Gijbels 1996) to estimate \(\beta , \theta , \alpha (\cdot )\) based on the model:

$$\begin{aligned} g\big \{\mu \big (U, \hat{\xi }, W, X\big )\big \}= \beta ^{T}\hat{\xi }+\theta ^{T}W+\alpha (U)^{T}X. \end{aligned}$$
(6)

Specifically, let h be the bandwidth, \(K(\cdot ) \) be the kernel function satisfying the condition (C6), and \(K_{h}(\cdot )=h^{-1}K(\cdot /h)\). For each u in a neighborhood of U, we approximate \(\alpha _{j}(U)\) by \(\alpha _{0j}(u)+\alpha _{0j}'(u)(U-u)\), \(j=1, \ldots , p\). Let \(\alpha (u)=(\alpha _{01}(u),\ldots , \alpha _{0p}(u) )^{T}\), \(b(u)=(\alpha _{01}'(u), \ldots , \alpha _{0p}'(u))^{T}\). The estimators of \(\beta \), \(\theta \), \(\alpha _{j}(u)\)’s and \(\alpha _{j}'(u)\)’s are obtained by maximizing the following local quasi-likelihood function with respect to \(\alpha (u)\), b(u), \(\beta \), \(\theta \),

$$\begin{aligned}&\mathcal {L}_{loc}\big (a(u),b(u), \beta , \theta \big )\nonumber \\&\quad =\sum _{i=1}^{n}\mathcal {Q}\left[ g^{-1}\big (\beta ^{T}\hat{\xi }_{i}+\theta ^{T}W_{i} +\alpha (u)^{T}X_{i}+b(u)^{T}X_{i}(U_{i}-u)\big ), Y_{i} \right] K_{h}(U_{i}-u),\nonumber \\ \end{aligned}$$
(7)

where \(\mathcal {Q}(x, y)\) is the quasi-likelihood function and is defined as \(\mathcal {Q}(x, y)=\int _{y}^{x}\frac{y-u}{T(u)}\hbox {d}u\). Denote the local quasi-likelihood estimators from (7) by \(\hat{\alpha }_{*}(u), \hat{b}_{*}(u), \hat{\beta }_{*}, \hat{\theta }_{*}\). As demonstrated in Lemma A.2 in Appendix that these estimators are all \(\sqrt{nh}\)-consistent (or \(\sqrt{nh_{o}}\)-consistent, under Condition (C7)).

We now update estimates of \(\beta \) and \(\theta \) using all data, through by considering a global quasi-likelihood procedure for improving efficiency. Define

$$\begin{aligned} \mathcal {L}_{gol}\big (\beta , \theta \big ) =\sum _{i=1}^{n}\mathcal {Q}\left[ g^{-1}\big (\beta ^{T}\hat{\xi }_{i}+\theta ^{T}W_{i} +\hat{\alpha }_{*}(U_{i})^{T}X_{i}\big ), Y_{i} \right] , \end{aligned}$$
(8)

where \(\hat{\alpha }_{*}(u)\) is obtained from (7). As a result, we have global quasi-likelihood estimators \(\hat{\beta }\) and \(\hat{\theta }\) by maximizing \(\mathcal {L}_{gol}(\beta , \theta )\). The corresponding estimators have the same merit as one-step backfitting algorithm estimates. One may also consider a full iterative backfitting algorithm or a profile likelihood approach to obtain estimators of \(\beta \), \(\theta \).

In the following, we introduce some notations for presenting the properties of the estimators. Denote \(\mathrm{A}^{\otimes 2}= \mathrm{A}\mathrm{A}^{T}\) for any matrix or vector \(\mathrm{A}\). Let \(q_{\ell }(x,y)=\frac{{{\partial }}^{\ell } }{{\partial } x^{\ell }}\mathcal {Q}\{g^{-1}(x), y\}\) for \(\ell =1, 2\). Then \( q_{1}(x,y)=\{y-g^{-1}(x)\}\rho _{1}(x), q_{2}(x,y)=\{y-g^{-1}(x)\}\rho _{1}'(x)-\rho _{2}(x) \) with \(\rho _{\ell }(x)=\left\{ \frac{d g^{-1}(x)}{dx}\right\} ^{\ell }\big /\big [\sigma ^{2}T\{g^{-1}(x)\}\big ]\). Let \(Z=\beta ^{T}\xi +\theta ^{T}W+\alpha (U)^{T}X\), \(Q=(\xi ^{T}, W^{T})^{T}\), \(N=(\xi ^{T}, W^{T}, X^{T})^{T}\), \(\Sigma =E\left[ \rho _{2} (Z)Q^{\otimes 2}\right] \). Denote by \(\kappa _{k}(u)\) the kth element of \(E\big [\rho _{2}(Z)N^{\otimes 2}\big |U=u\big ]^{-1}{N}\), \(\iota _{k}(u,v)\) is the kth element of \(E\big [\rho _{2} (Z)N^{\otimes 2}\big |U=u\big ]^{-1}E\big [\rho _{2} (Z)N\big |U=u, V=v\big ]\). Moreover,

$$\begin{aligned} \varGamma (u)= & {} \left\{ Q-\sum _{k=1}^{p}\kappa _{k}(u)E\big [\rho _{2} (Z)Q X_{k}\big |U=u\big ]\right\} q_{1}(Z, Y),\\ \varkappa _{k}(v)= & {} E\left[ Q X_{k}\rho _{2} (Z)\iota _{k}(U,v)\frac{f_{U,V}(U, v)}{f_{U}(U)f_{V}(v)}\right] , \\ \varLambda (v)= & {} \left\{ \sum _{k=1}^{p} \varkappa _{k}(v)-E\left[ \rho _{2} (Z)Q\big |V=v\right] \right\} e^{T}\beta . \end{aligned}$$

We have the following asymptotic results.

Theorem 1

Under Conditions (C1)–(C7) given in the Appendix, we have

$$\begin{aligned}&\sqrt{n}\big (\big (\hat{\beta }-\beta \big )^{T}, \big (\hat{\theta } -\theta \big )^{T} \big )^{T}\\&\quad \mathop {\longrightarrow }\limits ^{\mathcal {L}} N_{q}\left( 0, \Sigma ^{-1}E\left[ \varGamma (U)^{\otimes 2}\right] \Sigma ^{-1}+ \Sigma ^{-1}E\left[ \varLambda (V)^{\otimes 2}\right] \Sigma ^{-1}\right) . \end{aligned}$$

Remark 1

To ensure Theorem 1 holds, undersmoothing is necessary. This strategy concurs with that adapted in modeling GPLM (Severini and Staniswalis 1994). In the asymptotic variance, the first term \(\Sigma ^{-1}E\left[ \varGamma (U)^{\otimes 2}\right] \Sigma ^{-1}\) is similar to that obtained by Li and Liang (2008) for partially linear models, while the second term \(\Sigma ^{-1}E\left[ \varLambda (V)^{\otimes 2}\right] \Sigma ^{-1}\) is owing to the impact of measurement error and a bias correction in virtue of the ancillary variable V.

Bandwidth selection The proposed procedure involves the bandwidth \(b_{k}\) and h, to be selected. As indicated in Zhou and Liang (2009), the undersmoothing is necessary when we estimate \(\xi \). So, the optimal bandwidth for \(b_{k}\) has to be violated. The consequence of undersmoothing \(\xi \) is to keep the bias small and preclude the optimal bandwidth for \(b_{k}\). As suggested by Carroll et al. (1997), an ad hoc but reasonable choice is \(O(n^{-1/5})\times n^{-2/15}=O(n^{-1/3})\). The suitable bandwidth \(b_{k}\) is \(b_{k}=C_{1}n^{-1/3}\), where \(C_{1}\) is a positive constant. One can use a plug-in rule to estimate the constant \(C_{1}\), i.e., \(b_{k}=\hat{\sigma }_{V}n^{-1/3}\). Another selection for \(b_{k}\) can be chosen as \(b_{k}=n^{-2/15}\hat{b}_{k*}\), where \(\hat{b}_{k*}=\arg \min _{b_{*}}\mathrm{CV}_{k}(b_{*})\), \(\mathrm{CV}_{k}(b_{*})=n^{-1}\sum _{i=1}^{n}\left\{ \eta _{ik}-\hat{\xi }^{(-i)}_{k,b_{*}} (V_{i})\right\} ^{2}\), where \(\hat{\xi }^{(-i)}_{k,b_{*}}(V_{i}) \) is computed analogous to (3) from the data with the ith observation \(\eta _{i}, V_{i}\) deleted and bandwidth \(b_{*}\). To select h, we first define the “leave-one-sample out” method \(h_{1}=\arg \min _{h_{*}}\sum _{i=1}^{n}\mathcal {Q}\left[ g^{-1} \big (\hat{\beta }_{-i}^{T}\hat{\xi }_{i}+\hat{\theta }_{-i}^{T}W_{i}+ \hat{\alpha }_{-i,h_{*}}(u)^{T}X_{i}\big ), Y_{i} \right] \), where \(\hat{\beta }_{-i}\), \(\hat{\theta }_{-i}\) are obtained from (8), and \(\hat{\alpha }_{-i,h_{*}}(U_{i})\) is obtained from (7) with the fixed bandwidth \(h_{*}\) and the leave-one-out sample \(\{Y_{j}, \hat{\xi }_{j}, W_{j}, X_{j}, U_{j}\}_{1\le j \not = i \le n}\).

2.3 Penalized quasi-likelihood-based variable selection

In this section, we consider the variable selection problem. We define the penalized quasi-likelihood as

$$\begin{aligned} \mathcal {L}_{P}\big (\beta , \theta \big )=\mathcal {L}_{gol}\big (\beta , \theta \big ) -n\sum _{j=1}^{d}p_{\lambda _{1j}}(|\beta _{j}|)-n\sum _{s=1}^{r}p_{\lambda _{2s}}(|\theta _{s}|), \end{aligned}$$
(9)

where \(p_{\lambda _{1j}}(\cdot )\), \(p_{\lambda _{2s}}(\cdot )\) are penalty functions, and \(\lambda _{1j}\) and \(\lambda _{2s}\) are tuning parameters. We distinctively choose tuning parameters \(\lambda _{1}\)’s, \(\lambda _{2}\)’s for identifying nonzero elements of \(\beta \) and \(\theta \). If we are only interested in selecting W-variable, then we set \(p_{\lambda _{1j}}(\cdot )=0\), \(j=1, \ldots , p\). Similarly, we can commit only on \(\xi \)-variable.

We first briefly discuss the choice of penalty functions. There have been many penalty functions in the variables selection literature. For example, \(L_{0}\)-penalty, \(p_{\lambda _{1j}}(|\beta _{j}|)=0.5 \lambda _{1j}^{2}I\{|\beta _{j}|\not =0\}\), where \(I\{\cdot \}\) is an indicator function. Specially, if we further let \(\lambda _{1j}=\sigma \sqrt{2/n}\), \(\sigma \sqrt{\log (n)/n}\) and \(\sigma \sqrt{\log (d)/n}\), those penalty functions correspond to the popular variable selection criteria such as AIC (Akaike 1973), BIC (Schwarz 1978) and RIC (Foster and George 1994). We adopt SCAD penalty (Fan and Li 2001), whose first derivative is

$$\begin{aligned} p'_{\lambda }(\gamma )=\lambda \Big \{I(\gamma \le \lambda )+\frac{(a\lambda -\gamma )_{+}}{(a-1)\lambda }I(\gamma >\lambda )\Big \}, \end{aligned}$$

where \((s)_{+}=sI(s>0)\) is the hinge loss function and \(a = 3.7\).

We next study the asymptotic properties of the resulting penalized quasi-likelihood estimates. Without loss of generality, assume the first \(d_{1}\) components of \(\beta \) are nonzeros, the first \(r_{1}\) components of \(\theta \) are nonzeros. I.e., \(\beta _{s}\not =0\), \(s=1,\ldots , d_{1}\), \(\theta _{l}\not =0\), \(l=1, \ldots , r_{1}\) and \(\beta _{k}\equiv 0\), \(k=d_{1}+1, \ldots , d\), \(\theta _{t}\equiv 0\), \(t=r_{1}+1, \ldots , r\).

For notational simplicity, denote \(\mathcal {R}_{n,\lambda _{1}, \lambda _{2}}=(\mathcal {R}_{n,\lambda _{1}}^{T}, \mathcal {R}_{n,\lambda _{2}}^{T})^{T}\) with

$$\begin{aligned} \mathcal {R}_{n,\lambda _{1}}= & {} \{p'_{\lambda _{11}}(|\beta _{1}|)\mathrm{sign} (\beta _{1}), \ldots , p'_{\lambda _{1d_{1}}}(|\beta _{d_{1}}|)\mathrm{sign} (\beta _{d_{1}})\}^{T},\\ \mathcal {R}_{n,\lambda _{2}}= & {} \{ p'_{\lambda _{21}}(|\theta _{1}|)\mathrm{sign} (\theta _{1}), \ldots , p'_{\lambda _{2r_{1}}}(|\theta _{r_{2}}|)\mathrm{sign} (\theta _{r_{2}})\}^{T}, \end{aligned}$$

and we further define

$$\begin{aligned} a^{*}_{n}= & {} \max _{1\le j\le d}\{p'_{\lambda _{1j}}(|\beta _{j}|),~~~ \beta _{j}\not = 0\}, b^{*}_{n}=\max _{1\le s\le r}\{ p'_{\lambda _{2s}}(|\theta _{s}|), \theta _{s}\not = 0\}, \\ a^{**}_{n}= & {} \max _{1\le j\le d}\{p''_{\lambda _{1j}}(|\beta _{j}|),~~~ \beta _{j}\not = 0\}, b^{**}_{n}=\max _{1\le s\le r}\{p''_{\lambda _{2s}}(|\theta _{s}|), \theta _{s}\not = 0\}, \\ \Sigma _{n,\lambda _{1}, \lambda _{2}}= & {} \mathrm{diag}\{p''_{\lambda _{11}}(|\beta _{1}|), \ldots , p''_{\lambda _{1d_{1}}}(|\beta _{d_{1}}|), p''_{\lambda _{21}}(|\theta _{1}|), \ldots , p''_{\lambda _{2r_{1}}}(|\theta _{r_{1}}|)\}. \end{aligned}$$

Denote the resulting penalized estimators from (9) by \(\hat{\beta }_{\lambda _{1}}\), \(\hat{\theta }_{\lambda _{2}}\). We have the following asymptotic results.

Theorem 2

Under Conditions (C1)–(C8) given in the Appendix, moreover, suppose \(a_{n}^{*}=O(n^{-1/2})\), \(b_{n}^{*}=O(n^{-1/2})\), \(a_{n}^{**}\rightarrow 0\), \(b_{n}^{**}\rightarrow 0\), then there exist local maximizers \(\hat{\beta }_{\lambda _{1}}\), \(\hat{\theta }_{\lambda _{2}}\) of (9) such that their rates of convergence are \(\hat{\beta }_{\lambda _{1}}=\beta +O_{P}(n^{-1/2})\) and \(\hat{\theta }_{\lambda _{2}}=\theta +O_{P}(n^{-1/2})\).

We further introduce notations for presenting the oracle properties of the resulting penalized likelihood estimates. Without loss of generality, denote \(\beta =(\beta _{(1)}^{T}, \beta _{(2)}^{T})^{T}\), \(\theta =(\theta _{(1)}^{T}, \theta _{(2)}^{T})^{T}\), where \(\beta _{(1)}\) and \(\theta _{(1)}\) are \(d_{1}\) and \(r_{1}\) nonzero components of \(\beta \) and \(\theta \), respectively, and \(\beta _{(2)}\) and \(\theta _{(2)}\) are two \((d-d_{1})\)- and \((r-r_{1})\times 1\)-zero vectors. Accordingly, \(\xi _{(1)}\) and \(W_{(1)}\) are the first \(d_{1}\) covariates of \(\xi \), and the first \(r_{1}\) covariates of W. Let \(Z_{(1)}=\beta _{(1)}^{T}\xi _{(1)}+\theta _{(1)}^{T}W_{(1)}+\alpha (U)^{T}X\), \(Q_{(1)}=(\xi _{(1)}^{T}, W_{(1)}^{T})^{T}\), \(N_{(1)}=(\xi _{(1)}^{T}, W_{(1)}^{T}, X^{T})^{T}\), and \(e_{(1)}\) be the first \(d_{1}\) covariates of the error e. Moreover, the definitions of \(\Sigma _{(1)}\), \(\varGamma _{(1)}(u)\) and \(\varLambda _{(1)}(v)\) and the terms involved in these definitions are accordingly to \(\Sigma , \varGamma (u), \varLambda (v)\) by substituting \(\beta , Z, Q, N, e\) with \(\beta _{(1)}, Z_{(1)}, Q_{(1)}, N_{(1)}, e_{(1)}\), respectively.

Theorem 3

Under Conditions (C1)–(C8), the penalized estimators \(\hat{\beta }_{\lambda _{1}}=\big (\hat{\beta }_{\lambda _{1}(1)}^{T}, \hat{\beta }_{\lambda _{1}(2)}^{T}\big )^{T}\) and \(\hat{\theta }_{\lambda _{2}}= \big (\hat{\theta }_{\lambda _{2}(1)}^{T}, \hat{\theta }_{\lambda _{2}(2)}^{T}\big )^{T}\) satisfy: (a) with probability tending to one, \( \hat{\beta }_{\lambda _{1}(2)}=\mathbf{0}\), \( \hat{\theta }_{\lambda _{2}(2)}=\mathbf{0}\); and (b) \(\hat{\beta }_{\lambda _{1}(1)}\) and \(\hat{\theta }_{\lambda _{2}(1)}\) are asymptotically normal, i.e.,

$$\begin{aligned}&\sqrt{n}\big (\Sigma _{(1)}+\Sigma _{n,\lambda _{1}, \lambda _{2}}\big )\Big \{\big (\big (\hat{\beta }_{\lambda _{1}(1)}-\beta _{(1)}\big )^{T}, \big (\hat{\theta }_{\lambda _{2}(1)}-\theta _{(1)}\big )^{T}\big )^{T}\\&\quad +\big (\Sigma _{(1)}+\Sigma _{n,\lambda _{1}, \lambda _{2}}\big )^{-1}\mathcal {R}_{n, \lambda _{1}, \lambda _{2}}\Big \}\\&\quad \mathop {\longrightarrow }\limits ^{\mathcal {L}} N_{d_{1}+r_{1}}\left( \mathbf{0}, \Sigma _{(1)}^{-1}E\left[ \varGamma _{(1)}(U)^{\otimes 2}\right] \Sigma _{(1)}^{-1}+ \Sigma _{(1)}^{-1}E\left[ \varLambda _{(1)}(V)^{\otimes 2}\right] \Sigma _{(1)}^{-1}\right) . \end{aligned}$$

Remark 2

Theorem 3 indicates that the proposed variable selection procedure processes the oracle property with proper choices of tuning parameters \(\lambda _{1j}\)’s, \(\lambda _{2s}\)’s. If we further demand that \(\sqrt{n}\mathcal {R}_{n, \lambda _{1}, \lambda _{2}}\rightarrow 0\), and \(\Sigma _{n, \lambda _{1}, \lambda _{2}}\rightarrow 0\), the asymptotic variance simplifies to summand of \(\Sigma _{(1)}^{-1}E\left[ \varGamma _{(1)}(U)^{\otimes 2}\right] \Sigma _{(1)}^{-1}\) and \(\Sigma _{(1)}^{-1}E\left[ \varLambda _{(1)}(V)^{\otimes 2}\right] \Sigma _{(1)}^{-1}\).

Choice of \(\lambda _{j}\)’s. We adopt a data-driven GCV procedure proposed by Li and Liang (2008) to select the tuning parameters \(\lambda _{1}\)’s, \(\lambda _{2}\)’s in a \(d+r\)-dimensional space. Let \(\lambda _{1j}=\lambda * \mathrm{Se}(\hat{\beta }_{j})\), \(\lambda _{2i}=\lambda *\mathrm{Se}(\hat{\theta }_{i})\), where \(\mathrm{Se}(\hat{\beta }_{j})\) and \(\mathrm{Se}(\hat{\theta }_{i})\) are the estimated standard error of \(\hat{\beta }_{j}, \hat{\theta }_{i} \). Thus, the minimization over \(\lambda _{1}\)’s, \(\lambda _{2}\)’s is simplified to an one-dimensional minimization through \(\lambda \). We first introduce the estimation procedure for the standard errors, which can be obtained from the estimated covariance matrix \(\widehat{\mathrm{Cov}}(\hat{\gamma })\), where \(\hat{\gamma }=(\hat{\beta }^{T}, \hat{\theta }^{T})^{T}\) is obtained from (8). Write \(\ell '(\gamma )= \frac{\mathcal {L}_{gol}({\beta },{ \theta })}{\partial {\gamma }}\) \(\ell ''(\gamma )=\frac{\mathcal {L}_{gol}({\beta },{ \theta })}{\partial {\gamma } \partial {\gamma }^{T} }\), \(\gamma =(\beta ^{T}, \theta ^{T})^{T}\) and

$$\begin{aligned} \Sigma _{n,\lambda _{1}, \lambda _{2}}^{*}=\mathrm{diag}\left( \frac{p'_{\lambda _{11}}(|\beta _{1}|)}{|\beta _{1}|}, \ldots , \frac{p'_{\lambda _{1d}}(|\beta _{d}|)}{|\beta _{d}|}, \frac{p'_{\lambda _{21}} (|\theta _{1}|)}{|\theta _{1}|}, \ldots , \frac{p'_{\lambda _{2r}}(|\theta _{r}|)}{|\theta _{r}|}\right) . \end{aligned}$$
(10)

A sandwich formula for the covariance matrix of the estimates \(\hat{\gamma } =\left( \hat{\beta }^{T}, \hat{\theta }^{T}\right) ^{T}\) is given by

$$\begin{aligned} \widehat{\mathrm{Cov}}(\hat{\gamma })=\left\{ \ell ''(\hat{\gamma })-n \Sigma _{n,\lambda _{1}, \lambda _{2}}^{*}\right\} ^{-1}\widehat{\mathrm{Cov}}(\ell '(\hat{\gamma }))\left\{ \ell ''(\hat{\gamma })-n\Sigma _{n, \lambda _{1}, \lambda _{2}}^{*}\right\} ^{-1}. \end{aligned}$$

Write \(e(\lambda )=\mathrm{tr}\left\{ \left\{ \ell ''(\hat{\gamma })- n\Sigma _{n,\lambda , \lambda }^{*}\right\} ^{-1}\ell ''(\hat{\gamma })\right\} , \) where \(\Sigma _{n,\lambda _{1}, \lambda _{2}}^{*}\) is obtained from (10) by substituting \(\lambda _{1j}, \lambda _{2i}\) with \(\lambda * \mathrm{Se}(\hat{\beta }_{j}), \lambda *\mathrm{Se}(\hat{\theta }_{i})\) respectively. The GCV statistic is defined by

$$\begin{aligned} \mathrm{GCV}(\lambda )=\frac{\sum _{i=1}^{n}\mathcal {D}\left\{ Y_{i}, g^{-1}(\hat{\alpha }^{T}(U_{i})X_{i}+\hat{\xi }_{i}^{T}\hat{\beta } (\lambda )+W_{i}^{T}\hat{\theta }(\lambda ))\right\} }{n\{1-e(\lambda )/n\}^{2}}, \end{aligned}$$

where \(\mathcal {D}\{Y, \mu \}\) denotes the deviance of Y corresponding to the model fitting with \(\lambda \). The minimizer of \(\mathrm{GCV} (\lambda )\) with respect to \(\lambda \) can be obtained by a grid search.

3 Statistical inference for nonparametric components

In this section, we consider a refined estimator of \(\alpha (u)\) and propose a generalized likelihood ratio test to select significant components of X.

3.1 Refined estimator of nonparametric component

After we obtain the final estimators of \(\beta \) and \(\theta \) from Sect. 2.2, the estimator of \(\alpha (u)\) can be refined by maximizing the following local likelihood function:

$$\begin{aligned}&\mathcal {L}^{*}_{loc}\big (a(u),b(u)\big )\nonumber \\&\quad =\sum _{i=1}^{n}\mathcal {Q}\left[ g^{-1}\big (\hat{\beta }^{T}\hat{\xi }_{i}+ \hat{\theta }^{T}W_{i}+\alpha (u)^{T}X_{i}+b(u)^{T}X_{i}(U_{i}-u)\big ), Y_{i} \right] K_{h}(U_{i}-u)\nonumber \\ \end{aligned}$$
(11)

with respect to \(\alpha (u)\) and b(u). Let \(\hat{\alpha }(u)\) be the maximizer of (11). We have the following asymptotic result.

Theorem 4

Under Conditions (C1)–(C7), we have

$$\begin{aligned}&\sqrt{nh}\Bigg (\hat{\alpha }(u)-\alpha (u)-\frac{h^{2}\mu _{K2}}{2}\alpha ''(u)\nonumber \\&\quad +\frac{b^{2}\mu _{L2} }{2}\Sigma _{X}(u)^{-1}E\left[ \rho _{2}(Z)\xi ^{(2)}(V)^{T}\beta {X} \Big | U=u\right] \Bigg )\\\nonumber&\quad \mathop {\longrightarrow }\limits ^{\mathcal {L}} N\left( \mathbf{0}, \frac{v_{K_{0}}}{f_{U}(u)}\Sigma _{X}(u)^{-1}\right) , \end{aligned}$$

where \(\Sigma _{X}(u)=E\left[ \rho _{2}(Z){X}^{\otimes 2} \Big |U=u\right] \), and \(\mu _{K_{2}}=\int t^{2}K(t)\mathrm{d}t\), \(\mu _{L_{2}}=\int t^{2}L(t)\mathrm{d}t\), \(v_{K_{0}}=\int K^{2}(t)\mathrm{d}t\).

Remark 3

The second term in the asymptotic bias of \(\hat{\alpha }(u)\) is owing to calibrating the error-prone covariates. In fact, we can eliminate two bias terms \(O(h^{2})\) and \(O(b^{2})\) if we adapt the undersmoothing strategy in order for \(\hat{\beta }\), \(\hat{\theta }\) being root-n consistent. As such, the bias of \(\hat{\alpha }(u)\) tends to zero and the rates of \(\hat{\alpha }(u)\) are \((nh_{o})^{1/2}\).

3.2 Variable selection for nonparametric component

It is of interest to select nonzero component of \(\alpha (u)\) to increase model prediction. In this section, we adopt the GLRT proposed by Fan et al. (2001) to detect significant components of X, achieved by using the backward elimination procedure. In each step, we test \({H_{0}}: \alpha _{j_{1}}(u)=\cdots =\alpha _{j_{k}}(u)=0\) versus \({H_{1}}: \mathrm{not~ all~} \alpha _{j_{l}}(u)\not =0.\) For ease of presentation, we consider the following hypothesis:

$$\begin{aligned} {H_{0}}: \alpha _{1}(u)=\cdots =\alpha _{p}(u)=0 \qquad \mathrm{versus} \qquad { H_{1}}: \mathrm{not~ all~} \alpha _{j}(u)\not =0. \end{aligned}$$
(12)

Let \(\hat{\alpha }(u)\), \(\hat{\beta }\), \(\hat{\theta }\) be the estimators obtained from (8) and (11) under the alternative hypothesis, and \(\bar{\beta }\) and \(\bar{\theta }\) be the estimators of \(\beta \), \(\theta \) under the null hypothesis. Write

$$\begin{aligned} \mathcal {H}_{1}=\sum _{i=1}^{n}\mathcal {Q}\left\{ g^{-1}\left( \hat{\alpha } (U_{i})^{T}X_{i}+\hat{\theta }^{T}\hat{\xi }_{i}+\hat{\theta }^{T}W_{i}\right) ,Y_{i}\right\} \end{aligned}$$

and

$$\begin{aligned} \mathcal {H}_{0}=\sum _{i=1}^{n}\mathcal {Q}\left\{ g^{-1}\left( \bar{\beta }^{T} \hat{\xi }_{i}+\bar{\theta }^{T}W_{i}\right) ,Y_{i}\right\} . \end{aligned}$$

Following Fan et al. (2001) and Li and Liang (2008), we define the GLRT statistic

$$\begin{aligned} \mathcal {T}_{\mathrm{GLR}}= \mathcal {H}_{1}- \mathcal {H}_{0}. \end{aligned}$$

Define \(v_{L_{0}}=\int L^{2}(t)\mathrm{d}t\), \(v_{K_{0}}=\int K^{2}(t)\mathrm{d}t\), \(\sigma _{K}^{2}=2 p\left\{ \int [2K(t)-K*K(t)]^{2}\mathrm{d}t\right\} ^{2}|\mathcal {U}|\) with \(|\mathcal {U}|\) being the length of the support of U. \(\sigma _{L}^{2}=2\left\{ \int [L*L(t)]^{2}\mathrm{d}t\right\} ^{2}E\Big \{\frac{\{E[\rho _{2}({Z})|{V}]\}^{2}}{f_{V}({V})}\Big \}(\beta ^{T}\Sigma _{e}\beta )^{2}\). \(K*K(t), L*L(t)\) are the convolutions of K(t), L(t), respectively. \(c_{b}\) and \(c_{h}\) are two positive constants satisfying Condition (C7). We have the following theorem.

Theorem 5

Under Conditions (C1)–(C7), \(r_{LK}(\mathcal {T}_{\mathrm{GLR}} -\chi _{df_n}^2)\mathop {\longrightarrow }\limits ^\mathcal{{L}}0\) under the null hypothesis \(\mathcal {H}_0\), here

$$\begin{aligned} r_{LK}= & {} \dfrac{8c_{b}^{-1}v_{L_{0}}\beta ^{T}\Sigma _{e}\beta E\left\{ \frac{E[\rho _{2}({Z})|{V}]}{f_{V}({V})}\right\} + 8c_{h}^{-1} p |\mathcal {U}| \left[ K(0)-0.5 v_{K_{0}}\right] }{ c_{b}^{-1} \sigma _{L}^{2} + c_{h}^{-1}\sigma _{K}^{2}}, ~~\text {and} \\ df_{n}= & {} \frac{v_{L_{0}}\beta ^{T}\Sigma _{e}\beta }{b}E\left[ \frac{E[\rho _{2}(Z)|V]}{f_{V}(V)}\right] +\frac{p |\mathcal {U}|}{h}\left[ K(0)-0.5 v_{K_{0}}\right] . \end{aligned}$$

Remark 4

Theorem 5 claims that the Wilks type of phenomenon holds for GVCPLMeM. The first part of \(df_{n}\) gains insight into the effect of measurement error and ancillary variable. As indicated in Li and Liang (2008), this generalized likelihood ratio theory can be justified using empirical procedure, such as Monte Carlo simulation or a bootstrap procedure, since the degrees of freedom \(df_{n}\) tend to infinity as sample size n increases. It is worth mentioning that the main order of the degree of freedom \(r_{LK}df_{n}\) cannot be obtained similarly to those in Fan et al. (2001), because \(\Sigma _{e}\), \(\beta \) and \(E\left[ \frac{E[\rho _{2}(Z)|V]}{f_{V}(V)}\right] \) are usually unknown in practice and needed to be estimated from the data, and their estimators may not perform well when sample sizes are small or moderate. Moreover, those constants \(c_{b}, c_{h}\) involved in Condition (C7) for the bandwidth hb are also unknown. If the covariate \(\xi \) can be observed without measurement errors, i.e., \(\Sigma _{e}=0\), the \(c_{b}, c_{h}\) are vanished in \( r_{LK}\) and \( df_{n}\), and the method of calibration formulas for degree of freedoms proposed by Zhang (2004) can be directly applied. For these reasons and for practical purposes, one can follow the conditional bootstrap procedure suggested by Zhou and Liang (2009) and Cai et al. (2000) to estimate null distribution of \(\mathcal {T}_{\mathrm{GLR}}\).

Remark 5

Under the Conditions (C1)–(C7), we can have the following asymptotic

$$\begin{aligned} \sqrt{n}\big (\big (\bar{\beta }-\beta \big )^{T}, \big (\bar{\theta } -\theta \big )^{T} \big )^{T}\mathop {\longrightarrow }\limits ^{\mathcal {L}} N_{q}\big (0, \bar{\Sigma }^{-1}\bar{\varGamma }\bar{\Sigma }^{-1}+ \bar{\Sigma }^{-1}E\left[ \bar{\varLambda }(V)^{\otimes 2}\right] \bar{\Sigma }^{-1}\big ). \end{aligned}$$
(13)

where \(\bar{\Sigma }=E\left[ \rho _{2}(Z_{*})Q^{\otimes 2}\right] \), \(\bar{\varGamma }=E\left[ q^{2}_{1}(Z_{*},Y)Q^{\otimes 2}\right] \), \(\bar{\varLambda }(v)=E\left[ \rho _{2}(Z_{*})Q|V=v\right] e^{T}\beta \) and \(Z_{*}=\beta ^{T}\xi +\theta ^{T}W\). The asymptotic relative efficiency (ARE) of \(\bar{\beta }, \bar{\theta }\) with respect to \(\hat{\beta }, \hat{\theta }\) obtained in (8) is

$$\begin{aligned} \mathrm{ARE}\left( (\bar{\beta }, \bar{\theta }), (\hat{\beta }, \hat{\theta })\right) = \frac{\Vert {\Sigma }\Vert ^{2/q}_{D}}{\Vert \bar{\Sigma }\Vert ^{2/q}_{D}} \frac{\Vert \bar{\varGamma }+E\left[ \bar{\varLambda }(V)^{\otimes 2}\right] \Vert ^{1/q}_{D}}{\Vert E\left[ {\varGamma }(U)^{\otimes 2}\right] +E\left[ {\varLambda }(V)^{\otimes 2}\right] \Vert ^{1/q}_{D}}, \end{aligned}$$

where \(\Vert \cdot \Vert _{D}\) denotes the determinants of the covariance matrices.

4 Numerical studies

In this section, we conduct simulation studies to assess the performance of the proposed method. We then apply our method to analyze a real data set from a diabetes study. We used the Epanechnikov kernel function \(L(t)=K(t)=0.75(1-t^{2})_{+}\) in all numerical studies below. Note Condition (C7) means that the optimal bandwidth cannot be used because undersmoothing is necessary. As such, we used the rule of thumb (Silverman 1986). The smoothing parameter b was chosen as \(\hat{\sigma }_{V}n^{-1/3}\), where \(\hat{\sigma }_{V}\) is the sample deviation of V. This choice of b naturally meets Condition (C7). As pointed out in Remark 1 of Sect. 3.2, undersmoothing for h is an usual requirement for fitting generalized semiparametric models.

In our simulation studies, we generated 500 data sets consisting of \(n=500\) and \(n=1000\) observations from the semiparametric coefficient logistic regression model:

$$\begin{aligned} \text{ logit }\{p(U, \xi , W, X )\} =\xi ^{T}\beta +\theta ^{T}W+\alpha (U)^{T}X \end{aligned}$$
(14)

with covariates, nonparametric functions and parameters being explicitly specified below.

4.1 Simulation studies

Example 1

\(\beta =2\), \(\theta =(3, 1.5, 2)\) or \(\beta =0.2\), \(\theta =(0.3, 0.15, 0.2)\). \(X=(1, X)^{T}\) and \(X\sim N(0,1)\), \(\alpha (u)=(\alpha _{1}(u), \alpha _{2}(u))^{T}\), \(\alpha _{1}(u)=\exp (2u-1)\), \(\alpha _{2}(u)=2\sin ^{2}(2\pi u)\). \(\xi \) is unobserved and remitted by \((\eta , V)\) through \(\eta =\xi (V)+e\) with \(\xi (V)= 3V-\cos (V)\), \(V\sim N(0, 0.5^{2})\) and is independent of (UWX), e follows \(N(0, 0.5^{2})\) and is independent of (UVWX). We consider three cases: (i) W is independent of U, \(W\sim N(\mathbf{0}, \Sigma _{W})\), \(\Sigma _{W}=(\sigma _{w,ij})\) with \(\sigma _{w, ij}=0.25^{|i-j|}\), \(U\sim \mathrm{Unif}[0,1]\). (ii) (WU) follows \(\mathrm{Unif}[-1,1]\), and \(\mathrm{Var}((W^{T}, U)^{T})=(\sigma _{ij})\) with \(\sigma _{ij}=0.5^{|i-j|}\). (iii) The first component of W is 0 with probability 0.5 and 1 with probability 0.5, the rest components of W are normally distributed with mean 0, and \(\mathrm{Var}(W)=(\sigma _{w,i'j'})\) with \(\sigma _{w, i'j'}=0.5^{|i'-j'|}\), \(U\sim \mathrm{Unif}[0,1]\) and is independent of W. In this example, we use the bandwidth \(h=3\times n^{-1/3}\).

The simulation results for the benchmark estimator (i.e., all covariates are measured exactly), the proposed estimator and the naive estimator (using \(\eta \) directly) are presented in Tables 1 and 2, which reports the mean and associated standard errors of \((\hat{\beta }, \hat{\theta } )\). We can see that the estimated values based on the proposed procedure and the benchmark procedure are close to the true value in all three cases. This indicates our proposed method is promising. However, the naive estimator has severe bias and performs worse, especially when the sample size \(n=500\).

Table 1 The simulation results for Example 1
Table 2 The simulation results for Example 1

Example 2

In this example, we examined the performance of the proposed variable selection procedure by comparing it with the traditional subset selection criteria such as AIC, BIC and RIC from model (14). Let \(\beta =(-0.5, 0.5)^{T}\). X and \(\alpha (u)\) are the same as in Example 1. \(\xi (V)=(\xi _{1}(V), \xi _{2}(V))^{T}\), \(\xi _{1}(V)=2\cos (V)\), \(\xi _{2}(V)=0.1\exp (V)+3\sin (V)\), and the ancillary variable \(\eta =(\eta _{1}, \eta _{2})^{T}\) with \(\eta _{1}=\xi _{1}(V)+e_{1}\), \(\eta _{1}=\xi _{2}(V)+e_{2}\). V is independent of \((e_{1}, e_{2})^{T}\) and follows N(0, 1). \((e_{1}, e_{2})^{T}\) follows \(N_{2}((0, 0)^{T}, \Sigma _{e})\) with \(\Sigma _{e}=(\sigma _{e,ij})_{1\le i, j \le 2}\), \(\sigma _{e, ij}=(-0.5)^{|i-j|}\). U follows Unif[0, 1]. Moreover, \(\mathrm{Var}((\xi ^{T}, W)^{T})= (\sigma _{o,ij})_{1\le i,j \le q}\) with \(\sigma _{o,ij}=0.5^{|i-j|}\). We considered two cases: \(\theta =(1, 0, 0, 1, 0)^{T} \in \mathbb {R}^{5}\) and \(\theta =(1, 0, 0, 1, 0, 0, 0 ,0)^{T}\in \mathbb {R}^{8}\).

We examined the following quantities: the median of the squares errors (MedSE) of \(\Vert \hat{\gamma }-\gamma \Vert _{2}^{2}\), the average number (labeled “C”) of the three or eight true zero coefficients correctly set to zero, and the average number (labeled “I”) of the four true nonzeros incorrectly set to zero. Similar to Example 1, we considered three estimators: the benchmark estimator, the proposed estimator and the naive estimator. The GCV procedure introduced in Sect. 2.3 was used for selecting \(\lambda _{j}\)’s. 30 grid points were set to be evenly distributed over the range of \(\lambda \). The simulation results are reported in Table 3.

Table 3 The simulation results for Example 2

We can see that both benchmark estimator and our proposed estimator perform better as the sample size increase to 1000. The values of “C” and “I” are close to the true values 3 in case 1 or 6 in case 2, and 0, respectively. The performance of the SCAD procedure is similar to that of the oracle procedure and better than the penalized best subset variable selection procedure using AIC and RIC. Moreover, the performance of the SCAD is similar to that of BIC, which costs much more computational time, however. The MedSE of the SCAD and BIC procedures for both benchmark estimator and the proposed estimator are also close to those obtained from the oracle procedure. As anticipated, the naive procedure has a much higher rate of incorrectly setting nonzero coefficients to zero. Especially, the number of SCAD incorrectly setting nonzero coefficients closes to 1 instead of 0 in the two cases. When sample size \(n=1000\), the number of best subset variable selection incorrectly setting nonzero coefficients is at least 0.4 instead of 0. At the same time, the MedSE of the naive estimator is about 0.37 when \(n=500\) and 0.26 when \(n=1000\) even for the oracle setting. This means that ignoring measurement error e increases the chance of identifying more significant components, and causes that one may falsely choose variables and result in an inappropriate model. \(\hat{\xi }(v)\) performs well for variable selection.

Example 3

In this example, we examined the performance of the estimation procedure for nonparametric components introduced in Sect. 3.1. \(\beta =(1,1,1)^{T}\), \(\theta =(-1, 0.5)^{T}\), \(X=(1, X)^{T}\), where X follows N(0, 1). \(\alpha (u)=(\alpha _{1}(u), \alpha _{2}(u))^{T}\), \(\alpha _{1}(u)=2\exp (-2 u)\), \(\alpha _{2}(u)=2\sin ^{2}(\pi u)\). U, \(\xi (V)\), V and e are the same as in Example 2. Moreover, \(\mathrm{Var}\{(\xi ^{T}, W)^{T}\}=(\sigma _{o,ij})_{1\le i,j \le q}\) with \(\sigma _{o,ij}=0.5^{|i-j|}\). In this example, we set \(h=0.2\). The performance of the estimator \(\hat{\alpha }(u)=(\hat{\alpha }_{1}(u), \hat{\alpha }_{2}(u))^{T}\) was assessed by the square root of average square errors (RASE)

$$\begin{aligned}&\mathrm{RASE}_{1}=\Bigg \{n_{0}^{-1}\sum _{i=1}^{n_{0}}\Vert \hat{\alpha }_{1} (u_{i})-\alpha _{1}(u_{i})\Vert \Bigg \}^{1/2},\\&\mathrm{RASE}_{2}=\Bigg \{n_{0}^{-1}\sum _{i=1}^{n_{0}}\Vert \hat{\alpha }_{2}(u_{i})- \alpha _{2}(u_{i})\Vert \Bigg \}^{1/2}, \end{aligned}$$

where \(\{u_{1}, \ldots , u_{n_{0}} \}\) are the given grid points, and \(n_{0}=200\) is the number of grid points.

We evaluated the estimation procedure (11) for two scenarios: (i) using the estimated \(\hat{\gamma }=(\hat{\beta }^{T}, \hat{\theta }^{T})^{T}\), (ii) using the true value \({\gamma }=({\beta }^{T}, {\theta }^{T})^{T}\). We report the simulation mean and standard derivation of \(\mathrm{RASE}_{1}\) and \(\mathrm{RASE}_{2}\), and the simulation mean and associated stand derivation of \(\Vert \hat{\gamma }-\gamma \Vert ^{2}\) in Table 4. These results indicate that the performance of both the benchmark estimator and the proposed estimator works well regardless \(\hat{\gamma }\) or \(\gamma \) being used. This is not surprising because \(\hat{\gamma }\) is root-n consistent with higher convergence rates than nonparametric estimates. As a result, the benchmark estimator and the proposed estimator work satisfactorily under the two scenarios in term of RASE. On the other hand, the naive procedure results in no-ignorable biases in estimation of \(\gamma \) and the biased estimators \(\hat{\gamma }\) deteriorate the estimation procedure for \(\alpha (\cdot )\) and eventually make \(\hat{\alpha }(u)\) larger biases. It is worthy mention that the naive estimator through by using true \(\gamma \) works well since no biases are caused (see the third row under the “Exact \(\gamma \)” column in Table 4). The estimation for function \(\alpha _2(u)\) with the estimated \(\hat{\gamma }\) performs as well as if we knew the true value of \(\gamma \) regardless the proposed estimation method or naive estimation. But the estimation procedure for \(\alpha _1(u)\) does not have such a nice property; i.e., the RASE value for \(\alpha _1(u)\) increases from 0.0527 for the proposed method to 0.7993 for the naive estimation when the sample size \(n=1000\), whereas the RASE value for \(\alpha _2(u)\) keeps the same scale for the proposed method and the naive estimation. This substantial difference is because \(\alpha _1(u)\) but \(\alpha _2(u)\) includes the biases caused by ignoring measurement errors. These features can further be observed in the left panels and right panels of Fig. 1 for the benchmark procedure when \(n=1000\), i.e., using \(\xi (v)\) and Fig. 2 for the proposed procedure, i.e., using \(\hat{\xi }(v)\), where we plot the RASE values based on the true \(\gamma \) against the RASE values based on the estimated \(\hat{\gamma }\) for \(\alpha _1(u)\) the (left panel) and \(\alpha _2(u)\) (right panel). It can be seen that the estimation procedure (11) for \(\alpha _2(u)\) with the estimated \(\hat{\gamma }\) performs as well as if we knew the true value of \(\gamma \).

Table 4 The simulation results for Example 3
Fig. 1
figure 1

Simulation results (\(n=1000\)) for Example 2 along with the benchmark procedure, i.e., using \(\xi (v)\). RASE based on the true \(\gamma \) against RASE based on estimated \(\hat{\gamma }\) for \(\alpha _{1}(u)\) (left panel) and \(\alpha _2(u)\) (right panel)

Fig. 2
figure 2

Simulation results (\(n=1000\)) for Example 2 along with the proposed procedure, i.e., using \(\hat{\xi }(v)\). RASE based on the true \(\gamma \) against RASE based on estimated \(\hat{\gamma }\) for \(\alpha _{1}(u)\) (left panel) and \(\alpha _2(u)\) (right panel)

Example 4

In this example, we examine the performance of the test procedures proposed in Sect. 4.2. The simulation setting is the same as in Example 3. Consider the hypothesis

$$\begin{aligned} H_{0}: \alpha _{2}(u)=0 \ \text{ vs } \ H_{1}: \alpha _{2}(u)\not =0, \end{aligned}$$
(15)

where \(\alpha _2(u)\) is a sequence of alternative models indexed by \(C_{o}\) of form \(\alpha _{2}(u)=C_{o} \times u(1-u).\) We conducted 400 simulations at four different significance levels: 0.01, 0.025, 0.05 and 0.10 for the benchmark procedure and the proposed procedure. 500 conditional bootstrap (Cai et al. 2000) samples were generated in each simulation for power calculation. The simulation results are reported in Table 5 and Fig. 3. We can see that when \(C_{o}=0\), all empirical levels obtained by these two procedures are close to the four nominal levels, which indicates that the bootstrap method gives proper Type I errors. As \(C_{o}\) increases, the power functions increases rapidly. It is worth noting that the simulation results for the benchmark procedure concur with what Li and Liang (2008) observed, and the proposed estimation procedure performs also well. This indicates that the proposed GLRT under the measurement error setting works well numerically and confirms our theoretical findings.

Table 5 The simulation results for Example 4
Fig. 3
figure 3

Simulation results (\(n=1000\)) for Example 3—power plot for the bootstrap test proposed in Sect. 4.2. The significance level is 0.01, 0.025, 0.05 and 0.01. The dotted lines represent the power functions for the benchmark procedure by directly using \(\xi (v)\). The solid lines represent the proposed procedure using estimated \(\hat{\xi }(v)\)

4.2 An empirical example

We analyzed a data set with 358 complete observations from a diabetes study conducted in central Virginia for African Americans, whose aim was at understanding the relationship between the prevalence of obesity, diabetes, and other cardiovascular risk factors. There are 14 covariates of potential interest: “TC, Total Cholesterol”; “SG, Stabilized Glucose”; “ HDL, High-Density Lipoprotein”; “ Ratio, Cholesterol/HDL ”; “ GH, Glycosolated Hemoglobin”; “age”; “gender”; “height”; “weight”; “frame”; “FSBP, First Systolic Blood Pressure”; “ FDBP, First Diastolic Blood Pressure”; “waist” and “hip”. Usually, \(\mathtt{GH}\) over 7.0 indicates a positive diagnosis of diabetes. So Y was assigned 1 if \(\mathtt{GH} >7.0\) and 0 otherwise. We are interested in the relationship between the probability being diabetes and the collected covariates. Cambien et al. (1987) found that blood pressure is strongly associated with glucose. Han et al. (1995) also found that Ratio is associated with TC and HDL. On the basis of preliminary results, we treat \(\eta =(\mathtt{FSBP}, \mathtt{FDBP})^{T}\) and \(V=\mathtt{SG}\) as ancillary variables to remit unobservable variables \(\xi =(\xi _{1}(V), \xi _{2}(V))^{T}\). Take \(X=(\mathtt{TC}, \mathtt{HDL})^{T}\) and \(U=\mathtt{Ratio}\) to possibly investigate the varying coefficient functions \(\alpha (\cdot )=(\alpha _{1}(U), \alpha _{2}(U))^{T}\). Other W-variables include age, gender, height, weight, frame, waist, hip. Both gender and frame are discrete variables of 1 and 0 for male and female, and of 1, 2, 3 for small, medium and large frames, respectively. All continuous covariates were standardized.

We used the proposed quasi-likelihood method and the penalized quasi-likelihood with SCAD penalty for estimation and variable selection. To make a comparison, we also considered AIC, BIC and RIC variable selection procedures. The bandwidth \(h=0.5n^{-1/3}\) was used for local regression fitting. The results are reported in Table 6. We can see that the SCAD procedure is in conjunction with BIC and RIC procedures, and all the three methods indicate that only the possibly remitted variables \(\xi =(\xi _{1}(V), \xi _{2}(V))^{T}\) are significant, while all W-variable are not significant. Compared with the estimated value of \(\beta \), the SCAD-based estimates for \(\beta \) are close to those obtained using the quasi-likelihood. AIC selects extra 2 W-variables: \(\mathtt{waist}\) and \(\mathtt{hip}\). Recalling the simulation performance in Sect. 4.1, AIC may suggest an over-fitted model. As such, the model selected through SCAD, BIC and RIC may be more proper.

Table 6 Estimation and variable selection results of real data analysis

We further considered estimation procedure and variable selection for X-variables. We conducted 500 bootstraps to test \(\alpha _{1}(\cdot )= 0\). The corresponding GLRT-based p value is 0.0711, larger than the 97.5 % quantile of 500 bootstraps, 0.0418, and suggests a rejection of the null hypothesis. In the same way, we tested \(\alpha _{2}(\cdot )=0\) and got the corresponding p value 0.3027, much larger than the 97.5 % quantile of 500 bootstraps, 0.0355. This also indicates that we should reject the null hypothesis. The estimated curves associated with their 95 % pointwise confidence bands are depicted in Fig. 4, which shows a nonzero and nonlinear pattern. As a result, both \(\alpha _{1}(u)\) and \(\alpha _{2}(u)\) should be included in the final model.

Fig. 4
figure 4

Results for real data example. The local linear estimators for TC-\(\alpha _{1}(u)\) (the left panel) and HDL-\(\alpha _{2}(u)\) (the right panel) against variable \(U=\mathrm{Ratio}\) and the associated 95 % pointwise confidence intervals (dotted lines)