1 Introduction

Semiparametric partially linear varying coefficient model (SPLVCM) is an extension of partially linear model and varying coefficient model (Hastie and Tibshirani 1993; Cai et al. 2000; Fan and Zhang 1999; Fan and Zhang 2000). It allows some coefficient functions to vary with certain covariates, such as time or age variable. If \(Y\) is a response variable and \((U,\mathbf{X, Z})\) is the associated covariates, then SPLVCM takes the form

$$\begin{aligned} Y=\mathbf{X}^T{\varvec{\alpha }}(U)+\mathbf{Z}^T{\varvec{\beta }}+\varepsilon , \end{aligned}$$
(1)

where \(U\) is the so called index variable, without loss of generality, we assume it ranges over the unit interval [0,1]; \({\varvec{\alpha }}(\cdot )=({\alpha _1(\cdot ), \ldots , \alpha _p(\cdot )})^T\) is a unknown \(p\)-dimensional coefficient vector; \({\varvec{\beta }}=(\beta _1,\ldots ,\beta _d)^T\) is a \(d\)-dimensional vector of unknown regression coefficients; \(\mathbf{Z}=(Z_1, \ldots , Z_d)^T \in \mathbb{R }^d\) and \(\mathbf{X}=(X_1, \ldots , X_p)^T \in \mathbb{R }^p\) are two vector predictors; \(\varepsilon \) is random error with mean zero.

SPLVCM retains the virtues of both parametric and nonparametric modelling. It is a very flexible model and not only the linear interactions as in parametric model are considered but also general interactions between the index variable \(U\) and these covariates are explored nonparametrically. Many papers have been focused on SPLVCM. For example, Li et al. (2002) introduced a local least-square method with a kernel weight function for SPLVCM; Zhang et al. (2002) studied SPLVCM based on local polynomial method (Fan and Gijbels 1996); Lu (2008) discussed the SPLVCM in the framework of generalized linear model based on two step estimation procedure; Xia et al. (2004) investigated the efficient estimation problem of parametric part for SPLVCM; Fan and Huang (2005) presented the profile likelihood inferences for SPLVCM based on profile least-square technique. As an extension of Fan and Huang (2005), a profile likelihood estimation procedure was developed in Lam and Fan (2008) under the generalized linear model framework with a diverging number of covariates.

However, all the above mentioned papers were built on either least square or likelihood based methods, which are expected to be very sensitive to outliers and their efficiency may be significantly reduced for many commonly used non-normal errors. Due to the well-known advantages of quantile regression, researchers set foot on SPLVCM in the framework of quantile regression method. For example, Wang et al. (2009) considered quantile regression SPLVCM by \(B\)-spline and developed rank score test; Cai and Xiao (2012) presented the model based on local polynomial smoothing. Although the QR based method is a robust modeling tool, it has limitations in terms of efficiency and uniqueness of estimation. Specially, since the check loss function for QR is not strictly convex, its estimation may not necessarily be unique in general. Moreover, the quantile method may lose some efficiency when there are no outliers or the error distribution is normal.

Recently, Yao et al. (2012) investigated a new estimation method based on local modal regression in a nonparametric model. A distinguishing characteristic of their proposed method is that it introduces an additional tuning parameter that is automatically selected using the observed data in order to achieve both robustness and efficiency of the resulting estimate. Their estimation method not only is robust when the data sets include outliers or heavy-tail error distribution but also as asymptotically efficient as least square based method when there are no outliers and the error distribution follows normal distribution. In other words, their proposed estimator is almost as efficient as an omniscient estimator. This fact motivates us to extend the modal regression method to SPLVCM by borrowing the idea of Yao et al. (2012).

In practice, there are often many covariates in both in parametric part and nonparametric part of the model (1). With high-dimensional covariates, sparse modeling is often considered superior, owing to enhanced model predictability and interpretability. Various powerful penalization methods have been developed for variable selection in parametric models, such as the Lasso (Tibshirani 1996), the SCAD (Fan and Li 2001), the elastic net (Zou and Hastie 2005), the adaptive lasso (Zou 2006), the Dantzig selector (Candes and Tao 2007), one step sparse estimation (Zou and Li 2008), more recently the MCP (Zhang 2010), etc. Similar to linear models, variable selection for semiparametric regressions is equally important and even more complex because model (1) involves both nonparametric and parametric parts.

There are only a few papers on variable selection in semiparametric regression models. Li and Liang (2008) considered the problem of variable selection for SPLVCM, where the parametric components are identified via the smoothed clipped absolute deviation (SCAD) procedure and the varying coefficients are selected via the generalized likelihood ratio test. Xie and Huang (2009) discussed SCAD-penalized regression in partially linear models, which is a special case of SPLVCM. Zhao and Xue (2009) investigated a selection procedure via SCAD which can select parametric components and nonparametric components simultaneously based on \(B\)-spline for SPLVCM. In addition, Leng (2009) proposed a simple approach of model selection for varying coefficient models and Lin and Yuan (2012) studied the variable selection of the generalized partially linear varying coefficient model based on basic function approximation. More importantly, Kai et al. (2011) introduced a robust variable selection method for SPLVCM based on composite quantile regression and local ploynomial method, but they only considered variable selection for parametric part of model (1). The main goal of this paper is to develop an effective and robust estimation and variable selection procedure based on modal regression to select significant parametric and nonparametric components in model (1), where the nonparametric components are approximated by \(B\)-spline. The proposed procedure possesses the oracle property in the sense of Fan and Li (2001) and the computation time is very fast. An important contribution of this paper is to develop a newly robust and efficient variable selection for SPLVCM.

The outline of this paper is as follows. In Sect. 2, following the idea of modal regression method, we describe a new estimation method for SPLVCM, where the varying coefficient functions are approximated by \(B\)-spline. Then, an efficient and robust variable selection procedure via SACD penalty is developed, which can select both the significant parametric components and nonparametric components in Sect. 3. Meanwhile, we also establish its oracle property for both parametric and nonparametric part. In Sect. 4, we give the details of bandwidth selection both in theory and in practise and propose an EM-type algorithm for the variable selection procedure. Moreover, we develop the CV method to select the optimal knots of \(B\)-spline approximation and optimal adaptive penalty parameter. In Sect. 5, we conduct simulation study and real data example to examine the finite-sample performance of the proposed procedures. Finally, some concluding remarks are given in Sect. 6. All the regularity conditions and the technical proofs are contained in the Appendix.

2 Robust estimation method

2.1 Modal regression

As a measure of center, the mean, the median and the mode are three important numerical characteristics of error distribution. Among them, median and mode have the common advantage of robustness, which can be resistent to outliers. Moreover, since the modal regression focuses on the relationship for majority of the data and summaries the “most likely” conditional values, it can provide more meaningful point prediction and larger coverage probability for prediction than others when the error density is skewed if the same length of short intervals, centered around each estimate, are used.

For the linear regression model \(y_i=\mathbf{x}_i^T{\varvec{\beta }}+\varepsilon _i\), Yao and Li (2011) proposed to estimate the regression parameter \({\varvec{\beta }}\) by maximizing

$$\begin{aligned} Q({\varvec{\beta }})\equiv \frac{1}{n}\sum _{i=1}^{n} \phi _{h} \left( y_i-\mathbf{x}_{i}^{T}{\varvec{\beta }}\right) \!, \end{aligned}$$
(2)

where \(\phi _{h}(t)=h^{-1}\phi (t/h),\, \phi (t)\) is a kernel density function and \(h\) is a bandwidth.

To see why (2) can be used to estimate the modal regression, taking \({\varvec{\beta }}=\beta _0\) as the intercept term only in linear regression, then (2) is simplified to

$$\begin{aligned} Q(\beta _0)\equiv \frac{1}{n}\sum _{i=1}^n \phi _{h}(y_i-\beta _0). \end{aligned}$$
(3)

As a function of \(\beta _{0},\, Q_{h}(\beta _{0})\) is the kernel estimate of the density function of \(y\). Therefore, the maximizer of (3) is the mode of the kernel density function based on \(y_1,\ldots , y_n\). When \(n\rightarrow \infty \) and \(h\rightarrow 0\), the mode of kernel density function will converge to the mode of the distribution of \(y\).

For the univariate nonparametric regression model \(y_i=m(x_i)+\varepsilon _i\), Yao et al. (2012) proposed to estimate the nonparametric function \(m(x)\) using local polynomial by maximizing

$$\begin{aligned} Q(\theta )\equiv \frac{1}{n}\sum _{i=1}^{n} K_{\bar{h}}(x_i-x)\phi _{h}\left( y_i-\sum _{j=0}^p\theta _j(x_i-x)^j\right) \!, \end{aligned}$$
(4)

where \(K_{\bar{h}}(\cdot )=K(\cdot /\bar{h})/\bar{h}\) is a rescaled kernel function of \(K(\cdot )\) with bandwidth \(\bar{h}\) for estimating nonparametric functions and \(h\) is another bandwidth setting for \(\phi (\cdot )\), and \(\theta _j=m^{(j)}(x)/j!\).

Comparing with other estimation methods, modal regression treats \(-\phi _{h}(\cdot )\) as a loss function instead of quadratic loss function for least square and check loss function for quantile regression. It provides the “most likely” conditional values rather than the conditional average or quantile. However, despite the usefulness of modal regression, it has received little attention in the literatures. Lee (1989) used the uniform kernel and fixed \(h\) in (3) to estimate the modal regression. Scott (1992) proposed it, but little methodology is given on how to actually implement it. Recently, Yao and Li (2011) and Yao et al. (2012) systematically studied the modal regression for linear model and univariate nonparametric regression model. The main goal of this paper is to extend the modal regression to semiparametric models and discuss the variable selection for SPLVCM to obtain robust and efficient sparse estimator.

2.2 Estimation method for SPLVCM

Suppose that \(\{\mathbf{X}_i, \mathbf{Z}_i, U_i, Y_i\}_{i=1}^n\) is an independent and identically distributed sample from the model (1). Since \(\alpha _j(U) ( j=1,\ldots ,p)\) in (1) are some unknown nonparametric functions, following the method of Yao et al. (2012), we can use local linear polynomial to approximate \(\alpha _j(U)\) for \(U\) in a neighborhood of \(u\), i.e.,

$$\begin{aligned} \alpha _j(U)\approx \alpha _j(u)+\alpha _j^{\prime }(u)(U-u)\triangleq a_j+b_j (U-u), \quad j=1,\ldots ,p. \end{aligned}$$

Then we can obtain \(\hat{\varvec{\alpha }}(u)\) and \(\hat{{\varvec{\beta }}}\) by maximizing of local modal function

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n \phi _{h} \left( Y_i-\mathbf{X}_i^T(\mathbf{a}+\mathbf{b}(U_i-u))-\mathbf{Z}_i^T{\varvec{\beta }}\right) K_{\bar{h}}(U_i-u) \end{aligned}$$
(5)

with respect to \(\mathbf{a}, \mathbf{b}\) and \({\varvec{\beta }}\), where \(\mathbf{a}=(a_1, \ldots , a_p)^T\) and \(\mathbf{b}=(b_1, \ldots , b_p)^T\).

However, there are two criticisms of local polynomial estimation for semiparametric models. Firstly, noting that the parameter \(\mathrm{\beta }\) is a global parameter, in order to obtain its optimal \(\sqrt{n}\)-consistent estimation, we need two-step estimation and under smoothing technique in the first-step estimation; Secondly, the computation task for local polynomial estimation is very heavy especially for high dimensional SPLVCM.

To avoid these drawbacks of local polynomial estimation, we propose to use basis function approximations for nonparametric functions. More specially, let \(B(u)=(B_1(u),\ldots ,B_q(u))^T\) be \(B\)-spline basis functions with the order of \(\hbar \), where \(q=K+\hbar +1\), and \(K\) is the number of interior knots. Then \(\alpha _j(u)\) can be approximated by

$$\begin{aligned} \alpha _j(u)\approx B(u)^T{\varvec{\gamma }}_j, \quad j=1,\ldots ,p. \end{aligned}$$

Then, we obtain \(\hat{{\varvec{\beta }}}\) and \(\hat{{\varvec{\gamma }}}\) by maximizing

$$\begin{aligned} Q({\varvec{\gamma }},{\varvec{\beta }})=\sum _{i=1}^n\phi _h \left( Y_i-\mathbf{W}_i^T{\varvec{\gamma }}-\mathbf{Z}_i^T{\varvec{\beta }} \right) \!, \end{aligned}$$
(6)

with respect to \({\varvec{\beta }}\) and \({\varvec{\gamma }}\), where \(\mathbf{W}_i=I_p\otimes B(U_i)\cdot \mathbf{X}_i\) and \({\varvec{\gamma }}=({\varvec{\gamma }}_1^T, \ldots ,{\varvec{\gamma }}_p^T)^T\). According to Yao et al. (2012), the choice of \(\phi (\cdot )\) is not very crucial. For ease of computation, we use the standard normal density for \(\phi (t)\) throughout this paper. The bandwidth \(h\) in \(\phi _h(\cdot )\) plays the role of the bandwidth, which determines the degree of robustness of the estimator.

3 Variable selection for SPLVCM

In this section, we develop a robust and efficient variable selection procedure for SPLVCM and prove its oracle property via SCAD penalty.

Given \(a>2\) and \(\lambda >0\), the SCAD penalty at \(\theta \) is

$$\begin{aligned} p_{\lambda }(\theta )=\left\{ \begin{array}{ll} \lambda |\theta |, &{}\quad |\theta |\le \lambda , \\ -(\theta ^2-2a\lambda |\theta |+\lambda ^2)/[2(a-1)], &{}\quad \lambda <|\theta |\le a\lambda ,\\ (a+1)\lambda ^2/2, &{}\quad |\theta |>a\lambda . \end{array}\right. \end{aligned}$$

The SCAD penalty is continuously differentiable on \((-\infty , 0)\cup (0, \infty )\) but singular at 0. Its derivative vanishes outside \([-a\lambda , a\lambda ]\). As a consequence, SCAD penalized regression can produce sparse solutions and unbiased estimates for large coefficients. More details of the penalty can be found in Fan and Li (2001).

We define the penalized estimation for SPLVCM based on modal regression as

$$\begin{aligned} {\mathcal{L }}({\varvec{\gamma }},{\varvec{\beta }})=Q({\varvec{\gamma }},{\varvec{\beta }})-n\sum _{j=1}^p p_{\lambda _{1j}} \left( \Vert B(\cdot )^T{\varvec{\gamma }}_j\Vert \right) -n\sum _{k=1}^d p_{\lambda _{2k}}(|{\beta }_k|), \end{aligned}$$
(7)

where \(\lambda _{1j} (j=1,\ldots ,p)\) and \(\lambda _{2k} (k=1,\ldots ,d)\) are penalized parameters for the \(j\)th varying coefficient function and the \(k\)th parameter component, respectively.

Note that the regularization parameters for the penalty functions and in (7) are not necessarily the same for \({\varvec{\gamma }}_j, j=1,\ldots ,p\) and \({\varvec{\beta }}_k, k=1, \ldots ,d\), which can provide with flexibility and adaptivity. By this adaptive strategy, the tuning parameter for zero coefficient could be larger than that for nonzero coefficient, which can simultaneously unbiasedly estimate large coefficients and shrink the small coefficients toward zero. By maximizing the above objective function with proper penalty parameters, we can get sparse estimators and hence conduct variable selection.

Let \(\hat{{\varvec{\beta }}}\) and \(\hat{{\varvec{\gamma }}}\) be the solution by maximizing (7). Therefore, the estimator of \(\alpha _j(u)\) can be obtained by \(\hat{\alpha }_j(u)=B(u)^T\hat{{\varvec{\gamma }}}_j, j=1,\ldots ,p\). We call \(\hat{{\varvec{\beta }}}\) and \(\hat{\alpha }_j(u)\) as the penalized estimator of \({\varvec{\beta }}\) and \(\alpha _j(u)\) based on spline and robust modal regression (SMR) for SPLVCM. Next, we discuss the asymptotic properties of the resulting penalized estimators. Denote \({\varvec{\alpha }}_0(\cdot )\) and \({\varvec{\beta }}_0\) to be the true values of \({\varvec{\alpha }}(\cdot )\) and \({\varvec{\beta }}\), respectively. Without loss of generality, we assume that \(\alpha _{j0}(\cdot )=0, j=s_1+1,\ldots ,p\), and \(\alpha _{j0}(\cdot ), j=1,\ldots ,s_1\) are all nonzero components of \({\varvec{\alpha }}_0(\cdot )\). Furthermore, we assume that \(\beta _{k0}=0, k=s_2+1,\ldots ,d\), and \(\beta _{k0}, k=1,\ldots ,s_2\) are all nonzero components of \({\varvec{\beta }}_0\). Let

$$\begin{aligned} F(\mathbf{x,z},u,h)=\mathrm{E} \left\{ \phi ^{\prime \prime }_h(\varepsilon )| \mathbf{X}=\mathbf{x}, \mathbf{Z}=\mathbf{z},U=u \right\} \end{aligned}$$

and

$$\begin{aligned} G(\mathbf{x,z},u,h)=\mathrm{E}\left\{ \phi ^{\prime }_h(\varepsilon )^2|\mathbf{X}=\mathbf{x}, \mathbf{Z}=\mathbf{z},U=u \right\} . \end{aligned}$$

Denote

$$\begin{aligned} a_n=\max _{j,k}\left\{ |p^{\prime }_{\lambda _{1j}}(\Vert {\varvec{\gamma }}_{j0}\Vert _H)|, |p^{\prime }_{\lambda _{2k}}(|\beta _{k0}|)|: {\varvec{\gamma }}_{j0}\ne 0, \beta _{k0}\ne 0\right\} \end{aligned}$$

and

$$\begin{aligned} b_n=\max _{j,k}\left\{ |p^{\prime \prime }_{\lambda _{1j}} (\Vert {\varvec{\gamma }}_{j0}\Vert _H)|, |p^{\prime \prime }_{\lambda _{2k}}(|\beta _{k0}|)|: {\varvec{\gamma }}_{j0}\ne 0, \beta _{k0}\ne 0\right\} , \end{aligned}$$

where \(\Vert {\varvec{\gamma }}_{j0}\Vert _H=\sqrt{{\varvec{\gamma }}_{j0} ^TH{\varvec{\gamma }}_{j0}}\), \(H=\int _0^1 B(u)B^T(u)du\), \({\varvec{\gamma }}_{j0}\) is the best approximation coefficient of \(\alpha _j(u)\) in the \(B\)-spline space. Then, we have the following Theorem 1 which gives the consistency of the proposed penalized estimators.

Theorem 1

Suppose that the regularity conditions (C1)–(C8) in the Appendix hold and the numbers of knots \(K=O(n^{1/(2r+1)})\). If \(b_n \rightarrow 0\), then we have

  1. (i)

    \(\Vert \hat{{\varvec{\beta }}}-{\varvec{\beta }}_0\Vert =O_p\left( n^{\frac{-r}{2r+1}}+a_n\right) \),

  2. (ii)

    \(\Vert \hat{\alpha }_j(\cdot )-\alpha _{j0}\Vert =O_p\left( n^{\frac{-r}{2r+1}}+a_n\right) ,\quad j=1,\ldots ,p,\) where \(r\) is defined in the condition (C2) in the Appendix.

Let \(\lambda _{\max }=\max _{j,k}\{\lambda _{1j}, \lambda _{2k}\}\) and \(\lambda _{\min }=\min _{j,k}\{\lambda _{1j}, \lambda _{2k}\}\). Under some conditions, we can show that the consistent estimators in Theorem 1 possess the sparse property, which is stated as follows.

Theorem 2

Suppose that the regularity conditions (C1)–(C8) in the Appendix hold and the numbers of knots \(K=O(n^{1/(2r+1)})\). If \(\lambda _{\max }\rightarrow 0\) and \(n^{\frac{r}{2r+1}}\lambda _{\min } \rightarrow \infty \) as \(n \rightarrow \infty \), then with probability tending to 1, \(\hat{{\varvec{\beta }}}\) and \(\hat{\alpha }_j(\cdot )\) satisfy

  1. (i)

    \(\hat{\beta }_k=0,\quad k=s_2+1,\ldots ,d\),

  2. (ii)

    \(\hat{\alpha }_{j}(\cdot )=0,\quad j=s_1+1,\ldots ,p\).

Remark 1

For SCAD penalty function, we know that if \(\lambda _{\max }\rightarrow 0\) as \(n \rightarrow \infty \), then \(a_n=0\). Therefore, from Theorems 1 and 2, it is clear that, by choosing proper tuning parameters, our proposed variable selection method is consistent and the estimators of nonparametric components achieve the optimal convergence rate as if the subset of true zero coefficients is already known (Stone 1982).

Next, we show that the estimators for nonzero coefficients in the parametric components have the same asymptotic distribution as that based on the oracle model. To demonstrate this, we need more notations to present the asymptotic property of the resulting estimators. Let \({\varvec{\gamma }}_a=({\varvec{\gamma }}_1^T,\ldots , {\varvec{\gamma }}_{s_1}^T)^T\), \({\varvec{\beta }}_a=(\beta _1,\ldots ,\beta _{s_2})^T\), and \({\varvec{\gamma }}_{a0}\) and \({\varvec{\beta }}_{a0}\) be the true values of \({\varvec{\gamma }}_a\) and \({\varvec{\beta }}_a\), respectively. Corresponding covariates are denoted by \(\mathbf{W}_{a}\) and \(\mathbf{Z}_{a}\). In addition, denote

$$\begin{aligned} \Phi =\mathrm{E}\left( \phi _h^{\prime \prime }(\varepsilon )\mathbf{W}_a\mathbf{W}_a^T \right) =\mathrm{E}\left( F(\mathbf{X,Z},U,h)\mathbf{W}_a\mathbf{W}_a^T \right) \end{aligned}$$

and

$$\begin{aligned} \Psi =\mathrm{E}\left( \phi _h^{\prime \prime }(\varepsilon )\mathbf{W}_a\mathbf{Z}_a^T \right) =\mathrm{E} \left( F(\mathbf{X,Z},U,h)\mathbf{W}_a\mathbf{Z}_a^T \right) . \end{aligned}$$

Then we have the following theorem.

Theorem 3

Under the conditions of Theorem 2, we have

$$\begin{aligned} \sqrt{n} \left( \hat{{\varvec{\beta }}}_a-{\varvec{\beta }}_{a0} \right) \stackrel{\mathrm{d}}{\longrightarrow } N(0, \Sigma ^{-1} \Delta \Sigma ^{-1}), \end{aligned}$$
(8)

where \(\Delta =\mathrm{E}(G(\mathbf{X,Z},U,h){\check{\mathbf{Z}}}_a{ \check{\mathbf{Z}}}_a^T)\), \(\Sigma =\mathrm{E}(F(\mathbf{X,Z},U,h){ \check{\mathbf{Z}}}_{a}{\check{\mathbf{Z}}}_{a}^T)\), \({\check{\mathbf{Z}}}_{a}=\mathbf{Z}_{a}-\Psi ^T\Phi ^{-1}\mathbf{W}_{a}\).

Let \(\tilde{{\varvec{\alpha }}}_j(u)=B^T(u){\varvec{\gamma }}_{j0}\) for \(j=1,\ldots ,s_1\), denote \(\tilde{{\varvec{\alpha }}}_a(u)=(\tilde{{\varvec{\alpha }}}_1(u),\ldots , \tilde{{\varvec{\alpha }}}_{s_1}(u))^T\) and \(\hat{{\varvec{\alpha }}}_a(u)=(\hat{{\varvec{\alpha }}}_1(u), \ldots ,\hat{{\varvec{\alpha }}}_{s_1}(u))^T\), then the following result holds.

Theorem 4

Under the conditions of Theorem 2, for any vector \(\mathbf{c}_n\) with dimension \(q\times s_1\) and components not all 0, then we have

$$\begin{aligned} \left\{ \mathbf{c}_n^T\mathrm{var}(\hat{\varvec{\alpha }}_a(u))\mathbf{c}_n \right\} ^{-1/2}\mathbf{c}_n^T \left( \hat{\varvec{\alpha }}_a(u)-\tilde{{\varvec{\alpha }}}_a(u)\right) \stackrel{\mathrm{d}}{\longrightarrow } N(0, 1). \end{aligned}$$
(9)

The proofs of Theorems 1–4 are given in the Appendix.

4 Bandwidth selection and estimation algorithm

In this section, we first discuss the selection of bandwidth both in theoretical and in practice. Then, we develop estimation procedure for SPLVCM based on MEM algorithm (Li et al. 2007) and LQA algorithm (Fan and Li 2001). Note that the bandwidth selection discussing in this section is not the same as the bandwidth selection in local polynomial fitting for SPLVCM (Li and Palta 2009).

4.1 Optimal bandwidth

In this subsection, we give the optimal bandwidth in theoretical. For simplicity, we assume that the error variable independent of \(\mathbf{X}\), \(\mathbf{Z}\) and \(U\), based on (8) and the asymptotic variance of least-square \(B\)-spline estimator (LSB) given in Zhao and Xue (2009), we can show that the ratio of the asymptotic variance of the SMR estimator to that of the LPB estimator is given by

$$\begin{aligned} r(h)\triangleq \frac{G(h)F^{-2}(h)}{\sigma ^2}, \end{aligned}$$
(10)

where \(\sigma ^2=\mathrm{E}(\varepsilon ^2)\), \(F(h)=\mathrm{E}\{\phi _h^{\prime \prime }(\varepsilon )\}\) and \(G(h)=\mathrm{E}\{\phi _h^{\prime }(\varepsilon )\}^2\). The ratio \(r(h)\) depends on \(h\) only, and it plays an important role in efficiency and robustness of estimators. Therefore, the ideal choice of \(h\) is

$$\begin{aligned} h_{\mathrm{opt}}=\mathrm{argmin}_{h}r(h) =\mathrm{argmin}_{h}G(h)F^{-2}(h). \end{aligned}$$
(11)

From (11), we can see that \(h_{\mathrm{opt}}\) does not depend on \(n\) and only depends on the conditional error distribution of \(\varepsilon \).

Remark 2

Based on the expression of the ratio \(r(h)\), for all \(h>0\), we can prove that \(\inf _h r(h)= 1\) if the error follows normal distribution, and \(\inf _h r(h)\le 1\) regardless of the error distribution. Hence, SMR is better than or at least as well as LSB. In particular, if the error distribution has heavy tails or has large variance, the performance of SMR is much better than LSB.

4.2 Bandwidth selection in practice

In practice, we do not know the error distribution, hence we cannot obtain \(F(h)\) and \(G(h)\). An feasible method is to estimate \(F(h)\) and \(G(h)\) by

$$\begin{aligned} \hat{F}(h)=\frac{1}{n}\sum _{i=1}^n\phi ^{\prime \prime }_{h}(\hat{\varepsilon }_i) \quad \mathrm{and} \quad \hat{G}(h)=\frac{1}{n}\sum _{i=1}^n \left\{ \phi ^{\prime }_{h}(\hat{\varepsilon }_i) \right\} ^2, \end{aligned}$$

respectively.

Then \(r(h)\) can be estimated by \(\hat{r}(h)=\hat{G}(h)\hat{F}(h)^{-2}/\hat{\sigma }^2\), where \(\hat{\varepsilon }_i=Y_i-\mathbf{X}_i^T\hat{\varvec{\alpha }}(U_i)-\mathbf{Z}_i^T\hat{{\varvec{\beta }}}\) and \(\hat{\varvec{\alpha }}(\cdot )\), \(\hat{{\varvec{\beta }}}\) and \(\hat{\sigma }\) are estimated based on the pilot estimates. Then, using the grid search method, we can easily find \({h}_{\mathrm{opt}}\) to minimize \(\hat{r}(h)\). According to the advise of Yao et al. (2012), the possible grids points for \(h\) can be \(h=0.5\hat{\sigma }\times 1.02^j, j=0,1,\ldots ,k\), for some fixed \(k\) (such as \(k=50\) or 100).

4.3 Algorithm

In this subsection, we extend the modal expectation-maximization (MEM) algorithm (Li et al. 2007) and local quadratic algorithm (LQA, Fan and Li 2001) to maximize (7) for SPLVCM. Here, we assume \(\phi (\cdot )\) is the density function of a standard normal distribution.

Because \({\mathcal{L }}({\varvec{\gamma }},{\varvec{\beta }})\) is irregular at the origin. Directly maximizing (7) may be difficult. Following Fan and Li (2001), we first locally approximate the penalty function \(p_\lambda (\cdot )\) by a quadratic function at every step of iteration. More specifically, in a neighborhood of a given nonzero \(\omega _0\), an approximation of the penalized function at value \(\omega _0\) can be given by

$$\begin{aligned} p_{\lambda }(|\omega |)\approx p_{\lambda }(|\omega _0|) +\frac{1}{2}\left\{ p^{\prime }_{\lambda }(|\omega _0|)/|\omega _0 | \right\} \left( \omega ^2-\omega _0^2 \right) ,\quad \mathrm{for} \; \omega \approx \omega _0. \end{aligned}$$

Hence, if initial estimators \(\beta _k^{(0)}\) and \({\varvec{\gamma }}_j^{(0)}\) are very close to 0, then set \(\hat{\beta }_k=0\) and \(\hat{{\varvec{\gamma }}}_j=0\); otherwise, for the given initial value \(\beta _k^{(0)}\) with \(|\beta _k^{(0)}|>0, k=1,\ldots ,d\), and \({\varvec{\gamma }}_j^{(0)}\) with \(\Vert {\varvec{\gamma }}_j^{(0)}\Vert _H>0, j=1,\ldots ,p\), we can obtain

$$\begin{aligned} p_{\lambda _{2k}}(|\beta _k|)&\approx p_{\lambda _{2k}} \left( |\beta _k^{(0)}| \right) +\frac{1}{2}\frac{p^{\prime }_{\lambda _{2k}} \left( |\beta _k^{(0)}| \right) }{|\beta _k^{(0)} |}\left( |\beta _k|^2-|\beta _k^{(0)}|^2\right) \quad \mathrm{and} \\ p_{\lambda _{1j}}(\Vert {\varvec{\gamma }}_j\Vert _H)&\approx p_{\lambda _{1j}} \left( \Vert {\varvec{\gamma }}_j^{(0)}\Vert _H \right) +\frac{1}{2}\frac{p^{\prime }_{\lambda _{1j}} \left( \Vert {\varvec{\gamma }}_j^{(0)}\Vert _H \right) }{\Vert {\varvec{\gamma }}_j^{(0)}\Vert _H}\left( \Vert {\varvec{\gamma }}_j\Vert _H^2-\Vert {\varvec{\gamma }}_j^{(0)}\Vert _H^2\right) . \end{aligned}$$

Denote \({\varvec{\theta }}=({\varvec{\beta }}^T, {\varvec{\gamma }}^T)^T\), \(\mathbf{Z}_i^*=(\mathbf{Z}_i^T, \mathbf{W}_i^T)^T\) and set \(m=0\). Let

$$\begin{aligned} \Sigma _\lambda ({\varvec{\theta }}^{(m)})&= \mathrm{diag}\left\{ \frac{p^{\prime }_{\lambda _{21}} \left( |\beta _1^{(m)}| \right) }{|\beta _1^{(m)}|}, \ldots , \frac{p^{\prime }_{\lambda _{2d}} \left( |\beta _d^{(m)}| \right) }{|\beta _d^{(m)}|}, \frac{p^{\prime }_{\lambda _{11}} \left( \Vert {\varvec{\gamma }}_1^{(m)}\Vert _H \right) }{\Vert {\varvec{\gamma }}_1^{(m)}\Vert _H}H, \ldots ,\right. \\&\qquad \quad \left. \frac{p^{\prime }_{\lambda _{1p}} \left( \Vert {\varvec{\gamma }}_p^{(m)}\Vert _H \right) }{\Vert {\varvec{\gamma }}_p^{(m)}\Vert _H}H\right\} \!. \end{aligned}$$

With the aid of LQA and MEM algorithm, we can obtain the sparse estimators as follows:

  • Step 1 (E-step): We first update \(\pi (i|{\varvec{\theta }}^{(m)})\) by

    $$\begin{aligned} \pi (i|{\varvec{\theta }}^{(m)})=\frac{ \phi _{h}\left( Y_i-\mathbf{Z}_i^{*T}{\varvec{\theta }}^{(m)} \right) }{\sum _{i=1}^n\phi _{h}\left( Y_i-\mathbf{Z}_i^{*T}{\varvec{\theta }}^{(m)} \right) }\propto \phi _{h}\left( Y_i-\mathbf{Z}_i^{*T}{\varvec{\theta }}^{(m)}\right) , \quad \ i=1,\ldots ,n, \end{aligned}$$
  • Step 2 (M-step): Then, we update \({\varvec{\theta }}\) obtain \(\hat{\varvec{\theta }}^{(m+1)}\)

    $$\begin{aligned} \begin{array}{lll} \hat{\varvec{\theta }}^{(m+1)}&{}=&{}\mathrm{argmax}_{\varvec{\theta }} \sum \limits _{i=1}^n \left\{ \pi (i|{\varvec{\theta }}^{(m)}) \mathrm{log}\phi _{h} \left( Y_i-\mathbf{Z}_i^{*T} {\varvec{\theta }}^{(m)} \right) \right\} +\frac{n}{2}{\varvec{\theta }}^T \Sigma _\lambda ({\varvec{\theta }}^{(m)}){\varvec{\theta }}\\ &{} =&{} \left( \mathbf{Z}^{*T}W\mathbf{Z}^*+n\Sigma _\lambda ({\varvec{\theta }}^{(m)}) \right) ^{-1}\mathbf{Z}^{*T}W\mathbf{Y}, \end{array} \end{aligned}$$

    where \(W\) is an \(n\times n\) diagonal matrix with diagonal elements \(\pi (i|{\varvec{\theta }}^{(m)})\)s.

  • Step 3: Iterate the E-step and M-step until converges, and denote the final estimator of \({\varvec{\theta }}\) as \(\hat{\varvec{\theta }}\). Then \(\hat{{\varvec{\beta }}}=(I_{d\times d}, 0_{d\times pq})\hat{\varvec{\theta }}\), and \(\hat{{\varvec{\gamma }}}=(0_{pq\times d},I_{pq\times pq})\hat{\varvec{\theta }}\).

Similar to the EM algorithm, the MEM algorithm for SPLVCM within each Step also consists of two steps: E-step and M-step. The ascending property of the proposed MEM algorithm can be established along the lines of the study of Li et al. (2007).

Note that the converged value may depend on the starting point as the usual EM algorithms, and there is no guarantee that the MEM algorithm will converge to the global optimal solution. Therefore, it is prudent to run the algorithm from several starting-points and choose the best local optima found.

4.4 Selection of tuning parameter

To implement the above estimation procedure, the number of interior knots \(K\), and the tuning parameters \(a\), \(\lambda _{1j}\)’s and \(\lambda _{2k}\)’s in the penalty functions should be chosen appropriately. According to the suggestion of Fan and Li (2001), the choice of \(a=3.7\) performs well in a variety situations. Hence, we also use this value throughout this paper. We note that there are total \((p+d)\)-dimension penalty parameters (\(\lambda _{1j}\)’s and \(\lambda _{2k}\)’s) need to be selected. To reduce the computation task, we can use following strategy to set

$$\begin{aligned} \lambda _{1j}=\frac{\lambda }{\Vert \hat{{\varvec{\gamma }}}_j^{(0)}\Vert _H} \quad \mathrm{and} \quad \lambda _{2k}=\frac{\lambda }{|\hat{\beta }_k^{(0)}|}, \end{aligned}$$
(12)

where \(\hat{{\varvec{\gamma }}}_j^{(0)}\) and \(\hat{\beta }_j^{(0)}\) are the initial estimators of \({\varvec{\gamma }}_j\) and \(\beta _k\), respectively, using unpenalized estimator. Then we can use the following two-dimensional cross-validation score maximization problem

$$\begin{aligned} \mathrm{CV}(K, \lambda )=\sum _{i=1}^n \phi _{h}\left( Y_i-\mathbf{W}_i^T\hat{{\varvec{\gamma }}}^{(-i)}-\mathbf{Z}_i^T\hat{{\varvec{\beta }}}^{(-i)} \right) , \end{aligned}$$
(13)

where \(\hat{{\varvec{\beta }}}^{(-i)}\) and \(\hat{{\varvec{\gamma }}}^{(-i)}\) are the solution of (7) after deleting the \(i\)th subject. Then, the optimal \(K_\mathrm{opt}\) and \(\lambda _\mathrm{opt}\) are obtained by

$$\begin{aligned} (K_\mathrm{opt}, \lambda _\mathrm{opt})=\max _{K,\lambda }\mathrm{CV}(K,\lambda ). \end{aligned}$$

Note that the above strategy of selecting tuning parameters, in some sense, is the same rationale behind the adaptive Lasso (Zou 2006), and from our simulation experience, we found that this method performs well.

5 Numerical properties

In this section, we conduct simulation study to assess the finite-sample performance of the proposed procedures and illustrate the proposed methodology on a real-world data set in a health study. In all examples, we use the kernel function to be the Gaussian kernel.

5.1 Simulation study

In this example, we generate the random samples from the model

$$\begin{aligned} Y_i=\mathbf{X}_{i}^T{\varvec{\alpha }}(U_i)+\mathbf{Z}_i^T{\varvec{\beta }}+\varepsilon _i, \end{aligned}$$

where \({\varvec{\alpha }}(u)=(\alpha _1(u),\ldots ,\alpha _8(u))^T\), \(\alpha _1(u)=2\sin (2\pi u)\), \(\alpha _2(u)=8u(1-u)\) and \(\alpha _j(u)=0, j=3,\ldots ,10\); \(\mathrm{\beta }=(2.5,1.2,0.5,0,0,0,0,0,0,0)^T\); The covariate vector \((\mathbf{X}_i^T, \mathbf{Z}_i^T)^T\) is normally distributed with mean 0, variance 1 and correlation \(0.8^{|k-j|}, 1\le k, j\le p+d, p=d=10\); The index variable \(U_i\) is simulated from \(U[0,1]\) and independent of \((\mathbf{X}_i^T, \mathbf{Z}_i^T)^T\). In our simulations, we considered the following five different error distributions: \(N(0, 1)\), \(t\)-distribution with freedom degree 3: \(t(3)\), Laplace distribution: \(Lp(0,1)\), Laplace mixture distribution: \(0.8Lp(0,1)+0.2Lp(0,5)\) and mixture of normals: \(0.9N(0, 1) + 0.1N(0, 10)\) and error \(\epsilon _i\) is independent of all covariates. The sample size \(n\) is set to be 200, 400 and 600. A total of 400 simulation replications are conducted for each model setup. In all simulations, we use cubic \(B\)-spline basis to approximate varying coefficient functions and the optimal knots and penalty parameter obtained by CV method in Sect. 4.4.

The performance of the nonparametric estimator \(\hat{\alpha }(\cdot )\) will be assessed using the square root of average square errors (RASE)

$$\begin{aligned} \mathrm{RASE}=\left\{ n^{-1}_\mathrm{grid}\sum _{k=1}^{n_\mathrm{grid}}\Vert \hat{\alpha }(u_k)-\alpha (u_k)\Vert ^2\right\} ^{1/2}\!\!, \end{aligned}$$

where \(\{u_k, k=1,\ldots ,n_\mathrm{grid}\}\) are the grid points at which the functions \(\{\hat{\alpha }_j(\cdot )\}\) are evaluated. The generalized mean square error (GMSE) as defined in Li and Liang (2008) is used to evaluate the performance for parametric part

$$\begin{aligned} \mathrm{GMSE}=(\hat{{\varvec{\beta }}}-{\varvec{\beta }})^T\mathrm{E}(\mathbf{ZZ}^T)(\hat{{\varvec{\beta }}}-{\varvec{\beta }}). \end{aligned}$$

The medians of RASE and GMSE are listed in Table 1. To examine the robustness and efficiency of the proposed procedure (SMR), we also compare the simulation results with least square \(B\)-spline estimator (LSB) (Zhao and Xue 2009). Column “CN” shows the average number of zero coefficients correctly estimated to be zero for varying coefficient functions, and Column “CP” for parametric part. In the column labeled “IN”, we present the average number of nonzero coefficients incorrectly estimated to be zero for varying coefficient part, and “IP” for parametric part.

Table 1 Simulation results with different error distributions

Several observations can be made from the Table 1. Firstly, for the given sample size, penalized SMR estimator performs obviously better than penalized LSB estimator method especially for non-normal error distribution. Secondly, for the given error distribution, the performances of SMR estimator become better and better when the sample size increases. Thirdly, even for the normal error case, the SMR seems to perform no worse than the LSB. Especially, when sample size \(n=600\), there is almost no efficiency lost of RASE and GMSE compared with LSB or even slightly better in term of variable selection. Moreover, it is very interesting to see that the superiority of SMR become more and more obvious when the error follows mixture distribution and sample size is large. The main reason for this is that the larger of the sample size, the more likely the data contain outliers, and when there are some very large outliers in the data, the modal regression will put more weight to the “most likely” data around the true value, which lead to robust and efficient estimator.

To conclude, the penalized SMR estimator is better than or at least as well as LSB estimator.

5.2 Real data analysis

As an illustration, we apply the proposed procedures to analyze the plasma beta-carotene level data set collected by a cross-sectional study (Nierenberg et al. 1989). Research has shown that there is a direct relationship between beta-carotene and cancers such as lung, colon, breast, and prostate cancer (Fairfield and Fletcher 2002). This data set consists of 315 observations. The data can be downloaded from the StatLib database via the link “lib.stat.cmu.edu/datasets/Plasma_Retinol”. Brief description of the variable is shown in Table 2.

Table 2 Plasma beta-carotene level data

Of interest is the relationships between the plasma beta-carotene level and the following covariates: sex, smoking status, quetelet index (BMI), vitamin use, number of calories, grams of fat, grams of fiber, number of alcoholic drinks, cholesterol and age. We fit the data using SPLVCM with \(U\) being “Age”. The covariates “smoking status” and “vitamin use” are categorical variables and are thus replaced with dummy variables. We take these two dummy variables and other two discrete variables “sex” and “alcohol” as covariates of parametric part. All of the other covariates are standardized as the covariates of varying coefficient part. The index variable \(U\) is rescaled into interval [0,1]. We applied the SMR and LSB estimators to fit the SPLVCM. We randomly split the data into two parts, where 2/3 observations used as a training data set to fit the model and select significant variables, and the remaining 1/3 observations as test data set to evaluate the predictive ability of the selected model. The prediction performance is measured by the median absolute prediction error (MAPE), which is the median of \(\{|Y_i^\mathrm{test}-\hat{Y}^\mathrm{test}_i|, i=1,\ldots ,105\}\).

Beside the SCAD penalty, to see the effect of variable selection result for SMR, we also consider other two penalty functions, i.e. LASSO and MCP. We found that both SMR and LSB estimators select all the variables in varying coefficient part for three different penalty. The estimations of varying coefficient functions for SMR with SCAD penalty are depicted in Fig. 1. The resulting estimations for parametric part and MAPE together with the optimal bandwidth and penalized parameter are given in Table 3 (the estimated standard deviance for parametric component is given in the brackets).

Fig. 1
figure 1

Plots of estimated varying coefficient functions with SCAD penalty, the solid line is the estimated curve and the dot-dashed lines are 95 % pointwise confidence intervals: a intercept, b quetelet, c calories, d fat, e fiber, f cholesterol, g dietary beta-carotene. The histogram for \(Y\) is shown in h

Table 3 Selected parametric components and MAPE with different penalties

As we can see from Table 3, the performances of the three different penalty are very similar and SMR estimator is sparser than LSB method. Meanwhile, for all three penalties, the MAPE of SMR is smaller than LSB, which indicate that SMR model has better prediction performance than LSB model for the plasma beta-carotene level data. Because, for this data, the response \(Y\) is left-skewed, which can be seen in Fig. 1h. In addition, to confirm whether the selected variables in nonparametric part are truly relevant, we found that none of their 95 % pointwise confidence intervals (the dot-dashed lines) can completely well cover 0, which can see from the Fig. 1a–g.

Remark 3

Based on the result of Theorem 4, we can construct pointwise confidence intervals for each varying coefficient function if we know \(\mathrm{var}(\hat{\alpha }_j(u))\) or its estimate \({\widehat{\mathrm{var}}}(\hat{\alpha }_j(u))\). In practice, because \(\mathrm{var}(\hat{\alpha }_j(u))\) is unknown, one can use sandwich formula to obtain \({\widehat{\mathrm{var}}}(\hat{\alpha }_j(u))\). However, the sandwich formula for \({\widehat{\mathrm{var}}}(\hat{\alpha }_j(u))\) is very complicated and it includes many approximations, sometimes the results of the confidence intervals are not very well. Here, we obtain the 95 % pointwise confidence intervals for each nonzero varying coefficient function using Bootstrap resampling method. With \(B\) independent bootstrap samples, we can obtain the \(B\) bootstrap estimators of \(\tilde{\alpha }_j(u)\), then the sample standard error \(\hat{\sigma }_{j,B}(u)\) of \(\tilde{\alpha }_j(u)\) can be computed, and a \(1-\alpha \) confidence intervals for \(\tilde{\alpha }_j(u)\) based on a normal approximation is

$$\begin{aligned} \hat{\alpha }_j(u)\pm z_{1-\alpha /2}\hat{\sigma }_{j,B}(u), \quad \mathrm{for} \; j=1,\ldots ,s_1, \end{aligned}$$

where \(z_p\) is the \(100p\)th percentile of the standard normal distribution. If the bias \(\tilde{\alpha }_j(u)-{\alpha }_{j0}(u)\) is asymptotically negligible relative to the variance of \(\hat{\alpha }_j(u)\) by choosing a large \(K\), then \(\hat{\alpha }_j(u)\pm z_{1-\alpha /2}\hat{\sigma }_{j,B}(u)\) is also a \(1-\alpha \) asymptotic confidence intervals for \(\alpha _{j0}(u)\). More details see Huang et al. (2002) and Wang et al. (2008).

6 Concluding remarks

Variable selection for SPLVCM has been an interesting topic. However, most existing methods were built on either least square or likelihood based methods, which are very sensitive to outliers and their efficiency may be significantly reduced for heavy tail error distribution. In this paper, we developed an efficient and robust variable selection procedure for SPLVCM based on modal regression method, where the nonparametric functions are approximated by \(B\)-spline. The proposed procedure can simultaneously estimate and select important variables for both the nonparametric and the parametric part at their best convergence rates. We also established the oracle property for the proposed method. The distinguishing characteristic of newly proposed method is that it introduces an additional tuning parameter \(h\) to achieve both robustness and efficiency, which can automatically selected using the observed data. Simulation study and the plasma beta-carotene level data example confirm that the performances of our proposed method outperform than least-square based method.

There is room to improve our method. One limitation is that our proposed variable selection method for SPLVCM is established under the assumption that the varying and constant coefficients can be separated in advance. In fact, we do not know about this prior information when one using SPLVCM to analysis real data, i,e., whether a coefficient is important or not and whether it should be treated as fixed or varying. So, how to simultaneously identify whose predictors are important variables, whose predictors are really varying and whose predictors are only constant effect has been practical interest for researchers, more details see Cheng et al. (2009), Li and Zhang (2011) and Tang et al. (2012). We have embarked some research about it. In addition, one can apply our method to other semiparametric models, such as single-index model, partially linear single-index model and varying coefficient single-index model, to obtain robust and efficient estimation and achieve variable selection. Research in these aspects is ongoing.