1 Introduction

Consider the varying coefficient model

$$\begin{aligned} Y={\varvec{X}}^T(U) \varvec{\beta } (U)+ \varepsilon , \end{aligned}$$
(1)

where \(Y\) is the response variable, \(U\) and \(\varvec{X}\) are the covariates, and \(\varvec{\beta }(U)\) are some unknown smooth functions. The random error \(\varepsilon \) is independent of \(\varvec{X}\) and \(U\), and has probability density function \(h(\cdot )\) which has finite Fisher information. In this paper, it is assumed that \(U\) is a scalar and \(\varvec{X}\) is a \(p\)-dimensional vector which may depend on \(U\). Since introduced by Hastie and Tibshirani (1993), the varying coefficient model has been widely applied in many scientific areas, such as economics, finance, politics, epidemiology, medical science, ecology, and so on.

Due to its flexibility and interpretability, in the past ten years, it has experienced rapid developments in both theory and methodology; see Fan and Zhang (2008) for a comprehensive survey. In general, there are at least three common ways to estimate this model. One is the kernel-based local polynomial smoothing, see for instance, Wu et al. (1998), Hoover et al. (1998), Fan and Zhang (1999), Kauermann and Tutz (1999); One is the polynomial spline, see Huang et al. (2002, 2004) and Huang and Shen (2004); The last one is the smoothing spline, see Hastie and Tibshirani (1993), Hoover et al. (1998) and Chiang et al. (2001). Recently, efficient variable selection procedures for the varying coefficient model have been proposed as well. In a typical linear regression setup, it has been very well understood that ignoring any important predictor can lead to seriously biased results, whereas including spurious covariates can degrade the estimation efficiency substantially. Thus, variable selection is important for any regression problem. In a traditional linear regression setting, many selection criteria, e.g., Akaike information criterion (AIC) and Bayesian information criterion (BIC) have been extensively used in practice. Recently, various shrinkage methods have been developed, which include but are not limited to the least absolute shrinkage and selection operator (LASSO; c.f., Tibshirani 1996; Zou 2006) and the smoothly clipped absolute deviation (SCAD; Fan and Li 2001). These regularized estimation procedures were developed for varying coefficient models. Among others, Lin and Zhang (2006) develop COSSO for component selection and smoothing in smoothing spline ANOVA. Wang et al. (2007) propose to use group SCAD method for varying-coefficient model selection. Wang et al. (2008) extend the application of the SCAD penalty to varying coefficient models with longitudinal data. Li and Liang (2008) study variable selection for partially linear varying coefficient models, where the parametric components are identified via the SCAD but the nonparametric components are selected via a generalized likelihood ratio test, instead of a shrinkage method. Leng (2009) proposes a penalized likelihood method in the framework of the smoothing spline ANOVA models. Wang and Xia (2009) develop a shrinkage method, called KLASSO (Kernel-based LASSO), which combines the ideas of the local polynomial smoothing and LASSO. Tang et al. (2012) develop a unified variable selection approach for both least squares regression and quantile regression models with possibly varying coefficients. Their method is carried out by using a two-step iterative procedure based on basis expansion and an adaptive-LASSO-type penalty.

The estimation and variable selection procedures in the aforementioned papers are mainly built on least-squares (LS) type methods. Although the LS methods are successful and standard choice in varying coefficient model fitting, they may suffer when the errors follow a heavy-tailed distribution or in the presence of outliers. Thus, some efforts have been devoted to construct robust estimators for the varying coefficient models. Kim (2007) develops a quantile regression procedure for varying coefficient models when the random errors are assumed to have a certain quantile equal to zero. Wang et al. (2009) recently develop a local rank estimation procedure, which integrates the rank regression (Hettmansperger and McKean 1998) and local polynomial smoothing. In traditional linear regression settings, some also draw much attention to robust variable selection. Wang et al. (2007) propose a LASSO-based procedure using the least absolute deviation regression. Zou and Yuan (2008) propose the composite quantile regression (CQR) estimator by averaging \(K\) quantile regressions. They have shown that CQR is selection consistent and can be more robust in various circumstances. Wang and Li (2009) and Leng (2010) independently propose two efficient shrinkage estimators, using the idea of rank regression. However, to the best of our knowledge, there has hitherto been no existing appropriate robust variable selection procedure for the varying coefficient model, which is the focus of this paper.

In this paper, we aim to propose an efficient robust variable selection method for varying coefficient models. Motivated by the local rank inference (Wang et al. 2009), we start by developing a robust rank-based spline estimator. Under some mild conditions, we establish the asymptotic representation of the proposed estimator and further prove its asymptotic normality. We derive the formula of the asymptotic relative efficiency (ARE) of the rank-based spline estimator relative to the LS-based estimator, which has an expression that is closely related to that of the signed-rank Wilcoxon test in comparison with the t test. Further, we extend the application of the SCAD penalty to the rank-based spline estimator. Theoretical analysis reveals that our procedure is consistent in variable selection; that is, the probability that it correctly selects the true model tends to one. Also, we show that our procedure has the so-called oracle property; that is, the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. Simulation studies show that our procedure has better performance than KLASSO (Wang and Xia 2009) and LSSCAD (Wang et al. 2008) when the errors deviate from normality. Even in the most favorable case for KLASSO and LSSCAD, i.e., normal distribution, our procedure does not lose much, which coincides with our theoretical analysis.

This article is organized as follows. Section 2 presents the rank-based spline procedure for estimating the varying coefficient model, and some theoretical properties are provided. In Sect. 3, with the help of the rank-based spline procedure, we propose a new robust variable selection method and study its theoretical properties. Its numerical performance is investigated in Sect. 4. Several remarks draw the paper to its conclusion in Sect. 5. The technical details are provided in the “Appendix”. Some other simulation results are provided in another appendix, which is available online as supplementary material.

2 Methodology

To develop an efficient scheme for variable selection, we choose to consider a polynomial spline smoothing method rather than a local polynomial smoother. The reason is that using the former the varying coefficient model can be re-formulated as a traditional multiple regression model and thus it serves the variable selection purpose more naturally (Wang et al. 2008). In contrast, although works also very well, using local polynomial smoothers requires more sophisticated approximation and techniques in the selection procedures and proofs of oracle properties (Wang and Xia 2009). Therefore, in this section, we develop a rank-based spline method for estimating \(\varvec{\beta }(\cdot )\), which can be regarded as a parallel to the local rank estimator proposed by Wang et al. (2009).

2.1 The estimation procedure

Suppose that \(\{U_i,\varvec{X}_i,Y_i\}_{i=1}^n\) is a random sample from the model (1). Write \(\varvec{X}_i=(X_{i1},\ldots ,X_{ip})^T\) and \(\varvec{\beta }(U)=(\beta _1(U),\ldots ,\beta _p(U))^T\). Suppose that each \(\beta _l(U), l=1,\ldots ,p\), can be approximated by some spline functions, that is

$$\begin{aligned} \beta _l(U)\approx \sum _{k=1}^{K_l} \gamma _{lk} B_{lk}(U), \end{aligned}$$
(2)

where each \(\{B_{lk}(\cdot ),k=1,\ldots ,K_l\}\) is a basis for a linear space \(\mathbb {G}_l\) of spline functions with a fixed degree and knot sequence. In our applications we use the B-spline basis for its good numerical properties. Following (1) and (2), we have

$$\begin{aligned} Y_i \approx \sum _{l=1}^p \sum _{k=1}^{K_l} X_{il} B_{lk}(U_i) \gamma _{lk}+\varepsilon _i. \end{aligned}$$

Define \(\varvec{Y}=(Y_1,\ldots ,Y_n)^T, \mathbf{X}=(\varvec{X}_1,\ldots ,\varvec{X}_n)^T, \varvec{\gamma }_l=(\gamma _{l1},\ldots ,\gamma _{lK_l})^T, \varvec{\gamma }=(\varvec{\gamma }_1^T,\ldots ,\varvec{\gamma }_p^T)^T\),

$$\begin{aligned} \mathbf{B}(u)=\left( \begin{array}{ccccccccc} B_{11}(u) &{} \cdots &{} B_{1K_{1}}(u) &{} 0 &{} \cdots &{} 0 &{} 0 &{} \cdots &{} 0\\ \vdots &{} \ddots &{} \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} 0 &{} 0 &{} \cdots &{} 0 &{} B_{p1}(u) &{} \cdots &{} B_{pK_{p}}(u)\end{array}\right) , \end{aligned}$$

\(\varvec{Z}_i=\varvec{X}_i^{T} \mathbf{B}(U_i)\), and \(\mathbf{Z}=(\varvec{Z}_1^{T},\ldots , \varvec{Z}_n^{T})^{T}\). Based on the above approximation, we obtain the residual at \(U_i\)

$$\begin{aligned} e_i=Y_i-\varvec{Z}_i \varvec{\gamma }. \end{aligned}$$

Motivated by the rank regression (Jaeckel 1972; Hettmansperger and McKean 1998), we define the rank-based spline objective (loss) function

$$\begin{aligned} Q_n(\varvec{\gamma })=\frac{1}{n}\sum _{i<j}|e_i-e_j|. \end{aligned}$$
(3)

An estimator of \(\beta _l(u)\) is obtained by \(\hat{\beta }_l(u)=\sum _{k=1}^{K_l}\hat{\gamma }_{lk}B_{lk}(u)\), where the \(\hat{\gamma }_{lk}\)’s are the minimizers of (3). We term it as rank-based spline estimator because the objective (loss) function is equivalent to the classic rank loss function in linear models based on Wilcoxon scores (Hettmansperger and McKean 1998).

2.2 Asymptotic properties

In this subsection, we establish the asymptotic properties of the rank-based spline estimator. The main challenge comes from the nonsmoothness of the objective function \(Q_n(\varvec{\gamma })\). To overcome this difficulty, we first derive an asymptotic representation of \(\hat{\varvec{\gamma }}\) via a quadratic approximation of \(Q_n(\varvec{\gamma })\), which holds uniformly in a local neighborhood of the true parameter values. Throughout this manuscript, we will use the following notation for ease of exposition. Let \(|\varvec{a}|\) denote the Euclidean norm of a real valued vector \(\varvec{a}\). For a real-valued function \(g,\,||g||_{\infty }=\sup _u |g(u)|\). For a vector-valued function \({\varvec{g}}=(g_1,\ldots ,g_p)^T\), denote \(||{\varvec{g}}||_{L_2}=\sum _{1\le l \le p} ||g_l||^2_{L_2}\) and \(||{\varvec{g}}||_{\infty }=\max _l||g_l||_{\infty }\). Define \(K_n=\max _{1\le l \le p} K_l, \rho _n=\max _{1\le l \le p} \inf _{{g} \in \mathbb {G}_l}||\beta _l-g||_{{\infty }}\). Let \({\varvec{g}}^*=(g^*_1,\ldots ,g^*_p) \in \mathbb {G}\) be such that \(||{\varvec{g}}^*-\varvec{\beta }||_{\infty }=\rho _n\), where \(\mathbb {G}=\mathbb {G}_1\times \cdots \times \mathbb {G}_p\) and \(\varvec{\beta }\) is the real varying-coefficient function. Then there exists \(\varvec{\gamma }_0\), such that \({\varvec{g}}^*=\mathbf{B}(u)\varvec{\gamma }_0\).

Define \(\theta _n=\sqrt{{K_n}/{n}}, \varvec{\gamma }^{*}=\theta _n^{-1}(\varvec{\gamma }-\varvec{\gamma }_0)\), and \(\varDelta _i=\varvec{X}_i^{T}\varvec{\beta }(U_i)-\varvec{Z}_i \varvec{\gamma }_0\). Let \(\widehat{\varvec{\gamma }^{*}}\) be the value of \(\varvec{\gamma }^{*}\) that minimizes the following reparametrized function

$$\begin{aligned} Q_n^{*}(\varvec{\gamma }^{*})=\frac{1}{n} \sum _{i<j}|\varepsilon _i+\varDelta _i-\varepsilon _j-\varDelta _j-\theta _n (\varvec{Z}_i-\varvec{Z}_j)\varvec{\gamma }^{*}|. \end{aligned}$$

Then it can be easily seen that

$$\begin{aligned} \widehat{\varvec{\gamma }^{*}}=\theta _n^{-1}\left( \hat{\varvec{\gamma }}-\varvec{\gamma }_0\right) . \end{aligned}$$

We use \(\varvec{S}_n(\varvec{\gamma }^{*})\) to denote the gradient function of \(Q_n^{*}(\varvec{\gamma }^{*})\). More specifically,

$$\begin{aligned} \varvec{S}_n(\varvec{\gamma }^{*})=-\frac{\theta _n}{n} \sum _{i<j} \left( \varvec{Z}_i^{T}-\varvec{Z}_j^{T}\right) {\mathrm{sgn}}\left( \varepsilon _i+\varDelta _i-\varepsilon _j-\varDelta _j-\theta _n\left( \varvec{Z}_i-\varvec{Z}_j\right) \varvec{\gamma }^{*}\right) , \end{aligned}$$

where \({\mathrm{sgn}}(\cdot )\) is the sign function. Furthermore, we consider the following quadratic function of \(\varvec{\gamma }^{*}\)

$$\begin{aligned} A_n(\varvec{\gamma }^{*})=\tau \theta _n^2\varvec{\gamma }^{*T} \mathbf{Z}^T \mathbf{Z}\varvec{\gamma }^{*}+\varvec{\gamma }^{*T}\varvec{S}_n(\varvec{0})+Q_n^{*}(\varvec{0}), \end{aligned}$$

where \(\tau =\int h^2(t) dt\) is the well-known Wilcoxon constant, and \(h(\cdot )\) is the density function of the random error \(\varepsilon \).

For the asymptotic analysis, we need the following regularity conditions.

  1. (C1)

    The distribution of \(U_i\) has a Lebesgue density \(f(u)\) which is bounded away from 0 and infinity.

  2. (C2)

    \(E(\varvec{X}_i(u)|u=U_i)=\varvec{0}\), and the eigenvalues \(\lambda _1(u) \le \cdots \le \lambda _p(u)\) of \(\mathbf{\Sigma }(u)=E[{\varvec{X}_i}(u){\varvec{X}_i}(u)^{T}]\) are bounded away from 0 and infinity uniformly; that is, there are positive constants \(W_1\) and \(W_2\) such that \(W_1 \le \lambda _1(u) \le \cdots \le \lambda _p(u) \le W_2\) for all \(u\).

  3. (C3)

    There is a positive constant \(M_1\) such that \(|X_{il}(u)|\le M_1\) for all \(u\) and \(l=1,\ldots ,p,\,i=1,\ldots , n\).

  4. (C4)

    \(\lim \sup _n (\max _l K_l/\min _l K_l)<\infty \).

  5. (C5)

    The error \(\varepsilon \) has finite Fisher information, i.e., \(\int [h^{'}(x)]^2/h(x)dx < \infty \).

Remark 1

Conditions (C1)–(C4) are the same as those in Huang et al. (2004). The assumption on the random errors in (C5) is a standard condition for rank analysis in multiple linear regression (Hettmansperger and McKean 1998). These conditions are mild and can be satisfied in many practical situations.

Lemma 1

Suppose Conditions (C1)–(C5) all hold, then for any \(\epsilon >0\) and \(c>0\),

$$\begin{aligned} P\left( \sup _{\sqrt{1/K_n}|\varvec{\gamma }^{*}|\le c}|Q_n^{*}(\varvec{\gamma }^{*})-A_n(\varvec{\gamma }^{*})| \ge \epsilon \right) \rightarrow 0. \end{aligned}$$

Lemma 1 implies that the nonsmooth objective function \(Q_n^{*}(\varvec{\gamma }^{*})\) can be uniformly approximated by a quadratic function \(A_n(\varvec{\gamma }^{*})\) in a neighborhood of \(\varvec{0}\). It is also shown that the minimizer of \(A_n(\varvec{\gamma }^{*})\) is asymptotic within \(o(\sqrt{K_n})\) neighborhood of \(\widehat{\varvec{\gamma }^{*}}\), say \(|\widehat{\varvec{\gamma }^{*}}-(2 \tau )^{-1}(\theta _n^2\mathbf{Z}^{T}\mathbf{Z})^{-1}\varvec{S}_n(\varvec{0})|=o_p(\sqrt{K_n})\) (see “Appendix”). This further allows us to derive the asymptotic distribution.

Let \(\check{ \beta }_l(u)=E[\hat{ \beta }_l(u)\mid \fancyscript{X}]\) be the mean of \(\hat{ \beta }_l(u)\) conditioning on \(\fancyscript{X}=\{(\varvec{X}_i,U_i)\}_{i=1}^n\). It is useful to consider the decomposition \(\hat{\beta }_l(u)- \beta _l(u)=\hat{ \beta }_l(u)-\check{ \beta }_l(u)+\check{ \beta }_l(u)- \beta _l(u)\), where \(\hat{ \beta }_l(u)-\check{ \beta }_l(u)\) and \(\check{ \beta }_l(u)- \beta _l(u)\) contribute to the variance and bias terms, respectively. Denote \( \check{\varvec{\beta }}(u)=(\check{\beta }_1(u),\ldots ,\check{\beta }_p(u))\). The following two theorems establish the consistency and asymptotic normality of the rank-based spline estimator, respectively.

Theorem 1

Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0\), then \(||\hat{\varvec{\beta }}-\varvec{\beta }||^2_{L_2}=O_p(\rho _n^2+{K_n}/{n})\); consequently, if \(\rho _n\rightarrow 0\), then \(\hat{\beta _l}, l=1,\ldots ,p\) are consistent.

Theorem 2

Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0\), then

$$\begin{aligned} \left\{ {\mathrm{var}}[\hat{\varvec{\beta }}(u)]\right\} ^{-1/2}\left( \hat{\varvec{\beta }}(u)-\check{\varvec{\beta }}(u)\right) \mathop {\rightarrow }\limits ^{d}N\left( 0,\mathbf{I}_p\right) . \end{aligned}$$

The above two theorems are parallel to those in Huang et al. (2004). Theorem 1 implies that the magnitude of the bias is bounded in probability by the best approximation rates by the spaces \(\mathbb {G}_l\). Theorem 2 provides the asymptotic normality and can thus be used to construct confidence intervals.

Next, we study the ARE of the rank-based spline estimator with respect to the polynomial spline estimator [denoted by \(\hat{\varvec{\beta }}_{P}(u)\)]) for estimating \(\varvec{\beta }(u)\) in the varying coefficient model, say \(\text{ ARE }(\hat{\varvec{\beta }}(u),\hat{\varvec{\beta }}_{P}(u))\). Unlike the ARE study in Wang et al. (2009) in which the theoretical optimal bandwidth of local polynomial estimators is used, it seems difficult to plug in theoretical optimal \(K_l\)’s in evaluating \(\text{ ARE }(\hat{\varvec{\beta }}(u),\hat{\varvec{\beta }}_{P}(u))\) because the closed-form optimal \(K_l\)’s are not available. Thus, in the following analysis, we consider a common choice of the smoothing parameters for both two spline estimators \( \hat{\varvec{\beta }}(u)\) and \(\hat{\varvec{\beta }}_{P}(u)\).

According to Huang et al. (2004), we know that

$$\begin{aligned} {\mathrm{var}}\left[ \hat{\varvec{\beta }}_{P}(u)\mid \fancyscript{X}\right] =\sigma ^2 \mathbf{B}(u)\left( \mathbf{Z}^{T}\mathbf{Z}\right) ^{-1}\mathbf{B}^{T}(u), \end{aligned}$$

where \(\sigma ^2\) is the variance of \(\varepsilon \). Now, we give the conditioned variance of \(\hat{\varvec{\beta }}(u)\). From the Proof of Theorem 1 shown in the Appendix, we have

$$\begin{aligned} {\mathrm{var}}\left[ \hat{\varvec{\beta }}(u)\mid \fancyscript{X}\right]&=\mathbf{B}(u){\mathrm{var}}(\hat{\varvec{\gamma }})\mathbf{B}^{T}(u)\\&=\frac{1}{4\tau ^2}\theta _n^2\mathbf{B}(u)(\theta _n^2\mathbf{Z}^{T}\mathbf{Z})^{-1}{\mathrm{var}}\left[ \varvec{S}_n(\varvec{0})|\fancyscript{X}\right] \left( \theta _n^2\mathbf{Z}^{T}\mathbf{Z}\right) ^{-1}\mathbf{B}(u), \end{aligned}$$

where

$$\begin{aligned} {\mathrm{var}}[\varvec{S}_n(\varvec{0})\mid \fancyscript{X}]&=n^{-2}\theta _n^2E\left\{ \sum _{i<j}(\varvec{Z}_i^{T}-\varvec{Z}_j^{T}){\mathrm{sgn}}\left( (Y_i-Y_j)-(\varvec{Z}_i-\varvec{Z}_j)\varvec{\gamma }_0\right) \right\} ^2\\&=n^{-2}\theta _n^2\left\{ \sum _{i<j}(\varvec{Z}_i^{T}-\varvec{Z}_j^{T})\right\} ^2 E\left( 2H(\varepsilon )-1\right) ^2+o(1)\\&=\frac{1}{3}\theta _n^2\mathbf{Z}^{T}\mathbf{Z}+o(1), \end{aligned}$$

and \(H(\cdot )\) denotes the distribution of \(\varepsilon \) and the second equation follows from the independence of \(\varepsilon \) and \({\varvec{X}}, U\). Thus,

$$\begin{aligned} {\mathrm{var}}\left[ \hat{\varvec{\beta }}(u)\mid \fancyscript{X}\right] =\frac{1}{12\tau ^2}\mathbf{B}(u)\left( \mathbf{Z}^{T}\mathbf{Z}\right) ^{-1}\mathbf{B}^{T}(u)+o(1). \end{aligned}$$

It immediately follows from Theorem 2 that the ARE of \(\hat{\varvec{\beta }}(u)\) with respect to \(\hat{\varvec{\beta }}_{P}(u)\) is

$$\begin{aligned} \text{ ARE }\left( \hat{\varvec{\beta }}(u),\hat{\varvec{\beta }}_{P}(u)\right) =12\sigma ^2 \tau ^2. \end{aligned}$$

Remark 2

This asymptotic relative efficiency is the same as that of the signed-rank Wilcoxon test with respect to the t test. It is well known in the literature of rank analysis that the ARE is as high as 0.955 for normal error distribution, and can be significantly higher than one for many heavier-tailed distributions. For instance, this quantity is 1.5 for the double exponential distribution, and 1.9 for the \(t\) distribution with three degrees of freedom. For symmetric error distributions with finite Fisher information, this asymptotic relative efficiency is known to have a lower bound equal to 0.864.

2.3 Automatic selection of smoothing parameters

Smoothing parameters, such as the degrees of splines, the numbers and locations of knots, play an important role in nonparametric models. However, due to the computational complexity, automatically selecting those three smoothing parameters is difficult in practice. In this paper, we select only \(D=D_l\), the numbers of knots for \(\beta _l(\cdot )\)’s, using the data. The location of knots are equally spaced and the degrees of splines are fixed. We use “leave-one-out” cross-validation to choose \(D\). To be more specific, let \(\hat{\varvec{\beta }}^{(i)}(u)\) be the spline estimator obtained by deleting the \(i\)-th sample. The cross-validation procedure minimizes the target function

$$\begin{aligned} CV(D)=\frac{1}{n}\sum _{i=1}^n\left( Y_i-\varvec{X}_{i}^T \hat{\varvec{\beta }}^{(i)}(U_i)\right) ^2. \end{aligned}$$

In practice, some other criteria, such as the GCV, fivefold CV, BIC and AIC can also be used. Our simulation studies show that those procedures are also quite effective but the variable selection results are hardly affected by the choice of selection procedure for \(D_l\). Moreover, in this paper we restrict our attention to the spline with \(d=3\) degrees. This works well for the applications we considered. It might be worthwhile to investigate using the data to decide the knot positions (free-knot splines), which merits definitely some future research. Also, we may not use the same number of knots and degree of splines for each coefficient function because each coefficient function may have different features.

3 Rank-based variable selection and estimation

In this section, in order to conduct variable selection for the varying coefficient model in a computationally efficient manner, we incorporate the SCAD penalty function into the objective function (3) to implement nonparametric estimation and variable selection simultaneously.

3.1 The SCAD-penalty method

Now, suppose that some variables are not relevant in the regression model, so that the corresponding coefficient functions are zero functions. Let \(\mathbf{R}_k=(r_{ij})_{K_k \times K_k}\) be a matrix with entries \(r_{ij}=\int B_{ki}(t)B_{kj}(t) dt\). Then, we define \(||\varvec{\gamma }_k||_{R_k}^2\equiv \varvec{\gamma }_k^{T} \mathbf{R}_k \varvec{\gamma }_k\). The penalized rank-based loss function is then defined as

$$\begin{aligned} PL_n(\varvec{\gamma })=\frac{1}{n}\sum _{i<j}|e_i-e_j|+n \sum _{k=1}^p p_{\lambda _n}\left( ||\varvec{\gamma }_k||_{R_k}\right) \!, \end{aligned}$$
(4)

where \(\lambda _n\) is the tuning parameter and \(p_{\lambda }(\cdot )\) is chosen as the SCAD penalty function of Fan and Li (2001), defined as

$$\begin{aligned} p_{\lambda }(u)=\left\{ \begin{array}{cc}\lambda u, &{} 0\le u \le \lambda ,\\ -\frac{u^2-2a\lambda u+\lambda ^2}{2(a-1)}, &{} \lambda <u<a\lambda ,\\ \frac{(a+1)\lambda ^2}{2}, &{} u\ge a\lambda . \end{array}\right. \end{aligned}$$

where \(a\) is another tuning parameter. Here we adopted \(a=3.7\) as suggested by Fan and Li (2001). This penalized loss function takes a similar form to that of Wang et al. (2008) except that the rank-based loss function is used instead of LS-based functions. An estimator of \(\beta _l(u)\) is obtained by \(\bar{\beta }_l(u)=\sum _{k=1}^{K_l} \bar{\gamma }_{lk}B_{lk}(u)\), where the \(\bar{\gamma }_{lk}\) are minimizers of (4). In practice, one can also use the adaptive LASSO penalty to replace SCAD in (4) and we can expect that the resulting procedure will have similar asymptotic properties and comparable finite-sample performance (Zou 2006).

3.2 Computational algorithm

Because of nondifferentiability of the penalized loss (4), the commonly used gradient-based optimization method is not applicable here. In this section we develop an iterative algorithm using local quadratic approximation of the rank-based objective function \(\sum _{i<j}|e_i-e_j|\) and the nonconvex penalty function \(p_{\lambda _n}(||\varvec{\gamma }_k||_{R_k})\). Denote that \(R(e_i)\) is the rank of \(e_i\) among \(\{e_i\}_{i=1}^n\). Following Sievers and Abebe (2004), the objective function is approximated by

$$\begin{aligned} \sum _{i<j}|e_i-e_j|\approx \sum _{i=1}^n w_i\left( e_i-\zeta \right) ^2 \end{aligned}$$
(5)

where \(\zeta \) is the median of \(\{e_i\}_{i=1}^n\) and

$$\begin{aligned} w_i=\left\{ \begin{array}{cc}\frac{\frac{R(e_i)}{n+1}-\frac{1}{2}}{e_i-\zeta }, &{} e_i \not = \zeta ,\\ 0, &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$

Moreover, following Fan and Li (2001), in the neighborhood of a given positive \(u_0 \in R^{+}\),

$$\begin{aligned} p_{\lambda }(u)\approx p_{\lambda }(u_0)+\frac{1}{2}\left[ p_{\lambda }^{'} (u_0)/u_0\right] \left( u^2-u^2_0\right) . \end{aligned}$$

Then, given an initial value, \(\varvec{\gamma }_k^{(0)}\), with \(||\varvec{\gamma }_k^{(0)}||_{R_k}>0\), the corresponding weights \(w_i^{(0)}\) and the median of residuals, \(\zeta ^{(0)}\), can be obtained. Consequently, the penalized loss function (4) can be approximated by a quadratic form

$$\begin{aligned} PL_n(\varvec{\gamma })&\approx \frac{1}{n}\sum _{i=1}^n w_i^{(0)}(e_i-{\zeta }^{(0)})^2+p_{\lambda _n}(||\varvec{\gamma }_k^{(0)}||_{R_k})\\&\quad +\frac{1}{2}\left\{ \frac{p_{\lambda _n}^{'}(||\varvec{\gamma }_k^{(0)}||_{R_k})}{||\varvec{\gamma }_k^{(0)}||_{R_k}}\right\} \left\{ \varvec{\gamma }_k^{T}\mathbf{R}_k \varvec{\gamma }_k-(\varvec{\gamma }_k^{(0)})^{T}\mathbf{R}_k \varvec{\gamma }_k^{(0)}\right\} . \end{aligned}$$

Consequently, removing an irrelevant constant, the above quadratic form becomes

$$\begin{aligned} PL_n(\varvec{\gamma })\approx \frac{1}{n}\left( \varvec{S}^{(0)}-\mathbf{Z}\varvec{\gamma }\right) ^T\mathbf{W}^{(0)}(\varvec{S}^{(0)}-\mathbf{Z}\varvec{\gamma })+\frac{1}{2}\varvec{\gamma }^{T}\varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\varvec{\gamma }, \end{aligned}$$

where \(\varvec{S}^{(0)}=\varvec{Y}-\zeta ^{(0)}\), and \(\mathbf{W}^{(0)}\) and \(\varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\) are diagonal weight matrices with \(w_i\), and \(p_{\lambda _n}^{'}(||\varvec{\gamma }_k^{(0)}||_{R_k})/||\varvec{\gamma }_k^{(0)}||_{R_k} \mathbf{R}_k\) on the diagonals, respectively. This is a quadratic form with a minimizer satisfying

$$\begin{aligned} \left\{ \mathbf{Z}^{T}\mathbf{W}^{(0)}\mathbf{Z}+\frac{n}{2} \varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\right\} \varvec{\gamma }=\mathbf{Z}^{T}\mathbf{W}^{(0)}\varvec{S}^{(0)}. \end{aligned}$$
(6)

The foregoing discussion leads to the following algorithm:

Step 1::

Initialize \(\varvec{\gamma }=\varvec{\gamma }^{(0)}\).

Step 2::

Given \(\varvec{\gamma }^{(m)}\), update \(\varvec{\gamma }\) to \(\varvec{\gamma }^{(m+1)}\) by solving (6), where \(\varvec{\gamma }^{(0)}\) and the \(\varvec{\gamma }^{(0)}\) in \(\mathbf{W}^{(0)}, \varvec{S}^{(0)}\) and \(\varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\) are all set to be \(\varvec{\gamma }^{(m)}\).

Step 3::

Iterate Step 2 until convergence of \(\varvec{\gamma }\) is achieved.

Due to the use of nonconvex penalty SCAD, the global minimizer cannot be achieved in general and only some local minimizers can be obtained (Fan and Li 2001). In the literature, all the penalized methods based on nonconvex penalties would suffer from the same drawback as that of SCAD. Thus, a suitable initial value is usually required for fast convergence. The initial estimator of \(\varvec{\gamma }\) in Step 1 can be chosen as the unpenalized estimator, which can be solved by fitting a \(L_1\) regression on \({n(n-1)}/{2}\) pseudo observations \(\{(\varvec{Z}_i-\varvec{Z}_j, Y_i-Y_j)\}_{i<j}\). In our numerical studies, we use the function rq in the R package quantreg. From our numerical experience, our algorithm converges fast with the unpenalized estimator, and the resulting solution is reasonably good as demonstrated in our simulation study.

Note that the iterated algorithm will be instable when the weights in (5) are too large. As suggested by Sievers and Abebe (2004), the algorithm should be modified so it removes those residuals with very large weights from the iteration and reinstates them in subsequent iterations when their contribution to the sum \(\sum _{i=1}^n w_i(e_i-\zeta )^2\) becomes significant. Such an algorithm is quite efficient and reliable in practice. The R code for implementing the proposed scheme is available from the authors upon request. It is worth noting that we are doing an iterative approximation for both the original target function and the penalty function. Our numerical experience shows that such an algorithm is usually completed in less than ten iterations and never fails to converge. For example, it takes \(<\)1 s per iteration in R using an Inter Core 2.2 MHz CPU for a \(n=200, p=7\) case and the entire procedure is generally completed in \(<\)10 s. Theoretical investigation of the convergence property of the proposed algorithm definitely deserves future research.

3.3 Asymptotic properties

Without loss of generality, let \(\beta _k(u), k=1,\ldots , s\), be the nonzero coefficient functions and let \(\beta _k(u)\equiv 0\), for \(k=s+1,\ldots , p\).

Theorem 3

Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0, \rho _n\rightarrow 0, \lambda _n \rightarrow 0\), and \(\lambda _n/\max \left\{ \sqrt{{K_n}/{n}},\rho _n\right\} \rightarrow \infty \), we have the following:

  1. (i)

    \(\bar{\beta }_k=0,\,k=s+1,\ldots , p\), with probability approaching 1.

  2. (ii)

    \(||\bar{\beta }_k-\beta _k||_{L_2}=O_p\left( \max \left\{ \sqrt{\frac{K_n}{n}}, \rho _n\right\} \right) , k=1,\ldots ,s\).

Part (i) of Theorem 3 says that the proposed penalized rank-based method is consistent in variable selection; that is, it can identify the zero coefficient functions with probability tending to one. The second part provides the rate of convergence in estimating the nonzero coefficient functions.

Now we consider the asymptotic variance of the proposed estimate. Let \(\varvec{\beta }^{(1)}=(\beta _1,\ldots ,\beta _s)^{T}\) denote the vector of nonzero coefficient functions, and let \( \bar{\varvec{\beta }}^{(1)}=(\bar{\beta }_1,\ldots ,\bar{\beta }_s)^{T}\) denote its estimate obtained by minimizing (4). Let \(\bar{\varvec{\gamma }}^{(1)}=(\bar{\varvec{\gamma }}_1^{T}, \ldots , \bar{\varvec{\gamma }}_s^{T})^T\) and \(\mathbf{Z}^{(1)}\) denote the selected columns of \(\mathbf{Z}\) corresponding to \(\varvec{\beta }^{(1)}\). By using Lemma 1 and the quadratic approximation stated in the above subsection, we obtain another approximated loss function

$$\begin{aligned} PL^{'}_n(\varvec{\gamma })=A_n\left( \theta _n^{-1}(\varvec{\gamma }-\varvec{\gamma }_0)\right) +\frac{1}{2}\varvec{\gamma }^{T}\varvec{\Omega }_{\lambda _n}(\varvec{\gamma })\varvec{\gamma }. \end{aligned}$$
(7)

Similarly, let \(\varvec{\Omega }_{\lambda }^{(1)}\) denote the selected diagonal blocks of \(\varvec{\Omega }_{\lambda }\), and \(\varvec{S}^{(1)}_n(\varvec{0})\) denote the selected entries corresponding to \(\varvec{\beta }^{(1)}\). Thus, the minimizer of (7) yields

$$\begin{aligned} \bar{\varvec{\gamma }}^{(1)}=\left\{ 2 \tau (\mathbf{Z}^{(1)})^{T}\mathbf{Z}^{(1)}+\frac{n}{2}\varvec{\Omega }^{(1)}_{\lambda _n}\bar{\varvec{\gamma }}^{(1)}\right\} ^{-1}\theta _n^{-1}\varvec{S}^{(1)}_n(\varvec{0})+\varvec{\gamma }_0. \end{aligned}$$

Denote \(\mathbf{H}^{(1)}=2 \tau (\mathbf{Z}^{(1)})^{T}\mathbf{Z}^{(1)}+\frac{n}{2}\varvec{\Omega }^{(1)}_{\lambda _n}\bar{\varvec{\gamma }}^{(1)}\), so the asymptotic variance of \(\bar{\varvec{\gamma }}^{(1)}\) is

$$\begin{aligned} {\mathrm{avar}}(\bar{\varvec{\gamma }}^{(1)})=\theta _n^{-2}(\mathbf{H}^{(1)})^{-1}{\mathrm{var}}(\varvec{S}^{(1)}_n(\varvec{0}))(\mathbf{H}^{(1)})^{-1}. \end{aligned}$$

Since \(\bar{\varvec{\beta }}^{(1)}=({\mathbf{B}^{(1)}})^{T}\bar{\varvec{\gamma }}^{(1)}\), where \(\mathbf{B}^{(1)}\) is the first \(s\) rows of \(\mathbf{B}(u)\), we have \({\mathrm{avar}}(\bar{\varvec{\beta }}^{(1)})=({\mathbf{B}^{(1)}})^{T}{\mathrm{avar}}(\bar{\varvec{\gamma }}^{(1)}) \mathbf{B}^{(1)}\). Let \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) denote a modification of \({\mathrm{avar}}(\bar{\varvec{\beta }}^{(1)})\) by replacing \(\varvec{\Omega }^{(1)}_{\lambda _n}\) with \(0\), that is

$$\begin{aligned} {\mathrm{var}}^{*}\left( \bar{\varvec{\beta }}(u)\right)&= (4\tau ^2)^{-1}\theta _n^{-2}\left( {\mathbf{B}^{(1)}}\right) ^{T}\left( {(\mathbf{Z}^{(1)})^{T}\mathbf{Z}^{(1)}}\right) ^{-1}{\mathrm{var}}\left( \varvec{S}^{(1)}_n(\varvec{0})\right) \\&\quad \times \left( {(\mathbf{Z}^{(1)})^{T}\mathbf{Z}^{(1)}}\right) ^{-1} \mathbf{B}^{(1)}. \end{aligned}$$

Accordingly, the diagonal elements of \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) can be employed as the asymptotic variances of \(\bar{ \beta }_k(u)\)’s, i.e., \({\mathrm{avar}}(\bar{ \beta }_k(u)),k=1,\ldots ,s\).

Theorem 4

Suppose Conditions (C1)–(C5) hold. \( K_n \log K_n/n\rightarrow 0, \rho _n\rightarrow 0, \lambda _n \rightarrow 0\), and \(\lambda _n/\max \left\{ \sqrt{{K_n}/{n}},\rho _n\right\} \rightarrow \infty \). Then, as \(n\rightarrow \infty \),

$$\begin{aligned} \left\{ {\mathrm{var}}^{*}\left( \bar{\varvec{\beta }}(u)\right) \right\} ^{-1/2}\left( \bar{\varvec{\beta }}(u)-\breve{\varvec{\beta }}(u) \right) \mathop {\rightarrow }\limits ^{d}N(0,\mathbf{I}_s), \end{aligned}$$

where \(\breve{\varvec{\beta }}(u)=E[\bar{\varvec{\beta }}(u)\mid \fancyscript{X}]\) and in particular,

$$\begin{aligned} \left\{ {\mathrm{var}}^{*}(\bar{ \beta }_k(u)\right\} ^{-1/2}\left( \bar{ \beta }_k(u)-\breve{ \beta }_k(u) \right) \mathop {\rightarrow }\limits ^{d}N(0,1),\quad k=1,\ldots , s. \end{aligned}$$

Here \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) is exactly the same asymptotic variance of nonpenalized rank-based estimate using only those covariates corresponding to nonzero coefficient functions (See Sect. 2). Theorem 4 implies that our penalized rank-based estimate has the oracle property in the sense that the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model.

3.4 Selection of tuning parameters

The tuning parameter \(\lambda \) controls the model complexity and plays a critical role in the above procedure. It is desirable to select \(\lambda \) automatically by a data-driven method. Motivated by the Wilcoxon-type generalized BIC of Wang (2009) in which the multiple linear regression model is considered, we propose to select \(\lambda \) by minimizing

$$\begin{aligned} \text{ BIC }_{\lambda }&=12\hat{\tau }n^{-2}\sum _{i<j}|\left( Y_i-\varvec{Z}_i \bar{\varvec{\gamma }}_{\lambda }\right) -\left( Y_j-\varvec{Z}_j \bar{\varvec{\gamma }}_{\lambda }\right) |+df_{\lambda }\log (n/K_n)/(n/K_n)\nonumber \\&=12\hat{\tau }n^{-2}\sum _{i<j}|\left( Y_i-\varvec{X}_i^T \bar{\varvec{\beta }}_{\lambda }\right) -\left( Y_j-\varvec{X}_j^T \bar{\varvec{\beta }}_{\lambda }\right) |+df_{\lambda }\log (n/K_n)/(n/K_n), \end{aligned}$$
(8)

where \(\bar{\varvec{\gamma }}_{\lambda }\) is the penalized local rank spline estimator with tuning parameter \(\lambda ,\,df_{\lambda }\) is the number of nonzero components in \(\bar{\varvec{\beta }}_{\lambda }=\mathbf{B} \bar{\varvec{\gamma }}_{\lambda }\), and \(\hat{\tau }\) is an estimate of the Wilcoxon constant \(\tau \). The \(\hat{\tau }\) can be robustly estimated by using the approach given in Hettmansperger and McKean (1998) and easily be calculated by the function wilcoxontau in the R package (Terpstra and McKean 2005) with the unpenalized estimates. We refer to this approach as the BIC-selector, and denote the selected \(\lambda \) by \(\hat{\lambda }_{BIC}\). Similar to the BIC in Wang (2009), the first term in (8) can be viewed as an “artificial” likelihood as it shares some essential properties of a parametric log-likelihood. Note that due to the use of spline smoothing, the effective sample size would be \(n/K_n\) rather than the original sample size \(n\). This is because the classic parametric estimation methods is \(\sqrt{n}\)-consistent, while the convergence rate of spline methods is \(\sqrt{n/K_n}\). Say, for each \(u\), the spline estimator performs similarly to a parametric estimator as if a sample of size \(\sqrt{n/K_n}\) from the model (1) with \(\beta (u)\) were available. Therefore, the \(\text{ BIC }_{\lambda }\) in (8) replaces \(\log (n)/n\) in the Wang’s (2009) Wilcoxon-type generalized BIC by \(\log (n/K_n)/(n/K_n)\). It can be seen from the proof of Theorem 5, the BIC cannot achieve consistency without this modification.

Let \(S_T\) denote the true model and \(S_F\) denote the full model, and \(S_{\lambda }\) denote the set of the indices of the covariates selection by our robust variable selection method with tuning parameter \(\lambda \). For a given candidate model \(S\), let \(\varvec{\beta }_S\) be a vector of parameters and its \(i\)th coordinate is set to be zero, if \(i \not \in S\). Further, define \(L_n^S=n^{-2}\sum _{i<j}|(Y_i-\varvec{X}_i^{T}\hat{\varvec{\beta }}_S)-(Y_j-\varvec{X}_j^{T}\hat{\varvec{\beta }}_S)|\), where \(\hat{\varvec{\beta }}_S\) is the unpenalized robust estimator, i.e., the rank-based spline estimator for model \(S\). We make the following same assumptions as those of Wang and Li (2009):

  1. (1)

    for any \(S \subset S_F,\,L_n^S\mathop {\rightarrow }\limits ^{p}L^S\) for some \(L^S>0\);

  2. (2)

    for any , we have \(L^S>L^{S_T}\).

The next theorem indicates that \(\hat{\lambda }_{BIC}\) leads to a penalized rank-based estimator which consistently yields the true model.

Theorem 5

Suppose the assumptions above and Conditions (C1)–(C5) hold, then we have

$$\begin{aligned} P(S_{\hat{\lambda }_{BIC}}=S_T)\rightarrow 1. \end{aligned}$$

4 Numerical studies

4.1 Simulation

We study the finite-sample performance of the proposed rank-based spline SCAD (abbreviated by RSSCAD hereafter) method in this section. Wang and Xia (2009) have shown that the KLASSO is an efficient procedure in finite-sample cases and Wang et al. (2008) also proposed an efficient procedure based on least-squares and SCAD (abbreviated by LSSCAD hereafter). Thus, KLASSO and LSSCAD should be two ideal benchmarks in our comparison. For a clear comparison, we adopt the settings used in Wang and Xia (2009) for the following two models:

$$\begin{aligned} \text{(I) }&: Y_i=2\sin \left( 2\pi U_i\right) X_{i1}+4U_i(1-U_i)X_{i2}+\theta \varepsilon _i,\\ \text{(II) }&:Y_i=\exp (2U_i-1)X_{i1}+8U_i(1-U_i)X_{i2}+2\cos ^2\left( 2\pi U_i\right) X_{i3}+\theta \varepsilon _i, \end{aligned}$$

where for the first model, \(X_{i1}=1\) and \((X_{i2},\ldots ,X_{i7})^T\) are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=\rho ^{|j_1-j_2|}\) for any \(2 \le j_1,j_2 \le 7\), while for the second model, \(X_{i1},\ldots ,X_{i7}\) are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=\rho ^{|j_1-j_2|}\) for any \(1 \le j_1,j_2 \le 7\). Three cases of the correlation between the covariates are considered, \(\rho =0.3,0.5\), and \(0.8\). The index variable is simulated from \(\text{ Uniform }(0,1)\). The value of \(\theta \) is fixed as 1.5. The following model, which is similar to the one used in Wang et al. (2008), is also included in the comparison:

$$\begin{aligned} \text{(III) }: Y_i=\beta _0(U_i)+\sum _{k=1}^{23}\beta _k(U_i)X_{ik}+\varsigma \varepsilon _i, \end{aligned}$$

where

$$\begin{aligned}&\beta _0(U)=15+20\sin (0.5\pi U),\quad \beta _1(U)=2-3\cos \left( \frac{\pi (6U-5)}{3}\right) ,\\&\beta _2(U)=6-6U,\quad \beta _3(U)=-4+\frac{1}{2}(2-3U)^3, \end{aligned}$$

and the remaining coefficients are vanish. The index variable is still simulated from \(\text{ Uniform }(0,1)\). In this model, \(\varvec{X}\) depends on \(U\) in the following way. The first three variables are the true relevant covariates: \(X_{i1}\) is sampled uniformly from \([3U_i,2+3U_i]\) at any given index \(U_i;\,X_{i2}\), conditioning on \(X_{i1}\), is Gaussian with mean \(0\) and variance \((1+X_{i1})/(2+X_{i1})\); and \(X_{i3}\) independent of \(X_{i1}, X_{i2}\), is a Bernouli random variable with success rate \(0.6\). The other irrelevant variables are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=4\exp (-|j_1-j_2|)\) for any \(4\le j_1,j_2\le 23\). The parameter \(\varsigma \), which controls the model’s signal-to-noise ratio, is set to 5. For all these three models, four error distributions are considered: \(N(0,1),\,t(3)\) (Student’s t-distribution with three degrees of freedom), Tukey contaminated normal \(T(0.10; 5)\) (with the cumulative distribution function \(F(x)=0.9\varPhi (x)+0.1\varPhi (x/5)\) where \(\varPhi (x)\) is the distribution function of a standard normal distribution) and Lognormal. In addition, an outlier case is considered, in which the responses of 10 % generated samples are shifted with a constant \(c\). We use \(c=5\) and 25 for the first two models and the third model, respectively.

Table 1 The simulation results of variable selection for model (I) with \(\rho =0.5\)
Table 2 The simulation results of estimated errors of \(\varvec{\beta }\) for model (I) with \(\rho =0.5\)
Table 3 The simulation results of variable selection for model (II) with \(\rho =0.5\)
Table 4 The simulation results of estimated errors of \(\varvec{\beta }\) for model (II) with \(\rho =0.5\)
Table 5 The simulation results of variable selection for model (III)
Table 6 The simulation results of estimated errors of \(\varvec{\beta }\) for model (III)

Throughout this section we use the B-spline and 1,000 replications for each considered example. For every simulated data, we firstly fit an unpenalized varying coefficient estimate \(\hat{\varvec{\beta }}(U_i)\), for which the number of knots, \(D\), is selected via the method in Sect. 2.3. Then, the same \(D\) is used for RSSCAD, where the tuning parameter \(\lambda \) in the penalty function is chosen by the BIC (8). We report the average numbers of correct 0’s (the average numbers of the true zero coefficients that are correctly estimated to be zero), the average number of incorrect 0’s (the average number of the non-zero coefficients that are incorrectly estimated to be zero). Moreover, we also report the proportion of under-fitted models (at least one of the non-zero coefficients is incorrectly estimated to be zero), correctly fitted models (all the coefficients are selected correctly) and over-fitted models (all the non-zero coefficients are selected but at least one of the zero coefficient is estimated incorrectly to be non-zero). In addition, the performance of estimators in terms of estimation accuracy is assessed via the following two estimated errors which are defined by

$$\begin{aligned} \text{ EE1 }(\hat{\varvec{\beta }}_{a})&=\frac{1}{np}{\sum _{i=1}^n \sum _{j=1}^p|\hat{\beta }_{aj}(U_i)-\beta _{0j}(U_i)|},\\ \text{ EE2 }(\varvec{X}\hat{\varvec{\beta }}_{a})&=\frac{1}{np}{\sum _{i=1}^n \sum _{j=1}^p|X_{ij}\hat{\beta }_{aj}(U_i)-X_{ij}\beta _{0j}(U_i)|}, \end{aligned}$$

where \(\hat{\varvec{\beta }}_{aj}(\cdot )\) is an estimator of \(\varvec{\beta }_{0j}(\cdot )\) which is the true coefficient function. The means (denoted as MEE1 and MEE2) and standard deviations (in parentheses) of EE1 and EE2 values, are summarized. It is worth noting that because the KLASSO and RSSCAD use different smoothing approaches, we choose not to tabulate their MEE results to avoid misleading conclusions. Moreover, we also include two more unpenalized methods in the comparison, namely the rank-based spline estimator (RS) and the least-squares spline estimator (LSS).

We summarize the simulation results for the models (I)-(II) with \(\rho =0.5\) and the model (III) in Tables 1, 2, 3, 4, 5, and 6, respectively. The simulation results for the models (I)–(II) with \(\rho =0.3\) or 0.8 are presented in Tables A.1–A.8 of the supplemental file. A few observations can be made from Tables 1, 2, 3, 4, 5, and 6. Firstly, the proposed RSSCAD method is highly efficient for all the distributions under consideration. In terms of the probability of selecting the true model, the RSSCAD performs slightly worse than KLASSO and LSSCAD when the random error comes from the normal distribution as we can expect, but it performs significantly better than KLASSO and LSSCAD when the error distribution is nonnormal. For instance, when the errors come from the contaminated normal distribution, the KLASSO hardly selects the true model even for \(n=400\), whereas the RSSCAD selects the true model with quite large probability. For the third model in which the covariate \(\varvec{X}\) depends on the index \(U\), RSSCAD is still quite effective in selecting the true variables. Also, from these three Tables 1, 3, and 5, we can see that the proposed smoothing parameter selection method and the BIC (8) perform satisfactorily and conform to the asymptotic results shown in Sects. 3.3 and 3.4.

In the literature, it is well demonstrated that BIC tends to identify the true sparse model well but would result in certain under-fitting when the sample size is not sufficiently large (as in the cases of \(n=200\)). As shown in Theorem 5, the BIC is still consistent for selecting the variables in the present problem. When the sample size is larger (such as \(n=400\)), our method would select the correctly fitted models with a quite large probability, at least \(0.9\). From Tables 2, 4, and 6, we observe that the MEEs of those penalized methods are smaller than those corresponding unpenalized methods in all cases. It means that the variable selection procedure can evidently increase the efficiency of estimators. Furthermore, the rank-based methods (RS and RSSCAD) perform better than the corresponding least squares methods (LSS and LSSCAD) when the error deviates from a normal distribution. Even for normal, the MEEs of rank-based methods are merely larger than those least squares methods. This again reflects the robustness of our rank-based method to distributional assumption. Moreover, when the correlation between the covariates increases (decreases), all the three penalized methods become worse (better) but the comparison conclusion is similar to that of \(\rho =0.5\) (see Tables A.1–A.8 in the supplemental file). We also examine other error variance magnitudes for both models and the conclusion is similar.

To examine how well the method estimates the coefficient functions, Fig. 1 shows the estimates of the coefficients functions \(\beta _1(\cdot )\) and \(\beta _2(\cdot )\) for the model (I) with the normal and lognormal errors when \(\rho =0.5\) and \(n=200\). It can be seen that the estimates fit the true function well from the average viewpoint. The patterns of lower and upper confidence bands differ much from the true one at the right boundary, especially for \(\beta _2(\cdot )\). This may be caused by the lack of data in that region. The curves for the other error distributions, which give similar pictures of the estimated functions, are shown in Figure A.1 of the supplemental file.

4.2 The Boston housing data

To further illustrate the usefulness of RSSCAD, we consider here the Boston Housing Data, which has been analyzed by Wang and Xia (2009) and is publicly available in the R package mlbench, (http://cran.r-project.org/). Following Wang and Xia (2009), we take MEDV [median value of owner-occupied homes in 1,000 United States dollar (USD)] as the response, LSTAT (the percentage of lower status of the population) as the index variable, and the following predictors as the \(X\)-variables: CRIM (per capita crime rate by town), RM (average number of rooms per dwelling), PTRATIO (pupil-teacher ratio by town), NOX (nitric oxides concentration parts per 10 million), TAX (full-value property-tax rate per 10,000 USD), and AGE (proportion of owner-occupied units built prior to 1940). Figure A.2 in the supplemental file shows the normal qq-plot of residuals obtained by using a standard local linear non-penalized varying coefficient estimation (Fan and Zhang 2008). This figure clearly indicates that the errors are not normal. In Wang and Xia (2009), the variables are firstly transformed so that their marginal distribution is approximately \(N(0, 1)\). In our analysis, we do not take the transformation step since the RSSCAD is designed for robustness purpose. Similar to Wang and Xia (2009), the index variable, LSTAT, is transformed so that its marginal distribution is \(U[0, 1]\).

Fig. 1
figure 1

Fitted regression coefficient functions of the Model (I) with the normal and lognormal errors when \(\rho =0.5\) and \(n=200\). The red line is the true coefficient function and the black solid line is the average of estimated coefficient function over 1,000 replications. The lower and upper dash lines form the 95 % confidence bands (color figure online)

Fig. 2
figure 2

ac The RSSCAD estimates of the relevant coefficients NOX, RM, and PTRATIO; d the unpenalized estimates of the irrelevant coefficients

A standard “leave-one-out” cross-validation method suggests an optimal number of knots \(D=5\). The RSSCAD method is then applied to the data with this number of knots. The optimal shrinkage parameter selected by the BIC criterion (8) is \(\hat{\lambda }=0.0284\). The resulting RSSCAD estimate suggests that NOX, RM, and PTRATIO are all relevant variables, whereas CRIM, TAX, and AGE seem not quite significant for predicting MEDV. To confirm whether the selected variables (NOX, RM, and PTRATIO) are truly relevant, we provide in Fig. 2a–c their RSSCAD estimates with 95 % confidence bands. Obviously, they all suggest that these three coefficients are unlikely to be constant zero, because none of them is close to 0. The unpenalized estimates of the eliminated variables CRIM, TAX and AGE, are shown in Fig. 2d–f. We find that they are always close to zero over the entire range of the index variable LSTAT. Thus, Fig. 2 further confirms that those variables eliminated by RSSCAD are unlikely to be relevant. In contrast, without transformation of data, the KLASSO estimate suggests that all the variables are relevant except for AGE. Therefore, the proposed RSSCAD should be a reasonable alternative for variable selection in varying coefficient model by taking its efficiency, convenience and robustness into account.

5 Discussion

It is of interest to extend our proposed methodology to other more complex models, such as varying coefficient partially linear models (Li and Liang 2008; Zhao and Xue 2009). In fact, this amounts to adding further penalty terms into the rank-based loss function. Moreover, it is also of great interests to see whether RSSCAD and its oracle property are still valid in high-dimensional settings in which \(p\) diverges and even is larger than the sample size \(n\). The consistency of the BIC criterion proposed in Sect. 3.4 deserves further study as well. Furthermore, our rank-based spline estimator could also deal with the case that the distribution of the error term \(\varepsilon \) varies with time as well as the coefficient. For example, we can assume the following varying coefficient model \(Y=\varvec{X}^T(U)\varvec{\beta }(U)+\sigma (U)\varepsilon \) where \(\sigma (U)\) is a smooth function and the random error \(\varepsilon \) is independent of \(X\) and \(U\). With certain modifications of the proof and conditions, we are able to establish the consistency of the rank-based methods.