Abstract
The varying coefficient model is widely used as an extension of the linear regression model. Many procedures have been developed for the model estimation, and recently efficient variable selection procedures for the varying coefficient model have been proposed as well. However, those variable selection approaches are mainly built on the least-squares (LS) type method. Although the LS method is a successful and standard choice in the varying coefficient model fitting and variable selection, it may suffer when the errors follow a heavy-tailed distribution or in the presence of outliers. To overcome this issue, we start by developing a novel robust estimator, termed rank-based spline estimator, which combines the ideas of rank inference and polynomial spline. Furthermore, we propose a robust variable selection method, incorporating the smoothly clipped absolute deviation penalty into the rank-based spline loss function. Under mild conditions, we theoretically show that the proposed rank-based spline estimator is highly efficient across a wide spectrum of distributions. Its asymptotic relative efficiency with respect to the LS-based method is closely related to that of the signed-rank Wilcoxon test with respect to the t test. Moreover, the proposed variable selection method can identify the true model consistently, and the resulting estimator can be as efficient as the oracle estimator. Simulation studies show that our procedure has better performance than the LS-based method when the errors deviate from normality.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider the varying coefficient model
where \(Y\) is the response variable, \(U\) and \(\varvec{X}\) are the covariates, and \(\varvec{\beta }(U)\) are some unknown smooth functions. The random error \(\varepsilon \) is independent of \(\varvec{X}\) and \(U\), and has probability density function \(h(\cdot )\) which has finite Fisher information. In this paper, it is assumed that \(U\) is a scalar and \(\varvec{X}\) is a \(p\)-dimensional vector which may depend on \(U\). Since introduced by Hastie and Tibshirani (1993), the varying coefficient model has been widely applied in many scientific areas, such as economics, finance, politics, epidemiology, medical science, ecology, and so on.
Due to its flexibility and interpretability, in the past ten years, it has experienced rapid developments in both theory and methodology; see Fan and Zhang (2008) for a comprehensive survey. In general, there are at least three common ways to estimate this model. One is the kernel-based local polynomial smoothing, see for instance, Wu et al. (1998), Hoover et al. (1998), Fan and Zhang (1999), Kauermann and Tutz (1999); One is the polynomial spline, see Huang et al. (2002, 2004) and Huang and Shen (2004); The last one is the smoothing spline, see Hastie and Tibshirani (1993), Hoover et al. (1998) and Chiang et al. (2001). Recently, efficient variable selection procedures for the varying coefficient model have been proposed as well. In a typical linear regression setup, it has been very well understood that ignoring any important predictor can lead to seriously biased results, whereas including spurious covariates can degrade the estimation efficiency substantially. Thus, variable selection is important for any regression problem. In a traditional linear regression setting, many selection criteria, e.g., Akaike information criterion (AIC) and Bayesian information criterion (BIC) have been extensively used in practice. Recently, various shrinkage methods have been developed, which include but are not limited to the least absolute shrinkage and selection operator (LASSO; c.f., Tibshirani 1996; Zou 2006) and the smoothly clipped absolute deviation (SCAD; Fan and Li 2001). These regularized estimation procedures were developed for varying coefficient models. Among others, Lin and Zhang (2006) develop COSSO for component selection and smoothing in smoothing spline ANOVA. Wang et al. (2007) propose to use group SCAD method for varying-coefficient model selection. Wang et al. (2008) extend the application of the SCAD penalty to varying coefficient models with longitudinal data. Li and Liang (2008) study variable selection for partially linear varying coefficient models, where the parametric components are identified via the SCAD but the nonparametric components are selected via a generalized likelihood ratio test, instead of a shrinkage method. Leng (2009) proposes a penalized likelihood method in the framework of the smoothing spline ANOVA models. Wang and Xia (2009) develop a shrinkage method, called KLASSO (Kernel-based LASSO), which combines the ideas of the local polynomial smoothing and LASSO. Tang et al. (2012) develop a unified variable selection approach for both least squares regression and quantile regression models with possibly varying coefficients. Their method is carried out by using a two-step iterative procedure based on basis expansion and an adaptive-LASSO-type penalty.
The estimation and variable selection procedures in the aforementioned papers are mainly built on least-squares (LS) type methods. Although the LS methods are successful and standard choice in varying coefficient model fitting, they may suffer when the errors follow a heavy-tailed distribution or in the presence of outliers. Thus, some efforts have been devoted to construct robust estimators for the varying coefficient models. Kim (2007) develops a quantile regression procedure for varying coefficient models when the random errors are assumed to have a certain quantile equal to zero. Wang et al. (2009) recently develop a local rank estimation procedure, which integrates the rank regression (Hettmansperger and McKean 1998) and local polynomial smoothing. In traditional linear regression settings, some also draw much attention to robust variable selection. Wang et al. (2007) propose a LASSO-based procedure using the least absolute deviation regression. Zou and Yuan (2008) propose the composite quantile regression (CQR) estimator by averaging \(K\) quantile regressions. They have shown that CQR is selection consistent and can be more robust in various circumstances. Wang and Li (2009) and Leng (2010) independently propose two efficient shrinkage estimators, using the idea of rank regression. However, to the best of our knowledge, there has hitherto been no existing appropriate robust variable selection procedure for the varying coefficient model, which is the focus of this paper.
In this paper, we aim to propose an efficient robust variable selection method for varying coefficient models. Motivated by the local rank inference (Wang et al. 2009), we start by developing a robust rank-based spline estimator. Under some mild conditions, we establish the asymptotic representation of the proposed estimator and further prove its asymptotic normality. We derive the formula of the asymptotic relative efficiency (ARE) of the rank-based spline estimator relative to the LS-based estimator, which has an expression that is closely related to that of the signed-rank Wilcoxon test in comparison with the t test. Further, we extend the application of the SCAD penalty to the rank-based spline estimator. Theoretical analysis reveals that our procedure is consistent in variable selection; that is, the probability that it correctly selects the true model tends to one. Also, we show that our procedure has the so-called oracle property; that is, the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. Simulation studies show that our procedure has better performance than KLASSO (Wang and Xia 2009) and LSSCAD (Wang et al. 2008) when the errors deviate from normality. Even in the most favorable case for KLASSO and LSSCAD, i.e., normal distribution, our procedure does not lose much, which coincides with our theoretical analysis.
This article is organized as follows. Section 2 presents the rank-based spline procedure for estimating the varying coefficient model, and some theoretical properties are provided. In Sect. 3, with the help of the rank-based spline procedure, we propose a new robust variable selection method and study its theoretical properties. Its numerical performance is investigated in Sect. 4. Several remarks draw the paper to its conclusion in Sect. 5. The technical details are provided in the “Appendix”. Some other simulation results are provided in another appendix, which is available online as supplementary material.
2 Methodology
To develop an efficient scheme for variable selection, we choose to consider a polynomial spline smoothing method rather than a local polynomial smoother. The reason is that using the former the varying coefficient model can be re-formulated as a traditional multiple regression model and thus it serves the variable selection purpose more naturally (Wang et al. 2008). In contrast, although works also very well, using local polynomial smoothers requires more sophisticated approximation and techniques in the selection procedures and proofs of oracle properties (Wang and Xia 2009). Therefore, in this section, we develop a rank-based spline method for estimating \(\varvec{\beta }(\cdot )\), which can be regarded as a parallel to the local rank estimator proposed by Wang et al. (2009).
2.1 The estimation procedure
Suppose that \(\{U_i,\varvec{X}_i,Y_i\}_{i=1}^n\) is a random sample from the model (1). Write \(\varvec{X}_i=(X_{i1},\ldots ,X_{ip})^T\) and \(\varvec{\beta }(U)=(\beta _1(U),\ldots ,\beta _p(U))^T\). Suppose that each \(\beta _l(U), l=1,\ldots ,p\), can be approximated by some spline functions, that is
where each \(\{B_{lk}(\cdot ),k=1,\ldots ,K_l\}\) is a basis for a linear space \(\mathbb {G}_l\) of spline functions with a fixed degree and knot sequence. In our applications we use the B-spline basis for its good numerical properties. Following (1) and (2), we have
Define \(\varvec{Y}=(Y_1,\ldots ,Y_n)^T, \mathbf{X}=(\varvec{X}_1,\ldots ,\varvec{X}_n)^T, \varvec{\gamma }_l=(\gamma _{l1},\ldots ,\gamma _{lK_l})^T, \varvec{\gamma }=(\varvec{\gamma }_1^T,\ldots ,\varvec{\gamma }_p^T)^T\),
\(\varvec{Z}_i=\varvec{X}_i^{T} \mathbf{B}(U_i)\), and \(\mathbf{Z}=(\varvec{Z}_1^{T},\ldots , \varvec{Z}_n^{T})^{T}\). Based on the above approximation, we obtain the residual at \(U_i\)
Motivated by the rank regression (Jaeckel 1972; Hettmansperger and McKean 1998), we define the rank-based spline objective (loss) function
An estimator of \(\beta _l(u)\) is obtained by \(\hat{\beta }_l(u)=\sum _{k=1}^{K_l}\hat{\gamma }_{lk}B_{lk}(u)\), where the \(\hat{\gamma }_{lk}\)’s are the minimizers of (3). We term it as rank-based spline estimator because the objective (loss) function is equivalent to the classic rank loss function in linear models based on Wilcoxon scores (Hettmansperger and McKean 1998).
2.2 Asymptotic properties
In this subsection, we establish the asymptotic properties of the rank-based spline estimator. The main challenge comes from the nonsmoothness of the objective function \(Q_n(\varvec{\gamma })\). To overcome this difficulty, we first derive an asymptotic representation of \(\hat{\varvec{\gamma }}\) via a quadratic approximation of \(Q_n(\varvec{\gamma })\), which holds uniformly in a local neighborhood of the true parameter values. Throughout this manuscript, we will use the following notation for ease of exposition. Let \(|\varvec{a}|\) denote the Euclidean norm of a real valued vector \(\varvec{a}\). For a real-valued function \(g,\,||g||_{\infty }=\sup _u |g(u)|\). For a vector-valued function \({\varvec{g}}=(g_1,\ldots ,g_p)^T\), denote \(||{\varvec{g}}||_{L_2}=\sum _{1\le l \le p} ||g_l||^2_{L_2}\) and \(||{\varvec{g}}||_{\infty }=\max _l||g_l||_{\infty }\). Define \(K_n=\max _{1\le l \le p} K_l, \rho _n=\max _{1\le l \le p} \inf _{{g} \in \mathbb {G}_l}||\beta _l-g||_{{\infty }}\). Let \({\varvec{g}}^*=(g^*_1,\ldots ,g^*_p) \in \mathbb {G}\) be such that \(||{\varvec{g}}^*-\varvec{\beta }||_{\infty }=\rho _n\), where \(\mathbb {G}=\mathbb {G}_1\times \cdots \times \mathbb {G}_p\) and \(\varvec{\beta }\) is the real varying-coefficient function. Then there exists \(\varvec{\gamma }_0\), such that \({\varvec{g}}^*=\mathbf{B}(u)\varvec{\gamma }_0\).
Define \(\theta _n=\sqrt{{K_n}/{n}}, \varvec{\gamma }^{*}=\theta _n^{-1}(\varvec{\gamma }-\varvec{\gamma }_0)\), and \(\varDelta _i=\varvec{X}_i^{T}\varvec{\beta }(U_i)-\varvec{Z}_i \varvec{\gamma }_0\). Let \(\widehat{\varvec{\gamma }^{*}}\) be the value of \(\varvec{\gamma }^{*}\) that minimizes the following reparametrized function
Then it can be easily seen that
We use \(\varvec{S}_n(\varvec{\gamma }^{*})\) to denote the gradient function of \(Q_n^{*}(\varvec{\gamma }^{*})\). More specifically,
where \({\mathrm{sgn}}(\cdot )\) is the sign function. Furthermore, we consider the following quadratic function of \(\varvec{\gamma }^{*}\)
where \(\tau =\int h^2(t) dt\) is the well-known Wilcoxon constant, and \(h(\cdot )\) is the density function of the random error \(\varepsilon \).
For the asymptotic analysis, we need the following regularity conditions.
-
(C1)
The distribution of \(U_i\) has a Lebesgue density \(f(u)\) which is bounded away from 0 and infinity.
-
(C2)
\(E(\varvec{X}_i(u)|u=U_i)=\varvec{0}\), and the eigenvalues \(\lambda _1(u) \le \cdots \le \lambda _p(u)\) of \(\mathbf{\Sigma }(u)=E[{\varvec{X}_i}(u){\varvec{X}_i}(u)^{T}]\) are bounded away from 0 and infinity uniformly; that is, there are positive constants \(W_1\) and \(W_2\) such that \(W_1 \le \lambda _1(u) \le \cdots \le \lambda _p(u) \le W_2\) for all \(u\).
-
(C3)
There is a positive constant \(M_1\) such that \(|X_{il}(u)|\le M_1\) for all \(u\) and \(l=1,\ldots ,p,\,i=1,\ldots , n\).
-
(C4)
\(\lim \sup _n (\max _l K_l/\min _l K_l)<\infty \).
-
(C5)
The error \(\varepsilon \) has finite Fisher information, i.e., \(\int [h^{'}(x)]^2/h(x)dx < \infty \).
Remark 1
Conditions (C1)–(C4) are the same as those in Huang et al. (2004). The assumption on the random errors in (C5) is a standard condition for rank analysis in multiple linear regression (Hettmansperger and McKean 1998). These conditions are mild and can be satisfied in many practical situations.
Lemma 1
Suppose Conditions (C1)–(C5) all hold, then for any \(\epsilon >0\) and \(c>0\),
Lemma 1 implies that the nonsmooth objective function \(Q_n^{*}(\varvec{\gamma }^{*})\) can be uniformly approximated by a quadratic function \(A_n(\varvec{\gamma }^{*})\) in a neighborhood of \(\varvec{0}\). It is also shown that the minimizer of \(A_n(\varvec{\gamma }^{*})\) is asymptotic within \(o(\sqrt{K_n})\) neighborhood of \(\widehat{\varvec{\gamma }^{*}}\), say \(|\widehat{\varvec{\gamma }^{*}}-(2 \tau )^{-1}(\theta _n^2\mathbf{Z}^{T}\mathbf{Z})^{-1}\varvec{S}_n(\varvec{0})|=o_p(\sqrt{K_n})\) (see “Appendix”). This further allows us to derive the asymptotic distribution.
Let \(\check{ \beta }_l(u)=E[\hat{ \beta }_l(u)\mid \fancyscript{X}]\) be the mean of \(\hat{ \beta }_l(u)\) conditioning on \(\fancyscript{X}=\{(\varvec{X}_i,U_i)\}_{i=1}^n\). It is useful to consider the decomposition \(\hat{\beta }_l(u)- \beta _l(u)=\hat{ \beta }_l(u)-\check{ \beta }_l(u)+\check{ \beta }_l(u)- \beta _l(u)\), where \(\hat{ \beta }_l(u)-\check{ \beta }_l(u)\) and \(\check{ \beta }_l(u)- \beta _l(u)\) contribute to the variance and bias terms, respectively. Denote \( \check{\varvec{\beta }}(u)=(\check{\beta }_1(u),\ldots ,\check{\beta }_p(u))\). The following two theorems establish the consistency and asymptotic normality of the rank-based spline estimator, respectively.
Theorem 1
Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0\), then \(||\hat{\varvec{\beta }}-\varvec{\beta }||^2_{L_2}=O_p(\rho _n^2+{K_n}/{n})\); consequently, if \(\rho _n\rightarrow 0\), then \(\hat{\beta _l}, l=1,\ldots ,p\) are consistent.
Theorem 2
Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0\), then
The above two theorems are parallel to those in Huang et al. (2004). Theorem 1 implies that the magnitude of the bias is bounded in probability by the best approximation rates by the spaces \(\mathbb {G}_l\). Theorem 2 provides the asymptotic normality and can thus be used to construct confidence intervals.
Next, we study the ARE of the rank-based spline estimator with respect to the polynomial spline estimator [denoted by \(\hat{\varvec{\beta }}_{P}(u)\)]) for estimating \(\varvec{\beta }(u)\) in the varying coefficient model, say \(\text{ ARE }(\hat{\varvec{\beta }}(u),\hat{\varvec{\beta }}_{P}(u))\). Unlike the ARE study in Wang et al. (2009) in which the theoretical optimal bandwidth of local polynomial estimators is used, it seems difficult to plug in theoretical optimal \(K_l\)’s in evaluating \(\text{ ARE }(\hat{\varvec{\beta }}(u),\hat{\varvec{\beta }}_{P}(u))\) because the closed-form optimal \(K_l\)’s are not available. Thus, in the following analysis, we consider a common choice of the smoothing parameters for both two spline estimators \( \hat{\varvec{\beta }}(u)\) and \(\hat{\varvec{\beta }}_{P}(u)\).
According to Huang et al. (2004), we know that
where \(\sigma ^2\) is the variance of \(\varepsilon \). Now, we give the conditioned variance of \(\hat{\varvec{\beta }}(u)\). From the Proof of Theorem 1 shown in the Appendix, we have
where
and \(H(\cdot )\) denotes the distribution of \(\varepsilon \) and the second equation follows from the independence of \(\varepsilon \) and \({\varvec{X}}, U\). Thus,
It immediately follows from Theorem 2 that the ARE of \(\hat{\varvec{\beta }}(u)\) with respect to \(\hat{\varvec{\beta }}_{P}(u)\) is
Remark 2
This asymptotic relative efficiency is the same as that of the signed-rank Wilcoxon test with respect to the t test. It is well known in the literature of rank analysis that the ARE is as high as 0.955 for normal error distribution, and can be significantly higher than one for many heavier-tailed distributions. For instance, this quantity is 1.5 for the double exponential distribution, and 1.9 for the \(t\) distribution with three degrees of freedom. For symmetric error distributions with finite Fisher information, this asymptotic relative efficiency is known to have a lower bound equal to 0.864.
2.3 Automatic selection of smoothing parameters
Smoothing parameters, such as the degrees of splines, the numbers and locations of knots, play an important role in nonparametric models. However, due to the computational complexity, automatically selecting those three smoothing parameters is difficult in practice. In this paper, we select only \(D=D_l\), the numbers of knots for \(\beta _l(\cdot )\)’s, using the data. The location of knots are equally spaced and the degrees of splines are fixed. We use “leave-one-out” cross-validation to choose \(D\). To be more specific, let \(\hat{\varvec{\beta }}^{(i)}(u)\) be the spline estimator obtained by deleting the \(i\)-th sample. The cross-validation procedure minimizes the target function
In practice, some other criteria, such as the GCV, fivefold CV, BIC and AIC can also be used. Our simulation studies show that those procedures are also quite effective but the variable selection results are hardly affected by the choice of selection procedure for \(D_l\). Moreover, in this paper we restrict our attention to the spline with \(d=3\) degrees. This works well for the applications we considered. It might be worthwhile to investigate using the data to decide the knot positions (free-knot splines), which merits definitely some future research. Also, we may not use the same number of knots and degree of splines for each coefficient function because each coefficient function may have different features.
3 Rank-based variable selection and estimation
In this section, in order to conduct variable selection for the varying coefficient model in a computationally efficient manner, we incorporate the SCAD penalty function into the objective function (3) to implement nonparametric estimation and variable selection simultaneously.
3.1 The SCAD-penalty method
Now, suppose that some variables are not relevant in the regression model, so that the corresponding coefficient functions are zero functions. Let \(\mathbf{R}_k=(r_{ij})_{K_k \times K_k}\) be a matrix with entries \(r_{ij}=\int B_{ki}(t)B_{kj}(t) dt\). Then, we define \(||\varvec{\gamma }_k||_{R_k}^2\equiv \varvec{\gamma }_k^{T} \mathbf{R}_k \varvec{\gamma }_k\). The penalized rank-based loss function is then defined as
where \(\lambda _n\) is the tuning parameter and \(p_{\lambda }(\cdot )\) is chosen as the SCAD penalty function of Fan and Li (2001), defined as
where \(a\) is another tuning parameter. Here we adopted \(a=3.7\) as suggested by Fan and Li (2001). This penalized loss function takes a similar form to that of Wang et al. (2008) except that the rank-based loss function is used instead of LS-based functions. An estimator of \(\beta _l(u)\) is obtained by \(\bar{\beta }_l(u)=\sum _{k=1}^{K_l} \bar{\gamma }_{lk}B_{lk}(u)\), where the \(\bar{\gamma }_{lk}\) are minimizers of (4). In practice, one can also use the adaptive LASSO penalty to replace SCAD in (4) and we can expect that the resulting procedure will have similar asymptotic properties and comparable finite-sample performance (Zou 2006).
3.2 Computational algorithm
Because of nondifferentiability of the penalized loss (4), the commonly used gradient-based optimization method is not applicable here. In this section we develop an iterative algorithm using local quadratic approximation of the rank-based objective function \(\sum _{i<j}|e_i-e_j|\) and the nonconvex penalty function \(p_{\lambda _n}(||\varvec{\gamma }_k||_{R_k})\). Denote that \(R(e_i)\) is the rank of \(e_i\) among \(\{e_i\}_{i=1}^n\). Following Sievers and Abebe (2004), the objective function is approximated by
where \(\zeta \) is the median of \(\{e_i\}_{i=1}^n\) and
Moreover, following Fan and Li (2001), in the neighborhood of a given positive \(u_0 \in R^{+}\),
Then, given an initial value, \(\varvec{\gamma }_k^{(0)}\), with \(||\varvec{\gamma }_k^{(0)}||_{R_k}>0\), the corresponding weights \(w_i^{(0)}\) and the median of residuals, \(\zeta ^{(0)}\), can be obtained. Consequently, the penalized loss function (4) can be approximated by a quadratic form
Consequently, removing an irrelevant constant, the above quadratic form becomes
where \(\varvec{S}^{(0)}=\varvec{Y}-\zeta ^{(0)}\), and \(\mathbf{W}^{(0)}\) and \(\varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\) are diagonal weight matrices with \(w_i\), and \(p_{\lambda _n}^{'}(||\varvec{\gamma }_k^{(0)}||_{R_k})/||\varvec{\gamma }_k^{(0)}||_{R_k} \mathbf{R}_k\) on the diagonals, respectively. This is a quadratic form with a minimizer satisfying
The foregoing discussion leads to the following algorithm:
- Step 1::
-
Initialize \(\varvec{\gamma }=\varvec{\gamma }^{(0)}\).
- Step 2::
-
Given \(\varvec{\gamma }^{(m)}\), update \(\varvec{\gamma }\) to \(\varvec{\gamma }^{(m+1)}\) by solving (6), where \(\varvec{\gamma }^{(0)}\) and the \(\varvec{\gamma }^{(0)}\) in \(\mathbf{W}^{(0)}, \varvec{S}^{(0)}\) and \(\varvec{\Omega }_{\lambda _n}(\varvec{\gamma }^{(0)})\) are all set to be \(\varvec{\gamma }^{(m)}\).
- Step 3::
-
Iterate Step 2 until convergence of \(\varvec{\gamma }\) is achieved.
Due to the use of nonconvex penalty SCAD, the global minimizer cannot be achieved in general and only some local minimizers can be obtained (Fan and Li 2001). In the literature, all the penalized methods based on nonconvex penalties would suffer from the same drawback as that of SCAD. Thus, a suitable initial value is usually required for fast convergence. The initial estimator of \(\varvec{\gamma }\) in Step 1 can be chosen as the unpenalized estimator, which can be solved by fitting a \(L_1\) regression on \({n(n-1)}/{2}\) pseudo observations \(\{(\varvec{Z}_i-\varvec{Z}_j, Y_i-Y_j)\}_{i<j}\). In our numerical studies, we use the function rq in the R package quantreg. From our numerical experience, our algorithm converges fast with the unpenalized estimator, and the resulting solution is reasonably good as demonstrated in our simulation study.
Note that the iterated algorithm will be instable when the weights in (5) are too large. As suggested by Sievers and Abebe (2004), the algorithm should be modified so it removes those residuals with very large weights from the iteration and reinstates them in subsequent iterations when their contribution to the sum \(\sum _{i=1}^n w_i(e_i-\zeta )^2\) becomes significant. Such an algorithm is quite efficient and reliable in practice. The R code for implementing the proposed scheme is available from the authors upon request. It is worth noting that we are doing an iterative approximation for both the original target function and the penalty function. Our numerical experience shows that such an algorithm is usually completed in less than ten iterations and never fails to converge. For example, it takes \(<\)1 s per iteration in R using an Inter Core 2.2 MHz CPU for a \(n=200, p=7\) case and the entire procedure is generally completed in \(<\)10 s. Theoretical investigation of the convergence property of the proposed algorithm definitely deserves future research.
3.3 Asymptotic properties
Without loss of generality, let \(\beta _k(u), k=1,\ldots , s\), be the nonzero coefficient functions and let \(\beta _k(u)\equiv 0\), for \(k=s+1,\ldots , p\).
Theorem 3
Suppose Conditions (C1)–(C5) hold. If \(K_n \log K_n/n\rightarrow 0, \rho _n\rightarrow 0, \lambda _n \rightarrow 0\), and \(\lambda _n/\max \left\{ \sqrt{{K_n}/{n}},\rho _n\right\} \rightarrow \infty \), we have the following:
-
(i)
\(\bar{\beta }_k=0,\,k=s+1,\ldots , p\), with probability approaching 1.
-
(ii)
\(||\bar{\beta }_k-\beta _k||_{L_2}=O_p\left( \max \left\{ \sqrt{\frac{K_n}{n}}, \rho _n\right\} \right) , k=1,\ldots ,s\).
Part (i) of Theorem 3 says that the proposed penalized rank-based method is consistent in variable selection; that is, it can identify the zero coefficient functions with probability tending to one. The second part provides the rate of convergence in estimating the nonzero coefficient functions.
Now we consider the asymptotic variance of the proposed estimate. Let \(\varvec{\beta }^{(1)}=(\beta _1,\ldots ,\beta _s)^{T}\) denote the vector of nonzero coefficient functions, and let \( \bar{\varvec{\beta }}^{(1)}=(\bar{\beta }_1,\ldots ,\bar{\beta }_s)^{T}\) denote its estimate obtained by minimizing (4). Let \(\bar{\varvec{\gamma }}^{(1)}=(\bar{\varvec{\gamma }}_1^{T}, \ldots , \bar{\varvec{\gamma }}_s^{T})^T\) and \(\mathbf{Z}^{(1)}\) denote the selected columns of \(\mathbf{Z}\) corresponding to \(\varvec{\beta }^{(1)}\). By using Lemma 1 and the quadratic approximation stated in the above subsection, we obtain another approximated loss function
Similarly, let \(\varvec{\Omega }_{\lambda }^{(1)}\) denote the selected diagonal blocks of \(\varvec{\Omega }_{\lambda }\), and \(\varvec{S}^{(1)}_n(\varvec{0})\) denote the selected entries corresponding to \(\varvec{\beta }^{(1)}\). Thus, the minimizer of (7) yields
Denote \(\mathbf{H}^{(1)}=2 \tau (\mathbf{Z}^{(1)})^{T}\mathbf{Z}^{(1)}+\frac{n}{2}\varvec{\Omega }^{(1)}_{\lambda _n}\bar{\varvec{\gamma }}^{(1)}\), so the asymptotic variance of \(\bar{\varvec{\gamma }}^{(1)}\) is
Since \(\bar{\varvec{\beta }}^{(1)}=({\mathbf{B}^{(1)}})^{T}\bar{\varvec{\gamma }}^{(1)}\), where \(\mathbf{B}^{(1)}\) is the first \(s\) rows of \(\mathbf{B}(u)\), we have \({\mathrm{avar}}(\bar{\varvec{\beta }}^{(1)})=({\mathbf{B}^{(1)}})^{T}{\mathrm{avar}}(\bar{\varvec{\gamma }}^{(1)}) \mathbf{B}^{(1)}\). Let \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) denote a modification of \({\mathrm{avar}}(\bar{\varvec{\beta }}^{(1)})\) by replacing \(\varvec{\Omega }^{(1)}_{\lambda _n}\) with \(0\), that is
Accordingly, the diagonal elements of \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) can be employed as the asymptotic variances of \(\bar{ \beta }_k(u)\)’s, i.e., \({\mathrm{avar}}(\bar{ \beta }_k(u)),k=1,\ldots ,s\).
Theorem 4
Suppose Conditions (C1)–(C5) hold. \( K_n \log K_n/n\rightarrow 0, \rho _n\rightarrow 0, \lambda _n \rightarrow 0\), and \(\lambda _n/\max \left\{ \sqrt{{K_n}/{n}},\rho _n\right\} \rightarrow \infty \). Then, as \(n\rightarrow \infty \),
where \(\breve{\varvec{\beta }}(u)=E[\bar{\varvec{\beta }}(u)\mid \fancyscript{X}]\) and in particular,
Here \({\mathrm{var}}^{*}(\bar{\varvec{\beta }}(u))\) is exactly the same asymptotic variance of nonpenalized rank-based estimate using only those covariates corresponding to nonzero coefficient functions (See Sect. 2). Theorem 4 implies that our penalized rank-based estimate has the oracle property in the sense that the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model.
3.4 Selection of tuning parameters
The tuning parameter \(\lambda \) controls the model complexity and plays a critical role in the above procedure. It is desirable to select \(\lambda \) automatically by a data-driven method. Motivated by the Wilcoxon-type generalized BIC of Wang (2009) in which the multiple linear regression model is considered, we propose to select \(\lambda \) by minimizing
where \(\bar{\varvec{\gamma }}_{\lambda }\) is the penalized local rank spline estimator with tuning parameter \(\lambda ,\,df_{\lambda }\) is the number of nonzero components in \(\bar{\varvec{\beta }}_{\lambda }=\mathbf{B} \bar{\varvec{\gamma }}_{\lambda }\), and \(\hat{\tau }\) is an estimate of the Wilcoxon constant \(\tau \). The \(\hat{\tau }\) can be robustly estimated by using the approach given in Hettmansperger and McKean (1998) and easily be calculated by the function wilcoxontau in the R package (Terpstra and McKean 2005) with the unpenalized estimates. We refer to this approach as the BIC-selector, and denote the selected \(\lambda \) by \(\hat{\lambda }_{BIC}\). Similar to the BIC in Wang (2009), the first term in (8) can be viewed as an “artificial” likelihood as it shares some essential properties of a parametric log-likelihood. Note that due to the use of spline smoothing, the effective sample size would be \(n/K_n\) rather than the original sample size \(n\). This is because the classic parametric estimation methods is \(\sqrt{n}\)-consistent, while the convergence rate of spline methods is \(\sqrt{n/K_n}\). Say, for each \(u\), the spline estimator performs similarly to a parametric estimator as if a sample of size \(\sqrt{n/K_n}\) from the model (1) with \(\beta (u)\) were available. Therefore, the \(\text{ BIC }_{\lambda }\) in (8) replaces \(\log (n)/n\) in the Wang’s (2009) Wilcoxon-type generalized BIC by \(\log (n/K_n)/(n/K_n)\). It can be seen from the proof of Theorem 5, the BIC cannot achieve consistency without this modification.
Let \(S_T\) denote the true model and \(S_F\) denote the full model, and \(S_{\lambda }\) denote the set of the indices of the covariates selection by our robust variable selection method with tuning parameter \(\lambda \). For a given candidate model \(S\), let \(\varvec{\beta }_S\) be a vector of parameters and its \(i\)th coordinate is set to be zero, if \(i \not \in S\). Further, define \(L_n^S=n^{-2}\sum _{i<j}|(Y_i-\varvec{X}_i^{T}\hat{\varvec{\beta }}_S)-(Y_j-\varvec{X}_j^{T}\hat{\varvec{\beta }}_S)|\), where \(\hat{\varvec{\beta }}_S\) is the unpenalized robust estimator, i.e., the rank-based spline estimator for model \(S\). We make the following same assumptions as those of Wang and Li (2009):
-
(1)
for any \(S \subset S_F,\,L_n^S\mathop {\rightarrow }\limits ^{p}L^S\) for some \(L^S>0\);
-
(2)
for any , we have \(L^S>L^{S_T}\).
The next theorem indicates that \(\hat{\lambda }_{BIC}\) leads to a penalized rank-based estimator which consistently yields the true model.
Theorem 5
Suppose the assumptions above and Conditions (C1)–(C5) hold, then we have
4 Numerical studies
4.1 Simulation
We study the finite-sample performance of the proposed rank-based spline SCAD (abbreviated by RSSCAD hereafter) method in this section. Wang and Xia (2009) have shown that the KLASSO is an efficient procedure in finite-sample cases and Wang et al. (2008) also proposed an efficient procedure based on least-squares and SCAD (abbreviated by LSSCAD hereafter). Thus, KLASSO and LSSCAD should be two ideal benchmarks in our comparison. For a clear comparison, we adopt the settings used in Wang and Xia (2009) for the following two models:
where for the first model, \(X_{i1}=1\) and \((X_{i2},\ldots ,X_{i7})^T\) are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=\rho ^{|j_1-j_2|}\) for any \(2 \le j_1,j_2 \le 7\), while for the second model, \(X_{i1},\ldots ,X_{i7}\) are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=\rho ^{|j_1-j_2|}\) for any \(1 \le j_1,j_2 \le 7\). Three cases of the correlation between the covariates are considered, \(\rho =0.3,0.5\), and \(0.8\). The index variable is simulated from \(\text{ Uniform }(0,1)\). The value of \(\theta \) is fixed as 1.5. The following model, which is similar to the one used in Wang et al. (2008), is also included in the comparison:
where
and the remaining coefficients are vanish. The index variable is still simulated from \(\text{ Uniform }(0,1)\). In this model, \(\varvec{X}\) depends on \(U\) in the following way. The first three variables are the true relevant covariates: \(X_{i1}\) is sampled uniformly from \([3U_i,2+3U_i]\) at any given index \(U_i;\,X_{i2}\), conditioning on \(X_{i1}\), is Gaussian with mean \(0\) and variance \((1+X_{i1})/(2+X_{i1})\); and \(X_{i3}\) independent of \(X_{i1}, X_{i2}\), is a Bernouli random variable with success rate \(0.6\). The other irrelevant variables are generated from a multivariate normal distribution with \({\mathrm{cov}}(X_{ij_1},X_{ij_2})=4\exp (-|j_1-j_2|)\) for any \(4\le j_1,j_2\le 23\). The parameter \(\varsigma \), which controls the model’s signal-to-noise ratio, is set to 5. For all these three models, four error distributions are considered: \(N(0,1),\,t(3)\) (Student’s t-distribution with three degrees of freedom), Tukey contaminated normal \(T(0.10; 5)\) (with the cumulative distribution function \(F(x)=0.9\varPhi (x)+0.1\varPhi (x/5)\) where \(\varPhi (x)\) is the distribution function of a standard normal distribution) and Lognormal. In addition, an outlier case is considered, in which the responses of 10 % generated samples are shifted with a constant \(c\). We use \(c=5\) and 25 for the first two models and the third model, respectively.
Throughout this section we use the B-spline and 1,000 replications for each considered example. For every simulated data, we firstly fit an unpenalized varying coefficient estimate \(\hat{\varvec{\beta }}(U_i)\), for which the number of knots, \(D\), is selected via the method in Sect. 2.3. Then, the same \(D\) is used for RSSCAD, where the tuning parameter \(\lambda \) in the penalty function is chosen by the BIC (8). We report the average numbers of correct 0’s (the average numbers of the true zero coefficients that are correctly estimated to be zero), the average number of incorrect 0’s (the average number of the non-zero coefficients that are incorrectly estimated to be zero). Moreover, we also report the proportion of under-fitted models (at least one of the non-zero coefficients is incorrectly estimated to be zero), correctly fitted models (all the coefficients are selected correctly) and over-fitted models (all the non-zero coefficients are selected but at least one of the zero coefficient is estimated incorrectly to be non-zero). In addition, the performance of estimators in terms of estimation accuracy is assessed via the following two estimated errors which are defined by
where \(\hat{\varvec{\beta }}_{aj}(\cdot )\) is an estimator of \(\varvec{\beta }_{0j}(\cdot )\) which is the true coefficient function. The means (denoted as MEE1 and MEE2) and standard deviations (in parentheses) of EE1 and EE2 values, are summarized. It is worth noting that because the KLASSO and RSSCAD use different smoothing approaches, we choose not to tabulate their MEE results to avoid misleading conclusions. Moreover, we also include two more unpenalized methods in the comparison, namely the rank-based spline estimator (RS) and the least-squares spline estimator (LSS).
We summarize the simulation results for the models (I)-(II) with \(\rho =0.5\) and the model (III) in Tables 1, 2, 3, 4, 5, and 6, respectively. The simulation results for the models (I)–(II) with \(\rho =0.3\) or 0.8 are presented in Tables A.1–A.8 of the supplemental file. A few observations can be made from Tables 1, 2, 3, 4, 5, and 6. Firstly, the proposed RSSCAD method is highly efficient for all the distributions under consideration. In terms of the probability of selecting the true model, the RSSCAD performs slightly worse than KLASSO and LSSCAD when the random error comes from the normal distribution as we can expect, but it performs significantly better than KLASSO and LSSCAD when the error distribution is nonnormal. For instance, when the errors come from the contaminated normal distribution, the KLASSO hardly selects the true model even for \(n=400\), whereas the RSSCAD selects the true model with quite large probability. For the third model in which the covariate \(\varvec{X}\) depends on the index \(U\), RSSCAD is still quite effective in selecting the true variables. Also, from these three Tables 1, 3, and 5, we can see that the proposed smoothing parameter selection method and the BIC (8) perform satisfactorily and conform to the asymptotic results shown in Sects. 3.3 and 3.4.
In the literature, it is well demonstrated that BIC tends to identify the true sparse model well but would result in certain under-fitting when the sample size is not sufficiently large (as in the cases of \(n=200\)). As shown in Theorem 5, the BIC is still consistent for selecting the variables in the present problem. When the sample size is larger (such as \(n=400\)), our method would select the correctly fitted models with a quite large probability, at least \(0.9\). From Tables 2, 4, and 6, we observe that the MEEs of those penalized methods are smaller than those corresponding unpenalized methods in all cases. It means that the variable selection procedure can evidently increase the efficiency of estimators. Furthermore, the rank-based methods (RS and RSSCAD) perform better than the corresponding least squares methods (LSS and LSSCAD) when the error deviates from a normal distribution. Even for normal, the MEEs of rank-based methods are merely larger than those least squares methods. This again reflects the robustness of our rank-based method to distributional assumption. Moreover, when the correlation between the covariates increases (decreases), all the three penalized methods become worse (better) but the comparison conclusion is similar to that of \(\rho =0.5\) (see Tables A.1–A.8 in the supplemental file). We also examine other error variance magnitudes for both models and the conclusion is similar.
To examine how well the method estimates the coefficient functions, Fig. 1 shows the estimates of the coefficients functions \(\beta _1(\cdot )\) and \(\beta _2(\cdot )\) for the model (I) with the normal and lognormal errors when \(\rho =0.5\) and \(n=200\). It can be seen that the estimates fit the true function well from the average viewpoint. The patterns of lower and upper confidence bands differ much from the true one at the right boundary, especially for \(\beta _2(\cdot )\). This may be caused by the lack of data in that region. The curves for the other error distributions, which give similar pictures of the estimated functions, are shown in Figure A.1 of the supplemental file.
4.2 The Boston housing data
To further illustrate the usefulness of RSSCAD, we consider here the Boston Housing Data, which has been analyzed by Wang and Xia (2009) and is publicly available in the R package mlbench, (http://cran.r-project.org/). Following Wang and Xia (2009), we take MEDV [median value of owner-occupied homes in 1,000 United States dollar (USD)] as the response, LSTAT (the percentage of lower status of the population) as the index variable, and the following predictors as the \(X\)-variables: CRIM (per capita crime rate by town), RM (average number of rooms per dwelling), PTRATIO (pupil-teacher ratio by town), NOX (nitric oxides concentration parts per 10 million), TAX (full-value property-tax rate per 10,000 USD), and AGE (proportion of owner-occupied units built prior to 1940). Figure A.2 in the supplemental file shows the normal qq-plot of residuals obtained by using a standard local linear non-penalized varying coefficient estimation (Fan and Zhang 2008). This figure clearly indicates that the errors are not normal. In Wang and Xia (2009), the variables are firstly transformed so that their marginal distribution is approximately \(N(0, 1)\). In our analysis, we do not take the transformation step since the RSSCAD is designed for robustness purpose. Similar to Wang and Xia (2009), the index variable, LSTAT, is transformed so that its marginal distribution is \(U[0, 1]\).
A standard “leave-one-out” cross-validation method suggests an optimal number of knots \(D=5\). The RSSCAD method is then applied to the data with this number of knots. The optimal shrinkage parameter selected by the BIC criterion (8) is \(\hat{\lambda }=0.0284\). The resulting RSSCAD estimate suggests that NOX, RM, and PTRATIO are all relevant variables, whereas CRIM, TAX, and AGE seem not quite significant for predicting MEDV. To confirm whether the selected variables (NOX, RM, and PTRATIO) are truly relevant, we provide in Fig. 2a–c their RSSCAD estimates with 95 % confidence bands. Obviously, they all suggest that these three coefficients are unlikely to be constant zero, because none of them is close to 0. The unpenalized estimates of the eliminated variables CRIM, TAX and AGE, are shown in Fig. 2d–f. We find that they are always close to zero over the entire range of the index variable LSTAT. Thus, Fig. 2 further confirms that those variables eliminated by RSSCAD are unlikely to be relevant. In contrast, without transformation of data, the KLASSO estimate suggests that all the variables are relevant except for AGE. Therefore, the proposed RSSCAD should be a reasonable alternative for variable selection in varying coefficient model by taking its efficiency, convenience and robustness into account.
5 Discussion
It is of interest to extend our proposed methodology to other more complex models, such as varying coefficient partially linear models (Li and Liang 2008; Zhao and Xue 2009). In fact, this amounts to adding further penalty terms into the rank-based loss function. Moreover, it is also of great interests to see whether RSSCAD and its oracle property are still valid in high-dimensional settings in which \(p\) diverges and even is larger than the sample size \(n\). The consistency of the BIC criterion proposed in Sect. 3.4 deserves further study as well. Furthermore, our rank-based spline estimator could also deal with the case that the distribution of the error term \(\varepsilon \) varies with time as well as the coefficient. For example, we can assume the following varying coefficient model \(Y=\varvec{X}^T(U)\varvec{\beta }(U)+\sigma (U)\varepsilon \) where \(\sigma (U)\) is a smooth function and the random error \(\varepsilon \) is independent of \(X\) and \(U\). With certain modifications of the proof and conditions, we are able to establish the consistency of the rank-based methods.
References
Chiang CT, Rice JA, Wu CO (2001) Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. J Am Stat Assoc 96:605–619
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Zhang W (1999) Statistical estimation in varying-coefficient models. Ann Stat 27:1491–1518
Fan J, Zhang W (2008) Statistical methods with varying coefficient models. Stat Interface 1:179–195
Hastie TJ, Tibshirani RJ (1993) Varying-coefficient models. J R Stat Soc Ser B 55:757–796 (with discussion)
Hettmansperger TP, McKean JW (1998) Robust nonparametric statistical methods. Arnold, London
Hoover DR, Rice JA, Wu CO, Yang L-P (1998) Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85:809–822
Huang JZ, Shen H (2004) Functional coefficient regression models for nonlinear time series: a polynomial spline approach. Scand J Statist 31:515–534
Huang JZ, Wu CO, Zhou L (2002) Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89:111–128
Huang JZ, Wu CO, Zhou L (2004) Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Stat Sinica 14:763–788
Jaeckel LA (1972) Estimating regression coefficients by minimizing the dispersion of residuals. Ann Math Stat 43:1449–1458
Kauermann G, Tutz G (1999) On model diagnostics using varying coefficient models. Biometrika 86:119–128
Koul HL, Sievers GL, McKean JW (1987) An estimator of the scale parameter for the rank analysis of linear models under general score functions. Scand J Stat 14:131–141
Kim M-O (2007) Quantile regression with varying coefficients. Ann Stat 35:92–108
Leng C (2009) A simple approach for varying-coefficient model selection. J Stat Plan Inferface 139:2138–2146
Leng C (2010) Variable selection and coefficient estimation via regularized rank regression. Stat Sinica 20:167–181
Li R, Liang H (2008) Variable selection in semi-parametric regression model. Ann Stat 36:261–286
Lin Y, Zhang HH (2006) Component selection and smoothing in smoothing spline analysis of variance models. Ann Stat 34:2272–2297
Sievers G, Abebe A (2004) Rank estimation of regression coefficients using iterated reweighted least squares. J Stat Comput Simul 74:821–831
Tang Y, Wang H, Zhu Z, Song X (2012) A unified variable selection approach for varying coefficient models. Stat Sinica 22:601–628
Terpstra J, McKean J (2005) Rank-based analysis of linear models using R. J Stat Softw 14:1–26
Tibshirani RJ (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58:267–288
Wang L, Chen G, Li H (2007) Group scad regression analysis for microarray time course gene expression data. Bioinformatics 23:1486–1494
Wang L, Kai B, Li R (2009) Local rank inference for varying coefficient models. J Am Stat Assoc 104:1631–1645
Wang L, Li H, Huang J (2008) Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc 103:1556–1569
Wang L (2009) Wilcoxon-type generalized Bayesian information criterion. Biometrika 96:163–173
Wang L, Li R (2009) Weighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics 65:564–571
Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection via the LAD-LASSO. J Bus Econ Stat 25:347–355
Wang H, Xia Y (2009) Shrinkage estimation of the varying coefficient model. J Am Stat Assoc 104:747–757
Wu CO, Chiang C-T, Hoover DR (1998) Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. J Am Stat Assoc 93:1388–1402
Zhao P, Xue L (2009) Variable selection for semiparametric varying coefficient partially linear models. Stat Prob lett 79:2148–2157
Zou H (2006) The adaptive LASSO and its oracle properties. J Am Stat Assoc 101:1418–1429
Zou H, Yuan M (2008) Composite quantile regression and the oracle model selection theory. Ann Stat 36:1108–1126
Acknowledgments
The authors thank the editor and two anonymous referees for their many helpful comments that have resulted in significant improvements in the article. This research was supported by the NNSF of China Grants Nos. 11131002, 11101306, 11371202, 71202087, 11271169, the RFDP of China Grant No. 20110031110002, Foundation for the Author of National Excellent Doctoral Dissertation of PR China, New Century Excellent Talents in University NCET-12-0276 and PAPD of Jiangsu Higher Education Institutions.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Proofs of Theorems
Appendix: Proofs of Theorems
In order to prove the theorems, we firstly state a few necessary lemmas. Throughout this appendix, \(M_i, i=1,\ldots ,11\) are all some positive constants which are independent of the samples.
Lemma 2
Suppose Conditions (C1)–(C5) all hold and \(\rho _n\rightarrow 0\), then
where \(\mathbf{1}_K\) is a K-dimension vector of ones, and \(K=\sum _{i=1}^p K_i\).
Proof
By \(\varDelta _i=O_p(\rho _n)\),
Fixing \((\varvec{X}_i,U_i,\varepsilon _i)\), we define
Note that
and also
so, we obtain \(W_i=2 h(\varepsilon _i) \theta _n \varvec{Z}_i \varvec{\gamma }^{*}(1+o_p(1))\). Thus
According to Lemma A.3 in Huang et al. (2004), \(\theta _n^2\mathbf{Z}^T\mathbf{Z} \varvec{\gamma }^{*}\) is bounded by a positive constant with probability tending to one, so
This completes the proof. \(\square \)
Lemma 3
Suppose Conditions (C1)–(C5) all hold. Then \(|\widehat{\varvec{\gamma }^{*}}-\widetilde{\varvec{\gamma }^{*}}|^2=o_p(K_n)\), where \(\widetilde{\varvec{\gamma }^{*}}=\arg \min A_n(\varvec{\gamma }^{*})\).
Proof
Define \(Q_n^{*}(\varvec{\gamma }^{*})=A_n(\varvec{\gamma }^{*})+r_n(\varvec{\gamma }^{*})\) and for any constant \(c>0\),
If \(\varvec{\gamma }_1\) is outside the ball \(\{\varvec{\gamma } : \sqrt{1/K}|\varvec{\gamma }-\widetilde{\varvec{\gamma }^{*}}|\le c\}\), then \(\varvec{\gamma }_1=\widetilde{\varvec{\gamma }^{*}}+l{\varvec{1}}\), where \({\varvec{1}}\) is unit vector and \(l>c\). Then,
This implies if \(R_n \le \frac{1}{2}T_n\) then the minimizer of \(Q^{*}_n\) must be inside the ball. Thus,
Since \(A_n(\varvec{\gamma }^{*})\) is a quadratic form, and \(\widetilde{\varvec{\gamma }^{*}}\) is its minimizer, so after some simple calculations, \(A_n(\varvec{\gamma }^{*})\) can be rewritten as
As a consequence, if \(\varvec{\gamma }^{*}=\widetilde{\varvec{\gamma }}^{*}+c\varvec{1}\),
where \(M_5\) is the smallest eigenvalue of \(\frac{K_n}{n}\mathbf{Z}^T \mathbf{Z}\) which is a positive constant with probability tending to one by Lemma A.3 in Huang et al. (2004). This implies that \(T_n\ge \frac{1}{2}M_5c^2\). Hence,
On the other hand, according to Lemma 1, we obtain that \(R_n \mathop {\rightarrow }\limits ^{p}0\). Thus, by condition (C4), \(|\widehat{\varvec{\gamma }^{*}}-\widetilde{\varvec{\gamma }^{*}}|^2=o_p(K)=o_p(K_n)\). \(\square \)
Lemma 4
Suppose Conditions (C1)–(C5) all hold, then
Proof
Write \(\varvec{S}_n(\varvec{\gamma })=(S_{n1}(\varvec{\gamma }),\ldots ,S_{nK}(\varvec{\gamma }))^T\) and denote \(\varDelta _i=\varvec{X}_i^{T}\varvec{\beta }(U_i)-\varvec{Z}_i \varvec{\gamma }_0\). It suffices to show that \(E(S_{nq}(\varvec{0}))^2=O(1), q=1,\ldots ,K\).
We next deal with \(R_1\) and \(R_2\). Firstly,
By Lemma A.3 (Huang et al. 2004), there exists an interval \([M_3,M_4], 0<M_3<M_4<\infty \), such that all of eigenvalues of \(\frac{K_n}{n}\mathbf{Z}^{T}\mathbf{Z}\) fall into \([M_3,M_4]\) with probability tending to 1. We have \(R_1=O(1)\) immediately.
Note that \(\varDelta _i=O_p(\rho _n), i=1,\ldots ,n\). Thus,
By taking the same procedure as \(R_1\), we have \(R_2=O(1)\), which completes the proof. \(\square \)
Proof of Theorem 1
Note that \(\widetilde{\varvec{\gamma }^{*}}=(2 \tau )^{-1}(\theta _n^2\mathbf{Z}^{T}\mathbf{Z})^{-1}\varvec{S}_n(\varvec{0})\), so
By condition (C4), we obtain that \(|\widetilde{\varvec{\gamma }^{*}}|^2=O_p(K_n)\). By the triangle inequality, we have
and thus, by Lemma 3, \(|\widehat{\varvec{\gamma }^{*}}|^2=O_p(K_n)\). Consequently, \(|\hat{\varvec{\gamma }}-\varvec{\gamma }_0|^2=O_p(n^{-1}{K_n^2})\). By Lemma A.1 in Huang et al. (2004), \(||\hat{\varvec{\beta }}-{\varvec{g}^*}||^2_{L_2}=O_p(|\hat{\varvec{\gamma }}-\varvec{\gamma }_0|^2/K_n)=O_p(n^{-1}K_n)\). Finally, by the Cauchy-Schwarz inequality, \(||\hat{\varvec{\beta }}-\varvec{\beta }||^2_{L_2}=O_p(\rho _n^2+n^{-1}K_n)\). \(\square \)
Lemma 5
Suppose Conditions (C1)–(C5) all hold, then for any \(p\)-variate vector \(\varvec{c}_n\) whose components are not all zero,
Proof
By using \(\widetilde{\varvec{\gamma }^{*}}=(2 \tau )^{-1}(\theta _n^2\mathbf{Z}^{T}\mathbf{Z})^{-1}\varvec{S}_n(\varvec{0})\) again, it suffices to show that for any \(p\)-variate vector \(\varvec{b}_n\) whose components are not all zero, \(\varvec{b}_n^{T} \varvec{S}_n(\varvec{0})\) satisfies the Lindeberg–Feller condition. This can be easily verified by applying the dominated convergence theorem as briefly described below. Define
then we can write \(\varvec{b}_n^{T} \varvec{S}_n(\varvec{\gamma }_0)=\sum _{i=1}^n W_i\). Obviously, by applying Lemma 2, \(EW_i=0,\,{\mathrm{var}}(W_i)=\varsigma _{i}^2< \infty \) and as \(n \rightarrow \infty \)
We only need to check that
for all \(\epsilon >0\), where \(I(\cdot )\) is the indicator function. By applying Lemma 2 and the Cauchy–Schwarz inequality, we have \({\sqrt{K_n/n^3}}\sum _{j=1}^n \varvec{b}_n^{T}(\varvec{Z}_i^{T}-\varvec{Z}_j^{T})=o_p(1)\). Note that the random variable inside the expectation in (A.1) is bounded; hence, by dominated convergence we can interchange the limit and expectation. Since \(I(|W_i|>\epsilon )\rightarrow 0\) the expectation goes to 0 and the assertion of the lemma follows from the central limited theorem. \(\square \)
Proof of Theorem 2
By applying Lemmas 3 and 5, the theorem follows immediately from \(\hat{\beta }_l(u)=\sum _{k=1}^{K_l} \hat{\gamma }_{lk}B_{lk}(u)\). \(\square \)
Lemma 6
Suppose Conditions (C1)–(C5) all hold. If \( \rho _n\rightarrow 0, \lambda _n \rightarrow 0, {K_n}/{n} \rightarrow 0\), and \(\lambda _n/\rho _n \rightarrow \infty \) as \(n \rightarrow \infty \), then \(|\bar{\varvec{\gamma }}-\hat{\varvec{\gamma }}|=O_p({K_n}/{\sqrt{n}}+\sqrt{\lambda _n \rho _n K_n})\).
Proof
Let \(\bar{\varvec{\gamma }}-\varvec{\gamma }_0=\delta _n K^{1/2}\varvec{u}\), with \(\varvec{u}\) a vector satisfying \(|\varvec{u}|=1, \delta _n>0\) and \(\varvec{\gamma }_0=({\varvec{\gamma }^0_1}^{T},\ldots ,{\varvec{\gamma }^0_p}^{T})^{T}\). We first show that \(\delta _n=O_p(\theta _n+\lambda _n )\). Using the identity
which holds for \(z\not =0\), we have
According to Lemma 4, we can show that
And then \(Q_1\ge -M_6\delta _n n \theta _n\) for some positive constants \(M_6\). Taking the same procedure as \(W_i\) in Lemma 2, we can also obtain that \(Q_2=\tau (\bar{\varvec{\gamma }}-\varvec{\gamma }_0)^T\mathbf{Z}^T\mathbf{Z}(\bar{\varvec{\gamma }}-\varvec{\gamma }_0) (1+o_p(1))\). Thus, according to Lemma A.3 (Huang et al. 2004), \(Q_2\ge M_3 \delta _n^2 n\). Furthermore,
Thus,
which implies \(\delta _n=O_p(\theta _n+\lambda _n)\).
Next we proceed to improve the obtained rate and show that \(\delta _n=O_p(\theta _n+(\lambda _n\rho _n)^{1/2})\). For \(k=1,\ldots , p\), using properties of B-splines basis functions, we have
where \(A \asymp B\) means \(A/B\) is bounded. Thus, according to the Cauchy–Schwarz inequality, we have
Note that
It follows that \(||\bar{\varvec{\gamma }}_k||_{R_k} \rightarrow || \beta _k||_{L_2}\) and \(||\varvec{\gamma }^0_k||_{R_k} \rightarrow || \beta _k||_{L_2}\) with probability tending to one. Because \(|| \beta _k||_{L_2}>0\) for \(k=1, \ldots , s\) and \(\lambda _n\rightarrow 0\), we obtain that with probability tending to one,
On the other hand, \(|| \beta _k||_{L_2}=0\), for \(k=s+1,\ldots , p\), so \(||\varvec{\gamma }^0_k||_{R_k}=O_p(\rho _n)\). Because \(\lambda _n/\rho _n \rightarrow \infty \), with probability tending to one,
By the definition of \(p_{\lambda }(\cdot )\),
Therefore,
So according to the first part, with probability tending to one,
which in turn implies that \(\delta _n=O_p(\theta _n+(\lambda _n \rho _n)^{1/2})\). Then the lemma follows. \(\square \)
Proof of Theorem 3
To prove the first part, we use the reduction to absurdity. Suppose that for sufficiently large \(n\), there exist a constant \(\xi > 0\) such that with probability at least \(\xi \), there exist a \(k_0>s\) such that \(\bar{\beta }_{k_0}(u) \ne 0\). Then \(||\bar{\varvec{\gamma }}_{k_0}||_{k_0}=||\bar{\beta }_{k_0}(u)||_{L_2}>0\). Let \(\bar{\varvec{\gamma }}^*\) be a vector constructed by replacing \(\bar{\varvec{\gamma }}_{k_0}\) with 0 in \(\bar{\varvec{\gamma }}\).
Taking the same procedure of \(Q_n\) as Lemma 6, we obtain
According to Lemma A.3 in Huang et al. (2004) and Lemma 4, we obtain the following inequalities,
Consequently, according to (10), with probability tending to one,
Then,
According to the conditions, the third term dominates both the first and second terms, which contradicts the fact that \(P_n(\bar{\varvec{\gamma }})-P_n(\bar{\varvec{\gamma }}^*)\le 0\). We thus have proved part (i). Next, we will prove part (ii). Denote \(\varvec{\beta }=((\varvec{\beta }^{(1)})^T,(\varvec{\beta }^{(2)})^T)^T\), where \(\varvec{\beta }^{(1)}=(\beta _1,\ldots ,\beta _s)^T\) and \(\varvec{\beta }^{(2)}=(\beta _{s+1},\ldots ,\beta _p)^T\), and \(\varvec{\gamma }=((\varvec{\gamma }^{(1)})^T,(\varvec{\gamma }^{(2)})^T)^T\), where \(\varvec{\gamma }^{(1)}=(\varvec{\gamma }_1,\ldots ,\varvec{\gamma }_s)^T\) and \(\varvec{\gamma }^{(2)}=(\varvec{\gamma }_{s+1},\ldots ,\varvec{\gamma }_p)^T\). Similarly, denote \(\varvec{Z}_i=(\varvec{Z}_i^{(1)},\varvec{Z}_i^{(2)})\). Define the oracle version of \(\varvec{\gamma }\),
which is obtained as if the information of nonzero components were given; the corresponding vector of coefficient functions is designated \(\bar{\varvec{\beta }}_{oracle}\). By the above lemmas, we can easily obtain that \(||\bar{ \beta }_{k,oracle}||_{L_2}\rightarrow ||\beta _k||_{L_2}\), for \(k=1,\ldots , s\) and by the definition, \(||\bar{ \beta }_{k,oracle}||_{L_2}=0\), for \(k=s+1,\ldots , p\). By part (i) of the theorem, \(\bar{\varvec{\gamma }}=((\bar{\varvec{\gamma }}^{(1)})^T,\varvec{0}^T)^T\), and
with probability tending to one. Let \(\bar{\varvec{\gamma }}-\bar{\varvec{\gamma }}_{oracle}=\delta _n K_n^{1/2}\varvec{v}\), with \(\varvec{v}=((\varvec{v}^{(1)})^T,\varvec{0}^T)^T\), and \(|\varvec{v}|=1\). Then \(||\bar{\varvec{\beta }}-\bar{\varvec{\beta }}_{oracle}||_{L_2}\asymp K_n^{-1}|\bar{\varvec{\gamma }}-\bar{\varvec{\gamma }}_{oracle}|=\delta _n\). Similar to part (i),
Thus \(||\bar{\varvec{\beta }}-\bar{\varvec{\beta }}_{oracle}||_{L_2}\asymp \delta _n=O_p(\theta _n)\), which implies that \(||\bar{\varvec{\beta }}-\varvec{\beta }||_{L_2}=O_p(\rho _n+\theta _n)\). The desired result follows. \(\square \)
Proof of Theorem 4
By Theorem 3, with probability tending to one, \(\bar{\varvec{\gamma }}=((\bar{\varvec{\gamma }}^{(1)})^T,\varvec{0}^T)^T\) is a local minimizer of \(PL_n(\varvec{\gamma })\). Thus, by the definition of \(PL_n(\varvec{\gamma })\),
According to the proof of Lemma 6, \(||\bar{\varvec{\gamma }}_k||_{R_k}>a\lambda _n\), for \(k=1,\ldots ,s\), so the second part of the above equation is \(0\). Thus, \(\frac{\partial Q_n(\varvec{\gamma })}{\partial \varvec{\gamma }}\bigg |_{\varvec{\gamma }=((\bar{\varvec{\gamma }}^{(1)})^T,\varvec{0}^T)^T}=0\), which implies
Applying Theorem 2, we can easily obtain the result. \(\square \)
Proof of Theorem 5
Firstly, note that \(\hat{\varvec{\beta }}\) is a consistent estimator of \(\varvec{\beta }\) by Theorem 1, so, taking the same procedure as the proof of Theorem 1 in Koul et al. (1987), \(\hat{\tau }\) is a consistent estimator of \(\tau \). So we replace \(\hat{\tau }\) with \(\tau \) in the following proof. To establish the consistency of BIC, we first construct a sequence of reference tuning parameters, \(\lambda _n=\log (n/K_n)/\sqrt{n/K_n}\). By Theorem 4, the penalty estimator \(\bar{\varvec{\beta }}_{\lambda _n}\) is exactly the same as the oracle estimator \(\bar{\varvec{\beta }}_{oracle}\). It follows immediately that \(P(BIC_{\lambda _n}=BIC_{S_T})\rightarrow 1\), which implies \(BIC_{\lambda _n}\mathop {\rightarrow }\limits ^{p}\log (L^{S_T})\). Next, we verify that \(P(\inf _{\lambda \in \varOmega _{-}\cup \varOmega _{+}}BIC_{\lambda }>BIC_{\lambda _n})\rightarrow 1\), where \(\varOmega _{-}\) and \(\varOmega _{+}\) denote the underfitting case and overfitting case, respectively.
Case 1: Underfitted model, i.e., the model misses at least one covariate in the true model. For any \(\lambda \in \varOmega _{-}\), similar to Wang and Li (2009), we have
Case 2: Overfitted model, i.e., the model contains all the covariates in the true model and at least one covariate that does not belong to the true model. For any \(\lambda \in \varOmega _{+}\), by Lemma 1, we have
By Lemma A.3 in Huang et al. (2004) and Lemma 6, we obtain \(Q_n(\bar{\varvec{\gamma }}_{S_T})-Q_n(\bar{\varvec{\gamma }}_{\lambda })=O_p(K_n)\). Thus, with probability tending to one,
This implies that
Since the last term dominates the first term and diverges to \(+\infty \), we obtain \(P(\inf _{\lambda \in \varOmega _{+}}(BIC_{\lambda }-BIC_{\lambda _n})>0)\rightarrow 1\).
Thus, according to the above results, those \(\lambda \)’s which fail to identify the true model cannot be selected by BIC asymptotically, because at least the true model identified by \(\lambda _n\) is a better choice. As a result, the optimal value \(\hat{\lambda }_{BIC}\) can only be one of those \(\lambda \)’s whose corresponding estimator yields the true model. Hence, the theorem follows immediately. \(\square \)
Rights and permissions
About this article
Cite this article
Feng, L., Zou, C., Wang, Z. et al. Robust spline-based variable selection in varying coefficient model. Metrika 78, 85–118 (2015). https://doi.org/10.1007/s00184-014-0491-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-014-0491-y