Rank-based shrinkage estimation for identification in semiparametric additive models

Yang, Jing; Yang, Hu; Lu, Fang

doi:10.1007/s00362-017-0874-z

Rank-based shrinkage estimation for identification in semiparametric additive models

Regular Article
Published: 10 February 2017

Volume 60, pages 1255–1281, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistical Papers Aims and scope Submit manuscript

Rank-based shrinkage estimation for identification in semiparametric additive models

Download PDF

Jing Yang¹,
Hu Yang² &
Fang Lu³

397 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, we propose a novel and robust procedure for model identification in semiparametric additive models based on rank regression and spline approximation. Under some mild conditions, we establish the theoretical properties of the identified nonparametric functions and the linear parameters. Furthermore, we demonstrate that the proposed rank estimate has a great efficiency gain across a wide spectrum of non-normal error distributions and almost not lose any efficiency for the normal error compared with that of least square estimate. Even in the worst case scenarios, the asymptotic relative efficiency of the proposed rank estimate versus least squares estimate, which is show to have an expression closely related to that of the signed-rank Wilcoxon test in comparison with the t-test, has a lower bound equal to 0.864. Finally, an efficient algorithm is presented for computation and the selections of tuning parameters are discussed. Some simulation studies and a real data analysis are conducted to illustrate the finite sample performance of the proposed method.

Rank estimation for the function-on-scalar model

Article 21 September 2023

Robust and efficient estimation of nonparametric generalized linear models

Article 16 May 2023

Variable Selection for Varying Coefficient Models Via Kernel Based Regularized Rank Regression

Article 01 March 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider the following additive regression model

$$\begin{aligned} Y_i=u+ \sum _{j=1}^p { f_{0j}(X_{ij}) } + \varepsilon _i, \end{aligned}$$

(1)

where $X_i=(X_{i1},X_{i2},\ldots ,X_{ip})^T$ is a p-dimensional covariate, $\{ f_{0j}(\cdot ), j=1,2,\ldots ,p \}$ are unknown smooth functions satisfying $\text{ E }\{ f_{0j}(X_{ij}) \}=0$ for the sake of model identifiability, and $\varepsilon _i$ is the random error independent of $X_i$. There exist at least two benefits of such an additive approximation. First, the additive combination of univariate functions can be more interpretable and easier to fit than the joint multivariate nonparametric models. Second, the so-called “curse of dimensionality” that besets multivariate nonparametric regression is largely circumvented because every individual additive component can be estimated using a univariate smoother via an iterative manner. Therefore, large amounts of studies have been done under this model due to its superior characteristics, and we refer, for instance, to Yu and Lu (2004), Mammen and Park (2006), Yu et al. (2008), Xue (2009) and Lian (2012a, b).

Although model (1) owns some wonderful properties, Opsomer and Ruppert (1999) noticed that in practice some covariates may have linear or even no effects on the response variable while other covariates enter nonlinearly, and recommended the so-called semiparametric additive model (SPAM) with the form

$$\begin{aligned} Y_i=u+ \sum _{j=1}^{p_0} { f_{0j}(X_{ij}) } + \sum _{j=p_0+1}^p { X_{ij}\beta _{0j} } + \varepsilon _i. \end{aligned}$$

(2)

Statistically, the SPAM could be more parsimonious than the general additive model in some cases, and hence attracted considerable attention. For related literature, see Härdle et al. (2004), Deng and Liang (2010), Liu et al. (2011), Wei and Liu (2012), Wei et al. (2012) among others. Nevertheless, all these works for SPAM are based on the assumption that the linear and nonlinear part are known in advance, which is not always true in practice. If the structure is misspecified, it can not only increase complexity of model but also reduce the estimation accuracy. Since the optimal parametric estimation rate is $n^{-1/2}$ and the optimal nonparametric estimation rate is $n^{-2/5}$, treating a parametric component as a nonparametric component can over-fit the data and leads to efficiency loss. Therefore, model identification is important to model (1), and it is of great interest to develop some efficient methods to distinguish nonzero components as well as linear components from nonlinear ones.

In general, this goal could be achieved by conducting some hypothesis testing as done in Jiang et al. (2007), whereas it might be cumbersome to perform in practice when there are more than just a few predictors to test. Besides, the theoretical properties of such identifications based on hypothesis testing can be somewhat hard to analyze. To this end, Huang et al. (2010) presented a new type of usage for the SCAD penalty as well as its related methods and successfully applied it to nonparametric additive models for the purpose of identifying zero components and parametric components. Following a similar idea, Zhang et al. (2011) simultaneously identified the zero and linear components of partially linear models by using two penalty functions through an elegant mathematical framework; Lian (2012a) provided a way to determine linear components of additive models based on least square (LS) regression; Lian (2012b) successfully identified nonzero and linear components of model (1) in conditional quantile regression; Wang and Song (2013) applied the SCAD penalty to identify the model structure in semiparametric varying coefficient partially linear models. Note that all these papers were built on either LS regression, which is very sensitive and has low efficiency with respect to many commonly used non-normal errors, or quantile regression, for which the efficiency is proportional to the density at the median. Hence, it would be highly desirable to develop an efficient and robust method that can simultaneously conduct model identification and estimation.

Recently, Wang et al. (2009) proposed a novel procedure for the varying coefficient model based on rank regression and demonstrated that the new method is highly efficient across a wide class of error distributions and possesses comparable efficiency in the worst case scenario compared with LS regression. Similar conclusions on rank regression have been further confirmed in Leng (2010), Sun and Lin (2014), Feng et al. (2015) and the references therein. To the best known of our knowledge, none of these approaches has been studied in SPAM. Therefore, motivated by these observations, this paper is devoted to extending the rank regression to SPAM for identifying nonzero components as well as linear components. Specifically, we firstly embed the SPAM into an additive model and use the spline method to approximate unknown functions. A two-fold SCAD penalty is then employed to discriminate the nonzero components as well as linear components from the nonlinear ones by penalizing both the coefficient functions and their second derivatives. Furthermore, the theoretical properties of the estimator are established, and based on the asymptotic theory of the linear components, we show that the proposed rank estimate has a great efficiency gain across a wide spectrum of non-normal error distributions and loses almost no efficiency for the normal error compared with that of the LS estimate. Even in the worst case scenarios, the asymptotic relative efficiency (ARE) of the proposed rank estimate versus LS estimate has a lower bound being 0.864. In addition, it is worth noting that the ARE of the proposed rank estimate versus LS has an expression which is closely related to that of the signed-rank Wilcoxon test in comparison with the t-test.

The rest of this paper is organized as follows. In Sect. 2, we introduce our new penalized rank regression method based on basis expansion and the SCAD penalty. In Sect. 3, the asymptotic properties are established under some suitable conditions. The selection of optimal tuning parameters are discussed in Sect. 4 along with a computational algorithm for implementation. Sect. 5 illustrates the finite sample performance of the proposed procedure via some simulation studies, and short concluding remarks are followed in Sect. 6. All the technical proofs are deferred to Appendix.

2 Rank-based shrinkage regression for additive models

Suppose that $\{ X_i, Y_i \}_{i=1}^n$ is an independent and identically distributed sample from model (2). Without loss of generality, we assume that the distribution of $X_i$ is supported on [0,1]. As we do not know which covariates have linear effects in advance, all p components are considered as nonparametric and the polynomial splines are applied to approximate the components. Let $0=\xi _0< \xi _1< \cdots< \xi _{K_n} < \xi _{K_n+1}=1$ be a partition of [0,1] into $K_n+1$ subintervals $[\xi _k, \xi _{k+1}), k=0,1,\ldots ,K_n$, where $K_n$ denotes the number of internal knots that increases with sample size n. A polynomial spline of order q is a function whose restriction to each subinterval is a polynomial of degree $q-1$ and globally $q-2$ times continuously differentiable on [0,1]. The collection of splines with a fixed sequence of knots has a normalized B-spline basis $\{ B_1(x),B_2(x),\ldots ,B_{K^\prime }(x) \}$ with $K^\prime =K_n+q$.

Note that the constraint condition $\text{ E }\{ f_{0j}(X_{ij}) \}=0$ is required for the sake of model identifiability, so we instead focus on the space of spline functions $S_j^0:= \{ \hbar : \hbar =\sum _{k=1}^K { \gamma _{jk}B_{jk}(x) }, \hbar =\sum _{i=1}^K { \hbar (X_{ij})=0 } \}$ with centered basis $\{ B_{jk}(x)=B_{k}(x)- \sum _{i=1}^n { B_{k}(X_{ij})/n }, k=1,2,\ldots ,K= K^\prime -1 \}$, where $K=K^\prime -1$ due to the empirical version of the constraint. Then the nonlinear functions in model (1) can be approximated by

$$\begin{aligned} f_{0j}(x) \approx \sum _{k=1}^K { \gamma _{jk}B_{jk}(x) },~~~~j=1,2,\ldots ,p. \end{aligned}$$

(3)

For simplicity, we restrict our attention to equally spaced knots, although other regular knot sequences like quasi-uniform or data-driven choices can be considered. It is also possible to specify different values of $K_n$ for each component. However, our choice of the equally spaced knots and the same number of knots for each component allows for a much simpler exposition of our results, and as in most of the literature based on spline methods, it can be shown that similar asymptotic results still hold for different choices of $K_n$ and different knots for each component. Let $\gamma _j=(\gamma _{ji}, \gamma _{j2}, \ldots ,\gamma _{jK})^T$ and $B_j(x)=\big ( B_{j1}(x),B_{j2}(x),\ldots ,B_{jK}(x) \big )^T$. Following the approximation (3), model (1) can be rewritten as

$$\begin{aligned} Y_i \approx u+ \sum _{j=1}^p { \sum _{k=1}^K { \gamma _{jk}B_{jk}(X_{ij}) } } + \varepsilon _i = u+ \sum _{j=1}^p { B_j(X_{ij})^T \gamma _j } + \varepsilon _i . \end{aligned}$$

Accordingly, the residual for estimating $Y_i$ at $X_i$ is $e_i=Y_i-u-\sum _{j=1}^p { B_j(X_{ij})^T \gamma _j }$.

By applying the technique of rank regression method, we propose the following minimization problem

$$\begin{aligned} \check{\gamma }=\arg \min _{\gamma }L_n(\gamma ):=\frac{1}{n}\sum _{i<j} { |e_i-e_j| }, \end{aligned}$$

(4)

where $\gamma =\big ( \gamma _1^T,\gamma _2^T,\ldots ,\gamma _p^T \big )^T$. Thus the estimated component functions are $\check{f}_j(x)= B_j(x)^T\check{\gamma }_j$. Note that the loss function $L_n(\gamma )$ essentially belongs to a local version of Gini’s mean difference, which is a classical measure of concentration or dispersion; see David (1998) for details. In addition, it is worth mentioning that the above rank-based loss function cannot generate the estimate of intercept u because it is canceled out in $e_{i}-e_{j}$, which is an unique feature of using this type of estimate in the present problem. As pointed out in Wang et al. (2009), it is essential to have additional location constraint on the random errors in order to make the intercept identifiable, and they adopted the commonly used constraint that $\varepsilon _i$ has median zero. So following the same constraint on $\varepsilon _i$, a reasonable estimate of u can be derived by $\hat{u}=\sum _{i=1}^n{ Y_i/n }$ at the rate of $1/ \sqrt{n}$, which is faster than any rate of convergence for nonparametric function estimation. Thus for notational convenience, one can safely assume $u=0$, just as we done in the sequel.

Recall that we are interested in finding the zero components and linear components of model (1). Empirically, the former can be done by shrinking the function $\Vert f_j \Vert $ to zero, and the latter can be achieved via shrinking the second derivative $\Vert f_j^{\prime \prime } \Vert $ to zero because a function is linear if and only if it has a second derivative identically zero. Therefore, instead of (4), we consider the following two-fold penalization procedure

$$\begin{aligned} \hat{\gamma }=\arg \min _{\gamma }L_n^{\lambda }(\gamma ):= \frac{1}{n}\sum _{i<j} { |e_i-e_j| } + n \sum _{k=1}^p { p_{\lambda _1}(\Vert f_k \Vert ) } + n \sum _{k=1}^p { p_{\lambda _2}(\Vert f_k^{\prime \prime } \Vert )}, \end{aligned}$$

(5)

where $p_{\lambda }(\cdot )$ is the SCAD penalty function defined by its first derivative

$$\begin{aligned} p^{\prime }_\lambda (t) = \lambda \left\{ I(t \le \lambda )+\frac{(a\lambda -t)_+}{(a-1)\lambda }I(t > \lambda ) \right\} , \end{aligned}$$

where $\lambda $ is the penalized parameter, $a > 2$ is some constant usually taken to be 3.7 as suggested in Fan and Li (2001). Note that the SCAD penalty is continuously differentiable on $(-\infty ,0) \cup (0,\infty )$ but singular at 0, and that its derivative vanishes outside $[-a \lambda ,a \lambda ]$. These features of SCAD penalty result in a solution with three desirable properties including unbiasedness, sparsity and continuity, which were defined in Fan and Li (2001).

Note that $\Vert f_j(x) \Vert ^2 = \Vert B_j(x)^T \gamma _j \Vert ^2 = \int \big (\sum _{k=1}^K { \gamma _{jk}B_{jk}(x) } \big ) \big ( \sum _{k^{\prime }=1}^K { \gamma _{jk^{\prime }} B_{jk^{\prime }}(x) } \big ){ dx}$ and $\Vert f_j^{\prime \prime }(x) \Vert ^2 = \int \big ( \sum _{k=1}^K { \gamma _{jk}B_{jk}^{\prime \prime }(x) } \big ) \big ( \sum _{k^{\prime }=1}^K { \gamma _{jk^{\prime }}B_{jk^{\prime }}^{\prime \prime }(x) }\big )dx$, so $\Vert f_j(x) \Vert $ and $\Vert f_j^{\prime \prime }(x) \Vert $ can be equivalently expressed as $\sqrt{ \gamma _j^T D_j \gamma _j }$ and $\sqrt{ \gamma _j^T E_j \gamma _j }$ respectively, where $D_j,E_j \in R^{K \times K}$ with its $(k,k^{\prime })$ entry equaling to $\int B_{jk}(x) B_{jk^{\prime }}(x)dx$ and $\int B_{jk}^{\prime \prime }(x) B_{jk^{\prime }}^{\prime \prime }(x)dx$, respectively. Then, the above minimization problem (5) is equivalent to

$$\begin{aligned} \hat{\gamma }= & {} \arg \min _{\gamma }L_n^{\lambda }(\gamma ):= \frac{1}{n}\sum _{i<j} { |e_i-e_j| } + n \sum _{k=1}^p { p_{\lambda _1}\left( \sqrt{ \gamma _k^T D_k \gamma _k } \right) } \nonumber \\&+\, n \sum _{k=1}^p { p_{\lambda _2}\left( \sqrt{ \gamma _k^T E_k \gamma _k } \right) }. \end{aligned}$$

(6)

Consequently, the estimated component functions are given by $\hat{f}_j(x)= B_j(x)^T \hat{\gamma }_j$.

3 Theoretical properties

3.1 Asymptotic properties

Without loss of generality, we assume that $f_{0j}$ is truly nonparametric for $j=1,2,\ldots ,p_0$, linear for $j=p_0+1,p_0+2,\ldots ,s$ with the true slope parameters for the parametric components are denoted by $\beta _0=(\beta _{0,p_0+1},\beta _{0,p_0+2},\ldots ,\beta _{0,s})$, and zero for $j=s+1, s+2,\ldots ,p$. The vectors $X^{(1)}=(X_1,X_2,\ldots ,X_{p_0})^T$ and $X^{(2)}=(X_{p_0+1},X_{p_0+2},\ldots , X_{s})^T$ correspond to the nonlinear and linear components. Denote as $\mathcal {A}$ the subspace of functions on $R^{p_0}$ with an additive form

$$\begin{aligned}&\mathcal {A}:=\{ h(x^{(1)}): h(x^{(1)})=h_1(x_1)+h_2(x_2)+\ldots +h_{p_0}(x_{p_0}), E\big ( h_j(X_j) \big )\\&\quad =0~\text{ and }~E\big ( h_j(X_j)^2 \big ) < \infty \}, \end{aligned}$$

and $E_{\mathcal {A}}(M)$ the subspace projection of M onto $\mathcal {A}$ in the sense that

$$\begin{aligned} E\{ \big ( M-E_{\mathcal {A}}(M) \big ) \big ( M-E_{\mathcal {A}}(M) \big ) \} = \inf _{h \in \mathcal {A}} E \{ \big ( M-h(X^{(1)}) \big ) \big ( M-h(X^{(1)}) \big ) \}. \end{aligned}$$

Let $h(X^{(1)})=E_{\mathcal {A}}(X^{(2)})$. Each component of $h(X^{(1)})=\big ( h_{(1)}(X^{(1)}),\ldots ,h_{(p-p_0)}(X^{(1)}) \big )^T$ can be written in the form $h_{(u)}(x)= \sum _{j=1}^{p_0}h_{(u)j}(x_j)$ for some $h_{(u)j}(x_j) \in S_j^0$. To facilitate our asymptotic analysis, we further make the following regularity assumptions.

(A1)
The density function f(x) of X is absolutely continuous and compactly supported. Without loss of generality, assume that the support of X is $[0,1]^p$. Furthermore, there exist constants $0< c_1 \le c_2 < \infty $ such that $c_1 \le f(x) \le c_2$ for all $x \in \mathcal {X}$.
(A2)
For $g=f_{0j}, 1 \le j \le p_0$ or $g=h_{(u)j}, 1 \le u \le s, 1 \le j \le p_0$, g satisfies a Lipschitz condition of order $r>1/2$. That is, $| g^{(\lfloor r \rfloor )}(x_1)- g^{(\lfloor r \rfloor )}(x_2) | \le C | x_1-x_2 |^{r-\lfloor r \rfloor }$, where C is a constant, $\lfloor r \rfloor $ denotes the biggest integer strictly smaller than r and $g^{(\lfloor r \rfloor )}$ is the $\lfloor r \rfloor $th derivative of g. In addition, the order of the B-spline used satisfies $q \ge r + 2$.
(A3)
The matrix $ \Sigma =E\{ (X^{(2)}-h(X^{(1)})(X^{(2)}-h(X^{(1)})^T \}$ is positive definite.
(A4)
The errors $\varepsilon $ has a positive density function h(x) satisfying $\int [h^{\prime }(x)]^2 / h(x) dx <\infty $, which means that $\varepsilon $ has finite Fisher information.

Assumptions (A1)–(A2) are common in the polynomial spline estimation literatures; see for example Huang et al. (2010), Wang and Song (2013), Tang (2015) and Li et al. (2015). It was shown in Li (2000) that the positive definiteness of $\Sigma $ in (A3) is necessary for the identifiability of the model in the case that linear components are specified. Assumption (A4) is a regular condition on the random errors which is the same as those used in works on rank regression such as Wang et al. (2009), Hettmansperger and McKean (2011), Sun and Lin (2014) and Feng et al. (2015).

Theorem 1

Suppose that assumptions (A1)–(A4) hold. If the number of knots $K_n \asymp n^{1/(2r+1)}$, then we have

$$\begin{aligned} \Vert \check{f}_{j}-f_{0j} \Vert ^2=O_p \left( n^{\frac{-2r}{2r+1}} \right) ,~~ j=1,2,\ldots ,p, \end{aligned}$$

where $\check{f}_{j}=B_j^T \check{\gamma }_j$ is the unpenalized estimate of component function $f_{0j}$ with $\check{\gamma }$ generated by solving (4).

Theorem 1 indicates that the nonparametric estimates obtained by our proposed method attain the optimal convergence rates. The following theorem will show that if the tuning parameters $\lambda _1$ and $\lambda _2$ are appropriately specified, we can identify the zero parts and linear parts consistently.

Theorem 2

Under the same assumptions of Theorem 1, if $\max \{ \lambda _1,\lambda _2 \} \rightarrow 0$ and $n^{r/(2r+1)} \min \{ \lambda _1,\lambda _2 \} \rightarrow \infty $, then with probability tending to 1,

(i)
$\Vert \hat{f}_{j}-f_{0j} \Vert ^2=O_p \left( n^{\frac{-2r}{2r+1}} \right) $ for $j=1,2,\ldots ,p,$
(ii)
$ \hat{f}_j $ is a linear function for $ j=p_0+1,p_0+2,\ldots ,s $,
(iii)
$ \hat{f}_j \equiv 0$ for $ j=s+1,s+2,\ldots ,p $,

where $\hat{f}_{j}=B_j^T \hat{\gamma }_j$ is the penalized estimate of component function with $\hat{\gamma }$ generated by solving (5).

Finally, for the linear components, we will show that the estimate of the slope parameter is asymptotically normal.

Theorem 3

Under the same assumptions of Theorem 2, we have

$$\begin{aligned} \sqrt{n}(\hat{\beta }-\beta _0)~\mathop \rightarrow \limits ^d~ N \bigg ( 0,~ \frac{1}{12 \tau ^2} \Sigma ^{-1} \bigg ), \end{aligned}$$

(7)

where $\Sigma $ is defined in assumption (A3) and $\tau =\int h(x)^2 dx$.

Remark 1

Based on the results of Theorem 2 and Theorem 3, we observe that the proposed estimate enjoys an oracle property in the sense that it is asymptotically the same as the oracle estimate which is obtained when the true model is known in advance.

3.2 Asymptotic relative efficiency

Denote by $\hat{\beta }_{LS}$ and $\hat{\beta }_{RR}$ the estimates of $\beta _0$ generated by LS regression in Lian (2012a) and our proposed rank regression, respectively. To measure the efficiency, we consider the asymptotic variance of the estimates $\hat{\beta }_{LS}$ and $\hat{\beta }_{RR}$ since they all asymptotically unbiased. Hence, based on the asymptotic distribution of $\beta _0$ presented by Theorem 3 in Lian (2012a) and (7) of Theorem 3, we obtain the following theorem.

Theorem 4

The ARE of the rank-based estimate $\hat{\beta }_{RR}$ to the LS estimate $\hat{\beta }_{LS}$ for linear parameter $\beta _0$ is

$$\begin{aligned} ARE (\hat{\beta }_{RR},\hat{\beta }_{LS})=\frac{Var (\hat{\beta }_{LS})}{Var (\hat{\beta }_{RR})} = 12 \sigma ^2 \tau ^2, \end{aligned}$$

where $\sigma ^2=E (\varepsilon ^2)$. This ARE has a lower bound of 0.864 for estimating the parameter component, which is attained at the random error density $h(x)=\frac{3}{20\sqrt{5}}(5-x^2)I(|x|\le 5)$.

Note that the above obtained ARE is the same as that of the signed-rank Wilcoxon test with respect to the t-test. It is well known in the literature of rank analysis that the ARE is as high as 0.955 for the normal error distribution, and can be significantly higher than 1 for many heavier-tailed distributions. For instance, this quantity is 1.5 for the double exponential distribution and 1.9 for the t distribution with three degrees of freedom.

4 Algorithm implementation and tuning parameters selections

In this section we first present an iterative estimation procedure for computation by employing locally quadratic approximation (LQA, Fan and Li, 2001) to the rank-based objective function $L_n(\gamma )$ as well as the two penalty functions $p_{\lambda _1}(\cdot )$ and $p_{\lambda _2}(\cdot )$. Then we discuss the selections of extra parameters including the number of interior knots $K_n$ and the tuning parameters $\lambda _1$ and $\lambda _2$.

4.1 Algorithm implementation

It is worth noting that the commonly used gradient-based optimization technique is not feasible here for solving (6) due to its irregularity at the origin. According to Sievers and Abebe (2004), we approximate the unpenalized $L_n(\gamma )$ by

$$\begin{aligned} L_n(\gamma ) \approx \frac{1}{n}\sum _{i=1}^n { w_i(e_i-\varsigma )^2 }, \end{aligned}$$

where $\varsigma $ is the median of $\{e_i\}_{i=1}^n$ and

$$\begin{aligned} w_i= \left\{ \begin{array}{ll} \frac{\frac{R(e_i)}{n+1}-\frac{1}{2}}{e_i-\varsigma }, &{}\quad \text{ for }~e_i \ne \varsigma , \\ 0, &{}\quad \text{ otherwise } \end{array} \right. \mathrm{{ }} \end{aligned}$$

with $R(e_i)$ being the rank of $e_i$ among $\{e_i\}_{i=1}^n$.

On the other hand, following Fan and Li (2001), we apply LQA to the last two penalty terms. That is, for a given initial estimate $\hat{\gamma }_j^{(0)}$, the corresponding weights $w_i^{(0)}$ and the median of residual $\varsigma ^{(0)}$ can be obtained. If $\hat{f}_j^{(0)}$ ($\hat{f}_j^{(0)\prime \prime }$) is very close to 0, then set $\hat{f}_j=0$ ($\hat{f}_j^{\prime \prime }=0$). Otherwise, we have

$$\begin{aligned} p_{\lambda _1}(\Vert f_j \Vert ) \approx p_{\lambda _1}(\Vert \gamma _j^{(0)}\Vert _{D_j}) + \frac{1}{2}\frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(0)}\Vert _{D_j})}{ \Vert \gamma _j^{(0)}\Vert _{D_j} } \{ \Vert \gamma _j\Vert ^2_{D_j}-\Vert \gamma _j^{(0)}\Vert ^2_{D_j} \}, \end{aligned}$$

and

$$\begin{aligned} p_{\lambda _2}(\Vert f_j^{\prime \prime } \Vert ) \approx p_{\lambda _2}(\Vert \gamma _j^{(0)}\Vert _{E_j}) + \frac{1}{2}\frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(0)}\Vert _{E_j})}{ \Vert \gamma _j^{(0)}\Vert _{E_j} } \{ \Vert \gamma _j\Vert ^2_{E_j}-\Vert \gamma _j^{(0)}\Vert ^2_{E_j} \}, \end{aligned}$$

where $\Vert \gamma _j \Vert _{D_j}=\sqrt{ \gamma _j D_j \gamma _j }$ and $\Vert \gamma _j \Vert _{E_j}=\sqrt{ \gamma _j E_j \gamma _j }$. Ignoring the irrelevant constants, (6) is equivalent to minimize the following quadratic function

$$\begin{aligned} Q_n^{\lambda }(\gamma ):= & {} \frac{1}{n}\sum _{i=1}^n { w_i(e_i-\varsigma )^2 }+ \frac{n}{2} \sum _{k=1}^p { \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _k^{(0)}\Vert _{D_k})}{ \Vert \gamma _k^{(0)}\Vert _{D_k} }\gamma _k D_k \gamma _k } + \\&+ \frac{n}{2} \sum _{k=1}^p { \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _k^{(0)}\Vert _{E_k})}{ \Vert \gamma _k^{(0)}\Vert _{E_k} }\gamma _k E_k \gamma _k } . \end{aligned}$$

To make the expression convenient, we introduce the following notations

$$\begin{aligned}&\tilde{Y}^{(m)}=Y-\varsigma ^{(m)},~~~~W^{(m)}=\text{ diag }\left\{ w_1^{(m)},w_2^{(m)},\ldots ,w_n^{(m)} \right\} , \\&\Sigma _{\lambda _1}(\gamma ^{(m)})= \text{ diag }\left\{ \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_1})}{ \Vert \gamma _j^{(m)}\Vert _{D_1} }, \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_2})}{ \Vert \gamma _j^{(m)}\Vert _{D_2} }, \ldots , \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_p})}{ \Vert \gamma _j^{(m)}\Vert _{D_p} } \right\} , \\&\Sigma _{\lambda _2}(\gamma ^{(m)})= \text{ diag }\left\{ \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_1})}{ \Vert \gamma _j^{(m)}\Vert _{E_1} }, \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_2})}{ \Vert \gamma _j^{(m)}\Vert _{E_2} }, \ldots , \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_p})}{ \Vert \gamma _j^{(m)}\Vert _{E_p} } \right\} . \end{aligned}$$

Therefore, the computational algorithm can be implemented as follows:

Step 0: Choose the unpenalized estimate $\check{\gamma }$ as the initial estimate $\hat{\gamma }^{(0)}$ and let $\hat{\gamma }^{(m)}=\hat{\gamma }^{(0)}$.
Step 1: Update $\gamma ^{(m)}$to obtain $\gamma ^{(m+1)}$ by
$$\begin{aligned} \gamma ^{(m+1)} = \arg \min _{\gamma } Q_n^{\lambda }(\gamma ) =\left\{ Z^T W^{(m)} Z + \frac{n^2}{2} \Sigma _{\lambda _1}(\gamma ^{(m)}) + \frac{n^2}{2} \Sigma _{\lambda _2}(\gamma ^{(m)}) \right\} ^{-1} Z^T W^{(m)} \tilde{Y}^{(m)}, \end{aligned}$$
where $\tilde{Y}=(\tilde{Y}_1,\ldots ,\tilde{Y}_n)^T$, $Z=(Z_1,\ldots ,Z_n)^T$ with $Z_i= \big (B_1(X_{i1})^T,\ldots ,B_p(X_{ip})^T \big )^T$.
Step 2: Set $m=m+1$ and return back to Step 1.
Step 3: Iterate Step 1 and Step 2 until convergence.

Remark 2

As a stopping rule to check the convergence of $\hat{\gamma }$ in above estimation procedure, we propose to stop the iteration when the change in $\hat{\gamma }$between thei-th and $(i+1)$-th iteration is below a pre-specified threshold.

4.2 Extra parameters selections

To achieve good numerical performance, one needs to choose the number of interior knots $K_n$ and the tuning parameters $\lambda _1$ and $\lambda _2$ appropriately. Here we fix the spline order to be 4, which means that cubic splines are used in all our numerical implementations. Then we use 5-fold cross-validation (CV) to select $K_n$ as well as $\lambda =(\lambda _1,\lambda _2)^T$ simultaneously. To be more specific, we randomly divide the data into five roughly equal parts, denoted as $\{(X_i^T, Y_i )^T,~i\in S(j)\}$ for $j = 1, 2, \ldots , 5$, where S(j) is the set of subject indices corresponding to the jth part. For each j, we treat $\{(X_i^T, Y_i )^T,~i\in S(j)\}$ as the validation data set, and the remaining four parts of data as the training data set. For any candidate $(K_n,\lambda ^T)^T$, for each $ i\in S(j) $, we apply local polynomial fitting to the training data set to estimate $\{f_{0k}(\cdot )\}_{k=1}^p$ by solving (5). After we get the estimates $\{\hat{f}_{k}(\cdot )\}_{k=1}^p$ for all $ i\in S(j) $, we can calculate the corresponding prediction $\hat{Y}_i=\sum _{k=1}^p { \hat{f}_{k}(X_{ik}) }$. Then the cross validation error corresponding to a fixed $(K_n,\lambda ^T)^T$ is defined as

$$\begin{aligned} CV_5(K_n,\lambda )=\sum _{j=1}^5{ \sum _{i\in S(j)} { \left\{ \frac{R(e_i(\hat{f}))}{n+1}-\frac{1}{2} \right\} e_i(\hat{f}) }}, \end{aligned}$$

(8)

where $e_i(\hat{f})=Y_i-\sum _{j=1}^p { \hat{f}_{j}(X_{ij}) }$ and $R(e_i(\hat{f}))$ represents the rank of $e_i(\hat{f})$ among $\{ e_i(\hat{f}) \}_{i=1}^n$. Finally, the optimal $K_n$ and $\lambda $ are selected by minimizing the cross validation error $CV_5(K_n,\lambda )$.

Remark 3

As stated in Feng et al. (2015), the variable selection results are hardly affected by the choice of selection procedure for$K_n$. Therefore, to reduce the computation burden, one may firstly fit the additive model (1) without any penalization and use the above 5-fold cross validation to select an optimal$K_n$, and then fix the same$K_n$in (8) to select the optimal$\lambda $.

5 Numerical examples

5.1 Monte Carlo simulation

We generate our sample from the following additive model:

$$\begin{aligned} Y_i=\sum _{j=1}^{10} {f_{0j}(X_{ij})}+0.3 \varepsilon _i, \end{aligned}$$

(9)

where $f_{01}(x)= \sin (2\pi x)$, $f_{02}(x)=6x(1-x)$, $f_{03}(x)=2x$, $f_{04}(x)=x$, $f_{05}(x)=-x$, $f_{06}(x)=-2x$ and $f_{0j}(x) \equiv 0$ for $j=7,\ldots ,10$. Thus the number of nonparametric components is 2 and the number of nonzero linear components is 4. The covariates $X_i=(X_{i1}, X_{i2},\ldots , X_{i10})^T$ are generated from the standard normal distribution with the correlation between $X_{ij_1}$ and $X_{ij_2}$ being $0.5^{|j_1-j_2|}$. A similar model setting was also applied in Lian (2012a) without the last four zero functions because they only consider model identification for the linear components. Beforehand, we apply the cumulative distribution function of standard normal distribution to transform $X_{ij}$ to be marginally uniform on [0,1]. Finally, four different methods including Lian (2012a) (LS), Lian (2012b) with 0.5th quantile (QR), composite quantile regression (CQR) by Kai et al. (2010) with the number of quantile being 9 and our proposed rank regression (RR) are conducted in this example.

Table 1 Component selection results with $n=100$

Full size table

In order to examine the robustness and efficiency of our proposed method, five different error distributions are considered including standard normally distributed N(0,1), t(3) distribution which is heavy-tailed, the mixture of normals 0.9N(0,1) $+$ 0.1N(0,10) (MN) which is used to generate the outliers and two asymmetric errors Lognormal (LN) and Exponential (Exp(1)) distributions. For all scenarios, 200 data sets are generated and the corresponding results with $n=100$ and $n=200$ are summarized in Tables 1, 2, 3 and 4. Table 1 and Table 2 report the average number of nonparametric components selected (NN), the average number of true nonlinear components selected (NNT), the average number of linear components selected (NL), and the average number of true linear components selected (NLT). Table 3 and Table 4 present the performance of estimates for the first six nonzero component functions by using root mean squared errors (RMSE) defined by $\text{ RMSE }_j=\left\{ \frac{1}{n_{grid}} \sum _{i= 1}^{n_{grid}} { (\hat{f}_{j}(u_i)-f_{0j}(u_i)) ^2 } \right\} ^{1/2}$, where $\{u_i,i=1,2,\ldots ,n_{grid}\}$ are the grid points at which the function $f_j(\cdot )$ is evaluated.

Table 2 Component selection results with $n=200$

Full size table

Table 3 Root mean squared errors for $f_{01},\ldots ,f_{06}$ with $n=100$

Full size table

Table 4 Root mean squared errors for $f_{01},\ldots ,f_{06}$ with $n=200$

Full size table

We make several observations from the results of Tables 1, 2, 3 and 4: (1) Our proposed RR method performs similar to CQR method in most situations; (2) For the normal error, the RR and CQR estimators are comparable to the LS estimator in terms of model selection as well as estimation accuracy, and all above three estimators are much superior to that of QR estimator; (3) For the other four types of error, the performance of LS method is terrible, whereas the RR and CQR approaches possess a significantly higher efficiency than that of QR although they are all robust to error structures in comparison with the LS method; (4) The model identification performance and estimation accuracy of all considered methods improved as the sample size n increasing, which corroborates the theoretical properties. All these conclusions reveal that the CQR method and RR procedure are highly efficient in estimating and identifying nonzero components as well as simultaneously discriminating linear components from nonlinear ones, and they are robust and adaptive to different errors. However, it is worth noting that in contrast with the CQR method whose performance depends on the choice of the number of quantiles to combine, a meta parameter which plays a vital role in balancing the performance of LS and absolute deviation-based methods, our proposed RR procedure does not need to choose the meta parameter. This characteristic can reduce the burden of calculation.

Table 5 Estimation and model identification results with LASSO, ALASSO and MCP

Full size table

Table 6 Component selection results in Boston housing price data

Full size table

Note that, according to the anonymous reviewers’s valuable suggestions, we have added some simulations to evaluate the performance of our proposed RR method under the penalties of lasso, Adaptive-lasso and MCP. The results based on 200 samples are reported in Table 5, where $\text{ RMSE }(f)$ stands for the root mean squared errors of f with $f=\sum _{j=1}^{10} {f_{0j}}$. We can obtain from these results that the performances under Adaptive-lasso and MCP are similar, and they all have a significant superiority to the lasso penalty which has a bad performance. This is expected because Adaptive-lasso and MCP have been demonstrated to own consistency of model selection but lasso does not have. In addition, we have conducted some other simulations under a relatively heavier sparsity by choosing 21 functions, in which the first 6 functions are the same as in model (9) and the last 15 functions are 0. From our obtained results we observe that the performances in the case of heavier sparsity are similar to the case originally considered in model (9). Thus, we omit presenting the corresponding results although they are obtained so as to reduce the length of this paper.

5.2 Application to Boston housing price data

In this section, we consider an application of our proposed method to Boston housing price data, which has been analyzed by Yu and Lu (2004) and Xue (2009) among others. We take the median value of owner-occupied homes in $1000’s (medv) as the response variable. The covariate variables include per capita crime rate by town (crim), proportion of residential land zoned for lots over 25,000 sq.ft (zn), proportion of non-retail business acres per town (indus), nitric oxides concentration per 10 million (nox), average number of rooms per dwelling (rm), proportion of owner-occupied units built prior to 1940 (age), weighted distances to five Boston employment centers (dis), index of accessibility to radial highways (rad), full-value property tax per $10,000 (tax), pupil-teacher ratio by town (ptratio), a parabolic function of the relative size of the Black population in the town (black), and percentage of lower status of the population (lstat). Beforehand, all the covariate variables are standardized so that they have mean zero and unit variance, and the cumulative distribution function of standard normal distribution is employed to transform the covariates to be marginally uniform on [0,1]. Then we apply LS, QR and RR methods to analyze the data set via an additive model stated as (1).

The component selection results are presented in Table 6, in which, 0, 1 and 2 denote the covariates selected as zero, linear and nonlinear components, respectively. As we can see from Table 6, all three methods reveal that rm, rad and black have nonnegative effects on house price, which are clearly coincide with the heuristics about their effects on house prices. In addition, compared with the LS approach that removes three covariates zn, indus and age out of the final model as unimportant covariates and identifies the remaining nine covariates as nonlinear components, QR method identified the three covariates zn, indus and rad as zero components, the three covariates crim, age and ptratio as linear components, and the remaining six covariates as nonlinear components. The RR method identified four covariates zn, indus, age and rad as zero components, the three covariates crim, dis and black as linear components, and the remaining five covariates as nonlinear components. Similar conclusions can also be derived by the corresponding fits for this data set presented in Figs. 1, 2 and 3. Evidently, our proposed rank approach generates the most parsimonious model among the three considered methods.

For a further study of the applicability of the RR method, we display the normal QQ-plot of the residuals resulted by RR procedure in Fig. 4a, from which we observe that the error term of Boston housing data probably come from a non-normal distribution. Moreover, to compare the performance of the proposed RR procedure with those of LS and QR methods, we give the boxplots of mean absolute prediction error (MAPE) in Fig. 4b, which is obtained based on 200 times simulation with each simulation randomly extract 400 samples. Obviously, RR method performs the best since it has the smallest mean value of MAPE and variance. Consequently, taking into account of the complexity of model and the performance of prediction, our proposed rank-based regression is a preferred method for analyzing this data set.

6 Concluding remarks

In this paper, a novel and robust procedure based on rank regression and spline approximation was developed for model identification in semiparametric additive models. Via adding a two-fold SCAD penalty, the proposed method is able to simultaneously estimate and identify the nonzero components as well as the linear components. Theoretical properties of the estimators of both nonparametric parts and linear parameters were derived under some mild conditions. In addition, we show that the proposed rank estimator is highly efficient across a wide spectrum of error distributions; even in the worst case scenarios, the ARE of the proposed rank estimate versus least squares estimate, is show to have an expression closely related to that of the signed-rank Wilcoxon test in comparison with the t-test, which is equal to 0.864 for the linear parameters. Furthermore, we presented an efficient algorithm for computation and discussed the selections of tuning parameters. To extend our work to a generalized additive model or other nonparametric models seems a promising and useful project for practitioners; we leave it as a future work.

References

David HA (1998) Early sample measures of variability. Stat Sci 13:368–377
Article MathSciNet MATH Google Scholar
De Boor C (2001) A practical guide to splines, revised edn. Springer, New York
MATH Google Scholar
Deng G, Liang H (2010) Model averaging for semiparametric additive partial linear models. Sci China Math 53:1363–1376
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
Feng L, Zou C, Wang Z, Wei X, Chen B (2015) Robust spline-based variable selection in varying coefficient model. Metrika 78:85–118
Article MathSciNet MATH Google Scholar
Härdle W, Huet S, Mammen E, Sperlich S (2004) Bootstrap inference in semiparametric generalized additive models. Econ Theory 20:265–300
Article MathSciNet MATH Google Scholar
Hettmansperger TP, McKean JW (2011) Robust nonparametric statistical methods, 2nd edn. Chapman and Hall, Boca Raton
MATH Google Scholar
Hodges JL, Lehmann EL (1956) The efficiency of some nonparametric competitors of the t-test. Ann Math Stat 27:324–335
Article MathSciNet MATH Google Scholar
Huang J, Horowitz JL, Wei F (2010) Variable selection in nonparametric additive models. Ann Stat 38:2282–2313
Article MathSciNet MATH Google Scholar
Huang JZ, Wu CO, Zhou L (2004) Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Stat Sin 14:763–788
MathSciNet MATH Google Scholar
Jiang J, Zhou H, Jiang X, Peng J (2007) Generalized likelihood ratio tests for the structure of semiparametric additive models. Can J Stat 35:381–398
Article MathSciNet MATH Google Scholar
Kai B, Li R, Zou H (2010) Local composite quantile regression smoothing: an efficient and safe alternative to local polynomial regression. J R Stat Soc Ser B 72:49–69
Article MathSciNet MATH Google Scholar
Leng C (2010) Variable selection and coefficient estimation via regularized rank regression. Stat Sin 20:167–181
MathSciNet MATH Google Scholar
Li Q (2000) Efficient estimation of additive partially linear models. Int Econ Rev 41:1073–1092
Article MathSciNet Google Scholar
Li J, Li Y, Zhang R (2015) B spline variable selection for the single index models. Stat Pap. doi:10.1007/s00362-015-0721-z
Lian H (2012a) Shrinkage estimation for identification of linear components in additive models. Stat Probab Lett 82:225–231
Article MathSciNet MATH Google Scholar
Lian H (2012b) Semiparametric estimation of additive quantile regression models by two-fold penalty. J Bus Econ Stat 30:337–350
Article MathSciNet Google Scholar
Liu X, Wang L, Liang H (2011) Estimation and Variable selection for semiparametric additive partial linear models. Stat Sin 21:1225–1248
Article MathSciNet MATH Google Scholar
Mammen E, Park B (2006) A simple smooth backfitting method for additive models. Ann Stat 34:2252–2271
Article MathSciNet MATH Google Scholar
Opsomer JD, Ruppert D (1999) A root-n consistent backfitting estimator for semiparametric additive modeling. J Comput Graph Stat 8:715–732
Google Scholar
Pollard D (1991) Asymptotics for least absolute deviation regression estimators. Econ Theory 7:186–199
Article MathSciNet Google Scholar
Sievers GL, Abebe A (2004) Rank estimation of regression coefficients using iterated reweighted least squares. J Stat Comput Simul 74:821–831
Article MathSciNet MATH Google Scholar
Sun J, Lin L (2014) Local rank estimation and related test for varying-coefficient partially linear models. J Nonparametr Stat 26:187–206
Article MathSciNet MATH Google Scholar
Tang Q (2015) Robust estimation for spatial semiparametric varying coefficient partially linear regression. Stat Pap 56:1137–1161
Article MathSciNet MATH Google Scholar
Wang L, Kai B, Li R (2009) Local rank inference for varying coefficient models. J Am Stat Assoc 488:1631–1645
Article MathSciNet MATH Google Scholar
Wang M, Song L (2013) Identification for semiparametric varying coefficient partially linear models. Stat Probab Lett 83:1311–1320
Article MathSciNet MATH Google Scholar
Wei C, Liu C (2012) Statistical inference on semi-parametric partial linear additive models. J Nonparametr Stat 24:809–823
Article MathSciNet MATH Google Scholar
Wei C, Luo Y, Wu X (2012) Empirical likelihood for partially linear additive errors-in-variables models. Stat Pap 53:485–496
Article MathSciNet MATH Google Scholar
Xue L (2009) Consistent variable selection in additive models. Stat Sin 19:1281–1296
MathSciNet MATH Google Scholar
Yu K, Lu Z (2004) Local linear additive quantile regression. Scand J Stat 31:333–346
Article MathSciNet MATH Google Scholar
Yu K, Park B, Mammen E (2008) Smooth backfitting in generalized additive models. Ann Stat 36:228–260
Article MathSciNet MATH Google Scholar
Zhang HH, Cheng G, Liu Y (2011) Linear or nonlinear? Automatic structure discovery for partially linear models. J Am Stat Assoc 106:1099–1112
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to the Editor, Associate Editor and two anonymous referees whose comments lead to a significant improvement of the paper. This work was supported in part by the National Natural Science Foundation of China (Grant No. 11671059).

Author information

Authors and Affiliations

College of Mathematics and Computer Science, Key Laboratory of High Performance Computing and Stochastic Information Processing (Ministry of Education of China), Hunan Normal University, Changsha, 410081, China
Jing Yang
College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, China
Hu Yang
College of Mathematics and Statistics, Chongqing Technology and Business University, Chongqing, 400067, China
Fang Lu

Authors

Jing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Yang.

Appendix

In the proofs, C denotes a generic constant that might assume different values at different places. Assume $\gamma _0=(\gamma _{01}^T,\gamma _{02}^T, \ldots ,\gamma _{0p}^T)^T$ be a pK-dimensional vector satisfying $\Vert f_{0j}-B_j^T \gamma _{0j}\Vert =O_p(K^{-r})$ for $ 1\le j \le p_0$ and $f_{0j}=B_j^T \gamma _{0j}$ for $p_0 < j \le p$. In order to prove the theoretical results, we first give some notations for convenience of expression. Let

$$\begin{aligned} \theta _n=\sqrt{K/n}, ~~\gamma ^{*}=\theta _n^{-1}(\gamma -\gamma _0), ~~Z_i= \big ( B_1(X_{i1})^T,\ldots ,B_p(X_{ip})^T \big )^T, \end{aligned}$$

$$\begin{aligned} Z_{ij}=Z_i-Z_j, ~~Z=(Z_1,\ldots ,Z_n)^T, ~~\Delta _i=\sum _{l=1}^{p} { f_{0l}(X_{il})-Z_i^T \gamma _0 }, \end{aligned}$$

$$\begin{aligned} \bar{K}=pK, ~~~\text{ and }~~~ Q_n(\gamma ^{*})=\tau \theta _n^2 \gamma ^{*^T} Z^TZ \gamma ^{*} + \gamma ^{*^T} S_n(0) + L_n(0). \end{aligned}$$

Based on the notations, the objective function $L_n(\gamma )$ defined in (4) can be rewritten as

$$\begin{aligned} L_n^{*}(\gamma ^{*})= \frac{1}{n}\sum _{i<j} { |(\varepsilon _i+\Delta _i)-(\varepsilon _j+\Delta _j)-\theta _n Z_{ij}^T \gamma ^{*}| }. \end{aligned}$$

Further denote as $S_n(\gamma ^{*})$ the gradient function of $L_n(\gamma ^{*})$, that is,

$$\begin{aligned} S_n(\gamma ^{*}) = \frac{\partial L_n^{*}(\gamma ^{*})}{\partial \gamma ^{*}}= -\frac{\theta _n}{n} \sum _{i \ne j} { \text{ sgn } \{ \varepsilon _i + \Delta _{i} -\varepsilon _j - \Delta _{j} - \theta _n Z_{ij}^T \gamma ^{*} \} Z_{ij} }, \end{aligned}$$

where $\text{ sgn }(\cdot )$ denotes the sign function.

We first quote several necessary lemmas which are frequently used in the sequel, and the detailed proofs can be referred to Feng et al. (2015).

Lemma 1

Suppose that the assumptions (A1)–(A4) hold, then

$$\begin{aligned} S_n(\gamma ^{*})-S_n(0)=2 \tau \theta _n^2 Z^TZ \gamma ^{*} +o_p(1)\mathbf 1 _{\bar{K}}, \end{aligned}$$

where $\tau $ is defined in Theorem 3 and $\mathbf 1 _{\bar{K}}$ is a K-dimension vector of ones.

Lemma 2

Let $\hat{\gamma }^{*}=\arg \min L_n^{*}(\gamma ^{*})$ and $\tilde{\gamma }^{*}=\arg \min Q_n(\gamma ^{*})$. Suppose that the assumptions (A1)–(A4) hold, then

$$\begin{aligned} \Vert \hat{\gamma }^{*}-\tilde{\gamma }^{*}\Vert ^2=o_p(K). \end{aligned}$$

Lemma 3

Suppose that the assumptions (A1)–(A4) hold, then

$$\begin{aligned} S_n(0)=O_p(1)\mathbf 1 _{\bar{K}}. \end{aligned}$$

Proof of Theorem 1

By the definition of $A_n(\gamma ^{*})$, it follows from the convexity lemma in Pollard (1991) that

$$\begin{aligned} \tilde{\gamma }^{*}=-(2 \tau \theta _n^2 Z^TZ )^{-1}S_n(0). \end{aligned}$$

Note that, according to Lemma A.3 of Huang et al. (2004), there exists an interval $[C_1,C_2]$, $0<C_1<C_2<\infty $, such that all the eigenvalues of $\frac{K}{n}Z^TZ$ fall into $[C_1,C_2]$ with probability tending to 1. Write $S_n(0)=( S_{n1}(0),\ldots ,S_{n\bar{K}}(0) )^T$, then we have

$$\begin{aligned} \Vert \tilde{\gamma }^{*} \Vert ^2= & {} \frac{1}{4 \tau ^2}S_n(0)^T \left( \frac{K}{n}Z^TZ\right) ^{-1}\left( \frac{K}{n}Z^TZ\right) ^{-1}S_n(0) \\= & {} O_p(1)S_n(0)^T S_n(0)=O_p(1) \sum _{i=1}^{\bar{K}} { S_{ni}(0)^2 }=O_p(\bar{K}), \end{aligned}$$

where the last equality holds due to Lemma 3. As $\bar{K}=pK$, it follows that $|\tilde{\gamma }^{*}|^2=O_p(K)$. Therefore, based on the triangle inequality and Lemma 2, we obtain

$$\begin{aligned} \Vert \check{\gamma }^{*}\Vert ^2=\Vert \check{\gamma }^{*} - \tilde{\gamma }^{*} + \tilde{\gamma }^{*}\Vert ^2 \le \Vert \check{\gamma }^{*} - \tilde{\gamma }^{*} \Vert ^2 + \Vert \tilde{\gamma }^{*}\Vert ^2=o_p(K)+O_p(K)=O_p(K). \end{aligned}$$

This is equivalent to $\Vert \check{\gamma }- \gamma _0 \Vert ^2=O_p(K^2/n)$ since $\check{\gamma }^{*}=\theta _n^{-1}(\check{\gamma }-\gamma _0)$ and $\theta _n=\sqrt{K/n}$.

In addition, by the properties of spline in De Boor (2001) that there exist some constants $C_3$ and $C_4$ satisfying

$$\begin{aligned} C_3 K \Vert \check{\gamma }_j^T B_j- \gamma _{0j}^T B_j\Vert ^2 \le \Vert \check{\gamma }_j - \gamma _{0j} \Vert ^2 \le C_4 K \Vert \check{\gamma }_j^T B_j- \gamma _{0j}^T B_j\Vert ^2. \end{aligned}$$

Thus, we can derive that $\Vert \check{\gamma }_j^T B_j- \gamma _{0j}^T B_j\Vert ^2=O_p(K/n)$. Consequently, by the fact that $\Vert f_{0j}-B_j^T \gamma _{0j}\Vert =O_p(K^{-r})$, we have

$$\begin{aligned} \Vert \check{f}_j - f_{0j} \Vert ^2= & {} \Vert \check{\gamma }_j^T B_j- f_{0j} \Vert ^2 \le \Vert \check{\gamma }_j^T B_j- \gamma _{0j}^T B_j\Vert ^2 + \Vert \gamma _{0j}^T B_j- f_{0j} \Vert ^2 \\= & {} O_p(K/n) + O_p(K^{-2r})= O_p(n^{-2r/(2r+1)}), \end{aligned}$$

where the last equality holds due to the assumption that the number of knots $K=O_p\big ( n^{1/(2r+1)} \big )$. This completes the proof. $\square $

Proof of Theorem 2

Firstly, we prove (i). Denote by $\delta _n=\theta _n+\lambda _1 +\lambda _2$, we first prove that $\Vert \hat{\gamma }-\gamma _0\Vert = O_p(\bar{K}^{1/2} \delta _n)$. Let $\gamma =\gamma _0+\bar{K}^{1/2} \delta _n v$, where v is a $\bar{K}$-dimensional vector. It is sufficient to show, for any given $\xi >0$, there exists a large C such that

$$\begin{aligned} P\left\{ \inf _{\Vert v\Vert =C}L_n^{\lambda }(\gamma ) > L_n^{\lambda }(\gamma _0) \right\} \ge 1-\xi . \end{aligned}$$

(10)

By virtue of the identity $|x-y|-|x|=-y\text{ sgn }(x)+2(y-x)\{ I(0<x<y)-I(y<x<0) \}$ and the definition of $L_n^{\lambda }(\gamma )$, it follows that

$$\begin{aligned}&L_n^{\lambda }(\gamma ) - L_n^{\lambda }(\gamma _0) \nonumber \\&\quad = \frac{1}{n}\sum _{i<j} { \big \{ | Y_{ij}-Z_{ij}^T \gamma |-| Y_{ij}-Z_{ij}^T \gamma _0 | \big \} } + n \sum _{k=1}^p \big \{ p_{\lambda _1}( \sqrt{ \gamma _k^T D_k \gamma _k } ) \nonumber \\&\qquad - p_{\lambda _1}( \sqrt{ \gamma _{0k}^T D_k \gamma _{0k} } ) \big \} + n \sum _{k=1}^p { \bigg \{p_{\lambda _2}\left( \sqrt{ \gamma _{k}^T E_k \gamma _{k} } ) -p_{\lambda _2}( \sqrt{ \gamma _{0k}^T E_k \gamma _{0k} } \right) \bigg \}} \nonumber \\= & {} \frac{-1}{n}\sum _{i<j} { Z_{ij}^T (\gamma -\gamma _0) \text{ sgn }(Y_{ij}-Z_{ij}^T\gamma _0) } + \frac{2}{n}\sum _{i<j} { (Z_{ij}^T\gamma -Y_{ij}) } \cdot \nonumber \\&\big \{ I(0<Y_{ij}-Z_{ij}^T\gamma _0< Z_{ij}^T (\gamma -\gamma _0)) - I(Z_{ij}^T (\gamma -\gamma _0)<Y_{ij}-Z_{ij}^T\gamma _0<0) \big \} \nonumber \\&+\, n \sum _{k=1}^p { \big \{ p_{\lambda _1}( \sqrt{ \gamma _k^T D_k \gamma _k } ) - p_{\lambda _1}( \sqrt{ \gamma _{0k}^T D_k \gamma _{0k} } ) \big \}} \nonumber \\&+\,n \sum _{k=1}^p { \big \{p_{\lambda _2}( \sqrt{ \gamma _{k}^T E_k \gamma _{k} } ) -p_{\lambda _2}( \sqrt{ \gamma _{0k}^T E_k \gamma _{0k} } )\big \}} \nonumber \\\triangleq & {} L_1+L_2+L_3+L_4. \end{aligned}$$

(11)

From Lemma 3, it is easy to verify that $\frac{-1}{n}\sum _{i<j} { \text{ sgn }(Y_{ij}-Z_{ij}^T\gamma _0) Z_{ij} }=\theta _n^{-1} \mathbf 1 _{\bar{K}}$, thus we have $L_1=O_p( \delta _n \theta _n^{-1}\bar{K}^{1/2} \Vert v\Vert )=O_p( n^{1/2}\delta _n \Vert v\Vert )=o_p(n \delta _n^2 \Vert v\Vert )$ due to the assumption $n^{r/(2r+1)} \min \{ \lambda _1,\lambda _2 \} \rightarrow \infty $. Moreover, taking the similar arguments as in the proof of Lemma 1, we can obtain that

$$\begin{aligned} L_2=\tau (\gamma -\gamma _0)^T Z^TZ (\gamma -\gamma _0)(1+o_p(1)). \end{aligned}$$

By applying Lemma A.3 of Huang et al. (2004) to $L_2$ yields $L_2=O_p(n\delta _n^2 \Vert v\Vert ^2)$. Obviously, by choosing a sufficiently large C, $L_2$ dominates $L_1$ with probability tending to 1.

On the other hand, based on the well-known properties of B-spline that $D_k$ and $E_k$ are of rank $K-1$ and all their positive eigenvalues are of order 1 / K, then according to the inequality $p_{\lambda }(|x|)-p_{\lambda }(|y|) \le \lambda |x-y|$, we have

$$\begin{aligned} L_3 \le nC \lambda _1 \sum _{k=1}^p{\Vert \gamma _k-\gamma _{0k}\Vert }/\sqrt{K} = O_p(n \lambda _1 \delta _n \Vert v\Vert )=O_p(n \delta _n^2 \Vert v\Vert ). \end{aligned}$$

Thus $L_3$ is dominated by $L_2$ if a sufficiently large C is chosen. Similarly, it is easy to verify that $L_4$ is also dominated by $L_2$. Recall that $L_2>0$, so we have (10) holds, which means $\Vert \hat{\gamma }-\gamma _0\Vert =O_p(\bar{K}^{1/2} \delta _n)$.

Finally, we will show that the convergence rate can be further improved to $\Vert \hat{\gamma }-\gamma _0\Vert =O_p(\bar{K}^{1/2} \theta _n)$. In fact, as the model is fixed as $n \rightarrow \infty $, we can find a constant $C>0$, such that $\gamma _{0k}^TD_k\gamma _{0k} >C$ for $k \le s$ and $\gamma _{0k}^TE_k\gamma _{0k} >C$ for $k \le p_0$. As $\Vert \hat{\gamma }-\gamma _0\Vert ^2=O_p(\bar{K} \delta _n^2) =o_p(\bar{K})$ from above result and $\lambda _k=o_p(1), k=1,2$, we have

$$\begin{aligned}&P \left( p_{\lambda _1}\left( \sqrt{ \gamma _{0k}^T D_k \gamma _{0k} }\right) =p_{\lambda _1}\left( \sqrt{ \hat{\gamma }_k^T D_k \hat{\gamma }_k } \right) \right) \rightarrow 1, ~~~ j \le s, \\&P \left( p_{\lambda _1}\left( \sqrt{ \gamma _{0k}^T E_k \gamma _{0k} } )=p_{\lambda _1}( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } \right) \right) \rightarrow 1, ~~~ j \le p_0. \end{aligned}$$

These facts indicate that

$$\begin{aligned}&P \left( n \sum _{k=1}^p {p_{\lambda _1}\left( \sqrt{ \hat{\gamma }_k^T D_k \hat{\gamma }_k } \right) } - n \sum _{k=1}^p {p_{\lambda _1}\left( \sqrt{ \gamma _{0k}^T D_k \gamma _{0k} }\right) } \ge 0 \right) \rightarrow 1, \\&P \left( n \sum _{k=1}^p {p_{\lambda _1}\left( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } \right) } - n \sum _{k=1}^p {p_{\lambda _1}\left( \sqrt{ \gamma _{0k}^T E_k \gamma _{0k} }\right) } \ge 0 \right) \rightarrow 1. \end{aligned}$$

Removing the regularizing terms $L_3$ and $L_4$ in (11), the rate can be improved to $\Vert \hat{\gamma }-\gamma _0\Vert = O_p(\bar{K}^{1/2} \theta _n)$ by the same reasoning as above. That is $\Vert \hat{\gamma }-\gamma _0\Vert ^2= O_p(\bar{K} \theta _n^2)=O_p(K^2/n)$. As a consequence, following the same approach in the proof of the second part of Theorem 1, we obtain that $\Vert \hat{f}_j - f_{0j} \Vert ^2 = O_p(n^{-2r/(2r+1)})$, this completes the proof.

In the next, we put our main attention on proving part (ii) as an illustration and part (iii) can be similarly proved with its detailed proof omitted. Suppose that $B_j^T\hat{\gamma }_j$ does not represent a linear function for $ p_0+1 \le j \le s$. Define $\bar{\gamma }$ to be the same as $\hat{\gamma }$ except that $\hat{\gamma }_j$ is replaced by its projection onto the subspace { $\gamma _j : B_j^T\gamma _j$ stands for a linear function }. Therefore, we have that

$$\begin{aligned} 0\ge & {} L_n^{\lambda }(\hat{\gamma }) - L_n^{\lambda }(\bar{\gamma }) = ( L_n^{\lambda }(\hat{\gamma }) - L_n^{\lambda }(\gamma _0) ) - ( L_n^{\lambda }(\bar{\gamma }) - L_n^{\lambda }(\gamma _0) ) \nonumber \\= & {} \frac{1}{n}\sum _{i<j} { \big \{ | Y_{ij}-Z_{ij}^T \hat{\gamma } |-| Y_{ij}-Z_{ij}^T \gamma _0 | \big \} } - \frac{1}{n}\sum _{i<j} { \big \{ | Y_{ij}-Z_{ij}^T \bar{\gamma } |-| Y_{ij}-Z_{ij}^T \gamma _0 | \big \} } \nonumber \\&+ \,n \sum _{k=1}^p { \big \{ p_{\lambda _1}( \sqrt{ \hat{\gamma }_k^T D_k \hat{\gamma }_k } ) - p_{\lambda _1}( \sqrt{ \bar{\gamma }_{k}^T D_k \bar{\gamma }_{k} } ) \big \}} \nonumber \\&+\, n\sum _{k=1}^p { \big \{p_{\lambda _2}( \sqrt{ \hat{\gamma }_{k}^T E_k \hat{\gamma }_{k} } ) - p_{\lambda _2}( \sqrt{ \bar{\gamma }_{k}^T E_k \bar{\gamma }_{k} } ) \big \} } \nonumber \\\triangleq & {} M_1(\hat{\gamma },\gamma _0)-M_2(\bar{\gamma },\gamma _0) + M_3(\hat{\gamma },\bar{\gamma })+ M_4(\hat{\gamma },\bar{\gamma }). \end{aligned}$$

(12)

Note that, by the same arguments to the derivation of (11), it is not difficult to verify that

$$\begin{aligned} M_1(\hat{\gamma },\gamma _0)= \tau (\hat{\gamma }-\gamma _0)^T Z^TZ (\hat{\gamma }-\gamma _0)(1+o_p(1)) + \theta _n^{-1} (\hat{\gamma }-\gamma _0)^T S_n(0) \end{aligned}$$

and

$$\begin{aligned} M_2(\bar{\gamma },\gamma _0)= \tau (\bar{\gamma }-\gamma _0)^T Z^TZ (\bar{\gamma }-\gamma _0)(1+o_p(1)) + \theta _n^{-1} (\bar{\gamma }-\gamma _0)^T S_n(0). \end{aligned}$$

Therefore, we can show that

$$\begin{aligned}&M_1(\hat{\gamma },\gamma _0)-M_2(\bar{\gamma },\gamma _0) \\&\quad = \tau \{ (\hat{\gamma }-\bar{\gamma }+\bar{\gamma }-\gamma _0)^T Z^TZ (\hat{\gamma }-\bar{\gamma }+\bar{\gamma }-\gamma _0)\\&\qquad - (\bar{\gamma }-\gamma _0)^T Z^TZ (\bar{\gamma }-\gamma _0) \} (1+o_p(1)) +\, \theta _n^{-1} (\hat{\gamma }-\bar{\gamma })^T S_n(0) \\&\quad = \tau (\hat{\gamma }-\bar{\gamma })^T Z^TZ (\hat{\gamma }-\bar{\gamma }) + 2 \tau (\bar{\gamma }-\gamma _0)^T Z^TZ (\hat{\gamma }-\bar{\gamma }) + \theta _n^{-1} (\hat{\gamma }-\bar{\gamma })^T S_n(0) \\&\quad \ge 2 \tau (\bar{\gamma }-\gamma _0)^T Z^TZ (\hat{\gamma }-\bar{\gamma }) + \theta _n^{-1} (\hat{\gamma }-\bar{\gamma })^T S_n(0) \triangleq N_1+N_2. \end{aligned}$$

Recall that $\bar{\gamma }_k$ is the projection of $\hat{\gamma }_{k}$ onto $\{ \gamma _{k}: \gamma _{k}^T E_k \gamma _{k}=0 \}$, then $\hat{\gamma }_k-\bar{\gamma }_k$ is orthogonal to the space. Furthermore, the space $\{ \gamma _{k}: \gamma _{k}^T E_k \gamma _{k}=0 \}$ is just the eigenspace of $E_k$ corresponding to the zero eigenvalue. Consequently, based on the characterization of eigenvalues in terms of Rayleigh quotient, $(\hat{\gamma }_k-\bar{\gamma }_k)^T E_k (\hat{\gamma }_k-\bar{\gamma }_k) / \Vert \hat{\gamma }_k-\bar{\gamma }_k\Vert ^2$ lies between the minimum and the maximum positive eigenvalues of $E_k$, which is of order 1 / K. Taking into account of the fact that $\hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}=(\hat{\gamma }_k-\bar{\gamma }_k)^T E_k (\hat{\gamma }_k-\bar{\gamma }_k)$ since $\bar{\gamma }_{k}^T E_k \bar{\gamma }_{k}=0$, we derive $\Vert \hat{\gamma }_k-\bar{\gamma }_k\Vert =O_p(\sqrt{K \hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}})$. According to Lemma 3, Lemma A.3 of Huang et al. (2004) and the result $\Vert \bar{\gamma }-\gamma _0\Vert =O_p(K/\sqrt{n})$ from part (i), it follows that

$$\begin{aligned}&\Vert N_1 \Vert \le O_p\left( \frac{n}{K} \Vert \bar{\gamma }-\gamma _0 \Vert \cdot \Vert \hat{\gamma }-\bar{\gamma }\Vert \right) = O_p\left( \sqrt{nK}\sum _{k=1}^p { \sqrt{\hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}} }\right) , \\&\Vert N_2 \Vert \le O_p\left( \theta _n^{-1} \Vert \hat{\gamma }-\bar{\gamma }\Vert \cdot \Vert S_n(0)\Vert \right) = O_p\left( \sqrt{nK}\sum _{k=1}^p { \sqrt{\hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}} }\right) . \end{aligned}$$

These facts leads to

$$\begin{aligned} M_1(\hat{\gamma },\gamma _0)-M_2(\bar{\gamma },\gamma _0) \ge - O_p\left( \sqrt{nK}\sum _{k=1}^p { \sqrt{\hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}} } \right) . \end{aligned}$$

(13)

On the other hand, according to the proof of (i), we have $P \big ( p_{\lambda _1}( \sqrt{ \hat{\gamma }_k^T D_k \hat{\gamma }_k } ) = p_{\lambda _1}( \sqrt{ \bar{\gamma }_{k}^T D_k \bar{\gamma }_{k} } )\big ) \rightarrow 1$ and $P \big ( \bar{\gamma }_k^T E_k \bar{\gamma }_k =0 \big ) \rightarrow 1$. Substituting these results into (12) yields

$$\begin{aligned} P \left( M_1(\hat{\gamma },\gamma _0)-M_2(\bar{\gamma },\gamma _0)+ n \sum _{k=1}^p { p_{\lambda _2}( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } )} \le 0 \right) \rightarrow 1. \end{aligned}$$

(14)

In addition, based on the result of (i) and the condition $n^{r/(2r+1)} \min \{ \lambda _1,\lambda _2 \} \rightarrow \infty $, it is easy to verify that

$$\begin{aligned} \sqrt{ \hat{\gamma }_k^T E_j \hat{\gamma }_k } = \sqrt{ (\hat{\gamma }_k-\gamma _{0k})^T E_k (\hat{\gamma }_k-\gamma _{0k}) } =O_p(\sqrt{K/n})=o_p(\lambda _2). \end{aligned}$$

Hence, we have $P \big ( p_{\lambda _2}( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } )= \lambda _2\sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } \big ) \rightarrow 1$ by the definition of SCAD penalty function.

As a consequence, if $\hat{\gamma }_k^T E_k \hat{\gamma }_k > 0$, we have

$$\begin{aligned} n \sum _{k=1}^p { p_{\lambda _2}\left( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k }\right) }=O_p\left( n \lambda _2 \sum _{k=1}^p { \sqrt{\hat{\gamma }_{k}^T E_k \hat{\gamma }_{k}} }\right) . \end{aligned}$$

(15)

Combining (13) and (15) along with the condition $n^{r/(2r+1)} \min \{ \lambda _1,\lambda _2 \} \rightarrow \infty $, it follows that

$$\begin{aligned} M_1(\hat{\gamma },\gamma _0)-M_2(\bar{\gamma },\gamma _0)+ n \sum _{k=1}^p { p_{\lambda _2}\left( \sqrt{ \hat{\gamma }_k^T E_k \hat{\gamma }_k } \right) } > 0, \end{aligned}$$

which is contradictory to (14). Then we complete the proof of Theorem 2. $\square $

Proof of Theorem 3

Note that, by the results of Theorem 2, we only need to consider a correctly specified partially linear additive model as (2) without regularization terms. Specifically, the corresponding objective function is

$$\begin{aligned} \Phi _n(\alpha ,\beta )= \frac{1}{n}\sum _{i<j} { | Y_{ij}-V_{ij}^T \alpha - X_{ij}^{(2)^T} \beta | }, \end{aligned}$$

where $V_i= \big ( B_1(X_{i1})^T,\ldots ,B_p(X_{ip_0})^T \big )^T$, $X_i^{(2)}=(X_{i(p_0+1)},\ldots ,X_{is})^T$ and $\alpha =(\gamma _1,\ldots ,\gamma _{p_0})^T$ is the corresponding coefficient vector of the spline approximation. Let $(\hat{\alpha }^T,\hat{\beta }^T)^T=\arg \min \Phi _n(\alpha ,\beta )$, $\tilde{\Delta }_i=\sum _{l=1}^{p_0} { f_{0l}(X_{il}) }-V_i^T \hat{\alpha } $, $\delta _n= n^{-1/2}$ and $\beta ^*=\delta _n^{-1}(\beta -\beta _0)$. Then, $\hat{\beta }^*$ must be the minimizer of the following function

$$\begin{aligned} \Phi _n^{*}(\beta ^{*})= \frac{1}{n}\sum _{i<j} { | (\varepsilon _i+\tilde{\Delta }_i) - (\varepsilon _j+\tilde{\Delta }_j)- \delta _n X_{ij}^{(2)^T} \beta ^{*} | }. \end{aligned}$$

Denote by $S_n^*(\beta ^{*})$ the gradient function of $\Phi _n^{*}(\beta ^{*})$, that is

$$\begin{aligned} S_n^*(\beta ^{*}) = \frac{\partial \Phi _n^{*}(\beta ^{*})}{\partial \beta ^{*}}= -\frac{\delta _n}{n} \sum _{i \ne j} { \text{ sgn } \{ (\varepsilon _i+\tilde{\Delta }_i) - (\varepsilon _j+\tilde{\Delta }_j)- \delta _n X_{ij}^{(2)^T} \beta ^{*} \} X_{ij}^{(2)} }. \end{aligned}$$

Then, we can show that

$$\begin{aligned} S_n^*(\beta ^{*})-S_n^*(0)= & {} -\frac{\delta _n}{n} \sum _{i \ne j} { \text{ sgn } \big ( (\varepsilon _i+\tilde{\Delta }_i) - (\varepsilon _j+\tilde{\Delta }_j)- \delta _n X_{ij}^{(2)^T} \beta ^{*} \big ) X_{ij}^{(2)}} \\&+\frac{\delta _n}{n} \sum _{i \ne j} { \text{ sgn } \big ( (\varepsilon _i+\tilde{\Delta }_i) - (\varepsilon _j+\tilde{\Delta }_j) \big ) X_{ij}^{(2)} }. \end{aligned}$$

Taking into consideration of the results obtained in Theorem 2, we have $\tilde{\Delta }_i=O_p(K^{-r})=o_p(1)$ as $n \rightarrow \infty $. Hence, following the similar proof of Lemma 1, it is not difficult to obtain

$$\begin{aligned} S_n^*(\beta ^{*})-S_n^*(0)=2\tau \delta _n^2 \Sigma \beta ^{*}, \end{aligned}$$

(16)

where $\Sigma $ is defined in assumption (A3). Further let $B_n(\beta ^{*})=\tau \delta _n^2 \beta ^{*^T} \Sigma \beta ^{*} + \beta ^{*^T} S_n^*(0) + \Phi _n^{*}(0)$ and its minimizer denoted by $\tilde{\beta }^*$. Then it is not difficult to verify that $\tilde{\beta }^*=-(2\tau )^{-1}(\delta _n^2 \Sigma )^{-1}S_n^*(0)$. Based on Equation (16) and a similar arguments of Lemma 2, it follows that

$$\begin{aligned} \hat{\beta }^*=\tilde{\beta } ^*+o_p(1)=-(2\tau )^{-1}(\delta _n^2 \Sigma )^{-1}S_n^*(0)+o_p(1). \end{aligned}$$

(17)

In addition, by the assumption that $\varepsilon _i$ is the random error independent of $X_i$, combined with some calculations, we have

$$\begin{aligned} \delta _n^{-2}S_n^*(0) ~\mathop \rightarrow \limits ^d~ N \big (0,E\big \{ (2H(\varepsilon )-1)^2 \big \} \Sigma \big ), \end{aligned}$$

(18)

where $H(\cdot )$ stands for the cumulative distribution function of $\varepsilon $. Furthermore, it can be shown that

$$\begin{aligned} E\{ (2H(\varepsilon )-1)^2 \}= & {} \int (2H(\varepsilon )-1)^2 h(\varepsilon ) d\varepsilon \nonumber \\= & {} \int 4H(\varepsilon )^2 h(\varepsilon ) d\varepsilon - 4 \int H(\varepsilon ) h(\varepsilon ) d\varepsilon + \int h(\varepsilon ) d\varepsilon \nonumber \\= & {} \int 4H(\varepsilon )^2 dH(\varepsilon ) - 4 \int H(\varepsilon ) dH(\varepsilon )+1 = 1/3. \end{aligned}$$

(19)

Therefore, substituting (18) and (19) into (17), we complete the proof. $\square $

Proof of Theorem 4

Based on the asymptotic results of Theorem 3 and the least square B-spline estimate given in Theorem 3 of Lian (2012a), we immediately obtain $\text{ ARE }(\hat{\beta }_{RR},\hat{\beta }_{LS})=12 \tau ^2 \sigma ^2$. In addition, a result of Hodges and Lehmann (1956) indicates that the ARE has a lower bound 0.864, with this lower bound being obtained at the density $h(x)=\frac{3}{20\sqrt{5}}(5-x^2)I(|x|\le 5)$. This completes the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Yang, H. & Lu, F. Rank-based shrinkage estimation for identification in semiparametric additive models. Stat Papers 60, 1255–1281 (2019). https://doi.org/10.1007/s00362-017-0874-z

Download citation

Received: 17 November 2015
Revised: 02 November 2016
Published: 10 February 2017
Issue Date: August 2019
DOI: https://doi.org/10.1007/s00362-017-0874-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Rank-based shrinkage estimation for identification in semiparametric additive models

Abstract

Similar content being viewed by others

Rank estimation for the function-on-scalar model

Robust and efficient estimation of nonparametric generalized linear models

Variable Selection for Varying Coefficient Models Via Kernel Based Regularized Rank Regression

1 Introduction

2 Rank-based shrinkage regression for additive models

3 Theoretical properties

3.1 Asymptotic properties

Theorem 1

Theorem 2

Theorem 3

Remark 1

3.2 Asymptotic relative efficiency

Theorem 4

4 Algorithm implementation and tuning parameters selections

4.1 Algorithm implementation

Remark 2

4.2 Extra parameters selections

Remark 3

5 Numerical examples

5.1 Monte Carlo simulation

5.2 Application to Boston housing price data

6 Concluding remarks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 1

Lemma 2

Lemma 3

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation