1 Introduction

Consider the following additive regression model

$$\begin{aligned} Y_i=u+ \sum _{j=1}^p { f_{0j}(X_{ij}) } + \varepsilon _i, \end{aligned}$$
(1)

where \(X_i=(X_{i1},X_{i2},\ldots ,X_{ip})^T\) is a p-dimensional covariate, \(\{ f_{0j}(\cdot ), j=1,2,\ldots ,p \}\) are unknown smooth functions satisfying \(\text{ E }\{ f_{0j}(X_{ij}) \}=0\) for the sake of model identifiability, and \(\varepsilon _i\) is the random error independent of \(X_i\). There exist at least two benefits of such an additive approximation. First, the additive combination of univariate functions can be more interpretable and easier to fit than the joint multivariate nonparametric models. Second, the so-called “curse of dimensionality” that besets multivariate nonparametric regression is largely circumvented because every individual additive component can be estimated using a univariate smoother via an iterative manner. Therefore, large amounts of studies have been done under this model due to its superior characteristics, and we refer, for instance, to Yu and Lu (2004), Mammen and Park (2006), Yu et al. (2008), Xue (2009) and Lian (2012a, b).

Although model (1) owns some wonderful properties, Opsomer and Ruppert (1999) noticed that in practice some covariates may have linear or even no effects on the response variable while other covariates enter nonlinearly, and recommended the so-called semiparametric additive model (SPAM) with the form

$$\begin{aligned} Y_i=u+ \sum _{j=1}^{p_0} { f_{0j}(X_{ij}) } + \sum _{j=p_0+1}^p { X_{ij}\beta _{0j} } + \varepsilon _i. \end{aligned}$$
(2)

Statistically, the SPAM could be more parsimonious than the general additive model in some cases, and hence attracted considerable attention. For related literature, see Härdle et al. (2004), Deng and Liang (2010), Liu et al. (2011), Wei and Liu (2012), Wei et al. (2012) among others. Nevertheless, all these works for SPAM are based on the assumption that the linear and nonlinear part are known in advance, which is not always true in practice. If the structure is misspecified, it can not only increase complexity of model but also reduce the estimation accuracy. Since the optimal parametric estimation rate is \(n^{-1/2}\) and the optimal nonparametric estimation rate is \(n^{-2/5}\), treating a parametric component as a nonparametric component can over-fit the data and leads to efficiency loss. Therefore, model identification is important to model (1), and it is of great interest to develop some efficient methods to distinguish nonzero components as well as linear components from nonlinear ones.

In general, this goal could be achieved by conducting some hypothesis testing as done in Jiang et al. (2007), whereas it might be cumbersome to perform in practice when there are more than just a few predictors to test. Besides, the theoretical properties of such identifications based on hypothesis testing can be somewhat hard to analyze. To this end, Huang et al. (2010) presented a new type of usage for the SCAD penalty as well as its related methods and successfully applied it to nonparametric additive models for the purpose of identifying zero components and parametric components. Following a similar idea, Zhang et al. (2011) simultaneously identified the zero and linear components of partially linear models by using two penalty functions through an elegant mathematical framework; Lian (2012a) provided a way to determine linear components of additive models based on least square (LS) regression; Lian (2012b) successfully identified nonzero and linear components of model (1) in conditional quantile regression; Wang and Song (2013) applied the SCAD penalty to identify the model structure in semiparametric varying coefficient partially linear models. Note that all these papers were built on either LS regression, which is very sensitive and has low efficiency with respect to many commonly used non-normal errors, or quantile regression, for which the efficiency is proportional to the density at the median. Hence, it would be highly desirable to develop an efficient and robust method that can simultaneously conduct model identification and estimation.

Recently, Wang et al. (2009) proposed a novel procedure for the varying coefficient model based on rank regression and demonstrated that the new method is highly efficient across a wide class of error distributions and possesses comparable efficiency in the worst case scenario compared with LS regression. Similar conclusions on rank regression have been further confirmed in Leng (2010), Sun and Lin (2014), Feng et al. (2015) and the references therein. To the best known of our knowledge, none of these approaches has been studied in SPAM. Therefore, motivated by these observations, this paper is devoted to extending the rank regression to SPAM for identifying nonzero components as well as linear components. Specifically, we firstly embed the SPAM into an additive model and use the spline method to approximate unknown functions. A two-fold SCAD penalty is then employed to discriminate the nonzero components as well as linear components from the nonlinear ones by penalizing both the coefficient functions and their second derivatives. Furthermore, the theoretical properties of the estimator are established, and based on the asymptotic theory of the linear components, we show that the proposed rank estimate has a great efficiency gain across a wide spectrum of non-normal error distributions and loses almost no efficiency for the normal error compared with that of the LS estimate. Even in the worst case scenarios, the asymptotic relative efficiency (ARE) of the proposed rank estimate versus LS estimate has a lower bound being 0.864. In addition, it is worth noting that the ARE of the proposed rank estimate versus LS has an expression which is closely related to that of the signed-rank Wilcoxon test in comparison with the t-test.

The rest of this paper is organized as follows. In Sect. 2, we introduce our new penalized rank regression method based on basis expansion and the SCAD penalty. In Sect. 3, the asymptotic properties are established under some suitable conditions. The selection of optimal tuning parameters are discussed in Sect. 4 along with a computational algorithm for implementation. Sect. 5 illustrates the finite sample performance of the proposed procedure via some simulation studies, and short concluding remarks are followed in Sect. 6. All the technical proofs are deferred to Appendix.

2 Rank-based shrinkage regression for additive models

Suppose that \(\{ X_i, Y_i \}_{i=1}^n\) is an independent and identically distributed sample from model (2). Without loss of generality, we assume that the distribution of \(X_i\) is supported on [0,1]. As we do not know which covariates have linear effects in advance, all p components are considered as nonparametric and the polynomial splines are applied to approximate the components. Let \(0=\xi _0< \xi _1< \cdots< \xi _{K_n} < \xi _{K_n+1}=1\) be a partition of [0,1] into \(K_n+1\) subintervals \([\xi _k, \xi _{k+1}), k=0,1,\ldots ,K_n\), where \(K_n\) denotes the number of internal knots that increases with sample size n. A polynomial spline of order q is a function whose restriction to each subinterval is a polynomial of degree \(q-1\) and globally \(q-2\) times continuously differentiable on [0,1]. The collection of splines with a fixed sequence of knots has a normalized B-spline basis \(\{ B_1(x),B_2(x),\ldots ,B_{K^\prime }(x) \}\) with \(K^\prime =K_n+q\).

Note that the constraint condition \(\text{ E }\{ f_{0j}(X_{ij}) \}=0\) is required for the sake of model identifiability, so we instead focus on the space of spline functions \(S_j^0:= \{ \hbar : \hbar =\sum _{k=1}^K { \gamma _{jk}B_{jk}(x) }, \hbar =\sum _{i=1}^K { \hbar (X_{ij})=0 } \}\) with centered basis \(\{ B_{jk}(x)=B_{k}(x)- \sum _{i=1}^n { B_{k}(X_{ij})/n }, k=1,2,\ldots ,K= K^\prime -1 \}\), where \(K=K^\prime -1\) due to the empirical version of the constraint. Then the nonlinear functions in model (1) can be approximated by

$$\begin{aligned} f_{0j}(x) \approx \sum _{k=1}^K { \gamma _{jk}B_{jk}(x) },~~~~j=1,2,\ldots ,p. \end{aligned}$$
(3)

For simplicity, we restrict our attention to equally spaced knots, although other regular knot sequences like quasi-uniform or data-driven choices can be considered. It is also possible to specify different values of \(K_n\) for each component. However, our choice of the equally spaced knots and the same number of knots for each component allows for a much simpler exposition of our results, and as in most of the literature based on spline methods, it can be shown that similar asymptotic results still hold for different choices of \(K_n\) and different knots for each component. Let \(\gamma _j=(\gamma _{ji}, \gamma _{j2}, \ldots ,\gamma _{jK})^T\) and \(B_j(x)=\big ( B_{j1}(x),B_{j2}(x),\ldots ,B_{jK}(x) \big )^T\). Following the approximation (3), model (1) can be rewritten as

$$\begin{aligned} Y_i \approx u+ \sum _{j=1}^p { \sum _{k=1}^K { \gamma _{jk}B_{jk}(X_{ij}) } } + \varepsilon _i = u+ \sum _{j=1}^p { B_j(X_{ij})^T \gamma _j } + \varepsilon _i . \end{aligned}$$

Accordingly, the residual for estimating \(Y_i\) at \(X_i\) is \(e_i=Y_i-u-\sum _{j=1}^p { B_j(X_{ij})^T \gamma _j }\).

By applying the technique of rank regression method, we propose the following minimization problem

$$\begin{aligned} \check{\gamma }=\arg \min _{\gamma }L_n(\gamma ):=\frac{1}{n}\sum _{i<j} { |e_i-e_j| }, \end{aligned}$$
(4)

where \(\gamma =\big ( \gamma _1^T,\gamma _2^T,\ldots ,\gamma _p^T \big )^T\). Thus the estimated component functions are \(\check{f}_j(x)= B_j(x)^T\check{\gamma }_j\). Note that the loss function \(L_n(\gamma )\) essentially belongs to a local version of Gini’s mean difference, which is a classical measure of concentration or dispersion; see David (1998) for details. In addition, it is worth mentioning that the above rank-based loss function cannot generate the estimate of intercept u because it is canceled out in \(e_{i}-e_{j}\), which is an unique feature of using this type of estimate in the present problem. As pointed out in Wang et al. (2009), it is essential to have additional location constraint on the random errors in order to make the intercept identifiable, and they adopted the commonly used constraint that \(\varepsilon _i\) has median zero. So following the same constraint on \(\varepsilon _i\), a reasonable estimate of u can be derived by \(\hat{u}=\sum _{i=1}^n{ Y_i/n }\) at the rate of \(1/ \sqrt{n}\), which is faster than any rate of convergence for nonparametric function estimation. Thus for notational convenience, one can safely assume \(u=0\), just as we done in the sequel.

Recall that we are interested in finding the zero components and linear components of model (1). Empirically, the former can be done by shrinking the function \(\Vert f_j \Vert \) to zero, and the latter can be achieved via shrinking the second derivative \(\Vert f_j^{\prime \prime } \Vert \) to zero because a function is linear if and only if it has a second derivative identically zero. Therefore, instead of (4), we consider the following two-fold penalization procedure

$$\begin{aligned} \hat{\gamma }=\arg \min _{\gamma }L_n^{\lambda }(\gamma ):= \frac{1}{n}\sum _{i<j} { |e_i-e_j| } + n \sum _{k=1}^p { p_{\lambda _1}(\Vert f_k \Vert ) } + n \sum _{k=1}^p { p_{\lambda _2}(\Vert f_k^{\prime \prime } \Vert )}, \end{aligned}$$
(5)

where \(p_{\lambda }(\cdot )\) is the SCAD penalty function defined by its first derivative

$$\begin{aligned} p^{\prime }_\lambda (t) = \lambda \left\{ I(t \le \lambda )+\frac{(a\lambda -t)_+}{(a-1)\lambda }I(t > \lambda ) \right\} , \end{aligned}$$

where \(\lambda \) is the penalized parameter, \(a > 2\) is some constant usually taken to be 3.7 as suggested in Fan and Li (2001). Note that the SCAD penalty is continuously differentiable on \((-\infty ,0) \cup (0,\infty )\) but singular at 0, and that its derivative vanishes outside \([-a \lambda ,a \lambda ]\). These features of SCAD penalty result in a solution with three desirable properties including unbiasedness, sparsity and continuity, which were defined in Fan and Li (2001).

Note that \(\Vert f_j(x) \Vert ^2 = \Vert B_j(x)^T \gamma _j \Vert ^2 = \int \big (\sum _{k=1}^K { \gamma _{jk}B_{jk}(x) } \big ) \big ( \sum _{k^{\prime }=1}^K { \gamma _{jk^{\prime }} B_{jk^{\prime }}(x) } \big ){ dx}\) and \(\Vert f_j^{\prime \prime }(x) \Vert ^2 = \int \big ( \sum _{k=1}^K { \gamma _{jk}B_{jk}^{\prime \prime }(x) } \big ) \big ( \sum _{k^{\prime }=1}^K { \gamma _{jk^{\prime }}B_{jk^{\prime }}^{\prime \prime }(x) }\big )dx\), so \(\Vert f_j(x) \Vert \) and \(\Vert f_j^{\prime \prime }(x) \Vert \) can be equivalently expressed as \(\sqrt{ \gamma _j^T D_j \gamma _j }\) and \(\sqrt{ \gamma _j^T E_j \gamma _j }\) respectively, where \(D_j,E_j \in R^{K \times K}\) with its \((k,k^{\prime })\) entry equaling to \(\int B_{jk}(x) B_{jk^{\prime }}(x)dx\) and \(\int B_{jk}^{\prime \prime }(x) B_{jk^{\prime }}^{\prime \prime }(x)dx\), respectively. Then, the above minimization problem (5) is equivalent to

$$\begin{aligned} \hat{\gamma }= & {} \arg \min _{\gamma }L_n^{\lambda }(\gamma ):= \frac{1}{n}\sum _{i<j} { |e_i-e_j| } + n \sum _{k=1}^p { p_{\lambda _1}\left( \sqrt{ \gamma _k^T D_k \gamma _k } \right) } \nonumber \\&+\, n \sum _{k=1}^p { p_{\lambda _2}\left( \sqrt{ \gamma _k^T E_k \gamma _k } \right) }. \end{aligned}$$
(6)

Consequently, the estimated component functions are given by \(\hat{f}_j(x)= B_j(x)^T \hat{\gamma }_j\).

3 Theoretical properties

3.1 Asymptotic properties

Without loss of generality, we assume that \(f_{0j}\) is truly nonparametric for \(j=1,2,\ldots ,p_0\), linear for \(j=p_0+1,p_0+2,\ldots ,s\) with the true slope parameters for the parametric components are denoted by \(\beta _0=(\beta _{0,p_0+1},\beta _{0,p_0+2},\ldots ,\beta _{0,s})\), and zero for \(j=s+1, s+2,\ldots ,p\). The vectors \(X^{(1)}=(X_1,X_2,\ldots ,X_{p_0})^T\) and \(X^{(2)}=(X_{p_0+1},X_{p_0+2},\ldots , X_{s})^T\) correspond to the nonlinear and linear components. Denote as \(\mathcal {A}\) the subspace of functions on \(R^{p_0}\) with an additive form

$$\begin{aligned}&\mathcal {A}:=\{ h(x^{(1)}): h(x^{(1)})=h_1(x_1)+h_2(x_2)+\ldots +h_{p_0}(x_{p_0}), E\big ( h_j(X_j) \big )\\&\quad =0~\text{ and }~E\big ( h_j(X_j)^2 \big ) < \infty \}, \end{aligned}$$

and \(E_{\mathcal {A}}(M)\) the subspace projection of M onto \(\mathcal {A}\) in the sense that

$$\begin{aligned} E\{ \big ( M-E_{\mathcal {A}}(M) \big ) \big ( M-E_{\mathcal {A}}(M) \big ) \} = \inf _{h \in \mathcal {A}} E \{ \big ( M-h(X^{(1)}) \big ) \big ( M-h(X^{(1)}) \big ) \}. \end{aligned}$$

Let \(h(X^{(1)})=E_{\mathcal {A}}(X^{(2)})\). Each component of \(h(X^{(1)})=\big ( h_{(1)}(X^{(1)}),\ldots ,h_{(p-p_0)}(X^{(1)}) \big )^T\) can be written in the form \(h_{(u)}(x)= \sum _{j=1}^{p_0}h_{(u)j}(x_j)\) for some \(h_{(u)j}(x_j) \in S_j^0\). To facilitate our asymptotic analysis, we further make the following regularity assumptions.

  1. (A1)

    The density function f(x) of X is absolutely continuous and compactly supported. Without loss of generality, assume that the support of X is \([0,1]^p\). Furthermore, there exist constants \(0< c_1 \le c_2 < \infty \) such that \(c_1 \le f(x) \le c_2\) for all \(x \in \mathcal {X}\).

  2. (A2)

    For \(g=f_{0j}, 1 \le j \le p_0\) or \(g=h_{(u)j}, 1 \le u \le s, 1 \le j \le p_0\), g satisfies a Lipschitz condition of order \(r>1/2\). That is, \(| g^{(\lfloor r \rfloor )}(x_1)- g^{(\lfloor r \rfloor )}(x_2) | \le C | x_1-x_2 |^{r-\lfloor r \rfloor }\), where C is a constant, \(\lfloor r \rfloor \) denotes the biggest integer strictly smaller than r and \(g^{(\lfloor r \rfloor )}\) is the \(\lfloor r \rfloor \)th derivative of g. In addition, the order of the B-spline used satisfies \(q \ge r + 2\).

  3. (A3)

    The matrix \( \Sigma =E\{ (X^{(2)}-h(X^{(1)})(X^{(2)}-h(X^{(1)})^T \}\) is positive definite.

  4. (A4)

    The errors \(\varepsilon \) has a positive density function h(x) satisfying \(\int [h^{\prime }(x)]^2 / h(x) dx <\infty \), which means that \(\varepsilon \) has finite Fisher information.

Assumptions (A1)–(A2) are common in the polynomial spline estimation literatures; see for example Huang et al. (2010), Wang and Song (2013), Tang (2015) and Li et al. (2015). It was shown in Li (2000) that the positive definiteness of \(\Sigma \) in (A3) is necessary for the identifiability of the model in the case that linear components are specified. Assumption (A4) is a regular condition on the random errors which is the same as those used in works on rank regression such as Wang et al. (2009), Hettmansperger and McKean (2011), Sun and Lin (2014) and Feng et al. (2015).

Theorem 1

Suppose that assumptions (A1)–(A4) hold. If the number of knots \(K_n \asymp n^{1/(2r+1)}\), then we have

$$\begin{aligned} \Vert \check{f}_{j}-f_{0j} \Vert ^2=O_p \left( n^{\frac{-2r}{2r+1}} \right) ,~~ j=1,2,\ldots ,p, \end{aligned}$$

where \(\check{f}_{j}=B_j^T \check{\gamma }_j\) is the unpenalized estimate of component function \(f_{0j}\) with \(\check{\gamma }\) generated by solving (4).

Theorem 1 indicates that the nonparametric estimates obtained by our proposed method attain the optimal convergence rates. The following theorem will show that if the tuning parameters \(\lambda _1\) and \(\lambda _2\) are appropriately specified, we can identify the zero parts and linear parts consistently.

Theorem 2

Under the same assumptions of Theorem 1, if \(\max \{ \lambda _1,\lambda _2 \} \rightarrow 0\) and \(n^{r/(2r+1)} \min \{ \lambda _1,\lambda _2 \} \rightarrow \infty \), then with probability tending to 1,

  1. (i)

    \(\Vert \hat{f}_{j}-f_{0j} \Vert ^2=O_p \left( n^{\frac{-2r}{2r+1}} \right) \) for \(j=1,2,\ldots ,p,\)

  2. (ii)

    \( \hat{f}_j \) is a linear function for \( j=p_0+1,p_0+2,\ldots ,s \),

  3. (iii)

    \( \hat{f}_j \equiv 0\) for \( j=s+1,s+2,\ldots ,p \),

where \(\hat{f}_{j}=B_j^T \hat{\gamma }_j\) is the penalized estimate of component function with \(\hat{\gamma }\) generated by solving (5).

Finally, for the linear components, we will show that the estimate of the slope parameter is asymptotically normal.

Theorem 3

Under the same assumptions of Theorem 2, we have

$$\begin{aligned} \sqrt{n}(\hat{\beta }-\beta _0)~\mathop \rightarrow \limits ^d~ N \bigg ( 0,~ \frac{1}{12 \tau ^2} \Sigma ^{-1} \bigg ), \end{aligned}$$
(7)

where \(\Sigma \) is defined in assumption (A3) and \(\tau =\int h(x)^2 dx\).

Remark 1

Based on the results of Theorem 2 and Theorem 3, we observe that the proposed estimate enjoys an oracle property in the sense that it is asymptotically the same as the oracle estimate which is obtained when the true model is known in advance.

3.2 Asymptotic relative efficiency

Denote by \(\hat{\beta }_{LS}\) and \(\hat{\beta }_{RR}\) the estimates of \(\beta _0\) generated by LS regression in Lian (2012a) and our proposed rank regression, respectively. To measure the efficiency, we consider the asymptotic variance of the estimates \(\hat{\beta }_{LS}\) and \(\hat{\beta }_{RR}\) since they all asymptotically unbiased. Hence, based on the asymptotic distribution of \(\beta _0\) presented by Theorem 3 in Lian (2012a) and (7) of Theorem 3, we obtain the following theorem.

Theorem 4

The ARE of the rank-based estimate \(\hat{\beta }_{RR}\) to the LS estimate \(\hat{\beta }_{LS}\) for linear parameter \(\beta _0\) is

$$\begin{aligned} ARE (\hat{\beta }_{RR},\hat{\beta }_{LS})=\frac{Var (\hat{\beta }_{LS})}{Var (\hat{\beta }_{RR})} = 12 \sigma ^2 \tau ^2, \end{aligned}$$

where \(\sigma ^2=E (\varepsilon ^2)\). This ARE has a lower bound of 0.864 for estimating the parameter component, which is attained at the random error density \(h(x)=\frac{3}{20\sqrt{5}}(5-x^2)I(|x|\le 5)\).

Note that the above obtained ARE is the same as that of the signed-rank Wilcoxon test with respect to the t-test. It is well known in the literature of rank analysis that the ARE is as high as 0.955 for the normal error distribution, and can be significantly higher than 1 for many heavier-tailed distributions. For instance, this quantity is 1.5 for the double exponential distribution and 1.9 for the t distribution with three degrees of freedom.

4 Algorithm implementation and tuning parameters selections

In this section we first present an iterative estimation procedure for computation by employing locally quadratic approximation (LQA, Fan and Li, 2001) to the rank-based objective function \(L_n(\gamma )\) as well as the two penalty functions \(p_{\lambda _1}(\cdot )\) and \(p_{\lambda _2}(\cdot )\). Then we discuss the selections of extra parameters including the number of interior knots \(K_n\) and the tuning parameters \(\lambda _1\) and \(\lambda _2\).

4.1 Algorithm implementation

It is worth noting that the commonly used gradient-based optimization technique is not feasible here for solving (6) due to its irregularity at the origin. According to Sievers and Abebe (2004), we approximate the unpenalized \(L_n(\gamma )\) by

$$\begin{aligned} L_n(\gamma ) \approx \frac{1}{n}\sum _{i=1}^n { w_i(e_i-\varsigma )^2 }, \end{aligned}$$

where \(\varsigma \) is the median of \(\{e_i\}_{i=1}^n\) and

$$\begin{aligned} w_i= \left\{ \begin{array}{ll} \frac{\frac{R(e_i)}{n+1}-\frac{1}{2}}{e_i-\varsigma }, &{}\quad \text{ for }~e_i \ne \varsigma , \\ 0, &{}\quad \text{ otherwise } \end{array} \right. \mathrm{{ }} \end{aligned}$$

with \(R(e_i)\) being the rank of \(e_i\) among \(\{e_i\}_{i=1}^n\).

On the other hand, following Fan and Li (2001), we apply LQA to the last two penalty terms. That is, for a given initial estimate \(\hat{\gamma }_j^{(0)}\), the corresponding weights \(w_i^{(0)}\) and the median of residual \(\varsigma ^{(0)}\) can be obtained. If \(\hat{f}_j^{(0)}\) (\(\hat{f}_j^{(0)\prime \prime }\)) is very close to 0, then set \(\hat{f}_j=0\) (\(\hat{f}_j^{\prime \prime }=0\)). Otherwise, we have

$$\begin{aligned} p_{\lambda _1}(\Vert f_j \Vert ) \approx p_{\lambda _1}(\Vert \gamma _j^{(0)}\Vert _{D_j}) + \frac{1}{2}\frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(0)}\Vert _{D_j})}{ \Vert \gamma _j^{(0)}\Vert _{D_j} } \{ \Vert \gamma _j\Vert ^2_{D_j}-\Vert \gamma _j^{(0)}\Vert ^2_{D_j} \}, \end{aligned}$$

and

$$\begin{aligned} p_{\lambda _2}(\Vert f_j^{\prime \prime } \Vert ) \approx p_{\lambda _2}(\Vert \gamma _j^{(0)}\Vert _{E_j}) + \frac{1}{2}\frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(0)}\Vert _{E_j})}{ \Vert \gamma _j^{(0)}\Vert _{E_j} } \{ \Vert \gamma _j\Vert ^2_{E_j}-\Vert \gamma _j^{(0)}\Vert ^2_{E_j} \}, \end{aligned}$$

where \(\Vert \gamma _j \Vert _{D_j}=\sqrt{ \gamma _j D_j \gamma _j }\) and \(\Vert \gamma _j \Vert _{E_j}=\sqrt{ \gamma _j E_j \gamma _j }\). Ignoring the irrelevant constants, (6) is equivalent to minimize the following quadratic function

$$\begin{aligned} Q_n^{\lambda }(\gamma ):= & {} \frac{1}{n}\sum _{i=1}^n { w_i(e_i-\varsigma )^2 }+ \frac{n}{2} \sum _{k=1}^p { \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _k^{(0)}\Vert _{D_k})}{ \Vert \gamma _k^{(0)}\Vert _{D_k} }\gamma _k D_k \gamma _k } + \\&+ \frac{n}{2} \sum _{k=1}^p { \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _k^{(0)}\Vert _{E_k})}{ \Vert \gamma _k^{(0)}\Vert _{E_k} }\gamma _k E_k \gamma _k } . \end{aligned}$$

To make the expression convenient, we introduce the following notations

$$\begin{aligned}&\tilde{Y}^{(m)}=Y-\varsigma ^{(m)},~~~~W^{(m)}=\text{ diag }\left\{ w_1^{(m)},w_2^{(m)},\ldots ,w_n^{(m)} \right\} , \\&\Sigma _{\lambda _1}(\gamma ^{(m)})= \text{ diag }\left\{ \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_1})}{ \Vert \gamma _j^{(m)}\Vert _{D_1} }, \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_2})}{ \Vert \gamma _j^{(m)}\Vert _{D_2} }, \ldots , \frac{p_{\lambda _1}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{D_p})}{ \Vert \gamma _j^{(m)}\Vert _{D_p} } \right\} , \\&\Sigma _{\lambda _2}(\gamma ^{(m)})= \text{ diag }\left\{ \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_1})}{ \Vert \gamma _j^{(m)}\Vert _{E_1} }, \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_2})}{ \Vert \gamma _j^{(m)}\Vert _{E_2} }, \ldots , \frac{p_{\lambda _2}^{\prime }(\Vert \gamma _j^{(m)}\Vert _{E_p})}{ \Vert \gamma _j^{(m)}\Vert _{E_p} } \right\} . \end{aligned}$$

Therefore, the computational algorithm can be implemented as follows:

  • Step 0: Choose the unpenalized estimate \(\check{\gamma }\) as the initial estimate \(\hat{\gamma }^{(0)}\) and let \(\hat{\gamma }^{(m)}=\hat{\gamma }^{(0)}\).

  • Step 1: Update \(\gamma ^{(m)}\)to obtain \(\gamma ^{(m+1)}\) by

    $$\begin{aligned} \gamma ^{(m+1)} = \arg \min _{\gamma } Q_n^{\lambda }(\gamma ) =\left\{ Z^T W^{(m)} Z + \frac{n^2}{2} \Sigma _{\lambda _1}(\gamma ^{(m)}) + \frac{n^2}{2} \Sigma _{\lambda _2}(\gamma ^{(m)}) \right\} ^{-1} Z^T W^{(m)} \tilde{Y}^{(m)}, \end{aligned}$$

    where \(\tilde{Y}=(\tilde{Y}_1,\ldots ,\tilde{Y}_n)^T\), \(Z=(Z_1,\ldots ,Z_n)^T\) with \(Z_i= \big (B_1(X_{i1})^T,\ldots ,B_p(X_{ip})^T \big )^T\).

  • Step 2: Set \(m=m+1\) and return back to Step 1.

  • Step 3: Iterate Step 1 and Step 2 until convergence.

Remark 2

As a stopping rule to check the convergence of \(\hat{\gamma }\) in above estimation procedure, we propose to stop the iteration when the change in \(\hat{\gamma }\)between thei-th and \((i+1)\)-th iteration is below a pre-specified threshold.

4.2 Extra parameters selections

To achieve good numerical performance, one needs to choose the number of interior knots \(K_n\) and the tuning parameters \(\lambda _1\) and \(\lambda _2\) appropriately. Here we fix the spline order to be 4, which means that cubic splines are used in all our numerical implementations. Then we use 5-fold cross-validation (CV) to select \(K_n\) as well as \(\lambda =(\lambda _1,\lambda _2)^T\) simultaneously. To be more specific, we randomly divide the data into five roughly equal parts, denoted as \(\{(X_i^T, Y_i )^T,~i\in S(j)\}\) for \(j = 1, 2, \ldots , 5\), where S(j) is the set of subject indices corresponding to the jth part. For each j, we treat \(\{(X_i^T, Y_i )^T,~i\in S(j)\}\) as the validation data set, and the remaining four parts of data as the training data set. For any candidate \((K_n,\lambda ^T)^T\), for each \( i\in S(j) \), we apply local polynomial fitting to the training data set to estimate \(\{f_{0k}(\cdot )\}_{k=1}^p\) by solving (5). After we get the estimates \(\{\hat{f}_{k}(\cdot )\}_{k=1}^p\) for all \( i\in S(j) \), we can calculate the corresponding prediction \(\hat{Y}_i=\sum _{k=1}^p { \hat{f}_{k}(X_{ik}) }\). Then the cross validation error corresponding to a fixed \((K_n,\lambda ^T)^T\) is defined as

$$\begin{aligned} CV_5(K_n,\lambda )=\sum _{j=1}^5{ \sum _{i\in S(j)} { \left\{ \frac{R(e_i(\hat{f}))}{n+1}-\frac{1}{2} \right\} e_i(\hat{f}) }}, \end{aligned}$$
(8)

where \(e_i(\hat{f})=Y_i-\sum _{j=1}^p { \hat{f}_{j}(X_{ij}) }\) and \(R(e_i(\hat{f}))\) represents the rank of \(e_i(\hat{f})\) among \(\{ e_i(\hat{f}) \}_{i=1}^n\). Finally, the optimal \(K_n\) and \(\lambda \) are selected by minimizing the cross validation error \(CV_5(K_n,\lambda )\).

Remark 3

As stated in Feng et al. (2015), the variable selection results are hardly affected by the choice of selection procedure for\(K_n\). Therefore, to reduce the computation burden, one may firstly fit the additive model (1) without any penalization and use the above 5-fold cross validation to select an optimal\(K_n\), and then fix the same\(K_n\)in (8) to select the optimal\(\lambda \).

5 Numerical examples

5.1 Monte Carlo simulation

We generate our sample from the following additive model:

$$\begin{aligned} Y_i=\sum _{j=1}^{10} {f_{0j}(X_{ij})}+0.3 \varepsilon _i, \end{aligned}$$
(9)

where \(f_{01}(x)= \sin (2\pi x)\), \(f_{02}(x)=6x(1-x)\), \(f_{03}(x)=2x\), \(f_{04}(x)=x\), \(f_{05}(x)=-x\), \(f_{06}(x)=-2x\) and \(f_{0j}(x) \equiv 0\) for \(j=7,\ldots ,10\). Thus the number of nonparametric components is 2 and the number of nonzero linear components is 4. The covariates \(X_i=(X_{i1}, X_{i2},\ldots , X_{i10})^T\) are generated from the standard normal distribution with the correlation between \(X_{ij_1}\) and \(X_{ij_2}\) being \(0.5^{|j_1-j_2|}\). A similar model setting was also applied in Lian (2012a) without the last four zero functions because they only consider model identification for the linear components. Beforehand, we apply the cumulative distribution function of standard normal distribution to transform \(X_{ij}\) to be marginally uniform on [0,1]. Finally, four different methods including Lian (2012a) (LS), Lian (2012b) with 0.5th quantile (QR), composite quantile regression (CQR) by Kai et al. (2010) with the number of quantile being 9 and our proposed rank regression (RR) are conducted in this example.

Table 1 Component selection results with \(n=100\)

In order to examine the robustness and efficiency of our proposed method, five different error distributions are considered including standard normally distributed N(0,1), t(3) distribution which is heavy-tailed, the mixture of normals 0.9N(0,1) \(+\) 0.1N(0,10) (MN) which is used to generate the outliers and two asymmetric errors Lognormal (LN) and Exponential (Exp(1)) distributions. For all scenarios, 200 data sets are generated and the corresponding results with \(n=100\) and \(n=200\) are summarized in Tables 1, 2, 3 and 4. Table 1 and Table 2 report the average number of nonparametric components selected (NN), the average number of true nonlinear components selected (NNT), the average number of linear components selected (NL), and the average number of true linear components selected (NLT). Table 3 and Table 4 present the performance of estimates for the first six nonzero component functions by using root mean squared errors (RMSE) defined by \(\text{ RMSE }_j=\left\{ \frac{1}{n_{grid}} \sum _{i= 1}^{n_{grid}} { (\hat{f}_{j}(u_i)-f_{0j}(u_i)) ^2 } \right\} ^{1/2}\), where \(\{u_i,i=1,2,\ldots ,n_{grid}\}\) are the grid points at which the function \(f_j(\cdot )\) is evaluated.

Table 2 Component selection results with \(n=200\)
Table 3 Root mean squared errors for \(f_{01},\ldots ,f_{06}\) with \(n=100\)
Table 4 Root mean squared errors for \(f_{01},\ldots ,f_{06}\) with \(n=200\)

We make several observations from the results of Tables 1, 2, 3 and 4: (1) Our proposed RR method performs similar to CQR method in most situations; (2) For the normal error, the RR and CQR estimators are comparable to the LS estimator in terms of model selection as well as estimation accuracy, and all above three estimators are much superior to that of QR estimator; (3) For the other four types of error, the performance of LS method is terrible, whereas the RR and CQR approaches possess a significantly higher efficiency than that of QR although they are all robust to error structures in comparison with the LS method; (4) The model identification performance and estimation accuracy of all considered methods improved as the sample size n increasing, which corroborates the theoretical properties. All these conclusions reveal that the CQR method and RR procedure are highly efficient in estimating and identifying nonzero components as well as simultaneously discriminating linear components from nonlinear ones, and they are robust and adaptive to different errors. However, it is worth noting that in contrast with the CQR method whose performance depends on the choice of the number of quantiles to combine, a meta parameter which plays a vital role in balancing the performance of LS and absolute deviation-based methods, our proposed RR procedure does not need to choose the meta parameter. This characteristic can reduce the burden of calculation.

Table 5 Estimation and model identification results with LASSO, ALASSO and MCP
Table 6 Component selection results in Boston housing price data
Fig. 1
figure 1

The selected components and their fits in for Boston housing price data based on LS method

Note that, according to the anonymous reviewers’s valuable suggestions, we have added some simulations to evaluate the performance of our proposed RR method under the penalties of lasso, Adaptive-lasso and MCP. The results based on 200 samples are reported in Table 5, where \(\text{ RMSE }(f)\) stands for the root mean squared errors of f with \(f=\sum _{j=1}^{10} {f_{0j}}\). We can obtain from these results that the performances under Adaptive-lasso and MCP are similar, and they all have a significant superiority to the lasso penalty which has a bad performance. This is expected because Adaptive-lasso and MCP have been demonstrated to own consistency of model selection but lasso does not have. In addition, we have conducted some other simulations under a relatively heavier sparsity by choosing 21 functions, in which the first 6 functions are the same as in model (9) and the last 15 functions are 0. From our obtained results we observe that the performances in the case of heavier sparsity are similar to the case originally considered in model (9). Thus, we omit presenting the corresponding results although they are obtained so as to reduce the length of this paper.

Fig. 2
figure 2

The selected components and their fits for the Boston housing price data based on QR method

Fig. 3
figure 3

The selected components and their fits for the Boston housing price data based on RR method

5.2 Application to Boston housing price data

In this section, we consider an application of our proposed method to Boston housing price data, which has been analyzed by Yu and Lu (2004) and Xue (2009) among others. We take the median value of owner-occupied homes in $1000’s (medv) as the response variable. The covariate variables include per capita crime rate by town (crim), proportion of residential land zoned for lots over 25,000 sq.ft (zn), proportion of non-retail business acres per town (indus), nitric oxides concentration per 10 million (nox), average number of rooms per dwelling (rm), proportion of owner-occupied units built prior to 1940 (age), weighted distances to five Boston employment centers (dis), index of accessibility to radial highways (rad), full-value property tax per $10,000 (tax), pupil-teacher ratio by town (ptratio), a parabolic function of the relative size of the Black population in the town (black), and percentage of lower status of the population (lstat). Beforehand, all the covariate variables are standardized so that they have mean zero and unit variance, and the cumulative distribution function of standard normal distribution is employed to transform the covariates to be marginally uniform on [0,1]. Then we apply LS, QR and RR methods to analyze the data set via an additive model stated as (1).

Fig. 4
figure 4

a is the normal QQ-plot of the residuals resulted by RR method. b is the boxplots of MAPE in Boston housing data, where the dashed, dot-dashed and long-dashed horizontal lines represent the average MAPEs based on LS, QR and RR methods, respectively

The component selection results are presented in Table 6, in which, 0, 1 and 2 denote the covariates selected as zero, linear and nonlinear components, respectively. As we can see from Table 6, all three methods reveal that rm, rad and black have nonnegative effects on house price, which are clearly coincide with the heuristics about their effects on house prices. In addition, compared with the LS approach that removes three covariates zn, indus and age out of the final model as unimportant covariates and identifies the remaining nine covariates as nonlinear components, QR method identified the three covariates zn, indus and rad as zero components, the three covariates crim, age and ptratio as linear components, and the remaining six covariates as nonlinear components. The RR method identified four covariates zn, indus, age and rad as zero components, the three covariates crim, dis and black as linear components, and the remaining five covariates as nonlinear components. Similar conclusions can also be derived by the corresponding fits for this data set presented in Figs. 1, 2 and 3. Evidently, our proposed rank approach generates the most parsimonious model among the three considered methods.

For a further study of the applicability of the RR method, we display the normal QQ-plot of the residuals resulted by RR procedure in Fig. 4a, from which we observe that the error term of Boston housing data probably come from a non-normal distribution. Moreover, to compare the performance of the proposed RR procedure with those of LS and QR methods, we give the boxplots of mean absolute prediction error (MAPE) in Fig. 4b, which is obtained based on 200 times simulation with each simulation randomly extract 400 samples. Obviously, RR method performs the best since it has the smallest mean value of MAPE and variance. Consequently, taking into account of the complexity of model and the performance of prediction, our proposed rank-based regression is a preferred method for analyzing this data set.

6 Concluding remarks

In this paper, a novel and robust procedure based on rank regression and spline approximation was developed for model identification in semiparametric additive models. Via adding a two-fold SCAD penalty, the proposed method is able to simultaneously estimate and identify the nonzero components as well as the linear components. Theoretical properties of the estimators of both nonparametric parts and linear parameters were derived under some mild conditions. In addition, we show that the proposed rank estimator is highly efficient across a wide spectrum of error distributions; even in the worst case scenarios, the ARE of the proposed rank estimate versus least squares estimate, is show to have an expression closely related to that of the signed-rank Wilcoxon test in comparison with the t-test, which is equal to 0.864 for the linear parameters. Furthermore, we presented an efficient algorithm for computation and discussed the selections of tuning parameters. To extend our work to a generalized additive model or other nonparametric models seems a promising and useful project for practitioners; we leave it as a future work.