1 Introduction

Besides parametric models, various semi-/nonparametric models have been used to describe longitudinal data. The varying coefficient model (VCM), firstly proposed by Hastie and Tibshirani (1993), has been a popular modeling approach for longitudinal data due to its flexibility and interpretability, where the coefficients are smooth nonparametric functions of the measurement time. The varying coefficient model for longitudinal data can be written as

$$\begin{aligned} y_{ij}=\varvec{X}_{ij}^T\varvec{\alpha }(t_{ij})+\epsilon _{ij},\ i=1,\ldots ,n, \ j=1,\ldots ,m_i, \end{aligned}$$
(1)

where \(\varvec{X}_{ij}=(x_{ij1},\ldots ,x_{ijp})^T\) is a p-dimensional covariate associated with the j-th measurement for the i-th subject at time \(t_{ij}\), \(\varvec{\alpha }(t)=(\alpha _1(t),\ldots , \alpha _p(t))^T\) comprises p unknown nonparametric functions and \(\epsilon _{ij}\) is the random error.

There exist many studies on model (1) in the framework of mean regression; see, for example, Fan and Zhang (1999, 2000), Chiang et al. (2001), Huang et al. (2002), Qu and Li (2006), Şentürk and Müller (2008). Quantile regression (Koenker 2005) is a valuable alternative to least squares-based method. Kim (2007) and Cai and Xu (2008) proposed the quantile VCM for cross-sectional data using polynomial splines and local polynomial smoothing, respectively. Andriyana et al. (2014) investigated the quantile VCM for longitudinal data through the penalized splines approach. Compared with mean regression for VCM, quantile regression can provide a more comprehensive summary of the response distribution and describe the dynamic functional relationship between covariates and response at different percentiles of the distribution. In addition, quantile regression is more robust than least squares regression.

For longitudinal data analysis, it is important to take into account the correlation properly within each subject, and ignoring such correlations could yield less efficient estimators. Some strategies have been developed to incorporate the correlation to fit the longitudinal data in mean regression. Lin et al. (2007) studied the VCM based on generalized estimating equation (GEE, Liang and Zeger 1986). Wang et al. (2005) and Huang et al. (2007) considered the efficient estimation for partial linear semiparametric models, and Lian et al. (2014) investigated partial linear additive model in high dimensions. Although GEE-based estimators are consistent, they lose efficiency if the working correlation is misspecified. To alleviate the impact of correlation misspecification and improve estimation efficiency, Qu and Li (2006) applied quadratic inference functions (QIF, Qu et al. 2000) to VCM using penalized splines. The QIF approach takes into account the within-cluster correlation and is more efficient than the GEE approach when the working correlation is misspecified. Some studies have also been carried out for other nonparametric/semiparametric models using QIF; see, for example, Xue et al. (2010), Li et al. (2014), Ma et al. (2014).

Quantile regression has been widely used in longitudinal data. However, most of these works proposed estimators that ignore the correlation within subject for simplicity, which resulted in low efficiency for estimation and inferences. Compared to mean regression, it is more challenging to account for correlation for quantile regression due to some computational issues as pointed out by Leng and Zhang (2014). In view of the flexibility of VCM and good performance of QIF, in this paper, we study the estimation and inference problem for quantile VCM. By using the polynomial splines to approximate the nonparametric functions and taking into account the correlation, we propose a new estimation procedure for longitudinal data and establish theoretical properties of the resulting estimators. To further improve estimation, we develop a method to select the important variables using the adaptive group SCAD penalty. The theoretical challenge largely lies in the non-continuity of the objective function as well as the diverging dimensionality resulting from spline approximation, which is dealt with using empirical processes theory (Van der Vaart 2000). Simulation results and real data analysis show that our proposed method outperforms the existing methods.

The rest of the paper is organized as follows. In Sect. 2, we present the estimation approach to quantile VCM based on QIF, where the nonparametric functions are approximated by polynomial splines. The large sample properties of the proposed estimators are established, and we also develop an estimation procedure using induced smoothing. In Sect. 3, to select the important nonparametric components, we develop a variable selection procedure based on the adaptive group SCAD penalty, and its oracle properties are also investigated. Simulation studies in Sect. 4 and real data analysis in Sect. 5 are used to illustrate the performance of the proposed approach. Finally, some concluding remarks are given in Sect. 6. Technical proofs are contained in Appendix in Supplementary Material.

2 Methodology

2.1 Estimation based on QIF

We assume in (1) that the conditional \(\tau \)-th quantile of error \(\epsilon _{ij}\) given \({\varvec{X}}_{ij}\) and \(t_{ij}\) is zero, i.e.,

$$\begin{aligned} Q_\tau (y_{ij}|\varvec{X}_{ij})=\varvec{X}_{ij}^T\varvec{\alpha }_\tau (t_{ij}), \ i=1,\ldots ,n,\ j=1,\ldots , m_i, \end{aligned}$$

and that the observations are independent across different subjects. For notation simplicity, we omit the subscript \(\tau \) in \(\varvec{\alpha }_\tau (t)\) in the following.

In our estimation procedure, we approximate the smooth functions \(\{\alpha _l(\cdot )\}_{l=1}^p\) by polynomial splines (De Boor 2001; He and Shi 1996). We assume without loss of generality that the covariates \(t_{ij}\) are scaled to take value in the interval [0, 1]. For each \(1\le l \le p\), let

$$\begin{aligned} \xi _{-l,(d-1)}=\cdots =0=\xi _{l,0}<\xi _{l,1}<\cdots<\xi _{l,K_l}<1=\xi _{l,K_l+1}=\cdots =\xi _{l,K_l+d} \end{aligned}$$

be a partition of [0, 1] into subintervals \([\xi _{l,k},\xi _{l,k+1}), k=0,\ldots , K_l\) with \(K_l\) interior knots. A polynomial spline of order d is a function which is a polynomial of degree \(d-1\) in each subinterval and globally \(d-2\) times continuously differentiable on [0, 1]. Denote the d-th order B-spline basis as \(\varvec{B}_l(t)=(B_{l1}(t),\ldots , B_{lJ_l}(t))^T\), \(J_l=K_l+d\), and its normalized basis as \(\sqrt{J_l}\varvec{B}_l(t)=(\sqrt{J_l}B_{l1}(t),\ldots , \sqrt{J_l} B_{lJ_l}(t))^T\). With an abuse of notation, the normalized basis is still denoted by \(\varvec{B}_l(t)\). Then, each unknown function \(\alpha _l(\cdot )\) can be approximated by a linear combination of the normalized basis such that

$$\begin{aligned} \alpha _l(t)\approx s_l(t)=\sum _{k=1}^{J_l}B_{lk}(t)\gamma _{lk},\ \ l=1,\ldots ,p. \end{aligned}$$

Define \(\varvec{\gamma }_l=(\gamma _{l1},\ldots ,\gamma _{lK_l})^T\), \(\varvec{\gamma }=(\varvec{\gamma }_1^T,\ldots ,\varvec{\gamma }_p^T)^T\), \(\varvec{U}_{ij}^T=\varvec{X}^T_{ij}\bar{\varvec{B}}(t_{ij})\),

$$\begin{aligned} \bar{\varvec{B}}(t)=\left( \begin{array}{llll} \varvec{B}_1^T(t) &{}\quad {\varvec{0}} &{}\quad \cdots &{}\quad {\varvec{0}} \\ {\varvec{0}} &{}\quad \varvec{B}_2^T(t) &{}\quad \cdots &{}\quad {\varvec{0}} \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ {\varvec{0}} &{}\quad {\varvec{0}} &{}\quad \cdots &{}\quad \varvec{B}_p^T(t) \end{array}\right) , \end{aligned}$$

\(\varvec{U}_i=(\varvec{U}_{i1},\ldots ,\varvec{U}_{im_i})^T\) and \(\varvec{\epsilon }_i=(\epsilon _{i1},\ldots ,\epsilon _{im_i})^T\). If we ignore the correlation within subjects, we can obtain the estimator by minimizing the following objective function (Kim 2007):

$$\begin{aligned} G(\varvec{\gamma })=\sum _{i=1}^n\sum _{j=1}^{m_i} \rho _\tau \left( y_{ij}-\varvec{U}_{ij}^T\varvec{\gamma }\right) , \end{aligned}$$
(2)

where \(\rho _\tau (u)=u(\tau -I(u<0))\) is the check function. The estimating equation derived from (2) is

$$\begin{aligned} \sum _{i=1}^n\varvec{U}_i^T \psi _\tau (\varvec{y}_i-\varvec{U}_{i}\varvec{\gamma })=0, \end{aligned}$$

where \(\varvec{y}_i=(y_{i1},\ldots ,y_{im_i})^T\), \(\psi _\tau (u)=\tau -I(u<0)\) and \(\psi _\tau (\varvec{y}_i-\varvec{U}_{i}\varvec{\gamma })=(\psi _\tau ({y}_{i1}-\varvec{U}_{i1}^T{\varvec{\gamma }}),\ldots , \psi _\tau ({y}_{im_i}-\varvec{U}_{im_i}^T\varvec{\gamma }))^T\).

As in Jung (1996) and Leng and Zhang (2014), a more efficient estimating equation takes the form

$$\begin{aligned} \sum _{i=1}^n\varvec{U}_i^T \varvec{\Gamma }_i\varvec{A}^{-1}_i\psi _\tau (\varvec{y}_i-\varvec{U}_{i}\varvec{\gamma })=0, \end{aligned}$$
(3)

where \(\varvec{\Gamma }_i=\mathrm{diag}(f_{i1}(0),\ldots ,f_{im_i}(0))\) with \(f_{ij}(\cdot )\) being the conditional pdf of \(\epsilon _{ij}\), and \(\varvec{A}_i\) is the working correlation matrix that may depend on some nuisance parameters which, however, may be difficult to estimate. To circumvent this problem, we apply the QIF approach by approximating \(\varvec{A}^{-1}_i\) with a linear combination of basis matrices as

$$\begin{aligned} \varvec{A}^{-1}_i=\sum _{k=1}^\upsilon a_k\varvec{M}_{ki}, \end{aligned}$$

where \(\varvec{M}_{ki}\)’s are known symmetric matrices and \(a_1, \ldots , a_\upsilon \) are unknown constants. As stated in Qu et al. (2000), this is a sufficiently rich family that could accommodate, or at least approximate, many correlation structures commonly used. For example, if the working correlation has compound symmetric structure with parameter \(\varpi \), then \(\varvec{A}^{-1}_i\) can be represented as \(a_1\varvec{M}_{1i}+a_2\varvec{M}_{2i}\), where \(a_1=-\{(m_i-2)\varpi +1\}/k_1\), \(a_2=\varpi /k_1\), \(k_1=(m_i-1)\varpi ^2-(m_i-1)\varpi -1\), \(\varvec{M}_{1i}\) is the identity matrix, \(\varvec{M}_{2i}\) has 0 on the diagonal and 1 off the diagonal. Similar linear representation of \(\varvec{A}^{-1}_i\) also holds for the AR(1) correlation structure with appropriate basis matrices. The advantage of the QIF approach is that it does not need to estimate the linear coefficients \(a_i\)’s.

In QIF, we use estimating equations defined as

$$\begin{aligned} \varvec{S}(\varvec{\gamma })=\frac{1}{n}\sum _{i=1}^n\varvec{S}_i(\varvec{\gamma }), \end{aligned}$$
(4)

where

$$\begin{aligned} \varvec{S}_i(\varvec{\gamma }) =\left( \begin{array}{c} \varvec{U}_i^T\varvec{\Gamma }_i\varvec{M}_{1i}\psi _\tau (\varvec{y}_i-\varvec{U}_i\varvec{\gamma })\\ \vdots \\ \varvec{U}_i^T\varvec{\Gamma }_i\varvec{M}_{\upsilon i}\psi _\tau (\varvec{y}_i-\varvec{U}_i\varvec{\gamma }) \end{array}\right) . \end{aligned}$$
(5)

Note that there are more estimation equations than the number of unknown parameters in (4). \(\varvec{\gamma }\) can be estimated as

$$\begin{aligned} \hat{\varvec{\gamma }}=\mathrm{argmin}_{\varvec{\gamma }}\varvec{Q}_n(\varvec{\gamma }), \end{aligned}$$
(6)

where

$$\begin{aligned} \varvec{Q}_n(\varvec{\gamma })= & {} n\varvec{S}^T(\varvec{\gamma })\varvec{\Sigma }_n^{-1}(\varvec{\gamma })\varvec{S}(\varvec{\gamma }),\\ \varvec{\Sigma }_n(\varvec{\gamma })= & {} \frac{1}{n}\sum _{i=1}^n\varvec{S}_i(\varvec{\gamma })\varvec{S}_i^T(\varvec{\gamma }). \end{aligned}$$

As a result, we define the estimator of the unknown functions as

$$\begin{aligned} {\hat{\alpha }}_l(t)=\varvec{B}_l^T(t)\hat{\varvec{\gamma }}_l, \ l=1,\ldots ,p. \end{aligned}$$

Remark 1

Note that the estimating Eq. (4) involves the unknown error density \(f_{ij}(0)\). In this paper, we adopt the method of Hendricks and Koenker (1992) and estimate \(f_{ij}(0)\) by the difference quotient,

$$\begin{aligned} {\hat{f}}_{ij}(0)=2h_n\left\{ \varvec{X}_{ij}^T\left[ \check{\varvec{\alpha }}(t_{ij},\tau +h_n)- \check{\varvec{\alpha }}(t_{ij},\tau -h_n)\right] \right\} ^{-1}, \end{aligned}$$

where the estimators \(\check{\varvec{\alpha }}(t,\tau )\) can be obtained from (6) by omitting the term \(\varvec{\Gamma }_i\) at quantile level \(\tau \) and \(h_n\) is a bandwidth parameter tending to zero as \(n\rightarrow \infty \). In our numerical studies, we choose \(h_n=1.57n^{-1/3} \left( 1.5\phi ^2\{\Phi ^{-1}(\tau )\}/(2\{\Phi ^{-1}(\tau )\}^2+1)\right) ^{2/3}\) following Hall and Sheather (1988), where \(\phi (\cdot )\) and \(\Phi (\cdot )\) are the pdf and cdf of the standard normal distribution. For the sake of simplicity in the proof, we assume that \(f_{ij}(0)\)’s are known in the following in order to derive the asymptotic properties of the estimators. This is similar to the approach adopted in the literature, for example in Leng and Zhang (2014). Unfortunately, it seems very hard to establish the asymptotic property when the density is estimated. On the other hand, we believe a very accurate estimator of the density is not very critical in our context, unlike for classical density estimation problems. For example, even if we replace the density by 1, the estimating equation is still consistent. It required further study to see whether the gap in theory can be closed with more technical advancements.

2.2 Theoretical properties

In this subsection, we study the rate of convergence for \({\hat{\alpha }}_l(t)\). The following assumptions are needed to derive the asymptotic properties of \({\hat{\alpha }}_l(t)\).

(C1):

The cluster sizes \(m_i\) are uniformly bounded for all \(i=1,\ldots ,n\).

(C2):

The conditional density \(f_{ij}\) of \(\epsilon _{ij}\) is uniformly bounded and bounded away from zero and has a bounded first derivative in the neighborhood of zero.

(C3):

\(\alpha _l (\cdot )\in {\mathcal {H}}_r\), \(l=1,\ldots ,p\), for some \(r>1/2\), where \({\mathcal {H}}_r\) is the collection of all functions on [0, 1] whose s-th order derivative satisfies the Hölder condition of order \(\vartheta \) with \(r=s+\vartheta \). The spline order \(d\ge r+1\).

(C4):

The covariates \(\varvec{X}_{ij}\) are bounded in probability for all i and j.

(C5):

The conditional distribution of t given \(\varvec{X}=\varvec{x}\) has a density \(f_{t|\varvec{X}}\) which satisfies that \(0<c_1\le f_{t|\varvec{X}}(t|\varvec{x})\le c_2<\infty \) uniformly in \(\varvec{x}\) and t for some positive constants \(c_1\) and \(c_2\).

(C6):

The knots sequences \(\xi _{l,K_l}\) for \(l=1, \ldots , p\), are quasi-uniform. That is, there exists \(c_3>0\), such that \(\max _l\left( \max _k h_{l,k}/\min _k h_{l,k}\right) \le c_3\), where \(h_{l,k}=\xi _{l,k+1}-\xi _{l,k}\) (\(0\le k \le K_l\)) are the distances between neighboring knots. Furthermore, the number of interior knots \(K_l\asymp n^{1/(2r+1)}\).

(C7):

The eigenvalues for each \(\varvec{M}_{ki}\) are bounded away from 0 and \(\infty \).

These conditions are common in the literature; see, for example, Kim (2007), Huang et al. (2002), Wang et al. (2009) and Xue et al. (2010). Conditions (C1) and (C2) are standard assumptions used in longitudinal studies and quantile regression, respectively. Condition (C3) is a smoothness assumption on the nonparametric functions. The boundedness assumption in (C4) is mainly for convenience of proof. Condition (C5) is commonly used in VCM. Condition (C6) is a mild assumption on the choice of knots. Finally, Condition (C7) is satisfied for commonly used basis matrices.

Theorem 1

Under Conditions (C1)–(C7), there exists a local minimizer of (6) such that

$$\begin{aligned} \Vert {\hat{\alpha }}_l-\alpha _l\Vert _2^2=O_p\left( n^{-2r/(2r+1)}\right) , \ \ l=1, \ldots , p, \end{aligned}$$

where \(\Vert .\Vert _2\) is the \(L_2\) norm for functions on [0, 1].

Remark 2

The rate of convergence given here is the same as that in Kim (2007) for independent data (\(m_i\equiv 1\)) and in Wang et al. (2009) for longitudinal data which, however, ignore the correlations. In practice, the main advantage of the QIF approach is that it incorporates within-cluster correlation by optimally combining estimating equations without estimating the correlation parameters.

Remark 3

Due to that \({\varvec{Q}}_n(\varvec{\gamma })\) is not continuous, we are not aware of a satisfactory algorithm to compute its minimizer. We propose a smoothing method to obtain a feasible objective function in the next subsection. However, the theoretical results here provide a benchmark so one can compare the rates of the feasible estimator with the current infeasible one. Similarly, in Sect. 3, infeasible estimator is also studied theoretically when penalization is used for variable selection.

2.3 Induced smoothing

It is difficult to solve the estimating Eq. (6) directly, which is caused by the fact that \(\varvec{S}(\varvec{\gamma })\) is not continuous. To overcome this difficulty, we apply the smoothing method which has been used in linear quantile regression by Fu and Wang (2012) and Leng and Zhang (2014).

Let

$$\begin{aligned} \tilde{\varvec{S}}(\varvec{\gamma })=\mathrm{E}\left[ \varvec{S}(\varvec{\gamma }+(nh)^{-1/2}\varvec{\Omega }^{1/2}\varvec{\delta })\right] , \end{aligned}$$
(7)

where \(h=n^{-1/(2r+1)}\), the expectation is taken with respect to \(\varvec{\delta }\sim N(0, \varvec{I}_q)\) with \(q=\sum _{l=1}^p J_l\), and \(\varvec{\Omega }\) is a \(q\times q\) positive definite matrix. We note that such choices are certainly not the only possibilities. We just followed the literature on induced smoothing and used the multivariate Gaussian distribution for smoothing. The smoothing in some sense is merely used to make sure that the objective function is differentiable. As long as the disturbance is small enough, we should expect \(\tilde{\varvec{Q}}_n\) to be close to \(\varvec{Q}_n\) and thus should have reasonable performances.

By some simple calculations, \(\tilde{\varvec{S}}_i(\varvec{\gamma })\) can be written as

$$\begin{aligned} \tilde{\varvec{S}}_i(\varvec{\gamma })=\mathrm{E}{\varvec{S}}_i(\varvec{\gamma }+(nh)^{-1/2} \varvec{\Omega }^{1/2}\varvec{\delta })=\left( \begin{array}{c} \varvec{U}_i^T\varvec{\Gamma }_i\varvec{M}_{1i}\left[ \Phi \left( \sqrt{nh}\frac{ \varvec{y}_i-\varvec{U}_i\varvec{\gamma }}{\varvec{r}_i}\right) -\varvec{1}(1-\tau )\right] \\ \vdots \\ \varvec{U}_i^T\varvec{\Gamma }_i\varvec{M}_{\upsilon i}\left[ \Phi \left( \sqrt{nh} \frac{\varvec{y}_i-\varvec{U}_i\varvec{\gamma }}{\varvec{r}_i}\right) -\varvec{1}(1-\tau )\right] \end{array} \right) , \end{aligned}$$
(8)

where \(\varvec{r}_i=(r_{i1},\ldots ,r_{im_i})^T\) with \(r_{ij}=\sqrt{\varvec{U}_{ij}^T \varvec{\Omega }\varvec{U}_{ij}}\), \(j=1,\ldots ,m_i\), \(\varvec{1}\) being the \(m_i\)-dimensional column vector with all elements 1, and \(\Phi \left( \sqrt{nh}\frac{ \varvec{y}_i-\varvec{U}_i\varvec{\gamma }}{\varvec{r}_i}\right) \) denotes an \(m_i\)-dimensional vector with j-th element being \(\Phi \left( \sqrt{nh}\frac{ {y}_{ij}-\varvec{U}_{ij}^T\varvec{\gamma }}{r_{ij}}\right) \).

The smoothing estimator \(\tilde{\varvec{\gamma }}\) can be obtained as

$$\begin{aligned} \tilde{\varvec{\gamma }}=\mathrm{argmin}_{\varvec{\gamma }} \tilde{\varvec{Q}}_n(\varvec{\gamma }), \end{aligned}$$
(9)

where

$$\begin{aligned} \tilde{\varvec{Q}}_n(\varvec{\gamma })=n\tilde{\varvec{S}}^T(\varvec{\gamma })\tilde{\varvec{\Sigma }}^{-1}_n(\varvec{\gamma })\tilde{\varvec{S}}^T(\varvec{\gamma }) \end{aligned}$$

with \(\tilde{\varvec{\Sigma }}_n(\varvec{\gamma })=\frac{1}{n}\sum _{i=1}^n \tilde{\varvec{S}}_i(\varvec{\gamma })\tilde{\varvec{S}}_i^T(\varvec{\gamma })\). Then, we can set

$$\begin{aligned} {\tilde{\alpha }}_l(t)=\varvec{B}_l^T(t)\tilde{\varvec{\gamma }}_l, l=1,\ldots ,p. \end{aligned}$$

Theorem 2 establishes the theoretical property of the estimator after smoothing, which enjoys the same convergence rate.

Theorem 2

Let \(\varvec{\Omega }\) be any symmetric and positive definite matrix with bounded eigenvalues. Under Conditions (C1)–(C7), there exists a local minimizer of (9) such that

$$\begin{aligned} \Vert \tilde{\alpha }_l-\alpha _l\Vert _2^2 =O_p\left( n^{-2r/(2r+1)}\right) , \ \ l=1, \ldots , p. \end{aligned}$$

With the above smoothing method, we can use the standard Newton–Raphson method to obtain the estimator. However, the closed-form derivative of \(\tilde{\varvec{Q}}_n(\varvec{\gamma })\) with respect to \(\varvec{\gamma }\) is very messy to say the least and thus we use the two-stage method as mentioned in Greene (2011). First, the initial estimator is obtained by the R package quantreg with the weights \(\varvec{\Gamma }_i, i=1,\ldots ,n\) and correlations ignored. Then, the initial estimator is used to calculate \(\varvec{\Sigma }_n\) which is then fixed and standard Newton–Raphson method is used to obtain the refined estimator. According to our experience, the proposed method based on Newton–Raphson algorithm is very fast and typically converges within 20 iterations in our numerical results. Although multiple-stage approach could also be used in which refined estimator is used to further update \(\varvec{\Sigma }_n\), we find this does not improve over the two-stage approach and thus do not use multiple-stage approach in our study.

Remark 4

In practice, the matrix \(\varvec{\Omega }\) can be updated as an estimate of the covariance matrix of the estimators, which can be simply calculated as

$$\begin{aligned} \hat{\varvec{\Omega }}=\left( \{{\tilde{\varvec{S}}}^\prime (\tilde{\varvec{\gamma }})\}^T\tilde{\varvec{\Sigma }}_n^{-1}(\tilde{\varvec{\gamma }}) {\tilde{\varvec{S}}}^\prime (\tilde{\varvec{\gamma }})\right) ^{-1}, \end{aligned}$$

where \({\tilde{\varvec{S}}}^\prime (\tilde{\varvec{\gamma }})\) is the derivative of \({\tilde{\varvec{S}}}(\tilde{\varvec{\gamma }})\). This is the induced smoothing method proposed by Brown and Wang (2005), and Leng and Zhang (2014) adopted the same formula for longitudinal quantile linear regression. However, due to technical difficulties, we are not able to establish the validity of this rigorously due to the unsmooth nature of the objective function, and it is hard to deal with the spline approximation bias. Thus, the proposed asymptotic variance formula ignores the bias in function estimation and uses directly the formula pretending it is a linear model after spline expansion.

Remark 5

Although we are not able to provide rigorous asymptotic normality results for the nonparametric functions, we can informally argue that the QIF estimator is more efficient than the estimator assuming working independence as follows. Since the QIF estimator is a GMM (generalized method of moments) estimator, it is well known that the asymptotic variance of the QIF estimator is at least as small as that obtained by any linear combinations of the estimating equations. When one of the basis matrices in (5) is the identity matrix corresponding to the estimating equations ignoring correlations, we have that QIF is more efficient than assuming working independence. In fact, in practice one always set one of the basis matrices to be the identity matrix.

To implement the above estimating procedure, one needs to select the order and the number of knots in spline approximation. In all simulation studies and real data analysis, for simplicity, the nonparametric functions are approximated by cubic splines and the same number of interior knots, i.e., \(K:= K_1=\cdots =K_p\). The number of interior knots is chosen from the interval \(\left[ n^{1/(2d+1)}, 5n^{1/(2d+1)}\right] \) by minimizing the following BIC criterion

$$\begin{aligned} \mathrm{BIC}(K)=\tilde{\varvec{Q}}_n(\tilde{\varvec{\gamma }})+p(K+d){\log }(n). \end{aligned}$$

This range for K satisfies the order assumptions in Theorem 1 on the number of interior knots K, which has also been used in Ma et al. (2014).

3 Variable selection

In practice, there are often many covariates in model (1). With high-dimensional covariates, sparse modeling is often considered superior, owing to enhanced model predictability and interpretability. In this section, we address the variable selection problem for quantile varying coefficient model based on the QIF method.

There exist some works focusing on the variable selection methods for conditional mean or conditional quantile of VCM. For example, Wang and Xia (2009) proposed the variable selection approach for VCM using kernel smoothing adaptive group LASSO (Tibshirani 1996; Yuan and Lin 2006; Zou 2006), and Zhao et al. (2013) extended the method to the quantile VCM. Noh et al. (2012) applied the polynomial spline approximation and group SCAD with local linear approximation (LLA, Zou and Li 2008) to investigate variable selection for quantile VCM. Verhasselt (2014) discussed the variable selection for the generalized VCM using P-splines. For longitudinal data, Wang et al. (2008) studied the variable selection approach via group SCAD penalty (Fan and Li 2001). However, it ignored the correlation within subjects, which may lead to some efficiency loss in estimation and variable selection. Here, we consider variable selection of quantile VCM that incorporates the correlations within subjects.

To conduct variable selection, we propose the penalized estimator by minimizing the following penalized QIF, defined as

$$\begin{aligned} \hat{\varvec{\gamma }}^P=\mathrm{argmin}_{\varvec{\gamma }}\left\{ \varvec{Q}_n(\varvec{\gamma })+n \sum _{l=1}^p p_{\lambda _l}\left( \Vert \varvec{\gamma }_l\Vert _{\varvec{H}_l}\right) \right\} , \end{aligned}$$
(10)

where \(p_{\lambda _l}(\cdot )\) is a given penalty function depending on the tuning parameter \(\lambda _l\), \(l=1,\ldots ,p\), and \(\Vert \varvec{\gamma }_l\Vert ^2_{\varvec{H}_l} =\varvec{\gamma }_l^T\varvec{H}_l\varvec{\gamma }_l\) with \(\varvec{H}_l=\int \varvec{B}_l(t)\varvec{B}_l^T(t)dt\). Notice that \(\Vert \varvec{\gamma }_l\Vert _{\varvec{H}_l}=\Vert s_l\Vert _2\), the \(L_2\) norm of the spline function \(s_l(\cdot )\). Shrinking \(\Vert s_l\Vert _2\) to 0 entails \(s_l\equiv 0\). In addition, the tuning parameters \(\lambda _l\) for the penalty functions in (10) are not necessarily the same for different l, which can provide further flexibility.

There are several possible choices for the penalty function \(p_{\lambda _l}(\cdot )\), such as LASSO (Tibshirani 1996), MCP (Zhang 2010) and SCAD penalty (Fan and Li 2001). Here, we choose the SCAD penalty, which is defined as

$$\begin{aligned} p_{\lambda }(\theta )=\left\{ \begin{array}{ll} \lambda |\theta |, &{}\quad |\theta |\le \lambda , \\ -(\theta ^2-2a\lambda |\theta |+\lambda ^2)/[2(a-1)], &{}\quad \lambda <|\theta |\le a\lambda ,\\ (a+1)\lambda ^2/2, &{}\quad |\theta |>a\lambda , \end{array}\right. \end{aligned}$$

for given \(a>2\) and \(\lambda >0\). The SCAD penalty is continuously differentiable on \((-\infty , 0)\cup (0, \infty )\) but singular at 0, and its derivative vanishes outside \([-a\lambda , a\lambda ]\), which can produce sparse estimates for small coefficients and unbiased estimates for large coefficients. Therefore, we obtain the penalized estimator of \(\alpha _l(t)\) as

$$\begin{aligned} {\hat{\alpha }}^P_l(t)=\varvec{B}_l^T(t)\hat{{\varvec{\gamma }}}^P_l, \ \ l=1,\ldots ,p. \end{aligned}$$

We next discuss the asymptotic properties of the resulting penalized estimator. Without loss of generality, we assume that \(\alpha _{l}(\cdot )=0, j=d_0+1,\ldots ,p\) and \(\alpha _{l}(\cdot ), l=1,\ldots ,d_0\) are all nonzero components of \(\varvec{\alpha }(\cdot )\). We first show in Theorem 3 that the penalized QIF estimators \(\{{\hat{\alpha }}^P_l(t)\}_{l=1}^p\) have the same convergence rate as the unpenalized estimators \(\{{\hat{\alpha }}_l(t)\}_{l=1}^p\) . Moreover, Theorem 4 establishes the sparsity property of the penalized estimators, that is, \({\hat{\alpha }}^P_l(t)=0\) with probabilities approaching one for \(l=d_0+1,\ldots ,p\).

Theorem 3

Under the conditions of Theorem 1, if the tuning parameters satisfy \(\max _l\lambda _l\rightarrow 0\) in probability as \(n\rightarrow \infty \), then there exists a local minimizer of (10) such that

$$\begin{aligned} \Vert {\hat{\alpha }}_l^P-{\alpha }_l\Vert _2^2=O_p\left( n^{-2r/(2r+1)}\right) , \ \ l=1, \ldots , p. \end{aligned}$$

Theorem 4

Under the same conditions of Theorem 3, if the tuning parameters further satisfy \(\min _{d_0+1\le l\le p}\lambda _l n^{r/(2r+1)}\rightarrow \infty \) in probability as \(n\rightarrow \infty \), then, with probability approaching 1, \({\hat{\alpha }}^P_l(t)=0\) for \(l=d_0+1,\ldots ,p\).

Since the penalty function is not differentiable, to obtain the penalized estimator \(\hat{\varvec{\gamma }}^P\), we need to smooth the penalized object function for both terms in (10). For \(\varvec{Q}_n(\cdot )\), we can use the induced smoothing method as in Sect. 2.3, and for penalty \(p_{\lambda _l}(\cdot )\), we can use the local quadratic approximation (LQA, Fan and Li 2001). Then, we can use the Newton–Raphson algorithm to obtain the penalized smoothing QIF estimator, denoted by \(\tilde{\varvec{\gamma }}^P\). As in Theorem 2, we can establish that the penalized estimator \(\tilde{\varvec{\gamma }}^P\) still enjoys the same asymptotic property as \(\hat{\varvec{\gamma }}^P\). The details are omitted.

Unlike our approach, Noh et al. (2012) used the local linear approximation (LLA) to deal with the penalty terms, which allows the authors to convert the optimization problem to a second-order cone programming. We just use the simple LQA approach in order to make the optimization problem differentiable. Our model is more complicated, and it remained to be seen whether LLA will allow a more efficient algorithm to be developed.

Note that there are p tuning parameters to be chosen in conducting the variable selection procedure. To reduce the computational burden, we propose to use the following strategy by setting

$$\begin{aligned} \lambda _{l}=\frac{\lambda }{\Vert \tilde{{\varvec{\gamma }}}_l\Vert _{H_l}},\ \ l=1,\ldots ,p, \end{aligned}$$

where \(\tilde{{\varvec{\gamma }}}_l\) is the initial unpenalized estimator of \({\varvec{\gamma }}_l\) obtained in Sect. 2.3. Note that the above strategy has also been used in Wang and Xia (2009). Then, we use the following criterion to obtain the optimal tuning parameter

$$\begin{aligned} {\hat{\lambda }}^{\mathrm{opt}}=\mathrm{argmin}_\lambda \left\{ \tilde{\varvec{Q}}_n(\tilde{\varvec{\gamma }}^P) + J\cdot {\log }(n)\cdot df_\lambda \right\} , \end{aligned}$$

where \(df_\lambda \) is the number of nonzero coefficient functions for a given tuning parameter \(\lambda \).

4 Simulation studies

In this section, we conduct simulation studies to illustrate our proposed methods. Note that the minimizer of \(\varvec{Q}_n\) is presented merely for theoretical reasons and we are not able to compute it due to the discontinuity of the objective function, while the minimizer of \(\tilde{\varvec{Q}}_n\) can be easily found using the two-stage approach as mentioned in Sect. 2. For each example, we focus on \(\tau =0.25\) and 0.5, and 500 data sets are generated to evaluate the simulation results.

Example 1

In this example, the responses \(y_{ij}\) are generated from

$$\begin{aligned} y_{ij}=\alpha _0(t_{ij})+x_{ij1}\alpha _1(t_{ij})+x_{ij2}\alpha _2(t_{ij})+(1+\kappa |t_{ij}|)\epsilon _{ij}, \ i=1,\ldots ,n, \ j=1,\ldots ,m_i, \end{aligned}$$

where the number of subjects is \(n=50, 100\) and 150, and each subject is supposed to be measured at scheduled time point \(t_{ij}\) from \(\{0,0.1,0.2,\ldots ,1\}\), each of which has a 20% probability of being skipped. The actual measurement times are generated by adding a \(U(-0.05, 0.05)\) random variable to the scheduled times. The three nonparametric functions are set to be

$$\begin{aligned} \alpha _0(t)=3\sin (2\pi t), \ \ \alpha _1(t)=8t(1-t) \ \ \mathrm{and} \ \ \alpha _2(t)=2\cos (2\pi t). \end{aligned}$$
Table 1 Simulation results of AIMSE when the true error structure is compound symmetry

The marginal distributions of the two covariates are standard normal, and their correlation coefficient is 0.5. The random error \(\varvec{\epsilon }_i =(\epsilon _{i1},\ldots ,\epsilon _{im_i})^T\) follows multivariate normal distribution or multivariate t-distribution (degrees of freedom 3) with location parameter \(-q_\tau \) and covariance matrix \(\Sigma \), where \(q_\tau \) is the \(\tau \)-th quantile of the standard normal distribution or t-distribution with degrees of freedom 3, which implies the corresponding \(\tau \)-th quantile of \({\epsilon }_{ij}\) is zero. The covariance matrix \(\Sigma \) follows either the compound symmetry (CS) or AR(1) structure with parameter \(\rho =0.8\). In addition, the quantity \(\kappa \) equals 0 or 1 corresponding to homoscedastic model (HM) and heteroscedastic model (HT), respectively.

Table 2 Simulation results of AIMSE when the true error structure is AR(1)

To assess the estimation efficiency for nonparametric functions, we calculate the integrated mean square error (IMSE) defined as

$$\begin{aligned} \mathrm{IMSE}(\alpha _l)=\frac{1}{n_{\mathrm{grid}}}\sum _{k=1}^{n_{\mathrm{grid}}} \left\{ \alpha _l(t_k)-{\hat{\alpha }}_l(t_k)\right\} ^2 \end{aligned}$$

averaged over 500 data sets and report the average of the integrated mean square error (AIMSE)

$$\begin{aligned} \mathrm{AIMSE}=\frac{1}{p}\sum _{l=1}^p \mathrm{IMSE}(\alpha _l), \end{aligned}$$

where \(\{t_k: k=1,\ldots ,n_{\mathrm{grid}}\}\) with \(n_{\mathrm{grid}}=200\) are the grid time points at which the functions \(\{\alpha _l(\cdot )\}\) are evaluated. The simulation results are shown in Tables 1 and 2, where we also report the corresponding results by assuming working independence (WI) for comparison.

Table 1 summarizes the estimation results when the error correlation has compound symmetry structure. The AIMSEs for each method become smaller as the sample size increases. Moreover, it shows that the estimators with a correct CS working correlation have the smallest AIMSE, and even with misspecified AR(1) working correlation the efficiency gains are also obvious compared with WI ignoring the correlation within subjects. If the model is heteroscedastic and/or the error follows multivariate t-distribution, the efficiency gain is more obvious. Similar phenomena are also observed for the case of the true error correlation being AR(1) as shown in Table 2. Further simulation studies (not shown) indicate that the performances when incorporating \(\varvec{\Gamma }_i\) are only a little better than replacing \(\varvec{\Gamma }_i\) with an identity matrix. In addition, we have also tried the simulations when the error has lower correlation \(\rho =0.3\) or \(\rho =0.5\); the performances of our proposed approach are better than or at least as well as WI. We omit them to save space.

Example 2

In this example, the data are generated from the following model:

$$\begin{aligned} y_{ij}=\sum _{l=1}^p x_{ijl}\alpha _l(t_{ij})+(1+\kappa |x_{ij3}|)\epsilon _{ij}, \ i=1,\ldots ,250, \ j=1,\ldots ,m_i. \end{aligned}$$

Each subject is supposed to be measured at scheduled time points \(\{0,1, 2, \ldots , 20\}\), and each time point has a 50% probability of being skipped. Similar to Example 1, the actual measurement times are generated by adding a \(U(-\,0.5, 0.5)\) random variable to the scheduled times. The three relevant variables, \(x_{ijl}, l=1,2,3\), are simulated as follows: \(x_{ij1}\) is generated from \(U(t_{ij}/10,2+t_{ij}/10)\), \(x_{ij2}\), conditional on \(x_{ij1}\), is \(N(0,(1+x_{ij1})/(2+x_{ij1}))\), and \(x_{ij3}\), independent of \(x_{ij1}\) and \(x_{ij2}\), is a Bernoulli random variable with success probability 0.8. The rest five redundant variables, \(x_{ijl}, l=4,\ldots ,p\), are mutually independent, and for each l, \(x_{ijl}\) is generated from a multivariate Gaussian distribution with zero mean and a decayed exponential covariance

$$\begin{aligned}\mathrm{cov}(x_{ijl},x_{uvl})=\left\{ \begin{array}{ll} 4\exp (-|t_{ij}-t_{uv}|), &{}\quad \mathrm{if} \ i=u\\ 0, &{}\quad \mathrm{if} \ i\ne u \end{array}. \right. \end{aligned}$$

The three nonparametric functions are

$$\begin{aligned} \alpha _1(t)=15+20\sin \left( \frac{\pi t}{40}\right) ,\ \alpha _2(t)=2-3\cos \left( \frac{(t-25)\pi }{15}\right) ,\ \alpha _3(t)=6-0.6 t. \end{aligned}$$

The marginal variance of random error \(\varvec{\epsilon }_i =(\epsilon _{i1},\ldots ,\epsilon _{im_i})^T\) is 4, and the correlation settings are the same as in Example 1, and the cases \(\kappa =0\) and 1 are still denoted by HM and HT, respectively.

Table 3 Mean and standard deviation of AIMSE (\(\times 100\)) over 500 replications for Example 2 with \(p=8\)
Table 4 Variable selection results for Example 2 with \(p=8\)

We report the results with \(p=8\) in Tables 3 and 4, where the simulation results obtained by ignoring the correlation are also shown. Moreover, we also report the results with group LASSO penalty in Tables 3 and 4 for comparisons. In Table 3, the oracle estimator is obtained using only the first three relevant variables. In Table 4, two quantities “False positive rate” (FPR) and “false negative rate” (FNR) are used to evaluate the performance of variable selection, where FNR denotes the proportion of nonzero varying coefficient functions incorrectly estimated as zero coefficients, while FPR denotes the proportion of zero coefficients incorrectly estimated by nonzero functions, and we present the mean proportion over 500 replications. In addition, according to the suggestion of a reviewer, we also report the simulation results for case of divergent dimension \(p=[n^{1/2}]\) in Appendix in Supplementary Material (Tables 1 and 2).

The results in Tables 3 and 4 indicate that the performances of our proposed variable selection approach are satisfactory. Both the nonzero components and zero components can be correctly identified in terms of FPR and FNR. For variable selection, the difference between the proposed QIF method and the method assuming working independence is very small for both homoscedastic model and heteroscedastic model with group SCAD penalty or group LASSO penalty. Moreover, as we can see from Table 3, AIMSEs calculated based on SCAD penalized estimates are closer to the AIMSEs from the oracle estimator than the group LASSO penalty. In addition, the QIF-based estimation performance is a little better than working independence even if we use the misspecified working correlation matrix. This shows that our penalized QIF estimators can simultaneously estimate and select important variables and gain estimation accuracy by effectively removing the zero component variables. The above findings confirm our theoretical results and demonstrate efficiency gains by our proposed approach compared to ignoring correlations within subjects.

For the case of divergent dimension \(p=[n^{1/2}]\), the proportion of FNR in Table 2 in Supplementary Material is still very small while the proportion of FPR for heteroscedastic model is noticeably larger. However, the gains of our method are still obvious in terms of AIMSEs in Table 1.

5 Real data analysis

In this section, we demonstrate an application of our proposed method to a longitudinal AIDS data from the Multi-Center AIDS Cohort Study between 1984 and 1991. This AIDS Cohort Study was firstly reported in Kaslow et al. (1987), where each individual was scheduled to undergo measurements at semiannual visits, but because many individuals missed some of their scheduled visits and the human immunodeficiency virus (HIV) infections occurred randomly during the study, there were unequal numbers of repeated measurements and different measurement times for each individual. As a subset of the cohort, our analysis focused on the 283 homosexual men who were infected with HIV during the follow-up period. The main interest of these data is to describe the trend of the level of the CD4 percentage depletion over time and to evaluate the effects of cigarette smoking, pre-HIV infection CD4 percentage and age at infection on the CD4 percentage after the infection. These data have been studied in several papers, including Huang et al. (2002), Qu and Li (2006), Fan et al. (2007) and Wang et al. (2009). Recently, Wang et al. (2008) have considered variable selection in varying coefficients models for these data based on mean regression and quantile regression, respectively. However, they did not consider the correlation between measurements for the same individual, which may lose estimation efficiency.

In the following, our analysis focuses on evaluating the time-dependent effects of smoking status (\(x_{i1}\), taking values of 1 and 0 whether smoke or not), age (\(x_{i2}\)), preCD4 (\(x_{i3}\), pre-HIV infection CD4 percentage) and the interaction of the covariates at different quantile levels. We consider the following varying coefficient model:

$$\begin{aligned} y_{ij}= & {} \alpha _0(t_{ij},\tau )+\alpha _1(t_{ij},\tau )x_{i1} +\alpha _2(t_{ij},\tau )x_{i2}+\alpha _3(t_{ij},\tau )x_{i3}+\alpha _4(t_{ij},\tau )x_{i2}^2\\&\quad +\,\alpha _5(t_{ij},\tau )x_{i3}^2+\alpha _6(t_{ij},\tau )x_{i1}x_{i2}+\alpha _7(t_{ij},\tau )x_{i1}x_{i3} +\alpha _8(t_{ij},\tau )x_{i2}x_{i3}+\epsilon _{ij}(\tau ), \end{aligned}$$

where \(y_{ij}\) is the i-th individual’s CD4 percentage at time \(t_{ij}\) (in years), and \(x_{i2}\) and \(x_{i3}\) are standardized. The baseline function \(\alpha _0(t,\tau )\) represents the \(\tau \)-th quantile of CD4 percentage t years after the infection for a nonsmoker with average preCD4 percentage and average age at HIV infection. Here, we focus on three quantile levels \(\tau = 0.25\), 0.5 and 0.75, and we use the compound symmetry working correlation to fit the data, and the results are compared with working independence. The results of AR(1) working correlation are very similar.

Table 5 Selected components of two different methods at three quantile levels for AIDS data
Fig. 1
figure 1

Estimated coefficient curves at \(\tau =0.5\). a The baseline coefficient function, b coefficient for preCD4, c coefficient for the interaction of smoking and age and d coefficient for the interaction of smoking and preCD4. The shaded area indicates the 95% point-wise confidence band

Table 5 shows the selected components at different quantile levels, and we found that our QIF method selects one or two more variables than WI approach at quantile levels \(\tau =0.25\) and 0.5. In particular, the smoking effect is selected at \(\tau =0.25\) but not at \(\tau =0.5\) and 0.75, which indicates that smoking has the effect on the CD4 counts when the count is small. The estimated curves at \(\tau =0.5\) together with their 95% point-wise confidence band of the four important nonparametric components are shown in Fig. 1. Note that the zero lines are not completely contained in the selected 95% point-wise confidence bands, which indicate that the variables selected are reasonable. Figure 1a implies that the baseline coefficient function \(\alpha _0(t,\tau )\) decreases over time. From Fig. 1b, the preCD4 has positive effect on the post-infection CD4 on the whole with the rate decreases at first and increases later. It is noteworthy that the interaction between smoking and age is significant from Fig. 1c, though the individual variables have no effects on the post-infection CD4. In particular, the elder smokers tend to have lower CD4 counts. Furthermore, Fig. 1d shows that interaction between smoking and preCD4 increases at first and then decreases, which implies that smokers with high preCD4 tend to have higher CD4 counts after infection and it decreases after a period of follow-up. One possible explanation is that a smoking patient with lower preCD4 may choose to quit smoking due to medical concerns at first, and they start to smoke again when their preCD4 counts improved after a period.

6 Concluding remarks

In this paper, we investigate statistical method and theory for estimation and variable selection of quantile varying coefficient model in longitudinal data analysis. By using the QIF approach, the correlation within subject is incorporated and the efficiency of estimation and accuracy of the variable selection are improved compared to quantile regression assuming working independence. We further propose to use induced smoothing for estimation, and the procedure is easy to implement. Simulation studies and real data analysis show that the performance of our proposed approach is encouraging.

Several problems can be investigated in the future. Note that the dimension of the nonparametric components is fixed in this work. It may be extended to the case with a diverging number of covariates. One can also extend this work to the partially linear varying coefficient model, which is more flexible than VCM. In addition, it is also important to identify varying and constant coefficients among the nonzero coefficients. For that purpose, it is possible to use the hypotheses testing approach like Wang et al. (2009) after identifying the nonzero coefficients based on our method. Finally, since quantile regression curves are estimated individually, one interesting and important problem is how to avoid the crossing of the estimated quantile curves at adjacent quantile levels (Bondell et al. 2010).