1 Introduction

In the last two decades, there has been an increasing interest in regression models for functional variables as more and more data have arisen where the primary unit of observation can be viewed as a curve or in general a function, such as in biology, chemometrics, econometrics, geophysics, the medical sciences, meteorology and neurosciences. As a natural extension of the ordinary regression to the case where predictors include random functions and responses are scalars or functions, functional linear regression analysis provides valuable insights into these problems. The effectively infinite-dimensional character of functional data analysis is a source of many of its differences from more conventional multivariate analysis. The functional linear model has been extensively studied and successfully applied; see Cardot et al. (2003), Ramsay and Silverman (2002, 2005), Cai and Hall (2006), Hall and Horowitz (2007), Reiss and Ogden (2010), Brunel and Roche (2015), Hsing and Eubank (2015), among many others.

It is frequently the case that a response is related to both a vector of finite length and a function-valued random variable as predictor variables. With a square integrable random function X on a compact set \({\mathcal {T}}\) in R and a d-dimensional vector of random variables \(Z=(Z_{1},\ldots ,Z_{d})^{T}\), we suppose that the scalar response Y is linearly related to predictor variables (XZ) through the relationship

$$\begin{aligned} Y=\int _{{\mathcal {T}}}\gamma (t)X(t)\mathrm{d}t+Z^{T}\pmb {\beta }_{0}+\varepsilon , \end{aligned}$$
(1.1)

where \(\pmb {\beta }_{0}\) is a \(d\times 1\) vector of regression coefficients of Z, \(\gamma (t)\) is a square integrable function on \({\mathcal {T}}\), and \(\varepsilon \) is a random error. Model (1.1) generalizes both the classical linear regression model and functional linear regression model which correspond to the cases \(\gamma (t)=0\) and \(\pmb {\beta }_{0}=0\), respectively. Moreover, this model includes the analysis of covariance model where the covariate is a random function, i.e., the model represents functional linear models between a scalar variable Y and a function-valued random variable X for each group simultaneously with the \(Z_{k}\) being scalar-valued indicator variables associated with subgroups. Zhang et al. (2007) proposed a two-stage functional mixed effects model to deal with measurement error and irregularly spaced time points and estimated the regression coefficient function using a two-stage nonparametric regression calibration method. Shin (2009) and Reiss and Ogden (2010) proposed the estimators of \(\pmb {\beta }_{0}\) and \(\gamma (t)\) by generalizing the functional principal components estimation method in the functional linear regression and Shin and Lee (2012) considered a prediction of a scalar variable based on both a function-valued variable and a finite number of real-valued variables.

In this paper, we propose a new method for estimating the unknown parameters and function in model (1.1). Using functional principal component analysis, the unknown slope function is approximated by an average value which includes the unknown parameters. The estimators of the unknown parameters are obtained by solving a minimization problem. Although our method is obviously different from Shin (2009) and Shin and Lee (2012), we find that the estimators obtained by the two methods have the same behavior through simulation and further derivation. In fact, our estimators are more simple in expression and require less computation. Under conditions weaker than Shin (2009), we derive the asymptotic normality of the estimator of \(\pmb {\beta }_{0}\) and establish the global convergence rate of the estimator of the slope function \(\gamma (t)\). Since our assumptions are weaker than that of Shin (2009), the asymptotic distribution of the estimator of \(\pmb {\beta }_{0}\) is different from that of Shin (2009) and Shin and Lee (2012). The proofs of our theorems are essentially different from Shin (2009). We establish the convergence rate of the mean squared prediction error for a predictor. Based on the proposed estimation procedure, we further propose a family of variable selection procedures via the penalized least squares using concave penalty functions. We show that the proposed penalized regression estimators have the variable selection consistency and oracle property of Fan and Li (2001).

Variable selection is particularly important when the true underlying model has a sparse representation. Identifying significant predictors will enhance the prediction performance of the fitted model. A penalty function generally facilitates variable selection in regression models. Various penalty functions have been used in the literature: the bridge regression (Frank and Friedman 1993), LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), adaptive LASSO (Zou 2006), MCP (Zhang 2010), are well known. Liang and Li (2009) considered variable selection for partially linear models with measurement errors, Wang and Wang (2014) proposed adaptive Lasso estimators for ultrahigh-dimensional generalized linear models, and Aneirosa et al. (2015) investigated variable selection in partial linear regression with functional covariate. Fan et al. (2014) studied oracle optimality of folded concave penalized estimation.

The paper is organized as follows. Section 2 describes the estimation method and studies its asymptotic properties. Section 3 investigates an adaptive variable selection method and its asymptotic properties. Section 4 presents finite sample behaviors of the estimators. A real data example about the real estate data is given in Sect. 5. All proofs are relegated to “Appendix.”

2 Estimation method and asymptotic results

Let Y be a real-valued random variable defined on a probability space \((\Omega , {\mathcal {B}}, P)\). Let Z be a d-dimensional vector of random variables with finite second moments, and let \(\{X(t): t\in {\mathcal {T}}\}\) be a zero-mean and second-order (i.e., \(EX(t)^{2}<\infty \) for all \(t\in {\mathcal {T}})\) stochastic process defined on \((\Omega , {\mathcal {B}}, P)\) with sample paths in \(L_{2}({\mathcal {T}})\), the set of all square integrable functions on \({\mathcal {T}}\), where \({\mathcal {T}}\) is a bounded closed interval. \(\varepsilon \) is a random error with mean zero and is independent of (XZ). Let \(<\cdot ,\cdot>\) and \(\Vert \cdot \Vert \) represent, respectively, the \(L_{2}({\mathcal {T}})\) inner product and norm. Denote the covariance function of the process X(t) by \(K(s,t)=cov(X(s),X(t))\). We suppose that K(st) is positive definite, in which case it admits a spectral decomposition in terms of strictly positive eigenvalues \(\lambda _{j}\),

$$\begin{aligned} K(s, t)=\sum _{j=1}^{\infty }\lambda _{j}\phi _{j}(s)\phi _{j}(t), \quad s,t\in {\mathcal {T}}, \end{aligned}$$
(2.1)

where \((\lambda _{j},\phi _{j})\) are (eigenvalue, eigenfunction) pairs for the linear operator with kernel K, the eigenvalues are ordered so that \(\lambda _{1}>\lambda _{2}>\cdots \) and the functions \(\phi _{1},\phi _{2},\ldots \) form an orthonormal basis for \(L_{2}({\mathcal {T}})\). This leads to the Karhunen–Lo\(\grave{\mathrm{e}}\)ve representation

$$\begin{aligned} X(t)=\sum _{j=1}^{\infty }\xi _{j}\phi _{j}(t), \end{aligned}$$

where the \(\xi _{j}=\int _{{\mathcal {T}}}X(t)\phi _{j}(t)\mathrm{d}t\) are uncorrelated random variables with mean 0 and variance \(E\xi _{j}^{2}=\lambda _{j}\). Let \(\gamma (t)=\sum _{j=1}^{\infty }\gamma _{j}\phi _{j}(t)\), then model (1.1) can be written as

$$\begin{aligned} Y=\sum _{j=1}^{\infty }\gamma _{j}\xi _{j}+Z^{T}\pmb {\beta }_{0}+\varepsilon . \end{aligned}$$
(2.2)

By (2.2), we have

$$\begin{aligned} \gamma _{j}=E\{[Y-Z^{T}\pmb {\beta }_{0}]\xi _{j}\}/\lambda _{j}. \end{aligned}$$
(2.3)

Let \((X_{i}(t),Z_{i}, Y_{i}), i=1,\ldots ,n\), be independent realizations of (X(t), ZY) generated by the model (1.1). Empirical versions of K and of its spectral decomposition are

$$\begin{aligned} \hat{K}(s,t)=\frac{1}{n}\sum _{i=1}^{n}X_{i}(s)X_{i}(t)=\sum _{j=1}^{\infty }\hat{\lambda }_{j}\hat{\phi }_{j}(s)\hat{\phi }_{j}(t), \quad s,t\in {\mathcal {T}}. \end{aligned}$$

Analogously to the case of K, \((\hat{\lambda }_{j},\hat{\phi }_{j})\) are (eigenvalue, eigenfunction) pairs for the linear operator with kernel \(\hat{K}\), ordered such that \(\hat{\lambda }_{1}\ge \hat{\lambda }_{2}\ge \cdots \ge 0\). We take \((\hat{\lambda }_{j},\hat{\phi }_{j})\) and \(\hat{\xi } _{ij}=\langle X_{i},\hat{\phi }_{j}\rangle \) to be the estimators of \((\lambda _{j},\phi _{j})\) and \(\xi _{ij}=\langle X_{i},\phi _{j}\rangle ,\) respectively, and set

$$\begin{aligned} \tilde{\gamma }_{j}=\frac{1}{n\hat{\lambda }_{j}}\sum _{i=1}^{n}\left( Y_{i}-Z_{i}^{T}\pmb {\beta }_{0}\right) \hat{\xi }_{ij}. \end{aligned}$$
(2.4)

We use \(\sum _{j=1}^{m}\tilde{\gamma }_{j}\hat{\xi }_{j}\) to approximate \(\sum _{j=1}^{\infty }\gamma _{j}\xi _{j}\) in (2.2). Combining (2.2) and (2.4), we then solve the following minimization problem

$$\begin{aligned} \min _{\pmb {\beta }}\sum _{i=1}^{n}\left\{ Y_{i}-\sum _{j=1}^{m}\frac{\hat{\xi }_{ij}}{n\hat{\lambda }_{j}}\sum _{l=1}^{n}\left( Y_{l} -Z_{l}^{T}\pmb {\beta }\right) \hat{\xi }_{lj} -Z_{i}^{T}\pmb {\beta }\right\} ^{2} \end{aligned}$$
(2.5)

to obtain the estimator of \(\pmb {\beta }_{0}\). Define \(\tilde{\xi }_{li}=\sum _{j=1}^{m}\frac{\hat{\xi }_{lj}\hat{\xi }_{ij}}{\hat{\lambda }_{j}}\), \(\tilde{Y}_{i}=Y_{i}-\frac{1}{n}\sum _{l=1}^{n}Y_{l}\tilde{\xi }_{li}\) and \(\tilde{Z}_{i}=Z_{i}-\frac{1}{n}\sum _{l=1}^{n}Z_{l}\tilde{\xi }_{li}.\) Then, (2.5) can be written as

$$\begin{aligned} \min _{\beta }\sum _{i=1}^{n}\left( \tilde{Y}_{i}-\tilde{Z}_{i}^{T}\pmb {\beta }\right) ^{2} \end{aligned}$$
(2.6)

Let \(\tilde{Y}=(\tilde{Y}_{1},\ldots ,\tilde{Y}_{n})^{T}\) and \(\tilde{Z}=(\tilde{Z}_{1},\ldots ,\tilde{Z}_{n})^{T}\). Then the estimator \(\hat{\pmb {\beta }}\) of \(\pmb {\beta }_{0}\) is given by

$$\begin{aligned} \hat{\pmb {\beta }}=(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}\tilde{Y}. \end{aligned}$$
(2.7)

The estimator of \(\gamma (t)\) is given by \(\hat{\gamma }(t)=\sum _{j=1}^{m}\hat{\gamma }_{j}\hat{\phi }_{j}(t)\) with

$$\begin{aligned} \hat{\gamma }_{j}=\frac{1}{n\hat{\lambda }_{j}}\sum _{i=1}^{n}\left( Y_{i}-Z_{i}^{T}\hat{\pmb {\beta }}\right) \hat{\xi }_{ij}. \end{aligned}$$
(2.8)

To implement our estimation method, we need to know how to choose m. The value for m can be selected by leave-one-curve-out cross-validation of the prediction error. Define CV function as

$$\begin{aligned} \mathrm{CV}(m)=\sum _{i=1}^{n} \left( Y_{i} -\sum _{j=1}^{m}\hat{\gamma }_{j}^{-i}\hat{\xi }_{ij}-Z_{i}^{T}\hat{\pmb {\beta }}^{-i}\right) ^{2}, \end{aligned}$$

where \(\hat{\gamma }_{j}^{-i},j=1,\ldots ,m\) and \(\hat{\pmb {\beta }}^{-i}\) are computed after removing \((X_{i}, Z_{i}, Y_{i})\). As an alternative to cross-validation, m can also be chosen by information criteria BIC. The BIC criteria as a function of m is given by

$$\begin{aligned} \mathrm{BIC}(m)=\log \left\{ \sum _{i=1}^{n}\left( Y_{i} -\sum _{j=1}^{m}\hat{\gamma }_{j}\hat{\xi }_{ij}-Z_{i}^{T}\hat{\pmb {\beta }}\right) ^{2}\right\} +\frac{\log n}{n}(m+1). \end{aligned}$$

Large values of BIC indicate poor fits.

Remark 2.1

Noting that \(\hat{\xi } _{ij}=\langle X_{i},\hat{\phi }_{j}\rangle \), it can be easily shown that our estimators have the same performance as the estimators given in Shin (2009) and Shin and Lee (2012). However, our estimators are more simple in expression and require less computation.

In the following, we derive asymptotic normality of the estimator \(\hat{\pmb {\beta }}\) and the rate of convergence for the estimator \(\hat{\gamma }(t)\). We make the following assumptions.

Assumption 1

X has finite fourth moment, in that \(\int _{{\mathcal {T}}}E(X^{4})<\infty \), and for each j, \(E(\xi _{j}^{4})<C_{1}\lambda _{j}^{2}\) for some constant \(C_{1}\).

Assumption 2

There exists a convex function \(\varphi \) defined on the interval [0, 1] such that \(\varphi (0) = 0\) and \(\lambda _{j}=\varphi (1/j)\) for \(j\ge 1\).

Assumption 3

For Fourier coefficients \(\gamma _{j}\), there exist constants \(C_{2}>0\) and \(\delta >3/2\) such that \(|\gamma _{j}|\le C_{2}j^{-\delta }\) for all \(j\ge 1\).

Assumption 4

\(m\rightarrow \infty \) and \(n^{-1/2}m\lambda _{m}^{-1}\rightarrow 0\).

Assumption 5

\(E(\Vert Z\Vert ^{4})<+\,\infty \).

Assumptions 1 and 3 are standard conditions for functional linear models; see, e.g., Cai and Hall (2006) and Hall and Horowitz (2007). Assumption 2 is slightly less restrictive than (3.2) of Hall and Horowitz (2007). Assumptions 4 can be easily verified and will be further discussed below.

Remark 2.2

Assumptions 2 and 4 are weaker than the assumptions for \(\lambda _{j}\) and m, respectively, in Shin (2009) and Shin and Lee (2012).

We first establish the asymptotic distribution of the estimator \(\hat{\pmb {\beta }}\). To derive the asymptotic normality of the estimator \(\hat{\pmb {\beta }}\), we need to adjust for the dependence of \(Z=(Z_{1},\ldots ,Z_{d})^{T}\) and X(t), which is a common complication in semiparametric models. Let \({\mathcal {G}}\) denote the class of the random variables such that \(G\in {\mathcal {G}}\) if \(G=\sum _{j=1}^{\infty }g_{j}\xi _{j}\) and \(|g_{j}|\le C_{3}j^{-\delta }\) for all \(j\ge 1 \), where \(\delta \) is defined in Assumption 3 and \(C_{3}>0\) is a constant. Note that \({\mathcal {G}}\) is related to the first term on the right side of (2.2). Denote \(G_{r} =\sum _{j=1}^{\infty }g_{rj}\xi _{j}\). Let

$$\begin{aligned} G_{r}^{*}=\text{ arginf }_{G_{r}\in {\mathcal {G}}}E \left[ \left( Z_{r}-\sum _{j=1}^{\infty }g_{rj}\xi _{j}\right) ^{2}\right] . \end{aligned}$$

Since

$$\begin{aligned} E\left[ \left( Z_{r}-\sum _{j=1}^{\infty }g_{rj}\xi _{j}\right) ^{2}\right] =E[(Z_{r}-E(Z_{r}|X))^{2}] +E\left[ \left( E(Z_{r}|X)-\sum _{j=1}^{\infty }g_{rj}\xi _{j}\right) ^{2}\right] , \end{aligned}$$

therefore,

$$\begin{aligned} G_{r}^{*} =\text{ arginf }_{G_{r}\in {\mathcal {G}}} E\left[ \left( E(Z_{r}|X)-\sum _{j=1}^{\infty }g_{rj}\xi _{j}\right) ^{2}\right] . \end{aligned}$$

Thus, \(G_{r}^{*}\) are the projections of \(E(Z_{r}|X)\) onto the space \({\mathcal {G}}\). In other words, \(G_{r}^{*}\) is an element that belongs to \({\mathcal {G}}\) and it is the closest to \(E(Z_{r}|X)\) among all the random variables in \({\mathcal {G}}\). Let \(H_{r}=Z_{r}-G_{r}^{*}\) for \(r=1,\ldots ,d\), and \(H=(H_{1},\ldots ,H_{d})^{T}\). We then have the following results.

Theorem 2.1

Suppose that Assumptions 15 hold and \(\Omega =E(HH^{T})\) is invertible, then

$$\begin{aligned} \sqrt{n}(\hat{\pmb {\beta }}-\pmb {\beta }_{0})\rightarrow _{d}N(0,\Omega ^{-1}\sigma ^{2}), \end{aligned}$$
(2.9)

where \(\rightarrow _{d}\) means convergence in distribution.

Remark 2.3

When the model is changed from functional linear model to partial functional linear model, to derive the asymptotic normality of the estimator \(\hat{\pmb {\beta }}\), it is key to handle the relation of the vector Z and X(t). In our analysis, \(Z_{r}, r=1,\ldots ,d\) are divided into two unrelated parts \(G_{r}^{*}=\sum _{j=1}^{\infty }g_{rj}^{*}\xi _{j}\) and \(H_{r}\). Consequently, (2.2) can be written as

$$\begin{aligned} Y=\sum _{j=1}^{\infty } \left( \gamma _{j}+\sum _{r=1}^{d}g_{rj}^{*}\beta _{0r}\right) \xi _{j}+H^{T}\pmb {\beta }_{0}+\varepsilon , \end{aligned}$$

where \(\pmb {\beta }_{0}=(\beta _{01},\ldots ,\beta _{0d})^{T}\). If \(Z_{r}=\sum _{j=1}^{\infty }\tilde{g}_{rj}\xi _{j}+V_{r}\) and \(V_{r}\) is independent of X(t), then \(G_{r}^{*}=\sum _{j=1}^{\infty }\tilde{g}_{rj}\xi _{j}\) and \(H_{r}=V_{r}\). If \(Z_{r}\) is independent of X(t), then \(G_{r}^{*}=0\) and \(H_{r}=Z_{r}\). If \(E(Z_{r}|X(t))=\sum _{j=1}^{\infty }\bar{g}_{rj}\xi _{j}\), then \(G_{r}^{*}=\sum _{j=1}^{\infty }\bar{g}_{rj}\xi _{j}\) and \(H_{r}=Z_{r}-G_{r}^{*}\). In Shin (2009) and Shin and Lee (2012), it is assumed that \(E(Z_{r}|X(t))=\sum _{j=1}^{\infty }\lambda _{j}^{-1}<K_{Z_{k}X},\phi _{j}>\xi _{j}\), where \(K_{Z_{r}X}=cov(Z_{r},X)\) for \(r=1,\ldots ,d\). In this case, \(G_{r}^{*}=\sum _{j=1}^{\infty }\lambda _{j}^{-1}<K_{Z_{k}X},\phi _{j}>\xi _{j}\) and \(H_{r}=Z_{r}-G_{r}^{*}\), and the result of our Theorem 2.1 is the same as that of Theorem 3.1 in Shin (2009). Hence, Theorem 3.1 of Shin (2009) is a special case of our Theorem 2.1.

Next we establish the convergence rates of the estimators \(\hat{\gamma }(t)\).

Theorem 2.2

Assume that Assumptions 15 hold and that \(n^{-1}m^{2}\lambda _{m} ^{-1}\log m\rightarrow 0\). Then

$$\begin{aligned} \int _{{\mathcal {T}}}\left\{ \hat{\gamma }(t)-\gamma (t)\right\} ^{2}\mathrm{d}t=O_{p} \left( \frac{m}{n\lambda _{m}}+\frac{m}{n^{2} \lambda _{m}^{2}}\sum _{j=1}^{m}\frac{j^{3}\gamma _{j}^{2}}{\lambda _{j}^{2}}+\frac{1}{n\lambda _{m}}\sum _{j=1}^{m} \frac{\gamma _{j}^{2}}{\lambda _{j}}+m^{-2\delta +1}\right) .\nonumber \\ \end{aligned}$$
(2.10)

If \(\lambda _{j}\sim j^{-\tau }\), \(\tau >1\), \(m\sim n^{1/(\tau +2\delta )}\), \(\delta >2\) and \(\delta >1+\tau /2\), then \(\sum _{j=1}^{m}j^{3}\gamma _{j} ^{2}\lambda _{j}^{-2}\le C_{4}(\log m+m^{2\tau +4-2\delta )})\) and \(\sum _{j=1}^{m}\gamma _{j}^{2}\lambda _{j}^{-1}<+\,\infty \), where \(C_{4}\) is a positive constant. We then have the following corollary.

Corollary 2.1

Under Assumptions 15, if \(\lambda _{j}\sim j^{-\tau }\), \(\tau >1\), \(m\sim n^{1/(\tau +2\delta )}\) and \(\delta >\min (2,1+\tau /2)\), then it holds that

$$\begin{aligned} \int _{{\mathcal {T}}}\left\{ \hat{\gamma }(t)-\gamma (t)\right\} ^{2}\mathrm{d}t=O_{p}\left( n^{-(2\delta -1)/(\tau +2\delta )}\right) . \end{aligned}$$
(2.11)

The global convergence result (2.11) indicates that the estimator \(\hat{\gamma }(t)\) attains the same convergence rate as those of the estimators of Hall and Horowitz (2007), which are optimal in the minimax sense.

Let \({\mathcal {S}}=\{(Z_{i},X_{i},Y_{i}): 1\le i\le n\}\). In the following, for a new pair of predictor variables \((Z_{n+1}, X_{n+1})\) taking from the same population as the data and independent of the data, we shall derive the convergence rate of the mean squared prediction error (MSPE) given by

$$\begin{aligned} \mathrm{MSPE}= & {} E\left( \left[ \left( \int _{{\mathcal {T}}}\hat{\gamma }(t)X_{n+1}(t)\mathrm{d}t+Z_{n+1}^{T}\hat{\pmb {\beta }}\right) \right. \right. \\&\left. \left. -\left( \int _{{\mathcal {T}}}\gamma (t)X_{n+1}(t)\mathrm{d}t+Z_{n+1}^{T}\pmb {\beta }_{0}\right) \right] ^{2}|{\mathcal {S}}\right) . \end{aligned}$$

Theorem 2.3

Under Assumptions 13 and 5, if \(\lambda _{j}\sim j^{-\tau }\), \(\tau >1\), \(m\sim n^{1/(\tau +2\delta )}\) and \(\delta >\min (2,1+\tau /2)\), then

$$\begin{aligned} \mathrm{MSPE}=O_{p}(n^{-(\tau +2\delta -1)/(\tau +2\delta )}). \end{aligned}$$
(2.12)

Remark 2.4

In practical application, X(t) is only discretely observed. Without loss of generality, suppose \({\mathcal {T}}=[0,1]\) and for each \(i=1,\ldots ,n\), \(X_{i}(t)\) is observed at \(n_{i}\) discrete points \(0=t_{i1}<\ldots <t_{in_{i}}=1\). Typically, \(\max _{i}\max _{1\le j\le n_{i}-1}(t_{i(j+1)}-t_{ij})\rightarrow 0\) as \(n\rightarrow \infty \) is also assumed. Based on the discrete observations, for each \(i=1,\ldots ,n\), linear interpolation functions or spline interpolation functions can be used for the estimators of \(X_{i}(t)\). For example, we can use the following linear interpolation function

$$\begin{aligned} \hat{X}_{i}(t)&=X_{i}(t_{ij})+\frac{(X_{i}(t_{i(j+1)})-X_{i}(t_{ij}))}{t_{i(j+1)}-t_{ij}}(t-t_{ij}), \\&\quad \text{ for } \ t\in [t_{ij}, t_{i(j+1)}], j=0,\ldots ,n_{i}-1 \end{aligned}$$

as the estimator of \(X_{i}(t)\). It is necessary to point out that if \(X_{i}(t),i=1,\ldots ,n\) are replaced by \(\hat{X}_{i}(t),i=1,\ldots ,n\), the conclusions of Theorems 2.12.3 do not hold. We note that it is difficult to establish the related asymptotic properties by our current approach, and further research is expected.

3 Variable selection for partial functional linear model

In the variable selection problem, it is assumed that some components of \(\pmb {\beta _{0}}\) in model (1.1) are equal to zero. The goal is to identify and estimate the subset model. It has been argued that folded concave penalties are preferable to convex penalties such as the \(L_{1}\)-penalty in terms of both model-estimation accuracy and variable selection consistency (Lv and Fan 2009; Fan and Lv 2011). Let \(p_{\nu _{n}}(|u|)=p_{a,\nu _{n}}(|u|)\) be general folded concave penalty functions defined on \(u\in (-\,\infty ,+\,\infty )\) satisfying

  1. (a)

    The \(p_{\nu _{n}}(u)\) are increasing and concave in \(u\in [0,+\,\infty )\);

  2. (b)

    The \(p_{\nu _{n}}(u)\) are differentiable in \(u\in (0,+\,\infty )\) with \(p_{\nu _{n}}^{\prime }(0):=p_{\nu _{n}}^{\prime }(0+)\ge a_{1} \nu _{n}\), \(p_{\nu _{n}}^{\prime }(u)\ge a_{1}\nu _{n}\) for \(u\in (0,a_{2}\nu _{n}]\), \(p_{\nu _{n}}^{\prime }(u)\le a_{3} \nu _{n}\) for \(u\in [0,+\,\infty )\), and \(p_{\nu _{n}}^{\prime }(u)=0\) for \(u\in [a\nu _{n},+\,\infty )\) with a prespecified constant \(a>a_{2}\), where \(a_{1}\), \(a_{2}\) and \(a_{3}\) are fixed positive constants.

The above family of general folded concave penalties contains several popular penalties including the SCAD penalty (Fan and Li 2001), the derivative of which is given by

$$\begin{aligned} p_{\nu _{n}}^{\prime }(u)=\nu _{n}I_{\{u\le \nu _{n}\}} +\frac{(a\nu _{n}-u)_{+}}{a-1}I_{\{u>\nu _{n}\}} \quad \text{ for } \text{ some }\ a>2, \end{aligned}$$

and the MCP penalty (Zhang 2010), the derivative of which is given by

$$\begin{aligned} p_{\nu _{n}}^{\prime }(u)=\left( \nu _{n}-\frac{u}{a}\right) _{+} \quad \text{ for } \text{ some }\ a>1. \end{aligned}$$

It is easy to see that \(a_{1}=a_{2}=a_{3}=1\) for the SCAD, and \(a_{1} =1-a^{-1}\), \(a_{2}=a_{3}=1\) for the MCP.

Based on the above analysis, we define a penalized least squares estimator of \(\pmb {\beta }_{0}\) as

$$\begin{aligned} \hat{\pmb {\beta }}_\mathrm{PLS}=\arg \min _{\pmb {\beta }}(\tilde{Y}-\tilde{Z}\pmb {\beta })^{T}(\tilde{Y}-\tilde{Z}\pmb {\beta })+n\sum _{k=1}^{d}p_{\nu _{n}}^{\prime }(|\beta _{k}^{(0)}|)|\beta _{k}|, \end{aligned}$$
(3.1)

where \({\pmb {\beta }}^{(0)}=(\beta _{1}^{(0)},\ldots ,\beta _{d}^{(0)})^{T}\) is an initial estimator of \(\pmb {\beta }_{0}\). For example, \({\pmb {\beta }}^{(0)}\) can be obtained from (2.7) in Sect. 2.

In the following, we show that the penalized least squares estimator defined by (3.1) has the oracle property (Fan and Li 2001). Without loss of generality, let \(\pmb {\beta }=({\pmb {\beta }}_{1}^{T},\pmb {\beta }_{2}^{T})^{T}\), where \(\pmb {\beta }_{1}\in \mathbf {R}^{d_{1}}\) and \(\pmb {\beta }_{2}\in \mathbf {R}^{d-d_{1}}\). The vector of true parameters is denoted by \(\pmb {\beta }_{0}=(\pmb {\beta }_{01} ^{T},\pmb {\beta }_{02}^{T})^{T}\) with each element of \(\pmb {\beta }_{01}\) being nonzero and \(\pmb {\beta }_{02}=0\).

Theorem 3.1

Suppose that the conditions of Theorem 2.1 hold. Let \(p_{\nu _{n}}(\cdot )\) be general folded concave penalty functions satisfying assumptions (a) and (b) above and \({\pmb {\beta }}^{(0)}\) be the estimator defined by (2.7). If \(\nu _{n}\rightarrow 0\) and \(\sqrt{n}\nu _{n}\rightarrow \infty \) as \(n\rightarrow \infty \), then the penalized least squares estimator \(\hat{\pmb {\beta }}_\mathrm{PLS} =(\hat{\pmb {\beta }}_{\mathrm{PLS}1}^{T},\hat{\pmb {\beta }}_{\mathrm{PLS}2}^{T})^{T}\) defined by (3.1) satisfies

  1. (1)

    Sparsity: \(P(\hat{\pmb {\beta }}_{\mathrm{PLS}2}=0)\rightarrow 1.\)

  2. (2)

    Asymptotic normality:

    $$\begin{aligned} \sqrt{n}(\hat{\pmb {\beta }}_{\mathrm{PLS}1}-\pmb {\beta }_{01})\rightarrow _{d}N(0,\Omega _{1}^{-1}\sigma ^{2}), \end{aligned}$$
    (3.2)

    where \(\Omega _{1}=E[(H_{1},\ldots ,H_{d_{1}})^{T}(H_{1},\ldots ,H_{d_{1}})]\).

Let

$$\begin{aligned} \hat{\gamma }_{\mathrm{PLSj}}=\frac{1}{n\hat{\lambda }_{j}}\sum _{i=1}^{n} \left( Y_{i}-Z_{i}^{T}\hat{\pmb {\beta }}_\mathrm{PLS}\right) \hat{\xi }_{ij} \end{aligned}$$
(3.3)

and \(\hat{\gamma }_\mathrm{PLS}(t)=\sum _{j=1}^{m}\hat{\gamma }_{PLSj}\hat{\phi }_{j}(t)\). We then have the following theorem.

Theorem 3.2

  1. (1)

    Under the assumptions of Theorems 3.1 and 2.2, the estimator \(\hat{\gamma }_\mathrm{PLS}(t)\) satisfies the conclusions of Theorem 2.2.

  2. (2)

    Under the assumptions of Theorems 3.1 and 2.3, the conclusions of Theorem 2.3 hold.

4 Simulation results

Since our estimators have the same performances as Shin (2009) and Shin and Lee (2012), in this section, we only investigate the finite sample performance of the penalized least squares estimators proposed in Sect. 3 by carrying out a Monte Carlo study. The data sets were generated from the following models

$$\begin{aligned} Y_{i}=\int _{{\mathcal {T}}}\gamma (t)X_{i}(t)\mathrm{d}t+Z_{i}^{T}\pmb {\beta }_{0}+\varepsilon _{i}, \end{aligned}$$
(4.1)

with \({\mathcal {T}}=[0,1]\), \(\pmb {\beta }_{0}=(2,0,1.5,0,0.3)^{T}\). We took \(\gamma (t)=\sum _{j=1}^{50}\gamma _{j}\phi _{j}(t)\) and \(X_{i}(t)=\sum _{j=1}^{50}\xi _{ij}\phi _{j}(t)\), where \(\gamma _{1}=0.3\) and \(\gamma _{j}=4(-1)^{j+1}j^{-\delta },j\ge 2\); \(\phi _{1}(t)\equiv 1\) and \(\phi _{j}(t)=2^{1/2}\cos ((j-1)\pi t),j\ge 2\); the \(\xi _{ij}\)’s were independent and normal \(N(0, \lambda _{j})\). We let \(Z_{i}=(Z_{i1},\ldots ,Z_{i5})^{T}\), when conditioning on \(\xi _{ij}\), be a multivariate normal distribution with the mean vector \(((1+\lambda _{1})^{-1/2}\xi _{i1},\ldots , (1+\lambda _{5})^{-1/2}\xi _{i5})^{T}\) and the variance-covariance matrix \(V=v_{kl}\) with \(v_{kk}=(1+\lambda _{k})^{-1}\) and \(v_{kl}= 0.7((1+\lambda _{k})(1+\lambda _{l}))^{-1/2}\) for \(k,l =1,\ldots ,5\), so that \(Z_{i}\) has a multivariate normal distribution with the zero-mean vector and the variance-covariance matrix whose diagonal elements are 1 and off-diagonal elements are \(v_{kl}\). The errors \(\varepsilon _{i}\) were normally distributed with the mean 0 and the standard deviation 0.5. Similar to Shin and Lee (2012), we used 4 different sets of the eigenvalues, \(\{\lambda _{j}\}\). In the two settings, \(\lambda _{j}=j^{-\tau }\) and different values of \(\tau \) are considered. In the other two settings, eigenvalues are “closely spaced” as in Hall and Horowitz (2007): \(\lambda _{1}= 1, \lambda _{j} =0.2^{2}(1-0.0001j)^{2}\) if \(2 \le j\le 4\), \(\lambda _{5j+k}= 0.2^{2}\{(5j)^{-\tau /2}-0.0001k\}^{2}\) for \(j\ge 1\) and \(0\le k \le 4\).

  1. 1.

    Set \(\tau =1.1\) and \(\delta =2\) with the well-spaced eigenvalues.

  2. 2.

    Set \(\tau =1.1\) and \(\delta =2\) with the closely spaced eigenvalues.

  3. 3.

    Set \(\tau =3\) and \(\delta =2\) with the well-spaced eigenvalues.

  4. 4.

    Set \(\tau =3\) and \(\delta =2\) with the closely spaced eigenvalues.

Table 1 Results of Monte Carlo experiments for model (4.1)

All the results in this section are based on 500 replications. In all the simulated designs, we used the SCAD penalty function with \(a=3.7\). We set the sample size n to be 100 and 200, respectively. For each simulated data set, the penalized least squares estimators \(\hat{\pmb {\beta }}_\mathrm{PLS}\) and \(\hat{\gamma }_\mathrm{PLS}(t)\) were computed by the procedure given in Sects. 2 and 3. The tuning parameter m is determined by BIC criterion as described in Sect. 2, and the tuning parameter \(\nu _{n}\) in (3.1) is selected by the method given by Fan et al. (2014).

We measured the estimation accuracy for parametric estimators by the average \(l_{1}\)-losses: \(|\hat{\beta }_{1}-\beta _{1}|\), \(|\hat{\beta }_{3}-\beta _{3}|\), and \(|\hat{\beta }_{5}-\beta _{5}|\) over 500 replications. We also evaluated the selection accuracy by the average counts of false positive (FP) and false negative (FN) over the 500 replications; that is, the number of noise covariates included in the model and the number of signal covariates not included. Table 1 displays the simulation results for model (4.1). We see from Table 1 that there is a general tendency for the average \(l_{1}\)-loss and PN and FN to decrease as n increases and there is a general tendency for the average \(l_{1}\)-loss to decrease as \(\tau \) increases. Table 1 also shows that PNs and FNs for Settings 1 and 3 with the well-spaced eigenvalues are less than that for Settings 2 and 4 with the closely spaced eigenvalues, while PN and FN for the Setting 4 are less than that for the Setting 2.

Table 2 reports the integrated squared bias (\(\hbox {Bias}^{2}\)), integrated variance (Var) and mean integrated squared error (MISE) of the estimator \(\hat{\gamma }(t)\) computed on a grid of 100 equally spaced points on \({\mathcal {T}}\). Table 2 shows that there is a general tendency for the MISE to decrease as \(\tau \) increases. We also see from Table 2 that the MISEs for Settings 1 and 3 with the well-spaced eigenvalues are less than that for Settings 2 and 4 with the closely spaced eigenvalues.

Table 2 Results of Monte Carlo experiments for model (4.1)
Table 3 Results of Monte Carlo experiments under high-dimensional data

In the following, we investigate the variable selection for high-dimensional data. In (4.1), let \(Z_{i}=(Z_{i1},\ldots ,Z_{i30},)^{T}\), where \(Z_{i1},\ldots ,Z_{i5}\) are taken the same as above, \(Z_{i6},\ldots ,Z_{i30}\) are mutually independent and independent of \(Z_{i1},\ldots ,Z_{i5}\) and \(X_{ij}\sim N(0,1)\) for \(j=6,\ldots ,30\), \(\pmb {\beta }_{0}=(2,0,1.5,0,0.3,0,\ldots ,0)^{T}\). The simulation results under this high-dimensional data are reported in Tables 3 and 4. We find that Tables 3 and 4 show conclusions similar to those in Tables 1 and 2. Comparing Table 3 with Table 1 and Table 4 with Table 2, we see that our penalized least squares estimators also behave well under the high-dimensional data.

Table 4 Results of Monte Carlo experiments under high-dimensional data

5 A real data example

In this section, we analyze a real data set using the proposed methodology. For this purpose, we analyze the real estate data set which was collected from the statistical yearbooks of various cities, real estate market reports and statistical bulletins on national economic and social development in China. It includes the real estate data for 197 second-, third- and fourth-tier cities in China. In this data set, there are the average annual income of urban residents from 2000 to 2016, and the other data are based on 2016. Our purpose is to study the relationship between urban housing prices and their influencing factors. The response variable Y represents urban housing price. Since it takes many years of savings for the average resident to buy a house, we choose the average annual income of the residents as the functional covariate. Let \(X_{i}^{*}(t)\) denote the average annual income of the residents of the \(i\hbox {th}\) city for the year t and \(X_{i}(t)=X_{i}^{*}(t)-\bar{X}^{*}(t)\), where \(\bar{X}^{*}(t)=\frac{1}{197}\sum _{i=1}^{197}X_{i}^{*}(t)\). The scalar covariates of primary interests include urban category (\(Z_{2},Z_{3}\)), urban population (\(Z_{4}\)), urban GDP (\(Z_{5}\)), bank interest rate (\(Z_{6}\)), urban livability index (\(Z_{7}\)), urban comprehensive competitiveness (\(Z_{8}\)) and urban development index (\(Z_{9}\)). We note that among these variables the data of some variables such as \(Z_{4}\) and \(Z_{5}\) are very large, whereas those of some variables such as \(Z_{6}\) are small. For this purpose, for each data of these variables, we first make the following modification: Let \(\bar{z}_{i4}\), \(i=1,\ldots ,197\) be the observations of \(Z_{4}\). Let \(z_{i4} = \bar{z}_{i4}/\max \bar{z}_{i4}\),\(i=1,\ldots ,197\), so that the maximum of modified data of the variable \(Z_{4}\) is 1. The data of the variables \(Z_{5},\ldots ,Z_{9}\) are modified in a similar fashion. We construct the following partial functional linear model:

$$\begin{aligned} \log (Y_{i})=\int _{0}^{17}\gamma (t)X_{i}(t)\mathrm{d}t+Z_{i1}\beta _{01}+\cdots +Z_{i9}\beta _{09}+\varepsilon _{i}, \end{aligned}$$
(5.1)

where \(Z_{i1}\equiv 1\), \(Z_{i2}=1\) and \(Z_{i3}=0\) stand for second-tier city, \(Z_{i2}=0\) and \(Z_{i3}=1\) stand for third-tier city, and \(Z_{i2}=0\) and \(Z_{i3}=0\) stand for fourth-tier city.

The estimators of unknown parameters and function in model (5.1) are computed by the method given in Sect. 2, and the tuning parameter m is determined by BIC criterion as described in Sect. 2. Table 5 exhibits the parametric estimators, and Fig. 1a shows the estimated curve of \(\gamma (t)\) and its 95% confidence interval. We see from Table 5 that urban population, urban GDP, urban livability index , urban comprehensive competitiveness and urban development index have nonnegative effects, while bank interest rate has a negative effect. The fact that \(\beta _{02}>\beta _{01}>0\) in Table 5 indicates that the housing price for a third-tier city is larger than that for a fourth-tier city and the housing price for a second-tier city is larger than that for a third-tier city. We see from Fig. 1a that the estimated curve varies smoothly, but there is a rapid upward trend in the tail which shows that the effect of the average annual incomes of the residents on house prices varies greatly with different cities in recent years.

Table 5 The parametric estimators for model (5.1)
Fig. 1
figure 1

The solid lines are the estimated curves of \(\gamma (t)\), and the doted lines are their corresponding \(95\%\) point-wise confidence intervals. \(\gamma (t)\) in (a) is computed by (2.8), and \(\gamma (t)\) in (b) is computed by (3.3)

Table 6 exhibits the penalized least squares estimators of the parameters computed by the procedure given in Sect. 3, and Fig. 1b shows the estimated curve of \(\gamma (t)\) computed by (3.3) and its 95% confidence interval. Table 6 shows that urban category, urban GDP, urban livability index and urban development index are important factors affecting house prices. Comparing Fig. 1b with Fig. 1a, we see that the difference between the two is not much.

Table 6 The penalized least squares estimators of the parameters for model (5.1)

To evaluate the prediction performance of our model and methods, we applied leave-one-out cross-validation to the data; i.e., when predicting the housing price for the ith city, we omit the data for this city when fitting the model. Figure 2 displays the boxplots for the absolute prediction errors \(|\widehat{\log (y_{j})}-\log (y_{j})|,\ j=1,\ldots ,197,\) for the method given in Sect. 2 and the penalized method given in Sect. 3. The mean values of these errors for the two methods are 0.2529 and 0.2521, respectively. These observations and Fig. 2 suggest that the penalized method is slightly better than the method given in Sect. 2.

Fig. 2
figure 2

Boxplots for the absolute prediction error \(|\widehat{\log (y_{j})}\)\(-\log (y_{j})|,\ j=1,\ldots ,197,\) for two methods. Here 1 is the boxplot for the method given in Sect. 2 and 2 is the boxplot for the penalized method given in Sect. 3