1 Introduction

Consider the following linear regression model:

$$\begin{aligned} \mathbf{y}= \mathbf{X}\varvec{{\beta }}^* + \varvec{{\varepsilon }}, \end{aligned}$$
(1)

where \(\mathbf{y}=(y_1,\ldots ,y_n)^\mathrm{T}\in \mathbb {R}^n\) is a response vector, \(\varvec{{\beta }}^*=(\beta _1^*,\ldots ,\beta _p^*)^\mathrm{T}\in \mathbb {R}^p\) is a true regression coefficient vector, \(\mathbf{X}=(X_1,\ldots ,X_p)\) is an \(n\times p\) design matrix and \(\varvec{{\varepsilon }}=({\varepsilon }_1,\ldots ,{\varepsilon }_n)^\mathrm{T} \in \mathbb {R}^n\) is a random error vector. Penalized estimations have received much attention for variable selection in the model (1), that estimate the true regression coefficient vector \(\varvec{{\beta }}^*\) by minimizing the penalized sum of squares of residuals:

$$\begin{aligned} \Vert \mathbf{y}- \mathbf{X}\varvec{{\beta }}\Vert _2^2/2n + \sum _{j=1}^p J_\lambda (|\beta _j|), \end{aligned}$$
(2)

where \(J_\lambda \) is a penalty, in which the tuning parameter \(\lambda \) controls sparsity of the fitted model. The least absolute selection and shrinkage operator (LASSO) (Tibshirani 1996) has become so popular since it does parameter estimation and variable selection simultaneously with the penalty \(J_{\lambda }^L(|t|) = \lambda |t|\). On the other hand, the LASSO is known to select more variables than necessary (Zou 2006) unless some strong conditions are provided (Zhao and Yu 2006; Meinshausen and Yu 2009), and produces unnecessary bias to zero for large regression coefficients. To improve the LASSO, various non-convex penalties have been proposed such as the smoothly clipped absolute deviation (SCAD) penalty, (Fan and Li 2001) \(\mathrm{d}J_{\lambda }^S(|t|)/\mathrm{d}|t|=\min \{\lambda ,(a\lambda -|t|)_+/(a+1)\}, a>2\), the Bridge penalty (Huang et al. 2008), \(J_\lambda ^B(|t|)=\lambda |t|^\nu , 0<\nu \le 1\), and the minimax concave penalty (MCP) (Zhang 2010), \(\mathrm{d}J_{\lambda }^M(|t|)/\mathrm{d}|t|=(\lambda -|t|/a)_+, a>1\), where \(x_+=xI(x \ge 0)\). These non-convex penalties are known to have the oracle property: asymptotic equivalence to an ideal non-penalized estimator obtained with true predictive variables only (Fan and Peng 2004; Kim et al. 2008; Zhang 2010). We refer to Zhang and Zhang (2012) for a well organized review of penalized estimations for variable selection in high-dimensional linear regression.

In this paper, we consider a case where the p covariates can be decomposed into K disjoint groups. In this case, the model (1) can be rewritten as

$$\begin{aligned} \mathbf{y}= \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k^* + \varvec{{\varepsilon }}, \end{aligned}$$
(3)

where \(\mathbf{X}_k=(\mathbf{X}_{k1},\ldots ,\mathbf{X}_{kp_k})\) and \(\varvec{{\beta }}_k^*=(\beta _{k1}^*,\ldots ,\beta _{kp_k}^*)^\mathrm{T} \in \mathbb {R}^{p_k}\) are a \(n\times p_k\) design matrix and a true regression coefficient vector for the kth group, respectively, satisfying \(p=\sum _{k=1}^Kp_k\).

In high-dimensional linear regression models, it is not uncommon that covariates are naturally grouped, and hence group selection is of interest. For example, linear regression models with categorical covariates can be expressed with (3) via groups of dummy variables. In nonparametric additive models, each component can be expanded by a set of basis functions and the selection of components is equivalent to selecting groups of basis functions. In practice, selecting both groups and variables within the selected groups is interesting rather than selecting groups only. An example is a nonparametric additive model with wavelet basis, where choosing the basis functions within the selected components is as important as selecting components (Sardy and Tseng 2004).

There have been some developments in recent years that extend and generalize the existing penalties to group variable selection problems, by minimizing the following \(L_q\)-norm composite criterion:

$$\begin{aligned} \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + \sum _{k=1}^K J_{\lambda _k}(\Vert \varvec{{\beta }}_k\Vert _q) \end{aligned}$$
(4)

for \(q\in \{1,2\}\), where \(\lambda _k\)’s are tuning parameters that control sparsity of groups. For example, Yuan and Lin (2006) proposed the group LASSO using \(J_{\lambda _k}^L(|t|)= \lambda _k|t|\) for \(\lambda _k =\lambda p_k^{1/2}\) and \(q=2\), which is equivalent to the LASSO when each group consists of just one covariate. Another example is the \(L_2\)-norm MCP proposed by Huang et al. (2012). They used the MCP \(J_{\lambda _k}(|t|)=J_{\lambda _k}^M(|t|)\) for \(\lambda _k= \lambda p_k^{1/2}\) and \(q=2\), which is equivalent to \(J_\lambda ^M(p_k^{1/2}|t|)\) for \(a=a p_k\). Note that the group LASSO and \(L_2\)-norm MCP do not achieve sparsity within the selected groups since they use \(L_2\)-norm inside the penalties.

On the other hand, Huang et al. (2009) proposed the group Bridge that selects groups and variables simultaneously by minimizing a \(L_1\)-norm composite criterion where \(J_{\lambda _k}(|t|)= J_{\lambda _k}^B(|t|)\) for \(\lambda _k = \lambda p_k^{\nu }\) and \(q=1\), which is equivalent to \(J_{\lambda _k}^B(|t|)=J_\lambda ^B(p_k|t|)\). The group Bridge is an extension of the Bridge penalty to group variable selection, including the hierarchical LASSO proposed by Zhou and Zhu (2010) as a special case for \(\nu =1/2\). Recently, Jiang and Huang (2015) introduced general \(L_1\)-norm framework that includes the \(L_1\)-norm MCP and SCAD as special cases. For example, the \(L_1\)-norm MCP uses \(J_{\lambda _k}(|t|)=J_{\lambda _k}^M(|t|)\) for \(\lambda _k=\lambda p_k\) and \(q=1\). Another example is the \(L_1\)-norm exponential penalty proposed by Breheny (2015) where \(J_{\lambda _k}(|t|)=(\lambda ^2/\tau )\{1-\exp (\tau |t|/\lambda )\}\) with an extra tuning parameter \(\tau \) that controls the coupling effect.

The \(L_q\)-norm composite criterion in (4) can be a special case of the inner–outer composite criterion introduced by Breheny and Huang (2009):

$$\begin{aligned} \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + \sum _{k=1}^KJ_{\lambda _k}^O \left( \sum _{j=1}^{p_k}J_{\gamma }^I(|\beta _{kj}|) \right) , \end{aligned}$$

where \(J_{\lambda _k}^O\) is an outer penalty for group selection and \(J_{\gamma }^I\) is an inner penalty for variable selection. For example, the group Bridge uses the outer Bridge penalty \(J_{\lambda _k}^O(|t|)= J_{\lambda _k}^B(|t|)\) and the inner LASSO \(J_{\gamma }^I(|t|)=\gamma |t|\), where the inner tuning parameter is fixed with \(\gamma =1\). Another example is the group LASSO that uses the same outer penalty as the group Bridge but the inner ridge penalty \(J_{\gamma }^I(|t|)=\gamma |t|^2\), where the inner tuning parameter is fixed with \(\gamma =1\). Breheny and Huang (2009) proposed the composite MCP that uses the MCP for both inner and outer penalties where \(\lambda _k=\gamma \) and \(a_k=ap_k\gamma /2\).

There is another approach for group and variable selection which uses the sum of two different penalties:

$$\begin{aligned} \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + \sum _{k=1}^KJ_{\lambda _k}^G(\Vert \varvec{{\beta }}_k\Vert _2) + \sum _{k=1}^K\sum _{j=1}^{p_k} J_{\gamma }^V(|\beta _{kj}|), \end{aligned}$$

where \(J_{\lambda _k}^G\) is a penalty for selection of groups and \(J_\gamma ^V\) for variables. For example, the sparse group LASSO of Simon et al. (2013) adds \(L_1\)-penalty, \(J_{\gamma }^V(|t|) = \gamma |t|\) to the group LASSO, \(J_{\lambda _k}^G(|t|)=\lambda _k|t|=\lambda p_k^{1/2}|t|\) to yield sparsity within the selected groups. One unique feature of the sparse group LASSO is to use two different tuning parameters \(\lambda \) and \(\gamma \) to control sparsity between and within groups, which gives a clearer way of controlling sparsity in practice, although the use of two different tuning parameters may cause higher computational cost than other methods. We refer to Huang et al. (2012) for a nice review of penalized approaches for group and variable selection in linear regression models.

As aforementioned, the \(L_1\)-norm composite methods such as the group Bridge and \(L_1\)-norm MCP can handle sparsity both between and within groups. However, it is not clear how to control the two sparsity, especially sparsity within groups, since they fix the inner tuning parameter \(\gamma =1\). An exceptional example is the \(L_1\)-norm exponential penalty proposed by Breheny (2015) which introduces an extra tuning parameter to control the coupling effect. However, controlling coupling effect focuses on how the relevant and irrelevant variables in a group affect to each other rather than how to distinguish important variables from unimportant ones. Further, these methods are in lack of theoretical results in high-dimensional linear regression models. This is partly because they do not have a closed form of the global minimizer or an oracle estimator that may help us to study asymptotic properties of the global minimizer. On the other hand, the \(L_2\)-norm MCP takes an advantage of the existence of the oracle group estimator in (10), with which it can be shown that it achieves an oracle property (Huang et al. 2012) in group selection when there is no sparsity within groups. Note that for the composite MCP, the oracle estimator in (9) can be an intuitive global minimizer also when there is sparsity within groups. However, the systematic obscurity and lack of theoretical support still remain with the choice \(\lambda _k=\gamma \) and \(a_k=ap_k\gamma /2\).

In this paper, we propose a new penalized approach for group and variable selection, using the doubly sparse (DS) penalty. The DS penalty is an \(L_1\)-norm composite penalty, with the clipped LASSO proposed by Kwon et al. (2015) as the outer penalty. An important feature of the DS penalty compared to other \(L_1\)-norm composite penalties is that it can separately control sparsity between and within groups separately. Hence, the principle of selecting groups and variables is much more transparent.

Theoretically, the DS penalized approach achieves a group selection consistency, allowing the number of groups K to exceed the sample size n, which is common to other group penalties. However, one benefit of using \(L_1\)-composite penalties is to allow the maximum number of variables in signal groups, \(p_{\max }^*=\max _{k:\Vert \varvec{\beta }_k\Vert _1 \ne 0 }p_k\), to exceed the sample size n also which is impossible to other \(L_2\)-composite penalties such as the \(L_2\)-norm MCP.

To our best knowledge, there are no known results of group selection consistency of other \(L_1\)-norm composite penalties such as the group Bridge and hierarchical LASSO, when \(p>n\) and both levels of sparsity are assumed. Note that, the sparse group LASSO requires a strong irrepresentability condition for group selection consistency, since it selects groups as the group LASSO (Wei and Huang 2010). Hence we believe the applicability of the DS penalty is much wider than existing methods.

The paper is organized as follows. In Sect. 2, we introduce the DS penalty including an efficient optimization algorithm. Section 3 gives asymptotic properties and Sect. 4 presents the results from numerical studies including simulation as well as real data analysis. Concluding remarks and technical details are presented in Sect. 5 and Appendix, respectively.

2 Doubly sparse penalty

2.1 Definition and solution

Let \(a>0\) be a fixed constant. For given \(\lambda >0\) and \(0\le \gamma \le \lambda \), the DS penalized estimator is defined as the global minimizer of the DS penalized sum of squared residuals,

$$\begin{aligned} Q_{\lambda ,\gamma }(\varvec{{\beta }}) = \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + \sum _{k=1}^KJ_{\lambda ,\gamma }^{(k)}(\Vert \varvec{{\beta }}_k\Vert _1), \end{aligned}$$
(5)

where the DS penalty is a \(L_1\)-norm composite penalty for the clipped LASSO (Kwon et al. 2015): \(J_{\lambda ,\gamma }^{(k)}(0)=0\),

$$\begin{aligned} \nabla J_{\lambda ,\gamma }^{(k)}(|t|)\equiv & {} d J_{\lambda ,\gamma }^{(k)}(|t|)/d |t|\\= & {} \big (-|t|/a_k + \lambda \big )I\big \{|t| < a_k(\lambda -\gamma )\big \} + \gamma I\big \{|t|\ge a_k(\lambda -\gamma ) \big \} \end{aligned}$$

for \(|t|>0\) and \(a_k=ap_k\). The clipped LASSO is a continuously differentiable quadratic spline interpolation of two penalty functions, the MCP and LASSO:

$$\begin{aligned} J_{\lambda ,\gamma }^{(k)}(|t|)= J_{\lambda }^M(|t|)I\big \{|t| < a_k(\lambda -\gamma )\big \} + (J_\gamma ^L|t|+ c_{\lambda ,\gamma }^{(k)}) I\big \{|t| \ge a_k(\lambda -\gamma ) \big \}, \end{aligned}$$

where \(c_{\lambda ,\gamma }^{(k)}=a_k(\lambda -\gamma )^2/2\). Hence, the clipped LASSO is the same as the MCP with a tuning parameter \(\lambda \) for small \(|t|\le a_k(\lambda -\gamma )\), and the LASSO with a tuning parameter \(\gamma \) for large \(|t|>a_k(\lambda -\gamma )\). See Fig. 1 that depicts three penalty functions, LASSO, clipped LASSO and MCP.

When \(\gamma =\lambda \), the DS penalty becomes the LASSO so it selects variables as the LASSO without using group information. When \(\gamma =0,\) the DS penalty becomes a \(L_1\)-norm MCP, which can select groups and variables using a tuning parameter \(\lambda \). When \(0<\gamma <\lambda \), which is new, we will study various properties of the DS penalty.

Fig. 1
figure 1

Three penalty functions: the LASSO (dotted) with \(\lambda = 1\), the MCP (dashed) with \((\lambda ,a)=(1,2)\) and the clipped LASSO (line) with \((\lambda ,\gamma ,a)=(1,1/2,2)\)

2.2 Roles of two tuning parameters

Let \(\varvec{{\Omega }}_{\lambda ,\gamma }\) be the set of all local minimizers of \(Q_{\lambda ,\gamma }\). Given \(\varvec{{\beta }} \in \mathbb {R}^p\), let \(\mathcal{A}(\varvec{{\beta }})=\mathcal{L}(\varvec{{\beta }}) \cup \mathcal{S}(\varvec{{\beta }})= \{k:\Vert \varvec{{\beta }}_k\Vert _1 \ne 0\}\) and \(\mathcal{N}(\varvec{{\beta }}) = \{k:\Vert \varvec{{\beta }}_k\Vert _1=0\} = \mathcal{A}(\varvec{{\beta }})^c\) be the sets of group indices of coefficient vectors that have nonzero and zero \(L_1\)-norms, respectively, where \(\mathcal{L}(\varvec{{\beta }}) = \{k:\Vert \varvec{{\beta }}_k\Vert _1/p_k > a(\lambda -\gamma )\}\) and \(\mathcal{S}(\varvec{{\beta }}) = \{k:0<\Vert \varvec{{\beta }}_k\Vert _1/p_k \le a(\lambda -\gamma )\}\). Let \(D_{kj}(\varvec{{\beta }})=-\mathbf{X}_{kj}^\mathrm{T}(\mathbf{y}-\mathbf{X}\varvec{{\beta }})/n\) for \(k\le K\) and \(j\le p_k\), which represents the sample covariance between \(-\mathbf{X}_{kj}\) and residual \(\mathbf{y}-\mathbf{X}\varvec{{\beta }}\). Similarly, let \(D_k(\varvec{{\beta }})=-\mathbf{X}_k^\mathrm{T}(\mathbf{y}-\mathbf{X}\varvec{{\beta }})/n\) for \(k\le K\), which represents the vector of sample covariances of the kth group whose jth element is \(D_{kj}(\varvec{{\beta }})\) for \(j\le p_k\). Lemma 1 below directly comes from the first order optimality conditions in Bertsekas (1999).

Lemma 1

For any \({\varvec{\hat{{\beta }}}} \in \varvec{{\Omega }}_{\lambda ,\gamma }\),

figure a

for all \(k\le K\) and \(j\le p_k\), where \(\Delta _{\lambda }(t)=\lambda -t/a\) and \(\mathrm{sign}(t)=(t/|t|)I[t\ne 0]\).

The conditions in Lemma 1 show how the two tuning parameters \(\lambda \) and \(\gamma \) work in the DS penalty. First, the Eq. (6e) shows that \(\lambda \) gives a threshold for group sparsity. Second, the Eqs. (6a) and (6b) shows that \(\gamma \) gives a threshold for variable sparsity within selected groups as the LASSO in the region \(\mathcal{L}({\varvec{\hat{{\beta }}}})\), since they are exactly the first order optimality conditions for the LASSO with tuning parameter \(\gamma \). Third, the Eqs. (6c) and (6d) show that the DS penalty behaves like the \(L_1\)-norm MCP in the region \(\mathcal{S}({\varvec{\hat{{\beta }}}})\). Note that the events defined by the Eqs.  (6c) and (6d) are null events in the asymptotic sense under some assumptions. Hence, we may simply understand the DS penalized estimator through the lemma below.

Lemma 2

If \({\varvec{\hat{{\beta }}}} \in \mathbb {R}^p\) satisfies

figure b

for all \(k\le K\) and \(j\le p_k\), then \({\varvec{\hat{{\beta }}}} \in \varvec{{\Omega }}_{\lambda ,\gamma }\).

The conditions in Lemma 2 make us to expect that there is an estimator \({\varvec{\hat{{\beta }}}}\) that selects groups and then selects variables within selected groups. The condition (7a) requires that \({\varvec{\hat{{\beta }}}}\) excludes the groups whose maximum sample covariances are smaller than \(\lambda \). The condition (7b) requires that \({\varvec{\hat{{\beta }}}}\) includes the groups whose averaged \(L_1\)-norms of coefficient vectors are larger than \(a(\lambda -\gamma )\). The last two conditions (7c) and (7d) show that \({\varvec{\hat{{\beta }}}}\) becomes the LASSO with tuning parameter \(\gamma \), deleting variables within selected groups whose sample covariances are smaller than \(\gamma \). Hence, in contrast to other existing penalties, the DS penalty provides a clear way of controlling the sparsity between and within groups, which also shows the roles of two tuning parameters \(\lambda \) and \(\gamma \).

An oracle estimator for group and variable selection will be introduced in next subsection that satisfies these conditions asymptotically, which is an extension of the existence of an oracle estimator for variables selection. This is similar to the fundamental idea on the usual oracle property for the model (1) studied by many researchers (Kim et al. 2008; Zhang 2010; Huang et al. 2012), where the sufficient conditions for a given \({\varvec{\hat{{\beta }}}}\) to be a local minimizer of (2) are given by

figure c

for all \(j\le p\), for a class of nonconvex penalties (Zhang and Zhang 2012; Kim and Kwon 2012). Note that the conditions (8a) and (8b) require that \({\varvec{\hat{{\beta }}}}\) includes the variables whose absolute values of coefficients are larger than \(a\lambda \), and excludes the variables whose sample covariances are smaller than \(\lambda \), becoming a least square estimator that is unbiased as the condition (8c) requires.

2.3 Global minimizer

In high-dimensional linear regression models, the existence of an oracle estimator plays an important role for the study of the penalized estimator. For example, the oracle estimator,

$$\begin{aligned} {\varvec{\hat{{\beta }}}}^o = \mathop {\arg \min }\limits _{{_{{\beta _{j} = 0,j \in \{ l:\beta _{l}^{*} = 0\} }} }}\Vert \mathbf{y}- \mathbf{X}\varvec{{\beta }}\Vert _2^2 /2n , \end{aligned}$$
(9)

is proven to be the global minimizer of the penalized sum of squares of residuals in (2) for various penalties such as the SCAD penalty (Kim and Kwon 2012) and MCP (Zhang 2010), which shows \({\varvec{\hat{{\beta }}}}^o\) is exactly the global minimizer of the penalized sum of squares of residuals asymptotically. A similar result in group variable selection can be found in Huang et al. (2012) with the oracle group estimator,

$$\begin{aligned} {\varvec{\hat{{\beta }}}}^o = \mathop {\arg \min }\limits _{\Vert {\varvec{\beta }}_{k}\Vert _1=0,k \in \mathcal{N}({\varvec{\beta }}^*)} \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_{k}\varvec{{\beta }}_{k}\right\| _2^{2}/2n, \end{aligned}$$
(10)

which asymptotically becomes the global minimizer of the \(L_2\)-norm composite criterion (4) with the MCP when there is no sparsity within groups. Before introducing an oracle estimator for the DS penalized criterion in (5), we give sufficient conditions under which the global minimizer of \(Q_{\lambda ,\gamma }\) is unique when \(p\le n\). Let \(\lambda _{\min }(\mathbf{D})\) and \(\lambda _{\max }(\mathbf{D})\) be the smallest and largest eigenvalues of a given symmetric matrix \(\mathbf{D}\). The following lemma shows an analogous result to Kim and Kwon (2012) for group variable selection.

Lemma 3

Assume that \(\rho _{\min }=\lambda _{\min }(\mathbf{X}^\mathrm{T}\mathbf{X}/n)>0.\) If there exists a local minimizer \({\varvec{\hat{{\beta }}}} \in \varvec{{\Omega }}_{\lambda ,\gamma }\) that satisfies \(\mathcal{S}({\varvec{\hat{{\beta }}}}) = \emptyset \) and

$$\begin{aligned} {\left\{ \begin{array}{ll} {\min }_{k \in \mathcal{L}(\hat{{\varvec{\beta }}})}\Vert {\varvec{\hat{{\beta }}}}_k\Vert _1/p_k> (\lambda -\gamma )\max \{a,1/\rho _{\min }\},\\ {\max }_{k \in \mathcal{N}(\hat{{\varvec{\beta }}})}\Vert D_k({\varvec{\hat{{\beta }}}})\Vert _\infty < (\lambda -\gamma )\min \{a\rho _{\min },1\}+\gamma , \end{array}\right. } \end{aligned}$$

then \(\varvec{{\Omega }}_{\lambda ,\gamma }=\{{\varvec{\hat{{\beta }}}}\}.\) Lemma 3 plays a key role when we prove the main theorems in the next section under a sparsity condition when \(p>n\). Lemma 3 shows that the global minimizer is unique even when \(Q_{\lambda ,\gamma }\) is non-convex under the non-asymptotic sufficient conditions: the nonzero coefficients of the estimator are large so that \(\mathcal{S}({\varvec{\hat{{\beta }}}})=\emptyset \), but covariances between current residuals and irrelevant covariates are sufficiently small.

From the optimality conditions in Lemma 2 and 3, it is easy to see that the following estimator,

$$\begin{aligned} {\varvec{\hat{{\beta }}}}^o(\gamma ) = \mathop {\arg \min }\limits _{\Vert {\varvec{\beta }}_k\Vert _1=0,k \in \mathcal{N}({\varvec{\beta }}^*)} \left\{ \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + \gamma \sum _{k=1}^K \Vert \varvec{{\beta }}_k\Vert _1 \right\} , \end{aligned}$$
(11)

becomes the unique global minimizer of \(Q_{\lambda ,\gamma }\) if \({\varvec{\hat{{\beta }}}}^o(\gamma )\) satisfies the sufficient conditions in Lemma 3. Note that \({\varvec{\hat{{\beta }}}}^o(\gamma )\) is the LASSO for which \(\gamma \) controls sparsity of variables in the signal groups. In this sense, we refer \({\varvec{\hat{{\beta }}}}^o(\gamma )\) as the oracle LASSO. In practice, the oracle LASSO is unavailable since the signal groups are unknown. However, we will show that the oracle LASSO is exactly the DS penalized estimator asymptotically, if we choose \(\lambda \) to have correct group sparsity. This result also explains roles of the two tuning parameters \(\lambda \) and \(\gamma \) clearly: after selecting groups by \(\lambda \), the shrinkage and selection within the chosen groups are controlled by \(\gamma .\)

2.4 Optimization algorithm

Recall that we estimate \(\varvec{{\beta }}\) by minimizing \(Q_{\lambda ,\gamma }\). Let \(\tilde{J}_{\lambda ,\gamma }(\varvec{{\beta }})=\sum _{k=1}^KJ_{\lambda ,\gamma }^{(k)}\big (\Vert \varvec{{\beta }}_k\Vert _1\big )-\lambda \Vert \varvec{{\beta }}\Vert _1\). It is easy to see that we can decompose \(Q_{\lambda ,\gamma }\) into a sum of convex and concave functions as follows.

$$\begin{aligned} Q_{\lambda ,\gamma }(\varvec{{\beta }})= Q_{vex}(\varvec{{\beta }}) + Q_{cav}(\varvec{{\beta }}), \end{aligned}$$

where \(Q_{vex}(\varvec{{\beta }})=\Vert \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k \Vert _2^2/2n + \lambda \Vert \varvec{{\beta }}\Vert _1\) is a convex function, and \(Q_{cav}(\varvec{{\beta }}) = \tilde{J}_{\lambda ,\gamma }(\varvec{{\beta }}) \) is a continuously differentiable concave function. Hence, we can find a local minimizer with the convex concave procedure (CCCP) of Yuille and Rangarajan (2003) or the difference of convex (DC) decomposition procedure of An and Tao (1997) which are powerful algorithms for nonconvex optimization problems. For example, we can apply the CCCP as follows. Let \(\partial \tilde{J}_{\lambda ,\gamma }(\varvec{{\beta }})\) be any subgradient of \(\tilde{J}_{\lambda ,\gamma }(\varvec{{\beta }})\) at \(\varvec{{\beta }}\). For a given solution \({\varvec{\hat{{\beta }}}}^{c},\) we update the solution by minimizing

$$\begin{aligned} \left\| \mathbf{y}- \sum _{k=1}^K \mathbf{X}_k\varvec{{\beta }}_k\right\| _2^2/2n + (\partial \tilde{J}_{\lambda ,\gamma }({\varvec{\hat{{\beta }}}}^{c}))^\mathrm{T}\varvec{{\beta }}+ \lambda \Vert \varvec{{\beta }}\Vert _1, \end{aligned}$$

which can be solved by many efficient optimization algorithms for the LASSO such as the modified least angle regression (Efron et al. 2004) and coordinate descent algorithm (Friedman et al. 2007). We iterate this procedure until the solution converges. Note that the CCCP always converges to a local minimizer (Sriperumbudur and Lanckriet 2009).

3 Asymptotic properties

In this section we study statistical properties of the DS penalized estimator when \(p>n\). For any subset \(S\subset \{(k,j):k\le K, j\le p_k\},\) define \(\mathbf{X}_S\) as the \(n\times |S|\) matrix that is obtained by combining column vectors \(\mathbf{X}_{kj}\) for \((k,j)\in S\). We assume that there exist nonempty subsets \(A_*=\{(k,j):\beta _{kj}^*\ne 0\}\) and \(G_*=\{(k,j): k \in \mathcal{A}(\varvec{{\beta }}^*),j\le p_k\}\) that are index sets of true nonzero regression coefficients of variables and groups, respectively. To study properties of the DS penalized estimator, we need the following assumptions:

  1. (C1)

    The random errors \({\varepsilon }_1,\ldots ,{\varepsilon }_n\) are independently and identically distributed mean zero sub-Gaussian random variables with a positive scale factor \(\sigma _0<\infty \), that is, \(E\exp (t{\varepsilon }_i) \le \exp (\sigma _0^2t^2/2)\) for all \(i\le n\) and \(t>0\).

  2. (C2)

    There exist constants \(\alpha _0>0\) and \(\kappa _{\min }>0\) such that

    $$\begin{aligned} \phi _{\min }(2(\alpha _0+1)|A_*|) \ge \kappa _{\min }, \end{aligned}$$
    (12)

where \(\phi _{\min }(m)=\min _{|B|\le m ,A_* \subset B} \lambda _{\min }(\mathbf{X}_B^\mathrm{T}\mathbf{X}_B/n)\) is the lower sparse eigenvalue in Zhang and Zhang (2012), and

$$\begin{aligned} \phi _{\max }(\alpha _0|A_*|)/\alpha _0 \le (1-3\sqrt{\phi _{\max }(\alpha _0|A_*|)/\alpha _0 \kappa _{\min }})^2/576, \end{aligned}$$
(13)

where \(\phi _{\max }(m)=\min _{|B|\le m ,A_* \subset B} \lambda _{\max }(\mathbf{X}_B^\mathrm{T}\mathbf{X}_B/n)\) is the upper sparse eigenvalue in Zhang and Zhang (2012).

Remark 1

(C1) implies that there exist constants \(c_0>0\) and \(d_0>0\) such that the error vector \(\varvec{{\varepsilon }}\) satisfies

$$\begin{aligned} \mathbf{P}\big (|\mathbf{a}^\mathrm{T}\varvec{{\varepsilon }}|>t\big ) \le c_0 \exp (-d_0t^2/\Vert \mathbf{a}\Vert _2^2), \end{aligned}$$
(14)

for all \(\mathbf{a}\in \mathbb {R}^n\) and \(t>0\). The inequality (12) in (C2) ensures model identifiability and uniqueness of \({\varvec{\hat{{\beta }}}}^o(\gamma )\) as a local minimizer. In fact, \(\phi _{\min }((\alpha _0+1)|A_*|) \ge \kappa _{\min }\) is sufficient for \({\varvec{\hat{{\beta }}}}^o(\gamma )\) to be one of the local minimizers, which is often referred as the oracle property. The inequality (13) in (C2) assumes \(\phi _{\max }(\alpha _0|A_*|)/\alpha _0\) is bounded, which is weaker than similar conditions in Bickel et al. (2009) and Meinshausen and Yu (2009), where it is assumed that \(\phi _{\max }(|A_*|+\min \{n,p\})\) is bounded. A similar condition can be found in Wang et al. (2013). From the results in Zhang and Zhang (2012), (C2) controls the number of nonzero elements of \({\varvec{\hat{{\beta }}}}^o(\gamma )\) up to an order of \(|A_*|\) under the \(\eta \)-null consistency in Zhang and Zhang (2012).

First, we prove that the oracle LASSO is one of the local minimizers of \(Q_{\lambda ,\gamma }\) whose number of nonzero elements is less than \((\alpha _0+1)|A_*|\) with probability tending to 1 when \(p>n\). Given an integer \(s<n\), let

$$\begin{aligned} \varvec{{\Omega }}_{\lambda ,\gamma }(s) = \big \{{\varvec{\hat{{\beta }}}} \in \varvec{{\Omega }}_{\lambda ,\gamma }: \Vert {\varvec{\hat{{\beta }}}}\Vert _0 \le s\big \} \end{aligned}$$

be the set of all local minimizers that have s nonzero elements at most. Further, let \(\xi _{\lambda ,\gamma }^* = m_*-a(\lambda -\gamma )-\gamma \sqrt{(\alpha _0+1)|A_*|/\kappa _{\min }^2}\) and \(\zeta _{\lambda ,\gamma }^*=\lambda - \gamma \sqrt{(\alpha _0+1)|A_*|/\kappa _{\min }}\), where \(m_*=\min _{(k,j)\in A_*}|\beta _{kj}^*|\).

Theorem 1

Assume that (C1) and (C2) hold, then

$$\begin{aligned} \mathbf{P}\big ({\varvec{\hat{{\beta }}}}^o(\gamma ) \in \varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)\big ) \ge 1- \mathbf{P}_1 -\mathbf{P}_2 -\mathbf{P}_3, \end{aligned}$$

where \(\mathbf{P}_1=c_0|G_*|\exp (-d_0n\gamma ^2/4)\), \(\mathbf{P}_2= c_0(\alpha _0+1)|A_*| \exp \big (-d_0\kappa _{\min } n\xi _{\lambda ,\gamma }^{*2}\big )\) and \(\mathbf{P}_3=c_0(p-|G_*|)\exp \big (-d_0n\zeta _{\lambda ,\gamma }^{*2}\big ).\)

Corollary 1

Assume that (C1) and (C2) hold. If \(n\lambda ^2\rightarrow \infty \), \(\lambda =o(m_*)\) and \(n\gamma ^2 {\rightarrow }\infty \) then

$$\begin{aligned} \mathbf{P}\big ({\varvec{\hat{{\beta }}}}^o(\gamma ) \in \varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)) {\rightarrow }1, \end{aligned}$$

provided that \(\log p=o(n\lambda ^2)\), \(\log |G_*|=o(n\gamma ^2)\) and \(\gamma =o(\lambda /\sqrt{|A_*|})\) as \(n{\rightarrow }\infty .\)

Remark 2

Theorem 1 and Corollary 1 imply that the oracle LASSO is one of the local minimizers of \(Q_{\lambda ,\gamma }\) even when \(p>n\) under the sub-Gaussian assumption. Note that the total number of variables in the model, \(p=\sum _{k=1}^K p_k\), is allowed to have an exponential order of n. For example, suppose there exists a constant \(m_0>0\) such that \(m_*\ge m_0>0\) for all n. Then, by letting \(\lambda =n^{-1/3},\) Corollary 1 holds when \(p=\exp (n^{\delta _0})\) for some constant \(0<\delta _0<1/3.\)

Second, we prove that the DS penalized estimator is unique and asymptotically the same as the oracle LASSO, which is the main result of the paper. The next theorem and corollary prove that the unique local minimizer in \(\varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)\) is the oracle LASSO with probability tending to 1. Let \(\xi _{\lambda ,\gamma }^{**} = m_*-\max \{a,1/\kappa _{\min }\}(\lambda -\gamma )-\gamma \sqrt{(\alpha _0+1)|A_*|/\kappa _{\min }^2}\) and \(\zeta _{\lambda ,\gamma }^{**}=(\lambda -\gamma )\min \{a\kappa _{\min },1\}- \gamma (\sqrt{(\alpha _0+1)|A_*|/\kappa _{\min }}-1)\).

Theorem 2

Assume that (C1) and (C2) hold, then

$$\begin{aligned} \mathbf{P}\big (\varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)=\{{\varvec{\hat{{\beta }}}}^o(\gamma )\})\ge 1- \mathbf{P}_1 -\mathbf{P}_2 -\mathbf{P}_3, \end{aligned}$$

where \(\mathbf{P}_1=c_0|G_*|\exp (-d_0n\gamma ^2/4)\), \(\mathbf{P}_2= c_0(\alpha _0+1)|A_*| \exp \big (-d_0\kappa _{\min } n\xi _{\lambda ,\gamma }^{**2}\big )\) and \(\mathbf{P}_3=c_0(p-|G_*|)\exp \big (-d_0n\zeta _{\lambda ,\gamma }^{**2}\big ).\)

Corollary 2

Assume that (C1) and (C2) hold. If \(n\lambda ^2\rightarrow \infty \), \(\lambda =o(m_*)\) and

\(n\gamma ^2\rightarrow \infty \) then

$$\begin{aligned} \mathbf{P}(\varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)=\{{\varvec{\hat{{\beta }}}}^o(\gamma )\}) {\rightarrow }1 \end{aligned}$$

provided that \(\log p=o(n\lambda ^2)\), \(\log |G_*|=o(n\gamma ^2)\) and \(\gamma =o(\lambda /\sqrt{|A_*|})\) as \(n{\rightarrow }\infty .\)

Remark 3

Theorem 2 and Corollary 2 allow \(|G_*|\) to have an exponential order of n, which is impossible for other existing \(L_2\)-norm composite penalties such as the \(L_2\)-norm MCP. For example, suppose there exists a constant \(m_0>0\) such that \(m_*\ge m_0>0\) for all n. Then, by letting \(\lambda =n^{-1/4}\) and \(\gamma =n^{-1/3},\) Corollary 2 holds when \(|G_*|=\exp (n^{\delta _0})\) for some constant \(0<\delta _0<1/3\).

Remark 4

Under similar regularity conditions, Bickel et al. (2009) and Zhang and Zhang (2012) proved that

$$\begin{aligned} \Vert {\varvec{\hat{{\beta }}}}^o(\gamma )-\varvec{{\beta }}^*\Vert _2^2 =O_p\big (|A_*|(\log |G_*|)/{n}\big ) \text{ and } \Vert {\varvec{\hat{{\beta }}}}^o(\gamma )\Vert _0=O_p(|A_*|). \end{aligned}$$

It is easy to see that the convergence rate is faster than the upper bound \(O_p(|A_*|(\log p)/n)\) of the LASSO but the variable selection bound is the same. Since the DS penalized estimator is asymptotically equivalent to the oracle LASSO, we can say that it improves the LASSO by incorporating group information. In addition, a similar argument can be applied to the group LASSO. For a given \(0<\eta <1\), Huang and Zhang (2010) proved that the group LASSO, \({\varvec{\hat{{\beta }}}}^g\), satisfies

$$\begin{aligned} \Vert {\varvec{\hat{{\beta }}}}^g-\varvec{{\beta }}^*\Vert _2^2 = O_p\big (\{|G_*| + |\mathcal{A}(\varvec{{\beta }}^*)| \log (|\mathcal{A}(\varvec{{\beta }}^*)|/\eta )\}/{n}\big ) \text{ and } \Vert {\varvec{\hat{{\beta }}}}^g\Vert _0=O_p(|G_*|) \end{aligned}$$

under the \((|\mathcal{A}(\varvec{{\beta }}^*)|,|G_*|)\)-strong group sparsity. This shows that the group LASSO is inferior to the oracle LASSO in both bounds.

Remark 5

When \(p\le n\), it is easy to check whether a given solution \({\varvec{\hat{{\beta }}}}\) satisfies the conditions in Lemma 3, that is, we can check whether \({\varvec{\hat{{\beta }}}}\) obtained by the (or any) algorithm is unique or not. When \(p>n\), we need to check the conditions in Lemma 3 by using \(\kappa _{\min }\) in (C2). However, it is impossible to check the conditions in general since \(\kappa _{\min }\) is computationally unavailable, and \(\alpha _0\) and \(|A_*|\) are unknown. If we assume that \(\alpha _0\) and \(|A_*|\) are known, then we can conclude that the solution \({\varvec{\hat{{\beta }}}}\) is unique or not in \(\varvec{{\Omega }}_{\lambda ,\gamma }((\alpha _0+1)|A_*|)\) by using a rough estimate of \(\kappa _{\min }\) obtained by the way in Section 5 of Kim and Kwon (2012).

4 Numerical studies

4.1 Simulation study

The performance of the proposed DS method is examined using Monte–Carlo simulation. Throughout the simulation study, we use the following model:

$$\begin{aligned} y = \sum _{k=1}^{K}\sum _{j=1}^{p_k} x_{kj} \beta _{kj}^* + \varepsilon , \quad \varepsilon \sim N(0,\sigma ^2), \end{aligned}$$

where \(\sigma =2\). Similar to Huang et al. (2009), we set the covariate vector as follows:

$$\begin{aligned} x_{kj} = (z_k + w_{kj})/\sqrt{2},~k\le K, j \le p_k. \end{aligned}$$

where \(\mathbf{z}=(z_1,\ldots ,z_{K})^\mathrm{T} \sim N_K(\mathbf{0},\varvec{{\Sigma }})\) with \(\Sigma _{k_1k_2}=0.4^{|k_1-k_2|}\), \(k_1,k_2\le K\), and \(\mathbf{w}=(w_{11},\ldots , w_{Kp_K})^\mathrm{T} \sim N_p(\mathbf{0}, \mathbf{I})\), independently of \(\mathbf{z}\). For the true nonzero coefficients, we set \(\beta _{kj}^* = c/(k\times j)\), \(j \le q_k\), \(k\in \mathcal{A}(\varvec{{\beta }}^*)\), where the constant \(c>0\) is chosen so that the signal-to-noise ratio is 10. The sizes of groups are fixed such that \(p_{2k} =12\) and \(p_{2k-1}=6\) for \(k\le K/2\), and the first 6 groups are set to be signal groups, that is, \(\mathcal{A}(\varvec{{\beta }}^*) = \{1,\ldots ,6\}\). We consider two settings with different degrees of within group sparsity: \(q_k = p_k/3\) for Example 1 and \(q_k = 2p_k/3\) for Example 2. All settings are replicated 400 times with \(K\in \{18,100\}\) and \(n\in \{200,400\}\). Hence the total number of variables is \(p\in \{162,900\}\) and the number of signal variables is \(|A_*|\in \{18,36\}.\)

We compared the DS method with the LASSO, MCP, group LASSO (gLASSO), group Bridge (gBridge), group MCP (gMCP), group exponential LASSO (gExp). For comparison, we also considered five different versions of DS method: by fixing \(\gamma \) as \(\gamma =2\lambda ^*\) \(\gamma =\lambda ^*\), \(\gamma =\lambda ^*/2\) and \(\gamma =0\), where \(\lambda ^*\) is the optimal value of \(\lambda \) in the LASSO and by choosing an optimal \(\gamma ^*\) over a sequence of girds. The R package glmnet, grpreg and grppenalty were used for to implement other estimators. All the tuning parameters in each method were chosen by using an independent validation data set of size n / 2.

Tables 1 and 2 display the squared Prediction Error (PE) that is calculated from independent test data set of size 2n, and the Model Error (ME) calculated as \(({\varvec{\hat{{\beta }}}}-\varvec{{\beta }})^\mathrm{T}E(\mathbf{X}\mathbf{X}^\mathrm{T})({\varvec{\hat{{\beta }}}} - \varvec{{\beta }})\) (Fan and Li 2001). The tables also report the number of groups correctly and incorrectly selected (G.C and G.IC) and the number of variables correctly and incorrectly selected (V.C and V.IC). Methods are considered to select a group if at least one coefficient in the group was estimated as non-zero.

First, in Example 1, the LASSO outperforms the gLASSO in terms of PE and ME although it dose not employ the group structure, and so does the MCP. As expected, G.IC and V.IC of the LASSO and gLASSO are noticeably large. This indicates that ignoring group structure results in including many noisy groups and variables, and becoming inconsistent in group and variable selection. However, the gLASSO often performs better than the MCP in Example 2 where the group includes many signal variables, which is analogous to the results of Huang and Zhang (2010). The gLASSO and MCP are opposite in terms of individual variable selection while the MCP has small V.IC and large V.C, the gLASSO shows large V.IC and small V.C.

Second, in terms of prediction, the DS methods except the DS with \(\gamma =2\lambda ^*\) have smaller PE and ME than other methods for all cases. This shows that a practical choice of \(\gamma \in \{\lambda ^*,\lambda ^*/2\}\) performs well, which significantly reduces computational cost. The gBridge often shows the best PE and ME when \(K=18\) but has large PE and ME than DS methods when \(K=100\). We found that the algorithm implemented in \(\mathsf{R}\) is often unstable when \(K=100\). The gMCP performs the worst in PE and ME except the case when \(n=400\), for which the gExp performs the worst.

Table 1 Simulation results of Example 1
Table 2 Simulation results of Example 2

Third, in group selection, the DS methods show the largest G.C and G.IC and the gBridge has the smallest G.C and G.IC. This implies that the DS methods are denser than other methods and the gBridge is the most sparse in group selection. The gMCP has the smallest G.C especially when \(n=100\) which may a reason for large PE and ME, and gExp shows the smallest G.IC. In individual variable selection, the gMCP and gExp show much larger V.IC than the gBridge and DS methods, which shows they select groups well but fail to exclude irrelevant variables in the selected groups. The gBridge often shows the best performance when \(K=18\) but too sparse when \(K=100\). On the other hand, the DS methods has slightly smaller V.C than others but much smaller V.IC for all cases, which shows the DS methods perform better than other methods in variable selection.

To sum up, the DS methods and gBridge compete with each other, showing the best performance in all measures, and in detail, the gBridge is the best when \(K=18\) but the DS methods is the best when \(K=100\). As a practical choice of \(\gamma \) in the DS methods, we recommend to use a fixed constant for reducing computational cost, such as \(\gamma \in \{\lambda ^*,\lambda ^*/2\}\) in the simulation. Too large \(\gamma \) such as \(\gamma =2\lambda ^*\) does not work well and thus we recommend to use \(\gamma =\lambda ^*/2\) in practice.

4.2 Ways of tuning parameter selection

The performance of penalized estimation depends on the choice of tuning parameters. We construct several theorems and corollaries that can be used to develop some guide lines for choosing tuning parameters of the DS. For example, motivated by the universal regularization parameter proposed by Zhang (2010), we may set \(\lambda =\sigma \{(2/n)\log p\}^{1/2}\) when \(\varepsilon \) is a Gaussian random variable with mean 0 and known variance \(\sigma ^2\). This works as a method of group selection since the DS method is group selection consistent. However, such a choice may performs bad in practice unless the sample size is sufficiently large. Hence, we need to develop practical ways of selecting tuning parameters that produce reasonable finite sample performance.

For this issue, we consider four different ways of tuning parameter selection. The first method is to use independent validation sets as in our simulation studies assuming that the whole sample size is moderately large. For comparison, we consider two ways where the sample size of validation sets are 4n and n / 2, which are denoted by V\(_{4n}\) and V\(_{n/2}\). The second method is the k-fold cross validation and we consider \(k=5\) that is denoted by CV\(_5\). The use of validation sets and cross validation are based on prediction accuracy hence often produce slightly overfitted models (Wang et al. 2007, 2009), but the most practical ways of tuning parameter selection. The other method is the high-dimensional BIC criterion as in Wang et al. (2013) when \(\sigma ^2\) is unknown: \(\text{ BIC }_W=\log (\hat{\sigma }^2) + \{C_n(\log p)/n\}\hat{M}\), where \(\hat{\sigma }^2 = \Vert \mathbf{y}-\mathbf{X}{\varvec{\hat{{\beta }}}}\Vert ^2/n\), \(C_n\) is a sequence diverges slowly, for example we use \(C_n = \log (\log n)\) in this study, and \(\hat{M}\) is the number of variables included in the model.

Tables 3 and 4 present the simulation results, where the simulation settings are the same as those used for Tables 1 and 2, and we only consider the case when \(K=100\) and \(n=400\) for comparison. It is easy to see that the models based on the V\(_{4n}\) always show the smallest PE and ME for all the methods, which is reasonable since the error is a straightforward estimate of the test error. The BIC\(_W\) produces the worst results having slightly larger PE than the other criteria but much larger ME. However, in group and individual variable selection, all the methods have much small G.IC and V.IC when the BIC\(_W\) is used while G.C and V.C are slightly less than other criteria. Hence, the BIC\(_W\) seems to be the best in group and variable selection although it produces higher PE and ME, while other criteria show similar performance producing denser models than the BIC\(_W\). We note that, among the methods, the DS methods have the smallest PE and ME when \(\gamma \in \{\lambda ^*/2,\gamma ^*\}\) for almost all cases, regardless of the criteria, and all the methods are quite invariant to the choice of tuning parameter selection when they are based on the prediction. The V\(_{n/2}\) and CV\(_5\) show the second and third performance in PE and ME but the difference is not substantial compared with the V\(_{4n}\). This shows that the V\(_{n/2}\) and CV\(_5\) are simple but practical criteria. Further, we note that the V\(_{2/n}\) and CV\(_5\) perform similar to each other for all the methods, which indicates that one may use small validation sets instead of the cross validation to reduce computational cost. To sum up, we recommend to use the BIC\(_W\) for selection and V\(_{n/2}\) or CV\(_5\) for prediction, and the DS methods based on these criteria perform quite well with \(\gamma \in \{\lambda ^*/2,\lambda ^*\}\).

Table 3 Comparison of tuning parameter selection methods in Example 1
Table 4 Comparison of tuning parameter selection methods in Example 2

4.3 Real data analysis

We analyze two real data sets:

  • Ozone data: The Ozone data are popular real data set, which has been analyzed in the literatures including Breiman and Friedman (1985) and Lin and Zhang (2006). The data set is available from the R library ‘\(\mathsf {mlbench}\)’ including short descriptions of 3 categorical and 9 continuous independent variables for the dependent variable ‘\(\mathsf {daily~maximum~one~hour~average~ozone~reading}\)’. We drop the variable ‘\(\mathsf {temperature~of~measured~at~El~Monte,~CA}\)’ since it has too many missing values and exclude 36 observations including missing values. We keep all the categorical variables except ‘\(\mathsf {day~of~week}\)’.

  • TRIM data: We use the data set in Scheetz et al. (2006), which consists of gene expression levels of 18, 975 genes obtained from 120 rats. The main objective of the analysis is to find genes that are correlated with the \(\mathsf {TRIM32}\) gene, known to cause Bardet–Biedl syndrome. As was done Huang et al. (2008), we first select 3000 genes with the largest variance in expression levels and then choose top 100 genes that have the largest absolute correlation with \(\mathsf {TRIM32}\) among the selected 3000 genes.

Consider a linear regression model with K predictive variables for the analysis:

$$\begin{aligned} y = \varvec{w}^\mathrm{T}{\varvec{{\alpha }} + \varepsilon }, \end{aligned}$$
(15)

where \(\varvec{w}=(w_1,\dots ,w_K)^\mathrm{T}\) and \(\varvec{{\alpha }} \in \mathbb {R}^{K}\) is a regression coefficient. A problem of using the linear model is that the relation of \(w_k\) to y may not be linear, in particular when the empirical distribution of \(w_k\) is highly skewed. An alternative approach is to use an additive model. For the kth component \(w_k\), let \(-\infty<c_{k0}<c_{k1}<\cdots<c_{kp_k}<\infty \) be a given increasing sequence of \(p_k\) real numbers such that the support of \(w_k\) is covered in \([c_{k0},c_{kp_k}].\) Then we have an additive model with piecewise linear bases:

$$\begin{aligned} y=\beta _0 + \sum _{k=1}^{K}\sum _{j=1}^{p_k}\beta _{kj}x_{kj} (w_k)+ \varepsilon , \end{aligned}$$
(16)

where \(x_{k1}(w_k) = w_k\) and \(x_{kj}(w_k) = (w_k - c_{k(j-1)})_+\) for \(2\le j\le p_k\). The DS penalty can delete unnecessary components and at the same time delete unnecessary bases inside each component, which is new for the additive model.

We consider the estimators in the simulation studies for the additive model (16), and then add two more estimators for comparison, the LASSO and MCP for the linear model (15), denoted by LASSO\(_0\) and MCP\(_0\). We only consider the DS methods with \(\gamma \in \{\lambda ^*,\lambda ^*/2\}\) which perform well as shown in the simulations, and compare two ways of selecting tuning parameters, the BIC\(_W\) and CV\(_5\). We let \(p_k\) be either 5 or 9, and \(c_{kj}\) are selected so that the numbers of observations in \((c_{k(j-1)},c_{kj}]\) are roughly equal, and then we transform all the continuous variables into \(p_k\) new variables.

Tables 5 and 6 present averages of prediction errors (PE), numbers of selected groups (#G) and variables (#V), based on 100 random partitions: training (70 %) and test (30 %). First of all, the estimated models from the additive model show smaller PE than those from the linear model for all cases. The models selected by the BIC\(_W\) tend to include less groups and variables than CV\(_5\), which may cause low prediction accuracies for all the methods. We notice that the gLASSO shows the largest PE and #V for all cases, and selects the null model for TRIM data when the BIC\(_W\) is used.

Among the group and variables selection methods, the DS methods have the smallest PE and the gBridge is the second best for both cases. The DS methods produce denser models than the others for TRIM data, and the difference is much clearer when we use CV\(_5\). However, for Ozone data, the gMCP and gExp selects much more variables than the gBridge and DS methods especially when we use CV\(_5\). For example, the gMCP and gExp include about 7 and 5 variables in each selected group, while the DS methods include 2 variables only, keeping better PE. For the reference, we give partial fits of the 9 selected variables by the DS with \(\gamma =\lambda ^*/2\) based on the CV\(_5\) for Ozone data in Fig. 2. Further, we note that the DS methods are less sensitive to the choice of tuning parameter selection method as observed in the simulation studies than the other methods. Based on these results, we conclude that the additive model is a useful alternative to the linear model, and the DS penalty is well suited for this model.

Table 5 Results for Ozone data
Table 6 Results for TRIM data
Fig. 2
figure 2

Partial fits on selected variables of ozone data when \(p_k=9\): month (month), temperature measured (tem), inversion base temperature (ibt), visibility measured (vis), pressure gradient (dpg), pressure height (vh), inversion base height (ibh), humidity (hum), wind speed (wind)

5 Concluding remarks

There are two regularization parameters \(\lambda \) and \(\gamma \) in the DS penalty, and it may be computationally demanding to select them simultaneously. For this issue, we can consider an alternative: choosing \(\lambda \) by using the CV or BIC-type criteria after fixing \(\gamma \). Although we did not pursue theoretical properties further in this paper, the DS method performs quite well according to these two steps once a \(\gamma \) is given appropriately. We recommend, for example, to use a \(\gamma \) around the optimal \(\lambda ^*\) that is obtained from the LASSO in practice.

We have shown empirically that the additive model with sparse penalties is a promising alternative to linear models. However, the choice of the number of knots, \(p_k\), should be done carefully. When \(p_k\) is too small, the model may not capture the functional relation properly. On the other hand, when \(p_k\) is too large, the corresponding design matrix may not be well posed. We leave the optimal choice of \(p_k\) as future work.