1 Introduction

The lasso (Tibshirani 1996) is a very popular technique for variable selection for high-dimensional data. Consider the classical linear regression problem where we have a continuous response \(\mathbf{y}\in \mathbb {R}^{n}\) and an \(n \times p\) design matrix \(\mathbf{X}\). To remove the intercept that is not penalized, we can first center \(\mathbf{y}\) and each column of \(\mathbf{X}\), that is, all variables have mean zero. The lasso linear regression solves the following \(\ell _1\) penalized least squares:

$$\begin{aligned} \mathop {\mathrm{argmin}}_{\varvec{\beta }} \frac{1}{2} \Vert \mathbf{y}-\mathbf{X}\varvec{\beta }\Vert ^2_2+\lambda \Vert \varvec{\beta } \Vert _1, \qquad \lambda >0. \end{aligned}$$
(1)

The group-lasso (Yuan and Lin 2006) is a generalization of the lasso for doing group-wise variable selection. Yuan and Lin (2006) motivated the group-wise variable selection problem by two important examples. The first example concerns the multi-factor ANOVA problem where each factor is expressed through a set of dummy variables. In the ANOVA model, deleting an irrelevant factor is equivalent to deleting a group of dummy variables. The second example is the commonly used additive model in which each nonparametric component may be expressed as a linear combination of basis functions of the original variables. Removing a component in the additive model amounts to removing a group of coefficients of the basis functions. In general, suppose that the predictors are put into \(K\) non-overlapping groups such that \((1,2,\ldots ,p)=\bigcup ^K_{k=1} I_k\) where the cardinality of index set \(I_k\) is \(p_k\) and \(I_k \bigcap I_{k^{\prime }}=\emptyset \) for \(k \ne k^{\prime }\). Consider the linear regression model again and the group-lasso linear regression model solves the following penalized least squares:

$$\begin{aligned} \mathop {\mathrm{argmin}}_{\varvec{\beta }} \frac{1}{2} \Vert \mathbf{y}-\mathbf{X}\varvec{\beta }\Vert ^2_2+\lambda \sum ^K_{k=1} \sqrt{p_k}\Vert \varvec{\beta }^{(k)}\Vert _2,\qquad \lambda >0, \end{aligned}$$
(2)

where \(\Vert \varvec{\beta }^{(k)}\Vert _2=\sqrt{\sum _{j \in I_k } \beta ^2_j}\). The group-lasso idea has been used in penalized logistic regression (Meier et al. 2008).

The group-lasso is computationally more challenging than the lasso. The entire solution paths of the lasso penalized least squares can be efficiently computed by the least angle regression (LARS) algorithm (Efron et al. 2004). See also the homotopy algorithm of Osborne et al. (2000). However, the LARS-type algorithm is not applicable to the group-lasso penalized least squares, because its solution paths are not piecewise linear. Another efficient algorithm for solving the lasso problem is the coordinate descent algorithm (Tseng 2001; Fu 1998; Daubechies et al. 2004; Genkin et al. 2007; Wu and Lange 2008; Friedman et al. 2010). Yuan and Lin (2006) implemented a block-wise descent algorithm for the group-lasso penalized least squares by following the idea of Fu (1998). However, their algorithm requires the group-wise orthonormal condition, i.e., \(\mathbf{X}^{{ \mathsf T}}_{(k)} \mathbf{X}_{(k)}=\mathbf{I}_{p_k}\) where \(\mathbf{X}_{(k)}=[\cdots X_j \cdots ]\), \(j \in I_{k}\). Meier et al. (2008) also developed a block coordinate gradient descent algorithm BCGD for solving the group-lasso penalized logistic regression. Meier’s algorithm is implemented in an R package grplasso available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/grplasso.

From an optimization viewpoint, it is more interesting to solve the group-lasso with a general design matrix. From a statistical perspective, the group-wise orthonormal condition should not be the basis of a good algorithm for solving the group-lasso problem, even though we can transform the predictors within each group to meet the group-wise orthonormal condition. The reason is that even when the group-wise orthonormal condition holds for the observed data, it can be easily violated when removing a fraction of the data or perturbing the dataset as in bootstrap or sub-sampling. In other words, we cannot perform cross-validation, bootstrap or sub-sampling analysis of the group-lasso, if the algorithm’s validity depends on the group-wise orthonormal condition. In a popular MATLAB package SLEP, Liu et al. (2009) implemented Nesterov’s method (Nesterov 2004, 2007) for a variety of sparse learning problems. For the group-lasso case, SLEP provides functions for solving the group-lasso penalized least squares and logistic regression. Nesterov’s method can handle general design matrices. The SLEP package is available at http://www.public.asu.edu/~jye02/Software/SLEP.

In this paper we consider a general formulation of the group-lasso penalized learning where the learning procedure is defined by minimizing the sum of an empirical loss and the group-lasso penalty. The aforementioned group-lasso penalized least squares and logistic regression are two examples of the general formulation. We propose a simple unified algorithm, groupwise-majorization-descent (GMD), for solving the general group-lasso learning problems under the condition that the loss function satisfies a quadratic majorization (QM) condition. GMD is remarkably simple and has provable numerical convergence properties. We show that the QM condition indeed holds for many popular loss functions used in regression and classification, including the squared error loss, the Huberized hinge loss, the squared hinge loss and the logistic regression loss. It is also important to point out that GMD works for general design matrices, without requiring the group-wise orthogonal assumption. We have implemented the proposed algorithm in an R package gglasso which contains the functions for fitting the group-lasso penalized least squares, logistic regression, Huberized SVM using the Huberized hinge loss and squared SVM using the squared hinge loss. The Huberized hinge loss and squared hinge loss are interesting loss functions for classification from machine learning viewpoint. In fact, there has been both theoretical and empirical evidence showing that the Huberized hinge loss is better than the hinge loss (Zhang 2004; Wang et al. 2008). The group-lasso penalized Huberized SVM and squared SVM are not implemented in grplasso and SLEP.

Here we use breast cancer data (Graham et al. 2010) to demonstrate the speed advantage of gglasso over grplasso and SLEP. This is a binary classification problem where \(n=42\) and \(p=\text {22,283}\). We fit a sparse additive logistic regression model using the group-lasso. Each variable contributes an additive component that is expressed by five B-spline basis functions. The group-lasso penalty is imposed on the coefficients of five B-spline basis functions for each variable. Therefore, the corresponding group-lasso logistic regression model has 22,283 groups and each group has \(5\) coefficients to be estimated. Displayed in Fig. 1 are three solution path plots produced by grplasso, SLEP and gglasso. We computed the group-lasso solutions at 100 \(\lambda \) values on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor. It took SLEP about 450 and grplasso about 360 seconds to compute the logistic regression paths, while gglasso used only about 10 seconds.

The rest of this article is organized as follows. In Sect. 2 we formulate the general group-lasso learning problem. We introduce the quadratic majorization (QM) condition and show that many popular loss functions for regression and classification satisfy the QM condition. In Sect. 3 we derive the GMD algorithm for solving the group-lasso model satisfying the QM condition and discuss some important implementation issues. Simulation and real data examples are presented in Sect. 4. We end the paper with a few concluding remarks in Sect. 5. We present technical proofs in an Appendix.

Fig. 1
figure 1

Fit a sparse additive logistic regression model using the group-lasso on the breast cancer data (Graham et al. 2010) with \(n=42\) patients and 22,283 genes (groups). Each gene’s contribution is modeled by 5 B-Spline basis functions. The solution paths are computed at 100 \(\lambda \) values. The vertical dotted lines indicate the selected \(\lambda \) (\(\log \lambda =-3.73\)), which selects 8 genes

2 Group-lasso models and the QM condition

2.1 Group-lasso penalized empirical loss

To define a general group-lasso model, we need to introduce some notation. Throughout this paper we use \(\mathbf{x}\) to denote the generic predictors which are used to fit the group-lasso model. Note that \(\mathbf{x}\) may not be the original variables in the raw data. For example, if we use the group-lasso to fit an additive regression model. The original predictors are \(z_1,\ldots ,z_q\) but we generate \(\mathbf{x}\) variables by using basis functions of \(z_1,\ldots ,z_q\). For instance, \(x_1=z_1,x_2=z^2_1,x_3=z^3_1\), \(x_4=z_2,x_5=z^2_2\), etc. We assume that the user has defined the \(\mathbf{x}\) variables and we only focus on how to compute the group-lasso model defined in terms of the \(\mathbf{x}\) variables.

Let \(\mathbf{X}\) be the design matrix with \(n\) rows and \(p\) columns where \(n\) is the sample size of the raw data. If an intercept is used in the model, we let the first column of \(\mathbf{X}\) be a vector of 1. Assume that the group membership is already defined such that \((1,2,\ldots ,p)=\bigcup ^K_{k=1} I_k\) and the cardinality of index set \(I_k\) is \(p_k\), \(I_k \bigcap I_{k^{\prime }}=\emptyset \) for \(k \ne k^{\prime }, 1 \le k, k^{\prime } \le K\). Group \(k\) contains \(x_j, j \in I_{k}\), for \(1 \le k \le K.\) If an intercept is included, then \(I_1=\{1\}\). Given the group partition, we use \(\varvec{\beta }^{(k)}\) to denote the segment of \(\varvec{\beta }\) corresponding to group \(k\). This notation is used for any \(p\)-dimensional vector.

Suppose that the statistical model links the predictors to the response variable \(y\) via a linear function \(f=\varvec{\beta }^{{ \mathsf T}} \mathbf{x}\). Let \(\Phi (y,f)\) be the loss function used to fit the model. In this work we primarily focus on statistical methods for regression and binary classification, although our algorithms are developed for a general loss function. For regression, the loss function \(\Phi (y,f)\) is often defined as \(\Phi (y-f)\). For binary classification, we use \(\{+1,-1\}\) to code the class label \(y\) and consider the large margin classifiers where the loss function \(\Phi (y,f)\) is defined as \(\Phi (y f)\). We obtained an estimate of \(\varvec{\beta }\) via the group-lasso penalized empirical loss formulation defined as follows:

$$\begin{aligned} \mathop {\mathrm{argmin}}_{\varvec{\beta }} \frac{1}{n}\sum ^n_{i=1}\tau _i\Phi (y_i,\varvec{\beta }^{{ \mathsf T}} \mathbf{x}_i)+\lambda \sum ^K_{k=1} w_k \Vert \varvec{\beta }^{(k)}\Vert _2, \end{aligned}$$
(3)

where \(\tau _i \ge 0\) and \(w_k \ge 0\) for all \(i,k\).

Note that we have included two kinds of weights in the general group-lasso formulation. The observation weights \(\tau _i\)s are introduced in order to cover methods such as weighted regression and weighted large margin classification. The default choice for \(\tau _i\) is 1 for all \(i\). We have also included penalty weights \(w_k\)s in order to make a more flexible group-lasso model. The default choice for \(w_k\) is \(\sqrt{p_k}\). If we do not want to penalize a group of predictors, simply let the corresponding weight be zero. For example, the intercept is typically not penalized so that \(w_1=0\). Following the adaptive lasso idea (Zou 2006), one could define the adaptively weighted group-lasso which often has better estimation and variable selection performance than the un-weighted group-lasso (Wang and Leng 2008). Our algorithms can easily accommodate both observation and penalty weights.

2.2 The QM condition

For notation convenience, we use \(\mathbf{D}\) to denote the working data \(\{\mathbf{y},\mathbf{X}\}\) and let \(L(\varvec{\beta }|\mathbf{D})\) be the empirical loss, i.e.,

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})=\frac{1}{n}\sum ^n_{i=1}\tau _i\Phi (y_i,\varvec{\beta }^{{ \mathsf T}} \mathbf{x}_i). \end{aligned}$$

Definition 1

The loss function \(\Phi \) is said to satisfy the quadratic majorization (QM) condition, if and only if the following two assumptions hold:

  1. (i).

    \(L(\varvec{\beta }\mid \mathbf{D})\) is differentiable as a function of \(\varvec{\beta }\), i.e., \(\nabla L( \varvec{\beta }| \mathbf{D})\) exists everywhere.

  2. (ii).

    There exists a \(p \times p\) matrix \(\mathbf{H}\), which may only depend on the data \(\mathbf{D}\), such that for all \(\varvec{\beta }, \varvec{\beta }^*\),

    $$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\varvec{\beta }^* \mid \mathbf{D})+(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D})\nonumber \\&+\,\frac{1}{2}(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \varvec{\beta }^*). \end{aligned}$$
    (4)

The following lemma characterizes a class of loss functions that satisfies the QM condition.

Lemma 1

Let \(\tau _i\), \(1 \le i \le n\) be the observation weights. Let \(\varvec{\Gamma }\) be a diagonal matrix with \(\varvec{\Gamma }_{ii}=\tau _i\). Assume \(\Phi (y,f)\) is differentiable with respect to \(f\) and write \(\Phi _f^{\prime }=\frac{\partial {\Phi (y,f)}}{\partial {f}}\). Then

$$\begin{aligned} \nabla L( \varvec{\beta }| \mathbf{D})=\frac{1}{n}\sum ^n_{i=1}\tau _i\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta })\mathbf{x}_i. \end{aligned}$$
  1. (1)

    If \(\Phi _f^{\prime }\) is Lipschitz continuous with constant \(C\) such that

    $$\begin{aligned} |\Phi _f^{\prime }(y,f_1)-\Phi _f^{\prime }(y,f_2)| \le C |f_1-f_2| \quad \forall \ y,f_1,f_2, \end{aligned}$$

    then the QM condition holds for \(\Phi \) and \(\mathbf{H}=\frac{2C}{n}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}\).

  2. (2)

    If \(\Phi _f^{\prime \prime }=\frac{\partial {\Phi ^2(y,f)}}{\partial {f^2}}\) exits and

    $$\begin{aligned} \Phi _f^{\prime \prime } \le C_2 \quad \forall \ y,f, \end{aligned}$$

    then the QM condition holds for \(\Phi \) and \(\mathbf{H}=\frac{C_2}{n}\mathbf{X}^{{ \mathsf T}}\varvec{\Gamma }\mathbf{X}\).

In what follows we use Lemma 1 to verify that many popular loss functions indeed satisfy the QM condition. The results are summarized in Table 1.

Table 1 The QM condition is verified for the least squares, logistic regression, squared hinge loss and Huberized hinge loss

We begin with the classical squared error loss for regression: \(\Phi (y,f)=\frac{1}{2}(y-f)^2.\) Then we have

$$\begin{aligned} \nabla L( \varvec{\beta }| \mathbf{D})=-\frac{1}{n}\sum ^n_{i=1}\tau _i(y_i-\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta })\mathbf{x}_i. \end{aligned}$$
(5)

Because \(\Phi _f^{\prime \prime }=1\), Lemma 1 part (2) tell us that the QM condition holds with

$$\begin{aligned} \mathbf{H}=\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}/n \equiv \mathbf{H}^{\mathrm {ls}}. \end{aligned}$$
(6)

We now discuss several margin-based loss functions for binary classification. We code \(y\) by \(\{+1,-1\}\). The logistic regression loss is defined as \( \Phi (y,f)=\mathrm{Logit}(yf)=\log (1+\exp (-yf)). \) We have \(\Phi _f^{\prime }=-y \frac{1}{1+\exp (yf)}\) and \(\Phi _f^{\prime \prime }=y^2\frac{\exp (yf)}{(1+\exp (yf))^2}=\frac{\exp (yf)}{(1+\exp (yf))^2}\). Then we write

$$\begin{aligned} \nabla L(\varvec{\beta }\mid \mathbf{D})=-\frac{1}{n}\sum _{i=1}^{n}\tau _iy_{i}\mathbf{x}_i \frac{1}{1+\exp (y_i\mathbf{x}^{{ \mathsf T}}_i\varvec{\beta })} . \end{aligned}$$
(7)

Because \(\Phi _f^{\prime \prime } \le 1/4\), by Lemma 1 part (2) the QM condition holds for the logistic regression loss and

$$\begin{aligned} \mathbf{H}=\frac{1}{4}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}/n \equiv \mathbf{H}^{\mathrm {logit}}. \end{aligned}$$
(8)

The squared hinge loss has the expression \(\Phi (y,f)=\mathrm{sqsvm}(yf)=\left[ (1-yf)_{+}\right] ^{2}\) where

$$\begin{aligned} (1-t)_{+}=\left\{ \begin{array}{ll} 0, &{} \quad t>1\\ 1-t, &{} \quad t\le 1. \end{array}\right. \end{aligned}$$

By direct calculation we have

$$\begin{aligned}&\Phi _f^{\prime } = \left\{ \begin{array}{ll} 0, &{} \quad yf>1\\ -2y(1-yf), &{} \quad yf\le 1 \end{array}\right. \nonumber \\&\nabla L(\varvec{\beta }\mid \mathbf{D})=-\frac{1}{n}\sum _{i=1}^{n}2\tau _iy_{i}\mathbf{x}_i(1-y_i\mathbf{x}^{{ \mathsf T}}_i\varvec{\beta })_{+}. \end{aligned}$$
(9)

We can also verify that \( |\Phi _f^{\prime }(y,f_1)-\Phi _f^{\prime }(y,f_2)| \le 2 |f_1-f_2|\). By Lemma 1 part (1) the QM condition holds for the squared hinge loss and

$$\begin{aligned} \mathbf{H}=4\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}/n \equiv \mathbf{H}^{\mathrm {sqsvm}}. \end{aligned}$$
(10)

The Huberized hinge loss is defined as \(\Phi (y,f)=\mathrm{hsvm}(yf)\) where

$$\begin{aligned} \mathrm{hsvm}(t)=\left\{ \begin{array}{ll} 0, &{}\quad t>1\\ (1-t)^{2}/2\delta , &{}\quad 1-\delta <t\le 1\\ 1-t-\delta /2, &{}\quad t\le 1-\delta . \end{array}\right. \end{aligned}$$

By direct calculation we have \(\Phi _f^{\prime }=y\mathrm{hsvm}^{\prime }(yf)\) where

$$\begin{aligned} \mathrm{hsvm}^{\prime }(t)=\left\{ \begin{array}{ll} 0, &{}\quad t>1\\ (1-t)/\delta , &{}\quad 1-\delta <t\le 1\\ 1, &{}\quad t\le 1-\delta , \end{array}\right. \end{aligned}$$
$$\begin{aligned} \nabla L(\varvec{\beta }\mid \mathbf{D})=-\frac{1}{n}\sum _{i=1}^{n}\tau _i y_{i}\mathbf{x}_i\mathrm{hsvm}^{\prime }(y_i\mathbf{x}^{{ \mathsf T}}_i\varvec{\beta }). \end{aligned}$$
(11)

We can also verify that \( |\Phi _f^{\prime }(y,f_1)-\Phi _f^{\prime }(y,f_2)| \le \frac{1}{\delta } |f_1-f_2|. \) By Lemma 1 part (1) the QM condition holds for the Huberized hinge loss and

$$\begin{aligned} \mathbf{H}=\frac{2}{\delta } \mathbf{X}^{{ \mathsf T}}\varvec{\Gamma }\mathbf{X}/n \equiv \mathbf{H}^{\mathrm {hsvm}}. \end{aligned}$$
(12)

3 GMD algorithm

3.1 Derivation

In this section we derive the groupwise-majorization-descent (GMD) algorithm for computing the solution of (3) when the loss function satisfies the QM condition. The objective function is

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})+\lambda \sum ^K_{k=1} w_k \Vert \varvec{\beta }^{(k)}\Vert _2. \end{aligned}$$
(13)

Let \(\widetilde{\varvec{\beta }}\) denote the current solution of \(\varvec{\beta }\). Without loss of generality, let us derive the GMD update of \(\widetilde{\varvec{\beta }}^{(k)}\), the coefficients of group \(k\). Define \(\mathbf{H}^{(k)}\) as the sub-matrix of \(\mathbf{H}\) corresponding to group \(k\). For example, if group 2 is \(\{2,4\}\) then \(\mathbf{H}_2\) is a \(2 \times 2\) matrix with \(\mathbf{H}^{(2)}_{11}=\mathbf{H}_{2,2},\mathbf{H}^{(2)}_{12}=\mathbf{H}_{2,4},\mathbf{H}^{(2)}_{21}=\mathbf{H}_{4,2},\mathbf{H}^{(2)}_{22}=\mathbf{H}_{4,4}\).

Write \(\varvec{\beta }\) such that \(\varvec{\beta }^{(k^{\prime })}=\widetilde{\varvec{\beta }}^{(k^{\prime })}\) for \(k^{\prime } \ne k\). Given \(\varvec{\beta }^{(k^{\prime })}=\widetilde{\varvec{\beta }}^{(k^{\prime })}\) for \(k^{\prime } \ne k\), the optimal \(\varvec{\beta }^{(k)}\) is defined as

$$\begin{aligned} \mathop {\mathrm{argmin}}_{\varvec{\beta }^{(k)}} L(\varvec{\beta }\mid \mathbf{D})+\lambda w_k \Vert \varvec{\beta }^{(k)}\Vert _2. \end{aligned}$$
(14)

Unfortunately, there is no closed form solution to (14) for a general loss function with general design matrix. We overcome the computational obstacle by taking advantage of the QM condition. From (4) we have

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\widetilde{\varvec{\beta }} \mid \mathbf{D})+(\varvec{\beta }-\widetilde{\varvec{\beta }})^{{ \mathsf T}}\nabla L(\widetilde{\varvec{\beta }} | \mathbf{D})\\&+\frac{1}{2}(\varvec{\beta }-\widetilde{\varvec{\beta }})^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \widetilde{\varvec{\beta }}). \end{aligned}$$

Write \(U(\widetilde{\varvec{\beta }} )=-\nabla L(\widetilde{\varvec{\beta }} | \mathbf{D})\). Using

$$\begin{aligned} \varvec{\beta }-\widetilde{\varvec{\beta }}=(\underbrace{0,\ldots ,0}_{k-1},\varvec{\beta }^{(k)} -\widetilde{\varvec{\beta }}^{(k)},\underbrace{0,\ldots ,0}_{K-k}), \end{aligned}$$

we can write

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\widetilde{\varvec{\beta }} \mid \mathbf{D})-(\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} U^{(k)}\nonumber \\&+\frac{1}{2}(\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} \mathbf{H}^{(k)} ( \varvec{\beta }^{(k)}- \widetilde{\varvec{\beta }}^{(k)}). \end{aligned}$$
(15)

Let \(\eta _k\) be the largest eigenvalue of \(\mathbf{H}^{(k)}\). We set \(\gamma _k = (1+\varepsilon ^{*}) \eta _k\), where \(\varepsilon ^*=10^{-6}\). Then we can further relax the upper bound in (15) as

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\widetilde{\varvec{\beta }} \mid \mathbf{D})-(\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} U^{(k)}\nonumber \\&+\frac{1}{2} \gamma _k (\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} ( \varvec{\beta }^{(k)}- \widetilde{\varvec{\beta }}^{(k)}). \end{aligned}$$
(16)

It is important to note that the inequality strictly holds unless for \(\varvec{\beta }^{(k)} = \widetilde{\varvec{\beta }}^{(k)}\). Instead of minimizing (14) we solve

$$\begin{aligned}&\mathop {\mathrm{argmin}}_{\varvec{\beta }^{(k)}} L(\widetilde{\varvec{\beta }} \mid \mathbf{D})-(\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} U^{(k)}\nonumber \\&\quad +\frac{1}{2} \gamma _k (\varvec{\beta }^{(k)}-\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} ( \varvec{\beta }^{(k)}- \widetilde{\varvec{\beta }}^{(k)})+\lambda w_k \Vert \varvec{\beta }^{(k)}\Vert _2. \end{aligned}$$
(17)

Denote by \(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new})\) the solution to (17). It is straightforward to see that \(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new})\) has a simple closed-from expression

$$\begin{aligned} \widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \!=\! \frac{1}{\gamma _k}\left( U^{(k)}\!+\!\gamma _k \widetilde{\varvec{\beta }}^{(k)} \right) \left( 1-\frac{\lambda w_k}{\Vert U^{(k)}+\gamma _k \widetilde{\varvec{\beta }}^{(k)} \Vert _2}\right) _{\!+}\!. \end{aligned}$$
(18)

Algorithm 1 summarizes the details of GMD.

figure a

We can prove the strict descent property of GMD by using the MM principle (Lange et al. 2000; Hunter and Lange 2004; Wu and Lange 2010). Define

$$\begin{aligned} Q(\varvec{\beta }\mid \mathbf{D})\!&= \!L(\widetilde{\varvec{\beta }} \mid \mathbf{D})\!-\!(\varvec{\beta }^{(k)}\!-\!\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} U^{(k)}\nonumber \\&\!+\!\frac{1}{2} \gamma _k (\varvec{\beta }^{(k)}\!-\!\widetilde{\varvec{\beta }}^{(k)})^{{ \mathsf T}} ( \varvec{\beta }^{(k)}\!-\! \widetilde{\varvec{\beta }}^{(k)})+\lambda w_k \Vert \varvec{\beta }^{(k)}\Vert _2.\nonumber \\ \end{aligned}$$
(19)

Obviously, \(Q(\varvec{\beta }\mid \mathbf{D})=L(\varvec{\beta }\mid \mathbf{D})+\lambda w_k \Vert \varvec{\beta }^{(k)}\Vert _2\) when \(\varvec{\beta }^{(k)}=\widetilde{\varvec{\beta }}^{(k)}\) and (16) shows that \(Q(\varvec{\beta }\mid \mathbf{D}) > L(\varvec{\beta }\mid \mathbf{D})+\lambda w_k \Vert \varvec{\beta }^{(k)}\Vert _2\) when \(\varvec{\beta }^{(k)} \ne \widetilde{\varvec{\beta }}^{(k)}\). After updating \(\widetilde{\varvec{\beta }}^{(k)}\) using (18), we have

$$\begin{aligned} L(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \mid \mathbf{D})\!+\!\lambda w_k \Vert \widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \Vert _2 \!&\le \! Q(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \mid \mathbf{D})\\ \!&\le \! Q(\widetilde{\varvec{\beta }} \mid \mathbf{D}) \\ \!&= \! L(\widetilde{\varvec{\beta }} \mid \mathbf{D})\!+\!\lambda w_k \Vert \widetilde{\varvec{\beta }}^{(k)}\Vert _2. \end{aligned}$$

Moreover, if \(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \ne \widetilde{\varvec{\beta }}^{(k)}\), then the first inequality becomes

$$\begin{aligned} L(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \mid \mathbf{D})+\lambda w_k \Vert \widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \Vert _2&< Q(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) \mid \mathbf{D}). \end{aligned}$$

Therefore, the objective function is strictly decreased after updating all groups in a cycle, unless the solution does not change after each groupwise update. If this is the case, we can show that the solution must satisfy the KKT conditions, which means that the algorithm converges and finds the right answer. To see this, if \(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new}) = \widetilde{\varvec{\beta }}^{(k)}\) for all \(k\), then by the update formula (18) we have that for all \(k\)

$$\begin{aligned} \widetilde{\varvec{\beta }}^{(k)} =&\frac{1}{\gamma _k}\left( U^{(k)}+\gamma _k \widetilde{\varvec{\beta }}^{(k)} \right) \left( 1-\frac{\lambda w_k}{\Vert U^{(k)}+\gamma _k \widetilde{\varvec{\beta }}^{(k)} \Vert _2}\right) \nonumber \\&\qquad \mathrm{if }\,\Vert U^{(k)}+\gamma _k \widetilde{\varvec{\beta }}^{(k)} \Vert _2 > \lambda w_{k},\end{aligned}$$
(20)
$$\begin{aligned} \widetilde{\varvec{\beta }}^{(k)} =&\varvec{0} \qquad \mathrm{if }\,\Vert U^{(k)}+\gamma _k \widetilde{\varvec{\beta }}^{(k)} \Vert _2 \le \lambda w_{k}. \end{aligned}$$
(21)

By straightforward algebra we obtain the KKT conditions:

$$\begin{aligned} -U^{(k)}+\lambda w_{k}\cdot \frac{\widetilde{\varvec{\beta }}^{(k)} }{\Vert \widetilde{\varvec{\beta }}^{(k)}\Vert _2}=\varvec{0}&\qquad \mathrm{if }\,\widetilde{\varvec{\beta }}^{(k)}\ne \varvec{0},\\ \left\| U^{(k)} \right\| _2 \le \lambda w_{k}&\qquad \mathrm{if }\,\widetilde{\varvec{\beta }}^{(k)}=\varvec{0}, \end{aligned}$$

where \(k=1,2,\ldots ,K\). Therefore, if the objective function stays unchanged after a cycle, the algorithm necessarily converges to the right answer.

3.2 Implementation

We have implemented Algorithm 1 for solving the group-lasso penalized least squares, logistic regression, Huberized SVM and squared SVM. These functions are contained in an R package gglasso publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/gglasso. We always include the intercept term in the model. Without loss of generality we always center the design matrix beforehand.

We solve each group-lasso model for a sequence of \(\lambda \) values from large to small. The default number of points is 100. Let \(\lambda ^{[l]}\) denote these grid points. We use the warm-start trick to implement the solution path, that is, the computed solution at \(\lambda =\lambda ^{[l]}\) is used as the initial value for using Algorithm 1 to compute the solution at \(\lambda =\lambda ^{[l+1]}\). We define \(\lambda ^{[1]}\) as the smallest \(\lambda \) value such that all predictors have zero coefficients, except the intercept. In such a case let \(\hat{\beta }_{1}\) be the optimal solution of the intercept. Then the solution at \(\lambda ^{[1]}\) is \(\widehat{\varvec{\beta }}^{[1]}=(\hat{\beta }_{1},0,\ldots ,0)\) as the null model estimates. By the Karush-Kuhn-Tucker conditions we can find that

$$\begin{aligned} \lambda ^{[1]}=\underset{k=1,\ldots ,K}{\max } \left\| \left[ \nabla L( \widehat{\varvec{\beta }}^{[1]} | \mathbf{D})\right] ^{(k)}\right\| _{2} /w_{k}, \quad w_k \ne 0. \end{aligned}$$

For least squares and logistic regression models, \(\hat{\beta }_{1}\) has a simple expression:

$$\begin{aligned}&\hat{\beta }_{1}(\mathrm {LS})= \frac{\sum ^n_{i=1}\tau _i y_i}{\sum ^n_{i=1}\tau _i} \quad \mathrm{group{\text {-}}lasso\, penalized\, least\,squares}\nonumber \\&\quad \quad \quad \quad \qquad \qquad \mathrm{}\end{aligned}$$
(22)
$$\begin{aligned}&\hat{\beta }_{1}(\mathrm {Logit})= \log \left( \frac{\sum _{y_i=1}\tau _i}{\sum _{y_i=-1}\tau _i}\right) \quad \quad \mathrm{group{\text {-}}lasso\, penalized}\nonumber \\&\quad \quad \quad \qquad \quad \quad \qquad \qquad \mathrm{logistic\, regression} \end{aligned}$$
(23)

For the other two models, we use the following iterative procedure to solve for \(\hat{\beta }_{1}\):

  1. 1.

    Initialize \(\hat{\beta }_{1}=\hat{\beta }_{1}(\mathrm {Logit})\) in large margin classifiers.

  2. 2.

    Compute \(\hat{\beta }_{1}(\mathrm{new})=\hat{\beta }_{1}-\frac{1}{\gamma _1}\nabla L((\hat{\beta }_{1},0,\ldots ,0)| \mathbf{D})_1\) where \(\gamma _1=\frac{1}{n}\sum ^n_{i=1}\tau _i\).

  3. 3.

    Let \(\hat{\beta }_{1}=\hat{\beta }_{1}(\mathrm{new})\).

  4. 4.

    Repeat 2–3 until convergence.

For computing the solution at each \(\lambda \) we also utilize the strong rule introduced in Tibshirani et al. (2012). Suppose that we have computed \(\widehat{\varvec{\beta }}(\lambda ^{[l]})\), the solution at \(\lambda ^{[l]}\). To compute the solution at \(\lambda ^{[l+1]}\), before using Algorithm 1 we first check if group \(k\) satisfies the following inequality:

$$\begin{aligned} \left\| [\nabla L(\widehat{\varvec{\beta }}(\lambda ^{[l]}) | \mathbf{D})]^{(k)} \right\| _2 \ge w_k (2\lambda ^{[l+1]}-\lambda ^{[l]}). \end{aligned}$$
(24)

Let \(S=\{I_k:\,\mathrm{group}\, k \,\mathrm{passes\, the\, check\, in\, (24)}\}\) and denote its complement as \(S^{c}\). The strong rule claims that at \(\lambda ^{[l+1]}\) the groups in the set \(S\) are very likely to have nonzero coefficients and the groups in set \(S^c\) are very likely to have zero coefficients. If the strong rule guesses correctly, we only need to use Algorithm 1 to solve the group-lasso model with a reduced data set \(\{\mathbf{y}, \mathbf{X}_{S}\}\) where \(\mathbf{X}_{S}=[\cdots X_{j} \cdots ]\), \(j \in S\) corresponds to the design matrix with only the groups of variables in \(S\). Suppose the solution is \(\widehat{\varvec{\beta }}_{S}\). We then need to check whether the strong rule indeed made correct guesses at \(\lambda ^{[l+1]}\) by checking whether \(\widetilde{\varvec{\beta }}(\lambda ^{[l+1]})=(\widehat{\varvec{\beta }}_{S},\mathbf {0})\) satisfies the KKT conditions. If for each group \(k\) where \(I_{k} \in S^{c}\) the following KKT condition holds:

$$\begin{aligned} \left\| [\nabla L(\widetilde{\varvec{\beta }}(\lambda ^{[l+1]})=(\widehat{\varvec{\beta }}_{S},\mathbf {0}) | \mathbf{D})]^{(k)} \right\| _2 \le \lambda ^{[l+1]}w_{k}. \end{aligned}$$

Then \(\widetilde{\varvec{\beta }}(\lambda ^{[l+1]})=(\widehat{\varvec{\beta }}_{S},\mathbf {0})\) is the desired solution at \(\lambda =\lambda ^{[l+1]}\). Otherwise, any group that violates the KKT conditions should be added to \(S\). We update \(S\) by \(S=S\bigcup V\) where

$$\begin{aligned}&V\!=\!\left\{ I_k: I_k\in S^{c}\mathrm {and\,}\left\| [\nabla L(\widetilde{\varvec{\beta }}(\lambda ^{[l+1]})\right. \right. \nonumber \\&\left. \left. \quad =(\widehat{\varvec{\beta }}_{S},\mathbf {0}) | \mathbf{D})]^{(k)} \right\| _2>\lambda ^{[l+1]}w_{k}\right\} . \end{aligned}$$

Note that the strong rule will eventually make the correct guess since the set \(S\) can only grow larger after each update and hence the strong rule iteration will always stop after a finite number of updates. Algorithm 2 summarizes the details of the strong rule.

Note that not all the corresponding coefficients in \(S\) are nonzero. Therefore when we apply Algorithm 1 on the reduced data set \(\{\mathbf{y}, \mathbf{X}_{S}\}\) to cyclically update \(\widehat{\varvec{\beta }}_{S}\), we only focus on a subset \(A\) of \(S\) which contains those groups whose current coefficients are nonzero \(A=\{I_k: I_k \in S \) and \(\widetilde{\varvec{\beta }}^{(k)} \ne \varvec{0}\}\). This subset is referred as the active-set. The similar idea has been adopted by previous work (Tibshirani et al. 2012; Meier et al. 2008; Vogt and Roth 2012). In detail, we first create an active-set \(A\) by updating each group belonging to \(S\) once. Next Algorithm 1 will be applied on the active-set \(A\) until convergence. We then run a complete cycle again on \(S\) to see if any group is to be included into the active-set. If not, the algorithm is stopped. Otherwise, Algorithm 1 is repeated on the updated active-set.

figure b

In Algorithm 1 we use a simple updating formula to compute \(\nabla L(\widetilde{\varvec{\beta }} | \mathbf{D})\), because it only depends on \(R= \mathbf{y} - \mathbf{X}\widetilde{\varvec{\beta }}\) for regression and \(R= \mathbf{y} \cdot \mathbf{X}\widetilde{\varvec{\beta }}\) for classification. After updating \(\widetilde{\varvec{\beta }}^{(k)}\), for regression we can update \(R\) by \(R-\mathbf{X}^{(k)}(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new})-\widetilde{\varvec{\beta }}^{(k)})\), for classification update \(R\) by \(R+\mathbf{y} \cdot \mathbf{X}^{(k)}(\widetilde{\varvec{\beta }}^{(k)}(\mathrm{new})-\widetilde{\varvec{\beta }}^{(k)})\).

In order to make a fair comparison to grplasso and SLEP, we tested three different convergence criteria in gglasso:

  1. 1.

    \(\underset{j}{\max }\frac{\left| \tilde{\beta _j}(\mathrm {current})-\tilde{\beta _j}(\mathrm {new})\right| }{1+|\tilde{\beta _j}(\mathrm {current})|}<\epsilon \), for \(j=1,2\ldots ,p\).

  2. 2.

    \(\left\| \widetilde{\varvec{\beta }}(\mathrm {current})-\widetilde{\varvec{\beta }}(\mathrm {new})\right\| _2<\epsilon \).

  3. 3.

    \( \max _{k}(\gamma _k) \cdot \underset{j}{\max }\frac{\left| \tilde{\beta _j}(\mathrm {current})-\tilde{\beta _j}(\mathrm {new})\right| }{1+|\tilde{\beta _j}(\mathrm {current})|}<\epsilon , \) for \(j=1,2\ldots ,p\) and \(k=1,2\ldots ,K\).

Convergence criterion 1 is used in grplasso and convergence criterion 2 is used in SLEP. For the group-lasso penalized least squares and logistic regression, we used both convergence criteria 1 and 2 in gglasso. For the group-lasso penalized Huberized SVM and squared SVM, we used convergence criterion 3 in gglasso. Compared to criterion 1, criterion 3 uses an extra factor \(\max _{k}(\gamma _k)\) in order to take into account the observation that \(\widetilde{\varvec{\beta }}^{(k)}(\mathrm {current})-\widetilde{\varvec{\beta }}^{(k)}(\mathrm {new})\) depends on \(\frac{1}{\gamma _k}\). The default value for \(\epsilon \) is \(10^{-4}\).

4 Numerical examples

In this section, we use simulation and real data to demonstrate the efficiency of the GMD algorithm in terms of timing performance and solution accuracy. All numerical experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor. In this section, let gglasso (LS1) and gglasso (LS2) denote the group-lasso penalized least squares solutions computed by gglasso where the convergence criterion is criterion 1 and criterion 2, respectively. Likewise, we define gglasso (Logit1) and gglasso (Logit2) for the group-lasso penalized logistic regression.

4.1 Timing comparison

We design a simulation model by combining the FHT model introduced in Friedman et al. (2010) and the simulation model 3 in Yuan and Lin (2006). We generate original predictors \(X_j\), \(j=1,2\ldots ,q\) from a multivariate normal distribution with a compound symmetry correlation matrix such that the correlation between \(X_{j}\) and \(X_{j^{\prime }}\) is \(\rho \) for \(j \ne j^{\prime }\). Let

$$\begin{aligned} Y^*=\sum _{j=1}^{q}\left( \frac{2}{3}X_{j}-X^2_j+\frac{1}{3}{X^3_j}\right) \beta _{j}, \end{aligned}$$

where \(\beta _{j}=(-1)^{j}\exp (-(2j-1)/20)\). When fitting a group-lasso model, we treat \(\{X_j,X^2_j,X^3_j\}\) as a group, so the final predictor matrix has the number of variables \(p = 3q\).

Fig. 2
figure 2

The FHT model scenario 2. The average running time of 10 independent runs (in the natural logarithm scale) of gglasso for computing solution paths of a least squares; b logistic regression; c Huberized SVM; d squared SVM. In all cases \(n=200\)

For regression data we generate a response \(Y=Y^* + k\cdot e\) where the error term \(e\) is generated from \(N(0,1)\). \(k\) is chosen such that the signal-to-noise ratio is 3.0. For classification data we generate the binary response \(Y\) according to

$$\begin{aligned}&\mathrm {Pr(Y=-1)}=1/(1+\exp (-Y^*)), \quad \\&\mathrm {Pr(Y=+1)}=1/(1+\exp (Y^*)). \end{aligned}$$

We considered the following combinations of \((n,p)\):

  • Scenario 1. \((n , p) = (100, 3000)\) and \((n , p)= (300,\) \( 9000)\).

  • Scenario 2. \(n=200\) and \(p=600, 1200, 3000, 6000,\) \(9000\), shown in Fig. 2.

For each \((n,p,\rho )\) combination we recorded the timing (in seconds) of computing the solution paths at 100 \(\lambda \) values of each group-lasso penalized model by gglasso, SLEP and grplasso. The results was averaged over 10 independent runs.

Table 2 shows results from Scenario 1. We see that gglasso has the best timing performance. In the group-lasso penalized least squares case, gglasso (LS2) is about 12 times faster than SLEP (LS). In the group-lasso penalized logistic regression case, gglasso (Logit2) is about 2-6 times faster than SLEP (Logit) and gglasso (Logit1) is about 5-10 times faster than grplasso (Logit).

Figure 2 shows results from Scenario 2 in which we examine the impact of dimension on the timing of gglasso. We fixed \(n\) at \(200\) and plotted the run time (in log scale) against \(p\) for three correlation levels 0.2, 0.5 and 0.8. We see that higher \(\rho \) increases the timing of gglasso in general. For each fixed correlation level, the timing increases linearly with the dimension.

4.2 Quality comparison

In this section we show that gglasso is also more accurate than grplasso and SLEP under the same convergence criterion. We test the accuracy of solutions by checking their KKT conditions. Theoretically, \(\varvec{\beta }\) is the solution of (3) if and only if the following KKT conditions hold:

$$\begin{aligned}&[\nabla L( \varvec{\beta }| \mathbf{D})]^{(k)}+\lambda w_{k}\cdot \frac{\varvec{\beta }^{(k)}}{\Vert \varvec{\beta }^{(k)}\Vert _2}=\varvec{0}\qquad \mathrm{if}\,\varvec{\beta }^{(k)}\ne \varvec{0},\\&\left\| [\nabla L( \varvec{\beta }| \mathbf{D})]^{(k)} \right\| _2 \le \lambda w_{k}\qquad \mathrm{if }\,\varvec{\beta }^{(k)}=\varvec{0}, \end{aligned}$$

where \(k=1,2,\ldots ,K\). The theoretical solution for the convex optimization problem (3) should be unique and always passes the KKT condition check. However, a numerical solution could only approach this analytical value within certain precision therefore may fail the KKT check. Numerically, we declare \(\varvec{\beta }^{(k)}\) passes the KKT condition check if

$$\begin{aligned}&\left\| [\nabla L( \varvec{\beta }| \mathbf{D})]^{(k)}+\lambda w_{k}\cdot \frac{\varvec{\beta }^{(k)} }{\Vert \varvec{\beta }^{(k)}\Vert _2} \right\| _2 \le \varepsilon \qquad \mathrm{if }\,\varvec{\beta }^{(k)}\ne \varvec{0},\\&\left\| [\nabla L( \varvec{\beta }| \mathbf{D})]^{(k)} \right\| _2 \le \lambda w_{k} + \varepsilon \qquad \mathrm{if }\,\varvec{\beta }^{(k)}=\varvec{0}, \end{aligned}$$

for a small \(\varepsilon >0\). In this paper we set \(\varepsilon =10^{-4}\).

For the solutions of the FHT model scenario 1 computed in Sect. 4.1, we also calculated the number of coefficients that violated the KKT condition check at each \(\lambda \) value. Then this number was averaged over the 100 values of \(\lambda \)s. This process was then repeated 10 times on 10 independent datasets. As shown in Table 3, in the group-lasso penalized least squares case, gglasso (LS1) has zero violation count; gglasso (LS2) also has smaller violation counts compared with SLEP (LS). In the group-lasso penalized classification cases, gglasso (Logit1) has less KKT violation counts than grplasso (Logit) does when both use convergence criterion 1, and gglasso (Logit2) has less KKT violation counts than SLEP (Logit) when both use convergence criterion 2. Overall, it is clear that gglasso is numerically more accurate than grplasso and SLEP. gglasso (HSVM) and gglasso (SqSVM) both pass KKT checks without any violation.

Table 2 The FHT model scenario 1
Table 3 The FHT model scenario 1

4.3 Real data analysis

In this section we compare gglasso, grplasso and SLEP on several real data examples. Table 4 summarizes the datasets used in this section. We fit a sparse additive regression model for the regression-type data and fit a sparse additive logistic regression model for the classification-type data. The group-lasso penalty is used to select important additive components. All data were standardized in advance such that each original variable has zero mean and unit sample variance. Some datasets contain only numeric variables but some datasets have both numeric and categorical variables. For any categorical variable with \(M\) levels of measurement, we recoded it by \(M-1\) dummy variables and treated these dummy variables as a group. For each continuous variable, we used five B-Spline basis functions to represent its effect in the additive model. Those five basis functions are considered as a group. For example, in Colon data the original data have 2,000 numeric variables. After basis function expansion there are 10,000 predictors in 2,000 groups.

Table 4 Real datasets
Table 5 Group-lasso penalized regression and classification on real datasets

For each dataset, we report the average timings of 10 independent runs for computing the solution paths at 100 \(\lambda \) values. We also report the average number of the KKT check violations. The results are summarized in Table 5. It is clear that gglasso outperforms both grplasso and SLEP.

Before ending this section we would like to use a real data example to demonstrate why the group-lasso could be advantageous over the lasso. On sonar data we compared the lasso penalized logistic regression and the group-lasso penalized logistic regression. We randomly split the sonar data into training and test sets according to 4:1 ratio, and found the optimal \(\lambda \) for each method using five-fold cross-validation on the training data. Then we calculated the misclassification error rate on the test set. We used glmnet (Friedman et al. 2010) to compute the lasso penalized logistic regression. The process was repeated 100 times. In the group-lasso model, we define the group-wise \(L_2\) coefficient norm \(\theta _j(\lambda )\) for the \(j\)th variable by \(\theta _j(\lambda )=\sqrt{\sum ^5_{i=1}\hat{\beta }^2_{ji}(\lambda )}\). Then the \(j\)th variable enters the final model if and only if \(\theta _j(\lambda ) \ne 0\). Figure 3 shows the solution paths of the tuned lasso and group-lasso logistic model from one run, where in the group-lasso plot we plot \(\theta _j(\lambda )\) against \(\log \lambda \). To make a more direct comparison, we also plot the absolute value of each coefficient in the lasso plot. The fitted lasso logistic regression model selected 40 original variables while the group-lasso logistic regression model selected 26 original variables. When looking at the average misclassification error of 100 runs, we see that the group-lasso logistic regression model is significantly more accurate than the lasso logistic regression model. Note that the sample size is 208 in the Sonar data, thus the misclassification error calculation is meaningful.

Fig. 3
figure 3

Compare the lasso penalized logistic regression and the group-lasso penalized logistic regression on Sonar data with \(n=208\) and \(p=60\). a solution paths of lasso penalized logistic regression model, in which each absolute coefficient \(|\beta _j|\) is plotted against \(\log \lambda \) for \(j=1,\ldots ,p\). b solution paths of group-lasso penalized logistic regression model with \(p^*=300\) expanded variables grouped into \(K=60\) groups, in which each group norm \(\Vert \varvec{\beta }^{(k)}\Vert _2\) is plotted against \(\log \lambda \) for \(k=1,\ldots ,K\). The vertical dotted lines indicate the best models chosen by cross-validation

5 Discussion

In this paper we have derived a unified groupwise-majorizatoin-descent algorithm for computing the solution paths of a class of group-lasso penalized models. We have demonstrated the efficiency of Algorithm 1 on four group-lasso models: the group-lasso penalized least squares, the group-lasso penalized logistic regression, the group-lasso penalized HSVM and the group-lasso penalized SqSVM. Algorithm 1 can be readily applied to other interesting group-lasso penalized models. All we need to do is to check the QM condition for the given loss function. For that, Lemma 1 is a handy tool. We also implemented the exact groupwise descent algorithm in which we solved the convex optimization problem defined in (14) for each group. We used the same computational tricks to implement the exact groupwise descent algorithm, including the strong rule, warm-start and the active set. We found that groupwise-majorization-descent is about 10–15 times faster than the exact groupwise descent algorithm. This comparison clearly shows the value of using majorization within the groupwise descent algorithm. For the sake of brevity, we opt not to show the timing comparison here.

grplasso is a popular R package for the group-lasso penalized logistic regression, but the underlying algorithm is limited to twice differentiable loss functions. SLEP implements Nesterov’s method for the group-lasso penalized least squares and logistic regression. In principle, Nesterov’s method can be used to solve other group-lasso penalized models. For the group-lasso penalized least squares and logistic regression cases, our package gglasso is faster than SLEP and grplasso. Although we do not claim that groupwise-majorizatoin-descent is superior than Nesterov’s method, the numerical evidence clearly shows the practical usefulness of groupwise-majorizatoin-descent.

Finally, we should point out that Nesterov’s method is a more general optimization algorithm than groupwise-descent or groupwise-majorizatoin-descent. Note that groupwise-descent or groupwise-majorizatoin-descent can only work for groupwise separable penalty functions in general. What we have shown in this paper is that a more general algorithm like Nesterov’s method can be slower than a specific algorithm like GMD for a given set of problems. The same message was reported in a comparison done by Tibshirani in which the coordinate descent algorithm was shown to outperform Nesterov’s method for the lasso regression. See Tibshirani http://statweb.stanford.edu/~tibs/comparison.txt for more details.

6 Supplementary materials

Our methods have been implemented in an R package gglasso publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/gglasso.