1 Introduction

Variable selection in high-dimensional data analysis is an important issue in various fields of science including genetics, compressed censing, and machine learning. For instance, if we want to detect a disease from a person’s gene expressions measured by microarray, tens of thousands of genes are possible predictors while only a few of them are relevant to the disease. A systematic method to select these genes are in need.

Statistically, this is a variable or model selection problem in regression. A classical approach is to apply the best subset method, which selects the best model by examining the models created from all possible combinations of the predictor variables based on a predictive performance criterion such as the Akaike information criterion (AIC) [1], the Bayesian information criterion (BIC) [19], or the adjusted coefficient of determination (R 2). Applied to high-dimensional data, however, best subset method becomes erratic because of the unavoidable multicollinearity among the predictor variables. A reason for this instability is that the best subset method selects models in a discrete manner, hence the selected models may be very different in the presence of noise. Furthermore, the number of subsets of variables to be compared increases exponentially as the number of predictor variables becomes larger. To overcome these difficulties, “soft” versions of variable selection methods that continuously select models, have been extensively studied. Among these, the Lasso, which penalizes the sum of residual squares or other loss functions by the l 1 norm of the regression coefficients, has received a considerable attention [23]. The l 1 norm penalty makes many of the estimated coefficients exactly zero, which is why this method is broadly used for variable selection in high-dimensional data. Furthermore, the Lasso allows the coefficients to vary continuously with the level of regularization, rendering itself a stable soft model selector. The Lasso also shrinks the coefficients toward zero and thus exhibits a better predictive performance than ordinary least square estimators.

For these reasons, the Lasso has found many applications, including face recognition [26, 28], speech recognition [9], and estimation of gene networks [29].

Computationally, the Lasso problem is to minimize the objective function

$$ f(\beta)= (1/2) \|y-X\beta\|^{2}_{2} +\lambda \|\beta\|_{1}, $$
(1)

where y is an n×1 vector of the observations, X is an n×p data matrix of the levels of the predictor variables, and λ>0 is the regularization parameter; β is a p-dimensional vector of the regression coefficients, which is the optimization variable. We wish to find the vector \(\hat {\beta }\) that minimizes f(β). The l 1 norm penalty function is non-differentiable at the origin, and standard optimization methods based on derivatives such as the Newton-Raphson and the gradient descent methods cannot be used. Various algorithms to compute the Lasso solutions have been studied. Among these, the homotopy algorithm [6], while widely used for problems of small to moderate dimensions, is often inapplicable to high-dimensional problems because it requires repetitive Cholesky decompositions. To meet the needs of efficient and stable computation, iterative methods, which reduce the original optimization problem to a series of simpler problems and successively refine solutions, have been of interest. Many algorithms have been developed, including the majorization-minimization using local quadratic approximation (MM-LQA) algorithm [12], alternating direction method of multipliers (ADMM) [4], fast iterative shrinkage thresholding algorithm (FISTA) [2]. For individual algorithm, its scalability for the high-dimensional data and applicability for various types of penalties are well-studied. There are a few of comparative studies for the computation times of the algorithms as well [8, 27, 28]. However, studies that compare the merits and the drawbacks of these algorithms in a systematic fashion are scarce.

In this paper, we conduct studies that compare computational algorithms for high-dimensional Lasso, to find the factors that affect the performance and convergence of the algorithms. In practice, covariates in regression analysis for the high-dimensional data usually contain highly correlated variables. While there exist other penalized statistical procedures, it is worth investigating the numerical sensitivity of algorithms to the correlation between the covariates.

We compare four computational algorithms suitable for high-dimensional Lasso, that is, the coordinate descent (CD) algorithm (shooting algorithm), MM-LQA, FISTA, and ADMM. We analyze the sensitivity of these algorithms to the correlation between the covariates to see the relative merits of each algorithm. We found that MM-LQA and ADMM are quite robust to correlation. These algorithms also show stable convergence for a wide range of the regularization parameter. On the contrary, CD and FISTA may behave unstably when the correlation becomes high; this unstability is improved as regularization parameter gets larger. We also report empirical conditions for the CD algorithm to fail to converge.

Section 2 reviews the four algorithms for solving Lasso. In Section 3 we conduct the comparative numerical study. In Section 4, we apply the four algorithms to a study for cancer biomarker discovery, and compare their performance.

2 Preliminaries

In this section, we provide an overview of the iterative optimization methods to be compared in the following sections.

2.1 Coordinate descent algorithm (CD)

CD is devised to minimize an objective function that is separable with respect to each coordinate of the variable. It iteratively updates the variable from β (k) to β (k+1) by choosing a single coordinate, and then performs univariate minimization with the other coordinates held fixed at the values of the previous step. In other words, write f(β) = f(β 1,⋯ ,β p ). At the k-th iteration, choose an appropriate j∈{1,⋯p}, and set

$$\beta_{j}^{(k+1)}= \operatornamewithlimits{argmin}_{\beta_{j} \in \mathbb{R} } f(\beta_{1}^{(k+1)},...,\beta_{j-1}^{(k+1)},\beta_{j},\beta_{j+1}^{(k)},...,\beta_{p}^{(k)}). $$

For the Lasso, the objective function is convex and separable because so are the sum of residual squared and the l 1 norm, rendering the following decomposition possible,

$$ \begin{array}{lcl} f(\beta)& = & ({1}/{2}) \|y-X \beta\|^{2} + \lambda \|\beta\|_{1} \\ & = & ({1}/{2}) {\sum}_{i=1}^{n} (y_{i} -{\sum}_{j=1}^{p} X_{ij}\beta_{j})^{2} +\lambda {\sum}_{j=1}^{p} |\beta_{j}| \\ & = & ({1}/{2}) {\sum}_{i=1}^{n} (y_{i} -{\sum}_{l \neq j} X_{il}\beta_{l}-X_{ij} \beta_{j})^{2} \\ & & +\lambda {\sum}_{l \neq j}|\beta_{l}|+ \lambda |\beta_{j}|. \end{array} $$
(2)

If all the coefficients except for β j are fixed at \(\left \lbrace \tilde {\beta }_{l}, l \neq j\right \rbrace \), (2) becomes a convex, piecewise quadratic function in β j . This univariate function is minimized by

$$\beta^{*}_{j} =\frac{S_{\lambda}({\sum}_{i=1}^{n} X_{ij} (y_{i} - \tilde{y}_{i(j)}))}{{\sum}_{i=1}^{n} X_{ij}^{2}} $$

where \(\tilde {y}_{i(j)} = {\sum }_{l \neq j} X_{il}\tilde {\beta }_{l}\), and S λ (⋅) is the soft-thresholding operator:

$$ S_{\lambda}(x) = \text{sign}(x) \cdot \max(|x| - \lambda,0). $$
(3)

This coordinate-wise minimization step is repeated with respect to all j (1≤jp) until relative error is below determined tolerance, making a full iteration. The resulting algorithm for the Lasso is as follows.

figure a

Since each coordinate-wise minimization step is very efficiently computed, Friedman et al. [8], Wu and Lange [27] devised path algorithms based on CD with the warm start strategy that successively use the previous solution as an initial value of the current solution in the path. They report the CD-based path algorithms show better performance than LARS [6] in high dimensional settings.

CD can be accelerated by the active shooting strategy [17], which reduces the computational cost by maintaining a set of nonzero solutions, called an active set. The detailed steps are described as follows.

figure b

2.2 Majorization-Minimization using Local Quadratic Approximation (MM-LQA)

The majorization-minimization (MM) algorithm is based on the idea of iteratively minimizing series of surrogate functions, which majorizes the objective function f(β) at the current iteration point. A surrogate function at the current point β (k), denoted by f (β,β (k)), must majorize f(β). Majorization is defined as the following property: a function f (⋅,⋅) majorizes f(⋅) at β (k) if

$$f(\beta) \le f^{*}(\beta,\beta^{(k)}), \forall \beta \in \mathbb{R}^{p} $$

with equality holding when β = β (k).

Let β (k+1) denote a minimizer of f (β,β (k)). The mapping β (k)β (k+1) generates a sequence {β (k)} for which the sequence of objective function values {f(β (k))} is non-increasing. In particular, we have

$$f(\beta^{(k+1)}) \le f^{*}(\beta^{(k+1)}, \beta^{(k)}) \le f^{*}(\beta^{(k)}, \beta^{(k)}) = f(\beta^{(k)}) $$

This descent property provides a basis for the numerical stability of the MM algorithm. Choosing an appropriate majorizing function is crucial for the success of an MM algorithm [11].

For the Lasso, a majorizing function can be constructed from a local quadratic approximation (LQA) of the l 1 norm. Recall that

$$\|\beta\|_{1} = \sum\limits_{j=1}^{p} |\beta_{j}|. $$

Then it can be easily seen that the quadratic function in x

$$p^{*}(x,x_{0})=|x_{0}| +\frac{x^{2} - {x^{2}_{0}}}{2|x_{0}|},\quad x, x_{0} \in \mathbb{R} $$

majorizes |x| at x 0≠0, hence \({\sum }_{j=1}^{p} p^{*}(\beta _{j}, \beta _{j}^{(k)})\) majorizes ∥β1. It immediately follows that

$$f^{*}(\beta,\beta^{(k)}) = \frac{1}{2}\|y-X\beta\|_{2}^{2} + \lambda \sum\limits_{j=1}^{p} p^{*}(\beta_{j}, \beta_{j}^{(k)}) $$

majorizes f(β) at β (k)≠0. Unfortunately, this majorizing function is undefined at β (k)=0. To overcome this difficulty, Hunter and Li [12] propose to minimize a slightly perturbed version of (1):

$$f_{\epsilon}(\beta) = \frac{1}{2}\|y-X\beta\|_{2}^{2} + \lambda \sum\limits_{j=1}^{p} p_{\epsilon}(\beta_{j}) $$

where

$$p_{\epsilon}(\beta_{j}) = |\beta_{j}|-\epsilon {\int}_{0}^{|\beta_{j}|} \frac{1}{\epsilon+t}dt, \quad \epsilon>0 $$

It can be shown that

$$f^{**}(\beta,\beta^{(k)}) = \frac{1}{2}\|y-X\beta\|_{2}^{2} +\lambda \sum\limits_{j=1}^{p} p^{**}_{\epsilon}(\beta_{j},\beta_{j}^{(k)}) $$

majorizes f 𝜖 (β) at β (k), where

$$p^{**}_{\epsilon}(\beta_{j},\beta_{j}^{(k)}) = p_{\epsilon}(\beta_{j}^{(k)}) + \frac{{\beta_{j}^{2}} - (\beta_{j}^{(k)})^{2}}{2(|\beta_{j}^{(k)}|+\epsilon)}. $$

This majorizing function is defined for all \(\beta \in \mathbb {R}^{p}\). The optimality condition to minimize f ∗∗(β,β (k)) for β is given by

$$ (X^{T}X+D^{(k)})\beta = X^{T}y, $$
(4)

where D (k) is a diagonal matrix, \(D^{(k)}_{ii}={\lambda }/({|\beta _{i}^{(k)}|+\epsilon })\).

To solve this linear equation, the preconditioned conjugate gradient (PCG) algorithm is utilized due to its well-known efficiency. The PCG algorithm is also used for the ADMM introduced in the sequel; see A for the details of the PCG algorithm.

figure c

2.3 Fast iterative shrinkage thresholding algorithm (FISTA)

FISTA is a variant of the proximal gradient algorithm [16] applied to the Lasso [2]. Proximal algorithms minimize an objective function of the form of f(β) = g(β) + h(β), \(\beta \in \mathbb {R}^{p}\). Typically, h(β) is convex and non-smooth and g(β) is smooth convex function whose gradient is Lipschitz continuous with Lipschitz constant L. These algorithms have been applied to problems difficult to solve with standard smooth optimization methods since it can deal with non-smooth function [16].

The proximal gradient algorithm successively applies the proximal operator, defined below, to a quadratic majorizing function Q L (β,β (k)) of f(β) at β = β (k):

$$\begin{array}{lcl} Q_{L}(\beta,\beta^{(k)})& = &g(\beta^{(k)}) +\langle\beta-\beta^{(k)}, \nabla g(\beta^{(k)})\rangle\\ & & +{(L/2)} \|\beta-\beta^{(k)}\|^{2} + h(\beta) \end{array} $$

The proximal operator for Q L is its minimizer in β:

$$\begin{array}{lcl} p_{L}(\beta^{(k)})& = & \operatornamewithlimits{argmin} Q_{L}(\beta,\beta^{(k)}) \\ & = & \operatornamewithlimits{argmin}_{\beta} h(\beta) \\ &&\quad \qquad+ (L/2) \|\beta-(\beta^{(k)}-({1}/{L}) \nabla g(\beta^{(k)}) )\|^{2} \end{array} $$

Applying the map β (k+1) = p L (β (k)) iteratively generates a sequence of non-increasing objective function values {f(β (k))}, yielding the proximal gradient algorithm [16]. This is another instance of the MM algorithm.

Recall that for the Lasso, the objective function

$$f(\beta) =\frac{1}{2} \|y-X\beta \|^{2} + \lambda \|\beta \|_{1}, $$

can be divided into two parts:

$$g(\beta)= \frac{1}{2} \|y-X\beta \|^{2}, $$

and

$$h(\beta)=\lambda \|\beta \|_{1}. $$

Accordingly, the proximal operator p L reduces to the soft-thresholding operator and the updating mechanism for β is given by

$$ \beta^{(k+1)}=\mathbf{S}_{{\lambda}/{L}}(\beta^{(k)}-({1}/{L}) {X^{T}}(X{\beta^{(k)}}-y)), $$
(5)

where S λ/L is the vector version of the soft-thresholding operator in which (3) is applied element-wisely. This is the iterative shrinkage thresholding algorithm (ISTA) [16]. FISTA is a variation of ISTA with an accelerated convergence rate. In this algorithm, a real sequence {t k } is updated by the formula

$$t_{k+1}=\frac{1+\sqrt{1+4{t_{k}^{2}}}}{2} $$

and β (k+1) is updated by a point interpolating β (k) and β (k−1) depending on the value of t (k). In ISTA, the update for β depends only upon its previous value β (k), while in FISTA it depends on the two previous values. In general, ISTA iteration converges at the rate of O(1/k), whereas, FISTA converges at the rate of O(1/k 2), achieving faster convergence.

In practice, the Lipschitz constant L is difficult to find or expensive to compute, and backtracking is used to determine a reasonable upper bound. The resulting is given in Algorithm 4.

figure d

2.4 Alternating direction method of multipliers (ADMM)

ADMM is devised to minimize objective function that can be represented as a sum of two convex functions, similar to FISTA. In general, one of the two convex functions is smooth and the other is non-smooth. For the Lasso,

$$f(\beta)= \frac{1}{2} \|y-X\beta \|^{2} + \lambda \|\beta \|_{1} = g(\beta) + h(\beta), $$

where

$$\begin{array}{@{}rcl@{}} g(\beta)&=&\frac{1}{2}\|y-X\beta\|^{2}_{2} = \frac{1}{2}\beta^{T} X^{T} X\beta-y^{T} X\beta+\frac{1}{2} y^{T}y, \\ h(\beta)&=&\lambda \|\beta\|_{1}. \end{array} $$

Minimizing f(β) is equivalent to the following constrained optimization problem

$$\text{minimize } g(\beta)+h(\gamma) \quad \text{subject to } \beta=\gamma $$

To solve this, the augmented Lagrangian

$$L_{\mu}(\alpha,\beta,\gamma) = g(\beta)+h(\gamma)+\alpha^{T}(\gamma-\beta)+\frac{\mu}{2}\|\beta-\gamma\|^{2}_{2} $$

is constructed. Here μ>0 is a small fixed parameter. The quadratic term involving μ enforces the constraint in a smoother fashion than the ordinary Lagrangian (μ=0). The multiplier α is the dual variable.

The saddle point condition is given by

$$\begin{array}{@{}rcl@{}} (\beta^{*},\gamma^{*}) &=& \operatornamewithlimits{argmin}_{\beta, \gamma} L_{\mu}(\alpha^{*},\beta,\gamma) \quad \text{(primal)}\\ \alpha^{*} & =& \operatornamewithlimits{argmax}_{\alpha} L_{\mu}(\alpha,\beta^{*},\gamma^{*}) \quad \text{(dual)} \end{array} $$
(6)

ADMM solves (6) by a fixed-point iteration scheme, for which the dual variable is updated by gradient ascent. Specifically, the scheme consists of

  • β-update

    $$ \begin{array}{lcl} \beta^{(k+1)}& = & \operatornamewithlimits{argmin}_{\beta} L_{\mu}(\alpha^{(k)},\beta,\gamma^{(k)}) \\ & = & \operatornamewithlimits{argmin}_{\beta} \beta^{T} X^{T} X\beta/2 - y^{T} X \beta \\ & & \quad\quad\qquad- \alpha^{(k)T}\beta+ (\mu/2)\|\beta-\gamma^{(k)}\|^{2}_{2} \\ & = & (X^{T}X+ \mu I)^{-1}(X^{T}y+\mu \gamma^{(k)}+\alpha^{(k)}) \end{array} $$
    (7)

    In computing the final equation, PCG algorithm in A can be employed.

  • γ-update

    $$ \begin{array}{lcl} \gamma^{(k+1)}& = & \operatornamewithlimits{argmin}_{\gamma} L_{\mu}(\alpha^{(k)},\beta^{(k+1)},\gamma) \\ & = & \operatornamewithlimits{argmin}_{\gamma} \lambda \|\gamma\|_{1} +\alpha^{(k)T}\gamma \\ & & \quad\quad\qquad +(\mu/2)\|\beta^{(k+1)}-\gamma\|^{2}_{2} \\ & = & \mathbf{S}_{\lambda/\mu}(\beta^{(k+1)}+\alpha^{(k)}/\mu) \end{array} $$
    (8)
  • α-update

    $$ \alpha^{(k+1)}=\alpha^{(k)}+\mu (\beta^{(k+1)}-\gamma^{(k+1)}) $$
    (9)
figure e

3 Numerical study

3.1 Method

In this section, we conduct a numerical study comparing the sensitivity of the five algorithms of Section 2 (including CD with the active shooting strategy) to the correlation between the covariates and to the regularization parameter λ, and the ratio between the number of variables (p) and the sample size (n). We measure the sensitivity of an algorithm by the number of iterations required for numerical convergence, the corresponding computation time, and the converged objective function value. Note that the warm start strategy is not considered because our primary goal is to compare the algorithms for a fixed value of λ; the warm start strategy can be applied to any algorithm by simply setting the initial value of β for the new λ as the final (optimized) value for the previous λ.

3.1.1 Design of numerical study

To study the effect of correlations between the covariates, we control the condition number of population covariance matrix of the covariates, denoted by Σ. The condition number of a positive definite matrix is defined as the ratio of the largest eigenvalue to the smallest eigenvalue of the matrix. The larger the condition number is, the more correlated the covariates are. The data matrix, X, sampled from a p-variate distribution with covariance matrix Σ, is likely to be highly correlated if Σ is so. For the numerical study, we consider the following four types of population covariance matrices:

  • (Σ1)  Smallest condition number: an identity matrix (I p ) of size p×p whose condition number is 1.

  • (Σ2)  Small condition number: a symmetric banded matrix of bandwidth 1,

    $${\Sigma}_{2} = \begin{pmatrix} \begin{array}{ccccc} 1 & 0.45 & 0 & {\cdots} & 0 \\ 0.45 & 1 & 0.45 & {\cdots} & 0 \\ 0 & 0.45 & 1 & {\cdots} & 0 \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ 0 & 0 & {\cdots} & 0.45 & 1 \end{array} \end{pmatrix}. $$

    The condition number of Σ2 is approximately 19 for all the dimensions considered in this study.

  • (Σ3)  Moderate condition number: a modification of Σ2 so that the largest eigenvalue is 100. Specifically,

    $$ {\Sigma}_{3} = \sum\limits_{i=1}^{p-1} \lambda_{(i)} u_{i} {u_{i}^{T}} + \tilde{\lambda} u_{p} {u_{p}^{T}}, $$
    (10)

    where λ (i) is the ith smallest eigenvalue of Σ2, u i is an eigenvector of Σ2 corresponding to λ (i), and \(\tilde {\lambda } = 100\). The condition number of Σ3 is approximately 1000.

  • (Σ4)  Large condition number: similar to Σ3, but \(\tilde {\lambda }\) is set to 10000 so that the condition number is approximately 100000.

For the dimensions, we consider p=100, 500, and 2500, and set the number of samples as 500. Thus, the ratio p/n takes values of 0.2, 1, and 5, respectively. Note this ratio affects the condition number of the sample covariance matrix X T X.

To take into account the influence of the regularization parameter λ, we consider λ=0.1,1,10, and 100 for all the cases.

3.1.2 Data generation

For each combination of the two factors p/n and Σ, we generate n=500 observations of a p-dimensional covariate vector x from the multivariate normal distribution with mean 0 and covariance matrix Σ i ,i=1,2,3,4. To simulate sparse regression models, pre-selected 10 % of the coordinates of the p-dimensional coefficient vector β are set to be nonzero. Nonzero coefficients in β is sampled from the standard normal distribution. The response y is generated from the linear model

$$y = \textbf{x}^{T}\beta + \epsilon, $$

where 𝜖 is from the standard normal distribution.

3.1.3 Algorithm parameters

For a fair comparison of the algorithms, we use \(\beta ^{(0)} = \big (({\sum }_{i=1}^{n} X_{ij} y_{i})/({\sum }_{i=1}^{n} X_{ij}^{2})\big )_{1\le j \le p}\) as the common initial value and impose a common convergence criterion. The four algorithms stop when the relative error

$$\frac{{|f(\beta^{(k)}) - f(\beta^{(k-1)})|}}{{| f(\beta^{(k-1)})|}} $$

is below the pre-specified tolerance 10−7. For each combinations of the factors, each algorithm is repeated 10 times with different random seeds and then the means and the standard errors of the objective function values, the number of iterations and the computation times are taken.

For algorithm-specified parameters, we set the perturbation constant 𝜖=10−10 for MM-LQA, backtracking factor η=2 for FISTA, and augmented Lagrange multiplier μ=10 for ADMM. For PCG used with MM-LQA and ADMM, we use a diagonal matrix whose diagonal entries are those of the matrix to solve, i.e, diag(X T X + D (k)) for MM-LQA, and diag(X T X + μ I) for ADMM, as the preconditioner.

All algorithms we consider are implemented in MATLAB and the computation times are measured on a desktop PC (Intel i7-4770 (3.40 GHz) and 16.0 GB RAM). The code that implements the algorithms in this paper is available at https://sites.google.com/site/dhyeonyu/software.

3.2 Results of numerical study

In this section, we summarize the results of comparisons of the sensitivity of the MM-LQA, CD, ADMM and FISTA algorithms to the correlations between covariates, the ratio p/n, and the regularization parameter λ. We report the averages and the standard errors of the numbers of iterations, the computation times, the objective function values (F-value), and the number of nonzero solutions (sparsity) for the combinations of the covariance matrices and the regularization parameters with p=100, 500, and 2500 in Tables 13, respectively. Note that MM-LQA and ADMM utilize the PCG method to solve the inner linear system; the average and standard errors of the inner iterations are reported for these algorithms only.

Table 1 Convergence result, p=100
Table 2 Convergence result, p=500
Table 3 Convergence result, p=2500

It is interesting to note that CD with the active shooting strategy (referred to as “CD-act”) accelerated the CD for all the cases although CD-act needed much larger iterations than CD. Other than this, CD-act is basically equivalent to the CD except for the order of updates and we omit the comparison of the CD-act in this section.

To visualize the convergence properties, we plot the relative error of each algorithm with respect to the iteration for λ=10 and p=100, 500, and 2500 in Figs. 12, and 3. For a fair comparison, the relative error curves for MM-LQA and ADMM are represented as step functions to take into account the inner iterations for each outer iteration. Note that the plots are taken from one of the 10 simulations, as the other 9 exhibit similar patterns.

Fig. 1
figure 1

Comparison of Convergence rate among four methods in p=100 cases

Fig. 2
figure 2

Comparison of Convergence rate among four methods in p=500 cases

Fig. 3
figure 3

Comparison of Convergence rate among four methods in p=2500 cases

3.2.1 Sensitivity to the condition number of population covariance matrices

It is worth noting that overall, MM-LQA and ADMM show relatively stable convergence compared to CD and FISTA. The convergence speed of CD and FISTA considerably slows down as the condition number of the population covariance matrix increases, which means that these algorithms are sensitive to the correlation between the covariates. Notably, CD fails to converge for the case of p=2500, Σ=Σ4, and λ=0.1,1,10 (Table 3). For a relatively high level of regularization (λ=100), however, the CD and the FISTA algorithms show relatively stable performance. While not shown in the Tables, for an even larger value of regularization parameter (λ=1000), the CD converges within 124570.6 iterations.

In summary, convergence speed of the MM-LQA and the ADMM algorithms are less sensitive to the correlation between the covariates than those of the CD and the FISTA algorithms. This can be attributed to the fact that the MM-LQA and ADMM algorithms employ PCG to solve a linear system involving the sample covariance matrix X T X. For a high-dimensional setting (p/n≫1), this matrix becomes ill-conditioned, and the surface of objective function is flat. This makes coordinate-wisely moving algorithms like CD to slow down due to the reduced step size. In FISTA, the step size is inversely proportional to the Lipschitz constant L which is related to the maximum eigenvalue of X T X, hence FISTA also needs more iterations to converge. On the contrary, preconditioning alleviates ill-conditioning of the matrix X T X in MM-LQA.

3.2.2 Sensitivity to the ratio p/n

Here we examine the effect of the sample covariance matrix X T X which becomes ill-conditioned as the dimension increases. The ill-conditioning is most severe when p/n>1, because this matrix is singular. Thus, we expect that MM-LQA and ADMM are more stable than CD and FISTA. Simulation results meet the expectation. For instance, for Σ=Σ1 and λ=10, as the dimension varies from 100 to 500 and to 2500 (Tables 13), average iterations of the CD algorithm sharply increases from 7.9 to 35.7 and to 1070.2. For FISTA those are 30.4, 195.7, and 8620.6. On the contrary, the average iteration counts for MM-LQA are 39.4, 77.4 and 317.6.

To examine the effect of the dimensions more carefully, we conducted an additional numerical study with p=1000, 1500 and 2000 for Σ=Σ1 and λ=10. We chose this combination of Σ and λ to avoid the effect of the underlying covariance structure and that of too small or too large regularization parameter. We report the averages and standard errors of the the number of iterations and computation time for p/n=0.2,1,2,3,4,5 in Table 4. While both the number of iterations and the computation time for MM-LQA, CD, and FISTA increase as the dimension increases, ADMM does not show this tendency: the number of iterations are 373.0, 433.0, 847.0, 583.5, 406.6, and 379.0. MM-LQA shows an almost linear increase in convergence speed: the average numbers of iterations are 39.4, 77.4, 126.1, 195.9, 270.7, and 317.6. Compared to MM-LQA, CD and FISTA have a fast drop in the performance. For instance, when dimension increases from 500 to 1000, the average number of iterations for CD are 35.7 and 119.8, and that of FISTA are 195.7 and 959.3, while that of MM-LQA are 77.4 and 126.1.

Table 4 Number of iterations and operation counts with respect to the dimension (p), n=500, Σ1, λ=10

3.2.3 Sensitivity to the regularization parameter

It is interesting to observe that ADMM and FISTA exhibit opposite phenomena with respect to the regularization parameter λ. For large λ, ADMM shows slow convergence. For example, in p=2500, Σ=Σ4 (Table 3), the iteration count increases steeply from 342.6, to 1030.3, and to 2213.9 as λ grows from 10 to 100, and to 1000. In contrast, FISTA needs less iterations to converge for large λ. For instance, as λ increases from 0.1 to 1000, iterations are diminishing as 1152791.5, 739028.0, 230375.9, 39686.1, and 13904.6. To understand these phenomena, look at the updating mechanism for ADMM in (8). The part related to regularization parameter is the soft-thresholding operation for updating the added primal variable γ, meant to be equal to the original primal variable β. However, the amount of shrinkage to zero increases as λ increases, pulling β and γ apart. Since the dual variable α converges only when difference of β and γ is extremely small, large λ hinders convergence of α. On the contrary, updating mechanism in (5) only involves the β. Since a threshold is proportional to λ in soft-thresholding operator, large λ makes many of the components of β to zero. That is, less iteration is needed for large λ.

3.2.4 Accuracy

In terms of the minimized objective function value, MM-LQA obtains the smallest objective function value for most cases considered and the other algorithms have slightly larger values than MM-LQA. In high dimensions (p=500, 2500), the objective function values FISTA significantly differ from those of the other algorithms even though they satisfy the convergence criterion. For instance, the objective function value of FISTA is 7.33 % larger than that of MM-LQA for p=2500, Σ=Σ4, and λ=10 (Table 3), and 8.8 % larger for p=500, Σ=Σ4, and λ=0.1 (Table 2). The inaccuracy of FISTA becomes worse as the condition number of the population covariance matrix gets larger. In Table 1, the maximum relative differences of FISTA are 1.06 % with Σ1, 1.01 % with Σ2, 4.97 % with Σ3, and 7.33 % with Σ4.

3.2.5 Computation time

In lieu of Figs. 13, in which the number of inner-iterations for MM-LQA and ADMM is taken into account, CD performs the best for Σ1 and Σ2 and MM-LQA does the best for Σ3 and Σ4. In terms of computation times, however, MM-LQA is faster than CD for all the covariance matrices and dimensions except for the cases of Σ1 and Σ2 and p=100,500,2500 at λ=10,100 (Tables 13). Note that CD is at most 1.12 second faster than MM-LQA for the case of Σ2 and λ=100 (Table 3).

3.2.6 Oscillation of ADMM

What is noticeable in Figs. 13 is the oscillation of the relative error for ADMM. This oscillatory pattern is due to that ADMM allows β (k)γ (k), leaving the objective function infeasible before a sufficient number of iterations.

3.2.7 Non-convergence

We found that CD does not converge when λ is small and condition number (κ) of covariance matrix is large (Table 3). For p=2500, Σ=Σ4, CD does not satisfy the convergence criterion even after 2 million iterations. To see the pattern, we investigated similar cases. For p=2500, Σ=Σ23, λ=0.1 (Table 3), CD algorithm spends 189482.4 and 304744.0 iterations to converge. Furthermore, for p=2500, Σ=Σ4, λ=100 (Table 3), it needs 124570.6 iterations to converge. A similar phenomenon is found in low dimensions (p=100, 500). For p=500, Σ=Σ4, λ=0.1, 1 (Table 2), 1032160.8 and 231004.6 iterations are needed to converge, and 90007.4, 39520.6 iterations are needed for p=100, Σ=Σ4, λ=0.1, 1 (Table 1).

4 Application to cancer biomarker discovery

In this section, we apply the Lasso regression model to discover possible biomarkers for the lung cancer. Lung cancer is the leading cause of death from cancer and notorious for low survival rate; its 5-year survival rate is approximately 15 % [14]. Moreover, in the early stage, progression of the lung cancer varies greatly between patients. To make a proper treatment, it is important to classify the progression and metastasis for individual patient. Recently, cancer is considered as a disease of genomic alterations. Many prognostic or predictive biomarkers have been identified, which allow calculation of risk and precise classification of individual lung cancer patients [22]. For more accurate prognosis or treatments, it is useful to discover genes regulated by known biomarkers of lung cancer [5, 7]. The Lasso can be a useful tool for this purpose. To see the feasibility, we analyzed the gene expression dataset from the Lung Cancer Consortium [20].

4.1 Method

The dataset measures the gene levels in 442 lung cancer adenocarcinoma patients using the Affymetrix U133A platform. For preprocessing, we applied the robust multiarray average (RMA) algorithm and quantile-quantile normalization [13]. All gene expression values were log2 transformed. The Entrez IDs were used to map genes across microarray platforms. The annotated genes from this probe set, and their expression values were calculated using the average values of related probe sets.

To focus on the survival-related genes, we used univariate Cox regression to identify the gene sets that are significantly correlated with a patient’s overall survival time. In order to deal with the multiple comparison issue, we controlled the false discovery rate (FDR) [3], by fitting p-values using a Beta-Uniform model [18]. We identified a subgroup of 3093 genes whose expression levels had been shown to be associated with patients’ overall survival time (p-value < 0.0663, with estimated FDR <0.2).

Among the identified 3093 genes, the ROS1 gene is known as a biomarker for lung cancer [5, 7]. We consider the expression levels of the ROS1 gene as a response variable and the expression levels of the other 3092 genes as covariates. Both the response variable and covariates were standardized.

We chose the optimal regularization parameter λ that minimizes the BIC [19]

$$\begin{array}{@{}rcl@{}} BIC(\lambda;\hat{\beta}_{\lambda}) &=& n \log \left( \sum\limits_{i=1}^{n} \left( y_{i} - \sum\limits_{j=1}^{p} \hat{\beta}_{\lambda,j} X_{ij}\right)^{2} \right) \\ &&+ \log{n} \sum\limits_{j=1}^{p} I(\hat{\beta}_{\lambda,j} \neq 0), \end{array} $$

where I(⋅) denotes the indicator function, and \(\hat {\beta }_{\lambda }\) refers to the estimated Lasso coefficients with the regularization parameter λ.

4.2 Results

For the purpose of the comparison of the computational algorithms, we measured the total computation time including the selection of λ using the BIC. As mentioned in Section 2.1, the warm start strategy improves the overall computational efficiency when the solution path is sought. The results in Table 5 are made with both the warm and cold start strategies. In this Table, we see that MM-LQA shows the best performance both in convergence and computation time with the cold start strategy. On the other hand, CD with active shooting becomes the fastest when the warm start strategy is employed. This is because the active shooting strategy greatly reduces the number of variables to update for a high level of regularization.

Table 5 Total computation times and number of iterations for calculating BIC

We also plot the computation time for each tuning parameter in Fig. 4 for both strategies. It can be seen that the warm start strategy makes every algorithm faster. With increasing sparsity level, the speed of CD, CD-act, and FISTA increase; that of MM-LQA is almost static; and that of ADMM becomes worse.

Fig. 4
figure 4

Computation time vs. tuning parameter for the cold start (a) and warm start (b), for each algorithm

In computing the BIC, each algorithm produced slightly different numbers due to the difference in convergence properties; see Fig. 5. Based on these plots, we chose λ=60 since this value is close enough to the respective optima for all the algorithms. CD, FISTA, and ADMM selected exactly the same 24 variables, while MM-LQA selected two additional variables. We report the identified 24 genes and their regression coefficients in Table 6. Among the identified genes, the HLF, the PTPN13, the SFTPD, and the SLC34A2 genes were reported as lung cancer-related or lung injury-related genes in the literature [15, 25, 29]. Interestingly, these genes exhibit relatively strong interaction with ROS1 than the other genes in Table 6.

Fig. 5
figure 5

Cross validation based on BIC criteria

Table 6 List of the identified genes interacting with the gene ROS1 and their estimated coefficients from the LASSO

In addition, we investigated the functionality of the 3092 genes and their association with lung-related diseases using the DAVID, a well-known gene annotation tool [10]. From this investigation we found that there are 149 genes reported to be associated with lung diseases. If we consider this as the gold standard, consequently, the recall (sensitivity) and accuracy of our analysis can be said to be 2.68 % and 94.66 %, respectively. Yet, the full functionality of the 3092 genes has not been completely understood and much research efforts are currently being made in order to establish their functions and association with the diseases. To the best of of our knowledge, the genes reported in our manuscript are the only genes known to be related with the lung cancer or lung injuries such as the pulmonary embolism. We also would like to remark that our regression analysis can only find genes associated with the known oncogene ROS1, and it is unlikely that all the genes associated with lung diseases are necessarily associated with this gene. This may explain the low true positive rate of the our study.

5 Discussion

We have compared numerical performance of the CD, MM-LQA, FISTA, and ADMM algorithms for solving high-dimensional Lasso problems. Among these, MM-LQA show stable performance in both convergence and computation time. Its convergence speed does not depend much on the numerical condition of the data matrix, or the correlations between the covariates, both of which can be high in the high-dimensional setting. This is because a preconditioner of PCG employed in the inner iteration of the algorithm reduces the effect of ill-conditioning of the matrix. We used a simple preconditioner; use of a tailored preconditioner specific to the application may yield a better performance. Still, relative insensitivity to the choice of the preconditioner makes MM-LQA an attractive option.

Similar to MM-LQA, ADMM is also immune to the covariance structure and the dimension-to-sample size ratio. However, its lack of the descent property raises a caution when applying this to problems that needs precision.

CD is simple to implement and can be considerably accelerated by the active shooting algorithm. With either relatively high level of regularization or the warm start strategy, CD is competitive when the covariance matrix is well-conditioned, but it may take too long to produce the solution when the covariance matrix is ill-conditioned.

FISTA has the fastest per-iteration computation time, but is susceptible to the regularization parameter. In addition, the inaccuracy of the solution is a weak point, and this is worsened with ill-conditioned covariance matrices.

5.1 Active set strategies

While the focus of this paper is a comparative study of the considered algorithms per se, in practice these algorithms can be combined with an active set strategy, which reduces the number of variables before the algorithm runs by setting the coefficients of the discarded variables to zero. For example, the CD algorithm with the active shooting strategy (Algorithm 2) can be considered an instance of this approach specialized to the CD algorithm. On the other hand, the “strong rules” [24] are based on the Karush-Kuhn-Tucker (KKT) optimality conditions and are independent of a particular algorithm. Thus the strong rules can be applied to all of the algorithms considered in order to reduce computation time.

To see the potential of this active set strategy, we have conducted an additional numerical study with and without a strong rule for p=2500, n=500 and Σ=Σ4 defined in Section 3.1.1. We consider the “basic strong rule”, as our interest is a fixed value of the tuning parameter λ. The basic strong rule discards variables satisfying the condition \(|{X_{j}^{T}} y| < 2 \lambda - \lambda _{\max }\), where \(\lambda _{\max } = \max _{i} |{X_{i}^{T}} y|\) is that with which all the coefficients are zero. Because the basic strong rule is valid for λ max/2<λ<λ max, we consider two large tuning parameters, 0.75λ max and 0.9λ max, which lead highly sparse estimates. We report the average computation times, objective function values, sparsity levels, and number of discarded variables by the basic strong rule in Table 7, which indicates that the basic strong rule indeed improves computational efficiency of all the algorithms.

Table 7 Comparison of the algorithms with and without applying the basic strong rule for p=2500, n=500, and Σ=Σ4, \(\lambda = 0.75 \lambda _{\max }\) and \(0.9 \lambda _{\max }\)