Abstract
Variable selection is important in high-dimensional data analysis. The Lasso regression is useful since it possesses sparsity, soft-decision rule, and computational efficiency. However, since the Lasso penalized likelihood contains a nondifferentiable term, standard optimization tools cannot be applied. Many computation algorithms to optimize this Lasso penalized likelihood function in high-dimensional settings have been proposed. To name a few, coordinate descent (CD) algorithm, majorization-minimization using local quadratic approximation, fast iterative shrinkage thresholding algorithm (FISTA) and alternating direction method of multipliers (ADMM). In this paper, we undertake a comparative study that analyzes relative merits of these algorithms. We are especially concerned with numerical sensitivity to the correlation between the covariates. We conduct a simulation study considering factors that affect the condition number of covariance matrix of the covariates, as well as the level of penalization. We apply the algorithms to cancer biomarker discovery, and compare convergence speed and stability.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Variable selection in high-dimensional data analysis is an important issue in various fields of science including genetics, compressed censing, and machine learning. For instance, if we want to detect a disease from a person’s gene expressions measured by microarray, tens of thousands of genes are possible predictors while only a few of them are relevant to the disease. A systematic method to select these genes are in need.
Statistically, this is a variable or model selection problem in regression. A classical approach is to apply the best subset method, which selects the best model by examining the models created from all possible combinations of the predictor variables based on a predictive performance criterion such as the Akaike information criterion (AIC) [1], the Bayesian information criterion (BIC) [19], or the adjusted coefficient of determination (R 2). Applied to high-dimensional data, however, best subset method becomes erratic because of the unavoidable multicollinearity among the predictor variables. A reason for this instability is that the best subset method selects models in a discrete manner, hence the selected models may be very different in the presence of noise. Furthermore, the number of subsets of variables to be compared increases exponentially as the number of predictor variables becomes larger. To overcome these difficulties, “soft” versions of variable selection methods that continuously select models, have been extensively studied. Among these, the Lasso, which penalizes the sum of residual squares or other loss functions by the l 1 norm of the regression coefficients, has received a considerable attention [23]. The l 1 norm penalty makes many of the estimated coefficients exactly zero, which is why this method is broadly used for variable selection in high-dimensional data. Furthermore, the Lasso allows the coefficients to vary continuously with the level of regularization, rendering itself a stable soft model selector. The Lasso also shrinks the coefficients toward zero and thus exhibits a better predictive performance than ordinary least square estimators.
For these reasons, the Lasso has found many applications, including face recognition [26, 28], speech recognition [9], and estimation of gene networks [29].
Computationally, the Lasso problem is to minimize the objective function
where y is an n×1 vector of the observations, X is an n×p data matrix of the levels of the predictor variables, and λ>0 is the regularization parameter; β is a p-dimensional vector of the regression coefficients, which is the optimization variable. We wish to find the vector \(\hat {\beta }\) that minimizes f(β). The l 1 norm penalty function is non-differentiable at the origin, and standard optimization methods based on derivatives such as the Newton-Raphson and the gradient descent methods cannot be used. Various algorithms to compute the Lasso solutions have been studied. Among these, the homotopy algorithm [6], while widely used for problems of small to moderate dimensions, is often inapplicable to high-dimensional problems because it requires repetitive Cholesky decompositions. To meet the needs of efficient and stable computation, iterative methods, which reduce the original optimization problem to a series of simpler problems and successively refine solutions, have been of interest. Many algorithms have been developed, including the majorization-minimization using local quadratic approximation (MM-LQA) algorithm [12], alternating direction method of multipliers (ADMM) [4], fast iterative shrinkage thresholding algorithm (FISTA) [2]. For individual algorithm, its scalability for the high-dimensional data and applicability for various types of penalties are well-studied. There are a few of comparative studies for the computation times of the algorithms as well [8, 27, 28]. However, studies that compare the merits and the drawbacks of these algorithms in a systematic fashion are scarce.
In this paper, we conduct studies that compare computational algorithms for high-dimensional Lasso, to find the factors that affect the performance and convergence of the algorithms. In practice, covariates in regression analysis for the high-dimensional data usually contain highly correlated variables. While there exist other penalized statistical procedures, it is worth investigating the numerical sensitivity of algorithms to the correlation between the covariates.
We compare four computational algorithms suitable for high-dimensional Lasso, that is, the coordinate descent (CD) algorithm (shooting algorithm), MM-LQA, FISTA, and ADMM. We analyze the sensitivity of these algorithms to the correlation between the covariates to see the relative merits of each algorithm. We found that MM-LQA and ADMM are quite robust to correlation. These algorithms also show stable convergence for a wide range of the regularization parameter. On the contrary, CD and FISTA may behave unstably when the correlation becomes high; this unstability is improved as regularization parameter gets larger. We also report empirical conditions for the CD algorithm to fail to converge.
Section 2 reviews the four algorithms for solving Lasso. In Section 3 we conduct the comparative numerical study. In Section 4, we apply the four algorithms to a study for cancer biomarker discovery, and compare their performance.
2 Preliminaries
In this section, we provide an overview of the iterative optimization methods to be compared in the following sections.
2.1 Coordinate descent algorithm (CD)
CD is devised to minimize an objective function that is separable with respect to each coordinate of the variable. It iteratively updates the variable from β (k) to β (k+1) by choosing a single coordinate, and then performs univariate minimization with the other coordinates held fixed at the values of the previous step. In other words, write f(β) = f(β 1,⋯ ,β p ). At the k-th iteration, choose an appropriate j∈{1,⋯p}, and set
For the Lasso, the objective function is convex and separable because so are the sum of residual squared and the l 1 norm, rendering the following decomposition possible,
If all the coefficients except for β j are fixed at \(\left \lbrace \tilde {\beta }_{l}, l \neq j\right \rbrace \), (2) becomes a convex, piecewise quadratic function in β j . This univariate function is minimized by
where \(\tilde {y}_{i(j)} = {\sum }_{l \neq j} X_{il}\tilde {\beta }_{l}\), and S λ (⋅) is the soft-thresholding operator:
This coordinate-wise minimization step is repeated with respect to all j (1≤j≤p) until relative error is below determined tolerance, making a full iteration. The resulting algorithm for the Lasso is as follows.
Since each coordinate-wise minimization step is very efficiently computed, Friedman et al. [8], Wu and Lange [27] devised path algorithms based on CD with the warm start strategy that successively use the previous solution as an initial value of the current solution in the path. They report the CD-based path algorithms show better performance than LARS [6] in high dimensional settings.
CD can be accelerated by the active shooting strategy [17], which reduces the computational cost by maintaining a set of nonzero solutions, called an active set. The detailed steps are described as follows.
2.2 Majorization-Minimization using Local Quadratic Approximation (MM-LQA)
The majorization-minimization (MM) algorithm is based on the idea of iteratively minimizing series of surrogate functions, which majorizes the objective function f(β) at the current iteration point. A surrogate function at the current point β (k), denoted by f ∗(β,β (k)), must majorize f(β). Majorization is defined as the following property: a function f ∗(⋅,⋅) majorizes f(⋅) at β (k) if
with equality holding when β = β (k).
Let β (k+1) denote a minimizer of f ∗(β,β (k)). The mapping β (k)→β (k+1) generates a sequence {β (k)} for which the sequence of objective function values {f(β (k))} is non-increasing. In particular, we have
This descent property provides a basis for the numerical stability of the MM algorithm. Choosing an appropriate majorizing function is crucial for the success of an MM algorithm [11].
For the Lasso, a majorizing function can be constructed from a local quadratic approximation (LQA) of the l 1 norm. Recall that
Then it can be easily seen that the quadratic function in x
majorizes |x| at x 0≠0, hence \({\sum }_{j=1}^{p} p^{*}(\beta _{j}, \beta _{j}^{(k)})\) majorizes ∥β∥1. It immediately follows that
majorizes f(β) at β (k)≠0. Unfortunately, this majorizing function is undefined at β (k)=0. To overcome this difficulty, Hunter and Li [12] propose to minimize a slightly perturbed version of (1):
where
It can be shown that
majorizes f 𝜖 (β) at β (k), where
This majorizing function is defined for all \(\beta \in \mathbb {R}^{p}\). The optimality condition to minimize f ∗∗(β,β (k)) for β is given by
where D (k) is a diagonal matrix, \(D^{(k)}_{ii}={\lambda }/({|\beta _{i}^{(k)}|+\epsilon })\).
To solve this linear equation, the preconditioned conjugate gradient (PCG) algorithm is utilized due to its well-known efficiency. The PCG algorithm is also used for the ADMM introduced in the sequel; see A for the details of the PCG algorithm.
2.3 Fast iterative shrinkage thresholding algorithm (FISTA)
FISTA is a variant of the proximal gradient algorithm [16] applied to the Lasso [2]. Proximal algorithms minimize an objective function of the form of f(β) = g(β) + h(β), \(\beta \in \mathbb {R}^{p}\). Typically, h(β) is convex and non-smooth and g(β) is smooth convex function whose gradient is Lipschitz continuous with Lipschitz constant L. These algorithms have been applied to problems difficult to solve with standard smooth optimization methods since it can deal with non-smooth function [16].
The proximal gradient algorithm successively applies the proximal operator, defined below, to a quadratic majorizing function Q L (β,β (k)) of f(β) at β = β (k):
The proximal operator for Q L is its minimizer in β:
Applying the map β (k+1) = p L (β (k)) iteratively generates a sequence of non-increasing objective function values {f(β (k))}, yielding the proximal gradient algorithm [16]. This is another instance of the MM algorithm.
Recall that for the Lasso, the objective function
can be divided into two parts:
and
Accordingly, the proximal operator p L reduces to the soft-thresholding operator and the updating mechanism for β is given by
where S λ/L is the vector version of the soft-thresholding operator in which (3) is applied element-wisely. This is the iterative shrinkage thresholding algorithm (ISTA) [16]. FISTA is a variation of ISTA with an accelerated convergence rate. In this algorithm, a real sequence {t k } is updated by the formula
and β (k+1) is updated by a point interpolating β (k) and β (k−1) depending on the value of t (k). In ISTA, the update for β depends only upon its previous value β (k), while in FISTA it depends on the two previous values. In general, ISTA iteration converges at the rate of O(1/k), whereas, FISTA converges at the rate of O(1/k 2), achieving faster convergence.
In practice, the Lipschitz constant L is difficult to find or expensive to compute, and backtracking is used to determine a reasonable upper bound. The resulting is given in Algorithm 4.
2.4 Alternating direction method of multipliers (ADMM)
ADMM is devised to minimize objective function that can be represented as a sum of two convex functions, similar to FISTA. In general, one of the two convex functions is smooth and the other is non-smooth. For the Lasso,
where
Minimizing f(β) is equivalent to the following constrained optimization problem
To solve this, the augmented Lagrangian
is constructed. Here μ>0 is a small fixed parameter. The quadratic term involving μ enforces the constraint in a smoother fashion than the ordinary Lagrangian (μ=0). The multiplier α is the dual variable.
The saddle point condition is given by
ADMM solves (6) by a fixed-point iteration scheme, for which the dual variable is updated by gradient ascent. Specifically, the scheme consists of
-
β-update
$$ \begin{array}{lcl} \beta^{(k+1)}& = & \operatornamewithlimits{argmin}_{\beta} L_{\mu}(\alpha^{(k)},\beta,\gamma^{(k)}) \\ & = & \operatornamewithlimits{argmin}_{\beta} \beta^{T} X^{T} X\beta/2 - y^{T} X \beta \\ & & \quad\quad\qquad- \alpha^{(k)T}\beta+ (\mu/2)\|\beta-\gamma^{(k)}\|^{2}_{2} \\ & = & (X^{T}X+ \mu I)^{-1}(X^{T}y+\mu \gamma^{(k)}+\alpha^{(k)}) \end{array} $$(7)In computing the final equation, PCG algorithm in A can be employed.
-
γ-update
$$ \begin{array}{lcl} \gamma^{(k+1)}& = & \operatornamewithlimits{argmin}_{\gamma} L_{\mu}(\alpha^{(k)},\beta^{(k+1)},\gamma) \\ & = & \operatornamewithlimits{argmin}_{\gamma} \lambda \|\gamma\|_{1} +\alpha^{(k)T}\gamma \\ & & \quad\quad\qquad +(\mu/2)\|\beta^{(k+1)}-\gamma\|^{2}_{2} \\ & = & \mathbf{S}_{\lambda/\mu}(\beta^{(k+1)}+\alpha^{(k)}/\mu) \end{array} $$(8) -
α-update
$$ \alpha^{(k+1)}=\alpha^{(k)}+\mu (\beta^{(k+1)}-\gamma^{(k+1)}) $$(9)
3 Numerical study
3.1 Method
In this section, we conduct a numerical study comparing the sensitivity of the five algorithms of Section 2 (including CD with the active shooting strategy) to the correlation between the covariates and to the regularization parameter λ, and the ratio between the number of variables (p) and the sample size (n). We measure the sensitivity of an algorithm by the number of iterations required for numerical convergence, the corresponding computation time, and the converged objective function value. Note that the warm start strategy is not considered because our primary goal is to compare the algorithms for a fixed value of λ; the warm start strategy can be applied to any algorithm by simply setting the initial value of β for the new λ as the final (optimized) value for the previous λ.
3.1.1 Design of numerical study
To study the effect of correlations between the covariates, we control the condition number of population covariance matrix of the covariates, denoted by Σ. The condition number of a positive definite matrix is defined as the ratio of the largest eigenvalue to the smallest eigenvalue of the matrix. The larger the condition number is, the more correlated the covariates are. The data matrix, X, sampled from a p-variate distribution with covariance matrix Σ, is likely to be highly correlated if Σ is so. For the numerical study, we consider the following four types of population covariance matrices:
-
(Σ1) Smallest condition number: an identity matrix (I p ) of size p×p whose condition number is 1.
-
(Σ2) Small condition number: a symmetric banded matrix of bandwidth 1,
$${\Sigma}_{2} = \begin{pmatrix} \begin{array}{ccccc} 1 & 0.45 & 0 & {\cdots} & 0 \\ 0.45 & 1 & 0.45 & {\cdots} & 0 \\ 0 & 0.45 & 1 & {\cdots} & 0 \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & {\vdots} \\ 0 & 0 & {\cdots} & 0.45 & 1 \end{array} \end{pmatrix}. $$The condition number of Σ2 is approximately 19 for all the dimensions considered in this study.
-
(Σ3) Moderate condition number: a modification of Σ2 so that the largest eigenvalue is 100. Specifically,
$$ {\Sigma}_{3} = \sum\limits_{i=1}^{p-1} \lambda_{(i)} u_{i} {u_{i}^{T}} + \tilde{\lambda} u_{p} {u_{p}^{T}}, $$(10)where λ (i) is the ith smallest eigenvalue of Σ2, u i is an eigenvector of Σ2 corresponding to λ (i), and \(\tilde {\lambda } = 100\). The condition number of Σ3 is approximately 1000.
-
(Σ4) Large condition number: similar to Σ3, but \(\tilde {\lambda }\) is set to 10000 so that the condition number is approximately 100000.
For the dimensions, we consider p=100, 500, and 2500, and set the number of samples as 500. Thus, the ratio p/n takes values of 0.2, 1, and 5, respectively. Note this ratio affects the condition number of the sample covariance matrix X T X.
To take into account the influence of the regularization parameter λ, we consider λ=0.1,1,10, and 100 for all the cases.
3.1.2 Data generation
For each combination of the two factors p/n and Σ, we generate n=500 observations of a p-dimensional covariate vector x from the multivariate normal distribution with mean 0 and covariance matrix Σ i ,i=1,2,3,4. To simulate sparse regression models, pre-selected 10 % of the coordinates of the p-dimensional coefficient vector β are set to be nonzero. Nonzero coefficients in β is sampled from the standard normal distribution. The response y is generated from the linear model
where 𝜖 is from the standard normal distribution.
3.1.3 Algorithm parameters
For a fair comparison of the algorithms, we use \(\beta ^{(0)} = \big (({\sum }_{i=1}^{n} X_{ij} y_{i})/({\sum }_{i=1}^{n} X_{ij}^{2})\big )_{1\le j \le p}\) as the common initial value and impose a common convergence criterion. The four algorithms stop when the relative error
is below the pre-specified tolerance 10−7. For each combinations of the factors, each algorithm is repeated 10 times with different random seeds and then the means and the standard errors of the objective function values, the number of iterations and the computation times are taken.
For algorithm-specified parameters, we set the perturbation constant 𝜖=10−10 for MM-LQA, backtracking factor η=2 for FISTA, and augmented Lagrange multiplier μ=10 for ADMM. For PCG used with MM-LQA and ADMM, we use a diagonal matrix whose diagonal entries are those of the matrix to solve, i.e, diag(X T X + D (k)) for MM-LQA, and diag(X T X + μ I) for ADMM, as the preconditioner.
All algorithms we consider are implemented in MATLAB and the computation times are measured on a desktop PC (Intel i7-4770 (3.40 GHz) and 16.0 GB RAM). The code that implements the algorithms in this paper is available at https://sites.google.com/site/dhyeonyu/software.
3.2 Results of numerical study
In this section, we summarize the results of comparisons of the sensitivity of the MM-LQA, CD, ADMM and FISTA algorithms to the correlations between covariates, the ratio p/n, and the regularization parameter λ. We report the averages and the standard errors of the numbers of iterations, the computation times, the objective function values (F-value), and the number of nonzero solutions (sparsity) for the combinations of the covariance matrices and the regularization parameters with p=100, 500, and 2500 in Tables 1–3, respectively. Note that MM-LQA and ADMM utilize the PCG method to solve the inner linear system; the average and standard errors of the inner iterations are reported for these algorithms only.
It is interesting to note that CD with the active shooting strategy (referred to as “CD-act”) accelerated the CD for all the cases although CD-act needed much larger iterations than CD. Other than this, CD-act is basically equivalent to the CD except for the order of updates and we omit the comparison of the CD-act in this section.
To visualize the convergence properties, we plot the relative error of each algorithm with respect to the iteration for λ=10 and p=100, 500, and 2500 in Figs. 1, 2, and 3. For a fair comparison, the relative error curves for MM-LQA and ADMM are represented as step functions to take into account the inner iterations for each outer iteration. Note that the plots are taken from one of the 10 simulations, as the other 9 exhibit similar patterns.
3.2.1 Sensitivity to the condition number of population covariance matrices
It is worth noting that overall, MM-LQA and ADMM show relatively stable convergence compared to CD and FISTA. The convergence speed of CD and FISTA considerably slows down as the condition number of the population covariance matrix increases, which means that these algorithms are sensitive to the correlation between the covariates. Notably, CD fails to converge for the case of p=2500, Σ=Σ4, and λ=0.1,1,10 (Table 3). For a relatively high level of regularization (λ=100), however, the CD and the FISTA algorithms show relatively stable performance. While not shown in the Tables, for an even larger value of regularization parameter (λ=1000), the CD converges within 124570.6 iterations.
In summary, convergence speed of the MM-LQA and the ADMM algorithms are less sensitive to the correlation between the covariates than those of the CD and the FISTA algorithms. This can be attributed to the fact that the MM-LQA and ADMM algorithms employ PCG to solve a linear system involving the sample covariance matrix X T X. For a high-dimensional setting (p/n≫1), this matrix becomes ill-conditioned, and the surface of objective function is flat. This makes coordinate-wisely moving algorithms like CD to slow down due to the reduced step size. In FISTA, the step size is inversely proportional to the Lipschitz constant L which is related to the maximum eigenvalue of X T X, hence FISTA also needs more iterations to converge. On the contrary, preconditioning alleviates ill-conditioning of the matrix X T X in MM-LQA.
3.2.2 Sensitivity to the ratio p/n
Here we examine the effect of the sample covariance matrix X T X which becomes ill-conditioned as the dimension increases. The ill-conditioning is most severe when p/n>1, because this matrix is singular. Thus, we expect that MM-LQA and ADMM are more stable than CD and FISTA. Simulation results meet the expectation. For instance, for Σ=Σ1 and λ=10, as the dimension varies from 100 to 500 and to 2500 (Tables 1–3), average iterations of the CD algorithm sharply increases from 7.9 to 35.7 and to 1070.2. For FISTA those are 30.4, 195.7, and 8620.6. On the contrary, the average iteration counts for MM-LQA are 39.4, 77.4 and 317.6.
To examine the effect of the dimensions more carefully, we conducted an additional numerical study with p=1000, 1500 and 2000 for Σ=Σ1 and λ=10. We chose this combination of Σ and λ to avoid the effect of the underlying covariance structure and that of too small or too large regularization parameter. We report the averages and standard errors of the the number of iterations and computation time for p/n=0.2,1,2,3,4,5 in Table 4. While both the number of iterations and the computation time for MM-LQA, CD, and FISTA increase as the dimension increases, ADMM does not show this tendency: the number of iterations are 373.0, 433.0, 847.0, 583.5, 406.6, and 379.0. MM-LQA shows an almost linear increase in convergence speed: the average numbers of iterations are 39.4, 77.4, 126.1, 195.9, 270.7, and 317.6. Compared to MM-LQA, CD and FISTA have a fast drop in the performance. For instance, when dimension increases from 500 to 1000, the average number of iterations for CD are 35.7 and 119.8, and that of FISTA are 195.7 and 959.3, while that of MM-LQA are 77.4 and 126.1.
3.2.3 Sensitivity to the regularization parameter
It is interesting to observe that ADMM and FISTA exhibit opposite phenomena with respect to the regularization parameter λ. For large λ, ADMM shows slow convergence. For example, in p=2500, Σ=Σ4 (Table 3), the iteration count increases steeply from 342.6, to 1030.3, and to 2213.9 as λ grows from 10 to 100, and to 1000. In contrast, FISTA needs less iterations to converge for large λ. For instance, as λ increases from 0.1 to 1000, iterations are diminishing as 1152791.5, 739028.0, 230375.9, 39686.1, and 13904.6. To understand these phenomena, look at the updating mechanism for ADMM in (8). The part related to regularization parameter is the soft-thresholding operation for updating the added primal variable γ, meant to be equal to the original primal variable β. However, the amount of shrinkage to zero increases as λ increases, pulling β and γ apart. Since the dual variable α converges only when difference of β and γ is extremely small, large λ hinders convergence of α. On the contrary, updating mechanism in (5) only involves the β. Since a threshold is proportional to λ in soft-thresholding operator, large λ makes many of the components of β to zero. That is, less iteration is needed for large λ.
3.2.4 Accuracy
In terms of the minimized objective function value, MM-LQA obtains the smallest objective function value for most cases considered and the other algorithms have slightly larger values than MM-LQA. In high dimensions (p=500, 2500), the objective function values FISTA significantly differ from those of the other algorithms even though they satisfy the convergence criterion. For instance, the objective function value of FISTA is 7.33 % larger than that of MM-LQA for p=2500, Σ=Σ4, and λ=10 (Table 3), and 8.8 % larger for p=500, Σ=Σ4, and λ=0.1 (Table 2). The inaccuracy of FISTA becomes worse as the condition number of the population covariance matrix gets larger. In Table 1, the maximum relative differences of FISTA are 1.06 % with Σ1, 1.01 % with Σ2, 4.97 % with Σ3, and 7.33 % with Σ4.
3.2.5 Computation time
In lieu of Figs. 1–3, in which the number of inner-iterations for MM-LQA and ADMM is taken into account, CD performs the best for Σ1 and Σ2 and MM-LQA does the best for Σ3 and Σ4. In terms of computation times, however, MM-LQA is faster than CD for all the covariance matrices and dimensions except for the cases of Σ1 and Σ2 and p=100,500,2500 at λ=10,100 (Tables 1–3). Note that CD is at most 1.12 second faster than MM-LQA for the case of Σ2 and λ=100 (Table 3).
3.2.6 Oscillation of ADMM
What is noticeable in Figs. 1–3 is the oscillation of the relative error for ADMM. This oscillatory pattern is due to that ADMM allows β (k)≠γ (k), leaving the objective function infeasible before a sufficient number of iterations.
3.2.7 Non-convergence
We found that CD does not converge when λ is small and condition number (κ) of covariance matrix is large (Table 3). For p=2500, Σ=Σ4, CD does not satisfy the convergence criterion even after 2 million iterations. To see the pattern, we investigated similar cases. For p=2500, Σ=Σ2,Σ3, λ=0.1 (Table 3), CD algorithm spends 189482.4 and 304744.0 iterations to converge. Furthermore, for p=2500, Σ=Σ4, λ=100 (Table 3), it needs 124570.6 iterations to converge. A similar phenomenon is found in low dimensions (p=100, 500). For p=500, Σ=Σ4, λ=0.1, 1 (Table 2), 1032160.8 and 231004.6 iterations are needed to converge, and 90007.4, 39520.6 iterations are needed for p=100, Σ=Σ4, λ=0.1, 1 (Table 1).
4 Application to cancer biomarker discovery
In this section, we apply the Lasso regression model to discover possible biomarkers for the lung cancer. Lung cancer is the leading cause of death from cancer and notorious for low survival rate; its 5-year survival rate is approximately 15 % [14]. Moreover, in the early stage, progression of the lung cancer varies greatly between patients. To make a proper treatment, it is important to classify the progression and metastasis for individual patient. Recently, cancer is considered as a disease of genomic alterations. Many prognostic or predictive biomarkers have been identified, which allow calculation of risk and precise classification of individual lung cancer patients [22]. For more accurate prognosis or treatments, it is useful to discover genes regulated by known biomarkers of lung cancer [5, 7]. The Lasso can be a useful tool for this purpose. To see the feasibility, we analyzed the gene expression dataset from the Lung Cancer Consortium [20].
4.1 Method
The dataset measures the gene levels in 442 lung cancer adenocarcinoma patients using the Affymetrix U133A platform. For preprocessing, we applied the robust multiarray average (RMA) algorithm and quantile-quantile normalization [13]. All gene expression values were log2 transformed. The Entrez IDs were used to map genes across microarray platforms. The annotated genes from this probe set, and their expression values were calculated using the average values of related probe sets.
To focus on the survival-related genes, we used univariate Cox regression to identify the gene sets that are significantly correlated with a patient’s overall survival time. In order to deal with the multiple comparison issue, we controlled the false discovery rate (FDR) [3], by fitting p-values using a Beta-Uniform model [18]. We identified a subgroup of 3093 genes whose expression levels had been shown to be associated with patients’ overall survival time (p-value < 0.0663, with estimated FDR <0.2).
Among the identified 3093 genes, the ROS1 gene is known as a biomarker for lung cancer [5, 7]. We consider the expression levels of the ROS1 gene as a response variable and the expression levels of the other 3092 genes as covariates. Both the response variable and covariates were standardized.
We chose the optimal regularization parameter λ that minimizes the BIC [19]
where I(⋅) denotes the indicator function, and \(\hat {\beta }_{\lambda }\) refers to the estimated Lasso coefficients with the regularization parameter λ.
4.2 Results
For the purpose of the comparison of the computational algorithms, we measured the total computation time including the selection of λ using the BIC. As mentioned in Section 2.1, the warm start strategy improves the overall computational efficiency when the solution path is sought. The results in Table 5 are made with both the warm and cold start strategies. In this Table, we see that MM-LQA shows the best performance both in convergence and computation time with the cold start strategy. On the other hand, CD with active shooting becomes the fastest when the warm start strategy is employed. This is because the active shooting strategy greatly reduces the number of variables to update for a high level of regularization.
We also plot the computation time for each tuning parameter in Fig. 4 for both strategies. It can be seen that the warm start strategy makes every algorithm faster. With increasing sparsity level, the speed of CD, CD-act, and FISTA increase; that of MM-LQA is almost static; and that of ADMM becomes worse.
In computing the BIC, each algorithm produced slightly different numbers due to the difference in convergence properties; see Fig. 5. Based on these plots, we chose λ=60 since this value is close enough to the respective optima for all the algorithms. CD, FISTA, and ADMM selected exactly the same 24 variables, while MM-LQA selected two additional variables. We report the identified 24 genes and their regression coefficients in Table 6. Among the identified genes, the HLF, the PTPN13, the SFTPD, and the SLC34A2 genes were reported as lung cancer-related or lung injury-related genes in the literature [15, 25, 29]. Interestingly, these genes exhibit relatively strong interaction with ROS1 than the other genes in Table 6.
In addition, we investigated the functionality of the 3092 genes and their association with lung-related diseases using the DAVID, a well-known gene annotation tool [10]. From this investigation we found that there are 149 genes reported to be associated with lung diseases. If we consider this as the gold standard, consequently, the recall (sensitivity) and accuracy of our analysis can be said to be 2.68 % and 94.66 %, respectively. Yet, the full functionality of the 3092 genes has not been completely understood and much research efforts are currently being made in order to establish their functions and association with the diseases. To the best of of our knowledge, the genes reported in our manuscript are the only genes known to be related with the lung cancer or lung injuries such as the pulmonary embolism. We also would like to remark that our regression analysis can only find genes associated with the known oncogene ROS1, and it is unlikely that all the genes associated with lung diseases are necessarily associated with this gene. This may explain the low true positive rate of the our study.
5 Discussion
We have compared numerical performance of the CD, MM-LQA, FISTA, and ADMM algorithms for solving high-dimensional Lasso problems. Among these, MM-LQA show stable performance in both convergence and computation time. Its convergence speed does not depend much on the numerical condition of the data matrix, or the correlations between the covariates, both of which can be high in the high-dimensional setting. This is because a preconditioner of PCG employed in the inner iteration of the algorithm reduces the effect of ill-conditioning of the matrix. We used a simple preconditioner; use of a tailored preconditioner specific to the application may yield a better performance. Still, relative insensitivity to the choice of the preconditioner makes MM-LQA an attractive option.
Similar to MM-LQA, ADMM is also immune to the covariance structure and the dimension-to-sample size ratio. However, its lack of the descent property raises a caution when applying this to problems that needs precision.
CD is simple to implement and can be considerably accelerated by the active shooting algorithm. With either relatively high level of regularization or the warm start strategy, CD is competitive when the covariance matrix is well-conditioned, but it may take too long to produce the solution when the covariance matrix is ill-conditioned.
FISTA has the fastest per-iteration computation time, but is susceptible to the regularization parameter. In addition, the inaccuracy of the solution is a weak point, and this is worsened with ill-conditioned covariance matrices.
5.1 Active set strategies
While the focus of this paper is a comparative study of the considered algorithms per se, in practice these algorithms can be combined with an active set strategy, which reduces the number of variables before the algorithm runs by setting the coefficients of the discarded variables to zero. For example, the CD algorithm with the active shooting strategy (Algorithm 2) can be considered an instance of this approach specialized to the CD algorithm. On the other hand, the “strong rules” [24] are based on the Karush-Kuhn-Tucker (KKT) optimality conditions and are independent of a particular algorithm. Thus the strong rules can be applied to all of the algorithms considered in order to reduce computation time.
To see the potential of this active set strategy, we have conducted an additional numerical study with and without a strong rule for p=2500, n=500 and Σ=Σ4 defined in Section 3.1.1. We consider the “basic strong rule”, as our interest is a fixed value of the tuning parameter λ. The basic strong rule discards variables satisfying the condition \(|{X_{j}^{T}} y| < 2 \lambda - \lambda _{\max }\), where \(\lambda _{\max } = \max _{i} |{X_{i}^{T}} y|\) is that with which all the coefficients are zero. Because the basic strong rule is valid for λ max/2<λ<λ max, we consider two large tuning parameters, 0.75λ max and 0.9λ max, which lead highly sparse estimates. We report the average computation times, objective function values, sparsity levels, and number of discarded variables by the basic strong rule in Table 7, which indicates that the basic strong rule indeed improves computational efficiency of all the algorithms.
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp 267–281
Beck A, Teboulle M (2009) A Fast Iterative Shrinkage-Thresholding Algorithm fo Linear Inverse Problems. SIAM J. Imaging Sciences, doi:10.1137/080716542
Bejamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B 57(1):289–300
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Cagle PT, Allen TC, Olsen RJ (2013) Lung cancer biomarkers: present status and future developments. Arch Pathol Labor Med 137(9):1191–1198
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
El-Telbany A, Ma PC (2012) Cancer genes in lung cancer: racial disparities: are there any? Genes Cancer 3:467–480
Friedman J, Hastie T, Hofling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):307–332
Gemmeke JF, Hamme HV, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognitions. J Sel Topics Signal Process 4(2):272–287
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57
Hunter DR, Lange K (2000) Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics pp 60–77
Hunter DR, Li R (2005) Variable selection using mm algorithms. Ann Stat 33(4):1617–1642
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4):e15
Jemal A, Siegel R, Xu J, Ward E (2010) Cancer statistics. Cancer J Clin 60(5):277–300
Kati C, Alacam H, Duran L, Guzel A, Akdemir HU, Sisman B, Sahin C, Yavuz Y, Altintas N, Murat N, Okuyucu A (2014) The effectiveness of the serum surfactant protein d (sp-d) level to indicate lung injury in pulmonary embolism. Clin Lab 60(9):1457–1464
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Peng J, Wang P, Zhou N, Zhu J (2012) Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association
Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827
Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Carnegie Mellon University, Pittsburgh, PA
Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba II, Minna JD, Xie Y (2013) A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res 19(6):1577–1586
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58:267–288
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J Royal Stat Soc Ser B (Stat Methodol) 74(2):245–266
Woenckhaus M, Klein-Hitpass L, Grepmeier U, Merk J, Pfeifer M, Wild P, Bettstetter M, Wuensch P, Blaszyk H, Hartmann A, et al. (2006) Smoking and cancer-related gene expression in bronchial epithelium and non-small-cell lung cancers. J Pathol 210(2):192–204
Wright J, Yang AY, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalizaed regression. Ann Appl Stat 2 (1):224–244
Yang AY, Zhou Z, Ganesh A, Shankar SS, Ma Y (2013) Fast l1-minimization algorithms for robust face recognition. IEEE Trans Image Process 22:8
Yu D, Son W, Lim J, Xiao G (2015) Statistical completion of partially identified graph with application to estimation of gene regulatory network. Biostatistics 16(4):670–685
Acknowledgments
Donghyeon Yu was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP, No. 2015R1C1A1A02036312). Joong-Ho Won was supported by the National Research Foundation of Kor ea (NRF) grant funded by the Korean government (MSIP, Nos. 2013R1A1A1057949 and 2014R1A4A1007895).
Author information
Authors and Affiliations
Corresponding author
Appendix:
Appendix:
1.1 A Preconditioned conjugate gradient (PCG) method
Conjugate gradient (CG) is a method to resolve positive definite linear equations, A x = b, applied to sparse system that has too large data to solve with Cholesky Decomposition. Instead of directly solving linear equations, this method is to minimize the following function f(x),
For a positive definite A, two nonzero vectors u, v are said to be conjugate with respect to A, if they satisfy
If P is defined as
it means the set of n number of mutually conjugate directional vectors. Thus, the set P becomes a basis of \(\mathbb {R}^{n}\) and x is represented in the form of
By multiplying both sides by matrix A, b is decomposed by
Multiplying an arbitrary directional vector p k ∈P,
Accordingly, the explicit form of α k can be derived as followed,
If mutually conjugate directional vectors are not given, conjugate gradient (CG) solves the problem iteratively. Set x 0 as an initial value of x, and a linear equation given by
becomes a target function to solve. If we regarding r k = b−A x k as k-th residual, r k becomes a negative gradient of convex function x = x k , ∇f(x) given by,
which means that conjugate gradient method moves toward the direction of r k . Since all directional vectors should satisfy the condition that all vectors are conjugate with respect to A, then k-th direction p k is given by,
Following this direction, next value of x is updated as followed,
where
Convergence rate of conjugate gradient method depends on condition number of A and especially eigenvalues of A [21]. Accordingly, A x = b problem can be regarded same as linear equation that multiply by inverse matrix of preconditioner given by
In choosing an appropriate preconditioner, it should satisfy some necessary conditions.
-
M is both symmetric and positive definite matrix.
-
M −1 A is well conditioned and hardly has extreme eigenvalues.
-
M x = b is easy to solve.
Widely used preconditioners that satisfy these conditions are the followings;
-
1)
Diagonal: M=diag(1/A 11,...,1/A n n ),
-
2)
Incomplete(approximate) Cholesky factorization: \(M=\hat {A}^{-1}\), where \(\hat {A}=\hat {L}\hat {L}^{T}\).
Rights and permissions
About this article
Cite this article
Kim, B., Yu, D. & Won, JH. Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data. Appl Intell 48, 1933–1952 (2018). https://doi.org/10.1007/s10489-016-0850-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0850-7