Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recent applications often encounter high dimensionality with a limited number of data points leading to a number of covariance parameters that greatly exceeds the number of observations. Examples include marketing, e-commerce, and warehouse data in business; microarray, and proteomics data in genomics and heath sciences; and biomedical imaging, functional magnetic resonance imaging, tomography, signal processing, high-resolution imaging, and functional and longitudinal data. In biological sciences, one may want to classify diseases and predict clinical outcomes using microarray gene expression or proteomics data, in which hundreds of thousands of expression levels are potential covariates, but there are typically only tens or hundreds of subjects. Hundreds of thousands of single-nucleotide polymorphisms are potential predictors in genome-wide association studies. The dimensionality of the variables’ spaces grows rapidly when interactions of such predictors are considered.

Large-scale data analysis is also a common feature of many problems in machine learning, such as text and document classification and computer vision. For a \(p \times p\) covariance matrix \({\varvec{\Sigma }} \), there are \(p(p+1)/2\) parameters to estimate, yet the sample size n is often small. In addition, the positive-definiteness of \({\varvec{\Sigma }} \) makes the problem even more complicated. When \(n > p\), the sample covariance matrix is positive-definite and unbiased, but as the dimension p increases, i.e., when \(p \gg n\), which is the case in this study, the sample covariance matrix tends to become unstable and can fail to be consistent, since it becomes singular.

Due to its importance, there has been an active line of work on efficient optimization methods for solving the \(\ell _{1}\) regularized Gaussian maximum likelihood estimator problem (MLE). Among these methods, we cite the projected subgradients [7], the alternating linearization, the inexact interior point method, the greedy coordinate descent method and G-LASSO which is a block coordinate descent method. For typical high-dimensional statistical problems, optimization methods typically suffer sub-linear rates of convergence. This would be too expensive for the Gaussian MLE problem, since the number of matrix entries scales quadratically with the number of nodes.

Sparse modeling has been widely used to deal with high dimensionality. The main assumption is that the p dimensional parameter vector is sparse, with many components being exactly zero or negligibly small. Such an assumption is crucial in identifiability, especially for the relatively small sample size. Although the notion of sparsity gives rise to biased estimation in general, it has proved to be effective in many applications. In particular, variable selection can increase the estimation accuracy by effectively identifying important predictors and can improve the model interpretability.

To estimate the inverse of the covariance matrix, a thresholding approach is proposed as the following,

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{{\mathbf {M}}} \frac{1}{2}\Vert {\hat{{\varvec{\Sigma }} }} {\mathbf {M}} - {{\mathbf {I}}} \Vert _{F}^{2} + \lambda \Vert {\mathbf {M}}\Vert _{1}, \end{aligned}$$
(1)

where \({\mathbf {M}}\) is the precision matrix, i.e., the inverse of \({\varvec{\Sigma }} \). This equation emphasizes the fact that the solution may not be unique. Such nonuniqueness always occurs since \({\varvec{\Sigma }} = {{\mathbf {X}}}^T {{\mathbf {X}}}\) where \({{\mathbf {X}}} \in {{\mathbb {R}}}^{n\times p}\) is the sample data matrix and rank \(({\mathbf {X}}) \le n \ll p\), which is our case. However, there is no guarantee that the thresholding estimator is always positive definite [5]. Although the positive definite property is guaranteed in the asymptotic setting with high probability, the actual estimator can be an indefinite matrix, especially in real data analysis.

To overcome these limitations, i.e. to achieve simultaneously sparsity, positive semi-definiteness, and add structure or constraints on the coefficient of the precision matrix in a graphical model setting for example, a natural solution is to add the positive semi-definite constraint and penalize the \(\ell _{1}\) norm of a matrix \({\mathbf {D}}\) multiplied by the precision matrix estimate \({\mathbf {M}}\), where \({\mathbf {D}} \in {\mathbb {R}}^{p \times p}\), i.e.,

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{{\mathbf {M}}} \frac{1}{2}\Vert {\hat{{\varvec{\Sigma }} }} {\mathbf {M}} - {{\mathbf {I}}} \Vert _{F}^{2} + \lambda \Vert {\mathbf {D}} {\mathbf {M}}\Vert _{1}. \end{aligned}$$
(2)

Before discussing our contribution in that sense, let us introduce G-LASSO, which is a popular penalized approach for estimating the precision matrix.

1.1 Tikhonov and G-LASSO

The log-likelihood of a sample \({{\mathbf {x}}}_i\) in a Gaussian model with mean \(\varvec{\mu }\) and precision matrix \({\mathbf {M}}\) is given by

$$\begin{aligned} {\mathcal {L}}({\mathbf {x}}_{i}, \varvec{\mu }, {\hat{{\varvec{\Sigma }} }} ) = \log \det {\mathbf {M}} - ({\mathbf {x}}_{i}-\varvec{\mu })^{T} {\mathbf {M}} ({\mathbf {x}}_{i}-\varvec{\mu }), \end{aligned}$$
(3)

and since this average log-likelihood depends only on the sample covariance matrix \({\hat{{\varvec{\Sigma }} }}\), we define

$$\begin{aligned} {\mathcal {L}}({\mathbf {M}}, {\hat{{\varvec{\Sigma }} }}) = \sum _{i} {\mathcal {L}}({\mathbf {x}}_{i}, \varvec{\mu }, {\hat{{\varvec{\Sigma }} }}) = \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle . \end{aligned}$$
(4)

A very well known technique for the estimation of precision matrices is Tikhonov regularization. Given a sample covariance matrix \({\hat{{\varvec{\Sigma }} }}\succeq {{\mathbf {0}}}\), the Tikhonov-regularized problem is defined as

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad {\mathcal {L}}({\mathbf {M}}, \hat{{\varvec{\Sigma }} }) - \lambda ~{\text {Trace}} ({\mathbf {M}}), \end{aligned}$$
(5)

or equivalently;

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle - \lambda ~{\text {Trace}} ({\mathbf {M}}). \end{aligned}$$
(6)

This can be seen as imposing a matrix variate Gaussian prior on \({\mathbf {M}}^{1/2}\) with both row and column covariance matrices equal to \({\mathbf {I}}\) to make the solution well defined, assuming positiveness of \({\mathbf {M}}\). The optimal solution of (6) is given by

$$\begin{aligned} {{\mathbf {M}}} = { ( \hat{{\varvec{\Sigma }} } + \lambda {{\mathbf {I}}} ) }^{-1}. \end{aligned}$$
(7)

Since the \(\ell _{1}\)-norm is the tightest convex upper bound of the cardinality of a matrix, several methods based on \(\ell _{1}\)-regularized log-determinant program have been proposed. This is called G-LASSO, or Graphical LASSO, after the algorithm that efficiently computes the solution,

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle - \lambda \Vert {\mathbf {M}}\Vert _{1}. \end{aligned}$$
(8)

Similar to Eq. (8), Bien and Tibshirani in [3] suggested to add to the likelihood a LASSO penalty on \(\Vert {\mathbf {D}} \odot {{\varvec{\Sigma }} } \Vert \), where \({\mathbf {D}}\) is an arbitrary matrix with non-negative elements and \(\odot \) denotes the element-wise multiplication, i.e., the Hadamard product. So the regularization problem becomes

$$\begin{aligned} \min _{{\mathbf {M}}\succ 0}\quad - \log \det {\mathbf {M}} + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle + \lambda \Vert {\mathbf {D}} \odot {\mathbf {M}}\Vert _{1}, \end{aligned}$$
(9)

and one may tighten the constraint \({\mathbf {M}}\succ 0\) to \({\mathbf {M}}\succ \delta {\mathbf {I}}\) for some \(\delta > 0\), so that the above equation becomes

$$\begin{aligned} \min _{ \mathop { {\mathbf {M}}\succ \delta {\mathbf {I}}}\limits _{\delta > 0 }} \quad - \log \det {\mathbf {M}} + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle + \lambda \Vert {\mathbf {D}} \odot {\mathbf {M}}\Vert _{1}. \end{aligned}$$
(10)

The asymptotic properties of the estimator has been studied in [13]. Cai and Zhou in [5] proposed a constrained \(\ell _{1}\)-minimization procedure for estimating sparse precision matrices by solving the optimization problem

$$\begin{aligned} \min \quad \Vert {\mathbf {M}}\Vert _{1} \quad \text {subject to} \quad \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} - {\mathbf {I}}\Vert _{\infty } \le \lambda . \end{aligned}$$
(11)

Equation (11) is obtained from a decomposition of series of Dantzig selector problems. In the above, the symmetry condition is not enforced on \({\mathbf {M}}\) and as a result the solution is not symmetric in general. Cai and Zhou in [5] established the rates of convergence of CLIME, which is the algorithm derived from the above regularization problem, under both the entry-wise \(\ell _{1}\)- and the Frobenius-norm and proved that CLIME estimator is positive definite with high probability. However, in practice there is no guarantee that CLIME is always positive definite, especially when applied in real applications.

This warrants our study and our contributions.

1.2 Summary of the Paper and Contributions

In this paper, we propose a new way of estimating large sparse precision matrices, which is significant in the area of learning, vision and mining. This requires seeking a sparse solution and also ensuring positive definiteness. We, therefore, present an optimization model that encourages both sparsity and positive definiteness, and propose an implementation based on the inner-outer alternating direction method of multipliers (ADMM). The method is validated using numerical experiments.

The emphasis of the paper is on introducing a new criterion that insures the positive-definiteness of the precision matrix by adding a tuning parameter \(\delta >0\) in the constraints. This additional constraint will guard against positive semi-definiteness. We add structure on the coefficient of the precision matrix and we derive an efficient inner-outer ADMM form to obtain an optimal solution. We perform the ADMM steps, a variant of the standard Augmented Lagrangian method, that uses partial updates, but with three innovations that enable finessing the caveats detailed above. Also, we show that the proposed algorithm converges with a very relaxed stopping criterion in the inner iteration. It is scalable and with better accuracy properties than existing methods.

2 The Proposed Method

Given a data matrix \({{\mathbf {X}}} \in {{\mathbb {R}}}^{n\times p}\), a sample of n realizations from a p-dimensional Gaussian distribution with zero mean and covariance matrix \({\varvec{\Sigma }} = ( {{\varvec{X}} }^T {\varvec{X}} ) \in {{\mathbb {R}}}^{p\times p}\) and \(p \gg n\). The goal is to estimate the precision matrix \({{\mathbf {M}}}\), i.e., the inverse of the covariance matrix, \({{\varvec{\Sigma }} }\). Let \({\hat{{\varvec{\Sigma }} }}\) be the empirical covariance matrix. The Graphical LASSO problem minimizes the \(\ell _1\)-regularized negative log-likelihood

$$\begin{aligned} f( {{\mathbf {M}}} ) = - \log \det {{\mathbf {M}}} + \langle {{\mathbf {M}}}, {\hat{{\varvec{\Sigma }} }} \rangle + \lambda \Vert {\mathbf {M}}\Vert _{1}, \end{aligned}$$
(12)

subject to the constraint that \({{\mathbf {M}}} \succ {{\mathbf {0}}}\). This is a semi-definite programming problem in the variable \({\mathbf {M}}\). When we take the derivative of \(f( {\mathbf {M}})\) with respect to \({\mathbf {M}}\) and we set it to zero, we obtain the optimality condition

$$\begin{aligned} -{\mathbf {M}}^{-1} + {\hat{{\varvec{\Sigma }} }} + \lambda ~ {\varvec{\Gamma }} = {{\mathbf {0}}}, \end{aligned}$$
(13)

where \({\varvec{\Gamma }} \) is a matrix of componentwise signs of \({\mathbf {M}}\), i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} \gamma _{jk} = {\text {sign}} ( m_{jk} ) &{} {\text {if}} \,\, m_{jk} \ne 0 \\ \gamma _{jk} \in [-1, 1] &{} {\text {if}}\,\, m_{jk} = 0 \end{array}\right. } \end{aligned}$$
(14)

Taking the norm, we get

$$\begin{aligned} \Vert {{\mathbf {M}}}^{-1} - {\hat{{\varvec{\Sigma }} }} \Vert _{\infty } \le \lambda . \end{aligned}$$
(15)

Therefore, our interest is to construct a sparse precision matrix estimator via convex optimization that performs well in high-dimensional settings and is positive definite in finite samples, see e.g. [14]. Positive definiteness is desirable when the covariance estimator is applied to methods for supervised learning. Many of these methods either require a positive definite covariance estimator, or use optimization that is convex only if the covariance estimator is nonnegative definite, e.g., quadratic discriminant analysis and covariance regularized regression, see e.g. [16].

We define our estimator as a solution to the following problem

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{\mathop { {\mathbf {M}}\succeq \delta {\mathbf {I}}}\limits _{\delta \succeq 0 } } \frac{1}{2}\Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}} - {\mathbf {I}}\Vert _{F}^{2} + \lambda { \Vert {\mathbf {D}} \odot {\mathbf {M}} \Vert }_1, \end{aligned}$$
(16)

where \(\hat{{\varvec{\Sigma }} }\) is the empirical covariance matrix, \({\mathbf {D}} = (d_{ij})_{1\le i,j \le p}\) is an arbitrary matrix with non-negative elements where \({\mathbf {D}}\) can take different forms: it can be the matrix of all ones or it can be a matrix with zeros on the diagonal to avoid shrinking diagonal elements of \({\mathbf {M}}\). Furthermore, we can take \({\mathbf {D}}\) with elements \(d_{ij} = 1_{i \ne j}|\hat{{\mathbf {M}}}_{ij}^{\text {init}}|\), where \(\hat{{\mathbf {M}}}^{\text {init}}\) is an initial estimator of \({\mathbf {M}}\). The later choice of \({\mathbf {D}}\) corresponds to a precision matrix analogue of the Adaptive LASSO penalty.

To derive our inner-outer ADMM algorithm, as in [1], we will first introduce a new variable \({\varvec{\Theta }} \in {{\mathbb {R}}}^{p\times p}\) and an equality constraint as follows

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{ {\varvec{\Theta }} , {\mathbf {M}}} \left\{ \frac{1}{2}\Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}} - {\mathbf {I}}\Vert _{F}^{2} + \lambda { \Vert {\mathbf {D}} \odot {\mathbf {M}} \Vert }_1 \quad | \quad {\mathbf {M}}= {\varvec{\Theta }} , \; {\varvec{\Theta }} \ge \delta {\mathbf {I}} \right\} . \end{aligned}$$
(17)

The solution to (17) gives the solution to (16). To deal with the problem (17), we have to minimize its augmented Lagrangian function, \({{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} , {\mathbf {M}}, {\varvec{\Lambda }} )\) for some given penalty parameter \(\varrho \), i.e.,

$$\begin{aligned} {{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} , {\mathbf {M}}, {\varvec{\Lambda }} ) = \frac{1}{2} \Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}}- \mathbf {I} \Vert _F^2 + \lambda {\Vert \mathbf {D} \odot {\mathbf {M}}} \Vert _1 - \langle {\varvec{\Lambda }} , {\varvec{\Theta }} - {\mathbf {M}}\rangle + \frac{1}{2} \varrho {\Vert {\varvec{\Theta }} - {\mathbf {M}}}\Vert _F^2, \end{aligned}$$
(18)

where \({\varvec{\Lambda }} \in {{\mathbb {R}}}^{p\times p}\) is the ensemble of Lagrange multipliers associated with the constraint. Hence at iteration k, the ADMM algorithm consists of three steps, namely the \({\varvec{\Theta }} \)-Step, the \({\mathbf {M}}\)-Step and the dual-update step.

First step: In the first step of the ADMM algorithm, we fix \({\mathbf {M}}\) and \({\varvec{\Lambda }} \) and minimize the augmented Lagrangian function over \({\varvec{\Theta }} \) to get

where denotes the projection of the matrix \(\bar{ {\mathbf {A}} } = {\mathbf {M}}^k + \frac{1}{\varrho } {\varvec{\Lambda }} ^k \) onto the convex cone .

Lemma 1

Let be the projection of the matrix \(\bar{ {\mathbf {A}} }\) onto the convex cone . If \(\bar{ {\mathbf {A}} }\) has the eigen-decomposition \(\bar{ {\mathbf {A}} } = {{\mathbf {V}}} ~{\text {Diag}} ( \lambda _1, \lambda _2, \cdots , \lambda _n )~ {{\mathbf {V}}}^T \), then

(19)

Proof

This result can be traced back to early statisticians, e.g. see [15], who noticed that if a matrix \({{\mathbf {A}}}\) has an eigendecomposition

$$\begin{aligned} {{\mathbf {A}}} = {{\mathbf {U}}} ~{\text {Diag}} ( \mu _1, \mu _2, \cdots , \mu _n )~ {{\mathbf {U}}}^T, \end{aligned}$$
(20)

where \(\mu _1 \ge \cdots \ge \mu _n\) are the eigenvalues of \({{\mathbf {A}}}\) and \({{\mathbf {U}}}\) is the corresponding orthonormal matrix of eigenvectors; then the projection of \({{\mathbf {A}}}\) onto the closed convex cone , formed by the set of positive semidefinite matrices, is

(21)

Therefore, the extension to is straightforward.   \(\square \)

Since then this projection has been widely used, e.g. see [9]. Notice that the numerical cost of computing this projection is essentially that of computing the spectral decomposition of the matrix to project, i.e., \(\bar{{\mathbf {A}}}\).

Second step: In the second step, we fix \({\varvec{\Theta }} \) and \({\varvec{\Lambda }} \) and minimize the augmented Lagrangian over \({\mathbf {M}}\), i.e.,

$$\begin{aligned} {\mathbf {M}}^{k + 1} = \arg \min _{{\mathbf {M}}} {{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} ^{k + 1}, {\mathbf {M}}, {\varvec{\Lambda }} ^{k}), \end{aligned}$$
(22)

where this second step requires a special care since the concept of an inner iteration is introduced at this level of the procedure.

Third step: The final step of the ADMM algorithm is to update the dual variable \({\varvec{\Lambda }} \), i.e.,

$$\begin{aligned} {\varvec{\Lambda }} ^{k + 1} = {\varvec{\Lambda }} ^{k} - ({\varvec{\Theta }} ^{k + 1} - {\mathbf {M}}^{k + 1}). \end{aligned}$$
(23)

2.1 The Inner Iteration

It can be shown that

$$\begin{aligned} {\mathbf {M}}^{k + 1} = \arg \min _{{\mathbf {M}}} \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} + \lambda {\Vert {\mathbf {D}} \odot {\mathbf {M}} }\Vert _{1}, \end{aligned}$$

for which we are not able to derive a closed form for \({\mathbf {M}}\). To overcome this problem, we propose to derive a new ADMM to update \({\mathbf {M}}\) (the inner iteration). To do this, we reparametrize the \({\mathbf {D}} \odot {\mathbf {M}}\) with \({\varvec{\Gamma }} \in {{\mathbb {R}}}^{p\times p}\) and we add an equality constraint \({\mathbf {D}} \odot {\mathbf {M}} = {\varvec{\Gamma }} \), then we minimize the following

$$\begin{aligned} {\left\{ \begin{array}{ll} {\text {Minimize}} &{} \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} + \lambda {\Vert {\varvec{\Gamma }} }\Vert _{1} ,\,\,\,\, (24) \\ {\text {subject to}} &{} {\mathbf {D}} \odot {\mathbf {M}} = {\varvec{\Gamma }} .\qquad \qquad \qquad \qquad \qquad \qquad \,\,\quad \qquad \qquad \qquad \qquad (25) \end{array}\right. } \end{aligned}$$

The augmented Lagrangian \({{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} , {\varvec{\Delta }})\) associated with this problem is

$$\begin{aligned} {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} , {\varvec{\Delta }})&= \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} \nonumber \\&+ \lambda \Vert {\varvec{\Gamma }} \Vert _{1} - \langle {\varvec{\Delta }}, {\varvec{\Gamma }} - {\mathbf {D}} \odot {\mathbf {M}} \rangle + \frac{1}{2} \varrho \Vert {\varvec{\Gamma }} - {\mathbf {D}} \odot {\mathbf {M}} \Vert _F^2 \end{aligned}$$
(26)

where \({\varvec{\Delta }}\in {{\mathbb {R}}}^{p\times p}\) is the matrix containing the ensemble set of Lagrange multipliers associated with the constraint. As before, the ADMM here, i.e., the inner iteration, consists of the following three steps:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathbf {M}}_{k}^{j + 1} &{} = \arg \min \limits _{{\mathbf {M}}} {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j})\qquad \qquad \quad \qquad \qquad \quad \qquad \, (27) \\ {\varvec{\Gamma }} ^{j + 1} &{} = \arg \min \limits _{{\varvec{\Gamma }} } {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}_{k}^{j + 1}, {\varvec{\Gamma }} , {\varvec{\Delta }}^{j}) \qquad \qquad \qquad \quad \qquad \qquad \, (28)\\ {\varvec{\Delta }}^{j + 1} &{} = {\varvec{\Delta }}^{j} - ( \mathbf {{\varvec{\Gamma }} }^{j + 1} - {\mathbf {D}} \odot {\mathbf {M}}_{k}^{j + 1}) \quad \qquad \qquad \qquad \qquad \quad \,\, \ (29) \end{array}\right. } \end{aligned}$$

\(\underline{\mathrm{The\ intermediate}\ \mathbf M \text {-}\mathrm{Step}}\): It can be shown that \({\mathbf {M}}_{k}^{j + 1}\) is the minimizer of \({{{\mathcal {K}}}}_{\varrho }( {\mathbf {M}}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j})\), i.e.,

$$\begin{aligned} \frac{ \partial }{ \partial {\mathbf {M}} } \left[ {{{\mathcal {K}}}}_{\varrho }( {\mathbf {M}}_{k}^{j + 1}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j}) \right] = {\mathbf {0}}, \end{aligned}$$
(30)

which is equivalent to

$$\begin{aligned} \left( \hat{{\varvec{\Sigma }} }^{T} \hat{{\varvec{\Sigma }} } + \varrho {\mathbf {I}} \right) {\mathbf {M}}_{k}^{j + 1} + \varrho {\mathbf {D}} \odot {\mathbf {D}} \odot {\mathbf {M}}_{k}^{j+1} + \left( {\varvec{\Delta }}^{j} - \varrho {\varvec{\Gamma }} ^{j} \right) \odot {\mathbf {D}}\nonumber \\ - \left( \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \right) = {\mathbf {0}}, \end{aligned}$$
(31)

and therefore, given by the previous expression, \({\mathbf {M}}_{k}^{j + 1}\) has finally a closed form, despite the additional but straightforward computational effort at this level. This additional step, can be considered as a start-up for the original ADMM algorithm. This is very important when dealing with complex problems and large datasets. This step is solved via a fixed-point iteration, see [2].

\(\underline{\mathrm{The\ intermediate}\ {\varvec{\Gamma }} \text {-}\mathrm{Step}}\): To deal with this \({\varvec{\Gamma }} \)-Step, we define an entry-wise soft-thresholding rule for all the off-diagonal elements of a matrix \({\mathbf {A}}\) as

(32)

with \(s(a_{jl}, \kappa ) = \text {sign}(a_{jl})\max (|a_{jl}| - \kappa , 0) ~1_{\{j \ne l\}}\), see e.g. [6, 11, 12, 17]. Then the \({\varvec{\Gamma }} \)-Step has a closed form given by

(33)

2.2 The Inner-Outer ADMM Algorithm

Assembling the different expressions, we get the following inner-outer alternating directions method of multipliers algorithm. The advantage of this algorithm is that there is no need to solve the inner iteration with a higher accuracy. It is sufficient to stop the inner iteration with an accuracy of \(10^{-3}\), whereas in the outer iteration, we require a tighter accuracy of the order of \(10^{-6}\). This saves a lot computational time. Moreover, the motivation of using ADMM is the fact that it lends itself to an efficient implementation on parallel computers. Therefore, both the inner and outer iteration can easily be parallelized.

figure a

3 Experiments

The ADMM algorithm is known to be scalable. Hence both the inner and outer iterations can be easily parallelized, thus making the proposed algorithm efficient for big data. To show the convergence and efficiency of our approach, it is sufficient to use simulations on synthetic data and on a real example to compare the performance of our estimator with Graphical LASSO.

3.1 Validation on Synthetic Data

In order to validate our approach, we used the same simulation structure as in [5]. We generated \(n = 1000\) samples from a \(p = 600\)-dimensional normal distribution with correlation structure of the form \( \sigma (x_{i}, x_{j}) = 0.6 |i - j| \). This model has a banded structure, and the values of the entries decay exponentially as they move away from the diagonal. We generated an independent sample of size 1000 from the same distribution for validating the tuning parameter \(\lambda \). Using the training data, we compute a series of estimators with 50 different values of \(\lambda \) and use the one with the smallest likelihood loss on the validation sample, where the likelihood loss is defined by, see e.g. [8],

$$\begin{aligned} {{{\mathcal {L}}}} (\hat{{\varvec{\Sigma }} },{\mathbf {M}} ) = - \log \det ( {\mathbf {M}} ) + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}} \rangle . \end{aligned}$$
(34)

We mention that all the experiments are conducted on a PC with 4 GB of RAM, 3 Ghz CPU using Matlab 2009a.

Results for \(\lambda = 0.01\) are summarized in Fig. 1. The convergence is achieved in 25 steps and needs just 0.54 s. After a few steps of fluctuations (\(\approx \) 12 iterations), the objective function stabilizes and converges to its optimal value where the eigenvalues of the precision matrix estimated by our algorithm are real and positive. This proves the positive definiteness of the obtained precision matrix as shown in Fig. 2.

Fig. 1.
figure 1

Plot of the objective function for \(\lambda = 0.01\)

Fig. 2.
figure 2

Eigenvalue distribution of the estimated precision matrix.

3.2 Validation on Real Data

For experimental validation, we used 4 cancer datasets publicly available at the Gene Expression Omnibus, see e.g., http://www.ncbi.nlm.nih.gov/geo/. For a fair comparison with the other method of estimating the inverse covariance matrix, we follow the same analysis scheme used by [8]. The datasets are: Liver cancer (GSE1898), Colon cancer (GSE29638), Breast cancer (GSE20194) and Prostate cancer (GSE17951) with sample size \(n= 182; 50; 278\) and 154 respectively and number of genes \(p= 21794; 22011; 22283\) and 54675. We preprocessed the data so that each variable is zero mean and unit variance across the dataset. We performed 100 repetitions on a \(50\% - 50\%\) validation and testing samples.

Since regular sparseness promoting methods do not scale to large number of variables, we used the same regime proposed by [8] and validated our method in two regimes. In the first regime, for each of the 50 repetitions, we selected \(n = 200\) variables uniformly at random and used the G-LASSO. In the second regime, we used all the variables in the dataset, and used our inner-outer ADMM algorithm. Since the whole sample covariance matrix could not fit in memory, we computed it in batches of rows as in [12]. In order to make a fair comparison, the runtime includes the time needed to produce the optimal precision matrix from a given input dataset. Average runtimes were summarized in Table 1. This includes the time to solve each optimization problem and also the time to compute the covariance matrix (if needed). Our method is considerably faster than the G-LASSO method as shown in Table 1.

Table 1. Runtimes for gene expression datasets. Our method is considerably faster than G-LASSO.

3.3 Validation on CGH Data

Alterations in the genome that lead to changes in DNA sequence copy number are a characteristic of solid tumors and are found in association with developmental abnormalities and/or mental retardation. Comparative genomic hybridization (CGH) can be used to detect and map these changes therefore knowledge of copy number aberrations can have immediate clinical use in diagnosis, and in some cases provide useful prognostic information.

In a typical CGH measurement, total genomic DNA is isolated from test and reference cell populations, differentially labeled, and hybridized to a representation of the genome that allows the binding of sequences at different genomic locations to be distinguished. Array CGH has been implemented using a wide variety of techniques such as BAC array, i.e., produced from bacterial artificial chromosomes; cDNA microarray which is made from cDNAs and oligo array, made from oligonucleotides (Affy, Agilent, Illumina) to name a few. The output from array CGH experiment is a log2 ratio of the copy number in the test versus the reference. The goal of this experiment is to identify genome regions with DNA copy number alterations.

The glioma data from Bredel et al. [4] contain samples representing primary GBMs, a particular malignant type of brain tumor. We investigate the performance of Fused LASSO and ADMM LASSO methods on the array CGH profiles of the GBM samples examined in Lai et al. [10]. To generate a more challenging situation where both local amplification and large region loss exist in the same chromosome, we paste together the following 2 array regions: (1) chromosome 7 in GBM29 from 40 to 65 Mb and (2) chromosome 13 in GBM31. The performance of the two methods on this pseudo chromosome is illustrated in Fig. 3. We can see that the proposed method using ADMM LASSO successfully identified both the local amplification and the big chunk of copy number loss.

Fig. 3.
figure 3

Estimated copy number; Left: from Fused LASSO regression shows copy number alteration regions. Right: using ADMM LASSO algorithm

4 Conclusion and Future Work

The sparse precision matrix estimator has been shown to be useful in many applications. Penalizing the matrix is a tool with good asymptotic properties for estimating large sparse covariance and precision matrices. However, its positive definiteness property and unconstrained structure can be easily violated in practice, which prevents its use in many important applications such as graphical models, financial assets and comparative genomic hybridization.

In this paper, we have expressed the precision matrix estimation equation in a convex optimization framework and considered a natural modification by imposing the positive definiteness and problem-solving constraints. We have developed a fast alternating direction method of multipliers (ADMM) to solve the constrained optimization problem and the resulting estimator retains the sparsity and positive definiteness properties simultaneously. We are at the phase of demonstrating the general validity of the method and its advantages over correlation networks based on competitive precision matrix estimators with computer-simulated reaction systems, to be able to demonstrate strong signatures of intracellular pathways and provide a valuable tool for the unbiased reconstruction of metabolic reactions from large-scale metabolomics data sets.