Advanced Computation of Sparse Precision Matrices for Big Data

Baggag, Abdelkader; Bensmail, Halima; Srivastava, Jaideep

doi:10.1007/978-3-319-57529-2_3

Abdelkader Baggag¹⁹,
Halima Bensmail¹⁹ &
Jaideep Srivastava¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10235))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3024 Accesses

Abstract

The precision matrix is the inverse of the covariance matrix. Estimating large sparse precision matrices is an interesting and a challenging problem in many fields of sciences, engineering, humanities and machine learning problems in general. Recent applications often encounter high dimensionality with a limited number of data points leading to a number of covariance parameters that greatly exceeds the number of observations, and hence the singularity of the covariance matrix. Several methods have been proposed to deal with this challenging problem, but there is no guarantee that the obtained estimator is positive definite. Furthermore, in many cases, one needs to capture some additional information on the setting of the problem. In this paper, we introduce a criterion that ensures the positive definiteness of the precision matrix and we propose the inner-outer alternating direction method of multipliers as an efficient method for estimating it. We show that the convergence of the algorithm is ensured with a sufficiently relaxed stopping criterion in the inner iteration. We also show that the proposed method converges, is robust, accurate and scalable as it lends itself to an efficient implementation on parallel computers.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A computation study on an integrated alternating direction method of multipliers for large scale optimization

Article 17 February 2017

A Flexible ADMM Algorithm for Big Data Applications

Article 26 October 2016

Practical inexact proximal quasi-Newton method with global complexity analysis

Article 31 March 2016

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Recent applications often encounter high dimensionality with a limited number of data points leading to a number of covariance parameters that greatly exceeds the number of observations. Examples include marketing, e-commerce, and warehouse data in business; microarray, and proteomics data in genomics and heath sciences; and biomedical imaging, functional magnetic resonance imaging, tomography, signal processing, high-resolution imaging, and functional and longitudinal data. In biological sciences, one may want to classify diseases and predict clinical outcomes using microarray gene expression or proteomics data, in which hundreds of thousands of expression levels are potential covariates, but there are typically only tens or hundreds of subjects. Hundreds of thousands of single-nucleotide polymorphisms are potential predictors in genome-wide association studies. The dimensionality of the variables’ spaces grows rapidly when interactions of such predictors are considered.

Large-scale data analysis is also a common feature of many problems in machine learning, such as text and document classification and computer vision. For a $p \times p$ covariance matrix ${\varvec{\Sigma }} $, there are $p(p+1)/2$ parameters to estimate, yet the sample size n is often small. In addition, the positive-definiteness of ${\varvec{\Sigma }} $ makes the problem even more complicated. When $n > p$, the sample covariance matrix is positive-definite and unbiased, but as the dimension p increases, i.e., when $p \gg n$, which is the case in this study, the sample covariance matrix tends to become unstable and can fail to be consistent, since it becomes singular.

Due to its importance, there has been an active line of work on efficient optimization methods for solving the $\ell _{1}$ regularized Gaussian maximum likelihood estimator problem (MLE). Among these methods, we cite the projected subgradients [7], the alternating linearization, the inexact interior point method, the greedy coordinate descent method and G-LASSO which is a block coordinate descent method. For typical high-dimensional statistical problems, optimization methods typically suffer sub-linear rates of convergence. This would be too expensive for the Gaussian MLE problem, since the number of matrix entries scales quadratically with the number of nodes.

Sparse modeling has been widely used to deal with high dimensionality. The main assumption is that the p dimensional parameter vector is sparse, with many components being exactly zero or negligibly small. Such an assumption is crucial in identifiability, especially for the relatively small sample size. Although the notion of sparsity gives rise to biased estimation in general, it has proved to be effective in many applications. In particular, variable selection can increase the estimation accuracy by effectively identifying important predictors and can improve the model interpretability.

To estimate the inverse of the covariance matrix, a thresholding approach is proposed as the following,

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{{\mathbf {M}}} \frac{1}{2}\Vert {\hat{{\varvec{\Sigma }} }} {\mathbf {M}} - {{\mathbf {I}}} \Vert _{F}^{2} + \lambda \Vert {\mathbf {M}}\Vert _{1}, \end{aligned}$$

(1)

where ${\mathbf {M}}$ is the precision matrix, i.e., the inverse of ${\varvec{\Sigma }} $. This equation emphasizes the fact that the solution may not be unique. Such nonuniqueness always occurs since ${\varvec{\Sigma }} = {{\mathbf {X}}}^T {{\mathbf {X}}}$ where ${{\mathbf {X}}} \in {{\mathbb {R}}}^{n\times p}$ is the sample data matrix and rank $({\mathbf {X}}) \le n \ll p$, which is our case. However, there is no guarantee that the thresholding estimator is always positive definite [5]. Although the positive definite property is guaranteed in the asymptotic setting with high probability, the actual estimator can be an indefinite matrix, especially in real data analysis.

To overcome these limitations, i.e. to achieve simultaneously sparsity, positive semi-definiteness, and add structure or constraints on the coefficient of the precision matrix in a graphical model setting for example, a natural solution is to add the positive semi-definite constraint and penalize the $\ell _{1}$ norm of a matrix ${\mathbf {D}}$ multiplied by the precision matrix estimate ${\mathbf {M}}$, where ${\mathbf {D}} \in {\mathbb {R}}^{p \times p}$, i.e.,

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{{\mathbf {M}}} \frac{1}{2}\Vert {\hat{{\varvec{\Sigma }} }} {\mathbf {M}} - {{\mathbf {I}}} \Vert _{F}^{2} + \lambda \Vert {\mathbf {D}} {\mathbf {M}}\Vert _{1}. \end{aligned}$$

(2)

Before discussing our contribution in that sense, let us introduce G-LASSO, which is a popular penalized approach for estimating the precision matrix.

1.1 Tikhonov and G-LASSO

The log-likelihood of a sample ${{\mathbf {x}}}_i$ in a Gaussian model with mean $\varvec{\mu }$ and precision matrix ${\mathbf {M}}$ is given by

$$\begin{aligned} {\mathcal {L}}({\mathbf {x}}_{i}, \varvec{\mu }, {\hat{{\varvec{\Sigma }} }} ) = \log \det {\mathbf {M}} - ({\mathbf {x}}_{i}-\varvec{\mu })^{T} {\mathbf {M}} ({\mathbf {x}}_{i}-\varvec{\mu }), \end{aligned}$$

(3)

and since this average log-likelihood depends only on the sample covariance matrix ${\hat{{\varvec{\Sigma }} }}$, we define

$$\begin{aligned} {\mathcal {L}}({\mathbf {M}}, {\hat{{\varvec{\Sigma }} }}) = \sum _{i} {\mathcal {L}}({\mathbf {x}}_{i}, \varvec{\mu }, {\hat{{\varvec{\Sigma }} }}) = \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle . \end{aligned}$$

(4)

A very well known technique for the estimation of precision matrices is Tikhonov regularization. Given a sample covariance matrix ${\hat{{\varvec{\Sigma }} }}\succeq {{\mathbf {0}}}$, the Tikhonov-regularized problem is defined as

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad {\mathcal {L}}({\mathbf {M}}, \hat{{\varvec{\Sigma }} }) - \lambda ~{\text {Trace}} ({\mathbf {M}}), \end{aligned}$$

(5)

or equivalently;

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle - \lambda ~{\text {Trace}} ({\mathbf {M}}). \end{aligned}$$

(6)

This can be seen as imposing a matrix variate Gaussian prior on ${\mathbf {M}}^{1/2}$ with both row and column covariance matrices equal to ${\mathbf {I}}$ to make the solution well defined, assuming positiveness of ${\mathbf {M}}$. The optimal solution of (6) is given by

$$\begin{aligned} {{\mathbf {M}}} = { ( \hat{{\varvec{\Sigma }} } + \lambda {{\mathbf {I}}} ) }^{-1}. \end{aligned}$$

(7)

Since the $\ell _{1}$-norm is the tightest convex upper bound of the cardinality of a matrix, several methods based on $\ell _{1}$-regularized log-determinant program have been proposed. This is called G-LASSO, or Graphical LASSO, after the algorithm that efficiently computes the solution,

$$\begin{aligned} \max _{{\mathbf {M}}\succ 0} \quad \log \det {\mathbf {M}} - \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle - \lambda \Vert {\mathbf {M}}\Vert _{1}. \end{aligned}$$

(8)

Similar to Eq. (8), Bien and Tibshirani in [3] suggested to add to the likelihood a LASSO penalty on $\Vert {\mathbf {D}} \odot {{\varvec{\Sigma }} } \Vert $, where ${\mathbf {D}}$ is an arbitrary matrix with non-negative elements and $\odot $ denotes the element-wise multiplication, i.e., the Hadamard product. So the regularization problem becomes

$$\begin{aligned} \min _{{\mathbf {M}}\succ 0}\quad - \log \det {\mathbf {M}} + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle + \lambda \Vert {\mathbf {D}} \odot {\mathbf {M}}\Vert _{1}, \end{aligned}$$

(9)

and one may tighten the constraint ${\mathbf {M}}\succ 0$ to ${\mathbf {M}}\succ \delta {\mathbf {I}}$ for some $\delta > 0$, so that the above equation becomes

$$\begin{aligned} \min _{ \mathop { {\mathbf {M}}\succ \delta {\mathbf {I}}}\limits _{\delta > 0 }} \quad - \log \det {\mathbf {M}} + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}}\rangle + \lambda \Vert {\mathbf {D}} \odot {\mathbf {M}}\Vert _{1}. \end{aligned}$$

(10)

The asymptotic properties of the estimator has been studied in [13]. Cai and Zhou in [5] proposed a constrained $\ell _{1}$-minimization procedure for estimating sparse precision matrices by solving the optimization problem

$$\begin{aligned} \min \quad \Vert {\mathbf {M}}\Vert _{1} \quad \text {subject to} \quad \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} - {\mathbf {I}}\Vert _{\infty } \le \lambda . \end{aligned}$$

(11)

Equation (11) is obtained from a decomposition of series of Dantzig selector problems. In the above, the symmetry condition is not enforced on ${\mathbf {M}}$ and as a result the solution is not symmetric in general. Cai and Zhou in [5] established the rates of convergence of CLIME, which is the algorithm derived from the above regularization problem, under both the entry-wise $\ell _{1}$- and the Frobenius-norm and proved that CLIME estimator is positive definite with high probability. However, in practice there is no guarantee that CLIME is always positive definite, especially when applied in real applications.

This warrants our study and our contributions.

1.2 Summary of the Paper and Contributions

In this paper, we propose a new way of estimating large sparse precision matrices, which is significant in the area of learning, vision and mining. This requires seeking a sparse solution and also ensuring positive definiteness. We, therefore, present an optimization model that encourages both sparsity and positive definiteness, and propose an implementation based on the inner-outer alternating direction method of multipliers (ADMM). The method is validated using numerical experiments.

The emphasis of the paper is on introducing a new criterion that insures the positive-definiteness of the precision matrix by adding a tuning parameter $\delta >0$ in the constraints. This additional constraint will guard against positive semi-definiteness. We add structure on the coefficient of the precision matrix and we derive an efficient inner-outer ADMM form to obtain an optimal solution. We perform the ADMM steps, a variant of the standard Augmented Lagrangian method, that uses partial updates, but with three innovations that enable finessing the caveats detailed above. Also, we show that the proposed algorithm converges with a very relaxed stopping criterion in the inner iteration. It is scalable and with better accuracy properties than existing methods.

2 The Proposed Method

Given a data matrix ${{\mathbf {X}}} \in {{\mathbb {R}}}^{n\times p}$, a sample of n realizations from a p-dimensional Gaussian distribution with zero mean and covariance matrix ${\varvec{\Sigma }} = ( {{\varvec{X}} }^T {\varvec{X}} ) \in {{\mathbb {R}}}^{p\times p}$ and $p \gg n$. The goal is to estimate the precision matrix ${{\mathbf {M}}}$, i.e., the inverse of the covariance matrix, ${{\varvec{\Sigma }} }$. Let ${\hat{{\varvec{\Sigma }} }}$ be the empirical covariance matrix. The Graphical LASSO problem minimizes the $\ell _1$-regularized negative log-likelihood

$$\begin{aligned} f( {{\mathbf {M}}} ) = - \log \det {{\mathbf {M}}} + \langle {{\mathbf {M}}}, {\hat{{\varvec{\Sigma }} }} \rangle + \lambda \Vert {\mathbf {M}}\Vert _{1}, \end{aligned}$$

(12)

subject to the constraint that ${{\mathbf {M}}} \succ {{\mathbf {0}}}$. This is a semi-definite programming problem in the variable ${\mathbf {M}}$. When we take the derivative of $f( {\mathbf {M}})$ with respect to ${\mathbf {M}}$ and we set it to zero, we obtain the optimality condition

$$\begin{aligned} -{\mathbf {M}}^{-1} + {\hat{{\varvec{\Sigma }} }} + \lambda ~ {\varvec{\Gamma }} = {{\mathbf {0}}}, \end{aligned}$$

(13)

where ${\varvec{\Gamma }} $ is a matrix of componentwise signs of ${\mathbf {M}}$, i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} \gamma _{jk} = {\text {sign}} ( m_{jk} ) &{} {\text {if}} \,\, m_{jk} \ne 0 \\ \gamma _{jk} \in [-1, 1] &{} {\text {if}}\,\, m_{jk} = 0 \end{array}\right. } \end{aligned}$$

(14)

Taking the norm, we get

$$\begin{aligned} \Vert {{\mathbf {M}}}^{-1} - {\hat{{\varvec{\Sigma }} }} \Vert _{\infty } \le \lambda . \end{aligned}$$

(15)

Therefore, our interest is to construct a sparse precision matrix estimator via convex optimization that performs well in high-dimensional settings and is positive definite in finite samples, see e.g. [14]. Positive definiteness is desirable when the covariance estimator is applied to methods for supervised learning. Many of these methods either require a positive definite covariance estimator, or use optimization that is convex only if the covariance estimator is nonnegative definite, e.g., quadratic discriminant analysis and covariance regularized regression, see e.g. [16].

We define our estimator as a solution to the following problem

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{\mathop { {\mathbf {M}}\succeq \delta {\mathbf {I}}}\limits _{\delta \succeq 0 } } \frac{1}{2}\Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}} - {\mathbf {I}}\Vert _{F}^{2} + \lambda { \Vert {\mathbf {D}} \odot {\mathbf {M}} \Vert }_1, \end{aligned}$$

(16)

where $\hat{{\varvec{\Sigma }} }$ is the empirical covariance matrix, ${\mathbf {D}} = (d_{ij})_{1\le i,j \le p}$ is an arbitrary matrix with non-negative elements where ${\mathbf {D}}$ can take different forms: it can be the matrix of all ones or it can be a matrix with zeros on the diagonal to avoid shrinking diagonal elements of ${\mathbf {M}}$. Furthermore, we can take ${\mathbf {D}}$ with elements $d_{ij} = 1_{i \ne j}|\hat{{\mathbf {M}}}_{ij}^{\text {init}}|$, where $\hat{{\mathbf {M}}}^{\text {init}}$ is an initial estimator of ${\mathbf {M}}$. The later choice of ${\mathbf {D}}$ corresponds to a precision matrix analogue of the Adaptive LASSO penalty.

To derive our inner-outer ADMM algorithm, as in [1], we will first introduce a new variable ${\varvec{\Theta }} \in {{\mathbb {R}}}^{p\times p}$ and an equality constraint as follows

$$\begin{aligned} \hat{{\mathbf {M}}} = \arg \min _{ {\varvec{\Theta }} , {\mathbf {M}}} \left\{ \frac{1}{2}\Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}} - {\mathbf {I}}\Vert _{F}^{2} + \lambda { \Vert {\mathbf {D}} \odot {\mathbf {M}} \Vert }_1 \quad | \quad {\mathbf {M}}= {\varvec{\Theta }} , \; {\varvec{\Theta }} \ge \delta {\mathbf {I}} \right\} . \end{aligned}$$

(17)

The solution to (17) gives the solution to (16). To deal with the problem (17), we have to minimize its augmented Lagrangian function, ${{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} , {\mathbf {M}}, {\varvec{\Lambda }} )$ for some given penalty parameter $\varrho $, i.e.,

$$\begin{aligned} {{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} , {\mathbf {M}}, {\varvec{\Lambda }} ) = \frac{1}{2} \Vert \hat{{\varvec{\Sigma }} } {\mathbf {M}}- \mathbf {I} \Vert _F^2 + \lambda {\Vert \mathbf {D} \odot {\mathbf {M}}} \Vert _1 - \langle {\varvec{\Lambda }} , {\varvec{\Theta }} - {\mathbf {M}}\rangle + \frac{1}{2} \varrho {\Vert {\varvec{\Theta }} - {\mathbf {M}}}\Vert _F^2, \end{aligned}$$

(18)

where ${\varvec{\Lambda }} \in {{\mathbb {R}}}^{p\times p}$ is the ensemble of Lagrange multipliers associated with the constraint. Hence at iteration k, the ADMM algorithm consists of three steps, namely the ${\varvec{\Theta }} $-Step, the ${\mathbf {M}}$-Step and the dual-update step.

First step: In the first step of the ADMM algorithm, we fix ${\mathbf {M}}$ and ${\varvec{\Lambda }} $ and minimize the augmented Lagrangian function over ${\varvec{\Theta }} $ to get

where denotes the projection of the matrix $\bar{ {\mathbf {A}} } = {\mathbf {M}}^k + \frac{1}{\varrho } {\varvec{\Lambda }} ^k $ onto the convex cone .

Lemma 1

Let be the projection of the matrix $\bar{ {\mathbf {A}} }$ onto the convex cone . If $\bar{ {\mathbf {A}} }$ has the eigen-decomposition $\bar{ {\mathbf {A}} } = {{\mathbf {V}}} ~{\text {Diag}} ( \lambda _1, \lambda _2, \cdots , \lambda _n )~ {{\mathbf {V}}}^T $, then

(19)

Proof

This result can be traced back to early statisticians, e.g. see [15], who noticed that if a matrix ${{\mathbf {A}}}$ has an eigendecomposition

$$\begin{aligned} {{\mathbf {A}}} = {{\mathbf {U}}} ~{\text {Diag}} ( \mu _1, \mu _2, \cdots , \mu _n )~ {{\mathbf {U}}}^T, \end{aligned}$$

(20)

where $\mu _1 \ge \cdots \ge \mu _n$ are the eigenvalues of ${{\mathbf {A}}}$ and ${{\mathbf {U}}}$ is the corresponding orthonormal matrix of eigenvectors; then the projection of ${{\mathbf {A}}}$ onto the closed convex cone , formed by the set of positive semidefinite matrices, is

(21)

Therefore, the extension to is straightforward. $\square $

Since then this projection has been widely used, e.g. see [9]. Notice that the numerical cost of computing this projection is essentially that of computing the spectral decomposition of the matrix to project, i.e., $\bar{{\mathbf {A}}}$.

Second step: In the second step, we fix ${\varvec{\Theta }} $ and ${\varvec{\Lambda }} $ and minimize the augmented Lagrangian over ${\mathbf {M}}$, i.e.,

$$\begin{aligned} {\mathbf {M}}^{k + 1} = \arg \min _{{\mathbf {M}}} {{{\mathcal {L}}}}_{\varrho } ( {\varvec{\Theta }} ^{k + 1}, {\mathbf {M}}, {\varvec{\Lambda }} ^{k}), \end{aligned}$$

(22)

where this second step requires a special care since the concept of an inner iteration is introduced at this level of the procedure.

Third step: The final step of the ADMM algorithm is to update the dual variable ${\varvec{\Lambda }} $, i.e.,

$$\begin{aligned} {\varvec{\Lambda }} ^{k + 1} = {\varvec{\Lambda }} ^{k} - ({\varvec{\Theta }} ^{k + 1} - {\mathbf {M}}^{k + 1}). \end{aligned}$$

(23)

2.1 The Inner Iteration

It can be shown that

$$\begin{aligned} {\mathbf {M}}^{k + 1} = \arg \min _{{\mathbf {M}}} \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} + \lambda {\Vert {\mathbf {D}} \odot {\mathbf {M}} }\Vert _{1}, \end{aligned}$$

for which we are not able to derive a closed form for ${\mathbf {M}}$. To overcome this problem, we propose to derive a new ADMM to update ${\mathbf {M}}$ (the inner iteration). To do this, we reparametrize the ${\mathbf {D}} \odot {\mathbf {M}}$ with ${\varvec{\Gamma }} \in {{\mathbb {R}}}^{p\times p}$ and we add an equality constraint ${\mathbf {D}} \odot {\mathbf {M}} = {\varvec{\Gamma }} $, then we minimize the following

$$\begin{aligned} {\left\{ \begin{array}{ll} {\text {Minimize}} &{} \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} + \lambda {\Vert {\varvec{\Gamma }} }\Vert _{1} ,\,\,\,\, (24) \\ {\text {subject to}} &{} {\mathbf {D}} \odot {\mathbf {M}} = {\varvec{\Gamma }} .\qquad \qquad \qquad \qquad \qquad \qquad \,\,\quad \qquad \qquad \qquad \qquad (25) \end{array}\right. } \end{aligned}$$

The augmented Lagrangian ${{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} , {\varvec{\Delta }})$ associated with this problem is

$$\begin{aligned} {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} , {\varvec{\Delta }})&= \frac{1}{2} \left\{ \Vert \hat{{\varvec{\Sigma }} }{\mathbf {M}} \Vert _{F}^{2} + \varrho \Vert {\mathbf {M}} \Vert _{F}^{2} - 2 \langle {\mathbf {M}}, \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \rangle \right\} \nonumber \\&+ \lambda \Vert {\varvec{\Gamma }} \Vert _{1} - \langle {\varvec{\Delta }}, {\varvec{\Gamma }} - {\mathbf {D}} \odot {\mathbf {M}} \rangle + \frac{1}{2} \varrho \Vert {\varvec{\Gamma }} - {\mathbf {D}} \odot {\mathbf {M}} \Vert _F^2 \end{aligned}$$

(26)

where ${\varvec{\Delta }}\in {{\mathbb {R}}}^{p\times p}$ is the matrix containing the ensemble set of Lagrange multipliers associated with the constraint. As before, the ADMM here, i.e., the inner iteration, consists of the following three steps:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathbf {M}}_{k}^{j + 1} &{} = \arg \min \limits _{{\mathbf {M}}} {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j})\qquad \qquad \quad \qquad \qquad \quad \qquad \, (27) \\ {\varvec{\Gamma }} ^{j + 1} &{} = \arg \min \limits _{{\varvec{\Gamma }} } {{{\mathcal {K}}}}_{\varrho } ( {\mathbf {M}}_{k}^{j + 1}, {\varvec{\Gamma }} , {\varvec{\Delta }}^{j}) \qquad \qquad \qquad \quad \qquad \qquad \, (28)\\ {\varvec{\Delta }}^{j + 1} &{} = {\varvec{\Delta }}^{j} - ( \mathbf {{\varvec{\Gamma }} }^{j + 1} - {\mathbf {D}} \odot {\mathbf {M}}_{k}^{j + 1}) \quad \qquad \qquad \qquad \qquad \quad \,\, \ (29) \end{array}\right. } \end{aligned}$$

$\underline{\mathrm{The\ intermediate}\ \mathbf M \text {-}\mathrm{Step}}$: It can be shown that ${\mathbf {M}}_{k}^{j + 1}$ is the minimizer of ${{{\mathcal {K}}}}_{\varrho }( {\mathbf {M}}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j})$, i.e.,

$$\begin{aligned} \frac{ \partial }{ \partial {\mathbf {M}} } \left[ {{{\mathcal {K}}}}_{\varrho }( {\mathbf {M}}_{k}^{j + 1}, {\varvec{\Gamma }} ^{j}, {\varvec{\Delta }}^{j}) \right] = {\mathbf {0}}, \end{aligned}$$

(30)

which is equivalent to

$$\begin{aligned} \left( \hat{{\varvec{\Sigma }} }^{T} \hat{{\varvec{\Sigma }} } + \varrho {\mathbf {I}} \right) {\mathbf {M}}_{k}^{j + 1} + \varrho {\mathbf {D}} \odot {\mathbf {D}} \odot {\mathbf {M}}_{k}^{j+1} + \left( {\varvec{\Delta }}^{j} - \varrho {\varvec{\Gamma }} ^{j} \right) \odot {\mathbf {D}}\nonumber \\ - \left( \hat{{\varvec{\Sigma }} } + \varrho {\varvec{\Theta }} ^{k + 1} - {\varvec{\Lambda }} ^{k} \right) = {\mathbf {0}}, \end{aligned}$$

(31)

and therefore, given by the previous expression, ${\mathbf {M}}_{k}^{j + 1}$ has finally a closed form, despite the additional but straightforward computational effort at this level. This additional step, can be considered as a start-up for the original ADMM algorithm. This is very important when dealing with complex problems and large datasets. This step is solved via a fixed-point iteration, see [2].

$\underline{\mathrm{The\ intermediate}\ {\varvec{\Gamma }} \text {-}\mathrm{Step}}$: To deal with this ${\varvec{\Gamma }} $-Step, we define an entry-wise soft-thresholding rule for all the off-diagonal elements of a matrix ${\mathbf {A}}$ as

(32)

with $s(a_{jl}, \kappa ) = \text {sign}(a_{jl})\max (|a_{jl}| - \kappa , 0) ~1_{\{j \ne l\}}$, see e.g. [6, 11, 12, 17]. Then the ${\varvec{\Gamma }} $-Step has a closed form given by

(33)

2.2 The Inner-Outer ADMM Algorithm

Assembling the different expressions, we get the following inner-outer alternating directions method of multipliers algorithm. The advantage of this algorithm is that there is no need to solve the inner iteration with a higher accuracy. It is sufficient to stop the inner iteration with an accuracy of $10^{-3}$, whereas in the outer iteration, we require a tighter accuracy of the order of $10^{-6}$. This saves a lot computational time. Moreover, the motivation of using ADMM is the fact that it lends itself to an efficient implementation on parallel computers. Therefore, both the inner and outer iteration can easily be parallelized.

3 Experiments

The ADMM algorithm is known to be scalable. Hence both the inner and outer iterations can be easily parallelized, thus making the proposed algorithm efficient for big data. To show the convergence and efficiency of our approach, it is sufficient to use simulations on synthetic data and on a real example to compare the performance of our estimator with Graphical LASSO.

3.1 Validation on Synthetic Data

In order to validate our approach, we used the same simulation structure as in [5]. We generated $n = 1000$ samples from a $p = 600$-dimensional normal distribution with correlation structure of the form $ \sigma (x_{i}, x_{j}) = 0.6 |i - j| $. This model has a banded structure, and the values of the entries decay exponentially as they move away from the diagonal. We generated an independent sample of size 1000 from the same distribution for validating the tuning parameter $\lambda $. Using the training data, we compute a series of estimators with 50 different values of $\lambda $ and use the one with the smallest likelihood loss on the validation sample, where the likelihood loss is defined by, see e.g. [8],

$$\begin{aligned} {{{\mathcal {L}}}} (\hat{{\varvec{\Sigma }} },{\mathbf {M}} ) = - \log \det ( {\mathbf {M}} ) + \langle \hat{{\varvec{\Sigma }} }, {\mathbf {M}} \rangle . \end{aligned}$$

(34)

We mention that all the experiments are conducted on a PC with 4 GB of RAM, 3 Ghz CPU using Matlab 2009a.

Results for $\lambda = 0.01$ are summarized in Fig. 1. The convergence is achieved in 25 steps and needs just 0.54 s. After a few steps of fluctuations ($\approx $ 12 iterations), the objective function stabilizes and converges to its optimal value where the eigenvalues of the precision matrix estimated by our algorithm are real and positive. This proves the positive definiteness of the obtained precision matrix as shown in Fig. 2.

3.2 Validation on Real Data

For experimental validation, we used 4 cancer datasets publicly available at the Gene Expression Omnibus, see e.g., http://www.ncbi.nlm.nih.gov/geo/. For a fair comparison with the other method of estimating the inverse covariance matrix, we follow the same analysis scheme used by [8]. The datasets are: Liver cancer (GSE1898), Colon cancer (GSE29638), Breast cancer (GSE20194) and Prostate cancer (GSE17951) with sample size $n= 182; 50; 278$ and 154 respectively and number of genes $p= 21794; 22011; 22283$ and 54675. We preprocessed the data so that each variable is zero mean and unit variance across the dataset. We performed 100 repetitions on a $50\% - 50\%$ validation and testing samples.

Since regular sparseness promoting methods do not scale to large number of variables, we used the same regime proposed by [8] and validated our method in two regimes. In the first regime, for each of the 50 repetitions, we selected $n = 200$ variables uniformly at random and used the G-LASSO. In the second regime, we used all the variables in the dataset, and used our inner-outer ADMM algorithm. Since the whole sample covariance matrix could not fit in memory, we computed it in batches of rows as in [12]. In order to make a fair comparison, the runtime includes the time needed to produce the optimal precision matrix from a given input dataset. Average runtimes were summarized in Table 1. This includes the time to solve each optimization problem and also the time to compute the covariance matrix (if needed). Our method is considerably faster than the G-LASSO method as shown in Table 1.

Table 1. Runtimes for gene expression datasets. Our method is considerably faster than G-LASSO.

Full size table

3.3 Validation on CGH Data

Alterations in the genome that lead to changes in DNA sequence copy number are a characteristic of solid tumors and are found in association with developmental abnormalities and/or mental retardation. Comparative genomic hybridization (CGH) can be used to detect and map these changes therefore knowledge of copy number aberrations can have immediate clinical use in diagnosis, and in some cases provide useful prognostic information.

In a typical CGH measurement, total genomic DNA is isolated from test and reference cell populations, differentially labeled, and hybridized to a representation of the genome that allows the binding of sequences at different genomic locations to be distinguished. Array CGH has been implemented using a wide variety of techniques such as BAC array, i.e., produced from bacterial artificial chromosomes; cDNA microarray which is made from cDNAs and oligo array, made from oligonucleotides (Affy, Agilent, Illumina) to name a few. The output from array CGH experiment is a log2 ratio of the copy number in the test versus the reference. The goal of this experiment is to identify genome regions with DNA copy number alterations.

The glioma data from Bredel et al. [4] contain samples representing primary GBMs, a particular malignant type of brain tumor. We investigate the performance of Fused LASSO and ADMM LASSO methods on the array CGH profiles of the GBM samples examined in Lai et al. [10]. To generate a more challenging situation where both local amplification and large region loss exist in the same chromosome, we paste together the following 2 array regions: (1) chromosome 7 in GBM29 from 40 to 65 Mb and (2) chromosome 13 in GBM31. The performance of the two methods on this pseudo chromosome is illustrated in Fig. 3. We can see that the proposed method using ADMM LASSO successfully identified both the local amplification and the big chunk of copy number loss.

4 Conclusion and Future Work

The sparse precision matrix estimator has been shown to be useful in many applications. Penalizing the matrix is a tool with good asymptotic properties for estimating large sparse covariance and precision matrices. However, its positive definiteness property and unconstrained structure can be easily violated in practice, which prevents its use in many important applications such as graphical models, financial assets and comparative genomic hybridization.

In this paper, we have expressed the precision matrix estimation equation in a convex optimization framework and considered a natural modification by imposing the positive definiteness and problem-solving constraints. We have developed a fast alternating direction method of multipliers (ADMM) to solve the constrained optimization problem and the resulting estimator retains the sparsity and positive definiteness properties simultaneously. We are at the phase of demonstrating the general validity of the method and its advantages over correlation networks based on competitive precision matrix estimators with computer-simulated reaction systems, to be able to demonstrate strong signatures of intracellular pathways and provide a valuable tool for the unbiased reconstruction of metabolic reactions from large-scale metabolomics data sets.

References

Anbari, M.E., Alam, S., Bensmail, H.: COFADMM: a computational features selection with alternating direction method of multipliers. Procedia Comput. Sci. 29, 821–830 (2014)
Article Google Scholar
Baggag, A., Sameh, A.: A nested iterative scheme for indefinite linear systems in particulate flows. Comput. Methods Appl. Mech. Eng. 193(21–22), 1923–1957 (2004)
Article MathSciNet MATH Google Scholar
Bien, J., Tibshirani, R.: Sparse estimation of a covariance matrix. Biometrika 98, 807–820 (2011)
Article MathSciNet MATH Google Scholar
Bredel, M., Bredel, C., Juric, D., Harsh, G.R., Vogel, H., Recht, L.D., Sikic, B.I.: High-resolution genome-wide mapping of genetic alterations in human glial brain tumors. Cancer Res. 65(10), 4088–4096 (2005)
Article Google Scholar
Cai, T., Zhou, H.: A constrained $l_{1}$ minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106, 594–607 (2011)
Article Google Scholar
Donoho, D.L.: De-noising by soft-thresholding. IEEE Trans. Inf. Theory 41(3), 613–627 (1995)
Article MathSciNet MATH Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. ACM (2008)
Google Scholar
Honorio, J., Jaakkola, T.S: Inverse covariance estimation for high-dimensional data in linear time, space: spectral methods for riccati and sparse models. arXiv preprint arXiv:1309.6838 (2013)
Higham, N.J.: Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl. 103, 103–118 (1988)
Article MathSciNet MATH Google Scholar
Lai, W.R., Johnson, M.D., Kucherlapati, R., Park, P.J.: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinform. 21(19), 3763–3770 (2005)
Article Google Scholar
Maurya, A.: A well-conditioned and sparse estimation of covariance and inverse covariance matrices using a joint penalty. J. Mach. Learn. Res. 17, 1–28 (2016)
MathSciNet MATH Google Scholar
Mazumder, R., Hastie, T.: Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13(Mar), 781–794 (2012)
MathSciNet MATH Google Scholar
Ravikumar, P., Wainwright, M., Raskutti, G., Yu, B.: High-dimensional covariance estimation by minimizing $l_{1}$-penalized log-determinant divergence. Electron. J. Statist. 5, 935–980 (2011)
Article MATH Google Scholar
Rothman, A.J., Bickel, P.J., Levina, E., Zhu, J., et al.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)
Article MathSciNet MATH Google Scholar
Schwertman, N.C., Allen, D.M.: Smoothing an indefinite variance-covariance matrix. J. Stat. Comput. Simul. 9(3), 183–194 (1979)
Article Google Scholar
Witten, D.M., Tibshirani, R.: Covariance-regularized regression and classification for high dimensional problems. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 71(3), 615–636 (2009)
Article MathSciNet MATH Google Scholar
Xue, L., Ma, S., Zou, H.: Positive-definite 1-penalized estimation of large covariance matrices. J. Am. Stat. Assoc. 107(500), 1480–1491 (2012)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Abdelkader Baggag, Halima Bensmail & Jaideep Srivastava

Authors

Abdelkader Baggag
View author publications
You can also search for this author in PubMed Google Scholar
Halima Bensmail
View author publications
You can also search for this author in PubMed Google Scholar
Jaideep Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Abdelkader Baggag or Halima Bensmail .

Editor information

Editors and Affiliations

Kangwon National University, Chuncheon, Korea (Republic of)
Jinho Kim
Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
University of Technology Sydney, Sydney, New South Wales, Australia
Longbing Cao
KAIST, Daejeon, Korea (Republic of)
Jae-Gil Lee
University of New South Wales, Sydney, New South Wales, Australia
Xuemin Lin
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baggag, A., Bensmail, H., Srivastava, J. (2017). Advanced Computation of Sparse Precision Matrices for Big Data. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10235. Springer, Cham. https://doi.org/10.1007/978-3-319-57529-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-57529-2_3
Published: 23 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57528-5
Online ISBN: 978-3-319-57529-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us