1 Introduction

In the high-dimensional data analysis, estimating the population covariance matrix \(\varvec{\varSigma }\) and the precision matrix \(\varvec{\varOmega }=\varvec{\varSigma }^{-1}\) are fundamental problems. The covariance matrix or the precision matrix characterizes the structure of the relation among covariates. Many statistical references can benefit from these structures if they can be precisely estimated. We refer to Tong et al. (2014); Fan et al. (2016) and Cai (2017) for recent reviews.

In the high-dimensional regime where the data dimension is very large, estimating the covariance matrix or the precision matrix is challenging since the freedom of parameters are squared order of the data dimension. In literature, there have been a variety of methods which are proposed to estimate \(\varvec{\varSigma }\) or \(\varvec{\varOmega }\). For the population covariance matrix \(\varvec{\varSigma }\), Bickel and Levina (2008) and El Karoui (2008) constructed consistent estimation by thresholding the sample covariance matrix. This idea was further developed by Rothman et al. (2009) and Cai and Liu (2011). For the precision matrix estimation, there have been many methods based on penalization. Yuan and Lin (2007) considered a \(\ell _1\)-penalized Gaussian maximum likelihood estimator and Friedman et al. (2008) designed an efficient block coordinate descent algorithm for this so-called graphical Lasso method. Noting that estimating each column of the precision matrix can be formulated as a linear regression problem, Cai et al. (2011) proposed a constrained \(\ell _1\) minimizatixon estimator (CLIME) based on Dantzig selector (Candes and Tao 2007) and a similar estimator was also introduced by Yuan (2010). Liu and Luo (2015) further proposed a sparse column-wise inverse operator estimator which was essentially a Lasso-type analog of their earlier CLIME method. Zhang and Zou (2014) considered a symmetric loss function which yielded a symmetric estimator directly, whereas other methods (e.g., Yuan 2010; Cai et al. 2011; Liu and Luo 2015) needed an additional symmetrization step. There have been a huge number of papers addressing the problem of the covariance matrix and the precision matrix estimation. For related methods and their connections, see the book by Wainwright (2019)(e.g., Chapters 6 and 11).

Under high-dimensional settings, to obtain a consistent estimator of the covariance matrix or the precision matrix, a sparsity condition is often imposed on the true matrix. Namely, many entries of the matrix are exactly zero or nearly so. A target matrix \(\mathbf{A}=(a_{ij})\in {\mathbb {R}}^{p\times p}\) is said to be (strong) sparse or \(\ell _0\) sparse, meaning that

$$\begin{aligned} \ell _0~\hbox {sparse:}~\max _{j=1,\ldots ,p} \sum _{i=1}^p |a_{ij}|^0 \le s. \end{aligned}$$

That is, each column has at most s nonzero elements. The \(\ell _0\) sparsity condition requires that most entries of the matrix are exactly zero. A natural relaxation is to consider the weak sparsity or \(\ell _q\) sparsity, that is

$$\begin{aligned} \ell _q ~\hbox {sparse:}~\max _{j=1,\ldots ,p} \sum _{i=1}^p |a_{ij}|^q \le s_q, \end{aligned}$$

for some \(q \in [0,1)\) and \(s_q>0\) is a radius. Note that the \(\ell _0\) sparsity condition is a special case of \(\ell _q\) sparsity condition with \(q=0\) and \(s_q=s\). For estimating the covariance matrix \(\varvec{\varSigma }\), Bickel and Levina (2008) firstly provided the consistency result under the \(\ell _q\) sparsity condition. Later, Rothman et al. (2009) and Cai and Liu (2011) also studied the estimator with the weak \(\ell _q\) sparsity condition. These estimators have explicit forms based on thresholding the elements of the sample covariance matrix and the theoretical analysis is relatively straightforward. As far as the precision matrix, the theoretical analysis is more challenging since the estimator usually does not have an explicit form. Ravikumar et al. (2011) firstly established convergence rates for the graphical Lasso. In details, under the strong sparse condition, they derived the bounds under the matrix element-wise infinity, the spectral and the Frobenius norm. The graphical Lasso, SCIO (Liu and Luo 2015), together with the D-trace method (Zhang and Zou 2014) are all based on a loss function with a \(\ell _1\) penalty term. Technically, they all used the primal-dual witness technique (Wainwright, 2009) or its extensions to prove the consistency under the \(\ell _0\) sparsity condition. Specially, in order to apply the primal-dual technique, an irrepresentability condition is necessary; see Assumption 1 of Ravikumar et al. (2011), Section 5.2 of Zhang and Zou (2014) and the formula (4) in Liu and Luo (2015). From the perspective of variable selection or support recovery (Meinshausen and Bühlmann 2006), the strong sparsity condition is a reasonable and also important criteria to evaluate the estimation. However, from the perspective of matrix estimation, this condition is too restricted. For example, the Toeplitz matrix \(\varvec{\varOmega }=\left( \rho ^{|i-j|}\right) _{p \times p}\) for some \(\rho \in (-1,1)\) is an important matrix in statistics which does not satisfy the strong sparsity condition.

In this work, we focus on the \(\ell _q\) sparse or weak sparse case. Theoretically, for the Dantzig-type methods, Yuan (2010); Cai et al. (2011) and Cai et al. (2016a) studied the convergence bounds for the weak sparse matrices. Specially, they derived the error bounds under the spectral norm, the matrix element-wise infinity norm and the Frobenius norm of the estimation. Under the weak sparse settings, there is few theoretical study on Lasso-type methods of the precision matrix estimation. Among them, Sun and Zhang (2013) and Ren et al. (2015) exploited the scaled lasso (Sun and Zhang 2012) to establish the optimal convergence rate for their precision matrix estimators under the normality assumption.

In high-dimensional data analysis, there are various specific situations where we need to estimate the precision matrix and many methods were proposed in the literature. Usually, these methods are based on the well-known procedure CLIME or the graphical Lasso. For example, to study the data with heavy-tailed distributions, Avella-Medina et al. (2018) considered a robust approach to estimate the population covariance matrix and also the precision matrix. In details, they used the robust covariance matrix as a pilot estimator and implemented the CLIME method. For non-Guassian data, Liu et al. (2009) introduced a non-paranormal graphical model to describe the correlation among covariates. Liu et al. (2012) and Xue and Zou (2012) proposed a non-parametric rank-based estimator to estimate the covariance matrix and the precision matrix of the non-paranormal graphical model. In details, Liu et al. (2012) proposed four precision matrix estimators by plugging the rank-based estimate into the Dantzig-type method (Yuan 2010), the CLIME (Cai et al. 2011), the graphical Lasso (Yuan and Lin 2007; Friedman et al. 2008) and the neighbourhood pursuit estimator(Meinshausen and Bühlmann 2006). For the matrix valued data where the covariance matrix and the precision matrix have certain structures such as Kronecker product, Leng and Tang (2012) and Zhou (2014) proposed feasible methods to estimate the precision matrix for matrix data. In particular, Leng and Tang (2012) used the graphical Lasso and Zhou (2014) considered both the graphical Lasso and CLIME. These methods are driven by CLIME or the graphical Lasso. Computationally, it is well-known that the implementation of CLIME or the graphical Lasso is time-consuming. As our recent work Wang and Jiang (2020) showed, for \(p \gg n\), the computation complexities of SCIO and D-trace are \(O(np^2)\) while the one of the graphical Lasso is \(O(p^3)\) for general case (e.g., Witten et al. 2011, Sect. 3). For CLIME, each column is a Dantzig-selector regression and the state of the art algorithm is “flare” (Li et al. 2015) which is based on the linearized alternating direction method of multipliers proposed by Wang and Yuan (2012). For each column, the subprogram involves a Lasso problem whose computation is time-consuming. A detailed comparison on the computation time of these methods can be found in Sect. 3. Motivated by the appealing computational efficiency, it is natural to ask whether we can establish comparable convergence rates for Lasso-type methods under the weak sparse case with mild conditions.

In this article, we revisit the SCIO method and generalize the theoretical properties of SCIO under the \(\ell _q\) sparsity condition by a new analysis which is different from the proof of Liu and Luo (2015). In details, we exploit the oracle inequality of Lasso (Ye and Zhang 2010; Sun and Zhang 2012) and get the basic inequality for the SCIO method. Therefore, we can derive error bounds from the basic inequality directly and relax the common irrepresentability condition which is necessary for the primal-dual witness technique. Accordingly, we provide a unified framework for the SCIO method based on different types of covariance matrix estimation under various cases. These new SCIO-based methods can get the same consistence results as the ones based on CLIME and can be implemented more efficiently.

The SCIO estimation can be regarded as an example of the general M-estimation (Wainwright 2019, Chapter 9). Negahban et al. (2012) provided a unified framework of M-estimators with decomposable regularizes including \(\ell _1\) penalty. They obtained the error bounds under weak sparsity for several applications. However, the analysis of Negahban et al. (2012) is based on the restricted strong convexity (RSC) condition which holds for Gaussian or sub-Gaussian distributions. For other complicated cases such as the heavy-tailed or the non-paranormal assumption, it is challenging to verify the RSC condition. In contrast, our analysis is based on the basic inequality and also the structure of the precision matrix such as the symmetrization procedure (Cai et al. 2011). We can establish reliable convergence rates of the SCIO method under weak sparsity due to its neat structure and provide a unified framework which is applicable for various distribution cases including the heavy-tailed distribution and non-paranormal distribution.

The rest of the paper is organized as follows. In Sect. 2, we revisit the SCIO method and derive the non-asymptotic results under the \(\ell _q\) sparsity condition. In Sect. 3, we consider four covariance matrix estimators: the common sample covariance matrix for sub-Gaussian data, a Huber-type estimator for heavy-tailed data (Avella-Medina et al. 2018), the nonparametric correlation estimator for non-paranormal data (Liu et al. 2009) and the estimator for matrix variate data (Zhou 2014). By plugging these estimators into the SCIO procedure, we can estimate the corresponding precision matrix and derive the error bounds under the weak sparsity condition. Moreover, we conduct simulations to illustrate the performance of these estimators. Finally, we provide some brief comments in Sect. 4 and all technical proofs of our theorems, propositions and corollaries are relegated to Appendix.

2 Main results

We begin with some basic notations and definitions. For a vector \(\mathbf{a}=(a_1,\ldots ,a_p)^\top \in {\mathbb {R}}^p\), the vector norms are defined as follows

$$\begin{aligned} |\mathbf{a}|_\infty =\max _{1\le i \le p}{|a_i|}, ~|\mathbf{a}|_1=\sum _{i=1}^{p}|a_i|, \hbox {and}~ |\mathbf{a}|_2=\sqrt{\sum _{i=1}^{p}a_i^2}. \end{aligned}$$

For a matrix \(\mathbf{A}=(a_{ij})\in {\mathbb {R}}^{p\times q}\), we define the matrix norms:

  • The element-wise \(l_{\infty }\) norm \(\Vert \mathbf{A}\Vert _{\infty }=\max _{1\le i \le p, 1 \le j \le q} |a_{ij}|\);

  • The spectral norm \(\Vert \mathbf{A}\Vert _2=\sup _{|\mathbf {x}|_2\le 1}|\mathbf{A}\mathbf {x}|_2\);

  • The matrix \(\ell _1\) norm \(\Vert \mathbf{A}\Vert _{L_1}=\max _{1\le j\le q}\sum _{i=1}^{p}|a_{ij}|\);

  • the Frobenius norm \(\Vert \mathbf{A}\Vert _F=\sqrt{\sum _{i=1}^p\sum _{j=1}^{q}a_{ij}^2}\);

  • The element-wise \(\ell _1\) norm \(\Vert \mathbf{A}\Vert _1=\sum _{i=1}^p\sum _{j=1}^{q}|a_{ij}|\).

For index sets \(J\subseteq \{1,\ldots , p\}\) and \(K \subseteq \{1,\ldots , q\}\), \(\mathbf{A}_{J,\cdot }\) and \(\mathbf{A}_{\cdot ,K}\) denote the sub-matrix of \(\mathbf{A}\) with rows or columns whose indexes belong to J or K respectively. In particular, \(\mathbf{A}_{i,\cdot }\) and \(\mathbf{A}_{\cdot ,j}\) are the ith row and jth column respectively. For a set J, |J| denotes the cardinality of J. For two real sequences \(\{a_n\}\) and \(\{b_n\}\), write \(a_n=O(b_n)\) if there exists a constant C such that \(|a_n|\le C|b_n|\) holds for large n and \(a_n=o(b_n)\) if \(\lim _{n \rightarrow \infty }a_n/b_n=0\). The constants \(C,C_0,C_1,...\) may represent different values at each appearance.

2.1 SCIO revisited

Suppose that \(\widehat{\varvec{\varSigma }}\) be an arbitrary estimator of the population covariance matrix \(\varvec{\varSigma }\). Taking \(\widehat{\varvec{\varSigma }}\) as the common sample covariance matrix, Liu and Luo (2015) proposed the SCIO method which estimated the precision matrix column-wisely. Let \(\varvec{e}_i\) be the ith column of a \(p\times p\) identity matrix and write \(\varvec{\varOmega }=(\varvec{\beta }_1,\ldots ,\varvec{\beta }_p)\). For each \(i=1,\ldots ,p\), we have

$$\begin{aligned} \varvec{\varSigma }\varvec{\beta }_i=\varvec{e}_i, \end{aligned}$$

and SCIO estimates the vector \(\varvec{\beta }_i\) via a \(\ell _1\) penalized form

$$\begin{aligned} \widehat{\varvec{\beta }}_i={{\,\mathrm{arg\,min}\,}}_{\varvec{\beta }\in {\mathbb {R}}^{p}}\left\{ \frac{1}{2} \varvec{\beta }^\top \widehat{\varvec{\varSigma }}\varvec{\beta }-\varvec{e}_i ^\top \varvec{\beta }+\lambda |\varvec{\beta }|_{1}\right\} , \end{aligned}$$
(1)

where \(\lambda \ge 0\) is a tuning parameter. By stacking the resulting \(\widehat{\varvec{\beta }}_i\) together, we can obtain the precision matrix estimator \(\widehat{\varvec{\varOmega }}=(\widehat{\varvec{\beta }}_1,\ldots ,\widehat{\varvec{\beta }}_p)\). Noting that \(\widehat{\varvec{\varOmega }}\) may be asymmetric, a further symmetrization step is necessary. The final SCIO estimator \(\tilde{\varvec{\varOmega }}=({\tilde{\omega }}_{i j})_{p \times p}\) is defined as

$$\begin{aligned} {\tilde{\omega }}_{i j}={\tilde{\omega }}_{j i}=\left\{ \begin{array}{ll} {\hat{\beta }}_{i j} &{} \hbox { if}\ |{\hat{\beta }}_{i j}|\le |{\hat{\beta }}_{j i}|;\\ {\hat{\beta }}_{ji} &{} \hbox {otherwise.}\end{array} \right. \end{aligned}$$
(2)

From the method’s perspective, SCIO is closely-related to other popular precision matrix estimation methods. For example, the dual problem of (1) is a Dantzig-type optimization problem

$$\begin{aligned} \min |\varvec{\beta }|_{1} \quad \text{ subject } \text{ to } \quad |\widehat{\varvec{\varSigma }}\varvec{\beta }-\varvec{e}_i|_{\infty } \le \lambda , \end{aligned}$$
(3)

which is exactly the CLIME method (Cai et al. 2011). In a matrix form, (1) is equivalent to

$$\begin{aligned} \widehat{\varvec{\varOmega }}={{{\,\mathrm{arg\,min}\,}}_{\mathbf{B}\in {\mathbb {R}}^{p \times p}}}\left\{ \frac{1}{2}\hbox {tr}\left( \mathbf{B}\widehat{\varvec{\varSigma }}\mathbf{B}^\top \right) -\hbox {tr}(\mathbf{B}) +\lambda \Vert \mathbf{B}\Vert _1 \right\} . \end{aligned}$$

To deal with the problem that the objective function above is not symmetric about \(\mathbf{B}\), Zhang and Zou (2014) proposed the D-trace method which used the loss function

$$\begin{aligned} \frac{1}{4} \hbox {tr}\left( \mathbf{B}\widehat{\varvec{\varSigma }}\mathbf{B}^\top \right) +\frac{1}{4} \hbox {tr}\left( \mathbf{B}^\top \widehat{\varvec{\varSigma }}\mathbf{B}\right) -\hbox {tr}(\mathbf{B}). \end{aligned}$$

Computationally, SCIO and D-trace use quadratic loss functions which can be solved efficiently via standard optimization algorithms. More details can be found in our recent work (Wang and Jiang 2020).

From the theoretical perspective, SCIO (Liu and Luo 2015), together with the graphical Lasso (Ravikumar et al. 2011) and D-trace(Zhang and Zou 2014) are all based on a loss function combined with a \(\ell _1\) penalty term. To show the consistency of the estimation, they all used the primal-dual witness technique (Wainwright 2009) or its extensions under the \(\ell _0\) sparsity condition. To apply the primal-dual technique, an irrepresentability condition is also necessary; see Assumption 1 of Ravikumar et al. (2011), Section 5.2 of Zhang and Zou (2014) and the formula (4) in Liu and Luo (2015). In this work, we focus on exploring the theoretical property of the SCIO under the weak sparsity condition and relaxing the irrepresentability condition.

2.2 SCIO for weak sparsity

Note that Ye and Zhang (2010) and Raskutti et al. (2011) considered the Lasso under the \(\ell _q\) sparsity condition. Here we study SCIO under the weak \(\ell _q\) sparsity condition.

Before presenting the error bounds, we define the \(s_p\)-sparse matrices class:

$$\begin{aligned} {\mathcal {U}}_q(s_p,M_p)=\left\{ \varvec{\varOmega }=(\omega _{ij})_{p\times p}:\varvec{\varOmega }\succ 0 , \max _{1 \le j \le p} \sum _{i=1}^{p} {\left| \omega _{ij}\right| ^q} \le s_{p}, \Vert \varvec{\varOmega }\Vert _{L_{1}} \le M_{p}\right\} , \end{aligned}$$

where \(s_p\) represents the sparsity for columns of the precision matrix, and \(M_p\) may grow with the data dimension p. This matrix class was defined by Cai et al. (2011) and see also Bickel and Levina (2008) for a similar definition about the population covariance matrix.

The first theorem refers to a convergence bound for the optimization problem (1) in vector norms. This bound is established under the \(s_p\)-sparse matrices class and stated from a non-asymptotic viewpoint.

Theorem 1

Suppose \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\) for some \(0 \le q <1\). Assume that \(\lambda \ge 3\Vert \varvec{\varOmega }\Vert _{L_1}\Vert \widehat{\varvec{\varSigma }}-\varvec{\varSigma }\Vert _\infty \) and \(\Vert \varvec{\varOmega }\Vert _{L_1}^{-q}\lambda ^{1-q}s_p\le \frac{1}{2}\) hold. Then

$$\begin{aligned}&|\widehat{\varvec{\beta }}-\varvec{\beta }^*|_1\le 16\Vert \varvec{\varOmega }\Vert ^{1-q}_{L_1}s_p\lambda ^{1-q},\\&|\widehat{\varvec{\beta }}-\varvec{\beta }^*|_\infty \le 4\Vert \varvec{\varOmega }\Vert _{L_1}\lambda . \end{aligned}$$

The technical proof of Theorem 1 is based on a basic inequality analogous to the one proposed by Sun and Zhang (2012). We actually only take the special case where \(w=\varvec{\beta }^*\) in Sun and Zhang (2012) into consideration. Ren et al. (2015) used a similar trick and defined another weak sparsity based on a capped-\(\ell _1\) measure. Their analysis depends on the Gaussian assumption and each element estimation \({\hat{\omega }}_{i j}\) needs to solve a scaled Lasso problem while SCIO can be solved efficiently and yields the precision matrix estimation directly (Wang and Jiang 2020).

Remark 1

In Sun and Zhang (2013) and Ren et al. (2015), they used an alternative definition of the weak sparsity, i.e., the capped \(\ell _1\) measure which is defined as \(s_{t}(\varvec{\varOmega })=\max _{j}\sum _{i=1}^{p}\min \{1,|\beta ^{*}_{ij}|/t\}\) for a threshold parameter t. See also the expository paper by Cai et al. (2016b). For every column j, by taking the index set \(J=\{j~|~|\beta _{ij}|>t\}\), we have

$$\begin{aligned} \sum _{i=1}^{p}\min \{1,|\beta ^*_{ij}|/t\}=|J|+|\varvec{\beta }^{*}_{J^c}|/t. \end{aligned}$$

A slight modification of the proof in Theorem 1 yields

$$\begin{aligned} |\widehat{\varvec{\beta }}-\varvec{\beta }^*|_1\le \max \{12|\varvec{\beta }^*_{J^c}|_1, 16\lambda |J|\Vert \varvec{\varOmega }\Vert _{L_1}\}\le 16ts_{t}(\varvec{\varOmega }) \end{aligned}$$

with \(t=\lambda \Vert \varvec{\varOmega }\Vert _{L_1}\). Thus we can extend our theoretical result to the capped \(\ell _1\) measure. Moreover, by the discussion in Ren et al. (2015), our result can also be extended to the weak \(\ell _q\) ball sparsity condition.

Given the non-asymptotic bounds of \(\widehat{\varvec{\beta }}-\varvec{\beta }^*\), we remark that the symmetrization step (2) is also crucial for the precision matrix estimation. Yuan (2010) conducted another symmetrization procedure which was based on an optimization problem. In the next theorem, we present a non-asymptotic bound between the symmetric SCIO estimator \(\tilde{\varvec{\varOmega }}\) and the true precision matrix \(\varvec{\varOmega }\) under the matrix norms.

Theorem 2

Suppose \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\) for some \(0 \le q <1\). Assume that \(\lambda \ge 3\Vert \varvec{\varOmega }\Vert _{L_1}\Vert \widehat{\varvec{\varSigma }}-\varvec{\varSigma }\Vert _\infty \) and \(\Vert \varvec{\varOmega }\Vert _{L_1}^{-q}\lambda ^{1-q}s_p\le \frac{1}{2}\) hold. Then

$$\begin{aligned} \Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _\infty \le 4\Vert \varvec{\varOmega }\Vert _{L_1}\lambda , \end{aligned}$$
(4)

and

$$\begin{aligned}&\Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{L_1}\le 66 ( \lambda \Vert \varvec{\varOmega }\Vert _{L_1} )^{1-q}s_p. \end{aligned}$$
(5)

In Theorem 2, we develop a unified framework for establishing convergence rates for the SCIO method. For any covariance matrix estimator \(\widehat{\varvec{\varSigma }}\) with the bound \(\Vert \widehat{\varvec{\varSigma }}-\varvec{\varSigma }\Vert _\infty =o_p(1)\), the error bounds for the SCIO estimator under the matrix \(\ell _\infty \) norm and the matrix \(\ell _1\) norm are provided. Some specific examples with different choices of \(\widehat{\varvec{\varSigma }}\) will be discussed in the next section. It is noted that we can further refine the error bound in Theorem 2 by considering the tuning parameter \(\lambda \ge 3\Vert \widehat{\varvec{\varSigma }}\varvec{\varOmega }-\mathbf {I}\Vert _{\infty }\). However, for some covariance matrix estimators \(\widehat{\varvec{\varSigma }}\), it is not trivial to derive the bound \(\Vert \widehat{\varvec{\varSigma }}\varvec{\varOmega }-\mathbf {I}\Vert _{\infty }\). For the conciseness and uniformity of our result statement, we consider \(\lambda \ge 3\Vert \varvec{\varOmega }\Vert _{L_1}\Vert \widehat{\varvec{\varSigma }}-\varvec{\varSigma }\Vert _\infty \) here and will provide some comments about achieving the optimal bound for detailed applications later. Moreover, the \(\ell _\infty \) norm (4) and the matrix \(\ell _1\) norm (5) are very useful and can yield other matrix bounds directly. For example, by the Gershgorin circle theorem, we can obtain the bound for the spectral norm

$$\begin{aligned} \Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{2} \le \Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{L_1} \le 66 ( \lambda \Vert \varvec{\varOmega }\Vert _{L_1} )^{1-q}s_p, \end{aligned}$$

and for the Frobenius norm, we have

$$\begin{aligned} \frac{1}{p}\Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{F}^{2}&\le \Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{L_1}\Vert \tilde{\varvec{\varOmega }}-\varvec{\varOmega }\Vert _{\infty }\\&\le 264 \lambda ^{2-q} \Vert \varvec{\varOmega }\Vert ^{1-q}_{L_1} s_p. \end{aligned}$$

The matrix \(\ell _1\) norm (5) also plays an important role in many statistical inference. For example, an appropriate matrix \(\ell _1\) bound can help to establish the asymptotic distribution of the test statistics in Cai et al. (2014) or lead to the consistency of the thresholding estimation in Wang et al. (2019).

3 Applications of the unified framework

To illustrate the non-asymptotic bounds of Theorem 2, we apply the SCIO method to several covariance matrix estimations. In details, for each plug-in covariance estimator \(\widehat{\varvec{\varSigma }}\), we derive the bound \(\Vert \widehat{\varvec{\varSigma }}-\varvec{\varSigma }\Vert _\infty \) under high probability and apply Theorem 2 to show the consistency of the final precision matrix estimators \(\tilde{\varvec{\varOmega }}\).

3.1 Sample covariance matrix

As a motivating application, we study the sample covariance matrix

$$\begin{aligned} \widehat{\varvec{\varSigma }}_1=\frac{1}{n} \sum _{k=1}^{n}\left( \mathbf {X}_{k}-{\bar{\mathbf {X}}}\right) \left( \mathbf {X}_{k}-{\bar{\mathbf {X}}}\right) ^\top \end{aligned}$$

where \(\mathbf {X}_k \in {\mathbb {R}}^p\), \(k=1,\ldots ,n\) are independent and identically distributed (i.i.d.) samples and \({\bar{\mathbf {X}}}=n^{-1} \sum _{k=1}^{n} \mathbf {X}_{k}\) is the sample mean. Liu and Luo (2015) analyzed the sample covariance matrix and derived the consistency of SCIO under the \(\ell _0\) sparsity condition and also a related irrepresentable condition. With the aim at more general \(\ell _q\) sparsity setting, we state the technical conditions as the following:

  1. (A1).

    (Sparsity restriction) Suppose that \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\) for a given \(q\in [0,1)\), where \(s_p\) and \(M_p\) satisfy the following assumption:

    $$\begin{aligned} s_{p}M_p^{1-2q}=o\left( \frac{n}{\log p}\right) ^{\frac{1}{2}-\frac{q}{2}}. \end{aligned}$$
  2. (A2).

    (Exponential-type tails) Suppose that \(\log p = o(n)\). There exist positive numbers \(\eta >0\) and \(K>0\) such that

    $$\begin{aligned} \mathbf {E}\exp \left( \eta \left( X_{i}-\mu _{i}\right) ^{2}\right) \le K, \end{aligned}$$

    \( \text{ for } \text{ all } 1 \le i \le p\).

  3. (A3).

    (Polynomial-type tails) Suppose that \(p\le cn^{\gamma }\) for some \(\gamma , c>0\) and

    $$\begin{aligned} \mathbf {E}|X_i-\mu _i|^{4\gamma +4+\delta }\le K, \end{aligned}$$

    for some \(\delta >0\) and all \(1 \le i \le p\).

The condition (A1) is an analogue of the formula (3) in Liu and Luo (2015) which is for the special case \(q=0\). The conditions (A2) and (A3) are regular conditions which are used to control the tail probability of the variables. See also the assumptions of Cai et al. (2011). Under the conditions (A2) and (A3), Liu and Luo (2015) proved the following proposition:

Proposition 1

(Lemma 1, Liu and Luo 2015) For a given \(\tau >0\) and a sufficiently large constant C, we have

$$\begin{aligned} \mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_1-\varvec{\varSigma }\Vert _{\infty }\ge C\sqrt{\frac{\log {p}}{n}}\right) \le O\left( p^{-\tau }\right) , \end{aligned}$$

under the assumption (A2) or

$$\begin{aligned} \mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_1-\varvec{\varSigma }\Vert _{\infty }\ge C\sqrt{\frac{\log {p}}{n}}\right) \le O\left( p^{-\tau }+n^{-\frac{\delta }{8}}\right) , \end{aligned}$$

under the assumption (A3).

With these results, for the estimator \({\tilde{\varvec{\varOmega }}}_1\) obtained by plugging \(\widehat{\varvec{\varSigma }}_1\) into SCIO, we are ready to state our main results under the \(\ell _q\) sparsity setting.

Corollary 1

Let \(\lambda =C_0 \sqrt{\log p / n}\) with \(C_0\) being a sufficiently large number. For \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\), under assumptions (A1) and (A2) or (A3), we have

$$\begin{aligned}&\Vert \tilde{\varvec{\varOmega }}_1-\varvec{\varOmega }\Vert _{\infty } \le C_1 M_{p}^2 \sqrt{\frac{\log p}{n}},\\&\Vert \tilde{\varvec{\varOmega }}_1-\varvec{\varOmega }\Vert _{L_1} \le C_2 s_p M_{p}^{2-2q}\left( \frac{\log p}{n}\right) ^{\frac{1}{2}(1-q)}, \end{aligned}$$

with probability greater than \(1-O\left( p^{\tau }\right) \) or \(1-O\left( p^{-\tau }+n^{-\frac{\delta }{8}}\right) \). Here \(C_1,C_2\) are sufficiently large constants which only depend on \(q, s_p, M_p, C_0, \eta ,K, \delta \).

By plugging the sample covariance matrix into SCIO, the estimation (1) is similar to the classical Lasso regression problem and the error bounds considered here are analogous to prediction error bounds of Lasso regression problem (Wainwright 2019, Theorem 7.20). We adopt a different analysis from the primal-dual witness technique considered in Liu and Luo (2015) and remove the irrepresentability condition to obtain the error bounds under the \(\ell _q\) sparsity setting. Correspondingly, there is no variable selection consistency results since the notion of variable selection is ambiguous for the \(\ell _q\) sparsity. Moreover, we have the following remarks.

Remark 2

Comparing to the CLIME method (Cai et al., 2011), we derive the same convergence rates under the same conditions. This verifies the dual relation between Lasso and the Dantzig selector. Bickel et al. (2009) showed this point for the regression model and the results here demonstrate that the Lasso-type method and the Dantzig-type method for the precision matrix estimation also exhibit similar behaviors.

Remark 3

If we impose stronger conditions on the tail distribution of \(\varvec{\varOmega }X_i\), i.e., conditions (C2) and (C2*) in Liu and Luo (2015), we can get

$$\begin{aligned} \mathbf {P}\left( \max _{1 \le i \le p}|\widehat{\varvec{\varSigma }}_1 \varvec{\beta }_{i}^{*}-\varvec{e}_{i}|_{\infty } \ge C \sqrt{\frac{\log p}{n}}\right) = O(p^{-\tau }), \end{aligned}$$

or

$$\begin{aligned} \mathbf {P}\left( \max _{1 \le i \le p}|\widehat{\varvec{\varSigma }}_1 \varvec{\beta }_{i}^{*}-\varvec{e}_{i}|_{\infty } \ge C \sqrt{\frac{\log p}{n}}\right) = O(p^{-\tau }+n^{-\frac{\delta }{8}}), \end{aligned}$$

where \(\varvec{\beta }_i^*\) is the i-th column of the true precision matrix \(\varvec{\varOmega }\). Then with some additional efforts, the error bounds

$$\begin{aligned}&\Vert \tilde{\varvec{\varOmega }}_1-\varvec{\varOmega }\Vert _{\infty } \le C_1 M_{p} \sqrt{\frac{\log p}{n}},\\&\Vert \tilde{\varvec{\varOmega }}_1-\varvec{\varOmega }\Vert _{L_1} \le C_2 s_p M_{p}^{1-q}\left( \frac{\log p}{n}\right) ^{\frac{1}{2}(1-q)}, \end{aligned}$$

hold with probability greater than \(1-O\left( p^{-\tau }\right) \) or \(1-O\left( p^{-\tau }+n^{-\frac{\delta }{8}}\right) \). These convergence rates actually achieve the minimax rate for estimating the true precision matrix \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\). See Cai et al. (2016a) for more details.

The main motivation to study the SCIO for weak sparsity is that it is computationally more efficient than other methods such as CLIME or the graphical Lasso ( e.g., Wang and Jiang (2020), Table 1 of). Here, we further conduct several simulations to compare the computation time of these methods. In details, we include the scaled Lasso method(SLasso) which is implemented with the R package “scalreg” provided by Sun and Zhang (2013), the CLIME method which is implemented with the R package “flare” developed by Li et al. (2015), the graphical Lasso (gLasso) which is implemented with the R packages “gLasso”, “BigQuic” or ADMM algorithm (Boyd et al. 2011, Section 6.5), the D-trace and the SCIO which are implemented with the R package “EQUAL” developed by Wang and Jiang (2020). The SLasso uses the default setting and for all other methods, the computation time is recorded in seconds and averaged over 5 replications on a solution path with 50 \(\lambda \) values ranging from \(\lambda _{max}\) to \(\lambda _{max} \sqrt{\log {p}/n} \). Here \(\lambda _{max}\) is the maximum absolute off-diagonal elements of the sample covariance matrix. All methods are evaluated on an Intel Core i7 3.3GHz and under R version 4.2.1 with an optimized BLAS implementation for Mac hardware. Table 1 summarizes the computation time. Although the stopping criteria is different for each method, we can see from Table 1 the superior efficiency of the SCIO method.

Table 1 The average computation time (standard deviation) of the precision matrix estimation with \(n=200\)

To further investigate the numerical performance of the SCIO estimation, we compare it with SLasso, gLasso, CLIME and D-trace. While the SLasso is tuning-free, we implement a five folds cross-validation procedure to select the tuning parameter \(\lambda \) for all other methods. The tuning parameter \(\lambda \) is selected from 50 different values by minimizing the quadratic loss

$$\begin{aligned} {\text {Loss}}(\varvec{\varOmega })=\frac{1}{2}\hbox {tr}\left( \varvec{\varOmega }\widehat{\varvec{\varSigma }}\varvec{\varOmega }^\top \right) -\hbox {tr}(\varvec{\varOmega }) \end{aligned}$$

where \(\varvec{\varOmega }\) is computed based on the training sample and \(\widehat{\varvec{\varSigma }}\) is the sample covariance matrix of the test sample. To alleviate the bias of \(\ell _1\) penalty, we also include the relaxed version (Meinshausen and Bühlmann 2006; Hastie et al. 2020) of the estimator where a two-stage refitted estimator is obtained based on the support of the original estimator. These estimators are denoted by SLasso-R, gLasso-R, CLIME-R, D-trace-R and SCIO-R. Table 2 presents the estimation error for dimensions \(p=100,200,400\) based on 100 replications. From the simulation results of Table 1 and Table 2, we can conclude that SCIO enjoys comparable statistical convergence rates with superior computational efficiency in comparison to existing methods.

Table 2 The average statistical error (standard deviation) of the precision matrix estimation with \(n=200\)

Next we conduct some simulations to illustrate the developed theoretical results. Firstly, in order to show that the irrepresentable condition is not necessary, we revisit the diamond graph example in Ravikumar et al. (2011) and consider a block precision matrix:

$$\begin{aligned} \varvec{\varOmega }=\left( {\text {diag}}\left( \mathbf{A},\cdots ,\mathbf{A}\right) \right) ^{-1}, \end{aligned}$$

where

$$\begin{aligned} \mathbf{A}= \begin{pmatrix} 1&{}\rho &{}\rho &{}2\rho ^2 \\ \rho &{}1&{}0&{}\rho \\ \rho &{}0&{}1&{}\rho \\ 2\rho ^2&{}\rho &{}\rho &{}1 \end{pmatrix} \in {\mathbb {R}}^{4 \times 4}, \end{aligned}$$

and \(\rho \in \left( -1/\sqrt{2},1/\sqrt{2}\right) \) which ensures the positive definiteness of the covariance matrix. The irrepresentable condition of the graphical Lasso (Ravikumar et al., 2011) holds for \(|\rho |<(\sqrt{2}-1)/2\) and the irrepresentable conditions in Liu and Luo (2015) and Meinshausen and Bühlmann (2006) require that \(|\rho |<1/2\).

Figure 1 shows the performance of the SCIO estimation for \(\rho \in [-0.65,0.65]\). The sample is generated by the multivariate Gaussian distribution \({\mathcal {N}}_p(0,\varvec{\varOmega }^{-1})\) where \(p=100\) and \(n=200\). We plot the spectral norm, the matrix \(\ell _1\) norm and the scaled Frobenius norm of \(\varvec{\varOmega }-\widehat{\varvec{\varOmega }}\). For the brevity, the tuning parameter \(\lambda \) is chosen by minimizing the matrix norms. From these figures, we can observe that all the errors vary smoothly when \(\rho \) is changing. Particularly, these errors do not drop drastically around the critical boundary value \(|\rho |=0.5\). This phenomenon indicates that even though the validity of irrepresentable condition fails when \(|\rho |\ge 0.5\), the performance of the SCIO estimation does not become worse drastically. Therefore, it is reasonable to relax the extra irrepresentable condition for SCIO.

Fig. 1
figure 1

Plots of the estimation errors versus the parameter \(\rho \) under three norms. The dash vertical lines indicate the boundaries of the irrepresentable condition

To illustrate the consistent results for weak sparse cases, we further conduct numerical studies where \(\varvec{\varOmega }=(\omega _{ij})_{p \times p}=(\rho ^{|i-j|})_{p \times p}\) for some \(\rho \in (0,1)\). For a fixed \(q \in (0,1)\), we know

$$\begin{aligned} \max _{1 \le j \le p} \sum _{i=1}^{p} {\left| \omega _{ij}\right| ^q} \approx 1+2 \sum _{k=1}^\infty \rho ^{kq}=\frac{1+\rho ^q}{1-\rho ^q}:=s_{p}. \end{aligned}$$

Hence the parameter \(\rho \) measures the sparsity level of the true precision matrix. When \(\rho \) is small, the decay phenomenon is salient and the matrix tends to be more sparse. When \(\rho \) is large, the number of elements with small magnitude accounts for less proportion of all elements.

Figure 2 reports the performance of SCIO for three different sparsity levels \(\rho =0.2\), \(\rho =0.5\) and \(\rho =0.8\). We plot the errors for the solution path with a series of tuning parameters and three methods: SCIO, D-trace and CLIME. From Figure 2, we can see that these methods present similar patterns under all three norms. In other words, this demonstrates that SCIO performs similar as CLIME which has been proved to be consistent under the \(\ell _q\) sparsity condition.

Fig. 2
figure 2

Plots of the estimation errors versus the penalty parameter \(\lambda \) under three sparsity levels based on the sample covariance matrix

3.2 Robust matrix estimation

The sub-Gaussian assumption is crucial in the analysis of the sample covariance matrix. To relax the assumption of exponential-type tails on the covariates, Avella-Medina et al. (2018) introduced a robust matrix estimator which only required a bounded fourth moment assumption. They constructed a Huber-type estimator for the population covariance matrix and got a robust estimator for the precision matrix by plugging the Huber-type estimator into the adaptively CLIME procedure (Cai et al. 2016a).

Given the i.i.d. samples \(\mathbf {X}_k \in {\mathbb {R}}^p\), \(k=1,\ldots ,n\), Avella-Medina et al. (2018) proposed to estimate the covariance and the population mean based on the Huber loss function. In details, Huber’s mean estimator \({\tilde{\mu }}_i\) satisfies the equation

$$\begin{aligned} \sum _{k=1}^{n} \psi _{H}\left( \mathbf {X}_{ki}-{\tilde{\mu }}_i\right) =0, \end{aligned}$$

and the covariance estimator \({\tilde{\sigma }}_{ij}\) is defined by the equation

$$\begin{aligned} \sum _{k=1}^{n} \psi _{H}\left( \mathbf {X}_{ki}\mathbf {X}_{kj}-({\tilde{\sigma }}_{ij}+{\tilde{\mu }}_i{\tilde{\mu }}_j )\right) =0, \end{aligned}$$

where \(\psi _{H}(x)=\min \{H, \max (-H, x)\}\) denotes the Huber function. Accordingly, we construct a robust estimator \(\tilde{\varvec{\varSigma }}=({\tilde{\sigma }}_{ij})_{p \times p}\) and further project \(\tilde{\varvec{\varSigma }}\) to a cone of positive definite matrix

$$\begin{aligned} \widehat{\varvec{\varSigma }}_2=\underset{\varvec{\varSigma }\succeq \varepsilon \mathbf {I}}{{\text {argmin}}} \Vert \varvec{\varSigma }-\tilde{\varvec{\varSigma }}\Vert _{\infty }, \end{aligned}$$

where \(\varepsilon \) is a small positive number. This projection step can be easily implemented by the ADMM algorithm and see Datta and Zou (2017) for more details.

Avella-Medina et al. (2018) proposed to use \(\widehat{\varvec{\varSigma }}_2\) as a pilot estimator and implemented the adaptively CLIME procedure to estimate the precision matrix. Here we study the SCIO method based on the robust estimator \(\widehat{\varvec{\varSigma }}_2\). Following Avella-Medina et al. (2018), a bounded fourth moment condition is needed:

(A4). Suppose that \(\log {p}=o(n)\), and there exists a positive number \(K>0\) such that

$$\begin{aligned} \mathbf {E}\left( X_i-\mathbf {\mu }_i\right) ^4\le K \end{aligned}$$

for all \(1\le i\le p\).

Compared to the polynomial-type tails assumption (A3), the assumption (A4) here refines the moment order requirement from \(4\gamma +4+\delta \) to 4 and allows the result holds for a potentially larger p. The following proposition is from Avella-Medina et al. (2018).

Proposition 2

(Proposition 3, Avella-Medina et al. 2018) Under the assumption (A4), for a sufficiently large constant C, we have

$$\begin{aligned} \mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_2-\varvec{\varSigma }\Vert _\infty \ge C\sqrt{\frac{\log {p}}{n}}\right) = O(p^{-\tau }) \end{aligned}$$

for some constant \(\tau >0\).

Similar to Avella-Medina et al. (2018), we can get an estimator \(\tilde{\varvec{\varOmega }}_2\) by plugging \(\widehat{\varvec{\varSigma }}_2\) into SCIO and derive the convergence rates under different matrix norms.

Corollary 2

Let \(\lambda =C_0M_p\sqrt{\frac{\log p}{n}}\), where \(C_0\) is a sufficiently large constant and \(H=K(n / \log p)^{1 / 2}\) where K is a given constant. For \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\), under assumptions (A1) and (A4), there exist sufficiently large constants \(C_1,C_2\) satisfying that

$$\begin{aligned}&\Vert \tilde{\varvec{\varOmega }}_2-\varvec{\varOmega }\Vert _{\infty } \le C_1 M_{p}^2 \sqrt{\frac{\log p}{n}},\\&\Vert \tilde{\varvec{\varOmega }}_2-\varvec{\varOmega }\Vert _{L_1} \le C_2 s_p M_{p}^{2-2q}\left( \frac{\log p}{n}\right) ^{\frac{1}{2}(1-q)}, \end{aligned}$$

with probability greater than \(1-O(p^{-\tau })\), \(\tau >0\).

Remark 4

Note that Avella-Medina et al. (2018) provided the optimal convergence rate with an additional technique assumption that the truncated population covariance matrix \(\varvec{\varSigma }_{H}=E\left\{ 1_{(\left| X_{u} X_{v}\right| \le H)} X_{u} X_{v}\right\} \) satisfies that \(\Vert \varvec{\varSigma }_{H}\varvec{\varOmega }-\mathbf {I}\Vert _{\infty }\le C\sqrt{\frac{\log {p}}{n}}\). Although the convergence rate provided in Corollary 2 is not optimal, the optimal rate can be readily obtained by imposing the same condition on the truncated population covariance matrix in Avella-Medina et al. (2018).

To demonstrate these results numerically, we repeat the second scenario in Avella-Medina et al. (2018) where the data \(\mathbf {X}\) is generated from a student t distribution with 3.5 degrees of freedom and infinite kurtosis. Here the sub-Gaussian assumption is void. We still consider the precision matrix \(\varvec{\varOmega }=\left( 0.5^{|i-j|}\right) _{p \times p}\). The sample size n is set to 200 and the dimension p is 100. Figure 3 reports the numeric performances of SCIO, D-trace and CLIME based on the robust estimator \(\widehat{\varvec{\varSigma }}_2\). All three methods perform comparably and align well for the heavy-tailed distribution.

Fig. 3
figure 3

Plots of the estimation errors versus the penalty parameter \(\lambda \) based on the Huber-type estimator

Furthermore, we conduct a numerical simulation to illustrate the robustness of our SCIO estimator for the heavy-tailed distribution. In details, we plug the sample covariance matrix \(\widehat{\varvec{\varSigma }}_1\) and the Huber-type covariance matrix estimator \(\widehat{\varvec{\varSigma }}_2\) into the SCIO. We set the sample size as \(n=100\) and generate the data matrix \(\mathbf {X}\) with a multivariate t distribution with 5 degrees of freedom, zero mean and a covariance matrix \(\varvec{\varSigma }=\varvec{\varOmega }^{-1}\), where \(\varvec{\varOmega }=\left( \rho ^{|i-j|}\right) _{p \times p}\) and \(\rho =0.2,0.5\). Note that this distribution only has 4-th order moment. Table 3 reports the spectral norm error for different dimensions p based on 50 replications. From Table 3, we can see that the Huber-type precision matrix estimator performs better than the one with the sample covariance matrix. This result is consistent with our theoretical improvement from the requirement of Assumption (A3) to a milder one (A4).

Table 3 Comparison of SCIO with the sample and the Huber-type covariance matrices for the heavy-tailed data

3.3 Non-parametric rank-based estimation

For Gaussian distributions, the precision matrix characterizes the conditional independence among covariates. For non-Gaussian data, Liu et al. (2009) introduced a non-paranormal graphical model. Liu et al. (2012) and Xue and Zou (2012) studied the precision matrix estimation for this non-paranormal graphical model where the precision matrix: \(\varvec{\varOmega }={\varvec{\varSigma }}^{-1}\) was defined by the transformed samples and \(\varvec{\varSigma }\) was the correlation matrix. In details, they proposed to estimate the correlation matrix by the non-parametric rank-based statistics such as Spearman’s rho and Kendall’s tau.

Given the sample data matrix \(\left( X_{ij}\right) _{n \times p}=\left( \mathbf {X}_1,\dots ,\mathbf {X}_n\right) ^\top \), we convert them to rank statistics denoted by \((r_{ij})_{n \times p}=\left( \mathbf{r}_1,\dots ,\mathbf{r}_n\right) ^\top \) where each column \(\left( r_{1j},\dots ,r_{nj}\right) \) serves as the rank statistic of \(\left( X_{1j},\dots ,X_{nj}\right) \). Spearman’s rho correlation coefficient \(\widehat{\rho }_{i j}\) is defined as the Pearson correlation between the columns \(\mathbf{r}_i\) and \(\mathbf{r}_j\), that is,

$$\begin{aligned} \hbox {Spearman's rho:}~ \widehat{\rho }_{i j}=\frac{\sum _{k=1}^{n}\left( r_{ki}-{\bar{r}}_{i}\right) \left( r_{kj}-{\bar{r}}_{j}\right) }{\sqrt{\sum _{k=1}^{n}\left( r_{ki}-{\bar{r}}_{i}\right) ^2 \cdot \sum _{k=1}^{n}\left( r_{kj}-{\bar{r}}_{j}\right) ^2}}. \end{aligned}$$

Similarly, Kendall’s tau correlation coefficient is defined by

$$\begin{aligned} \hbox {Kendall's tau:}~ \widehat{\tau }_{ij}=\frac{2}{n(n-1)} \sum _{1 \le k<k^{\prime } \le n} {\text {sign}}\{\left( X_{ki}-X_{k^{\prime }i}\right) \left( X_{kj}-X_{k^{\prime }j}\right) \}. \end{aligned}$$

Based on Spearman’s rho and Kendall’s tau correlation coefficients, we are able to construct two non-parametric estimators \({\tilde{\varvec{\varSigma }}}_{3\rho }\) and \({\tilde{\varvec{\varSigma }}}_{3\tau }\) for the correlation matrix \(\varvec{\varSigma }\), where

$$\begin{aligned} ({\tilde{\varvec{\varSigma }}}_{3\rho })_{ij}=\left\{ \begin{array}{ll} 2 \sin \left( \frac{\pi }{6} \widehat{\rho }_{ij}\right) , &{} i \ne j \\ 1, &{} i=j \end{array}\right. \end{aligned}$$

and

$$\begin{aligned} ({\tilde{\varvec{\varSigma }}}_{3\tau })_{ij}=\left\{ \begin{array}{ll} \sin \left( \frac{\pi }{2} \widehat{\tau }_{ij}\right) , &{} i \ne j \\ 1, &{} i=j \end{array}\right. . \end{aligned}$$

Moreover, we still need an additional projection step

$$\begin{aligned} \widehat{\varvec{\varSigma }}_3={{\,\mathrm{arg\,min}\,}}_{\varvec{\varSigma }\succeq \varepsilon \mathbf {I}} \Vert \varvec{\varSigma }-{\tilde{\varvec{\varSigma }}}_3\Vert _{\infty } \end{aligned}$$

to obtain the final positive definite estimator \(\widehat{\varvec{\varSigma }}_{3\rho }\) or \(\widehat{\varvec{\varSigma }}_{3\tau }\). Note that if \(\mathbf {X}\) satisfies the non-paranormal distribution, Liu et al. (2012) proved that \({\tilde{\varvec{\varSigma }}}_{3\rho }\) and \({\tilde{\varvec{\varSigma }}}_{3\tau }\) are consistent estimators of \(\varvec{\varSigma }\) under the element-wise \(\ell _\infty \) norm. The following proposition is from Liu et al. (2012).

Proposition 3

(Theorem 4.1 and 4.2, Liu et al. 2012) Assuming that \(\mathbf {X}\) satisfies a non-paranormal distribution, there exists a sufficiently large constant C such that

$$\begin{aligned}&\mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_{3\rho }-\varvec{\varSigma }\Vert _\infty \ge C\sqrt{\frac{\log {p}}{n}}\right) = O(p^{-1}),\\&\mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_{3\tau }-\varvec{\varSigma }\Vert _\infty \ge C\sqrt{\frac{\log {p}}{n}}\right) = O(p^{-1}). \end{aligned}$$

To estimate the sparse precision matrix, Liu et al. (2012) proposed to plug \(\widehat{\varvec{\varSigma }}_{3\rho }\) or \(\widehat{\varvec{\varSigma }}_{3\tau }\) into the graphical Dantzig selector (Yuan 2010), CLIME (Cai et al. 2011), the graphical Lasso (Friedman et al. 2008), or the neighborhood pursuit estimator (Meinshausen and Bühlmann 2006). In this part, we consider the SCIO procedure with \(\widehat{\varvec{\varSigma }}_{3\rho }\) or \(\widehat{\varvec{\varSigma }}_{3\tau }\). Denote \(\tilde{\varvec{\varOmega }}_3\) as the precision matrix estimator by plugging \(\widehat{\varvec{\varSigma }}_{3\rho }\) or \(\widehat{\varvec{\varSigma }}_{3\tau }\) into SCIO. The following corollary holds for both \(\widehat{\varvec{\varSigma }}_{3\rho }\) and \(\widehat{\varvec{\varSigma }}_{3\tau }\).

Corollary 3

Let \(\lambda =C_0M_p\sqrt{\frac{\log p}{n}}\), where \(C_0\) is a sufficiently large constant. For \(\varvec{\varOmega }\in {\mathcal {U}}_q(s_p,M_p)\), under assumptions (A1) and that \(\mathbf {X}\) satisfies a non-paranormal distribution, there exist sufficiently large constants \(C_1,C_2\) satisfying that

$$\begin{aligned}&\Vert \tilde{\varvec{\varOmega }}_3-\varvec{\varOmega }\Vert _{\infty } \le C_1 M_{p}^2 \sqrt{\frac{\log p}{n}},\\&\Vert \tilde{\varvec{\varOmega }}_3-\varvec{\varOmega }\Vert _{L_1} \le C_2 s_p M_{p}^{2-2q}\left( \frac{\log p}{n}\right) ^{\frac{1}{2}(1-q)} \end{aligned}$$

with probability greater than \(1-O(p^{-1})\).

To conduct numeric simulations, we assume that \(\mathbf {X}\) follows a non-paranormal distribution \(f(\mathbf {X})\sim {\mathcal {N}}(0,\varvec{\varSigma })\). Following Definition 9 in Liu et al. (2009) and Definition 5.1 in Liu et al. (2012), we choose the transformation function f as the Gaussian CDF transformation function with \(\mu _{g_0}=0.05\) and \(\sigma _{g_0}=0.4\). To mimic the weak sparse case, we consider \(\varvec{\varSigma }\) as the correlation matrix of \(\varvec{\varOmega }_0^{-1}\), where \(\varvec{\varOmega }_0=(0.5^{|i-j|})_{p \times p}\). We set the sample size \(n=200\) and the dimension \(p=100\) again. Based on Spearman’s rho estimation \(\widehat{\varvec{\varSigma }}_{3\rho }\) or Kendall’s tau estimation \(\widehat{\varvec{\varSigma }}_{3\tau }\), Figure 4 reports the numeric performances of the SCIO, D-trace and CLIME. Again, we can see that all three methods perform comparably.

Fig. 4
figure 4

Plots of the estimation errors versus the penalty parameter \(\lambda \) based on Spearman’s rho and Kendall’s tau estimation. Here ac are the results for Spearman estimation and df are the results for Kendall estimation

As Avella-Medina et al. (2018) showed, Proposition 3 works for the elliptically distributed \(\mathbf {X}\), which includes the multivariate t distributed random variables. Here, we evaluate the robust performance of non-parametric rank-based SCIO estimation under the heavy-tailed circumstance. The setting is the same as the one of Table 3. Table 4 shows the spectral norm error of SCIO with Pearson’s correlation matrix \(\widehat{\varvec{\varSigma }}_1\), Spearman’s rho correlation matrix \(\widehat{\varvec{\varSigma }}_{3\rho }\) and Kendall’s tau correlation matrix \(\widehat{\varvec{\varSigma }}_{3\tau }\). We can see that SCIO with non-parametric correlation estimators outperform the one with the Pearson’s correlation matrix. Under our settings, Spearman’s estimation performs slightly better than Kendall’s estimation. The numerical results verify the robustness of our non-parametric rank-based SCIO estimation for the heavy-tailed case.

Table 4 Comparison of precision matrix estimation errors under the spectral norm for the non-parametric estimators over 50 replications

3.4 Matrix data estimation

The matrix variate data are frequently encountered in real applications where the covariance matrix has a Kronecker product structure \(\varvec{\varSigma }=\mathbf{A}\otimes \mathbf{B}\). To study the matrix data, it is of great interest to estimate the graphical structures \(\varvec{\varOmega }=\varvec{\varSigma }^{-1}=\left( \mathbf{A}\otimes \mathbf{B}\right) ^{-1}\) (Leng and Tang 2012; Zhou 2014). For the brevity, we assume \(\mathbf{A}\) and \(\mathbf{B}\) are all correlation matrices, which means the diagonal entries are ones.

Given the i.i.d. matrix samples \(\mathbf {X}(t) \in {\mathbb {R}}^{f \times m},~t=1,\ldots ,n\), Zhou (2014) developed the Gemini estimator for the precision matrix \(\mathbf{A}^{-1}\otimes \mathbf{B}^{-1}\). Writing the columns of \(\mathbf {X}(t)\) as \(x(t)^{1},\dots ,x(t)^m \in {\mathbb {R}}^f\), Zhou (2014) proposed to estimate \(\mathbf{A}\) by

$$\begin{aligned} (\widehat{\varvec{\varSigma }}_{4A})_{ij}=\frac{\sum _{t=1}^n (x(t)^{i})^\top \left( x(t)^{j}\right) }{\sqrt{\sum _{t=1}^n (x(t)^{i})^\top \left( x(t)^{i}\right) } \sqrt{\sum _{t=1}^n \left( x(t)^{j}\right) ^\top \left( x(t)^{j}\right) }} \end{aligned}$$

and the estimator \(\widehat{\varvec{\varSigma }}_{4B}\) for \(\mathbf{B}\) is constructed similarly based on the rows of \(\mathbf {X}(t)\). The final estimation of \(\varvec{\varOmega }\) is obtained by implementing the graphical Lasso or CLIME with \(\widehat{\varvec{\varSigma }}_{4A}\) and \(\widehat{\varvec{\varSigma }}_{4B}\). Zhou (2014) derived the convergence rate under \(\ell _0\) sparsity condition for the graphical Lasso and introduced the CLIME procedure to refine their convergence rates. For the \(\ell _q\) sparse matrix, Zhou (2014) did not provide the explicit theoretical results.

In this part, we study the SCIO method based on \(\widehat{\varvec{\varSigma }}_{4A}\) and \(\widehat{\varvec{\varSigma }}_{4B}\). We first present an approximate sparsity condition for \(\mathbf{A}^{-1}\) and \(\mathbf{B}^{-1}\).

(A5). Suppose \(\mathbf{A}^{-1} \in {\mathcal {U}}_q(s_m,M_m)\) and \(\mathbf{B}^{-1} \in {\mathcal {U}}_q\left( {\tilde{s}}_f, {\tilde{M}}_f\right) \) for a given \(q \in [0,1)\). Moreover, the parameters \(s_m, {\tilde{s}}_f, M_m, {\tilde{M}}_f\) satisfy

$$\begin{aligned}&s_mM_m^{2-q}=o\left( \frac{\sqrt{nf}}{\log ^{\frac{1}{2}}(m\vee f)}\right) ,\\&{\tilde{s}}_f {\tilde{M}}_f^{2-q}=o\left( \frac{\sqrt{nm}}{\log ^{\frac{1}{2}}(m\vee f)}\right) . \end{aligned}$$

The following proposition is from Theorem 4.1 of Zhou (2014).

Proposition 4

(Theorem 4.1, Zhou 2014) For \(t=1,\ldots , n\), suppose that \({\text {vec}}(\mathbf {X}(t)) \sim {\mathcal {N}}_{f, m}(0, \mathbf{A}\otimes \mathbf {B})\). Under the assumption (A5) and the assumption (A2) in Zhou (2014), there exists a sufficiently large constant C such that

$$\begin{aligned}&\mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_{4A}-\mathbf{A}\Vert _\infty \ge C\sqrt{\frac{\log (m\vee f)}{nf}} \right) = O\left( \left( m \vee f\right) ^{-2}\right) ,\\&\mathbf {P}\left( \Vert \widehat{\varvec{\varSigma }}_{4B}-\mathbf{B}\Vert _\infty \ge C\sqrt{\frac{\log (m\vee f)}{nm}}\right) = O\left( \left( m \vee f\right) ^{-2}\right) . \end{aligned}$$

As an application of our Theorem 2, we can derive the theoretical result of the SCIO estimator for estimating \(\varvec{\varOmega }=\mathbf{A}^{-1}\otimes \mathbf {B}^{-1}\).

Corollary 4

Suppose that \({\text {vec}}(\mathbf {X}(t)) \sim {\mathcal {N}}_{f, m}(0, \mathbf{A}_{m\times m} \otimes \mathbf {B}_{f\times f}),~t=1,\ldots , n\). Let \(\lambda _A=C_AM_m\frac{\log ^{\frac{1}{2}}(m\vee f)}{\sqrt{fn}}\) and \(\lambda _B=C_B {\tilde{M}}_f\frac{\log ^{\frac{1}{2}}(m\vee f)}{\sqrt{mn}}\), where \(C_A\) and \(C_B\) are sufficiently large constants. Under our assumption (A5) and the assumption (A2) in Zhou (2014), there exists a sufficiently large constant C such that

$$\begin{aligned} \Vert \tilde{\varvec{\varOmega }}_4-\varvec{\varOmega }\Vert _{2} \le C \left( \frac{\log (m\vee f)}{n}\right) ^{\frac{1-q}{2}} \left( s_mM_m^{2-2q}f^{(q-1)/2} + {\tilde{s}}_f {\tilde{M}}_f^{2-2q} m^{(q-1)/2}\right) \end{aligned}$$

with probability greater than \(1-O((m \vee f)^{-2})\).

Compared with Theorem 3.3 of Zhou (2014), Corollary 4 is derived under a general \(\ell _q\) sparsity condition. In particular, the special case \(q=0\) corresponds to Theorem 3.3 of Zhou (2014) and these two results are consistent due to the dual properties between Lasso and the Dantzig selector. Moreover, our result can be generalized to the sub-Gaussian condition of the matrix data (Hornstein et al. 2019) and we omit the details.

To conduct the simulations, we generate the data from the matrix normal distribution \({\text {vec}}(\mathbf {X}(t)) \sim {\mathcal {N}}_{f, m}(0, \mathbf{A}_{m\times m}\otimes \mathbf {B}_{f\times f})\). To mimic the \(\ell _q\) sparsity, we choose \(\mathbf {B}_{ij}\) as the correlation matrix of \(\varvec{\varPhi }^{-1}\), where \(\varvec{\varPhi }_{ij}=(0.2^{|i-j|})_{f \times f}\) and \(\mathbf {A}_{ij}\) as the correlation matrix of \(\varvec{\varTheta }^{-1}\), where \(\varvec{\varTheta }_{ij}=(0.5^{|i-j|})_{m \times m}\) . We set the dimension of \(\mathbf {A}\) as 80 and the dimension of \(\mathbf {B}\) as 40. The sample size n is taken as 3. Figure 5 reports the performance of the Gemini estimator under several matrix norms where the penalty level \(\lambda \) of \(\mathbf {A}^{-1}\) is varying and the penalty level of \(\mathbf {B}^{-1}\) is set to 0.15 for simplicity. From Fig. 5, we can observe that the Gemini method based on SCIO performs similarly as the Gemini method based on CLIME, which means SCIO is also applicable to the matrix data.

Fig. 5
figure 5

Plots of the estimation errors versus the penalty parameter \(\lambda \) based on the Gemini estimator for the matrix data

4 Discussion

This article revisits the SCIO method proposed by Liu and Luo (2015) and explores the theoretical and numerical properties of SCIO under the weak sparsity condition. Intuitively, the approach to obtain our matrix estimation error bound by plugging in the sample covariance matrix is similar to the process of obtaining the prediction bound in regression setting. For the classical Lasso problem, Ye and Zhang (2010) and Sun and Zhang (2012) have analyzed the Lasso method or its variants under general weak sparsity. Our technique essentially originates from the basic inequality derived from their theoretical analysis of Lasso. The main difference lies that Lasso’s results rely on constant lower bounds of some quantities such as the cone invertibility factor or the compatibility factor.

As for the precision matrix estimation, the error bounds under \(\ell _q\) sparsity condition have been discussed for Dantzig-type methods such as the graphical Dantzig method (Yuan 2010) and the CLIME method (Cai et al. 2011), and minimax convergence rates have been established by the ACLIME method (Cai et al. 2016a). The CLIME method and its variant ACLIME are frequently introduced to deal with the \(\ell _q\) sparsity for the precision matrix estimation. Here, our work provides an alternative approach and shows that the Lasso-type method SCIO can obtain the theoretical guarantees of CLIME under the \(\ell _q\) sparsity condition. Specially, we relax the irrepresentable condition, which is commonly used for Lasso-type precision matrix estimation. In addition, the SCIO method can be efficiently implemented according to Wang and Jiang (2020) while the computation of Dantzig-type methods turns out to be slow. From this perspective, the SCIO method tends to be more appealing for the high dimensional precision matrix estimation.

Another closely related Lasso-type method is SLasso proposed by Sun and Zhang (2013). By inducing a noise level, the SLasso is tuning-free by iteratively estimating the noise level. For the normal distribution, Sun and Zhang (2013) derived the optimal error bounds under the alternative weak sparsity condition, i.e., the capped \(\ell _1\) measure. The key ideas of SLasso and SCIO are quite similar, e.g., the SLasso for fixed noise level \(\sigma \) is the same as the SCIO by setting \({\hat{\beta }}_{jj}=-1\). It would be interesting to compare these two methods from both the computation complexity and the performance of the estimators. We implement the R package “scalreg” provided by Sun and Zhang (2013) and it is not very efficient which prevents us from conducting the comparison experiment. From the original paper of Sun and Zhang (2013), SLasso has improvements over CLIME for most cases. This improvement is due to adaptive choice of the penalty level for each column of the precision matrix. Actually, we can also use different tuning parameters for each column in SCIO (or CLIME) and it is expected to obtain some improvements. Another interesting question is how to exploit the noise level into complicated cases, e.g., the heavy-tailed data, the non-paranormal data, and the matrix variate data. We leave these questions as a future work.

For other Lasso-type methods such as the graphical Lasso and D-trace, they are not in a column-by-column form. Although they have been shown to be consistent under the \(\ell _0\) sparsity condition, the extension to the \(\ell _q\) sparsity is not trivial and our current technique can not be implemented directly. It is still of interest whether optimal rates can be established under the weak sparse case for the graphical Lasso and D-trace. Moreover, for other statistical problems such as the discriminant analysis problem, the misclassification rate measures the performance of the method and we can use our current technique to derive its error bounds under the general \(\ell _q\) sparsity condition. Specifically, it is possible to show that some Lasso-type methods for the discriminant analysis such as Fan et al. (2012) or Mai et al. (2012) are still applicable for the weak sparse case. We leave these problems for future works.