Keywords

1 Introduction

The dictionary learning is a matrix factorization problem that amounts to finding the linear combination of a given signal \( {\mathbf{Y}} \in {\mathbb{R}}^{N \times M} \) with only a few atoms selected from columns of the dictionary \( {\mathbf{D}} \in {\mathbb{R}}^{N \times K} \cdot \) In an overcomplete setting, the dictionary matrix \( {\mathbf{D}} \) has more columns than rows \( K > N, \) and the corresponding coefficient matrix \( {\mathbf{X}} \in {\mathbb{R}}^{K \times M} \) is assumed to be sparse. For most practical tasks in the presence of noise, we consider a contamination form of the measurement signal \( {\mathbf{Y}} = {\mathbf{DX}} + {\mathbf{w}}, \) where the elements of noise \( {\mathbf{w}} \) are independent realizations from the Gaussian distribution \( \mathcal{N}(0,{\kern 1pt} {\kern 1pt} {\kern 1pt} \sigma_{n}^{2} ) \). The basic dictionary learning problem is formulated as:

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{D}},{\kern 1pt} {\kern 1pt} {\mathbf{X}}}} \left\| {{\mathbf{Y}} - {\mathbf{DX}}} \right\|_{F}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} s.t.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left\| {{\mathbf{x}}_{i} } \right\|_{0} \le L{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \forall i $$
(1)

Therein, \( L \) is the maximal number of non-zero elements in the coefficient vector \( {\mathbf{x}}_{i} \). Starting with an initial dictionary, this minimization task can be solved by the popular alternating approaches such as the method of optimal directions (MOD) [1] and K-SVD [2]. The dictionary training on noisy samples can incorporate the denoising together into one iterative process [3]. For a single image, the K-SVD algorithm is adopted to train a sparsifying dictionary and the developed method in [3] denoises the corrupted image by alternating between the update stages of the sparse representations and the dictionary. In general, the residual errors of learning process are determined by noise levels. Noise incursion in a trained dictionary can affect the stability and accuracy of sparse representation [4]. So the performance of dictionary learning highly depends on the estimation accuracy of unknown noise level \( \sigma_{n}^{2} \) when the noise characteristics of trained dictionaries are unavailable.

The main challenge of estimating the noise level lies in effectively distinguishing the signal from noise by exploiting sufficient prior information. The most existing methods have been developed to estimate the noise level from image signals based on specific image characteristics [5,6,7,8]. Generally, these works assume that a sufficient amount of homogeneous areas or self-similarity patches are contained in natural images. Thus empirical observations, singular value decomposition (SVD) or statistical properties can be applied on carefully selected patches. However, it is not suitable for estimating the noise level in dictionary update stage because only few atoms for sparse representation cannot guarantee the usual assumptions. To enable wider applications and less assumptions, more recent methods estimate the noise level based on principal component analysis (PCA) [9, 10]. These methods underestimate the noise level since they only take the smallest eigenvalue of block covariance matrix. Although later work [11] has made efforts to tackle these problems by spanning low dimensional subspace, the optimal estimation for true noise variance is still not achieved due to the inaccuracy of subspace segmentation. As for estimating the noise variance techniques, the scaled median absolute deviation of wavelet coefficients has been widely adopted [12]. Leveraging the results from random matrix theory (RMT), the median of sample eigenvalues is also used as an estimator of noise variance [13]. However, these estimators are no longer consistent and unbiased when the dictionary matrix has high dimensional structure.

To solve the aforementioned problems, we propose to accurately estimate the noise variance in a trained dictionary by using exact eigenvalues of a sample covariance matrix. The proposed method can also be applied to estimate the noise level for the noisy image. As a novel contribution, we construct the tight asymptotic bounds of extreme eigenvalues to separate the subspaces between the signal and the noise based on random matrix theory (RTM). Moreover, in order to eliminate the possible bias caused by the high-dimensional settings, a corrected estimator is derived to provide the consistent inference on the noise variance for a trained dictionary. Based on these asymptotic results, we develop an optimal variance estimator which can well deal with the settings with different sample sizes and dimensions. The practical usefulness of our method is numerically illustrated.

2 Tight Bounds for Noise Eigenvalue Distributions

In this section, we analyze the asymptotical distribution of the ratio of extreme eigenvalues of a sample covariance matrix based on the limiting RTM law. Then a tight bound is derived.

2.1 Eigenvalue Subspaces of Sample Covariance Matrix

We consider the sparse approximation of each observed sample \( {\mathbf{y}}_{i} \in {\mathbb{R}}^{N} \) with \( s \) prototype atoms selected from learned dictionary \( {\mathbf{D}} \). With respect to the sparse model (1), we aim at estimating the noise level \( \sigma_{n}^{2} \) for an elementary trained dictionary \( {\mathbf{D}}_{s} \) containing a subset of the atoms \( \{ {\mathbf{d}}_{i} \}_{i = 1}^{s} \). Note that \( {\mathbf{D}}_{s} = {\mathbf{D}}_{S}^{0} + {\mathbf{w}}_{S} \), where \( {\mathbf{D}}_{S}^{0} \) denotes original dictionary and \( {\mathbf{w}}_{S} \) is the additive Gaussian noise. At each iterative step, the noise level \( \sigma_{n}^{2} \) goes gradually to zero when updating towards true dictionary \( {\mathbf{D}}_{S}^{0} \) [14]. The known noise variance is helpful to avoid noise incursion and determine the sample size, the sparsity degree and even the performance of the true underlying dictionary [15]. To derive the relationship between the eigenvalues and noise level, we first construct the sample covariance matrix of dictionary \( {\mathbf{D}}_{s} \) as follows:

$$ \varSigma_{S} = \frac{1}{s - 1}\sum\limits_{i = 1}^{s} {({\mathbf{d}}_{i} - \overline{{\mathbf{d}}} )} ({\mathbf{d}}_{i} - \overline{{\mathbf{d}}} )^{\text{T} } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \overline{{\mathbf{d}}} = \frac{1}{s}\sum\limits_{i = 1}^{s} {{\mathbf{d}}_{i} } $$
(2)

According to (2), the square matrix \( \sum_{s} \) has \( N \) dimensions with the sparse condition \( N \gg s \). Based on the symmetric property, this matrix is decomposed into the product of three matrices: an orthogonal matrix \( {\mathbf{U}} \), a diagonal matrix and a transpose matrix \( {\mathbf{U}}^{T} \), which can be selected by satisfying \( {\mathbf{U}}^{T} {\mathbf{U}} = {\mathbf{I}} \). Here, this transform process is written as:

$$ {\mathbf{U}}^{\text{T} }\Sigma _{S} {\mathbf{U = diag}}(\lambda_{1} , \ldots ,\lambda_{m} ,\lambda_{m + 1} , \ldots ,\lambda_{N} ) $$
(3)

Given \( \lambda_{1} \ge \lambda_{2} \ge \ldots \ge \lambda_{N} \), we exploit the eigenvalue subspaces to enable the separation of atoms from noise. To be more specific, we divide the eigenvalues into two sets \( {\mathbf{S}} = {\mathbf{S}}_{1} \cup {\mathbf{S}}_{2} \) by finding the appropriate bound in a spiked population model [16]. Most structures of an atom lie in low-dimension subspace and thus the leading eigenvalues in set \( {\mathbf{S}}_{1} = \left\{ {\lambda_{i} } \right\}_{i = 1}^{m} \) are mainly contributed by atom itself. The redundant-dimension subspace \( {\mathbf{S}}_{2} = \left\{ {\lambda_{i} } \right\}_{i = m + 1}^{N} \) is dominated by the noise. Because the atoms contribute very little to this later portion, we take all the eigenvalues of \( {\mathbf{S}}_{2} \) into consideration to estimate the noise variance while eliminating the influence of trained atoms. Moreover, the random variables \( \left\{ {\lambda_{i} } \right\}_{i = m + 1}^{N} \) can be considered as the eigenvalues of pure noise covariance matrix \( \Sigma _{{\mathbf{w}}} \), whose dimensions are \( N \).

2.2 Asymptotic Bounds for Noise Eigenvalues

Suppose the sample matrix \( \Sigma _{{\mathbf{w}}} \) has the form \( (s - 1)\Sigma _{{\mathbf{w}}} = {\mathbf{HH}}^{\text{T} } \), where the sample entries of \( {\mathbf{H}} \) are independently generated from the distribution \( \mathcal{N}(0,{\kern 1pt} {\kern 1pt} {\kern 1pt} \sigma_{n}^{2} ) \). Then the real matrix \( {\mathbf{M}} = {\mathbf{HH}}^{\text{T} } \) follows a standard Wishart distribution [17]. The ordered eigenvalues of \( {\mathbf{M}} \) are denoted by \( \bar{\lambda }_{\hbox{max} } ({\mathbf{M}}) \ge \cdot \cdot \cdot \ge \bar{\lambda }_{\hbox{min} } ({\mathbf{M}}) \). In the high dimensional situation: \( N/s \to \gamma \in \left[ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \infty } \right) \) as \( s{\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} N \to \infty \), the Tracy-Widom law gives the limiting distribution of the largest eigenvalue of the large random matrix \( {\mathbf{M}} \) [18]. Then we have the following asymptotic expression:

$$ \Pr \left\{ {\frac{{{{\bar{\lambda }_{\hbox{max} } } \mathord{\left/ {\vphantom {{\bar{\lambda }_{\hbox{max} } } {\sigma_{n}^{2} }}} \right. \kern-0pt} {\sigma_{n}^{2} }} - \mu }}{\xi } \le z} \right\} \to F_{{\text{TW} 1}} (z) $$
(4)

where \( F_{{\text{TW} 1}} (z) \) indicates the cumulative distribution function with respect to the Tracy-Widom random variable. In order to improve both the approximation accuracy and convergence rate, even only with few atom samples, we need choose the suitable centering and scaling parameters \( \mu {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \xi \) [19]. By the comparison between different values, such parameters are defined as

$$ \left\{ {\begin{array}{*{20}l} {\mu = {1 \mathord{\left/ {\vphantom {1 s}} \right. \kern-0pt} s} \cdot \left( {\sqrt {s - 1/2} + \sqrt {N - 1/2} } \right)^{2} } \hfill \\ {\xi = {1 \mathord{\left/ {\vphantom {1 s}} \right. \kern-0pt} s} \cdot \left( {\sqrt {s - 1/2} + \sqrt {N - 1/2} } \right)\left( {\frac{1}{{\sqrt {s - 1/2} }} + \frac{1}{{\sqrt {N - 1/2} }}} \right)^{{{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-0pt} 3}}} } \hfill \\ \end{array} } \right. $$
(5)

The empirical distribution of the eigenvalues of the large sample matrix converges almost surely to the Marcenko-Pastur distribution on a finite support [20]. Based on the generalized result in [21], when \( N \to \infty \) and \( \gamma \in \left[ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \infty } \right) \), with probability one, we derive limiting value of the smallest eigenvalue as

$$ {{\bar{\lambda }_{\hbox{min} } } \mathord{\left/ {\vphantom {{\bar{\lambda }_{\hbox{min} } } {\sigma_{n}^{2} }}} \right. \kern-0pt} {\sigma_{n}^{2} }} \to \left( {1 - \sqrt \gamma } \right)^{2} $$
(6)

According to the asymptotic distributions described in the theorems (4) and (6), we further quantify the distribution of the ratio of the maximum eigenvalue to minimum eigenvalue in order to detect the noise eigenvalues. Let \( T_{{{\kern 1pt} 1}} \) be a detection threshold. Then we find \( T_{{{\kern 1pt} 1}} \) by the following expression:

$$ \begin{array}{*{20}l} {\Pr \left\{ {\frac{{\bar{\lambda }_{\hbox{max} } }}{{\bar{\lambda }_{\hbox{min} } }} \le T_{{{\kern 1pt} 1}} } \right\} = \Pr \left\{ {\frac{{\bar{\lambda }_{\hbox{max} } }}{{\sigma_{n}^{2} }} \le T_{{{\kern 1pt} 1}} \cdot \frac{{\bar{\lambda }_{\hbox{min} } }}{{\sigma_{n}^{2} }}} \right\} \approx {\Pr} \left\{ {\frac{{\bar{\lambda }_{\hbox{max} } }}{{\sigma_{n}^{2} }} \le T_{{{\kern 1pt} 1}} \cdot \left( {1 - \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} } \right)^{2} } \right\}} \hfill \\ { = \Pr \left\{ {\frac{{{{\bar{\lambda }_{\hbox{max} } } \mathord{\left/ {\vphantom {{\bar{\lambda }_{\hbox{max} } } {\sigma_{n}^{2} }}} \right. \kern-0pt} {\sigma_{n}^{2} }} - \mu }}{\xi } \le \frac{{T_{{{\kern 1pt} 1}} \cdot \left( {1 - \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} } \right)^{2} - \mu }}{\xi }} \right\} \approx F_{{\text{TW} 1}} \left\{ {\frac{{T_{{{\kern 1pt} 1}} \cdot \left( {1 - \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} } \right)^{2} - \mu }}{\xi }} \right\}} \hfill \\ \end{array} $$
(7)

Note that there is no closed-form expression for the function \( F_{{\text{TW} 1}} \). Fortunately, the values of \( F_{{\text{TW} 1}} \) and the inverse \( F_{{\text{TW} 1}}^{ - 1} \) can be numerically computed at certain percentile points [16]. For a required detection probability \( \alpha_{1} \), this leads to

$$ \frac{{T_{{{\kern 1pt} 1}} \cdot \left( {1 - \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} } \right)^{2} - \mu }}{\xi } = F_{{\text{TW} 1}}^{ - 1} (\alpha_{1} ) $$
(8)

Plugging the definitions of \( \mu \) and \( \xi \) into the Eq. (8), we finally obtain the threshold

$$ T_{{{\kern 1pt} {\kern 1pt} 1}} = \frac{{s\left( {\sqrt {s - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} + \sqrt {N - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} } \right)^{2} }}{{\left( {\sqrt s - \sqrt N } \right)^{2} }} \cdot \left( {\frac{{\left( {\sqrt {s - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} + \sqrt {N - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} } \right)^{{ - {2 \mathord{\left/ {\vphantom {2 3}} \right. \kern-0pt} 3}}} }}{{\left( {s - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} \right)^{{{1 \mathord{\left/ {\vphantom {1 6}} \right. \kern-0pt} 6}}} \left( {N - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} \right)^{{{1 \mathord{\left/ {\vphantom {1 6}} \right. \kern-0pt} 6}}} }} \cdot F_{{\text{TW} 1}}^{ - 1} (\alpha_{1} ) + 1} \right) $$
(9)

When the detection threshold \( T_{{{\kern 1pt} 1}} \) is known in the given probability, it means that an asymptotic upper bound can also be obtained for determining the noise eigenvalues of the matrix \( \varSigma_{{\mathbf{w}}} \) because the equality \( {{\lambda_{m + 1} } \mathord{\left/ {\vphantom {{\lambda_{m + 1} } {\lambda_{N} }}} \right. \kern-0pt} {\lambda_{N} }} = {{\overline{\lambda }_{\hbox{max} } } \mathord{\left/ {\vphantom {{\overline{\lambda }_{\hbox{max} } } {\overline{\lambda }_{\hbox{min} } }}} \right. \kern-0pt} {\overline{\lambda }_{\hbox{min} } }} \) holds. In general, the noise eigenvalues in the set \( {\mathbf{S}}_{2} \) surround the true noise variance as it follows the Gaussian distribution. The estimated largest eigenvalue \( \lambda_{m + 1} \) should be no less than \( \sigma_{n}^{2} \). The known smallest eigenvalue \( \lambda_{N} \) is no more than \( \sigma_{n}^{2} \) by the theoretical analysis [11]. The location and value of \( \lambda_{m + 1} \) in \( {\mathbf{S}} \) are obtained by checking the bound \( \lambda_{m + 1} \le T_{{{\kern 1pt} 1}} \cdot \lambda_{N} \) with high probability \( \alpha_{1} \). In addition, \( \lambda_{1} \) cannot be selected as noise eigenvalue \( \lambda_{m + 1} \).

3 Noise Variance Estimation Algorithm

3.1 Bounded Estimator for Noise Variance

Without requiring the knowledge of signal, the threshold \( T_{{{\kern 1pt} 1}} \) can provide good detection performance for finite \( s{\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} N \) even when the ratio \( N/s \) is not too large. Based on this result, more accurate estimation can be obtained by averaging all elements in \( {\mathbf{S}}_{2} \). Hence, the maximum likelihood estimator of \( \sigma_{n}^{2} \) is

$$ \hat{\sigma }_{n}^{2} = \frac{1}{N - m}\sum\limits_{j = m + 1}^{N} {\lambda_{j} } $$
(10)

In the low dimensional setting where \( N \) is relatively small compared with \( s \), the estimator \( \hat{\sigma }_{n}^{2} \) is consistent and unbiased as \( s \to \infty \). It follows asymptotically normal distribution as

$$ \sqrt s \left( {\hat{\sigma }_{n}^{2} - \sigma_{n}^{2} } \right) \to \mathcal{N}(0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} t^{2} ),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} t^{2} = \frac{{2\sigma_{n}^{4} }}{N - m} $$
(11)

When \( N \) is large with respect to the sample size \( s \), the sample covariance matrix shows significant deviations from the underlying population covariance matrix. In this context, the estimator \( \hat{\sigma }_{n}^{2} \) might have a negative bias, which leads to overestimation of true noise variance [22, 23]. We investigate the distribution of another eigenvalue ratio. Namely, the ratio of the maximum eigenvalue to the trace of the eigenvalues is

$$ U = \frac{{\lambda_{m + 1} }}{{{1 \mathord{\left/ {\vphantom {1 {(N - m) \cdot \text{tr} (\Sigma _{{\mathbf{w}}} )}}} \right. \kern-0pt} {(N - m) \cdot \text{tr} (\Sigma _{{\mathbf{w}}} )}}}} = \frac{{\lambda_{m + 1} }}{{{1 \mathord{\left/ {\vphantom {1 {(N - m) \cdot \sum\nolimits_{j = m + 1}^{N} {\lambda_{j} } }}} \right. \kern-0pt} {(N - m) \cdot \sum\nolimits_{j = m + 1}^{N} {\lambda_{j} } }}}} $$
(12)

According to the result in (4), the ratio \( U \) also follows a Tracy-Widom distribution as both \( N,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} s \to \infty \). The denominator in the definition of \( U \) is distributed as an independent \( {{\sigma_{n}^{2} \chi_{N}^{2} } \mathord{\left/ {\vphantom {{\sigma_{n}^{2} \chi_{N}^{2} } N}} \right. \kern-0pt} N} \) random variable, and thus has \( \text{E} (\hat{\sigma }_{n}^{2} ) = \sigma_{n}^{2} \) and \( \text{Var} (\hat{\sigma }_{n}^{2} ) = {{2\sigma_{n}^{4} } \mathord{\left/ {\vphantom {{2\sigma_{n}^{4} } {(N \cdot s)}}} \right. \kern-0pt} {(N \cdot s)}} \). It is easy to show that replacing \( \sigma_{n}^{2} \) by \( \hat{\sigma }_{n}^{2} \) results in the same limiting distribution in (4). Then we have

$$ \Pr \left\{ {\frac{{{{\lambda_{m + 1} } \mathord{\left/ {\vphantom {{\lambda_{m + 1} } {\hat{\sigma }_{n}^{2} }}} \right. \kern-0pt} {\hat{\sigma }_{n}^{2} }} - \mu }}{\xi } \le z} \right\} \to F_{{\text{TW} 1}} (z) $$
(13)

Unfortunately, the asymptotic approximation present in (13) is inaccurate for small and even moderate values of \( N \) [24]. This approximation is not a proper distribution function. The simulation observations imply that the major factor contributing to the poor approximation is the asymptotic error caused by the constant \( \xi \) [24]. Therefore, a more accurate estimate for the standard deviation of \( {{\lambda_{m + 1} } \mathord{\left/ {\vphantom {{\lambda_{m + 1} } {\hat{\sigma }_{n}^{2} }}} \right. \kern-0pt} {\hat{\sigma }_{n}^{2} }} \) will provide a significant improvement. For finite samples, we have

$$ \text{E} \left( {\frac{{\lambda_{m + 1} }}{{\sigma_{n}^{2} }}} \right) = \mu {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \text{E} \left( {\frac{{\lambda_{m + 1}^{4} }}{{\sigma_{n}^{4} }}} \right) = \mu^{2} + \xi^{2} $$
(14)

Using these asymptotic results, we get the corrected deviation

$$ \xi^{\prime } = \sqrt {\frac{N \cdot s}{2 + N \cdot s}(\xi^{2} - \frac{2}{N \cdot s}\mu^{2} )} $$
(15)

Note that this formula in (15) has corrected the overestimation in the high dimensional setting. thus the better approximation for the probabilities of the ratio is

$$ \Pr \left\{ {\frac{{{{\lambda_{m + 1} } \mathord{\left/ {\vphantom {{\lambda_{m + 1} } {\hat{\sigma }_{n}^{2} }}} \right. \kern-0pt} {\hat{\sigma }_{n}^{2} }} - \mu }}{{\xi^{\prime}}} \ge z} \right\} \approx 1 - F_{{\text{TW} 1}} (z) $$
(16)

The determination of the distribution for the ratio \( U \) is devoted to the correction of the variance estimator. In order to complete the detection of the large deviations of the initial estimator \( \hat{\sigma }_{n}^{2} \), we provide a procedure to set the threshold \( T_{{{\kern 1pt} 2}} \). Based on the result in (16), an approximate expression for the overestimation probability is given by

$$ \Pr \left\{ {\frac{{\hat{\sigma }_{n}^{2} }}{{\lambda_{m + 1} }} \le T_{2} } \right\}{ = }\Pr \left\{ {\frac{{{{\lambda_{m + 1} } \mathord{\left/ {\vphantom {{\lambda_{m + 1} } {\hat{\sigma }_{n}^{2} - \mu }}} \right. \kern-0pt} {\hat{\sigma }_{n}^{2} - \mu }}}}{{\xi^{\prime}}} \ge \frac{{{1 \mathord{\left/ {\vphantom {1 {T_{2} - \mu }}} \right. \kern-0pt} {T_{2} - \mu }}}}{{\xi^{\prime}}}} \right\} \approx 1 - F_{{\text{TW} 1}} (\frac{{{1 \mathord{\left/ {\vphantom {1 {T_{2} - \mu }}} \right. \kern-0pt} {T_{2} - \mu }}}}{{\xi^{\prime}}}) $$
(17)

Hence, for a desired probability level \( \alpha_{2} \), the above equation can be numerically inverted to find the decision threshold. After some simplified manipulations, we obtain

$$ T_{2} { = }\frac{ 1}{{\xi^{\prime} \cdot F_{{\text{TW} 1}}^{ - 1} (1 - \alpha_{2} ){ + }\mu }} $$
(18)

Asymptotically, the spike eigenvalue \( \lambda_{m + 1} \) converges to the right edge of the support \( \sigma_{n}^{2} (1 + \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} ) \) as \( N{\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} s \) go to infinity. According to the expression in (18), this function turns out to have a simple approximation \( T_{2} = {1 \mathord{\left/ {\vphantom {1 \mu }} \right. \kern-0pt} \mu } \) in the high probability case. Then the upper bound \( T_{2} \cdot \lambda_{m + 1} \) for the known \( \hat{\sigma }_{n}^{2} \) yields a bias estimation. Finally, the following expectation holds true:

$$ \text{E} \left( {\frac{{\mu \cdot T_{2} \cdot \lambda_{m + 1} }}{{1 + \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} }}} \right) \approx \sigma_{n}^{2} \ll \hat{\sigma }_{n}^{2} $$
(19)

By analyzing the statistical result in (19), the correction for \( T_{2} \cdot \lambda_{m + 1} \) can be approximated as the better estimator than \( \hat{\sigma }_{n}^{2} \) because this bias-corrected estimator is closer to the true variance under the high dimensional conditions. If \( \hat{\sigma }_{n}^{2} \) can satisfy the requirement of no excess of the bound \( T_{2} \cdot \lambda_{m + 1} \), the sample eigenvalues are consistent estimates of their population counterparts. Hence, the optimal estimator is given by

$$ \hat{\sigma }_{ * }^{2} = \hbox{min} \left\{ {\hat{\sigma }_{n}^{2} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{{\mu \cdot T_{2} \cdot \lambda_{m + 1} }}{{1 + \sqrt {{N \mathord{\left/ {\vphantom {N s}} \right. \kern-0pt} s}} }}} \right\} $$
(20)

3.2 Implementation

Based on the construction of two thresholds, we propose a noise estimation algorithm for dictionary learning as follows:

figure a

4 Numerical Experiments

The proposed estimation method is evaluated on two benchmark datasets: Kodak [7] and TID2008 [9]. The subjective experiment is to compare our method with three state-of-the-art estimation methods by Liu et al. in [8], Pyatykh et al. in [9] and Chen et al. in [11], which are relevant in SVD domain. The testing images are added to the independent white Gaussian noise with deviation level 10 and 30, respectively. We set the probabilities \( \alpha_{1} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} \alpha_{2} = 0.97 \) and choose \( N = 256 \), \( s = 3 \). In general, a higher noise estimation accuracy leads to a higher denoising quality. We use the K-SVD method to denoise the images [3]. Figures 1 and 2 show the results using our method outperform other competitors. Moreover, our peak signal-to-noise ratios (PSNRs) are nearest to true values, 32.03 dB and 27.01 dB, respectively.

Fig. 1.
figure 1

Denoising results on the Woman image using K-SVD.

Fig. 2.
figure 2

Denoising results on the House image using K-SVD.

To quantitatively evaluate the accuracy of noise estimation, the average of standard deviations, mean square error (MSE), mean absolute difference (MAD) are computed by randomly selecting 1500 image patches from 20 testing images. The results shown in Table 1 indicate that the proposed method is more accurate and stable than other methods. Next, we compare our optimal estimator \( \hat{\sigma }_{ * }^{2} \) with \( \hat{\sigma }_{n}^{2} \) and other two existing estimators in the literatures. The simulated realization of a sample covariance matrix is followed a Gaussian distribution with different variances. As presented in Table 2, the performance of \( \hat{\sigma }_{ * }^{2} \) is invariably better than other estimators. To test robustness of our estimation method, we further obtain the empirical probabilities of the estimated eigenvalues at typical confidence levels. Figure 3 illustrates that two asymptotic bounds can achieve very high success probabilities.

Table 1. Estimation results of different methods (Best results are highlighted).
Table 2. Estimation results of four estimators (Best results are highlighted).
Fig. 3.
figure 3

Empirical probabilities of exact noise eigenvalue estimation.

5 Conclusions

In this paper, we have shown how to infer the noise level from a trained dictionary. The eigen-spaces of the signal and noise are transformed and separated well by determining the eigen-spectrum interval. In addition, the developed estimator can effectively eliminate the estimation bias of a noise variance in high dimensional context. Our noise estimation technique has low computational complexity. The experimental results have demonstrated that our method outperforms the relevant existing methods over a wide range of noise level conditions.