1 Introduction

Vision is taken as the most advanced source of information for human beings, and images play the most important role in human vision. Noise corruption is inevitable during the sensing process, and it may heavily degrade our visual experience. Removing noise is an essential step in various image processing and vision tasks, such as image segmentation, image coding, and target detection. In particular, many image restoration problems could be addressed by solving a sub-problems of denoising, which further extends the meaning of exploring image denoising techniques.

Many classical denoising methods have emerged in recent years [1,2,3,4,5,6,7,8], such as wavelet, dictionary leaning and multi-scale feature fusion. The methods could be roughly divided into two categories: spatial domain filtering and transform domain filtering. The former directly processes the pixels of the image and the representative one is the non-local means (NLM) [1]. NLM is wonderful algorithm because of its simplicity and effectiveness, which creatively utilized the non-local self-similarity (NSS) of image for denoising task. The core of NLM is to group the similar patches and perform an average operation for noise suppression. Natural images often have NSS prior, as shown in Fig. 1, the contours are not randomly distributed and have a clear correlation/similarity at different positions. The NSS prior is widely used in many image processing methods [2,3,4,5,6], such as the benchmark denoising algorithm-block matching three-dimensional filtering (BM3D) [3].

Fig. 1
figure 1

Examples of NSS features extracted from images. a some natural images from the standard test dataset; b the contours according to a (extraction with second atom by using PCA method)

The other category mainly uses a basis function (atom) to transform the image to another domain, where separates the noise and effective information of image. The representative method is sparse representation. For natural images, the meaningful information usually possesses sparsity in a transform domain (e.g. wavelet domain), where the sparse signal is enhanced and noise remains the same, by using a simple thresholding method, the noise component could be removed and the useful information is retained, and finally the inverse transform is performed to recover the image in spatial domain. There are two typical solutions, one is based on the analytic basis [9] and the other is the dictionary learning. With the flexibility of atom design, the redundant and over-complete dictionary representation could achieve higher sparsity when compared with the analytic basis methods, and indeed obtained much better denoising performance, such as K-means Singular Value Decomposition (K-SVD) [10], Learned Simultaneous Sparse Coding (LSSC) [5], Expected Patch Log Likelihood (EPLL) [11], Nonlocally Centralized Sparse Representation (NCSR) [12]; furthermore, the visual data often has an intrinsic low-rank structure [13], the low-rankness of patch matrix could be viewed as a 2D sparsity prior as compared with the dictionary sparse representation, and based on the low rank matrix completion theory [14], some impressive restoration methods have been proposed [15,16,17,18,19,20,21,22,23,24]. The key point of sparse method is to improve the sparsity of representation, and the NSS feature is one of the most important and widely used prior for sparsity improvement, especially in the case of image corrupted by noise. Several very competitive algorithms reported [12, 16, 18], taking full advantage of the NSS prior of natural images, have proven the effectiveness of this prior.

In this work, in order to better employ the NSS prior of image and further improve the denoising performance, the centralized sparse representation (CSR) [25]-based algorithm is proposed with enhanced NSS from 3 aspects: (1) the similar patches from external images are combined with the internal noise patches to refine the dictionary; (2) a global matching strategy is adopted to facilitate the estimation of model parameters; (3) an improved similarity measurement with cosine distance is used to eliminate the influence of the dimensional difference.

Among the sparse representation methods, the benchmarks [12, 25] introduced the sparse coding noise (SCN) in the object function and minimized SCN along with the fidelity term to obtain a superior performance. However, the sub-dictionary is learned from the centralized patches from noisy image itself and neglects the external priors, and the estimation of latent clean image is performed in a local window, which limits the quality of patch grouping with NSS prior. By introducing external reference images and global matching strategies, the clustering is improved to make better dictionary learning and more accurate estimation of latent clean patches, and then, the improvement of denoising performance is expected. In particular, using Euclidean distance (ED) is easy to cause mismatch by dimensional error, as shown in Fig. 2, and we adopt the cosine distance instead of ED to remove the dimensional difference to achieve a better matching effect.

Fig. 2
figure 2

Illustration of mismatching with common-used Euclidean distance. a The reference patch; b The patch with the same intensity and different structure; c The patch with different intensity and the same structure

The structure of the paper is constructed as follows: the second part introduces the proposed method, and third part illustrates the operations of NSS enhancement as well as the principle of denoising method; the fourth part demonstrates the denoising results and the comparison with several state-of-the-art algorithms based on two kinds of datasets, one is the standard image processing pictures with added Gaussian white noise (AGWN); the other is the actual low-light image with real noise; the last part summarizes the content of the paper, points out the places need to be improved in the future.

2 The proposed method

For sparse representation, the image x ϵ RN, and its sparse coefficients x ≈ Φαx, αx ϵ RM, where M <  < N and most entries of αx are 0 by dictionary coding (dictionary Φ ϵ RMxN), the key point of the method is sparse decomposition with dictionary Φ, and generally it could be obtained by solving the l0-norm minimization problem [26,27,28], αx = argmin||α||0, s.t. ||x-Φαx||2 ≤ ε; however, the l0 norm minimization is a NP-hard problem, and often replaced by its closest convex relaxation-l1 norm minimization:

$$ \alpha_{x} = \mathop {\arg \min }\limits_{\alpha } \{ \left\| {x - \Phi \alpha } \right\|_{2}^{2} + \lambda \left\| \alpha \right\|_{1} \} $$
(1)

where the constant λ denotes the regularization parameter, which balance the fidelity term (approximation error) and the sparse prior. And Eq. (1) could be minimized efficiently by an analytic method such as iterative shrinkage algorithm [29]. In particular, coding with dictionaries learned from natural images could get a better performance than with fixed basis [30].

For image restoration, the degraded image signal y could be generally written as y = Hx + n, where H is the degradation operator and for denoising H is specified as identical matrix, x is the latent clean image and n is the additive noise, and generally set as Gaussian distribution N(0, σn2). Due to intensity difference in sparse domain between noise and image signal, denoising could be realized by a simple thresholding method, which removes the noise components and retains the structure information [31]. To recover the latent clean image x from its noisy version y with respect to the dictionary Φ, the sparse model could be written in a patched way as below:

$$ \alpha_{y} = \mathop {\arg \min }\limits_{\alpha } \{ \left\| {y - \Phi \alpha } \right\|_{2}^{2} + \lambda \sum\limits_{i} {\left\| {\alpha_{i} } \right\|_{1} } \} $$
(2)

However, simply using the sparse coding for the degraded image to recover the latent clean one is a very challenging task, for ill-posed nature of denoising for high-dimensional nature of natural images. On the other hand, it is known that the sparse coefficients of natural images are not randomly distributed, as shown in Fig. 1, and they have a NSS feature obviously. The strong correlations could be used to develop a more effective sparse model by exploring the nonlocal redundancies. Indeed, several classic methods are derived from this point; among them, the most representative one is NSCR, which introduced a sparse coding noise (SCN) to incorporate the NSS prior to help improve the image restoration performance. The objective function of NCSR [12] is:

$$ \alpha_{y} = \mathop {\arg \min }\limits_{\alpha } \{ \left\| {y - \Phi \alpha } \right\|_{2}^{2} + \lambda \sum\limits_{i} {\left\| {\alpha_{i} - \beta_{i} } \right\|_{1} } \} $$
(3)

where parameters Φ and β paly the most important role for performance improvement. Φ is indeed a sub-dictionary which is learned from each cluster grouped in a global area and β is the estimated latent clean patch used to form the SCN (α-β), obtained by the corresponding sub-dictionary coding on an averaged patch with local matching method. The regularization parameter λ could be derived from the interpretation of the model in the Maximum a Posterior way [32].

The NCSR model achieved the state-of-the-art performance at that time. However, new requirements raised later have surpassed its ability, such as the heavy noise condition where the noise components are dominating in the images, it is very difficult to extract effective characteristics if only based on the original one. There is still room for improvement of the estimated β by sub-dictionary coding on a specified global searching instead of a local window. In order to improve the accuracy of sparse coding and cover more noise levels, we propose an enhanced NSS improvement for sparse representation embed in the NCSR basic framework to pursue a better denoising performance.

3 External and global matching strategy

3.1 The effectiveness and feasibility of external images

The prior information could be extracted from different sources. In general, learning a dictionary from the natural image library will be time-consuming, however, in many occasions, a large number of similar images could be obtained and simplify the selection of external priors. With the help of the correlated images, the extracted sparse prior is much more accurate than that learned from the noisy image only. For example, in the surveillance situations, the image taken at night corrupted by a strong noise due to its low light environment, and in the same environment, images taken during the day have less noise while the major components remain the same which could be a worth reference and provide effective guidance for denoising at night.

3.2 Similarity measurement

For image processing, patch operation is widely adopted for consideration of computation efficiency. Therefore, the patch matrix M of image is constructed with each column as a patch vector. To cluster the patches, the K-means method is adopted for similarity calculation, the cosine distance is used to replace the ED since ED ignores the correlation interference between data/patch (as shown in Fig. 2), which may lower the matching accuracy and in turn affects quality of sparse representation.

The cosine distance could be viewed as a normalized ED; it is insensitive to the absolute values of patches and more closed to the humane perception under a neuroscience perspective [33]. Based on the concept of cosine distance, image block similarity measurement is proposed:

$$ d_{r,i} = 1 - \cos (y_{i} ,y_{r} ) $$
(4)

where r represents the reference image patch, i represents the test patch, dr, i is the ‘distance’ between the two patches yr, yi, function cos(·) is the cosine similarity of yr, yi as two vectors. One patch is classified into a certain class according to the size of dr, i.

3.3 Global clustering for dictionary learning

For the noisy image, external priors are added to improve the dictionary learning of corresponding features, aiming to improve the denoising performance, steps include:

(1) Image retrieval: selecting the similar external images from some ready-made datasets or on the web. For the tested dataset set12, the external image could be selected easily by using a search engine, such as TinEye or Google Images. Also, one can use a much more sophisticated method for image searching, such as [34]. For the real-noise dataset, the external similar images are captured with different conditions such as light intensity, exposure time and different view-angles.

(2) Patch matrix formation: decomposing the selected images into patches with specified size, merging the patches sequentially to form a matrix M (each column represents a patch vector);

(3) Feature extraction:

(a) Flat feature extraction, for a specific patch yi = M(:, i) in M, calculating its variance:

$$ \sigma_{i} = \frac{1}{n}\sum\limits_{p = 1}^{b} {\sum\limits_{q = 1}^{b} {(y_{i} (p,q) - \tilde{y}_{i} (p,q))^{2} } } $$
(5)

where n is the pixel number in a patch, p and q represent the pixel locations, b is the patch size (square shape). \(\tilde{y}_{i}\) is the Gaussian filtered image of patch yi. Then a threshold method is used to determine whether a patch belongs to the flat region. As for natural images, the flat area generally accounts for a large proportion and the flat structure are simple, a small local window is sufficient for its extraction as well as atom learning.

(b) Detail feature extraction: for the detail patches left in M, the K-means method is used for grouping. The similarity measurement adopts the method described in Sect. 2.2. In consideration of the instability of the global dictionary learning method [10], a sub-dictionary learning followed the Ref.12 is adopted based on PCA method [35]. The number of sub-dictionary atom is consistent with the final optimized clustering centers (assuming K). Given the flat region atom, the total atom number is K + 1, and the final dictionary is represented as:

$$ D = \{ \varphi_{0} ,\varphi_{1} ,\varphi_{3} ,...,\varphi_{K} \} $$
(6)

3.4 Global matching for sparse coefficient estimation

The method of searching similar patches in local window is widely used because it is simple and effective. However, for natural images, similar structures may be distributed at different locations across the whole image as well as similar external images. Local windows exclude a large number of high-quality structures outside the window, as shown in Fig. 3. Therefore, we propose a matching method by searching the similar patches in a global area (patch matrix M) instead of a local window.

Fig. 3
figure 3

Comparison of finding similar patches with a local and global matching: a The similar patches with local matching; b The similar patches with global matching (the reference patch is indicated by green square, and the similar patches indicated by red square); c comparison of the averaging effect for local and global matching

For the i-th patch yi of target image, first finding the class Sk it belongs to, for all the image blocks yj in the class Sk, calculating the distance di,j by using (4); then choosing the first t minimum distance patches yi,j as the similar patches of yi. The estimated patch of yi is calculated as:

$$ e_{i} = \frac{1}{{\sum\nolimits_{j = 1}^{t} {w_{i,j} } }}\sum\nolimits_{j = 1}^{t} {w_{i,j} y_{i,j} } $$
(7)

where the averaging weights of wi,j is determined by:

$$ w_{i,j} = \exp ( - d_{i,j}^{2} /h^{2} ) $$
(8)

Finally, the key parameter of sparse coefficient is calculated by the learned sub-dictionary D (φk):

$$ \beta_{i} = \varphi_{k}^{T} e_{i} $$
(9)

Additionally, the regularization parameter λ could be derived from applying the maximum posterior probability MAP [28] for NCSR model representation:

$$ \lambda_{i,j} = \frac{{2\sqrt 2 \sigma_{n}^{2} }}{{\sigma_{i,j} }} $$
(10)

where σn2 is the noise level of original image, and σi,j is the noise level of jth pixel in ith patch, and the object function is:

$$ \alpha_{y} = \mathop {\arg \min }\limits_{\alpha } \{ \left\| {y - \Phi \alpha } \right\|_{2}^{2} + \sum\limits_{i} {\sum\limits_{j} {\lambda_{i,j} \left\| {\alpha_{i} (j) - \beta_{i} (j)} \right\|_{1} } } \} $$
(11)

The global matching strategy offers several distinct advantages: (1) the external prior is used to enhance the NSS feature for sub-dictionary learning and latent clean patch estimation; (2) making the sparse coding of patch consistent with the sub-dictionary learning process in a global nature; (3) the denoising effect has been improved remarkably with different noise levels.

The object function (11) could be solved by an iterative shrinkage method efficiently [29], and the algorithm of this work combined with a global match strategy is illustrated in Algorithm 1.

figure a

4 Experimental results and discussions

4.1 Evaluation indicators and parameters setting

In order to verify the effectiveness of NSS enhancement, the proposed method will be tested on the standard test image dataset (set12) and the real-noise photograph dataset, respectively. The results are compared with the classic denoising algorithms K-SVD, EPLL, BM3D, NCSR and WNNM. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used for evaluation of image denoising performance.

The parameters in K-SVD, EPLL, BM3D, NCSR and WNNM are used as default. Except that there are several parameters to be illustrated for patch matching in NCSR and proposed method. Experimentally, the patch size is set to 6 × 6, 7 × 7, 8 × 8 and 9 × 9 for σn ≤ 20, 20 < σn ≤ 40, 40 < σn ≤ 70 and 70 < σn, respectively as well as T is set to 2, 3, 3, and 4 according to the noise levels. Additionally, the thresholding value for flat classification is proportional to the total noise level and the default scale coefficient is 1. When used for real photograph denoising, the noise level is evaluated by the method [37].

4.2 Test on standard dataset (set12)

Test is performed on 12 standard test images (as shown in Fig. 4) with additive white Gaussian noise (AWGN) for performance testing. The AWGN is set with a mean value of 0 and the standard deviation σn is 10, 30, 50, 70, and 100, respectively.

Fig. 4
figure 4

The standard test image dataset

Table 1 and Table 2 show the PSNR and SSIM results for various sparse representation algorithms, respectively, and the bold values represent the best performance in the comparison. It could be seen that the proposed method has achieved a better performance in almost all noise levels, even in the case of strong noise (σn >  = 50). The remarkable improvements under strong noise (σn >  = 50) prove the effectiveness of the NSS enhancement, since that the external images with similar details/structures merged have offered extra useful information and improved the quality of image prior, while the other methods that only rely on the single noisy image suffer from heavy degradation as the noise level increases and could not be recovered independently and effectively.

Table 1 PSNR (dB) results by different denoising methods
Table 2 SSIM results by different denoising methods

For subjective evaluation, three standard images (Boat, Monarch, Peppers) under strong noise (σn=100) were selected for display (shown in Figure 5). The details of the same area of each figure are enlarged (marked by a green square). Fig. 5 indicates that some details in the original image are almost indistinguishable due to severe damage of strong noise. By comparison, it is found that the K-SVD is overall blurred after denoising, image details are lost. As for EPLL and BM3D, some visual artifacts are suppressed, but the water ripple effect appears in the smooth area of the image, mainly caused by the patch processing of the image, while the NCSR and WNNM have an over-smooth phenomenon and some texture information is lost. Local window search limits the quantity and quality for patch matching, resulting in obvious errors in the reconstructed structures. As for the proposed method with global matching, more details of the original image are reconstructed, while the flat area is visually pleasant.

Fig. 5
figure 5

Comparison of denoising effect for subjective evaluation, the 3 selected images are boat, peppers and Monarch. The classical sparse representation algorithms are selected: K-SVD [10], EPLL [11], BM3D [3], NCSR [12], WNNM [18]. The enlarged details are indicated by green squares

4.3 Test on real photograph dataset

To further verify the effectiveness of the proposed algorithm, we applied the algorithm for real-noise image taken at low-light condition in a laboratory environment. The dataset contains 8 scenes, 5 levels of light intensity below 3 × 10−3 lx and 3 exposure times have been chosen for each scene, and 5 photos are taken under every condition. The total number of our self-made image dataset is 8 × 5 × 3 × 5 = 600. Several low-light typical scenes are shown in Fig. 6.

Fig. 6
figure 6

The real low-light photograph dataset. The pictures are collected with luma target ≤ 3 × 10–3 lx and exposure time = 1/25 s. The top row are the low-light photographs for example, the bottom row are the reference images taken at 1 lx

The overall image is dark, with low signal-to-noise ratio and dead pixels. Usually, such image needs to be pre-processed before denoising, which mainly includes defect corrections and luminance enhancement. The dataset contains images taken under different ambient light levels and under the same shooting parameters, generally, the lower the brightness, the lower the signal-to-noise ratio and stronger noise intensity after digital amplifier. Since latent truth of images is extremely difficult to obtain, we only make a subjective comparison, as shown in Figure 7.

Fig. 7
figure 7

Comparison of denoising performance for subjective evaluation on real photograph dataset. The input noisy images are after pre-processing, and 3 scenes are selected for demonstration, the enlarged details are indicated by green squares

The comparison is similar to that on the standard dataset. Overall, the EPLL and BM3D results still have slight artifacts in the flat region and edge structure; the NCSR and WNNM have a smooth trend. As for the protection of details and textures such as characters and edge lines, the EPLL, BM3D and NCSR are not as good as the proposed algorithm.

5 Conclusions

In this paper, the limitations of reported sparse representation for image denoising are studied. Based on the non-local clustering algorithm framework, we proposed a NSS enhanced denoising algorithm. By combining external reference images and global matching strategies, the quantity and quality of similar patches for sparse prior are improved, making the NSS-based method more efficient. By testing on the standard test dataset and the real low-light dataset, the results indicate that the proposed algorithm could restore the details and textures effectively during denoising and result in visually pleasant images. The method proposed surpasses the previous methods alike and achieves competitive performance even compared with remarkable low rank methods.

The method based on sparse model optimization has promoted many advanced image techniques; however, these methods suffer from shortcoming of time-consuming which limits their applications in real world. In particular, with the fast development of machine learning technology, many deep learning-based denoising methods have been invented [38,39,40,41], which are classified as the discriminative learning method with the advantages of excellent performance on specific problems and fast processing speed, and disadvantages of poor generalization ability and difficult to collect the training pairs.

On the contrary, the model optimization method, such as sparse representation, is of good generalization and provides relatively satisfactory denoising performance, and how to improve the processing speed of the method is our next focus. Inspired by the deep learning work on image denoising, a natural optimization method could be raised by converting the solution of a sparse model to a network like the deep learning network, and it is expected to take both advantages of retaining the image prior for better generalization and transferring the time-consuming work to the pre-training part. The final testing could be high efficiency with GPU hardware accelerations.