1 Introduction

Image acquired by the camera is always degraded by noise. Often the scene is not illuminated properly, forcing the camera to increase the sensitivity of the sensor degrades the image. Removal of noise is an essential step in various image restoration [1, 2] tasks. Considering noise as independent of the image gives us a simple model \(I=S+\eta ,\) where I is a noisy image, S is a noiseless ground truth image and η is noise with standard deviation σ. Noisy images with additive white Gaussian noise (AWGN) [3] can be modeled as the sum of ground truth image and noise (η) (Eq. 1)

$$I\left(x,y\right)=S\left(x,y\right)+\eta \left(x,y\right)$$
(1)

where \(I\left(x,y\right)\) is resultant noisy image pixel when \(S\left(x,y\right)\) is corrupted image with noise \((\eta )\).

Image denoising is often ill-posed. This ill-posedness can be solved with maximum-a-posteriori (MAP) principle [4]. Solving this problem with MAP requires modeling the image with random variables that follow a prior distribution. The corrupted image is then reconstructed with the priors using maximum-a-posteriori principle. The objective is to maximize the conditional probability of the reconstructed image when a corrupted image is given.

Image denoising methods can be broadly classified into two categories, model-based methods and discriminative methods. Model based methods use a generic prior model and an optimization algorithm. Model based methods include Block Matching and 3D Filtering (BM3D) [5], Expected Patch Log Likelihood (EPLL) [6] and Weighted Nuclear Norm Minimization (WNNM) [7]. These methods are computationally expensive, time consuming and unable to resolve the problem of spatially variant noise. Discriminative methods model an appropriate image prior, this approach of learning the prior employ Multi-Layer Perceptron (MLP) [8], DnCNN [2], and Trainable nonlinear reaction diffusion (TNRD) [9]. The major difference between model based and discriminative methods is that the model-based methods have flexibility to handle several tasks. Whereas the discriminative methods are solely dependent on the type of dataset used and thus are very specific in nature and can only solve problems they are designed for. For example, a single model-based method NCSR [10] can solve for image denoising, deblurring and super-resolution, whereas three different discriminative methods MLP [8], SRCNN [11], and DCNN [12] are designed for image deblurring, super-resolution and denoising.

Despite having the flexibility of handling multiple tasks, model-based methods are time consuming and need to be optimized with appropriate priors. On the other hand, discriminative methods offer fast speed and promising performance. So far, the most promising results have been provided by discriminative methods. Discriminative methods include usage of convolutional neural network (CNN) [11] and MLP [8] based deep learning techniques [13,14,15]. The aim of this paper is to obtain an estimate of the ground truth image and a thorough evaluation of different receptive field sizes. Mapping function can be referred by \(F\left(I\right)=\widehat{S}\), for this mapping a deep learning approach is employed.

A major part of deep learning is based on CNNs. These networks perform very well as compared to traditional algorithms and produce state of the art results. CNNs were originally developed for image recognition [14] and classification [16, 17] tasks. Using a CNN for these types of tasks progressively reduces the image resolution. A straightforward approach to increase resolution would be to remove subsampling or strides from the layers. This does increase the resolution but at the same time severely affects the receptive field. So, removing subsampling improves the loss in resolution but on the other hand, it reduces the receptive field [18] in the same proportion as it had improved resolution. Reduction in the receptive field cannot be compromised in any way. A brief introduction of receptive field is given below that explains the variation in the size of the receptive field with a set of equations.

1.1 Receptive field

In a Neural Network each node of a layer is connected with node of the next layer, this way of connection to transfer information requires an extremely large number of parameters. CNN uses a slightly different approach where only a few nodes participate in the connection to the next layer. Since this transfer of the information resembles the response of neurons to stimuli only in certain regions of the visual field, the region is called the receptive field in the visual system. With this analogy it can be stated that CNN uses a receptive field like layout [19], where the subset of the nodes of the previous layer connected to the next layer is the receptive field of the next layer. Figure 1 shows the receptive field in neural network layers and CNN.

Fig. 1
figure 1

a Feature transfer in traditional neural network in a multi-channel input layer n and n + 1 b. Feature transfer in a CNN. Only a certain region participates i.e. the receptive field

1.2 Size of receptive field

A large receptive field means the network can perceive more information to predict an accurate image. So, the receptive field has to be enlarged to broaden the view of the input to capture wider contextual information. The receptive field can be enlarged either by increasing the size of the kernel or by increasing the depth of the network. Inflating the size of the kernel increases the number of parameters thus makes this approach a computational burden. A larger number of layers makes the network architecture deep thus introduces more operations. A solution to this problem is to replace convolutional layers with dilation layers. It helps in increasing the size of the receptive field.

The dilation layer method increases the effective kernel size by inserting blank spaces between them. Dilated kernel reduces the computation as it uses a smaller number of parameters. Also, it helps in detection of minute details with improved resolution. Dilated convolution is applied in various domains like image super-resolution [11, 21] Text-to-speech [22] solution and language translation [23].

1.3 Equation of receptive field

Convolution and dilated convolution-based model [20] can be defined with Eqs. (2) and (3) respectively–

Let \({F :Z}^{2}\) → R be a discrete function. Let \({\Omega }_{r}={[-r,r]}^{2}\cap {Z}^{2}\) and let \({k :\Omega }_{r}\to R\) be a discrete filter of size\({(2r + 1)}^{2}\). The discrete convolution operator ∗ is defined as (Eqs. 2 and 3)

$$\left(F*k\right)\left(p\right)=\sum_{s+t=p}F\left(s\right)k\left(t\right)$$
(2)

Let l be a dilation factor and let \({*}_{l}\) be defined as

$$\left(F{*}_{l}k\right)\left(p\right)= \sum_{s+lt=p}F\left(s\right)k\left(t\right)$$
(3)

When l = 1 refers discrete convolution and l > 1 refers dilated convolution. Same can be applied on 2-Dimensional dilated convolution and given by Eq. 4:

$$y\left(m,n\right)=\sum_{i=1}^{m}\sum_{j=1}^{n}x\left(m+s\mathrm{i},\mathrm{ n}+\mathrm{sj}\right)w\left(i, j\right)$$
(4)

where \(y\left(m,n\right)\) is the output of dilated convolution for input \(x\left(m, n\right)\) and a filter \(w\left(i, j\right)\) with length and the width of m and n respectively.

The dilation layer can be naively called convolution over input with a sparsely populated filter which expands the size of the convolving filter. The expansion rate is controlled by a hyper-parameter d, where (d − 1) blank spaces are inserted in the kernel. For dilation rate equal to 1, zero space will be added. The effective kernel size under the influence of dilation d with kernel size k is given by [24] as

$${k}^{^{\prime}}=k+\left(k-1\right)\left(d-1\right)$$
(5)

where \({k}^{^{\prime}}\) represents the effective kernel size. Effective receptive field from Eq. 5, for a kernel (k = 3) and dilation rate d = 1, 2 and 3 is 3 × 3, 5 × 5 and 7 × 7 (shown in Fig. 2a–c).

Fig. 2
figure 2

Effective receptive field: (a) 3 × 3 kernel for d = 1 (b) 5 × 5 for dilation rate 2 (d = 2), c 7 × 7 for dilation rate 3(d = 3)

In addition to this, the receptive field for the depth n and kernel size 3 × 3 (throughout the network) can be given by (2d + 1)(2d + 1). The relationship between the dilation rate and output size o as in [24] is given by

$$o=\left[\frac{i+2p-k-(k-1)(d-1)}{s}\right]+1$$
(6)

For the input size i, padding p and stride s. Convolution of 3 × 3 kernel over an input of size 9 × 9, padding zero and dilation rate 2 (i.e., i = 9, k = 3, d = 2, s = 1 and p = 0) produces output of dimension 5 × 5. Figure 3 shows the output size 3 × 3 for i = 7. That concludes the introduction on receptive field.

Fig. 3
figure 3

Estimation of the dimension of the output layer: where a 7 × 7 input size is used (i = 7) with kernel size 3 × 3(k = 3) convolves with a filter of dilation 2(d = 2) gives the output of 3 × 3

In this paper we have studied the significance of receptive field in image denoising in 4 study cases. Our work can be summarized with Fig. 4, where images of compared cases are shown. Cases 1–4 are described in Sect. 3. Case 4 showcases our best result in terms of PSNR (Peak Signal to Noise Ratio) comparison. The rest of the paper is organized as follows. Section 2 contains a brief study on related methods. Section 3 explains our study cases. Sections 4 and  5 describe the details of dataset and network structure. In Sect. 6 experimental results and analysis based on compared PSNR values and corresponding images are shown. Based on the analysis in Sect. 6 another network mentioned as case 4 is introduced. Discussion on the results of test sets is in Sect. 7. Lastly, we wrap up our work with the conclusion mentioned in Sect. 8 and future work in 9.

Fig. 4
figure 4

Predicted denoised images were compared with their PSNR and SSIM, inset image is also shown to compare the results visually. Here a input noisy image σ = 25. b Result with case 1. c Result with case 2. d Result with case 3 and e Result with case 4

2 Related methods

Filter based techniques are one of the initially proposed methods for AWGN denoising, these filters are further divided into spatial domain filter and transform domain filter. Mean filtering [25], denoising with local statistics [26], Weiner filter [27] and Bilateral Filtering [28] are some of the most prevalent techniques. These techniques were not sufficient to produce a good quality image.

The image prior is an important property in image denoising. In the past decade a lot of methods were proposed based on image priors. Some of them are Markov Random Field (MRF) [20, 29], BM3D [5], NCSR [10], nonlocal self-similarity (NSS) [31] and WNNM [7]. It is convenient to learn the prior model on small image patches. In EPLL [6] optimization is made on an entire image and image prior is given by the product of all patch priors. Non-local self-similarity [6, 7, 31] based methods exploit the property of repetitive patterns in natural images. These similar patches are grouped to collaboratively estimate the final image. Out of the prior based methods mentioned above, BM3D [5] and WNNM [7] are the popular ones. They are capable of handling various noise levels but they cannot be directly used for spatially variant noise [32].

CNNs are widely used in various image processing tasks due to their excellent performance. Though CNN based methods have also been challenged against prior based methods. Jain et al. [33] compared the performance of Markov random field (MRF) with convolutional neural networks. Another comparison with BM3D is done by Burger et al. [8].

Prior based methods perform well but with some drawbacks. They need to be optimized well, thus there is an increase in the computation cost, in addition to that they rely on manual settings and tuning. To address these problems discriminative approaches were proposed where there is a direct mapping from the noisy image to the ground truth image. In the discriminative learning approach DnCNN [2] is the most popular one. Here a single method aims to solve various image restoration (IR) tasks i.e., it can provide solutions for blind Gaussian denoising, single image super-resolution and JPEG deblocking [34]. They produced promising results for all the three problems. Another similar method [1] used HQS (Half Quadratic Splitting), a variable splitting technique to solve for image denoising, image deblurring and single image super resolution utilizing deep CNN denoiser prior. Chuah et al. [35] provided a straightforward strategy of estimating noise level before removing noise. Wang et al. [36] achieved comparable results with reduced computational cost, less complicated network structure, using a larger receptive field than DnCNN. A combination of a dilated layer with residual learning is a popular technique in resolving the AWG noise problem [37, 38 and 2]. All the methods described above employ a one-to-one mapping i.e., they require a single image as input to produce a denoised image. Zhang et al. [32] proposed a novel network design FFDNet (Fast and Flexible Denoising Convolutional Neural Network) that takes sampled input images with their noise level maps to produce a denoised image. In addition to that, recent approaches aim to deal with spatially variant and invariant noises, these methods worked on real world noisy images. Anwar et al. [38] incorporated feature attention for image denoising. A benchmark dataset of denoised images is created by Romano et al. [39], these images are captured with different cameras and under different camera settings. Guo et al. [40] uses the same strategy of estimating noise first before noise removal like [35]. Their work is denoted by its network architecture name CBDNet [40], which has two sub-networks one for noise estimation and other non-blind denoising estimation. Some of the methods used for comparison are categorized in Fig. 5.

Fig. 5
figure 5

Categorization of image denoising methods

3 Our method

The present research has studied three cases, that are as follows:

Case 1: A dilated convolutional network is utilized here. The network structure is similar to Zhang et al. [1] and Peng et al. [37], here a 7 layered dilated network with a receptive field of the network of size 33 × 33 is used. Zhang et al. [1] and [37] used a residual learning formulation i.e., \(F\left(I\right)=\eta\). i.e., noise is separated by subtracting the predicted noise from noisy input image. We have used \(F\left(I\right)=\widehat{S}\), a direct mapping from noisy image to noiseless image. No intermediate step is required.

Case 2: In this case plain CNN network with the same configuration and same receptive field of size 33 × 33 is used. Same receptive field can be ensured either by using larger sized filters or increasing the depth of the network. We have used larger sized filters.

Case 3: To verify the efficacy of dilated layers in image denoising process, receptive field size is reduced. All dilated layers are replaced with plain CNN layers that reduces the size of the receptive field to 15 × 15. This way performance can be compared with the above two cases. The figure below represents the three cases (Fig. 6).

Fig. 6
figure 6

Illustration of the three cases in the proposed work

Receptive field of each layer for all three cases are shown in Table 1, all the cases have seven layers. We have used the term \({rcp}_{i}\) which is the receptive field of the network up to \(i\) layers, where \(i=1, 2, \dots ,7\). All these cases mentioned above are essential steps in verifying the effectiveness of receptive field size in image denoising. Three cases are compared qualitatively and visually.

Table 1 Receptive field of the network used in case 1, 3 and case 3

4 Datasets

We have used COCO dataset [41], which is available as an open-source online dataset and contains 5000 images in its val2016 set. After preprocessing this data, an augmented set of 10,000 images was developed. It is observed that increasing the size of the dataset beyond this does not lead to significant improvement. Images are then separated into training and validation sets in the ratio of 7:3. These images are then cropped to 256 × 256 pixels. To produce synthetic noisy images, an additive white Gaussian noise (AWGN) is added to the images.

For blind denoising, noise levels (σ) are randomly selected from the range [0, 75] to create the dataset. Test sets BSD68 [30], Set 12 [2], RNI15 [42], NC12 [42] and Nam [43] datasets are used. BSD68 [30] and Set 12 [2] contain classic images in the field of image processing i.e., these images have been extensively used for the evaluation of numerous methods. RNI15 [42] set is a real-world noisy image set having 15 images, these images contain spatially variant noise too. NC12 is a set of 12 noisy images, there are no ground truth images in RNI15 [42], NC12 [42] so the images will be compared visually for this set.

5 Network structure

The architecture of the network used to remove Gaussian noise is shown in Fig. 7, where it takes an image degraded with a certain level of AWGN as input. This image is convolved with dilated kernels. Each layer has these kinds of filters with dimension defined by their dilation rate, which will then be trained to appropriate values during the back-propagation algorithm. No pooling is used here due to the requirement of same dimensions of input and output images. Zero padding is used to avoid boundary artifacts. Filters of dimension 3 × 3 × 32 are used where the third channel refers to the number of filters. ReLu (Rectified Linear Unit) [44] is also placed between two consecutive convolution layers to introduce non-linearity in the network. For adaptive learning Adam [45] is used as an optimizer with learning rate 0.001. Loss function here is MSE (mean squared Error) shown in Eq. (7).

$$MSE= \frac{1}{n}{\sum }_{i=1}^{n}({{S}_{i}-{\widehat{S}}_{i})}^{2}$$
(7)

where \(S\) and \(\widehat{S}\) are ground truth and predicted denoised images. The final layer gives out the output i.e., an image with reduced noise.

Fig. 7
figure 7

Network architecture used in simulation. Where 1D, 2D… refers to the dilation rate

Table 2 shows the parameters and the size of the effective kernel in each layer. Here the receptive field (effective kernel size) of each layer denoted by \({r}_{1},{r}_{2}, \dots ,{r}_{7}\) is 3, 5, 7, 9, 7, 5, 3 in 1–7 layers respectively assuming \({r}_{0}=1\). In calculating the receptive field of a network.

Table 2 Network structure used in of case 1

6 Experimental results and analysis

Proposed method is a plain discriminative method, since it is feasible to train a deep network with minimal number of layers. Though residual learning framework is not utilized here, equivalent results are achieved. All our models are trained on Nvidia Tesla K80 GPU.

PSNR and SSIM measurements are utilized to compare the performance of the cases mentioned in Sect. 3. Including SSIM for comparison ensures the quality of image for human perception. It estimates the correlation between two normalized images.

$$PSNR=10\frac{{(255)}^{2}}{MSE}$$
(8)
$$SSIM= \frac{(2{\mu }_{S}{\mu }_{\widehat{S}}+{C}_{1})(2{\sigma }_{S\widehat{S}}+{C}_{2})}{({{\mu }_{S}}^{2}+{{\mu }_{\widehat{S}}}^{2}+{C}_{1})({{\sigma }_{S}}^{2}+{{\sigma }_{\widehat{S}}}^{2}+{C}_{2})}$$
(9)

where \({\mu }_{S}\), \({{\sigma }_{S}}^{2}\) and \({\mu }_{\widehat{S}}\), \({{\sigma }_{\widehat{S}}}^{2}\) the local mean and variance of the ground truth image and predicted image. \({\sigma }_{S\widehat{S}}\) denotes the local covariance of ground truth and predicted image.

6.1 Comparison of case 1, case 2, case 3 and case 4

Our cases are compared with three prior based methods BM3D [5], WNNM [8], EPLL [7] and two discriminative methods TNRD [9] and DnCNN [2] as shown in Table 3. Comparison is based on the PSNR values obtained for Set 12 Images [2].

Table 3 PSNR results of various methods on Set 12 dataset, maximum PSNR values are highlighted as bold characters

All the simulations are done for a sigma set in the range of σ ∈ [0–75], where a single model is used to denoise an image. Images have a range of [0–255]. These models are referred to as case 1, case 2 and case 3. This type of denoising may also be referred to as blind denoising. Here a single model can handle a series of σ ∈ [0–75]. Comparison of each is shown below:

  • Case 1 and case 2

It is observed from Table 3 that these two cases produce approximately the same result for both σ = 15 and 25. Since these cases have the same receptive field, they produce similar results. Also, these results verify the significance of the receptive field. Case 1 and 2 differ in terms of convergence rate, since case 2 has CNNs with large sized filters; its convergence rate is slower than case 1.

  • Case 1 and case 3

Case 3 shows slightly less PSNR values than case 1. The difference increases when noise level (σ) increases from 15 to 25. This difference proves the need for a larger receptive field in image denoising. Case 1 has fastest convergence compared to the rest of the cases.

  • Case 2 and case 3

Case 2 outperforms case 3 significantly when noise is high (σ = 25), although it maintains comparable PSNR values when noise levels are less (σ < 20). It is visible in Table 3 that as the noise (σ) goes above 30 all these cases fail to beat the state-of-the-art methods.

These comparisons lead this work in the direction of case 1. A slight modification is done in case 1 i.e., use of a dilation layer with a larger receptive field. Further, previous methods validate the need of batch normalization so batch normalization layers are also added in this case 4.

  • Case 4

This case is the same as case 1 with some variations. This network is designed with Batch Normalization (BN) layer [15], where Conv + BN + ReLu series is utilized. Utilization of Batch normalization slows down the process of convergence due to increased number of parameters but it is essential to reduce the problem of covariance shift in a network.

Recalling the inabilities of previous network of case 1 with 7 layers, there is a need of increasing the number of layers. This network with BN has 9 layers. Receptive field of this network comes out to be 51 × 51(Table 4). Network structure is shown in Fig. 8. Mean squared logarithmic error (MSLE) is used in this case unlike MSE in previous cases. As name suggests it is the (Mean Squared Error) MSE calculated over logarithmic error values \(S\) and \(\widehat{S}\) (shown in Eq. (10)).

$$L\left(S,\widehat{S}\right)= \frac{1}{N}{\sum }_{i=0}^{N}{(log\left({S}_{i}+1\right)-log({\widehat{S}}_{i}+1))}^{2}$$
(10)

where \(S\) and \(\widehat{S}\) are ground truth and predicted denoised images respectively and \(N\) denotes the number of samples. 1 is added to both \(S\) and \(\widehat{S}\) to validate this equation mathematically when \(S\), \(\widehat{S}\to 0\).

Table 4 Receptive field of the network of case 4
Fig. 8
figure 8

The network architecture with Batch Normalization

Comparison of case 4 with the state of the art is shown in Table 3, where case 4 surpasses BM3D [5], WNNM [8], MLP [8] and DnCNN [2] by a margin of at-least 0.4 dB for a range of noise level beyond (σ > 25). For noise levels less than (σ < 25) case 4 produces comparable results, this is because of the larger receptive field. Lower noise levels are removed better by methods having larger modeling capacity e.g. DnCNN [2]. Case 4 worked well for σ = 25, 35 and 50, even for σ = 75, case 4 outperformed FFDNet [33] in 4 out of 12 images.

7 Discussion

Case 4 proved to be our best model so far according to Table 3, thus this will be used as a predicting model for test sets. Figure 9 shows predicted images by case 4 for a series of noise levels (σ = 15, 25, 35, 45, 55, 75). Where an image from CBSD68 [30] is corrupted with a series of noise levels to create a set of synthetic images.

Fig. 9
figure 9

Image denoising results for noise levels σ = 15, 25, 35, 45, 55, 65 and 75 (written in bold). PSNR values (Noisy/Ground Truth) of noisy and predicted image are shown, where upper row belongs to the noisy image set and lower row shows predicted image

Figure 10 shows predicted PSNR results on popular test images from BSD68 [30] and CBSD68 [30], used by previous methods. Ours (Case4) performs best on BSD68 [30] than CBSD68 [30]. For color images FFDNet works best among others. Average of the predicted PSNR values computed on 68 images of BSD68 [30] and CBSD68 [30] dataset is shown in Table 5, DnCNN [2] and FFDNet [33] are used for comparison. where bolded values specify the maximum values among the comparing methods, ours work best on BSD68 [30] and RIDNet [39] performs best on CBSD68 [30]. Results on Real world dataset (RNI15) [42] were compared (Fig. 11) with DnCNN [2], FFDNet [33] and RIDNet [39], these images contain spatially variant noise. Given the fact that our method is not specifically designed for this type of noise, it performs surprisingly well. Here in these images, it can be clearly seen that for the Flower image it outperforms [39]. Result on Pattern 3 image is also shown, no doubt that overall noise reduction is better done by RIDNet but by analyzing these images carefully one can conclude that an important piece of data is missing in RIDNet’s [39] output. The inset image shown in the bottom illustrates the loss in image quality with RIDNet and FFDNet. Quality of the images are better compared with zoomed out versions of resultant images. Overall performance of our model on these datasets validates the broad spectrum of our method.

Fig. 10
figure 10

Image denoising results on BSD68 (grey) and CBSD68 (color) dataset for sigma value 15, 25 and 50

Table 5 Average PSNR result comparison
Fig. 11
figure 11

Test Results on RNI15 and NC12 dataset

8 Conclusion

The presented paper incorporates a detailed study on receptive fields. It includes significance of receptive field in image denoising [46, 47] problem and calculation of receptive field of a network as well as its layers. To achieve this goal several comparison techniques were utilized in network design and training. Performances on different sizes of receptive fields were compared. Our work not only provides a comparative study but also solves the problem of image denoising. With these variations we were able to produce results that are competitive to the state of the art. Previous methods articulated the crucial need of residual learning whereas present work is able to perform without residual learning. Proposed work is an end-to-end approach that only requires a noisy image. A single network can work well for a wide range of sigma (σ) values of grayscale and colored images. The results on real noisy images further demonstrated that our work can deliver perceptually appealing denoised results when compared with BM3D [5], WNNM [8], EPLL [7], TNRD [9] and DnCNN [2]. Though this work did not concern spatially variant noise in its methodology, it can compete with the method RIDNet [39] designed specifically for spatially variant noise.

9 Future scope

There is a continuous effort on image enhancement [48, 49] and restoration [50] in various fields, despite that performance on real images is still lacking. This is due the fact that any simulated noise is much simpler than the real noise. In real life components as illumination, camera shaking and sensors are accountable for degrading the image. Thus, a further powerful noise modeling is required that can handle a variety of noise.