1 Introduction

Image denoising aims to recover a clean image from a noisy observation. It is a fundamental research topic in the fields of image processing and computer vision because it benefits many high-level applications, such as bioinformatics [14, 19, 25], image encryption [13, 15, 18], texture classification [16, 26], and many others [30,31,32,33,34]. According to the generation mechanism, there are different kinds of image noise, such as additive white Gaussian noise (AWGN), impulse noise, salt and pepper noise, and Poisson noise. In this work, our attention is focused on removing the AWGN because it is the most common noise that corrupts images in practice. Let u(x) be a clean (noise-free) image. The noised image is generated as follows:

$$\begin{aligned} v(x)=u(x)+n(x), \end{aligned}$$
(1)

where v(x) is the noised version of u(x) and n(x) is the noise added to v(x). In this situation, n(x) follows the Gaussian distribution, namely \(n(x)\sim N(\mu ,\sigma ^2)\). \(\mu \) and \(\sigma ^2\) represent the mean and variance of the noise.

In the last two decades, numerous and diverse image denoising algorithms have been developed from various perspectives, such as image filtering, shrinkage of coefficients, sparse representation of a learned dictionary, and non-local self-similarity statistics. Representation methods include bilateral filtering [29], non-local means (NLM) [2], block matching and 3D filtering (BM3D) [4], K-SVD [6], higher-order singular value decomposition (HOSVD) [22], and weighted nuclear norm minimization (WNNM) [7]. To further improve the performance of the aforementioned approaches, the authors in [23] proposed a scheme called cascade of shrinkage fields (CSF) for image denoising, which was a kind of unified random field model. Recently, Chen and Pock [3] developed a trainable nonlinear reaction diffusion (TNRD) model. CSF and TNRD employ a large amount of prior knowledge of images and then use the forward propagation to optimize the network. Although CSF and TNRD are able to reduce the computational efficiency and improve the quality of denoising, they need to train a specific model to determine the characteristic noise, which is not universally suitable for image denoising.

Fig. 1
figure 1

The main architecture of the proposed DRCNN network. Learning potential clean images by residual learning

Inspired by the excellent success of deep learning models in diverse vision applications, especially image super-resolution, an increasing number of researchers have attempted to employ the deep learning techniques for image denoising. The multilayer perceptron (MLP) [8] and the stack denoising auto-encoder (SDA) convolutional neural network [27] are the first two denoising algorithms to use deep learning techniques, and they have achieved performance comparable to that of the representative BM3D. However, the layer number of MLP and SDA networks is shallow because of the gradient vanishing over the depth network, which limits their performance. Mao et al. [20] proposed a very deep convolutional encoder–decoder network for image restoration. By skip connecting the encoding layer and the decoding layer, the number of network layers reached to 30, and the proposed network achieved satisfactory performance in both image denoising and super-resolution. Recently, a novel method named DnCNN was proposed for image denoising; this method contains 17 convolutional layers and takes into account the residual learning technique [28]. DnCNN not only converges quickly but also significantly improves the performance of previous algorithms.

Although DnCNN has achieved impressive denoising performance, the layer number of DnCNN is still not deep enough. A deeper network is possible to achieve better performance. In order to improve the denoising capacity, we must deepen the network. However, the gradient will vanish as the neural network deepens. It is necessary to introduce other techniques to avoid the conflict between deep layers and gradient vanishing.

Fig. 2
figure 2

Comparison of different core structures. a Ours, b ResNet, and c DnCNN

In order to alleviate the gradient vanishing caused by deepening the network depth, we propose a novel image denoising method named deep residual convolutional neutral network (DRCNN), which is based on the DnCNN and the ResNet [9] architecture. We first optimize it by analyzing and removing unnecessary modules to simplify the network architecture and then use skip connections and residual learning strategies to alleviate the gradient vanishing and accelerate the convergence of the network. The proposed method has been evaluated on publicly available benchmark datasets [21] and outperforms the current state-of-the-art approaches.

The main contributions of this work can be summarized as follows. First, we design a deep residual convolution neural network that uses skip connections between the convolution layers to form a residual block, which can then alleviate the problem of the gradient vanishing and network performance degradation due to the excessive depth of the network layer. Second, we introduce the residual learning and simplify the network structure by removing the BN [10] layer. The network can converge very quickly when the number of network layers is very deep. In addition, this enhances the denoising performance in PSNR value and also has a good visual effect.

The rest of this paper is organized as follows. Section 2 provides some related works. Section 3 introduces the proposed method in detail. Several experimental results are presented in Sect. 4. Section 5 finally gives the conclusion.

2 Related work

In this section, we briefly introduce two related techniques that are used in the proposed method.

2.1 Skip connection

The receptive field is the size of the unit extracted from the original input image. The deeper the network and the larger the receptive field, the better the effect of image feature extraction is. As the number of network layers increases, it becomes easier for the gradient to vanish while training the convolution neural network, resulting in network degradation.

To alleviate the gradient vanishing, Srivastava et al. [24] put forward the skip connection method. The i layer is connected directly to the \(i+n (n>1)\) layer and is applied to the high-speed network (highway networks). By skip connection, the number of highway network layers is more than 100 layers, and there is no network degradation. He et al. [9] proposed a residual network (ResNet), in which the fitted mapping, H(x), is expressed as \(H(x)=F(x)+x\), where F(x) is called the residual mapping and x is the input signal. By skip connection, the learning of H(x) is transformed into F(x) learning. The authors proved that F(x) is more easy to learn than H(x).

2.2 Residual learning

Direct fitting of clean pictures sets up lowly when setting the learning rate, which then leads to an excessive convergence time or difficulties in converging. VDSR [11] proposes a residual learning strategy, which defines a residual image as \(r=y-x\), in which it is quicker to fit the r than to fit the x, and the learning rate is 1000 times that of SRCNN [5], which greatly accelerates the convergence and performs well on the super-resolution. DnCNN also uses a residual learning strategy to directly fit noise pictures, and achieves very good denoising effects.

Fig. 3
figure 3

Twelve widely used testing images

Fig. 4
figure 4

Twelve example images from the BSD68 data set

Fig. 5
figure 5

a is the loss curve of DnCNN-17, DnCNN-22, and DnCNN-40. b is the loss curve of DRCNN–withBN-40 and DnCNN-40. The results are evaluated on the BSD68 dataset, when the noise level is 25

3 Proposed method

This section presents the proposed DRCNN in detail, including the network structure and the training procedure of DRCNN.

3.1 Network structure

For image denoising, we use a very deep residual convolutional neural network inspired by ResNet. The configuration is outlined in Fig. 1. DRCNN is mainly composed of three parts and has 40 convolutional layers. The first part is mainly to learn the features of the noise image. It consists of a convolutional layer and a rectified linear unit (ReLU) activation layer, in which the convolution layer has 64 filters with a filter size of \(3\times 3\). The second part is made up of 19 residual blocks, each of which consists of two convolutional layers, and each convolutional layer has a ReLU layer for nonlinear mapping. Each convolutional layer is composed of \(64\times 3\times 3\) filters. The third part is made up of a convolutional layer, which is a clean image for processing output, consisting of a \(3\times 3\) filter. If the output image is a color image, the number of filters is 3.

The input information will transfer many convolutional layers. As the information transmission path becomes longer, it is easy to cause the gradient vanishing/explosion. In the image denoising, the input image is very similar to the output image, so the difference between the input and output images is very small or 0 [11]. Fitting these values is easier to converge than the direct fit of the clean image. By subtracting the noisy image from the predicted noise image, we can obtain the predicted clean image.

The main structure of our network is the residual block. Residual networks exhibit excellent performance in computer vision problems, and our network structure is similar to ResNet. What is different is that we have removed the BN layer of each layer and simplified the neural network. The main component of the residual block is formed by two convolutional layers that are skip connected. Each convolutional layer directly carries out nonlinear mapping with a ReLU layer. We compare the difference between our residual blocks and DnCNN and ResNet core structures, as shown in Fig. 2.

Fig. 6
figure 6

The loss curve of DRCNN–withBN and DRCNN–withoutBN (ours). The results are evaluated on the BSD68 dataset, When the noise level is 25

Fig. 7
figure 7

The loss curves of DnCNN and DRCNN. The results are evaluated on the BSD68 dataset when the noise level is 50

Table 1 Denoising performance of different algorithms on the BSD68 dataset in terms of PSNR

3.2 Training

After completing the construction of the network, we need to train it to optimize the parameters in the network. In this paper, the training optimization method is Adam [12], and the number of training epochs is 60, where the first 50 epochs use a 0.001 learning rate and the second 10 epochs adopt a learning rate of 0.0001. We denoise at the noise levels of 25, 50, 75, and 100. The input of our DRCNN is a noisy observation, \(y=x+b\). For DRCNN, the network uses the true noise image instead of the clean image as the label. In other words, we train a network to map \(R(y)=b\) instead of \(F(y)=x\). We use the average mean square error (mse) as the cost function of the network, which can be represented as follows:

$$\begin{aligned} \text {loss}(\varTheta )=\frac{1}{N}\sum _{i=1}^{N}\Vert R(y_{i}:\varTheta )-(y_{i}-x_{i})\Vert _{F}^{2}, \end{aligned}$$
(2)

where \(\varTheta \) represents the trainable parameters to be learned in DRCNN and \({(y_i,x_i)}\) represents the ith noisy-clean training image pairs. R denotes the residual mapping to predict the residual image, and N is the number of total training images.

Table 2 Denoising performance of different algorithms on 12 widely used testing images in terms of PSNR
Fig. 8
figure 8

Denoising results of the image from Set12 with a noise level of 75

Fig. 9
figure 9

Denoising results of the image from Set12 with a noise level of 100

4 Experiments

In this section, we provide several experimental results to evaluate the proposed DRCNN. The setting of our experiments is first introduced. The studies of network degradation and batch normalization are then given. The comparison results with the state-of-the-art denoising algorithms are presented last.

4.1 Experimental setting

As a variant of the CNNs, DRCNN involves a large number of matrix calculations, resulting in a very high computation cost. To address this problem, the computation of DRCNN is performed on the Tesla P100 GPUs. A commonly used deep learning framework, Tensorflow [1], is utilized here. Similar to some representative previous works, the BSD400 dataset is selected as the training images. In this work, we follow the operations used in DnCNN to generate the \(40\times 40\) image patches to train DRCNN. It takes approximately 14 h to complete the training procedure of DRCNN.

In our experiments, two sets of images are applied as test images to study the denoising performance of different competing methods. The first set is formed by 12 widely used images, such as Lena, Cameraman, House, Peppers, and Barbara. All these images are shown in Fig. 3. The second dataset is the BSD68 that contains 68 different images. Figure 4 shows example images that are randomly selected from BSD68. It can be seen that all these images in the two sets include different types of image characteristics, such as textured and smooth regions; hence, they can be employed to comprehensively study all competing denoising methods.

In this work, the proposed DRCNN is compared with the following methods: BM3D, WNNM, EPLL, MLP, CSF, TNRD, and DnCNN, respectively. The noised images are generated via Eq. (1), and the noise intensity \(\sigma \) of the AWGN is set to 25, 50, 75, and 100, as in the previous literature. It is quite challenging to recover the original noise-free images as the noise intensity increases to a large value. Similarly, to quantitatively describe the denoising performance, the commonly used peak signal-to-noise ratio (PSNR) is applied here.

4.2 Study of network degradation

For image denoising, the deeper the network is, the greater the degradation of the network due to the vanishing of the gradient, and the worse the performance of the denoising. In this work, we use a network of different layers of the same structure. We deepened the number of layers of DnCNN to 22 and 40 and compared them with the DnCNN (17 convolutional layers). We selected the BSD400 dataset as the training set, trained at the noise level of 50, and used Adam as the optimization method. Referencing the training details mentioned above, we trained 50 epochs and used BSD400 at a noise level of 25 to train and used BSD68 to evaluate the denoising performance. The results of the loss curve during training are shown in Fig. 5a. From Fig. 5a, we can see that the average PSNR of DnCNN-17 is the highest. We can now show that the deeper the network is, the lower the noise reduction performance.

The skip connection alleviates the network degradation problem. In the DRCNN network, we propose to add a BN layer named DRCNN–withBN. The difference between DRCNN–withBN and DnCNN-40 is that DRCNN–withBN has a skip connection on the network structure. In this work, we use DRCNN–withBN and DnCNN-40 to train 50 epochs, use BSD400 with a noise level of 25 to train, and then use BSD68 to evaluate the denoising performance. The results of the loss curve during training are shown in Fig. 5b. From Fig. 5b, we can see that the average PSNR of DRCNN–withBN is the higher than that of DnCNN-40. We can now prove that the skip connection can alleviate the vanishing of the gradient.

4.3 Study of batch normalization operation

The BN layer can accelerate convergence but also removes range flexibility from networks by normalizing the features [17]. In image denoising, the feature in each layer is unnecessary to be normalized. On the contrary, the BN layer will destroy the original image features. Removing the BN layer can improve the denoising performance. To demonstrate this conclusion, we use DRCNN–withoutBN (ours) and DRCNN–withBN for testing. The only difference between the two is whether they have BN structure. As described in Sect. 3.2, we train for 60 epochs. We still selected the BSD400 to train at the noise level of 25 and then use the BSD68 dataset to test the PSNR values obtained by different methods, respectively. The loss curve is shown in Fig. 6. As seen in Fig. 6, the average PSNR of DRCNN–withoutBN is higher than that of DRCNN–withBN. Therefore, BN is not important in denoising, and sometimes removing the BN layer can improve the denoising performance.

4.4 Comparisons with state-of-the-art methods

The loss curves of DnCNN and DRCNN are shown in Fig. 7. From Fig. 7, we can see that our DRCNN model converges faster than the DnCNN model, and the final PSNR is higher. The average PSNR results of different methods on the BSD68 dataset are shown in Table 1. As one can see, our DRCNN model can achieve the best PSNR results over the competing methods at almost every noise level. Compared to the benchmark BM3D, the PSNR value of our model is higher by approximately about 0.7 dB over that of the BM3D model when the noise level is 50.

Table 2 lists the PSNR results of different methods on the 12 test images shown in Fig. 3. The best PSNR result for each image with each noise level is highlighted in bold. It can be seen that the denoising results of our model are better than that of the comparison method in almost every image. At low noise levels, our denoising effect is similar to that of DnCNN-S. However, at high noise levels, the denoising performance of our model is notably better than that those of other methods. Figures 8 and 9 illustrate the visual results of different methods. It can be seen that for BM3D and DnCNN-S lost more texture details. When magnified, some details also became blurred.

5 Conclusion

In this work, we presented DRCNN as a novel image denoising method using very deep networks. It is difficult to train a very deep network because of the slow convergence rate. Gradient vanishing and explosion are the two largest difficulties in the process of neural network deepening. To address this limitation, we use the residual learning and skip connection operations to optimize a very deep network for DRCNN. By applying the methods introduced in this paper, the neural network is deepened and the network denoising ability is not inhibited by network degradation. Based on the experimental results, we compared the existing denoising algorithms and demonstrated that the method we proposed is not only improved on PSNR but also exhibited a very good in visual performance.