Keywords

1 Introduction

Image denoising is an important research problem in low-level vision, aiming at recovering the latent clean image \(\textit{\textbf{x}}\) from its noisy observation \(\textit{\textbf{y}}\). Despite the significant advances in the past decades  [8, 14, 56, 57], real image denoising still remains a challenging task, due to the complicated processing steps within the camera system, such as demosaicing, Gamma correction and compression  [46].

Fig. 1.
figure 1

Illustration of our proposed dual adversarial framework. The solid lines denote the forward process, and the dotted lines mark the gradient interaction between the denoiser and generator during the backword.

From the Bayesian perspective, most of the traditional denoising methods can be interpreted within the Maximum A Posteriori (MAP) framework, i.e., \(\max _{\textit{\textbf{x}}} p(\textit{\textbf{x}}|\textit{\textbf{y}}) \propto p(\textit{\textbf{y}}|\textit{\textbf{x}})p(\textit{\textbf{x}})\), which involves one likelihood term \(p(\textit{\textbf{y}}|\textit{\textbf{x}})\) and one prior term \(p(\textit{\textbf{x}})\). Under this framework, there are two methodologies that have been considered. The first attempts to model the likelihood term with proper distributions, e.g., Gaussian, Laplacian, MoG  [33, 55, 59] and MoEP  [10], which represents different understandings for the noise generation mechanism, while the second mainly focuses on exploiting better image priors, such as total variation  [40], non-local similarity  [8], low-rankness  [15, 17, 47, 53] and sparsity  [31, 52, 58]. Despite better interpretability led by Bayesian framework, these MAP-based methods are still limited by the manual assumptions on the noise and image priors, which may largely deviate from the real images.

In recent years, deep learning (DL)-based methods have achieved impressive success in image denoising task  [4, 56, 57]. However, as is well known, training a deep denoiser requires large amount of clean-noisy image pairs, which are time-consuming and expensive to collect. To address this issue, several noise generationFootnote 1 approaches were proposed to simulate more clean-noisy image pairs to facilitate the training of deep denoisers. The main idea behind them is to unfold the in-camera processing pipelines  [7, 19], or directly learn the distribution \(p(\textit{\textbf{y}})\) as in   [11, 25] using generative adversarial network (GAN)  [16]. However, the former methods involve many hyper-parameters needed to be carefully tuned for specific cameras, and the latter ones suffer from simulating very realistic noisy image with high-dimensional signal-dependent noise distributions. Besides, to the best of our knowledge, there is still no metric to quantitatively assess the quality of the generated noisy images w.r.t. the real ones.

Against these issues, we propose a new framework to model the joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\) instead of only inferring the conditional posteriori \(p(\textit{\textbf{x}}|\textit{\textbf{y}})\) as in conventional MAP framework. Specifically, we firstly factorize the joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\) from two opposite directions, i.e., \(p(\textit{\textbf{x}}|\textit{\textbf{y}})p(\textit{\textbf{y}})\) and \(\int _{\textit{\textbf{z}}}p(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})p(\textit{\textbf{x}})p(\textit{\textbf{z}})\mathrm {d}\textit{\textbf{z}}\), which can be well approximated by a image denoiser and a noise generator. Then we simultaneously train the denoiser and generator in a dual adversarial manner as illustrated in Fig. 1. After that, the learned denoiser can either be directly used for the real noise removal task, or further enhanced with new clean-noisy image pairs simulated by the learned generator. In summary, the contributions of this work can be mainly summarized as:

  • Different from the traditional MAP framework, our method approximates the joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\) from two different factorized forms in a dual adversarial manner, which subtlely avoids the manual design on image priors and noise distribution. What’s more, the joint distribution theoretically contains more complete information underlying the data set comparing with the conditional posteriori \(p(\textit{\textbf{x}}|\textit{\textbf{y}})\).

  • Our proposed method can simultaneously deal with both the noise removal and noise generation tasks in one unified Bayesian framework, and achieves superior performance than the state-of-the-arts in both these two tasks. What’s more, the performance of our denoiser can be further improved after retraining on the augmented training data set with additional clean-noisy image pairs simulated by our learned generator.

  • In order to assess the quality of the simulated noisy images by a noise generation method, we design two metrics, which, to the best of our knowledge, are the first metrics to this aim.

2 Related Work

2.1 Noise Removal

Image denoising is an active research topic in computer vision. Under the MAP framework, rational priors are necessary to be pre-assumed to enforce some desired properties of the recovered image. Total variation  [40] was firstly introduced to deal with the denoising task. Later, the non-local similarity prior, meaning that the small patches in a large non-local area may share some similar patterns, was considered in NLM  [8] and followed by many other denoising methods  [14, 15, 28, 30]. Low-rankness  [15, 17, 53, 54] and sparsity  [30, 31, 50, 52, 58] are another two well-known image priors, which are often used together within the dictionary learning methods. Besides, discriminative learning methods also represent another research line, mainly including Markov random field (MRF) methods  [6, 41, 44], cascade of shrinkage fields (CSF) methods  [42, 43] and the trainable nonlinear reaction diffusion (TNRD)  [12] method. Different from above priors-based methods, noise modeling approaches focus on the other important component of MAP, i.e., likelihood or fidelity term. E.g., Meng and De La Torre  [33] proposed to model the noise distribution as mixture of Gaussians (MoG), while Zhu et al.  [59] and Yue et al.  [55] both introduced the non-parametric Dirichlet Process to MoG to expand its flexibility. Furthermore, Cao et al.  [10] proposed the mixture of expotential power (MoEP) distributions to fit more complex noise.

In recent years, DL-based methods achieved significant advances in the image denoising task. Jain and Seung  [23] firstly adopted a five-layer network to deal with the denoising task. Then Burger et al.  [9] obtained the comparable performance with BM3D using one plain multi-layer perceptron (MLP). Later, some auto-encoder based methods  [2, 49] were also immediately proposed. It is worthy mentioning that Zhang et al.  [57] proposed the convolutional denoising network DnCNN and achieved the state-of-the-art performance on Gaussian denoising. Following DnCNN, many different network architectures were designed to deal with the denoising task, including RED  [32], MemNet [45], NLRN  [29], N3Net  [37], RIDNet  [4] and VDN  [56].

2.2 Noise Generation

As is well known, the expensive cost of collecting pairs of training data is a critical limitation for deep learning based denoising methods. Therefore, several methods were proposed to explore the generation mechanism of image noise to facilitate an easy simulation of more training data pairs. One common idea was to generate image pairs by “unprocessing” and “processing” each step of the in-camera processing pipelines, e.g.,  [7, 19, 24]. However, these methods involve many hyper-parameters to be tuned for specifi camera. Another simpler way was to learn the real noise distribution directly using GAN  [16] as demonstrated in  [11] and  [25]. Due to the complexity of real noise and the instability of training GAN, it is very difficult to train a good generator for simulating realistic noise.

3 Proposed Method

Like most of the supervised deep learning denoising methods, our approach is built on the given training data set containing pairs of real noisy image \(\textit{\textbf{y}}\) and clean image \(\textit{\textbf{x}}\), which are accessible thanking to the contributions of  [1, 3, 51]. Instead of forcely learning a mapping from \(\textit{\textbf{y}}\) to \(\textit{\textbf{x}}\), we attempt to approximate the underlying joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\) of the clean-noisy image pairs. In the following, we present our method from the Bayesian perspective.

3.1 Two Factorizations of Joint Distribution

In this part, we factorize the joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\) from two different perspectives, and discuss their insights respectively related to the noise removal and noise generation tasks.

Noise Removal Perspective: The noise removal task can be considered as inferring the conditional distribution \(p(\textit{\textbf{x}}|\textit{\textbf{y}})\) under the Bayesian framework. The learned denoiser R in this task represents an implicit distribution \(p_R(\textit{\textbf{x}}|\textit{\textbf{y}})\) to approximate the true distribution \(p(\textit{\textbf{x}}|\textit{\textbf{y}})\). The output of R can be seen as an image sampled from this implicit distribution \(p_R(\textit{\textbf{x}}|\textit{\textbf{y}})\). Based on such understanding, we can obtain a pseudo clean image pair \((\hat{\textit{\textbf{x}}}, \textit{\textbf{y}})\) as followsFootnote 2, i.e.,

$$\begin{aligned} \textit{\textbf{y}} \sim p(\textit{\textbf{y}}),~\hat{\textit{\textbf{x}}} = R(\textit{\textbf{y}})\Longrightarrow (\hat{\textit{\textbf{x}}}, \textit{\textbf{y}}), \end{aligned}$$
(1)

which can be seen as one example sampled from the following pseudo joint distribution:

$$\begin{aligned} p_R(\textit{\textbf{x}},\textit{\textbf{y}}) = p_R(\textit{\textbf{x}}|\textit{\textbf{y}})p(\textit{\textbf{y}}). \end{aligned}$$
(2)

Obviously, the better denoiser R is, the more accurately that the pseudo joint distribution \(p_R(\textit{\textbf{x}},\textit{\textbf{y}})\) can approximate the true joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\).

Noise Generation Perspective: In real camera system, image noise is derived from multiple hardware-related random noises (e.g., short noise, thermal noise), and further affected by in-camera processing pipelines (e.g., demosaicing, compression). After introducing an additional latent variable \(\textit{\textbf{z}}\), representing the fundamental elements conducting the hardware-related random noises, the generation process from \(\textit{\textbf{x}}\) to \(\textit{\textbf{y}}\) can be depicted by the conditional distribution \(p(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\). The generator G in this task expresses an implicit distribution \(p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\) to approximate the true distribution \(p(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\). The output of G can be seen as an example sampled from \(p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\), i.e., \(G(\textit{\textbf{x}},\textit{\textbf{z}})\sim p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\). Similar as Eq. (1), a pseudo noisy image pair \((\textit{\textbf{x}},\hat{\textit{\textbf{y}}})\) is easily obtained:

$$\begin{aligned} \textit{\textbf{z}} \sim p(\textit{\textbf{z}}),~ \textit{\textbf{x}} \sim p(\textit{\textbf{x}}), ~ \hat{\textit{\textbf{y}}} = G(\textit{\textbf{x}}, \textit{\textbf{z}}) \Longrightarrow (\textit{\textbf{x}}, \hat{\textit{\textbf{y}}}), \end{aligned}$$
(3)

where \(p(\textit{\textbf{z}})\) denotes the distribution of the latent variable \(\textit{\textbf{z}}\), which can be easily set as an isotropic Gaussian distribution \(\mathcal {N}(0, \textit{\textbf{I}})\).

Theoretically, we can marginalize the latent variable \(\textit{\textbf{z}}\) to obtain the following pseudo joint distribution \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\) as an approximation to \(p(\textit{\textbf{x}},\textit{\textbf{y}})\):

$$\begin{aligned} p_G(\textit{\textbf{x}},\textit{\textbf{y}})=\int _{\textit{\textbf{z}}} p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})p(\textit{\textbf{x}})p(\textit{\textbf{z}}) \mathrm {d}\textit{\textbf{z}} \approx \frac{1}{L}\sum _{i}^L p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}}_i)p(\textit{\textbf{x}}), \end{aligned}$$
(4)

where \(\textit{\textbf{z}}_i \sim p(\textit{\textbf{z}})\). As suggested in  [27], the number of samples L can be set as 1 as long as the minibatch size is large enough. Under such setting, the pseudo noisy image pair \((\textit{\textbf{x}},\hat{\textit{\textbf{y}}})\) obtained from the generation process in Eq. (3) can be roughly regarded as an sampled example from \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\).

3.2 Dual Adversarial Model

In the previous subsection, we have derived two pseudo joint distributions from the perspectives of noise removal and noise generation, i.e., \(p_R(\textit{\textbf{x}},\textit{\textbf{y}})\) and \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\). Now the problem becomes how to effectively train the denoiser R and the generator G, in order to well approximate the joint distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\). Fortunately, the tractability of sampling process defined in Eqs. (1) and (3) makes such training possible in an adversarial manner as GAN  [16], which gradually pushes \(p_R(\textit{\textbf{x}},\textit{\textbf{y}})\) and \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\) toward the true distribution \(p(\textit{\textbf{x}},\textit{\textbf{y}})\). Specifically, we formulate this idea as the following dual adversarial problem inspired by Triple-GAN  [13],

$$\begin{aligned} \min _{R,G}\max _{D} \mathcal {L}_{\text {gan}}(R,G,D) = E_{(\textit{\textbf{x}},\textit{\textbf{y}})}[D(\textit{\textbf{x}},\textit{\textbf{y}})]&- \alpha E_{(\hat{\textit{\textbf{x}}},\textit{\textbf{y}})}[D(\hat{\textit{\textbf{x}}},\textit{\textbf{y}})] \nonumber \\&- (1-\alpha ) E_{(\textit{\textbf{x}},\hat{\textit{\textbf{y}}})}[D(\textit{\textbf{x}},\hat{\textit{\textbf{y}}})], \end{aligned}$$
(5)

where \(\hat{\textit{\textbf{x}}}=R(\textit{\textbf{y}})\), \(\hat{\textit{\textbf{y}}}=G(\textit{\textbf{x}},\textit{\textbf{z}})\), and D denotes the discriminator, which tries to distinguish the real clean-noisy image pair \((\textit{\textbf{x}},\textit{\textbf{y}})\) from the fake ones \((\hat{\textit{\textbf{x}}},\textit{\textbf{y}})\) and \((\textit{\textbf{x}},\hat{\textit{\textbf{y}}})\). The hyper-parameter \(\alpha \) controls the relative importance between the denoiser R and generator G. As in  [5], we use the Wassertein-1 distance to measure the difference between two distributions in Eq. (5).

The working mechanism of our dual adversarial network can be intuitively explained in Fig. 1. On one hand, the denoiser R, delivering the knowledge of \(p_R(\textit{\textbf{x}}|\textit{\textbf{y}})\), is expected to conduct the joint distribution \(p_R(\textit{\textbf{x}},\textit{\textbf{y}})\) of Eq. (2), while the noise generator G, conveying the information of \(p_G(\textit{\textbf{y}}|\textit{\textbf{x}},\textit{\textbf{z}})\), is expected to derive the joint distribution \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\) of Eq. (4). Through the adversarial effect of discriminator D, the denoiser R and generator G are both gradually optimized so as to pull \(p_R(\textit{\textbf{x}},\textit{\textbf{y}})\) and \(p_G(\textit{\textbf{x}},\textit{\textbf{y}})\) toward the true joint distribution \(p(\textit{\textbf{x}}, \textit{\textbf{y}})\) during training. On the other hand, the capabilities of R and G are mutually enhanced by their dual regularization between each other. Given any real image pair \((\textit{\textbf{x}}, \textit{\textbf{y}})\) and one pseudo image pair \((\textit{\textbf{x}},\hat{\textit{\textbf{y}}})\) from generator G or \((\hat{\textit{\textbf{x}}}, \textit{\textbf{y}})\) from denoiser R, the discriminator D will be updated according to the adversarial loss. Then D is fixed as a criterion to update both R and G simultaneously as illustrated by the dotted lines in Fig. 1, which means R and G are keeping interactive and guided by each other in each iteration.

Previous researches  [22, 60] have shown that it is benefical to mix the adversarial objective with traditional losses, which would speed up and stabilize the training of GAN. For noise removal task, we adopt the \(L_1\) loss, i.e., \(||\hat{\textit{\textbf{x}}}-\textit{\textbf{x}}||_1\), which enforces the output of denoiser R to be close to the groundtruth. For the generator G, however, the direct \(L_1\) loss would not be benefical because of the randomness of noise. Therefore, we propose to apply the \(L_1\) constrain on the statistical features of noise distribution:

$$\begin{aligned} ||\mathcal {GF}(\hat{\textit{\textbf{y}}}-\textit{\textbf{x}}) - \mathcal {GF}(\textit{\textbf{y}}-\textit{\textbf{x}})||_1, \end{aligned}$$
(6)

where \(\mathcal {GF}(\cdot )\) represents the Gaussian filter used to extract the first-order statistical information of noise. Intergrating these two regularizers into the adversarial loss of Eq. (5), we obtain the final objective:

$$\begin{aligned} \min _{R,G}\max _{D} \mathcal {L}_{gan}(R,G,D) + \tau _1 ||\hat{\textit{\textbf{x}}}-\textit{\textbf{x}}||_1 + \tau _2 ||\mathcal {GF}(\hat{\textit{\textbf{y}}}-\textit{\textbf{x}}) - \mathcal {GF}(\textit{\textbf{y}}-\textit{\textbf{x}})||_1, \end{aligned}$$
(7)

where \(\tau _1\) and \(\tau _2\) are hyper-parameters to balance different losses. More sensetiveness analysis on them are provided in Sect. 5.2.

3.3 Training Strategy

In the dual adversarial model of Eq. (7), we have three objects to be optimized, i.e., the denoiser R, generator G and discriminator D. As in most of the GAN-related papers  [5, 13, 16], we jointly train R, G and D but update them in an alternating manner as shown in Algorithm 1. In order to stabilize the training, we adopt the gradient penalty technology in WGAN-GP  [18], enforcing the discriminator to satisfy 1-Lipschitz constraint by an extra gradient penalty term.

After training, the generator G is able to simulate more noisy images given any clean images, which are easily obtained from the original training data set or by downloading from internet. Then we can retrain the denoiser R by adding more synthetic clean-noisy image pairs generated by G to the training data set. As shown in Sect. 5, this strategy can further improve the denoising performance.

3.4 Network Architecture

The denoiser R, generator G and discriminator D in our framework are all parameterized as deep neural networks due to their powerful fitting capability. As shown in Fig. 1, the denoiser R takes noisy image \(\textit{\textbf{y}}\) as input and outputs denoised image \(\hat{\textit{\textbf{x}}}\), while the generator G takes the concatenated clean image \(\textit{\textbf{x}}\) and latent variable \(\textit{\textbf{z}}\) as input and outputs the simulated noisy image \(\hat{\textit{\textbf{y}}}\). For both R and G, we use the UNet  [39] architecture as backbones. Besides, the residual learning strategy  [57] is adopted in both of them. The discriminator D contains five stride convolutional layers to reduce the image size and one fully connected layer to fuse all the information. More details about the network architectures are provided in the supplementary material due to page limitation. It should be noted that our proposed method is a general framework that does not depend on the specific architecture, therefore most of the commonly used networks architectures  [4, 32, 57] in low-level vision tasks can be substituted.

figure a

4 Evaluation Metrics

For the noise removal task, PSNR and SSIM  [48] can be readily adopted to compare the denoising performance of different methods. However, to the best of our knowledge, there is still no any quantitative metric having been designed for noise generation task. To address this issue, we propose two metrics to compare the similarity between the generated and the real noisy images as follows:

  • PGap (PSNR Gap): The main idea of PGap is to compare the synthetic and real noisy images indirectly by the performance of the denoisers trained on them. Let \(\mathcal {D}=\{(\textit{\textbf{x}}^i, \textit{\textbf{y}}^i)\}_{i=1}^N\), \(\mathcal {T}=\{(\tilde{\textit{\textbf{x}}}^j, \tilde{\textit{\textbf{y}}}^j)\}_{j=1}^S\) denote the available training and testing sets, whose noise distributions are same or similar. Given any one noisy image generator G, we can synthesize another training set:

    $$\begin{aligned} \mathcal {D}_G=\{(\textit{\textbf{x}}^i, \tilde{\textit{\textbf{y}}}^i)|\tilde{\textit{\textbf{y}}}^i=G(\textit{\textbf{x}}^i,\textit{\textbf{z}}^i), \textit{\textbf{z}}^i\sim p(\textit{\textbf{z}})\}_{i=1}^N. \end{aligned}$$
    (8)

    After training two denoisers \(R_1\) on the original data set \(\mathcal {D}\) and \(R_2\) on the generated data set \(\mathcal {D}_{G}\) under the same conditions, we can define PGap as

    $$\begin{aligned} \text {PGap} = \text {PSNR}(R_1(\mathcal {T})) - \text {PSNR}(R_2(\mathcal {T})), \end{aligned}$$
    (9)

    where \(\text {PSNR}(R_i(\mathcal {T})) (i=1, 2)\) represents the PSNR result of denoiser \(R_i\) on testing data set \(\mathcal {T}\). It is obvious that, if the generated noisy images in \(\mathcal {D}_G\) are close to the real noisy ones in \(\mathcal {D}\), the performance of \(R_2\) would be close to \(R_1\), and thus the PGap would be small.

  • AKLD (Average KL Divergence): The noise generation task aims at synthesizing fake noisy image \(\textit{\textbf{y}}^f\) from the real clean image \(\textit{\textbf{x}}^r\) to match the real noisy image \(\textit{\textbf{y}}^r\) in distribution. Therefore, the KL divergence between the conditional distributions \(p_{\textit{\textbf{y}}^f}(\textit{\textbf{y}}|\textit{\textbf{x}})\) on the fake image pair \((\textit{\textbf{x}}^r,\textit{\textbf{y}}^f)\) and \(p_{\textit{\textbf{y}}^r}(\textit{\textbf{y}}|\textit{\textbf{x}})\) on the real image pair \((\textit{\textbf{x}}^r, \textit{\textbf{y}}^r)\) can serve as a metric. To make this conditional distribution tractable, we utlize the pixel-wisely Gaussian assumption for real noise in recent work VDN  [56], i.e.,

    $$\begin{aligned} p_{\textit{\textbf{y}}^c}(\textit{\textbf{y}}|\textit{\textbf{x}}) = \mathcal {N}(\textit{\textbf{y}}|[\textit{\textbf{x}}^r], \text {diag}([\textit{\textbf{V}}^c])), ~ c \in \{f,r\}, \end{aligned}$$
    (10)

    where

    $$\begin{aligned} \textit{\textbf{V}}^c = \mathcal {GF}((\textit{\textbf{y}}^c - \textit{\textbf{x}}^r)^2), ~ c \in \{f, r\}, \end{aligned}$$
    (11)

    \([\cdot ]\) denotes the reshape operation from matrix to vector, \(\mathcal {GF}(\cdot )\) denotes the Gaussian filter, and the square of \((\textit{\textbf{y}}^c-\textit{\textbf{x}}^r)^2\) is pixel-wise operation. Based on such explicit distribution assumption, the KL divergence between \(p_{\textit{\textbf{y}}^f}(\textit{\textbf{y}}|\textit{\textbf{x}})\) and \(p_{\textit{\textbf{y}}^r}(\textit{\textbf{y}}|\textit{\textbf{x}})\) can be regarded as an intuitive metric. To reduce the influence of randomness, we randomly generate L synthetic fake noisy images:

    $$\begin{aligned} \textit{\textbf{y}}^{f_j} = G(\textit{\textbf{x}}^r, \textit{\textbf{z}}^j), ~ \textit{\textbf{z}}^j \sim p(\textit{\textbf{z}}), ~j=1,2,\cdots ,L, \end{aligned}$$
    (12)

    for any real clean image \(\textit{\textbf{x}}^r\), and define the following average KL divergence as our metric, i.e.,

    $$\begin{aligned} \text {AKLD} = \frac{1}{L}\sum _{j=1}^L KL[p_{\textit{\textbf{y}}^{f_j}}p(\textit{\textbf{y}}|\textit{\textbf{x}})||p_{\textit{\textbf{y}}^r}(\textit{\textbf{y}}|\textit{\textbf{x}})]. \end{aligned}$$
    (13)

    Evidently, the smaller AKLD is, the better the generator G is. In the following experiments, we set \(L=50\).

5 Experimental Results

In this section, we conducted a series of experiments on several real-world denoising benchmarks. In specific, we considered two groups of experiments: the first group (Sect. 5.2) is designed for evaluating the effectiveness of our method on both of the noise removal and noise generation tasks, which is implemented on one specific real benchmark containing training, validation and testing sets; while the second group (Sect. 5.3) is conducted on two real benchmarks that only consist of some noisy images as testing set, aiming at evaluating its performance on general real-world denoising tasks.

In brief, we denote the jointly trained Dual Adversarial Network following Algorithm 1 as DANet. As discussed in Sect. 3.3, the learned generator G in DANet is able to augment the original training set by generating more synthetic clean-noisy image pairs, and the retrained denoiser R on this augmented training data set under \(L_1\) loss is denoted as \(\text {DANet}_{+}\).

5.1 Experimental Settings

Parameter Settings and Network Training: In the training stage of DANet, the weights of R and G were both initialized according to  [20], and the weights of D were initialized from a zero-centered Normal distribution with standard deviation 0.02 as  [38]. All the three networks were trained by Adam optimizer  [26] with momentum terms (0.9, 0.999) for R and (0.5, 0.9) for both G and D. The learning rates were set as \(1e\text {-}4\), \(1e\text {-}4\) and \(2e\text {-}4\) for R, G and D, respectively, and linearly decayed in half every 10 epochs.

In each epoch, we randomly cropped \(16\times 5000\) patches with size \(128\times 128\) from the images for training. During training, we updated D three times for each update of R and G. We set \(\tau _1=1000\), \(\tau _2=10\) throughout the experiments, and the sensetiveness analysis about them can be found in Sect. 5.2. As for \(\alpha \), we set it as 0.5, meaning the denoiser R and generator G contribute equally in our model. The penalty coefficient in WGAN-GP  [18] is set as 10 following its default settings. As for \(\text {DANet}_{+}\), the denoiser R was retrained with the same settings as that in DANet. All the models were trained using PyTorch  [35].

Table 1. The PGap and AKLD performances of different compared methods on the SIDD validation data set. And the best results are highlighted in bold.
Fig. 2.
figure 2

PSNR results of different methods during training.

5.2 Results on SIDD Benchmark

In this part, SIDD  [1] benchmark is employed to evaluate the denoising performance and generation quality of our proposed method. The full SIDD data set contains about 24000 clean-noisy image pairs as training data, and the rest 6000 image pairs are held as the benchmark for testing. For fast training and evaluation, one medium training set (320 image pairs) and validation set (40 image pairs) are also provided, but the testing results can only be obtained by submission. We trained DANet and \(\text {DANet}_{+}\) on the medium version training set, and evaluated on the validation and testing sets.

Fig. 3.
figure 3

Illustration of one typical generated noisy images (1st row) by different methods and their corresponding noise (2nd row) and variance map (3rd row) estimated by Eq. (11). The first column represents the real ones in SIDD validation set.

Noise Generation: The generator G in DANet is mainly used to synthesize the corresponding noisy image given any clean one. As introduced in Sect. 4, two metrics PGap and AKLD are designed to assess the generated noisy image. Based on these two metrics, we compared DANet with three recent methods, including CBDNet  [19], ULRD  [7] and GRDN  [25]. CBDNet and ULRD both attempted to generate noisy images by simulating the in-camera processing pipelines, while GRDN directly learned the noise distribution using GAN  [16].

Table 1 lists the PGap values of different methods on SIDD validation set. For the calculation of PGap, SIDD validation set is regarded as the testing set \(\mathcal {T}\) in Eq. (9). Obviously, DANet achieves the best performance. Figure 2 displays the PSNR curves of different denoisers trained on the real training set or only the synthetic training sets generated by different methods, which gives an intuitive illustration for PGap. It can be seen that all the methods tend to gradually overfit to their own synthetic training set, especially for CBDNet. However, DANet performs not only more stably but also better than other methods.

Table 2. The PSNR and SSIM results of different methods on SIDD validation and testing sets. The best results are highlighted in bold.

The average AKLD results calculated on all the images of SIDD validation set are also listed in Table 1. The smallest AKLD of DANet indicates that it learns a better implicit distribution to approximate the true distribution \(p(\textit{\textbf{y}}|\textit{\textbf{x}})\). Figure 3 illustrates one typical example of the real and synthetic noisy images generated by different methods, which provides an intuitive visualization for the AKLD metric. In summary, DANet outperforms other methods both in quantization and visualization, even though some of them make use of additional metadata.

Fig. 4.
figure 4

One typical denoising example in the SIDD validation dataset.

Table 3. The PSNR and SSIM results of DANet under different \(\tau _1\) values on SIDD validation data set.
Table 4. The PGap and AKLD results of DANet under different \(\tau _2\) values on SIDD validation data set.

Noise Removal: To verify the effectiveness of our proposed method on real-world denoising task, we compared it with several state-of-the-art methods, including CBM3D  [14], WNNM  [17], DnCNN  [57], CBDNet  [19], RIDNet  [4] and VDN  [56]. Table 2 lists the PSNR and SSIM results of different methods on SIDD validation and testing sets. It should be noted that the results on testing sets are cited from official websiteFootnote 3, but the results on validation set are calculated by ourself. For fair comparison, we retrained DnCNN and CBDNet on SIDD training set. From Table 2, it is easily observed that: 1) deep learning methods obviously performs better than traditional methods CBM3D and WNNM due to the powerful fitting capability of DNN; 2) DANet and \(\text {DANet}_{+}\) both outperform the state-of-the-art real-world denoising methods, substantiating their effectiveness; 3) \(\text {DANet}_{+}\) surpasses DANet about 0.18dB PSNR, which indicates that the synthetic data by G facilitates the training of the denoiser R.

Figure 4 illustrates the visual denoising results of different methods. It can be seen that CBM3D and WNNM both fail to remove the real-world noise. DnCNN tends to produce over-smooth edges and textures due to the \(L_2\) loss. CBDNet, RIDNet and VDN alleviate this phenomenon to some extent since they adopt more robust loss functions. DANet recovers sharper edges and more details owning to the adversarial loss. After retraining with more generated image pairs, \(\text {DANet}_{+}\) obtains the closer denoising results to the groundtruth.

Fig. 5.
figure 5

This figure displays the real or generated noisy images (the 1st row) by DANet under different \(\tau _2\) value and the corresponding noise (the 2nd row). From left to right: (a) real case, (b) \(\tau _2=0\), (c) \(\tau _2=5\), (d) \(\tau _2=10\), (e) \(\tau _2=50\), (f) \(\tau _2=+\infty \).

Hyper-parameter Analysis: Our proposed DANet involves two hyper-parameters \(\tau _1\) and \(\tau _2\) in Eq. (7). The pamameter \(\tau _1\) mainly influences the performance of denoiser R, while \(\tau _2\) directly affects the generator G.

Table 5. The comparison results of BaseD and DANet on SIDD validation set.
Table 6. The comparison results of BaseG and DANet on SIDD validation set.
Table 7. The PSNR and SSIM results of different methods on DND benchmark. The best results are highlighted as bold.

Table 3 lists the PSNR/SSIM results of DANet under different \(\tau _1\) settings, where \(\tau _1=+\infty \) represents the results of the denoiser R trained only with \(L_1\) loss. As expected, small \(\tau _1\) value, meaning that the adversarial loss plays more important role, leads to the decrease of PSNR and SSIM performance to some extent. However, when \(\tau _1\) value is too large, the \(L_1\) regularizer will mainly dominates the performance of denoiser R. Therefore, we set \(\tau _1\) as a moderate value \(1e\text {+}3\) throughout all the experiments, which makes the denoising results more realistic as shown in Fig. 4 even sacrificing a little PSNR performance.

The PGap and average AKLD results of DANet under different \(\tau _2\) values are shown in Table 4. Note that \(\tau _2=+\infty \) represents the results of the generator G trained only with the regularizer of Eq. (6). Figure 5 also shows the corresponding visual results of one typical example. As one can see, G fails to simulate the real noise with \(\tau _2=0\), while it is also difficult to be trained only with the regularizer of Eq. (6). Taking both the quantitative and visual results into consideration, \(\tau _2\) is constantly set as 10 in our experiments.

Ablation Studies: To verify the marginal benefits brought up by our dual adversarial loss, two groups of ablation experiments are designed in this part. In the first group, we train DANet without the generator and denote the trained model as BaseD. On the contrary, we train DANet without the denoiser and denote the trained model as BaseG. And the comparison results of these two baselines with DANet on noise removal and noise generation tasks are listed in Table 5 and Table 6, respectively. It can be easily seen that DANet achieves better performance than both the two baselines in noise removal and noise generation tasks, especially in the latter, which illustrates the mutual guidance and amelioration between the denoiser and the generator.

5.3 Results on DND and Nam Benchmarks

In this section, we evaluate the performance of our method on two real-world benchmarks, i.e., DND  [36] and Nam  [34]. Following the experimental setting in RIDNet  [4], we trained another model using images from SIDD  [1], Poly  [51] and RENOIR  [3] for fair comparison. To be distinguished from the model of Sect. 5.2, the trained models under this setting are denoted as GDANet and \(\text {GDANet}_{+}\), aiming at dealing with the general denoising task in real application. For the training of \(\text {GDANet}_{+}\), we employed the images of MIR Flickr  [21] as clean images to synthesize more training pairs using G. Note that the experimental results on Nam benchmark are put into supplementary material due to page limitation.

Fig. 6.
figure 6

Denoising results of different methods on DND benchmark.

DND Benchmark: This benchmark contains 50 real noisy and almost noise-free image pairs. However, the almost noise-free images are not publicly released, thus the PSNR/SSIM results can only be obtained through online submission system. Table 7 lists the PSNR/SSIM results released on the official DND benchmark websiteFootnote 4. From Table 7, we have the following observations: 1) \(\text {GDANet}_{+}\) outperforms the state-of-the-art VDN about 0.2dB PSNR, which is a large improvement in the field of real-world denoising; 2) GDANet obtains the highest SSIM value, which means that it preserves more structural information than other methods as that can be visually observed in Fig. 6; 3) DnCNN cannot remove most of the real noise because it overfits to the Gaussian noise case; 4) the classical CBM3D and WNNM methods cannot handle the complex real noise.

6 Conclusion

We have proposed a new Bayesian framework for real-world image denoising. Different from the traditional MAP framework relied on subjective pre-assumptions on the noise and image priors, our proposed method focuses on learning the joint distribution directly from data. To estimate the joint distribution, we attempt to approximate it by its two different factorized forms using an dual adversarial manner, which correspondes to two tasks, i.e., noise removal and noise generation. For assessing the quality of synthetic noisy image, we have designed two applicable metrics, to the best of our knowledge, for the first time. The proposed DANet intrinsically provides a general methodology to facilitate the study of other low-level vision tasks, such as super-resolution and deblurring.