1 Introduction

Image denoising not only enhances visual perception, but also ensure the integrity and authenticity of the image, which is convenient for subsequent target detection [1], image classification [2] and medical image processing [3]. Especially, with the rapid development of computer vision technology, image denoising becomes more important.

Image noise can be broadly classified into two types: synthetic noise and real-world noise (or real noise). Synthetic noise refers to noise that follows a certain probability distribution and whose noise level can be set independently, such as gaussian noise, salt-and-pepper noise, and gamma noise. However, the type and distribution of real-world noise is often uncertain. It may originate from internal components of the system or device as internal noise, or it may be external noise caused by environmental factors. This type of noise exhibits a diverse structure that is difficult to describe with simple parameters. In order to solve the problem of image denoising, researchers have proposed many methods. Generally, these methods are divided into traditional methods and deep learning methods. Traditional image denoising can be roughly categorized into two types: filter-based methods and model-based methods. Filter-based methods mainly perform noise suppression in spatial domain and transform domain, such as non-local means (NLM) [4], block matching and 3D filtering (BM3D) [5]. Model-based methods mainly design regularization with prior information for image denoising, including k-means singular-value decomposition (KSVD) [6], weighted nuclear norm minimization (WNNM) [7]. However, traditional methods still have some drawbacks, either involving cumbersome feature extraction processes, or having high computational requirements, or facing difficulties in directly handling complex real noise. Therefore, these methods are difficult to meet the current practical requirements.

In recent years, due to the success of deep learning, image denoising methods based on deep learning have achieved superior performance compared to traditional methods and have become the mainstream methods [8]. Among them, the deep learning-based convolutional neural network (CNN) model [9] has been widely used in image denoising. For example, DNCNN [10] was the first method to use CNN for blind denoising. It employed a residual learning strategy, allowing the network to directly learn the difference between the noisy input image and the clean image. This enabled the network to focus more on learning the characteristics of noise, rather than the entire image content. Additionally, it applied batch normalization after each convolutional layer to accelerate the training process and improve model performance. However, DnCNN may have limited generalization capabilities for certain specific texture or detail patterns. This could result in the smoothing of these textures or details while removing noise. FFDNet [11] used adjustable noise level maps as inputs to the model to achieve non-blind denoising. Nevertheless, its noise level maps needed to be manually set based on empirical knowledge. CBDNet [12] was an extension of FFDNet, which designed a network for estimating noise and thus becoming a blind denoising model. RIDNet [13] introduced feature attention modules to enhance the information interaction of the network on channel features, so as to achieve significant denoising performance. To enhance the denoising performance of the model, MIRNet [14] designed multiple novel modules with attention mechanisms to extract multi-scale feature information. MPRNet [15] built a multi-stage architecture to exchange information across different stages, reducing the loss of detailed information and thus better balancing the competing goals of spatial details and high-level contextual information during the image restoration phase. Recently, some new denoising methods have been proposed. For example, APD-Nets [16] first attempted to introduce adaptive regularization and complementary prior information into denoising networks, thereby improving the generalization ability and image quality restoration of denoising networks. MSIDNet [17] utilized a fusion mechanism to fully utilize multi-scale features, enhancing the network’s information perception and thus improving the image’s visual effects. MDRN [18] introduced a multi-scale feature extraction module alongside a dilated residual module, which are designed to extract multi-scale features, thereby enhancing the performance of image restoration. TSIDNet [19] employed a data sub-network for denoising and a feature extraction network for global feature extraction, then aggregated the information from both networks to enhance the model’s robustness and denoising performance. To address the non-blind denoising and noise estimation issues of many CNN methods for real image denoising, CFNet [20] designed a new conditional filter to adaptively adjust the denoising manner and an affine transformation block for noise prediction. Although CNN-based methods could accomplish achieve noise removal, these methods were purely end-to-end, without utilizing the neural network to provide a more detailed evaluation of the denoised images during the training process. From this analysis, these methods not only led to the underutilization of the learning capacity of CNN and the parallel computing capability of GPU but also potentially resulted in inconsistent quality of the denoised images.

The generative adversarial network (GAN) model [21] based on deep learning was proposed by Goodfellow. It utilized neural networks to offer detailed assessments of the processed data during training, leading to superior performance compared to CNN in specific image processing tasks, including image translation [22] and image inpainting [23,24,25]. GAN model consists of two parts: a generator and a discriminator. The role of a generator is to generate data that is as consistent as possible with the original data distribution. The discriminator is more of an auxiliary generator in the entire model, and its main task is to judge whether the given data is the original real sample data or the fake sample data forged by the generator model. In the training phase, the two models compete with each other through the max-min game. The objective function of GAN model [21] is as follows:

$$\begin{aligned} \begin{aligned} \min _G\max _DV(D,G) =&E_{x\sim p_{\textrm{data}}(x)}[\log D(x)]\\&+E_{z\sim p_{z}(z)}[\log (1-D(G(z)))] \end{aligned} \end{aligned}$$
(1)

where G is the generator, D is the discriminator. \(p_{\textrm{data}}(x)\) represents the distribution of real data, and \(p_{z}(z)\) represents the distribution of random noise. \(E_{x\sim p_{\textrm{data}}(x)}[\log D(x)]\) stands for the expected loss of the discriminator on real data, and \(E_{z\sim p_{z}(z)}[\log (1-D(G(z)))]\) represents the expected loss of the discriminator on generated data. From Eq. 1, we can see that the objective function ensures that the discriminator performs well on real data (minimizing misclassification of real data), while the generator aims to produce fake data that can “fool” the discriminator (maximizing the discriminator’s misclassification of fake data). This constitutes a min-max problem, as the discriminator attempts to maximize its ability to distinguish between real and fake data, while the generator seeks to minimize the discriminator’s ability to distinguish its generated data. WGAN [26] improved the stability of the GAN model by replacing JS divergence or KL divergence with wasserstein distance to measure the difference between samples. WGAN-GP [27] added an additional gradient penalty term to make the gradient of the model more stable in training, thus solving the problem of pattern collapse that both GAN and WGAN were prone to. At the same time, the gradient penalty term introduced by WGAN-GP makes the discriminator have the ability to accurately distinguish the true and false samples by maintaining the smooth gradient, making the generated samples more realistic.

Based on the above theoretical research of GAN, many scholars have successfully applied GAN to image denoising. For example, GCBD [28] was one of the earliest methods to utilize GAN to model real-world noise for constructing a noise dataset. Its approach involved the generator randomly generating noise patches, which, when combined with original images, could produce corresponding noisy images. This augmentation of the model’s training dataset effectively addressed the challenge of finding paired datasets in practical applications. Similarly, ADGAN [29] utilized GAN to generate noisy samples for dataset augmentation, and it introduced a feature loss function to extract image features, thus improving the restoration performance of image details. To enhance the blind denoising performance, BDGAN [30] utilized the technology that improved the stability of training and obtained improved image quality to modify the architecture of the generator network. It also designed multiple discriminators with different receptive fields to conduct multi-scale evaluation on images, so as to obtain high-quality images. GRDN [31] proposed a new GAN-based method for modeling real noise, addressing the challenge of obtaining paired datasets. It also enhanced the denoising performance of the model through extensive and hierarchical use of residual connections. Subsequently, DANet [32] introduced an innovative bayesian framework that simultaneously completes the tasks of noise elimination and noise generation using dual adversarial learning. DeGAN [33] harnessed the mutual game between the generator network and the feature extractor network, along with additional training from the feature extractor network, to enable the generator network to accomplish a direct mapping from the noisy image domain to the noise-free image domain. Although this method effectively removed mixed noise and restored damaged images, it was not effective for more complex and realistic noisy images. HI-GAN [34] build a deeper network structure through dense residuals to improve the denoising effect. Nevertheless, this approach also resulted in the accumulation and repetition of information, leading to a decrease in feature propagation efficiency and making the model training more challenging. The recent DGCL [35] used two independent GANs to learn from denoised images and image datasets separately, partially addressing the issues of complex network structures and training difficulties associated with single GAN-based methods. However, DGCL did not perform adversarial training on the final fused results, leading to suboptimal image quality. While these GAN-based methods have brought more possibilities for image denoising, the practicability of many of these denoising models still needs to be further improved, especially on real noise images.

To solve above problems, in this paper, we redesign the generator, discriminator and loss function in the GAN model. First, the design inspiration of the generator mainly comes from references [36,37,38,39,40,41]. Specifically, literature [36, 37] proposed a multi-scale network model based on convolutional neural network, which verified that more information could be learned by using multi-scale for feature extraction. MSGAN [38] designed a new multi-scale module and added this module to the skip connections, so that the operation could improve the model performance. Literature [39] extracts more high-level information by increasing the number of convolutional layers to the network model, and utilized the residual bottleneck proposed by He et al [40] in the denoising network to solve the problem of gradient calculation caused by too deep network. DCANet [41] built a dual CNN containing two different branches to learn complementary features to obtain noisy estimated images. According to these studies, In our generator, we not only extracted multi-scale features, but also fused the extracted features before passing them through the skip connection. Our network also used two branches, one of which employed the optimized residual module that we improved, and two branches directly dealt with the noise without estimating the noise. Then, the design of discriminator was primarily inspired by PatchGAN [42]. One of the key contributions of PatchGAN was designing a discriminator that could focus on multiple regions to evaluate the images. Nevertheless, the original PatchGAN discriminator utilized downsampling for multi-scale feature extraction, which resulted in the loss of certain information and thereby limited the discriminator’s ability to capture and differentiate subtle details in the images. So we made improvements to the discriminator to address this issue. Finally, there are some other works that can help us to design the loss function. For example, references [29, 30] introduced the perceptual loss proposed in image super-resolution [43] to enhance the detailed information of denoised images and improve the visual effect. The total variation algorithm [44] and literature [45] directly provided regularization loss function with denoising performance to achieve image denoising. These loss functions were adopted by us to constrain the training of our network, and inspired by them, we proposed a new function to improve the performance of the model.

Fig. 1
figure 1

The architecture of the generator in MIFGAN, where k, n and s represent the kernel size, channel number and stride in the convolutional layer respectively

In summary, to address the limited performance of most image denoising algorithms on real noise, this paper proposes a multi-scale information fusion generative adversarial network (MIFGAN) algorithm. The algorithm can be implemented in machine vision software in the future to effectively eliminate real noise, thereby enhancing image quality. This enhancement facilitates the machine vision system’s ability to process and analyze image data more efficiently, ultimately leading to improved performance and accuracy of the machine system. For this paper, the main contributions are as follows:

  1. (1)

    Utilizing the generative adversarial network model and the concept of multi-scale information fusion, a novel denoising algorithm is proposed that can significantly enhance the quality of real image denoising.

  2. (2)

    A novel encoder–decoder network branch is designed, which extracts and fuses multi-scale features of images with different resolutions in the encoder stage, enabling noise reduction while preserving crucial image details. Moreover, to address feature compression and gradient calculation issues, an improved residual network branch is introduced.

  3. (3)

    To further improve the model’s denoising performance, we design a discriminator with richer receptive field that aims to effectively capture the global features and context information of the image, thereby the denoised image can be evaluated in more detail.

  4. (4)

    The dual denoising loss function is presented. It can be combined with other loss functions in the training phase to further optimize the performance of the model.

2 Proposed method

2.1 Generator network architecture

The generator of MIFGAN is the core of the whole network framework, which is an end-to-end denoising network. Its input is a noisy image, and its output is a clean image. The specific network architecture is shown in Fig. 1. By integrating multi-scale contextual information, more feature information can be supplemented, thereby enhancing the denoising performance of the model. The so-called multi-scale feature means that the feature maps with different resolution sizes contain different information. Typically, higher-resolution images can provide richer details and more precisely capture edge features, while lower-resolution images can provide overall structure and global features. Therefore, We process the noisy images at three different scales: original resolution, four times downsampling, and eight times downsampling. We extract features from these images at respective scales and then combine the extracted information at appropriate positions. The specific realization process is as follows:

Due to the excellent image reconstruction capability, scalability, and adaptability of the U-net [46] network, it has shown good performance in many image processing tasks such as image super-resolution [47] and image inpainting [48]. So we modify U-net to be the backbone of the encoder–decoder branch of the generator. The encoder performs four downsampling operations for feature extraction, reducing the size of the feature maps by factors of two, four, eight, and sixteen. Meanwhile, the decoder conducts four upsampling operations for information recovery, successively increasing the reduced feature maps by factors of two, four, eight, and sixteen. Additionally, skip connections are used to transfer information extracted from the downsampling to the upsampling stage, enabling feature reuse. In order to obtain richer information such as image structure and content in the downsampling stage, we use a \(4\times 4\) convolution with a step size of 2 to replace the pooling operation. The checkerboard effect occurs during upsampling due to the use of deconvolution [49]. Therefore, we use bilinear upsampling for image reconstruction. After each downsampling and upsampling, two \(3\times 3\) convolutions with step size 1 are used to extract information on the feature map. In addition, we downsample the noisy image by a factor of four and eight, respectively, resulting in images with a corresponding reduction in size. Then extract shallow-level information using two \(3\times 3\) convolutions. Then, the feature map information of different scales is spliced and fused in the corresponding downsampling stage of the encoder, which can improve the feature representation ability in the subsampling, and can also transfer more information to the decoder through skip connections.

Fig. 2
figure 2

a The architecture of residual bottleneck in ResNet. b The architecture of residual module in MIFGAN

Additionally, the residual bottleneck proposed by ResNet [50] can extract feature information by stacking and can well solve the problem of gradient disappearance and gradient explosion caused by the deepening of network layers in the stacking process. Therefore, we design a new residual module by combining the advantages of the residual bottleneck. The specific architecture of the residual module designed by us and the residual bottleneck proposed by ResNet is shown in Fig. 2. In our residual module, we remove the batch normalization (BN) layer used in the original residual network. This modification improves the training speed of the model and prevents the degradation of information in denoised images caused by normalization. At the same time, we utilize the leaky rectified linear unit (LeakReLU) as the activation function in the encoder–decoder branch, because the model can still be trained stably when the activation function is negative. So, we also modify the activation function in the original residual module from rectified linear unit(ReLU) to LeakReLU. A new branch of the residual network is constructed by cascading the residual modules designed by us. It achieves image denoising and restoration by learning the residual transformation between the output denoised image and the input noisy image. Unlike the feature extraction methods of downsampling and upsampling, our residual network extracts shallow features from deep features of context directly on the original size image, avoiding the loss of detailed information caused by sampling operations. Its processing flow is that the noise image is first passed through a \(3\times 3\) convolution with a step size of 1 to change the number of channels and obtain a feature map with rich information, and then the feature map is input into the residual module for information extraction. At this stage, the size of the feature map is always consistent with the input image. The ablation study results in Sect. 3.5.1 show that we can obtain improved results when using 3 residual modules in the residual branch.

We restore the feature maps obtained from the two branches to a clean image using a \(1\times 1\) convolution. We use \(1\times 1\) convolutions because they have fewer parameters and computations, which can improve training time and reduce memory usage. Finally, we perform pixel-wise addition to fuse the images outputted by the two branches, resulting in the final clean image. We will verify the validity of each part of our generator in the ablation study in Sect. 3.5.2.

Fig. 3
figure 3

a The discriminator architecture in PatchGAN. b The discriminator architecture in MIFGAN. The kernel size, channel number and stride in the convolutional layer are denoted by k, n and s respectively

2.2 Discriminator network architecture

Our discriminator is improved based on PatchGAN [42]. Figure 3 shows the network architecture of the two discriminators. It was confirmed in the literature [42] that using PatchGAN allowed for capturing more high-frequency information and improve the image processing capability of the generator to a certain extent. The original PatchGAN can evaluate image features at multiple scales through downsampling, but important information will be lost in the downsampling process. So, we add a \(4\times 4\) convolution and an \(8\times 8\) convolution on the basis of PatchGAN to extract features directly from the image and resulting in feature maps that are four times and eighth times smaller than the input image size, respectively. These extracted features are then aggregate the extracted information at the corresponding positions in the discriminator downsampling stage. This operation can enhance the receptive field of the discriminator, so that the discriminator can capture more global information and local information, so as to make up for the loss of information in the downsampling stage, and improve the discriminator’s ability to judge each area of the image. Our discriminator enhances the generator’s generation capability through the adversarial relationship in GAN. Specifically, the new discriminator allows the generator to achieve superior denoising effects while also improving the restoration of the structure and details in the denoised image. We will validate the effectiveness of our discriminator in the ablation study Sect. 3.5.3.

2.3 Loss function

Our total objective function is composed of four loss functions: adversarial loss, content loss, total variation loss and dual denoising loss. In this section, we will provide a detailed explanation for each loss function. We will show the effects of each function in Sect. 3.5.4.

Adversarial loss: In order to make the model training stable, we use the objective function in WGAN-GP [27] as the loss function of adversarial training in our model. The mathematical formula is shown in Eq. 2.

$$\begin{aligned} \begin{aligned} \min _G\max _DL(G,D)=&E_{G(x)\sim P_{g}}[D(G(x))]\\&- E_{x\sim P_{r}}[D(x)]+ \lambda _{gp}\times {L_{gp}} \end{aligned} \end{aligned}$$
(2)

where \(E_{G(x)\sim P_{g}}[D(G(x))]\) represents the mathematical expectation of the scores given by the discriminator D to the generated images G(x) when the generated images follow the generated distribution \(P_{g}\). \( E_{x\sim P_{r}}[D(x)]\) represents the mathematical expectation of the scores given by the discriminator D to the real images x when the real images follow the real data distribution \(P_{r}\). \(L_{gp}\) represents the gradient penalty, \(\lambda _{gp}\) is the weight of the gradient penalty. Compared to the adversarial loss in Eq. 1, the main difference in the adversarial loss in Eq. 2 utilizes the wasserstein distance to measure the difference between two distributions and adds a regularization term as a gradient penalty, in order to ensure that the gradient changes smoothly between the real data distribution and the generated data distribution. The specific mathematical formula for the gradient penalty \(L_{gp}\) is shown in Eq. 3.

$$\begin{aligned} L_{gp} = E_{\hat{x}\sim P_{\hat{x}}}[(\Vert \nabla _{\hat{x}}D(\hat{x})\Vert _2 - 1)^2] \end{aligned}$$
(3)

In Eq. 3, \(\hat{x}\) represents a linearly interpolated sample between the real data distribution \(P_r\) and the generated data distribution \(P_g\).

According to the game idea, Eq. 2 can be divided into the adversarial loss of the generator and the adversarial loss of the discriminator. Specifically, the generator’s adversarial loss is shown in Eq. 4 and the discriminator’s adversarial loss is shown in Eq. 5.

$$\begin{aligned} \begin{aligned} L_{adv}=-E_{G(x)\sim P_{g}}[D(G(x))] \end{aligned} \end{aligned}$$
(4)

Eq. 4 demonstrates the adversarial loss of the generator. The generator’s goal is to produce samples that are as close as possible to the real data distribution.

$$\begin{aligned} \begin{aligned} L_{D} =&E_{G(x)\sim P_{g}}[D(G(x))]\\&-E_{x\sim P_{r}}[D(x)]+\lambda _{gp}\times {L_{gp}} \end{aligned} \end{aligned}$$
(5)

Eq. 5 demonstrates the adversarial loss of the discriminator. The discriminator’s goal is to correctly distinguish between real samples and generated samples. It desires a high score D(x) for real samples x (drawn from the distribution \(P_{r}\)) and a low score D(G(x)) for the generated samples G(x) from the generator G. One of the key differences between this discriminator and a traditional GAN discriminator is the introduction of a gradient penalty term in the equation, which helps stabilize the training process and prevents mode collapse.

Content loss: Our model cannot generate denoised images with rich content by relying only on adversarial loss. So, to constrain the MIFGAN denoised image can reach the same standard as the ground-truth image, we construct the content loss function. As Eq. 6.

$$\begin{aligned} \begin{aligned} L_\text {content}=&\lambda _{pixel} \times L_{pixel}+\lambda _{edge}\times L_{edge}\\&+{\lambda _{vgg}\times L_{vgg}} \end{aligned} \end{aligned}$$
(6)

In Eq. 8, \(L_{pixel}\) represents pixel loss, \(L_{edge}\) represents edge loss and \(L_{vgg}\) represents perceptual loss. In addition, the coefficients \(\lambda _{pixel}\), \(\lambda _{edge}\) and \(\lambda _{vgg}\) represent the weights of these three loss functions in the whole content loss respectively

The use of L1 loss in image denoising will lack the protection of relevant details, resulting in image detail texture loss and edge sharpening after denoising. The use of L2 loss is only the calculation of the sum of squares of all pixels of the image, which will cause the image to become more blurred and smooth. So L1 loss and L2 loss are not conducive to image reconstruction in our denoising task. Therefore, we use the charbonnier loss proposed in the literature [51] as our pixel loss. This loss function addresses certain issues inherent in L1 and L2 losses, resulting in an improved image reconstruction performance. The specific form of this loss is given in Eq. 7:

$$\begin{aligned} \begin{aligned} L_{pixel}=\frac{1}{N}\sum _{i=1}^N\sqrt{\left( y_i-G(x)_i\right) ^2+\ell ^2} \end{aligned} \end{aligned}$$
(7)

Where N denotes the total number of image pixels and i denotes the image pixel index. y represent the ground-truth image, and G(x) represent the denoised image produced by the generator. In the equation, \(\ell \) is a small constant that controls the smoothness of the loss function. When the pixel difference is small, the behavior of the loss function resembles L2 loss; whereas, when the pixel difference is large, it behaves more like L1 loss. This way, the charbonnier loss is able to better preserve image edges and details during the denoising process while avoiding overly blurred results.

More high-frequency texture information in the image content reconstruction process is what we need to focus on preserving. So, we introduce edge loss [52, 53] to improve the detail representation of our denoised images. The mathematical formula is expressed as follows:

$$\begin{aligned} \begin{aligned} L_{edge}=\sqrt{\left( \Delta (y)-\Delta (G(x))\right) ^2+\ell ^2} \end{aligned} \end{aligned}$$
(8)

In Eq. 8, the Laplacian operator [54] represented by \(\Delta \) is first used to calculate the gradient information of the ground-truth image y and the denoised image G(x) to extract the edge information. The difference between the tow images is then calculated using the square root of the square including the penalty coefficient.

In the denoising task, the result of more favorable quantitative evaluation criteria is the goal we pursue. However, sometimes the denoised image can not meet our visual good feeling when it has a high peak signal-to-noise ratio. Therefore, in order to make the image more refined after noise removal, the content is more clear, and meets the human aesthetic in a more pleasing manner. We introduce perceptual loss [43] to enhance visual perception.

$$\begin{aligned} \begin{aligned} L_{vgg}=\frac{1}{C_iW_iH_i}\left\| V_{\textrm{gg}_i}(y)-V_{\textrm{gg}_i}(G(x))\right\| _2^2 \end{aligned} \end{aligned}$$
(9)

Eq. 9 is the mathematical symbolic form of perceptual loss. We use the pretrained model of VGG19 [55] to extract the features of the denoised image G(x) and the ground-truth image y. These extracted feature differences are then used to compute the perceptual loss. In the above equation \(V_{\textrm{gg}_i}\) is represented as the ith feature layer in the VGG19 model. \({C_i}\), \(W_i\) and \(H_i\) are the number of channels, width and height of the ith layer, respectively.

Total variation loss: Total variation loss [44] was successful in previous denoising work. This function can effectively remove noise and also promote the preservation of texture details. The loss function is defined as follows:

$$\begin{aligned} \begin{aligned} L_{tv}=\Vert \nabla _{h}(G(x))\Vert _{2}^{2}+\Vert \nabla _{\nu }(G(x))\Vert _{2}^{2} \end{aligned} \end{aligned}$$
(10)

where \(\nabla _{h}\)(\(\nabla _{\nu }\)) is the gradient operator along the horizontal (vertical) direction. The loss function makes full use of the context information of the denoised image G(x), measures the change of pixel value in the vertical and horizontal directions in the form of gradient, and smooths the noise of the detected abnormal noise points. We introduce this loss to promote the spatial smoothness of the output image and avoid over-pixelation.

Dual denoising loss: To enhance the denoising capability of the generator, we conduct a thorough study of the loss function and design a novel loss function called the dual denoising loss, which aims to further constrain and guide the training process of the generator. The form of the function is as follows:

$$\begin{aligned} \begin{aligned} L_{dual}=\frac{1}{N}\sum _{i=1}^NS(y_i-G(G(x))_i) \end{aligned} \end{aligned}$$
(11)

where N denotes the total number of image pixels and i denotes the image pixel index. \(S(\cdot )\) represents the Smooth L1 Loss, whose specific calculation form is shown in Eq. 12.

$$\begin{aligned} \begin{aligned} S(m) ={\left\{ \begin{array}{ll}0.5m^2&{}\mathrm {if~}|m|<1\\ |m|-0.5&{}otherwise\end{array}\right. } \end{aligned} \end{aligned}$$
(12)

The design idea of the dual denoising loss function is as follows. Ideally, our denoising model only processes noisy images. When a clean image passes through the denoising network, the output image should remain consistent with the original input. If the denoised image still contains noise, we re-input the denoised image into our network, and the network will process the remaining noise again. By passing the noisy image through the network twice, we can further reduce any residual noise left after the first pass. Therefore, based on the above theoretical analysis, we propose the function in Eq. 11, which constrains the denoised image to be closer to the standard clean image. In Eq. 11 we use the Smoothing L1 loss [56] to measure the difference between the image G(G(x)) and the ground-truth image y. Smooth L1 loss combines the advantages of L1 loss and L2 loss, enabling it to address issues like gradient vanishing and gradient explosion that arise in some special cases. Specifically, when the difference between the predicted value and the true value is small, the Smooth L1 Loss uses the L2 loss (squared error), while for larger differences, it employs the L1 loss (absolute error). The above theoretical analysis demonstrates that employing this design in our dual denoising function helps stabilize gradients during training and makes the model more robust to outliers. To further prove that our proposed dual denoising loss can constrain the network training to be more stable and enhance denoising performance, we conduct ablation experiments on this function in Sect. 3.5.4 to verify its effectiveness.

In summary, the total loss function constraining our generator is given by Eq. 13.

$$\begin{aligned} \begin{aligned} L_G =&\lambda _{adv}\times L_{adv}+\lambda _{conten}\times L_{\textrm{content}}\\&+\lambda _{tv}\times L_{tv}+\lambda _{dual}\times \ L_{dual} \end{aligned} \end{aligned}$$
(13)

where \(\lambda _{adv}\), \(\lambda _{conten}\), \(\lambda _{tv}\) and \(\lambda _{dual}\) represent the weights of adversarial loss, content loss, total variation loss and dual denoising loss respectively. The weight setting for the loss function will be discussed in Sect. 3.2. Our discriminator is only for evaluation. Therefore, the discriminator only needs to be constrained by the adversarial loss, and the overall loss function of the discriminator is the function in Eq. 5.

3 Experiments

3.1 Datasets

Three datasets are used for the experiment, namely the SIDD dataset [57], the DND dataset [58] and the PolyU dataset [59]. We use these three datasets for the following reasons: First, these three datasets are publicly available and have official evaluation criteria. Second, these datasets are the collection of complex noise in the world scene, which conforms to the distribution standard of real noise. Third, the noise types in these datasets are complex, so the performance of denoising methods can be accurately gauged.

SIDD: Smartphone image denoising dataset (SIDD) provides 320 pairs of high-resolution color images for training data. To ensure stable and efficient training of our model, we crop each pair of training sets into \(256\times 256\) patches for training our model. The dataset also provides 40 pairs of images for validation, and each pair of validation images is evaluated on 32 image blocks of size \(256\times 256\). Therefore, there are a total of 1280 image pairs of size \(256\times 256\) in the 40 validation datasets for validation.

DND: Darmstadt noise dataset (DND) provides 50 noisy images for testing. Each test image does not provide a clean image. Test results can only be obtained through the official online test website. The official test benchmark is to divide each piece of test data into 20 \(512\times 512\) image boxes for evaluation. So, the DND ended up evaluating 1000 images of \(512\times 512\) size.

PolyU: The PolyU dataset is a collection of 40 indoor scenes. The dataset provides 100 noise images in the size of \(512\times 512\) for the denoising test, and provides a ground standard image for each test data.

Algorithm 1
figure c

Training Procedure of MIFGAN

3.2 Experimental settings

The training and testing of the experiment are based on the pytorch1.7.0 deep learning framework on the Nvidia GeForce GTX 3090 GPU. Both generator and discriminator adopt the Adam optimizer. We set the momentum parameters to \(\beta 1 = 0.9\) and \(\beta 2 = 0.999\). The learning rate is uniformly set to 2e-4 during the whole training process. The batch size is 16 and a total of 200 epochs are trained.

The parameters in the loss function are set as follows: First, the value of penalty coefficient \(\lambda _{gp}\) in Eqs. 2 and 5 is set to 10 according to the research results in WGAN-GP [27]. Then, the value of parameter \(\ell ^2\) in Eqs. 7 and 8 is set as 0.001 following the default setting in literature [51, 52]. Finally, we learn from the experience of weight setting for pixel loss, edge loss, and perceptual loss in the literature [54,55,56], and our work focuses on reconstruction of pixels. Before fully training the model, we first take a small subset of the dataset and conduct multiple training sessions. Before each training session, we adjust the weights of our loss function. Through multiple experiments, the results show that when \(\lambda _{pixel}=10\), \(\lambda _{edge}=0.5\) and \(\lambda _{vgg} =1\), the model has superior performance. When we set the parameters to \(\lambda _{adv}=0.01\), \(\lambda _{conten}=1\), \(\lambda _{tv}=0.01\) and \(\lambda _{dual}=0.1\), the model can be trained stably and effectively. The specific training procedure of our proposed model is shown in Algorithm 1.

Table 1 Quantitative comparison results of different denoising methods on the SIDD dataset
Fig. 4
figure 4

Visual comparison results of different denoising methods on SIDD dataset

3.3 Evaluation metrics

The evaluation of image denoising mainly includes quantitative evaluation and qualitative evaluation. The qualitative evaluation is mainly based on human visual intuition to judge whether the image quality is in line with people’s aesthetic. Quantitative evaluation uses Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [60] is the most popular and recognized index at present.

PSNR is usually used to measure the noise removal effect of denoising algorithm. The specific mathematical formula is as follows:

$$\begin{aligned} \begin{aligned} PSNR=20\times \log _{10}\left( \frac{MAX_{pixel}}{\sqrt{MSE{(G(x),y)}}}\right) \end{aligned} \end{aligned}$$
(14)

where \(MAX_{pixel}\) represents the maximum pixel value of our image, usually 255. MSE(G(x), y) uses the mean square error to calculate the difference between a denoised image G(x) and a clean image y, which should be as small as possible. Therefore, the larger the PNSR value, the higher the image quality after denoising.

This evaluation index SSIM is more consistent with the human visual perception of images. SSIM compares the difference between denoised image and clean image mainly from three aspects: structure, brightness and contrast. The formula for calculating SSIM is as follows:

$$\begin{aligned} \begin{aligned} SSIM=\frac{(2\mu _{_{G(x)}}\mu _{_{y}}+M_{_{1}})(2\sigma _{_{G(x)y}}+M_{_{2}})}{(\mu _{_{G(x)}}^2+\mu _{_{y}}^2+M_{_{1}})(\sigma _{_{G(x)}}^2+\sigma _{_{y}}^2+M_{_{2}})} \end{aligned} \end{aligned}$$
(15)

where \(\mu _{G(x)}\) and \(\mu _{y}\) represent the mean values of the denoised image G(x) and the clean image y, respectively. \(\sigma _{G(x)}^2\) and \(\sigma _{y}^2\) represent the variances of the pixel values of the denoised image G(x) and the clean image y, respectively. \(\sigma _{G(x)y}\) represents the covariance between the denoised image G(x) and the clean image y. \(M_{1}\) and \(M_{2}\) are constants to prevent the denominator from being 0 in the calculation. The SSIM is calculated between 0 and 1. The closer the evaluation result is to 1, the higher the similarity between the denoised image and the clean image, and the more favorable the image quality.

3.4 Comparison with other methods

In order to evaluate the effectiveness and competitiveness of our denoising algorithm. Our algorithm is evaluated quantitatively and qualitatively with some state-of-the-art methods on SIDD, DND and PolyU datasets.

Evaluation results on the SIDD dataset: As shown in the results in Table 1, MIFGAN has superior quantitative evaluation metrics on this dataset compared to other methods. Specifically, compared with CFNet and RIDNet, our PSNR and SSIM increased by 0.93, 0.005 and 1.56, 0.046 respectively. The quantitative evaluation results of MIFGAN are significantly superior than BM3D, CBDNet and FFDNet. Figure 4 shows the results of the visual comparison. BM3D and FFDNet are less effective in removing real noise. The denoised images of RIDNet and DCANet produced artifacts. DANet and MIRNet are more blurred than MIFGAN in floor texture details. In contrast, our method is more effective for noise removal and superior for detail recovery.

Table 2 Quantitative comparison results of different denoising methods on the DND dataset
Fig. 5
figure 5

Visual comparison results of different denoising methods on the DND dataset

Evaluation results on the DND dataset: Table 2 presents the results of quantitative evaluation. Compared with DCANet, FFDNet and MDRN methods, the PSNR and SSIM of MIFGAN are increased by 0.25 and 0.001, 5.42 and 0.107, 0.39 and 0.002, respectively. Although our method is 0.004 lower than ADGAN and 0.003 lower than MSGAN in SSIM, we are far higher than them in PSNR. Figure 5 shows the visual results of the different methods. BM3D, CDNCNN-B, and FFDNet still have a large amount of noise residual. FFDNet and CBDNet generate blurred structures due to the excessive smoothing operation during denoising, which makes the texture details disappear. In the gray texture in the black area, the fine texture structure retained by MIFGAN is clearer than that of DANet and DCANet. Experimental results show that MIFGAN algorithm is more competitive.

Evaluation results on the PolyU dataset: The results of quantitative evaluation are shown in Table 3. According to the data results, the MIFGAN demonstrates the best performance, the PSNR and SSIM of the MIFGAN compared to the TSIDNet are improved by 1.3 and 0.027, respectively. Additionally, when compared to the DCANet, which exhibits a higher performance, the MIFGAN still manages to improve the PSNR and SSIM by 0.9 and 0.009, respectively. Fig. 6 shows the comparison of visualization. For easy observation, we enlarge the text part in the upper right corner of the image without destroying the image. From the figures, it can be observed that the texts in DANet and MIRNet appear blurred. CBDNet still contains residual noise. The texts in DCANet and MPRNet are not as clear as those in MIFGAN, and the texture structure of the red brick in MIFGAN is more detailed and clear. The experimental results demonstrate that our algorithm possesses superior generalization ability.

Table 3 Quantitative comparison results of different denoising methods on the PolyU dataset
Fig. 6
figure 6

Visual comparison results of different denoising methods on the PolyU dataset

3.5 Ablation study

To demonstrate the effectiveness of our algorithm, we conduct ablation studies on the SIDD dataset.

3.5.1 Ablation study of the residual modules

In the generator architecture, we use multiple residual modules on the original resolution of the image for denoising. In order to determine the number of residual modules when the effect is best. We performed ablation experiments for the number of residual modules. The detailed results are shown in Table 4. According to the data in the table, it can be concluded that when the residual module is added on the basis of the three residual modules, the values of PSNR and SSIM do not increase, and even the index data becomes worse. Although the PSNR is increased by 0.01 when using 7 residual modules compared to 3 residual modules, the number of parameters of the model is more than twice as large, which results in more time for our model to train and test. So, in the end, we set the number of residual modules to 3, which allows us to achieve good results in terms of PSNR and SSIM with only 0.22 million parameters.

Table 4 The effect of the number of residual modules on PSNR, SSIM and Params(M)

3.5.2 Ablation study for generator

In order to verify the effectiveness of the generator in MIFGAN, we perform ablation experiments for four structures in the generator that deal with different scales. The results of the ablation study are shown in Table 5. According to the data results, the denoising effect is poor when we only use the U-net architecture. After sequentially adding 4 times and 8 times downsampling architecture, both PSNR and SSIM are gradually improved. When we add our residual network structure, both objective metrics have the best performance. Therefore, the results of the ablation study of the generator architecture show that each part of the denoising network in MIFGAN has an irreplaceable role.

Table 5 Ablation study of network architecture in generator for denoising effect, where down indicates downsampling, \(\checkmark \) indicates use and ✗ indicates non-use

3.5.3 Ablation study for discriminator

To verify the effectiveness of the discriminators in MIFGAN, and we perform ablation study on the discriminator. Our specific operation is to replace the discriminators in MIFGAN with those in DANet, PatchGAN and BDGAN respectively to train our network. The specific results of the experiments are shown in Table 6.

We can conclude from the data in Table 6 that MIFGAN improves PSNR and SSIM by 2.68 and 0.009, 1.25 and 0.005, 1.04 and 0.004, respectively, compared with the use of a simple fully-connected layer discriminator in DANet, the use of an ordinary PatchGAN discriminator in the literature [34], and the use of multiple different scale discriminators in BDGAN. Therefore, the experimental results show that our proposed discriminator can effectively assist the model for denoising.

Table 6 Ablation study of the effect of discriminator on model performance, where DANet, PatchGAN, BDGAN and MIFGAN refer to the discriminator in the corresponding methods

3.5.4 Ablation study for loss functions

There are a total of 6 loss functions used in MIFGAN to train the generator. In order to prove the effectiveness of each loss function, we conduct ablation studies on these 6 loss functions. The results of the quantitative evaluation are shown in Table 7. Both PSNR and SSIM values are low when we use only adversarial loss. After we successively introduce pixel loss, total variation loss, and edge loss, the quantitative evaluation results steadily improve. After introducing our proposed double denoising loss, PSNR and SSIM reach 39.80 and 0.956, respectively. Finally, after we add the perceptual loss, our SSIM and PSNR are further improved, breaking through to 40.27 and 0.960 respectively. Figure 7 shows the visual effects after denoising using different loss functions. It is straightforward to see that our network relying only on adversarial loss leads to unstable denoising training and makes it difficult to generate image structures. With the introduction of pixel loss, the image can generate a rough structure, but there is still noise and atomization phenomenon. After adding the total variation loss, the image achieves enhanced denoising quality. After adding the edge loss, the image texture details are clearer. After the introduction of double denoising loss, it can strengthen the recovery of some important detailed textures in the image. Finally, the addition of perceptual loss makes the image more realistic and bright. So, the experimental results show that each function in MIFGAN has an indispensable role.

Table 7 Ablation study of the effect of loss function on denoising performance
Fig. 7
figure 7

The visual effect of the proposed network with different loss functions. a noisy. b clean. c \(L_{adv}\) loss. d \(L_{adv}+L_{pixel}\) loss. e \(L_{adv}+L_{pixel}+L_{tv}\) loss. f \(L_{adv}+L_{pixel}+L_{tv}+L_{edge}\) loss. g \(L_{adv}+L_{pixel}+L_{tv}+L_{edge}+L_{dual}\) loss. h \(L_{adv}+L_{pixel}+L_{tv}+L_{edge}+L_{dual}+L_{vgg}\) loss

4 Conclusion

In this paper, we propose a multi-scale information fusion generative adversarial network (MIFGAN) for real-world noisy image denoising. Our encoder–decoder network employs a multi-scale information fusion strategy to enhance the model’s capabilities for image denoising and restoration, while our customized residual network further mitigates noise and preserves image information. Additionally, By incorporating convolutional kernels of various sizes into the discriminator, we can increase its receptive field, enabling it to gather more comprehensive contextual information and more effectively assist the generator in completing the denoising task. Our proposed dual denoising loss combined with other loss functions together constitutes a set of multi-scale objective functions, which improves the image denoising and restoration performance of the model. The experimental results show that our method can effectively remove complex noise in real images and the denoised images are more clear and realistic. Compared with other methods, our method exhibits superior practicability when applied to real noisy images.

Our algorithm is not only meaningful for the image denoising task, but can also be widely applied to other image enhancement tasks such as image super-resolution and image inpainting by modifying the relevant parameters in the model. However, the number of parameters in our model is relatively large, which can lead to limitations on some micro-embedded devices. Conducting further research into developing more lightweight denoising models is another research direction.