1 Introduction

Monte Carlo (MC) path tracing [18] is a general and powerful rendering technique for simulating light transport behavior and rendering photo-realistic images in computer graphics. Due to its generality and unbiased nature, the MC path tracing method has been widely used in animation production, visual effects, and video games [20]. However, it requires tracking a large number of ray paths within each pixel to render noise-free images, resulting in consuming a lot of rendering time. This problem motivates researchers to develop many denoising approaches at a reduced sample rate (e.g., 1–64 samples per pixel (spp)) with the help of auxiliary buffers (e.g., albedo, normal, and depth buffers).

Recently, ACFM [35] and DMCR [26] apply the generative adversarial network (GAN) [9] for denoising Monte Carlo renderings at an offline rate to achieve more plausible results than traditional kernel filtering [4] and CNN-based approaches [2, 31, 34, 36].

Fig. 1
figure 1

The K3N48S1 represents a convolution operation where the kernel size is 3, the number of feature channels is 48, and stride is 1. a An overview of our network framework (DuRCGAN). Note that the denoisers and discriminators of diffuse and specular input are of different weights, respectively. The denoiser is based on a residual-in-residual (RIR) design, which stacks three dual residual groups (DRGs) and a long skip connection. The red line between the DRGs is residual connection-2 (see Fig. 2 and Sect. 3.1 in detail); b auxiliary buffer encoder network. We use a multi-scale convolution dense block (MSCDB) to extract spatially precise auxiliary features by convolution dense blocks (CDBs) and obtain a complementary set of contextual information across multiple spatial scales by multi-scale feature fusion blocks (MSFFBs). The encoded auxiliary features \(F_{\mathrm{Abuf}}\) are used to modulate noisy features in DRGs; c discriminator network

However, we find three main limitations of these methods. First, most of the MC denoising network structures apply several residual blocks to build a deeper network and thus improve denoising performance [26, 31, 35]. However, the residual connection [13] in the previous MC denoising methods is all embedded in the residual unit, which ignores the interaction of features between different residual units. Second, existing works ever since ACFM [35] modulate noisy feature maps based on encoded auxiliary features. The method can achieve better denoising performance than simply concatenating auxiliary buffers with the noisy image as network input (as previous works did [2, 31]) as long as they are encoded properly. Existing works typically extract the features of auxiliary buffers in the form of full resolution (single scale) via several convolution layers [26, 35] to achieve fine spatial details; however, operating on a single scale makes the receptive field fixed in each layer, and it is well known in the vision science that the size of the local receptive field in the same area is different [37]. Third, previous works often use traditional convolution operations to extract the local fixed-location features, which makes the network lack flexibility when facing low-frequency and high-frequency information simultaneously.

To address the above problems, we propose a novel adversarial approach for denoising Monte Carlo renderings, called dual residual connection GAN (DuRCGAN). Specifically, as illustrated in Fig. 1, we introduce the residual-in-residual (RIR) structure. The hierarchical connections inside the RIR allow the network to have more path options, which can increase the flow of information and the chance of the optimal feature selection. Moreover, we propose a multi-scale convolution dense block (MSCDB) as an auxiliary buffer encoder. It operates on full-resolution features to extract and maintain the fine spatial details of auxiliary features. During the encoding process, additional down-sampling and up-sampling layers are used to generate low-resolution features to obtain a complementary set of features across multiple spatial scales [37]. The encoded auxiliary features are used to modulate features from noisy input inside the proposed RIR structure. Furthermore, we propose a spatial-adaptive block (SAB). It introduces deformable convolution [38] to help the network adapt to spatial variations between low-frequency and high-frequency features and thus recover more spatial details and textures. As shown in Fig. 6, DuRCGAN can achieve better visual results and quantitative metrics compared with previous state-of-the-art methods.

Table 1 The abbreviations we used in this section

2 Related works

2.1 Learning-based Monte Carlo denoising

The key idea of denoising MC rendering is to reconstruct noise-free images from noisy input with the help of auxiliary features including albedo, normal, depth, and the corresponding variance buffers [34, 39]. Recently, learning-based MC denoising approaches have leveraged deep neural networks to outperform traditional image-space methods [4].

Pixel-space reconstruction is the most common way of learning-based MC denoising. The pixel-based denoisers use the summary statistics of per-pixel sample distributions. As a pioneer, Kalantari et al. [19] used a multilayer perceptron neural network to estimate the parameters of denoising filters. Chaitanya et al. [5] proposed a recurrent neural network (RNN) to deal with image sequences at an interactive rate. Bako et al. [2] applied convolutional neural networks for predicting kernel filters. Vogels et al. [31] enriched Chaitanya et al. [5] and Bako et al.’s [2] works by considering multi-scale denoising and temporal coherence. Wong et al. [34] used several residual blocks to directly generate the noise-free images instead of predicting kernel filters [2]. Kuznetsov et al. [23] divided the denoising problem into two parts: adaptive sampling and reconstruction. Hasselgren [12] enriched Kuznetsov et al.’s [23] work by introducing multi-scale kernel prediction network and considering temporal denoising. Xu et al. [35] first applied the generative adversarial network to this mission. Moreover, they proposed the auxiliary feature conditioned modulation method to exert more additive and multiplicative interactions between the auxiliary features and noisy input. This is more effective than naively concatenating them with the noisy image as the network input. DMCR [26] enriched Xu et al.’s work [35] by introducing residual attention network and hierarchical features extraction method of auxiliary buffers. Meng et al. [27] introduced the neural bilateral grid [7] to build a light-weight network for real-time denoising.

Deep learning has also been utilized for sample-based MC denoising. This method worked on individual samples instead of pixel aggregates. Gharbi et al. [8] proposed a novel splatting method to predict per-sample splatting kernel. The kernel splats each sample onto nearby pixels to produce final results. Munkberg et al. [28] proposed a layering embedding denoising approach to speed up this operation. It separated samples into different layers and used a splatting kernel filter in each layer, respectively. However, memory requirements are substantial in the sample-based denoising method because each rendered sample has more scalar features. Meanwhile, the rendered samples still need to be averaged to pixels and processed by a pixel-based denoiser to generate features. Hence, in this paper, we focus on the design of the pixel-based denoising network structure.

In addition, Kettunen et al. [21] and Guo et al. [11] tried to reconstruct screened Poisson process for gradient-domain rendering, but it required additional input information. Vicini et al. [30] tried to consider deep Monte Carlo renderings denoising.

2.2 Generative adversarial networks

The generative adversarial network (GAN) [9] has been widely used in various image generation tasks, including image-to-image translation [32], image editing [17], and image super-resolution [33]. However, the training process of vanilla GAN is unstable because of gradient vanishing and mode collapse. Recently, several works focus on stabilizing the GAN’s training and increase the sample diversity [1, 10].

For MC denoising, Xu et al. [35] applied the VGG network [29] for the discriminator but failed to capture global information of the images. Moreover, DMCR [26] introduced a multi-scale PatchGAN discriminator [32] which means that no fully connected layer was used to discriminate images from coarse to fine scale.

3 Denoising network structure

Similar to previous work [35], our denoising network processes the diffuse and the specular noisy images separately and synthesizes the output of two networks to obtain the final denoised results. In this section, we elaborate on our proposed residual-in-residual (RIR) module and the multi-scale convolution dense block (Table 1).

3.1 Residual-in-residual (RIR) module

To achieve better results and make a deeper network, we introduce the residual-in-residual (RIR) module. It consists of a long skip connection and three dual residual groups (DRGs) (Fig. 1a). The long skip connection allows residual learning at a coarse level, which makes the network pay attention to learn high-frequency information. As illustrated in Fig. 2, the DRG consists of two dual residual blocks (DRBs) named as \(\hbox {DRB}^l\) and \(\hbox {DRB}^{l+1}\) and a middle skip connection to make a further step toward residual learning. In each DRB, it has two residual units (RUs) with dual residual connections (the blue and the red lines) and a short skip connection.

Fig. 2
figure 2

Illustration of the dual residual group (DRG). DRG has two dual residual blocks (DRBs) named as \(\hbox {DRB}^l\) and \(\hbox {DRB}^{l+1}\) and a middle skip connection. Each DRB consists of two residual units (RUs), the dual residual connections (residual connection-1 and residual connection-2), and a short skip connection. The \(F_{\mathrm{Abuf}}\) produced by MSCDB is used to modulate noisy features in the RU

Fig. 3
figure 3

Illustration of the (\(l+1\))th dual residual block (DRB). It has two residual units (RU) with a short skip connection. In each RU, we apply the CFM [35] to modulate noisy features with auxiliary features \(F_{\mathrm{Abuf}}\) and the channel attention (CA) [26] to exploit introduce the dual residual connections (the blue and red line) to make the network exploit the inter-channel relationship of features. We further use a spatial-adaptive block (SAB) to make the network adapt to spatial variations

We now elaborate on the proposed dual residual connections. Recently, paired operations with dual residual connections have shown their effectiveness on image processing tasks [25]. In this paper, we introduce the dual residual connections into the MC denoising tasks and regard two RUs of the DRB as the paired operations. The dual residual connections consist of residual connection-1 and residual connection-2. In practice, the residual connection-1 (the blue line in Fig. 2) is equivalent to the identity mappings in the standard residual unit [14], which can be viewed as the intra-RU residual connection. Besides residual connection-1, we introduce residual connection-2 (the red line in Fig. 2) into the second RU of each DRB. As shown in Figs. 2 and 3a, let \(\hbox {RU}_{2}^{l}\) and \(\hbox {RU}_{2}^{l+1}\) be the second RU of \(\hbox {DRB}^l\) and \(\hbox {DRB}^{l+1}\), respectively. Before its ReLu function, \(\hbox {RU}_{2}^{l+1}\) receives the intermediate residual (named as \(\hbox {Res}_{2}^{l}\)) from \(\hbox {RU}_{2}^{l}\). After the ReLu function, it generates a new intermediate residual \(\hbox {Res}_{2}^{l+1}\). To make full use of these intermediate residuals, we do element-wise addition operation on them (\(\hbox {Res}_{2}^{l}\oplus \hbox {Res}_{2}^{l+1}\)) and use its result as the input residual of the next DRG. The \(\hbox {RU}_{2}^{l}\) itself also benefits from this; it received the intermediate residual (\(\hbox {RU}_{2}^{l-2} \oplus \hbox {RU}_{2}^{l-1}\)) from the last DRG, as shown in Fig. 2. Therefore, the dual residual connections can implicitly increase the number of potential interactions between the intra-unit and inter-unit features, which can achieve better denoising results. The long, middle, and short skip connections and dual residual connections in the RIR allow the network to have more path options, and more information can be bypassed through the multiple connections.

Fig. 4
figure 4

The implementation of convolution dense block (CDB) and multi-scale feature fusion block (MSFFB) in the auxiliary buffer encoder network

3.2 Multi-scale convolution dense block (MSCDB)

Auxiliary buffers are inexpensive rendering by-products, but they can provide rich geometry and texture information for noisy features, which can greatly improve denoising performance [39]. Hence, how to effectively extract these abundant features is a key problem in MC denoising.

The previous work [26] has shown the effectiveness of the convolution dense block (CDB) to extract auxiliary features. However, we find that the CDB only operates on the full-resolution (single-scale) auxiliary features, which can extract fine spatial details but fail to capture semantically reliable contextual information from multiple scales. Hence, in this paper, we propose the multi-scale convolution dense block (MSCDB) to extract and fuse diverse information from both full-resolution and low-resolution scales, as illustrated in Figs. 1b and 4. To decrease the number of network parameters, we abandoned the way of using CDB on multiple scales. Instead, we introduce a multi-scale feature fusion block (MSFFB, Fig. 3a) after CDB to obtain rich and semantically reliable contextual information while maintaining precise spatial features.

Specifically, the CDB operates on full resolution representations, and the MSFFB follows behind the CDB to fuses contextual information. MSFFB applies down-sampling operations to produce three resolution streams. In each resolution stream, we use one residual unit to extract features. Then, we apply up-sampling operations for two low-resolution feature maps to return to their full-resolution form. Motivated by Zamir et al. ’s work [37], we introduce a selective kernel feature fusion (SKFF) module to aggregate features from three scales instead of simply concatenating them. The SKFF performs the element-wise addition operation on three scales features and then applies a global average pooling to squeeze the spatial dimension of the fusion features. This equals compute channel-wise statistics. Next, there is a channel-downscaling convolution layer to generate a latent vector, followed by three parallel channel-upscaling convolution layers to produce three feature descriptors. For select operation, we apply the softmax function to obtain three attention features f1, f2, and f3. Finally, we use f1, f2, and f3 to recalibrate the input feature maps from three scales, respectively.

After passing through CDB and MSFFB, both full-resolution and progressive low-resolution features are extracted. We repeatedly stack them to extract deeper features. Finally, we concatenate the output of CDBs or MSFFBs and use a \(1\times 1\) convolution layer to fuse them into the final auxiliary features \(F_{\mathrm{Abuf}}\).

Fig. 5
figure 5

The architecture of the spatial-adaptive block (SAB). The SAB consists of the offset block and deformable convolution layer. The deformable convolution uses the offset value obtained by the offset block to extract non-fixed location features

Fig. 6
figure 6

We evaluate our network and compare it with the state-of-the-art methods, including NFOR [4], KPCN [2], ACFM [35], and DMCR [26], on test scene from [3] and rendered by the Tungsten renderer. For each scene, we also demonstrate two close-ups

3.3 Spatial-adaptive block (SAB)

The standard convolution operation extracts the local fixed-location features, which may lead to calculating relevant and unrelated features simultaneously. As illustrated in Fig. 10b, the standard convolution operation makes the results blur at the junction of high-frequency and low-frequency information.

To address this problem, in this paper, we introduce a spatial-adaptive block (SAB) to help the network adapt to spatial changes. The core of SAB is the modulated deformable convolution (Fig. 5b) [6, 38]. Compared with the standard convolution, the modulated deformable convolution can change the shapes of convolutional kernels and be formulated as:

$$\begin{aligned} y(p) = \sum _{p(i)\in N(p)}^{}w_{i}\cdot x(p_{i}+\varDelta p_{i}) * \varDelta m_{i} \end{aligned}$$

where N(p) denotes the neighborhoods of location p with convolutional kernel size, and \(w_i\) and \(p_i\) denote the weight and the location in N(p) (green squares shown in Fig. 5a). \(\varDelta p_i\) and \(\varDelta m_i\) are offset values and obtained via the offset block. \(\varDelta p_i\) can change the location of \(p_i\) (blue squares shown in Fig. 5b), and \(\varDelta m_i\) is the modulation scalar which lies in the range [0, 1] to recalibrate features further. Hence, the modulated deformable convolution can adjust the spatial support regions, which helps the network deal with low-frequency and high-frequency features more effectively.

Inspired by Chang et al. [6], in order to better estimate the current offset values, we transfer the offset values obtained in the last offset block \(\{\varDelta p^{\mathrm{last}}, \varDelta m^{\mathrm{last}}\}\) to the current offset block (the purple line in Figs. 5 and 3a). Thus, we apply several standard convolution layers to extract features from the input features \(x_s\) and aggregate them with the last offset values \(\{\varDelta p^{\mathrm{last}}, \varDelta m^{\mathrm{last}}\}\) to estimate current offset values \(\{\varDelta p^{\mathrm{curr}}, \varDelta m^{\mathrm{curr}}\}\). Moreover, we introduce the spatial attention (SA) proposed by Zamir et al. [37] to help the offset block pay attention to spatial importance.

Table 2 The statistics of numerical performance show that our method can outperform the state-of-the-art approaches at any spp. Avg. indicates the average value over the entire test set calculated by SSIM, PSNR, or RMSE metrics, and B.P. indicates the percentage of all the best results of each method to the total test set
Table 3 Average time cost and the number of parameters of each denoising approach

In order to reduce the number of network parameters and memory space occupation, we only replace the first standard convolution layer of each RU in the last DRG with the SAB (see Fig. 3a).

4 Experimental setup

4.1 Datasets

Training a robust generative adversarial network requires a large-scale and diverse dataset to avoid overfitting. We use the public denoising datasets [2, 3] rendered by Tungsten renderer. The training datasets consist of 8 scenes, and each scene has about 200 pairs of noisy input, auxiliary buffers, and reference images rendered from different camera parameters, materials, textures, and illumination conditions. The reference images were rendered with 32,768 samples per pixel (spp), while the noisy images and corresponding auxiliary buffers were rendered with 32 spp. All training data are cropped into patches of size \(128\times 128\) by importance sampling [2]. We use albedo, normal, and depth maps as auxiliary buffers, and scale them to the same range [0.0–1.0]. We use diffuse data without albedo as the network input to preserve the texture information, and we then multiply the albedo map back to the denoised result. Before using specular or untextured diffuse RGB color buffers as the network input, we apply a logarithmic function to them to compress the high dynamic range (HDR) of color values, i.e., \(\log (1+x)\), where x is the HDR color values.

4.2 Implementation details

The discriminator network is designed in a PatchGAN [16] style, which means that no fully connected layers are used to capture global features. We concatenate the original denoised/reference image and the image after applying a Laplacian filter, and as the input to the discriminator. This strategy lets the discriminator pay attention to the edge information of the image, while also prompting the generator to generate high-frequency details.

Fig. 7
figure 7

Illustration of training loss

We implement our networks using PyTorch and train the networks on a single Titan RTX GPU. We use Wasserstein-GAN with a gradient penalty (WGAN-GP) [10] to stabilize the training process. We use Adam solver [22] with the default parameters and mini-batch size of 8 to train the network. The learning rate is set to 2e–4 for both diffuse and specular branches and halved after training 5k, 10k, 15k, 20k iterations. The training time takes about 36 h for each branch. In the paper, we combine the symmetric mean absolute percentage error (SMAPE) loss and adversarial loss for training. The SMAPE loss enforces correctness at the low-frequency region, while adversarial loss focuses on high-frequency details of images. We use SMAPE instead of L1 and L2 loss since the SMAPE can stabilize to denoise HDR images [28, 31]. We set the ratio between SMAPE and adversarial loos to 100:1 to make training more stable (Fig. 7).

5 Evaluation

5.1 Results

To evaluate our proposed network, we compare with four state-of-the-art offline denoising methods: NFOR [4], KPCN [2], ACFM [35], and DMCR [26]. The NFOR denoiser is the state-of-the-art regression-based method and embedded in the public Tungsten renderer, while KPCN, ACFM, and DMCR have the public model weights and training codes.

We measure the time cost and the number of parameters of each denoising approach. The time cost is averaged over all \(1280\times 720\) test images. The statistics of this information are presented in Table 3. The NFOR only has CPU implementation and takes 19.3 s on the 2.30 GHZ Intel Xeon CPU, and the KPCN takes 4.6 s since the kernel filter needs to calculate the final result pixel by pixel. Our method uses about 1.33 M parameters, which is fewer than the DMCR and ACFM but is still on a par with or even better than the previous method in terms of visual effects and quantitative metrics (see Table 2 and Fig. 6).

In addition, we choose three image quality metric methods, including relative MSE (RMSE), PSNR, and SSIM to compare quantitative results. To fairly compare with the above methods, we use their public codes to retrain the network on Tungsten training datasets. We design the experiments on noisy input images with different samples per pixel, including 4, 16, and 32 samples per pixel. Note that we only train the network on 32 samples per pixel, instead of training a unique network for each spp.

Figure 6 shows the comparison of denoising results on four representative test scenes, including the Gray and White Room, the White Room, the Veach ajar, and the Country Kitchen. For each scene, we also show close-ups of the yellow and blue squares. NFOR leaves some splotchy artifacts on the low-frequency region of the image, which is caused by the lack of global information in the process of generating filter weights, and the edge details on the ceiling of the room (e.g., close-ups of the yellow squares in the Gray and White room) have also been softened. KPCN successfully denoises most low-frequency areas but fails to capture high-frequency ones since only stacking the standard convolution operations makes the network lack flexibility when facing different features. Both ACFM and DMCR try to make the network recover high-frequency information as much as possible, but they may produce smooth results in the junction of high-frequency and low-frequency areas (e.g., close-ups of the yellow squares in the Veach ajar). Our method performs on a par or even better than previous works in terms of visual effects and quantitative metrics. Similar to ACFM [35], we calculate the average denoising performance of these methods on the entire test set, as shown in Table 2. In addition, as illustrated in Fig. 8, we also apply the denoiser network trained at 32 spp to denoise 4 spp and 16 spp noisy images.

Fig. 8
figure 8

We apply the denoiser network trained at 32 spp to denoise 4 spp and 16 spp noisy images. Our method can still recover more high-frequency information and achieve better denoising performance than previous methods

6 Analysis

6.1 Ablation on multi-scale convolution dense block

As presented in Sect. 3.2, we propose a multi-scale convolution dense block (MSCDB) to aggregate both high-resolution and progressive low-resolution hierarchical features of auxiliary buffers. This can preserve high-resolution and spatially precise auxiliary features as well as receive abundant contextual information from low-resolution representations.

Fig. 9
figure 9

Comparisons of different auxiliary features encoder. From left to right: a reference; b training with 5 CDBs; c simply concatenate multiple scales for training; d training with DuRCGAN (Ours)

To demonstrate its effectiveness, we only use five CDBs to operate on single-scale auxiliary features and ensure that it has similar parameters with MSCDB. Due to the lack of contextual complementary information, the black shadow on the pick wall and the edge of the ceiling are softened (Fig. 9b and d). Moreover, similar to Zamir et al. [37], we also analyzed the influence of the SKFF module. We concatenate the features from three scales and then mix them through two convolution operations with \(1\times 1\) kernel size. The SKFF module has only about 500 parameters, while the concatenating method will lead to more than 4000 parameters. However, the SKFF module used in MSCDB to fuse multi-scale features can achieve better visual effects and quantitative metrics than concatenating method (Fig. 9c and d).

6.2 Ablation on residual-in-residual (RIR)

To demonstrate the effect of our proposed RIR structure, we conduct ablation experiments on long, short, middle skip connections and residual connection-2, respectively (the residual connection-1 in the residual unit has been proven effective in various image processing tasks, so we only focus on the impact of residual connection-2). Table 4 shows the average metrics of the network without a specific connection on the entire test set. When residual connection-2 is removed, all the three metric values worsen no matter whether other skip connections are used or not. This indicates that the network without residual connection-2 cannot achieve better denoising performance since the residual connection-2 can increase the number of potential interactions between residual units and the chance of the optimal feature selection. Moreover, residual connection-2 does not increase the number of network parameters.

Table 4 Ablation study of different RIR components

In addition, using the long, middle, and short skip connections in the RIR structure can further improve the overall performance of the denoising networks. Among them, using short skip connections in the RIR has the most obvious improvement. A direct reason is that in our proposed RIR structure, the number of short skip connections is more than the other two connection types, and this also indirectly proves the superiority of multiple path options strategies of the network. These comparisons show that long, middle, short skip connections, and dual residual connections are essential for denoising networks. They also demonstrate the effectiveness of our proposed RIR structure.

Fig. 10
figure 10

Ablation on spatial-adaptive block. From left to right: a reference; b training without deformable convolution; c training without offset transfer; d training with DuRCGAN-Full (Ours)

6.3 Ablation on spatial-adaptive block

Spatial-adaptive block (SAB) introduces deformable convolution to make the network adapt to different spatial changes, which can improve its flexibility. In order to show the significance of deformable convolution, we conduct an ablation study on replacing deformable convolution in the last DRG with the standard convolution layers. As shown in Fig. 10b, it produces artifacts and blur at the junction of low-frequency and high-frequency information and failed to restore reflected illumination. In addition, to demonstrate the influence of reusing last offset values in SAB, we also remove the offset transfer between SAB (the purple line in Fig. 3) and only use the noisy features to estimate the offsets values. Figure 10c and d shows that training without offset transfer may produce unreal details in the edge of the frame and shadows. Therefore, the SAB can help the network to adapt to changes in spatial edges and textures.

Fig. 11
figure 11

The failure cases. Due to the inconsistency of the data distribution between the training and testing datasets, it is still difficult to recover high-frequency information for the hair and effects of water

6.4 Discussion

6.4.1 Limitations of training datasets

Large-scale and diverse datasets are essential to train a robust deep learning-based network. Our denoiser may produce poor performance and artifacts on some scenes due to lack of some special effects in the training datasets, including fog, motion blur, depth of field, smoke, etc. Figure 11 shows the limitations of our method on unknown effects, but our method can still recover some fine detail compared with previous approaches. Therefore, we would like to enlarge the training dataset to adapt to a wider range of rendering effects in the future. To make a further step, we will also extend Monte Carlo denoising to unsupervised learning like previous works [11, 24], because rendering large-scale noise-free images (e.g., 16k, 32k) consumes a lot of time.

6.4.2 Adopt different denoiser structures for diffuse and specular noise colors

Considering that diffuse and specular have different noise characteristics, in this paper, we separate them but use the same network structure for training. However, specular buffer may contain extreme noise since the specular light paths are difficult to sample [27], and the albedo buffer may provide little information for specular noisy colors. It would be interesting to apply different network structures to diffuse and specular parts to adapt to this limitation (Fig. 12).

6.4.3 Multi-scale denoising structure

According to the improvement shown by the multi-scale convolution dense block (MSCDB), we believe that it would be beneficial to further introduce the multi-scale structure into the single-scale denoising network. Compared with a network that only uses a single-scale structure, the combination of single-scale and multi-scale structures can better help the network collect contextual information to generate semantically reliable denoising results, but it leads to an increase in network parameters and inference time. The OpenImageDenoise framework [15] has been widely used as a lightweight multi-scale structure (U-net), but it has not been operated on full-resolution features and is unable to capture spatial details. Hence, we will try to combine our single-scale denoiser with U-net in the future to obtain spatially accurate details and semantically reliable results.

6.4.4 Temporal denoising

In this paper, we focus on single-frame denoising at an offline rate. However, for 3D games, virtual reality, and other real-time applications, we would like to study how to denoise temporal sequences interactively at 1 spp with a generative adversarial network.

Fig. 12
figure 12

Comparisons of Intel\(\circledR \) Open Image Denoise framework and our method

7 Conclusion

We have proposed a novel GAN structure (DuRCGAN) for denoising Monte Carlo renderings. It has fewer network parameters and better denoising performance than the state-of-the-art methods.

We proposed a multi-scale convolution dense block to exploit diverse features of auxiliary buffers. It not only maintains the spatial details at high resolution but also explores contextual information at low resolution. We also proposed the dual residual connections in the residual in residual structure to build a deeper network and increase the number of potential interactions between residual units, which increases the flexibility of the network and allows it to have more path options. Moreover, we further propose a spatial-adaptive block by introducing the deformable convolution to adapt to the spatial variations in textures and edges. Although our method has fewer network parameters and inference time than previous state-of-the-art methods, a comprehensive experimental evaluation proves that our network structure is more robust and efficient.