Interactive neural cascade denoising for 1-spp Monte Carlo images

Chen, Yuankang; Lu, Yifan; Zhang, Xiaohua; Xie, Nine

doi:10.1007/s00371-023-02951-6

Interactive neural cascade denoising for 1-spp Monte Carlo images

Original article
Published: 12 July 2023

Volume 39, pages 3197–3210, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Visual Computer Aims and scope Submit manuscript

Interactive neural cascade denoising for 1-spp Monte Carlo images

Download PDF

388 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Monte Carlo (MC) path tracing is known for its high fidelity and heavy computational cost.With the development of neural networks, the kernel-based post-processing method has succeeded in denoising noisy images under low sampling rates, but the complex network structure impedes its deployment in interactive applications. In this paper, we propose a lightweight cascaded network which progressively denoises 1-spp Monte Carlo images through both pixel and kernel prediction methods. A primary denoised image is generated by the pixel prediction network at the first stage, which is then fed to the kernel prediction network to obtain multi-resolution kernels. In addition, to take full advantage of the auxiliary buffers, we introduce a bilateral method during image reconstruction. Experimental results show that our approach achieves state-of-the-art denoising qualities for 1-spp images at an interactive frame speed.

A detail preserving neural network model for Monte Carlo denoising

Article Open access 02 April 2020

An Improved Monte Carlo Denoising Algorithm Based on Kernel-Predicting Convolutional Network

Deep residual learning for denoising Monte Carlo renderings

Article Open access 09 May 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Monte Carlo path tracing is a classical algorithm used in photorealistic image rendering and computer animation, which simulates a variety of physics-based lighting effects and guarantees the authenticity and continuity of the rendered results. However, the high computational cost severely limits its performance in interactive applications, since high sampling rates, known as samples per pixel (spp), are required to achieve mathematical convergence. To produce high-quality images at low sampling rates, image denoising techniques are widely considered, because they are capable of eliminating visual noise through image reconstruction in the post-processing stage.

With the rapid development of deep learning, convolutional neural networks (CNN) have shown strong performance in Monte Carlo denoising. Previous works [3, 27] proposed kernel prediction networks and multi-resolution filtering structures for denoising. Though the kernel prediction method has innate advantages in processing low-spp noisy images by taking nearby pixels into consideration, the extremely noisy 1-spp scenes still affect denoiser’s performance, because it is quite hard to get clean results when image patches carry amount of noises, especially for those areas first appearing in successive frames which suffer from insufficient temporal accumulation. A straightforward solution is to predict large-scale kernels to collect more adequate information, but the computation cost will also significantly increase, which is inefficient for interactive applications requiring quick responses. Actually, in addition to kernel prediction, neural networks are good at predicting pixel values as well. It is potential to explore an way that benefits from both pixel prediction and kernel prediction to efficiently denoise 1-spp consecutive frames at real-time or interactive speed.

Inspired by above observation, in this paper we propose a novel lightweight deep learning approach to denoise 1-spp Monte Carlo images at interactive speed. The core idea of our approach is to use a combination of pixel-to-pixel network and kernel-prediction network to denoise images in a cascade way, and meanwhile the united network can easily be trained in an end-to-end manner like other deep learning-based methods [6, 19]. We design a pixel-to-pixel CNN to generate initial denoised results at the first step instead of directly computing kernel weights. In this way, denoiser can quickly carve out noises since it directly predicts pixel values and is able to restore more pleasant results in noisy areas by collecting implicit information in auxiliary buffers after training. Then, we feed initial results to kernel-prediction CNN to obtain multi-resolution kernels. To denoise images more effectively, we introduce a bilateral reconstruction method to take full advantage of the information in the auxiliary channels. We adaptively adjust the kernel weights by learning per-pixel parameters used in the bilateral filtering. Finally, we use a tiny network to combine multi-resolution results to obtain the denoised image.

According to comprehensive experiments, the proposed approach is able to denoise 1-spp noisy images at interactive frame speed and achieves state-of-the-art denoising qualities. To summarize, our approach makes the following contributions:

A novel and easy-to-train cascaded pixel-kernel prediction architecture which achieves state-of-the-art performance at 1-spp occasions.
An improved kernel construction approach which reconstructs kernels by bilateral filtering to obtain more accurate results.

2 Related work

Monte Carlo denoising is a classic research topic in academic studies of graphics, and it is also closely followed by industrial productions. In this section, we will mainly focus on learning-based Monte Carlo denoising approaches that most relevant to our work, where we refer to the survey of Huo et al. [10] for a comprehensive study. For traditional adaptive sampling and reconstruction techniques, readers can refer to the survey of Zwicker [32].

Learning-based Monte Carlo denoising. With the idea of predicting the filter parameters, Kalantari et al. [12] proposed a multilayer perceptron neural network to automatically estimate the parameters of Monte Carlo denoising filters. Bako et al. [3] and Vogels et al. [27] showed that using CNN to predict per-pixel kernel filters is more robust and easier to train. The latter incorporated a multi-resolution filtering architecture and temporal aggregation. Kuznetsov et al. [15] used CNN to estimate sampling maps for adaptive sampling. In nontrivial-domain denoising, Gharbi et al. [7] proposed a kernel prediction network that splats individual samples onto nearby pixels. Subsequently, Hasselgren et al. [9] and Munkberg et al. [21] extended this technique to interactive rates by proposing an efficient way to control performance and memory characteristics. In gradient-domain rendering, Kettunen et al. [13] used a U-Net autoencoder to replace screened Poisson solver. In addition, Xu et al. [29] first employed Generative Adversarial Networks (GAN) to improve Monte Carlo denoising, and they designed a novel conditioned feature modulation method to utilize auxiliary buffers. Lu et al. introduced residual attention networks [18] and dual residual connection GAN structure [17]. Lin et al. [16] proposed a three-scale network to handle different features (pixel, sample and path) in path-based denoising. To take advantages of various denoising approaches, Back et al. [2] and Zheng [31] designed unified denoising strategies for offline scenarios. And in a recent study, Yu et al. [30] proposed a self-attention mechanism to denoise Monte Carlo rendering. The above deep neural network approaches have achieved impressive results, but they also have a large time overhead due to their delicate and complex network structure.

Interactive denoising. The goal of interactive denoising is to denoise 1-spp data under limited execution time and running overhead. Schied et al. [24] proposed a temporal reconstruction filter with edge-stopping functions and improved this method by estimating temporal gradients to implement adaptive temporal accumulation [25]. Chaitanya et al. [5] designed a deep autoencoder with recurrent connections to enhance temporal stability among successive frames and adopted an end-to-end training strategy with auxiliary buffers. Meng et al. [19] projected the noisy input image onto the bilateral grid based on the guide image learned by the neural network and then, sliced the grid to obtain the denoised images. Isik et al. [11] proposed a filtering algorithm by computing a pairwise affinity to quantify the relationship between per-pixel deep features. Fan et al. [6] learned lightweight importance maps and constructed multi-scale filtering kernels to reduce the time cost of the kernel prediction method. Muller et al. [20] proposed a neural radiance caching technique for path-space global illumination and denoising. Manu et al. [26] introduced a joint network to combine denoising and supersampling in modern rendering pipeline.

3 Methodology

3.1 Problem statement

Monte Carlo denoising can be viewed as a supervised learning problem. Our goal is to reconstruct noise-free images from 1-spp noisy input images at interactive speed (within 100 ms). In addition to noisy images, auxiliary features (e.g., albedo, normal and depth) can be obtained as rendering by-products from the rendering process. The dataset is an animation sequence $\mathcal {X} = \{(x_{1},\textbf{f}_{1}), (x_{2},\textbf{f}_{2}),..., (x_{N},\textbf{f}_{N})\}$ with N frames and corresponding ground truth sequence $\mathcal {Y} = \{y_{1}, y_{2},..., y_{N}\}$, where ${x_{n}}$ stands for 1-spp noisy image, ${\textbf{f}_{n}}$ is the noise-free auxiliary features of frame n, $y_{n}$ is the ground truth of frame n rendered with high spp.

Just like previous works [6, 19], we use accumulated 1-spp noisy images in the experiment. Please refer Appendix 1 for details.

3.2 Cascaded pixel-kernel architecture

We propose a cascaded pixel-kernel network in this paper. As shown in Fig. 1, input data consist of temporal accumulated 1-spp noisy image and auxiliary features (albedo, normal and depth). Input data are first sent to the pixel network, denoted as $\mathcal {P}(\cdot )$, to generate a 3-channel RGB image as initial denoised output. Then, this image is concatenated with auxiliary features as the input of the kernel network, which denoted as $\mathcal {K}(\cdot )$. $\mathcal {K}(\cdot )$ predicts three-level per-pixel kernel weights and adopt a bilateral method to adjust the kernel weights. Lastly, we use these kernels to filter the initial denoised image and combine the results in a multi-resolution manner to get the final denoised image.

We use the U-shape convolutional neural network (U-Net) [23] as the backbone for both pixel network $\mathcal {P}(\cdot )$ and kernel network $\mathcal {K}(\cdot )$ for we adopt a multi-scale fusion strategy to generate results in both networks. Pixel network $\mathcal {P}(\cdot )$ takes the noisy image x and its auxiliary buffers $\textbf{f}$ as input and predict the initial denoised image $R^{\prime }$:

$$\begin{aligned} R^{\prime } = \mathcal {P}(x,\textbf{f}). \end{aligned}$$

(1)

More specifically, $\mathcal {P}(\cdot )$ generates 3-scale denoised images at the last three decoder layers. We use a lightweight fusion module (FM in Fig. 2) to combine the denoised images at different scales. The blending output is retrieved as:

$$\begin{aligned} \textbf{o}=\textbf{i}^{f}-\alpha \left[ \textbf{U D i}{ }^{f}\right] +\alpha \left[ \textbf{U i}^{c}\right] , \end{aligned}$$

(2)

where $\textbf{i}^{c}$ and $\textbf{i}^{f}$ denote coarse-resolution image and fine-resolution image, respectively, $\textbf{D}$ are $2 \times 2$ downsampling and $\textbf{U}$ are $2 \times 2$ nearest upsampling, and $\alpha $ represents per-pixel blending weight predicted by fusion module. After conducting fusion operations progressively, $\mathcal {P}(\cdot )$ outputs the initial denoised image $R^{\prime }$ at the original scale. $R^{\prime }$ is then sent to kernel network $\mathcal {K}(\cdot )$ together with auxiliary features $\textbf{f}$, and $\mathcal {K}(\cdot )$ predicts the per-pixel filtering kernel denoted as $\textbf{w}$:

$$\begin{aligned} \textbf{w} = \mathcal {K}(R^{\prime },\textbf{f}). \end{aligned}$$

(3)

We generate 3 kernels with a same size of $k \times k$ at the last three decoder layers to conduct multi-scale filtering and fusion. We then downsample the initial denoised images $R^{\prime }$ and use corresponding kernels to conduct filtering, which can be formulated as:

$$\begin{aligned} R^{\prime \prime }(x, y)=\sum _{(i,j) \in \Omega _{\langle x,y \rangle }}\textbf{w}_{\langle x,y \rangle }(i,j) R^{\prime }\left( i, j\right) , \end{aligned}$$

(4)

where $\Omega _{\langle x,y \rangle }$ is the $k \times k$ neighborhood centered on the pixel located at the coordinate $\langle x,y \rangle $ of the image and $\textbf{w}_{\langle x,y \rangle }$ is the corresponding kernel. We use a kernel size of 5 in our model to guarantee a fast inference speed and apply the same kernel weight to each RGB channel of a pixel during reconstruction. The multi-scale reconstructed images are combined through fusion modules to generate the final denoised image $R^{\prime \prime }$ in the same manner described in Eq. (2).

3.3 Bilateral weight adjustment

The kernel network outputs three basic kernels as described above. Though these kernels can act directly on the noisy image, the performance will decrease when the adjacent pixel has similar weights with the central one yet they are actually far apart in the world space. We consider use auxiliary buffer to avert this problem. We utilize cross-bilateral filtering to guide the kernel effectively obtain information from auxiliary features through a straightforward attention mechanism. Normal and depth are downsampled through average pooling and applied to the cross bilateral filter, in an exponential decay manner based on the square distance to the central one with learnable parameters:

$$\begin{aligned} \textbf{w}^{{\prime }}_{\langle x,y \rangle }(i,j)= & {} \textbf{w}_{\langle x,y \rangle } (i,j)*D_{n}(i,j,x,y)\nonumber \\{} & {} \quad *D_{d}(i,j,x,y)*D_{c}(i,j,x,y),\end{aligned}$$

(5)

$$\begin{aligned} D_{n}(i,j,x,y)= & {} \exp (-f_{1}(x,y)*[n(i,j)-n(x,y)]^{2}), \end{aligned}$$

(6)

$$\begin{aligned} D_{d}(i,j,x,y)= & {} \exp (-f_{2}(x,y)*[d(i,j) - d(x,y)]^{2}), \end{aligned}$$

(7)

$$\begin{aligned} D_{c}(i,j,x,y)= & {} \exp (-f_{3}(x,y)*[(i-x)^{2}+(j-y)^{2}]),\nonumber \\ \end{aligned}$$

(8)

where $\langle x,y \rangle $ is the central pixel’s coordinate, $\langle i,j \rangle $ is the adjacent pixel’s coordinate, n is normal and d is depth. $D_n$, $D_d$, $D_c$ are differences of normal, depth and the coordinate’s distance between pixel $\langle x,y \rangle $ and pixel $\langle i,j \rangle $, respectively. $f_{1}$, $f_{2}$ and $f_{3}$ are three predicted per-pixel 1-D features to adaptively adjust the weights in bilateral filtering. After adjusting kernel weights, we use $\mathbf {w^{\prime }}$ to filter images same as Eq. (4).

3.4 Training loss

We use the Symmetric Mean Absolute Percentage Error (SMAPE) to train our network. A naive spatial loss can be computed as:

$$\begin{aligned} \mathcal {L}_{s}(R, Y)= \frac{\left| R-Y\right| }{\left| R\right| +\left| Y\right| +\varepsilon }, \end{aligned}$$

(9)

where R is denoised image, Y is reference image and $\epsilon $ is a small number $10^{-2}$. We adopt a temporal loss to enhance denoiser’s perception of temporal features:

$$\begin{aligned} \mathcal {L}_{t}(\partial R, \partial Y)=\frac{\left| \partial R-\partial Y\right| }{\left| \partial R\right| +\left| \partial Y\right| +\varepsilon }, \end{aligned}$$

(10)

where $\partial R =\left| R_{i}-R_{i-1} \right| $ and $\partial Y =\left| Y_{i}-Y_{i-1} \right| $ represent the frame differences. $R_{i}$ and $Y_{i}$ are the denoised image and reference of frame i, $R_{i-1}$ and $Y_{i-1}$ are those in the last frame $i-1$. We apply $\mathcal {L}_{s}$ and $\mathcal {L}_{t}$ to both pixel network and kernel network. We train these two networks together with a total loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}(R^{\prime },R^{\prime \prime },Y)&= w_{\textrm{a}} \mathcal {L}_{s}(R^{\prime }, Y)+w_{\textrm{b}} \mathcal {L}_{s}(R^{\prime \prime }, Y)\\&\quad +w_{\textrm{c}} \mathcal {L}_{t}(\partial R^{\prime },\partial Y)+w_{\textrm{d}} \mathcal {L}_{t}(\partial R^{\prime \prime },\partial Y), \end{aligned} \end{aligned}$$

(11)

where $R^{\prime }$ is the initial denoised image predicted by pixel network, $R^{\prime \prime }$ is the final denoised image reconstructed through filtering kernels predicted by kernel network, and Y is the reference image. We use $w_{\textrm{a}}=w_{\textrm{b}}=1$ and $w_{\textrm{c}}=w_{\textrm{d}}=0.1$ in experiments.

4 Experiment

4.1 Datasets

Datasets preparation. We conducted the experiments on two datasets. Firstly we adopt an existing public dataset from the work of Koskela et al [14], called the BMFR dataset including 6 scenes. Each scene has 60 noisy 1-spp path-traced rendering images with a resolution of $720 \times 1280$, corresponding 4096-spp path-traced renderings as their ground truth images and feature buffers (albedo, normal, depth and world position) obtained in the path tracing phase. This dataset has also been adopted by Meng et al. [19] and Fan et al. [6]. However, most scenes in BMFR dataset are relatively simple and lack of complex lighting effects.

Table 1 Comparison of PSNR and SSIM on 1-spp BMFR dataset and Tungsten dataset (left is PSNR and right is SSIM)

Full size table

For a more comprehensive test of denoiser’s performance at 1-spp scenes, we build a new 1-spp dataset rendered by Tungsten renderer [4] which covers various 3D models, abundant material types and complicated lighting effects, called as Tungsten dataset. The Tungsten dataset contains 9 scenes in total, and each scene has 20–100 frames. The noisy images are rendered at 1-spp, and the ground truth images are 4096 spp. We obtain the auxiliary features (albedo, normal and depth) during rendering. All images in Tungsten dataset are at a resolution of $720 \times 1280$. We make a brief demonstration of Tungsten dataset in Fig. 3.

For each test scene in both datasets, we hold out one scene as test data and the remaining scenes in corresponding dataset as training data. The network takes 1-spp accumulated noisy images and feature buffers as input, consisting of 3-channel color, 3-channel albedo, 3-channel normal, and 1-channel depth in total. To avoid over-blurring of texture details, we separate the albedo from the noisy image before feeding it into the denoising network.

4.2 Implementation

Our denoiser is implemented in TensorFlow [1]. At test stage, we use CUDA to implement the bilateral kernel filtering function to reduce inference time. We use mini-batch size of 8 and train our network with Adam optimizer with a learning rate to $10^{-4}$. The multi-level kernels predicted by kernel network have a scale of $5 \times 5$. For each test case, we use one scene as the test set and the other scene as the training set. All auxiliary features are normalized to the range from 0 to 1 in the dataset. For each scene we remain the last 4 frames as validation set and other frames as train data. We augment training data by random flips and rotations of 90 degrees. To apply temporal loss, we split training data into groups of 8 consecutive frames. We use $128 \times 128$ image patches for training, and at test stage we use the full $720 \times 1280$ image as input. The network is trained for 100 epochs on a NVIDIA TITAN RTX GPU.

4.3 Comparison with baselines

We use PSNR and SSIM metrics for comparisons. We focus our comparisons on a range of learning-based interactive denoising approaches: NBGD [19], ONND [5], and three variants of kernel prediction network [3] with a small network size (KP), a multi-resolution architecture (MRKP) [27] modified by Meng et al. [19], and a weight sharing approach (WSKP) [6].

Based on KPCN [3], MRKP used a multi-resolution approach for interactive denoising. Fan et al. [6] used the encoding of the kernel map to speed up and KP used a network structure similar to that of MRKP but only predict a larger kernel at the original scale. For fair comparison, these three variants share a same U-Net backbone with close time cost compared with our network. We use the multi-scale WSKP architecture and each output layer generates 3 encoding maps, which are then decoded into kernels of size 3, 5 and 7. According to the open source code of the original paper, we compare with the NBGD method with the largest multi-scale 7-layer architecture with three bilateral grids. For NBGD, MRKP, KP and WSKP, we use same training datasets as our method. ONND is provided as a black box modules in Optix 5.1 based on Chaitanya et al. [5]. For KP, MRKP and WSKP, we use SMAPE loss as the image reconstruction loss. For NBGD, we maintain the original L1 loss suggested by the author.

Table 1 shows the error metrics of different methods on all test scenes. In general, we achieve the best average quantitative results in both BMFR dataset and Tungsten dataset. Our method also holds the highest PSNR and SSIM in most test cases.

BMFR dataset. Firstly, we show our denoiser’s performance on the BMFR dataset in Fig. 4. Our method produces pleasant and clean visual effects. It successfully recovers fine details in high-frequency areas such as edges (crops row 3, 4, 5, 6) and produces clean and clear soft shadows (crops row 1, 2, 7, 8). Our model also generates more steady edges than basic kernel prediction methods (KP and MRKP) in the complex outdoor scene (crops row 5, 6). In contrast, the results of NBGD and WSKP are more dirty and difficult to retain the edges of objects, leading to ghost and distortion. ONND generally tends to produce more blurry images and lacks of details.

Tungsten dataset. Figure 5 shows visual comparisons of denoising results on Tungsten dataset which has more complicated lighting effects such as refraction and specular reflection. Our model also holds remarkable denoising effects in these complex scenes. Compared to other methods, our method recovers more clear details on transparent and reflective items (crops row 3, 7) and maintains stable edges in high-frequency areas (crops row 5, 6, 8).

Temporal Stability. We select an area in the image and scan over consecutive frames to compare temporal stability of different denoisers. Figure 6 are the temporal profiles for different scenes. Our model maintains clean and stable object boundaries and shows smooth transitions over frames. Other methods suffer from temporal discontinuities and produce floating results. See our supplementary video for more comparisons.

Initial and final results. Quantitative results of the initial denoised images predicted by the pixel network at the first stage and the final denoised images generated through the whole pipeline are shown in Table 1, where the final one hold higher metrics in all test cases, demonstrating the effectiveness of cascaded denoising. The final one produces cleaner and more stable visual results than the initial one (Fig. 7).

Time cost. The denoising time cost of each method is shown in Table 2. WSKP is implemented in unoptimized Pytorch. KP, MRKP, NBGD and our denoiser are implemented using unoptimized Tensorflow. ONND is provided as a black box modules in Optix 5.1. The average denoising time of our model at resolution of $720 \times 1280$ is 55.21 ms, which can be applied to interactive denoising. Our method achieves better visual results and highest metrics with little additional overhead compared with WSKP and NBGD.

Table 2 Average time cost of each denoising approach at resolution of $720 \times 1280$

Full size table

Table 3 Ablation study of different methods: Kernel alone, Pixel alone, Kernel-Pixel, Kernel-Kernel and Pixel-Kernel (denoted as K, P, K+P, K+K and P+K)

Full size table

4.4 Ablation studies

Cascaded architecture. We have conducted experiments of different kinds of cascaded strategies. We compare independent kernel network (denoted as K) and pixel network (denoted as P) with three cascaded architectures: $K+P$, which uses kernel network to denoise the image first and then, use pixel network to generate final result; $K+K$, which uses two kernel networks to denoise progressively; and $P+K$ denotes our original pixel-kernel architecture. The numbers of model parameters are shown in Table 3. In K and P, the number of feature channels is set to 64. Figure 8 compares the visual effects of above methods. The proposed $P+K$ model produces steady and cleaner results at the edge of the chair (row 1) and the heating radiator (row 2), while the results of other methods suffer from blur and distortion. Using kernel network or pixel network alone makes more blurry denoised results. Kernel network cannot maintain steady edges in complex scenes (row 2, 3). Pixel network shows better performance than kernel network, but it predicts wrong pixel values (white pixels in row 3) when objects are too small. By contrast, $P+K$ still gets clean and clear results on these areas. In general, $P+K$ model holds the best average metrics (Table 3) on the test scenes which turns out that applying a pixel network at first is effective since more information can be collected on the initial denoised image than the original 1-spp input when using a small filtering kernel.

Bilateral weight adjustment. We compare the denoising performance of the bilateral weight adjustment strategy against the original method that directly uses kernels to filter image. Table 4 shows the quantitative errors of these two methods, where the proposed bilateral method has higher average results. Figure 9 shows visual comparisons of denoising quality. Model with bilateral weight adjustment presents more clear edges and cleaner results.

Temporal loss. We test the influence of the temporal loss. We compare the denoising performances when removing the temporal loss. Table 4 shows the quantitative comparisons of denoising results, where the one with temporal loss achieves better performance. Figure 10 is the visual quality comparison. When using temporal loss, the results have less flickers and cleaner edge areas.

Table 4 Ablation study of bilateral weight adjustment (denoted as BWA) and temporal loss (denoted as TL)

Full size table

4.5 Limitations and future work

Generalization to unseen effects. Our cascaded denoiser is able to robustly denoise 1-spp BMFR dataset and Tungsten dataset noisy images, but it may produce artifacts when denoising specific rendering effects (e.g., motion blur, volumetric media, fur material, etc.). Figure 11 shows the limitations of our method on unseen effects. This problem can be alleviated by enlarging the diversity of the training scenes to make our model adapt to more diverse data distribution.

Spatio-temporal architecture. In this work, we apply temporal accumulation and temporal loss to enhance the model’s awareness of spatio-temporal information. There are also some recent works conducting interactive denoising [11, 26] or supersampling [8, 28] by spatio-temporal neural networks which can directly learn temporal coherence with adjacent frames. Hence, in the future research we will try to combine our cascaded denoiser with temporal adaptive module to take advantage of cross-frame features.

5 Conclusions

We propose a novel and practical cascaded neural denoiser to effectively denoise 1-spp noisy images at interactive speed. At the core of our approach, we design an efficient cascaded kernel-pixel prediction architecture, which first uses a pixel network to generate initial denoised image and then, utilizes a kernel network to predict per-pixel kernel weights and conduct multi-scale filtering and fusion. We also introduce a neural bilateral method to adaptively adjust kernel weights by auxiliary features. Our experiments have demonstrated that the proposed denoiser is able to present pleasant denoised results for 1-spp input data and achieve satisfactory denoising speed.

Data Availability

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. hgpu.org (2015)
Back, J., Hua, B.S., Hachisuka, T., Moon, B.: Deep combiner for independent and correlated pixel estimates. ACM Trans. Graph. 39(6)(2020)
Bako, S., Vogels, T., McWilliams, B., Meyer, M., Novák, J., Harvill, A., Sen, P., Derose, T., Rousselle, F.: Kernel-predicting convolutional networks for denoising monte carlo renderings. ACM Trans. Graph. 36(4), 97–100 (2017)
Article Google Scholar
Bitterli, B.: Rendering resources (2016). https://benedikt-bitterli.me/resources/
Chaitanya, C.R.A., Kaplanyan, A.S., Schied, C., Salvi, M., Lefohn, A., Nowrouzezahrai, D., Aila, T.: Interactive reconstruction of monte carlo image sequences using a recurrent denoising autoencoder. ACM Transactions on Graphics (TOG) 36(4), 1–12 (2017)
Article Google Scholar
Fan, H., Wang, R., Huo, Y., Bao, H.: Real-time monte carlo denoising with weight sharing kernel prediction network. In: Computer Graphics Forum, vol. 40, pp. 15–27. Wiley Online Library (2021)
Gharbi, M., Li, T., Aittala, M., Lehtinen, J., Durand, F.: Sample-based monte carlo denoising using a kernel-splatting network. ACM Trans. Graph. 38(4), 1–125 (2019)
Article Google Scholar
Guo, J., Fu, X., Lin, L., Ma, H., Guo, Y., Liu, S., Yan, L.: Extranet: real-time extrapolated rendering for low-latency temporal supersampling. ACM Trans. Graph. 40(6), 1–278 (2021)
Google Scholar
Hasselgren, J., Munkberg, J., Salvi, M., Patney, A., Lefohn, A.E.: Neural temporal adaptive sampling and denoising. Comput. Graph. Forum 39(2), 147–155 (2020)
Article Google Scholar
Huo, Y., Yoon, S.E.: A survey on deep learning-based monte carlo denoising. Comput. Visual Med. 7(2), 169–185 (2021)
Article Google Scholar
Isik, M., Mullia, K., Fisher, M., Eisenmann, J., Gharbi, M.: Interactive monte carlo denoising using affinity of neural features. ACM Trans. Graph. 40(4), 1–37 (2021)
Article Google Scholar
Kalantari, N.K., Bako, S., Sen, P.: A machine learning approach for filtering monte carlo noise. ACM Trans. Graph. 34(4), 1–122 (2015)
Article Google Scholar
Kettunen, M., Harkonen, E., Lehtinen, J.: Deep convolutional reconstruction for gradient-domain rendering. ACM Trans. Gr. 38(4), 126 (2019)
Article Google Scholar
Koskela, M., Immonen, K., Mäkitalo, M., Foi, A., Viitanen, T., Jääskeläinen, P., Kultala, H., Takala, J.: Blockwise multi-order feature regression for real-time path-tracing reconstruction. ACM Trans. Gr. (TOG) 38(5), 1–14 (2019)
Article Google Scholar
Kuznetsov, A., Kalantari, N.K., Ramamoorthi, R.: Deep adaptive sampling for low sample count rendering. Comput. Graph. Forum 37(4), 35–44 (2018)
Article Google Scholar
Lin, W., Wang, B., Yang, J., Wang, L., Yan, L.: Path-based monte carlo denoising using a three-scale neural network. Comput. Graph. Forum 40(1), 369–381 (2021)
Article Google Scholar
Lu, Y., Fu, S., Zhang, X.H., Xie, N.: Denoising monte carlo renderings via a multi-scale featured dual-residual gan. The Visual Computer (2021)
Lu, Y., Xie, N., Shen, H.T.: Dmcr-gan: Adversarial denoising for monte carlo renderings with residual attention networks and hierarchical features modulation of auxiliary buffers. In: SIGGRAPH Asia 2020 Technical Communications, pp. 1–4 (2020)
Meng, X., Zheng, Q., Varshney, A., Singh, G., Zwicker, M.: Real-time monte carlo denoising with the neural bilateral grid. In: C. Dachsbacher, M. Pharr (eds.) 31st Eurographics Symposium on Rendering, EGSR 2020 - Digital Library Only Track, London, UK, June 29 - July 3, 2020, pp. 13–24. Eurographics Association (2020)
Müller, T., Rousselle, F., Novák, J., Keller, A.: Real-time neural radiance caching for path tracing. ACM Trans. Graph. 40(4), 1–36 (2021)
Article Google Scholar
Munkberg, J., Hasselgren, J.: Neural denoising with layer embeddings. Comput. Gr. Forum 39(4), 1–12 (2020)
Article Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention (2015)
Schied, C., Kaplanyan, A., Wyman, C., Patney, A., Chaitanya, C.R.A., Burgess, J., Liu, S., Dachsbacher, C., Lefohn, A.E., Salvi, M.: Spatiotemporal variance-guided filtering: real-time reconstruction for path-traced global illumination. In: Proceedings of High Performance Graphics, HPG 2017, Los Angeles, CA, USA, July 28 - 30, 2017, pp. 2:1–2:12. ACM (2017)
Schied, C., Peters, C., Dachsbacher, C.: Gradient estimation for real-time adaptive temporal filtering. Proc. ACM Comput. Graph. Interact. Tech. 1(2), 1–24 (2018)
Article Google Scholar
Thomas, M.M., Liktor, G., Peters, C., Kim, S., Vaidyanathan, K., Forbes, A.G.: Temporally stable real-time joint neural denoising and supersampling. Proc. ACM Comput. Graph. Interact. Tech. 5(3), 1–21 (2022)
Article Google Scholar
Vogels, T., Rousselle, F., McWilliams, B., Röthlin, G., Harvill, A., Adler, D., Meyer, M., Novák, J.: Denoising with kernel prediction and asymmetric loss functions. ACM Trans. Gr. (TOG) 37(4), 1–15 (2018)
Article Google Scholar
Xiao, L., Nouri, S., Chapman, M., Fix, A., Lanman, D., Kaplanyan, A.: Neural supersampling for real-time rendering. ACM Trans. Graph. 39(4), 142 (2020)
Article Google Scholar
Xu, B., Zhang, J., Wang, R., Xu, K., Yang, Y.L., Li, C., Tang, R.: Adversarial monte carlo denoising with conditioned auxiliary feature modulation. ACM Trans. Gr. (TOG) 38(6), 1–12 (2019)
Article Google Scholar
Yu, J., Nie, Y., Long, C., Xu, W., Zhang, Q., Li, G.: Monte carlo denoising via auxiliary feature guided self-attention. ACM Trans. Graph. 40(6), 1–273 (2021)
Article Google Scholar
Zheng, S., Zheng, F., Xu, K., Yan, L.Q.: Ensemble denoising for monte carlo renderings. ACM Trans. Graph. 40(6) (2021)
Zwicker, M., Jarosz, W., Lehtinen, J., Moon, B., Ramamoorthi, R., Rousselle, F., Sen, P., Soler, C., Yoon, S.E.: Recent advances in adaptive sampling and reconstruction for monte carlo rendering. In: Computer graphics forum, vol. 34, pp. 667–681. Wiley Online Library (2015)

Download references

Funding

This work is part of the research supported by Chengdu Science and Technology Project (2019-YF08-00285-GX), Development and Application Demonstrations of Digitalized Governance of Local Society for Future Cities Research Program (2021-JB00-00033-GX), the National Natural Science Foundation of China under Grant NO. 61976156, Sichuan Science and Technology Program (2022NSFSC0640) and Medical Science and Technology Project of Sichuan Health Committee (21PJ119).

Author information

Authors and Affiliations

Center for Future Media, Department of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China
Yuankang Chen, Yifan Lu & Nine Xie
Hiroshima Institute of Technology, Hiroshima, Japan
Xiaohua Zhang

Authors

Yuankang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Nine Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nine Xie.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Temporal accumulation

Consecutive frame accumulation is quite effective in 1-spp scene denoising. We carry out temporal accumulation for all input 1-spp noisy images in both datasets. We use the motion vector to reproject the pixel of previous frame to the corresponding location at the current frame, which can increase the effective spp of noisy radiance and improve the denoising effect.

We explain how to accumulate 1-spp noisy images. BMFR dataset provides the world position of each image and the corresponding matrix ${\textbf {M}}$ which is the multiplication of camera projection matrix and camera transformation matrix. We also generated each frame’s camera matrix and world position when building Tungsten dataset. We first compute the motion vector based on the world position and matrix ${\textbf {M}}$. We convert the world position of the current frame to the screen space of the previous frame through matrix ${\textbf {M}}$ and viewport transformation according to the resolution of the image. We then calculate the difference between pixel’s previous position and its current position in screen space to get the motion vector. We carry out position and normal detection during computation and discard motion vectors with the square distance of world coordinates greater than 0.1 or the square distance of normal greater than 0.01 between the origin location in current frame and its reprojected location in previous frame. Then, we use grid_sample function in PyTorch [22] to implement image warp based on the motion vector, which outputs warped image through a bilinear interpolation method. We recurrently mix 20% of current data and 80% old warped accumulated data as new accumulated data in each frame.

Appendix B: Details of comparison methods

Here, we describe the architectures of the kernel prediction network (KP), the multi-resolution kernel prediction network (MRKP) and the weight sharing kernel prediction network (WSKP).

KP (Fig. 12) and MRKP (Fig. 13) have a same 5-layer U-Net architecture except for the final output layers. KP only outputs $13 \times 13$ kernel on the first decode layer. MRKP output three $5 \times 5$ kernels and their corresponding per-pixel blending weights in total. These three kernels act on the noisy images of original size, the $2 \times 2$ downsampling size and $4 \times 4$ downsampling size, respectively.

WSKP (Fig. 14) also uses a 5-layer U-Net. Output layer generates 3 encoding maps and the blending weights, which are then decoded into kernels of size 3, 5 and 7. There are 3 channels for encoding maps (denoted as Map), 3 channels for blending weights (denoted as $\beta $) to fuse the results filtered by kernel of size 3, 5 and 7, and 1 channel for multi-scale blending weights (denoted as $\alpha $) in the output layer. We blend the filtering result in each resolution first and then, conduct multi-scale fusion to get the final output.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Lu, Y., Zhang, X. et al. Interactive neural cascade denoising for 1-spp Monte Carlo images. Vis Comput 39, 3197–3210 (2023). https://doi.org/10.1007/s00371-023-02951-6

Download citation

Accepted: 09 June 2023
Published: 12 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00371-023-02951-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interactive neural cascade denoising for 1-spp Monte Carlo images

Abstract

Similar content being viewed by others

A detail preserving neural network model for Monte Carlo denoising

An Improved Monte Carlo Denoising Algorithm Based on Kernel-Predicting Convolutional Network

Deep residual learning for denoising Monte Carlo renderings

1 Introduction

2 Related work