1 Introduction

Monte Carlo path tracing is a classical algorithm used in photorealistic image rendering and computer animation, which simulates a variety of physics-based lighting effects and guarantees the authenticity and continuity of the rendered results. However, the high computational cost severely limits its performance in interactive applications, since high sampling rates, known as samples per pixel (spp), are required to achieve mathematical convergence. To produce high-quality images at low sampling rates, image denoising techniques are widely considered, because they are capable of eliminating visual noise through image reconstruction in the post-processing stage.

Fig. 1
figure 1

An overview structure of our cascaded neural denoiser. The temporal accumulated noisy image and the auxiliary features (albedo, normal and depth) are first sent to pixel net. Pixel net predicts an initial denoised image, which is then concatenated with auxiliary buffers and sent to kernel net. Kernel net generates multi-resolution kernels, and we use a bilateral method to adjust weights of kernels. Lastly, we apply multi-scale filtering and fusion to get the final denoised image

With the rapid development of deep learning, convolutional neural networks (CNN) have shown strong performance in Monte Carlo denoising. Previous works [3, 27] proposed kernel prediction networks and multi-resolution filtering structures for denoising. Though the kernel prediction method has innate advantages in processing low-spp noisy images by taking nearby pixels into consideration, the extremely noisy 1-spp scenes still affect denoiser’s performance, because it is quite hard to get clean results when image patches carry amount of noises, especially for those areas first appearing in successive frames which suffer from insufficient temporal accumulation. A straightforward solution is to predict large-scale kernels to collect more adequate information, but the computation cost will also significantly increase, which is inefficient for interactive applications requiring quick responses. Actually, in addition to kernel prediction, neural networks are good at predicting pixel values as well. It is potential to explore an way that benefits from both pixel prediction and kernel prediction to efficiently denoise 1-spp consecutive frames at real-time or interactive speed.

Inspired by above observation, in this paper we propose a novel lightweight deep learning approach to denoise 1-spp Monte Carlo images at interactive speed. The core idea of our approach is to use a combination of pixel-to-pixel network and kernel-prediction network to denoise images in a cascade way, and meanwhile the united network can easily be trained in an end-to-end manner like other deep learning-based methods [6, 19]. We design a pixel-to-pixel CNN to generate initial denoised results at the first step instead of directly computing kernel weights. In this way, denoiser can quickly carve out noises since it directly predicts pixel values and is able to restore more pleasant results in noisy areas by collecting implicit information in auxiliary buffers after training. Then, we feed initial results to kernel-prediction CNN to obtain multi-resolution kernels. To denoise images more effectively, we introduce a bilateral reconstruction method to take full advantage of the information in the auxiliary channels. We adaptively adjust the kernel weights by learning per-pixel parameters used in the bilateral filtering. Finally, we use a tiny network to combine multi-resolution results to obtain the denoised image.

According to comprehensive experiments, the proposed approach is able to denoise 1-spp noisy images at interactive frame speed and achieves state-of-the-art denoising qualities. To summarize, our approach makes the following contributions:

  • A novel and easy-to-train cascaded pixel-kernel prediction architecture which achieves state-of-the-art performance at 1-spp occasions.

  • An improved kernel construction approach which reconstructs kernels by bilateral filtering to obtain more accurate results.

Fig. 2
figure 2

This figure shows the detail of cascaded pixel-kernel network. We use 4-layer U-Net architecture in pixel net and kernel net. In pixel net, upsampling modules (denoted as UM, where we use e, d, O represent encoder, decoder dataflow and predicted output) in the last 3 decoder layers predict multi-scale images and use fusion modules (denoted as FM) to blend results. In kernel net, the decoder layers predict kernels of size 5 and bilateral weights. We use them to filter the initial denoised image and blend the multi-scale results to generate final denoised images. K3N40 represents \(3 \times 3\) convolutional kernel size and 40 channels output. K3N28 means 28 channels output, and so on. We use deconvolution to implement \(2 \times 2\) upsampling. In FM, the blending operation is corresponding to Eq. (2) and input \({Ui}^{c}\) is the upsampling coarse-resolution image

2 Related work

Monte Carlo denoising is a classic research topic in academic studies of graphics, and it is also closely followed by industrial productions. In this section, we will mainly focus on learning-based Monte Carlo denoising approaches that most relevant to our work, where we refer to the survey of Huo et al. [10] for a comprehensive study. For traditional adaptive sampling and reconstruction techniques, readers can refer to the survey of Zwicker [32].

Learning-based Monte Carlo denoising. With the idea of predicting the filter parameters, Kalantari et al. [12] proposed a multilayer perceptron neural network to automatically estimate the parameters of Monte Carlo denoising filters. Bako et al. [3] and Vogels et al. [27] showed that using CNN to predict per-pixel kernel filters is more robust and easier to train. The latter incorporated a multi-resolution filtering architecture and temporal aggregation. Kuznetsov et al. [15] used CNN to estimate sampling maps for adaptive sampling. In nontrivial-domain denoising, Gharbi et al. [7] proposed a kernel prediction network that splats individual samples onto nearby pixels. Subsequently, Hasselgren et al. [9] and Munkberg et al. [21] extended this technique to interactive rates by proposing an efficient way to control performance and memory characteristics. In gradient-domain rendering, Kettunen et al. [13] used a U-Net autoencoder to replace screened Poisson solver. In addition, Xu et al. [29] first employed Generative Adversarial Networks (GAN) to improve Monte Carlo denoising, and they designed a novel conditioned feature modulation method to utilize auxiliary buffers. Lu et al. introduced residual attention networks [18] and dual residual connection GAN structure [17]. Lin et al. [16] proposed a three-scale network to handle different features (pixel, sample and path) in path-based denoising. To take advantages of various denoising approaches, Back et al. [2] and Zheng [31] designed unified denoising strategies for offline scenarios. And in a recent study, Yu et al. [30] proposed a self-attention mechanism to denoise Monte Carlo rendering. The above deep neural network approaches have achieved impressive results, but they also have a large time overhead due to their delicate and complex network structure.

Interactive denoising. The goal of interactive denoising is to denoise 1-spp data under limited execution time and running overhead. Schied et al. [24] proposed a temporal reconstruction filter with edge-stopping functions and improved this method by estimating temporal gradients to implement adaptive temporal accumulation [25]. Chaitanya et al. [5] designed a deep autoencoder with recurrent connections to enhance temporal stability among successive frames and adopted an end-to-end training strategy with auxiliary buffers. Meng et al. [19] projected the noisy input image onto the bilateral grid based on the guide image learned by the neural network and then, sliced the grid to obtain the denoised images. Isik et al. [11] proposed a filtering algorithm by computing a pairwise affinity to quantify the relationship between per-pixel deep features. Fan et al. [6] learned lightweight importance maps and constructed multi-scale filtering kernels to reduce the time cost of the kernel prediction method. Muller et al. [20] proposed a neural radiance caching technique for path-space global illumination and denoising. Manu et al. [26] introduced a joint network to combine denoising and supersampling in modern rendering pipeline.

3 Methodology

3.1 Problem statement

Monte Carlo denoising can be viewed as a supervised learning problem. Our goal is to reconstruct noise-free images from 1-spp noisy input images at interactive speed (within 100 ms). In addition to noisy images, auxiliary features (e.g., albedo, normal and depth) can be obtained as rendering by-products from the rendering process. The dataset is an animation sequence \(\mathcal {X} = \{(x_{1},\textbf{f}_{1}), (x_{2},\textbf{f}_{2}),..., (x_{N},\textbf{f}_{N})\}\) with N frames and corresponding ground truth sequence \(\mathcal {Y} = \{y_{1}, y_{2},..., y_{N}\}\), where \({x_{n}}\) stands for 1-spp noisy image, \({\textbf{f}_{n}}\) is the noise-free auxiliary features of frame n, \(y_{n}\) is the ground truth of frame n rendered with high spp.

Just like previous works [6, 19], we use accumulated 1-spp noisy images in the experiment. Please refer Appendix 1 for details.

3.2 Cascaded pixel-kernel architecture

We propose a cascaded pixel-kernel network in this paper. As shown in Fig. 1, input data consist of temporal accumulated 1-spp noisy image and auxiliary features (albedo, normal and depth). Input data are first sent to the pixel network, denoted as \(\mathcal {P}(\cdot )\), to generate a 3-channel RGB image as initial denoised output. Then, this image is concatenated with auxiliary features as the input of the kernel network, which denoted as \(\mathcal {K}(\cdot )\). \(\mathcal {K}(\cdot )\) predicts three-level per-pixel kernel weights and adopt a bilateral method to adjust the kernel weights. Lastly, we use these kernels to filter the initial denoised image and combine the results in a multi-resolution manner to get the final denoised image.

We use the U-shape convolutional neural network (U-Net) [23] as the backbone for both pixel network \(\mathcal {P}(\cdot )\) and kernel network \(\mathcal {K}(\cdot )\) for we adopt a multi-scale fusion strategy to generate results in both networks. Pixel network \(\mathcal {P}(\cdot )\) takes the noisy image x and its auxiliary buffers \(\textbf{f}\) as input and predict the initial denoised image \(R^{\prime }\):

$$\begin{aligned} R^{\prime } = \mathcal {P}(x,\textbf{f}). \end{aligned}$$
(1)

More specifically, \(\mathcal {P}(\cdot )\) generates 3-scale denoised images at the last three decoder layers. We use a lightweight fusion module (FM in Fig. 2) to combine the denoised images at different scales. The blending output is retrieved as:

$$\begin{aligned} \textbf{o}=\textbf{i}^{f}-\alpha \left[ \textbf{U D i}{ }^{f}\right] +\alpha \left[ \textbf{U i}^{c}\right] , \end{aligned}$$
(2)

where \(\textbf{i}^{c}\) and \(\textbf{i}^{f}\) denote coarse-resolution image and fine-resolution image, respectively, \(\textbf{D}\) are \(2 \times 2\) downsampling and \(\textbf{U}\) are \(2 \times 2\) nearest upsampling, and \(\alpha \) represents per-pixel blending weight predicted by fusion module. After conducting fusion operations progressively, \(\mathcal {P}(\cdot )\) outputs the initial denoised image \(R^{\prime }\) at the original scale. \(R^{\prime }\) is then sent to kernel network \(\mathcal {K}(\cdot )\) together with auxiliary features \(\textbf{f}\), and \(\mathcal {K}(\cdot )\) predicts the per-pixel filtering kernel denoted as \(\textbf{w}\):

$$\begin{aligned} \textbf{w} = \mathcal {K}(R^{\prime },\textbf{f}). \end{aligned}$$
(3)

We generate 3 kernels with a same size of \(k \times k\) at the last three decoder layers to conduct multi-scale filtering and fusion. We then downsample the initial denoised images \(R^{\prime }\) and use corresponding kernels to conduct filtering, which can be formulated as:

$$\begin{aligned} R^{\prime \prime }(x, y)=\sum _{(i,j) \in \Omega _{\langle x,y \rangle }}\textbf{w}_{\langle x,y \rangle }(i,j) R^{\prime }\left( i, j\right) , \end{aligned}$$
(4)

where \(\Omega _{\langle x,y \rangle }\) is the \(k \times k\) neighborhood centered on the pixel located at the coordinate \(\langle x,y \rangle \) of the image and \(\textbf{w}_{\langle x,y \rangle }\) is the corresponding kernel. We use a kernel size of 5 in our model to guarantee a fast inference speed and apply the same kernel weight to each RGB channel of a pixel during reconstruction. The multi-scale reconstructed images are combined through fusion modules to generate the final denoised image \(R^{\prime \prime }\) in the same manner described in Eq. (2).

3.3 Bilateral weight adjustment

The kernel network outputs three basic kernels as described above. Though these kernels can act directly on the noisy image, the performance will decrease when the adjacent pixel has similar weights with the central one yet they are actually far apart in the world space. We consider use auxiliary buffer to avert this problem. We utilize cross-bilateral filtering to guide the kernel effectively obtain information from auxiliary features through a straightforward attention mechanism. Normal and depth are downsampled through average pooling and applied to the cross bilateral filter, in an exponential decay manner based on the square distance to the central one with learnable parameters:

$$\begin{aligned} \textbf{w}^{{\prime }}_{\langle x,y \rangle }(i,j)= & {} \textbf{w}_{\langle x,y \rangle } (i,j)*D_{n}(i,j,x,y)\nonumber \\{} & {} \quad *D_{d}(i,j,x,y)*D_{c}(i,j,x,y),\end{aligned}$$
(5)
$$\begin{aligned} D_{n}(i,j,x,y)= & {} \exp (-f_{1}(x,y)*[n(i,j)-n(x,y)]^{2}), \end{aligned}$$
(6)
$$\begin{aligned} D_{d}(i,j,x,y)= & {} \exp (-f_{2}(x,y)*[d(i,j) - d(x,y)]^{2}), \end{aligned}$$
(7)
$$\begin{aligned} D_{c}(i,j,x,y)= & {} \exp (-f_{3}(x,y)*[(i-x)^{2}+(j-y)^{2}]),\nonumber \\ \end{aligned}$$
(8)

where \(\langle x,y \rangle \) is the central pixel’s coordinate, \(\langle i,j \rangle \) is the adjacent pixel’s coordinate, n is normal and d is depth. \(D_n\), \(D_d\), \(D_c\) are differences of normal, depth and the coordinate’s distance between pixel \(\langle x,y \rangle \) and pixel \(\langle i,j \rangle \), respectively. \(f_{1}\), \(f_{2}\) and \(f_{3}\) are three predicted per-pixel 1-D features to adaptively adjust the weights in bilateral filtering. After adjusting kernel weights, we use \(\mathbf {w^{\prime }}\) to filter images same as Eq. (4).

3.4 Training loss

We use the Symmetric Mean Absolute Percentage Error (SMAPE) to train our network. A naive spatial loss can be computed as:

$$\begin{aligned} \mathcal {L}_{s}(R, Y)= \frac{\left| R-Y\right| }{\left| R\right| +\left| Y\right| +\varepsilon }, \end{aligned}$$
(9)

where R is denoised image, Y is reference image and \(\epsilon \) is a small number \(10^{-2}\). We adopt a temporal loss to enhance denoiser’s perception of temporal features:

$$\begin{aligned} \mathcal {L}_{t}(\partial R, \partial Y)=\frac{\left| \partial R-\partial Y\right| }{\left| \partial R\right| +\left| \partial Y\right| +\varepsilon }, \end{aligned}$$
(10)

where \(\partial R =\left| R_{i}-R_{i-1} \right| \) and \(\partial Y =\left| Y_{i}-Y_{i-1} \right| \) represent the frame differences. \(R_{i}\) and \(Y_{i}\) are the denoised image and reference of frame i, \(R_{i-1}\) and \(Y_{i-1}\) are those in the last frame \(i-1\). We apply \(\mathcal {L}_{s}\) and \(\mathcal {L}_{t}\) to both pixel network and kernel network. We train these two networks together with a total loss:

$$\begin{aligned} \begin{aligned} \mathcal {L}(R^{\prime },R^{\prime \prime },Y)&= w_{\textrm{a}} \mathcal {L}_{s}(R^{\prime }, Y)+w_{\textrm{b}} \mathcal {L}_{s}(R^{\prime \prime }, Y)\\&\quad +w_{\textrm{c}} \mathcal {L}_{t}(\partial R^{\prime },\partial Y)+w_{\textrm{d}} \mathcal {L}_{t}(\partial R^{\prime \prime },\partial Y), \end{aligned} \end{aligned}$$
(11)

where \(R^{\prime }\) is the initial denoised image predicted by pixel network, \(R^{\prime \prime }\) is the final denoised image reconstructed through filtering kernels predicted by kernel network, and Y is the reference image. We use \(w_{\textrm{a}}=w_{\textrm{b}}=1\) and \(w_{\textrm{c}}=w_{\textrm{d}}=0.1\) in experiments.

4 Experiment

4.1 Datasets

Datasets preparation. We conducted the experiments on two datasets. Firstly we adopt an existing public dataset from the work of Koskela et al [14], called the BMFR dataset including 6 scenes. Each scene has 60 noisy 1-spp path-traced rendering images with a resolution of \(720 \times 1280\), corresponding 4096-spp path-traced renderings as their ground truth images and feature buffers (albedo, normal, depth and world position) obtained in the path tracing phase. This dataset has also been adopted by Meng et al. [19] and Fan et al. [6]. However, most scenes in BMFR dataset are relatively simple and lack of complex lighting effects.

Fig. 3
figure 3

An overview of Tungsten dataset. Scenes (From top left to bottom right): Bathroom-1, Bathroom-2, Classroom-large, Dining-room, Bedroom, Kitchen, Living-room-1, Living-room-2, Living-room-3

Fig. 4
figure 4

Visual quality comparisons of different methods on the 1-spp BMFR dataset. Closeups of the blue square are shown on the top row, and closeups of the green square are on the bottom row in each scene. PSNR and SSIM of each method are computed on the full resolution image. Zoom in for a better view

Table 1 Comparison of PSNR and SSIM on 1-spp BMFR dataset and Tungsten dataset (left is PSNR and right is SSIM)
Fig. 5
figure 5

Visual quality comparisons of different methods on the 1-spp Tungsten dataset. Zoom in for a better view

Fig. 6
figure 6

Visual comparisons of temporal profiles. The highlighted red stripes are \(150 \times 10\) resolution, and the insets show the concatenated results of stripes over 20 consecutive frames

For a more comprehensive test of denoiser’s performance at 1-spp scenes, we build a new 1-spp dataset rendered by Tungsten renderer [4] which covers various 3D models, abundant material types and complicated lighting effects, called as Tungsten dataset. The Tungsten dataset contains 9 scenes in total, and each scene has 20–100 frames. The noisy images are rendered at 1-spp, and the ground truth images are 4096 spp. We obtain the auxiliary features (albedo, normal and depth) during rendering. All images in Tungsten dataset are at a resolution of \(720 \times 1280\). We make a brief demonstration of Tungsten dataset in Fig. 3.

For each test scene in both datasets, we hold out one scene as test data and the remaining scenes in corresponding dataset as training data. The network takes 1-spp accumulated noisy images and feature buffers as input, consisting of 3-channel color, 3-channel albedo, 3-channel normal, and 1-channel depth in total. To avoid over-blurring of texture details, we separate the albedo from the noisy image before feeding it into the denoising network.

4.2 Implementation

Our denoiser is implemented in TensorFlow [1]. At test stage, we use CUDA to implement the bilateral kernel filtering function to reduce inference time. We use mini-batch size of 8 and train our network with Adam optimizer with a learning rate to \(10^{-4}\). The multi-level kernels predicted by kernel network have a scale of \(5 \times 5\). For each test case, we use one scene as the test set and the other scene as the training set. All auxiliary features are normalized to the range from 0 to 1 in the dataset. For each scene we remain the last 4 frames as validation set and other frames as train data. We augment training data by random flips and rotations of 90 degrees. To apply temporal loss, we split training data into groups of 8 consecutive frames. We use \(128 \times 128\) image patches for training, and at test stage we use the full \(720 \times 1280\) image as input. The network is trained for 100 epochs on a NVIDIA TITAN RTX GPU.

4.3 Comparison with baselines

We use PSNR and SSIM metrics for comparisons. We focus our comparisons on a range of learning-based interactive denoising approaches: NBGD [19], ONND [5], and three variants of kernel prediction network [3] with a small network size (KP), a multi-resolution architecture (MRKP) [27] modified by Meng et al. [19], and a weight sharing approach (WSKP) [6].

Based on KPCN [3], MRKP used a multi-resolution approach for interactive denoising. Fan et al. [6] used the encoding of the kernel map to speed up and KP used a network structure similar to that of MRKP but only predict a larger kernel at the original scale. For fair comparison, these three variants share a same U-Net backbone with close time cost compared with our network. We use the multi-scale WSKP architecture and each output layer generates 3 encoding maps, which are then decoded into kernels of size 3, 5 and 7. According to the open source code of the original paper, we compare with the NBGD method with the largest multi-scale 7-layer architecture with three bilateral grids. For NBGD, MRKP, KP and WSKP, we use same training datasets as our method. ONND is provided as a black box modules in Optix 5.1 based on Chaitanya et al. [5]. For KP, MRKP and WSKP, we use SMAPE loss as the image reconstruction loss. For NBGD, we maintain the original L1 loss suggested by the author.

Table 1 shows the error metrics of different methods on all test scenes. In general, we achieve the best average quantitative results in both BMFR dataset and Tungsten dataset. Our method also holds the highest PSNR and SSIM in most test cases.

BMFR dataset. Firstly, we show our denoiser’s performance on the BMFR dataset in Fig. 4. Our method produces pleasant and clean visual effects. It successfully recovers fine details in high-frequency areas such as edges (crops row 3, 4, 5, 6) and produces clean and clear soft shadows (crops row 1, 2, 7, 8). Our model also generates more steady edges than basic kernel prediction methods (KP and MRKP) in the complex outdoor scene (crops row 5, 6). In contrast, the results of NBGD and WSKP are more dirty and difficult to retain the edges of objects, leading to ghost and distortion. ONND generally tends to produce more blurry images and lacks of details.

Tungsten dataset. Figure 5 shows visual comparisons of denoising results on Tungsten dataset which has more complicated lighting effects such as refraction and specular reflection. Our model also holds remarkable denoising effects in these complex scenes. Compared to other methods, our method recovers more clear details on transparent and reflective items (crops row 3, 7) and maintains stable edges in high-frequency areas (crops row 5, 6, 8).

Temporal Stability. We select an area in the image and scan over consecutive frames to compare temporal stability of different denoisers. Figure 6 are the temporal profiles for different scenes. Our model maintains clean and stable object boundaries and shows smooth transitions over frames. Other methods suffer from temporal discontinuities and produce floating results. See our supplementary video for more comparisons.

Initial and final results. Quantitative results of the initial denoised images predicted by the pixel network at the first stage and the final denoised images generated through the whole pipeline are shown in Table 1, where the final one hold higher metrics in all test cases, demonstrating the effectiveness of cascaded denoising. The final one produces cleaner and more stable visual results than the initial one (Fig. 7).

Time cost. The denoising time cost of each method is shown in Table 2. WSKP is implemented in unoptimized Pytorch. KP, MRKP, NBGD and our denoiser are implemented using unoptimized Tensorflow. ONND is provided as a black box modules in Optix 5.1. The average denoising time of our model at resolution of \(720 \times 1280\) is 55.21 ms, which can be applied to interactive denoising. Our method achieves better visual results and highest metrics with little additional overhead compared with WSKP and NBGD.

Fig. 7
figure 7

The initial and final images generated by our denoiser

Table 2 Average time cost of each denoising approach at resolution of \(720 \times 1280\)
Table 3 Ablation study of different methods: Kernel alone, Pixel alone, Kernel-Pixel, Kernel-Kernel and Pixel-Kernel (denoted as K, P, K+P, K+K and P+K)
Fig. 8
figure 8

Visual quality comparisons of independent networks and different cascaded architectures. K and P are kernel network and pixel network alone. \(K+P\), \(K+K\), \(P+K\) are, respectively, kernel-pixel, kernel-kernel, pixel-kernel method

4.4 Ablation studies

Cascaded architecture. We have conducted experiments of different kinds of cascaded strategies. We compare independent kernel network (denoted as K) and pixel network (denoted as P) with three cascaded architectures: \(K+P\), which uses kernel network to denoise the image first and then, use pixel network to generate final result; \(K+K\), which uses two kernel networks to denoise progressively; and \(P+K\) denotes our original pixel-kernel architecture. The numbers of model parameters are shown in Table 3. In K and P, the number of feature channels is set to 64. Figure 8 compares the visual effects of above methods. The proposed \(P+K\) model produces steady and cleaner results at the edge of the chair (row 1) and the heating radiator (row 2), while the results of other methods suffer from blur and distortion. Using kernel network or pixel network alone makes more blurry denoised results. Kernel network cannot maintain steady edges in complex scenes (row 2, 3). Pixel network shows better performance than kernel network, but it predicts wrong pixel values (white pixels in row 3) when objects are too small. By contrast, \(P+K\) still gets clean and clear results on these areas. In general, \(P+K\) model holds the best average metrics (Table 3) on the test scenes which turns out that applying a pixel network at first is effective since more information can be collected on the initial denoised image than the original 1-spp input when using a small filtering kernel.

Bilateral weight adjustment. We compare the denoising performance of the bilateral weight adjustment strategy against the original method that directly uses kernels to filter image. Table 4 shows the quantitative errors of these two methods, where the proposed bilateral method has higher average results. Figure 9 shows visual comparisons of denoising quality. Model with bilateral weight adjustment presents more clear edges and cleaner results.

Temporal loss. We test the influence of the temporal loss. We compare the denoising performances when removing the temporal loss. Table 4 shows the quantitative comparisons of denoising results, where the one with temporal loss achieves better performance. Figure 10 is the visual quality comparison. When using temporal loss, the results have less flickers and cleaner edge areas.

Fig. 9
figure 9

Visual quality comparisons on bilateral weight adjustment (denoted as BWA). Results in the third column show more clear and stable edges

Fig. 10
figure 10

Comparisons on temporal loss. Results with temporal loss have less artifacts on edges

Table 4 Ablation study of bilateral weight adjustment (denoted as BWA) and temporal loss (denoted as TL)

4.5 Limitations and future work

Generalization to unseen effects. Our cascaded denoiser is able to robustly denoise 1-spp BMFR dataset and Tungsten dataset noisy images, but it may produce artifacts when denoising specific rendering effects (e.g., motion blur, volumetric media, fur material, etc.). Figure 11 shows the limitations of our method on unseen effects. This problem can be alleviated by enlarging the diversity of the training scenes to make our model adapt to more diverse data distribution.

Spatio-temporal architecture. In this work, we apply temporal accumulation and temporal loss to enhance the model’s awareness of spatio-temporal information. There are also some recent works conducting interactive denoising [11, 26] or supersampling [8, 28] by spatio-temporal neural networks which can directly learn temporal coherence with adjacent frames. Hence, in the future research we will try to combine our cascaded denoiser with temporal adaptive module to take advantage of cross-frame features.

Fig. 11
figure 11

The failure cases. Our model fails to reconstruct the details of hair and the effects of water due to the inconsistency of the data distribution. The pixel network fails to produce the denoised image (row 1) or predicts wrong pixel colors (row 2)

5 Conclusions

We propose a novel and practical cascaded neural denoiser to effectively denoise 1-spp noisy images at interactive speed. At the core of our approach, we design an efficient cascaded kernel-pixel prediction architecture, which first uses a pixel network to generate initial denoised image and then, utilizes a kernel network to predict per-pixel kernel weights and conduct multi-scale filtering and fusion. We also introduce a neural bilateral method to adaptively adjust kernel weights by auxiliary features. Our experiments have demonstrated that the proposed denoiser is able to present pleasant denoised results for 1-spp input data and achieve satisfactory denoising speed.