Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Image restoration is concerned with the reconstruction/estimation of the uncorrupted image from a corrupted or incomplete one. Typical corruptions include noise, blur, down-sampling, hardware constraints (e.g. Bayer pattern) and combinations of those. After decades of research there is a large literature [1] dedicated to restoration tasks, whereas the literature studying the fusion of restoration results is thin [2]. In this paper we tackle such fusion as a means for further performance improvements. Particularly, we propose a 3D convolutional fusion (3DCF) method and validate it on image denoising and single image super-resolution.

1.1 Image Denoising (DN)

Natural image denoising aims at recovering the clean image given a noisy observation. The most often studied case is when the image corruption is caused by additive white Gaussian (AWG) noise with known variance. Also, the images are assumed to be natural, capturing every-day scenes, and the quantitative measure for assessing the recovery result is the peak signal-to-noise ratio (PSNR), which stands in monotonic relation to the mean squared error (MSE).

The most successful denoising methods employ at least one of the following

Denoising principles as listed in [3]: Bayesian modeling (coupled with Gaussian models for noiseless patches), transform thresholding (assumes sparsity of patches in a fixed basis), sparse coding (sparsity over a learned dictionary), pixel or block averaging (exploits image self-similarity).

Most denoising methods work at a single image scale, the finest one, and often a small image patch is the basic processing unit. The patch captures local image information for a central pixel and a statistical amount of uncorrupted pixels. Zontak et al. [4] recently opened up a fresh research direction by proposing a method based on patch recurrence across scales (PRAS). Another partition of the methods is based on whether only the noisy image is used, or also learned priors and/or extra data from other (clean) natural images. This leads to internal and external methods. Some well known examples of each are:

Internal denoising methods:

NLM (non-local means) [5] reconstructs a noisy patch with a weighted average of similar patches from the same image. It uses the image self-similarity and the fact that the noise is usually uncorrelated.

BM3D (block matching 3D) [6] extends NLM and the DCT denoising method [7]. BM3D groups similar patches into a 3D block, applies 3D linear transform thresholding, and inverses the transform.

WNNM (weighted nuclear norm minimization) [8] follows the self-similarity principle, and applies WNNM to recover the noiseless patch from a matrix of stacked non-local similar patch vectors.

PRAS (patch recurrence across scales) [4] creates (an)isotropic image scale pyramids and extracts the estimated (noiseless) patch from the same corresponding position but at a different scale.

PLE (piecewise linear estimation) [9] is a Bayesian restoration model, including denoising, deblurring, and inpainting. PLE employs a set of 19 Gaussian models obtained from synthetic edge images (as priors) and an estimation-maximization iterative procedure.

External denoising methods:

EPLL (expected patch log likelihood) [10] can be seen as a shotgun extended version of PLE. It learns a Gaussian mixture model with 200 components for 2 million clean patches sampled from external natural images, and tries to maximize the expected log likelihood of any randomly chosen patch in the image.

LSSC (learned simultaneous sparse coding) [11] adapts a sparse dictionary learned over an external database by adding a grouping step to the noise image.

MLP (multi-layer percepton) [12] learns from an external database with clean and noisy images, and was among the first to introduce neural networks to low level image restoration tasks.

CSF (cascade of shrinkage fields) [13] proposes shrinkage fields, combining the image model and the optimization algorithm as a whole. The time complexity is greatly reduced by inherent parallelism.

opt-MRF (Loss-Specific Training of Filter-Based MRFs) [14] revisits loss-specific training and uses bi-level optimization to solve the image restoration problem.

TRD (trained reaction diffusion) [15] extends the solving process of nonlinear reaction diffusion to a deep recurrent neural network, outperforms many of the aforementioned methods, while offering the lowest time complexity for now.

It is quite surprising that most of the recent top denoising methods (such as BM3D, LSSC, EPLL, PRAS, and even WNNM) face a plateau. They perform equally well for a large range of noise, despite that they are quite different in their formulations, assumptions, and information used. This is the reason behind the recent work that fuses them, pushing the limits by combining different approaches [16, 17]. We refer the readers to [2] for a study of image fusion algorithms of the past decades. Others investigated the theoretical limits for denoising with natural image patch priors [18], and at least for the lower noise levels, the gap between the most successful methods and the predicted limits seems to rapidly diminish.

Fusion methods:

PatchSNR (patch signal-to-noise ratio). Mosseri et al. [19] propose a patch-wise signal-to-noise-ratio to distinguish whether an internal or an external denoising method should be applied. Their fused result slightly improves over the stand-alone methods.

RTF (regression tree fields). Jancsary et al. [16] observe that depending on the image content some methods perform better than other. They consider RTFs based on a filterbank (RTF\(_{plain}\)), also additional exploitation of BM3D’s output (RTF\(_{BM3D}\)), or a setting exploiting all the outputs of their benchmarked methods (RTF\(_{all}\)). The more methods the better their fusion result. The RTFs are learned on large datasets. It is also worth mentioning that following [16], Schmidt et al. [20] propose a cascade of regression tree fields (CRTF) working on deblurring and denoising and obtain good performances in both cases.

NN (neural nets/multi-layer perceptron). Burger et al. [17] pursue the success of MLP [12] in denoising, to learn the best fusion. They found the internal denoising methods to suit better images with artificial (human-made) contents, and external ones to work better for natural scenes. They argue against PatchSNR and consider that there is no trivial rule to decide among internal or external method on a patch-by-patch basis, and indeed their NN fusion produces the best denoising results to date. Unfortunately, the learning is quite intensive.

AF (anchored fusion). Timofte [21] clusters the patch space and for each cluster learns an anchored regressor from fused methods’ patches to the fusion output.

1.2 Single Image Super-Resolution (SR)

Single image super-resolution (SR) is another active area [22,23,24,25,26,27] of image restoration aiming at recovering a high-resolution (HR) image from a low-resolution (LR) input image by inferring missing high frequency contents. We can roughly categorize the recent methods in:

Non-neural network methods:

SR (sparse representation) [28] generates a sparse representation/coding of each LR image patch, and then applies the coefficients of this representation to generate the HR image.

A+ (adjusted anchored neighborhood regression) [29], considered to be an advanced version of ANR (anchored neighborhood regression) [30], learns sparse dictionaries and regressors anchored to the dictionary atoms.

RFL (super-resolution Forests) [31] maps low to high-resolution patches using random forests and anchored regressors as in A+.

selfEx (transformed self-exemplars) [32] introduces a self-similarity based image SR algorithm by applying transformed self-exemplars.

Neural network methods:

SRCNN (convolutional neural network) [33] learns an end-to-end mapping between the low/high-resolution images by a deep convolutional neural network.

CSCN (cascade of sparse coding network) [24] combines the key ingredients of deep learning with those of the sparse coding model.

1.3 Contributions

In this paper, we study the patch-by-patch fusion of image restoration methods with particular focus on recent top methods for both DN and SR tasks. To this end, we propose a generic 3D convolutional fusion architecture (3DCF) to learn the best combination of existing methods. Our three main contributions are:

  1. 1.

    We show the complementarity of different methods (e.g. internal vs. external).

  2. 2.

    We demonstrate that our method learns sophisticated correlation details from top methods to achieve the best reported results on a wide range of images.

  3. 3.

    The generality of our 3DCF method for both DN and SR.

The paper is organised as follows. Section 2 provides some insights and empirical evidence for the complementarity of the DN/SR methods and analyses oracle bounds for fusion. Section 3 motivates and introduces our novel 3DCF method with the necessary details and mathematical formulations. Section 4 presents the experiments and discusses the results. Section 5 concludes the paper.

2 Insights

Our focus is fusion for improved image restoration results and particularly for denoising in the presence of additive white Gaussian noise (AWG), with validation on single image super-resolution. Here we analyse the complementarity of the restoration methods and fusion strategies.

2.1 Complementarity of Top Methods

Jancsary et al. [16], Burger et al. [17], and Zontak and Irani et al. [4], among others, already observed that each method works best for some particular image contents while being worse than others for other image regions.

First, we pair-wise compare the PSNR performances of BM3D (internal method), and MLP and TRD (external methods) on 68 images from the Berkeley dataset for AWG noise with \(\sigma =50\). The relative improvements (PSNR gain) are reported in Fig. 1. MLP is better than BM3D on all images but is worse than TRD on \({\sim }40\%\) of them. Also, BM3D is better than TRD on some images. We conclude there is no absolute winner at image-level.

Second, we compare pixel-wise or patch-wise and see that within the same image there is no absolute winner always getting the best result either. In Fig. 2 for one image altered with AWG noise, \(\sigma =50\), we report pixel-wise selections from BM3D (25.77 dB PSNR) and MLP (26.19 dB) to best match the ground truth image. Despite MLP being significantly better (+0.41 dB) on denoising this image, at pixel-level the results are almost equally divided between the methods. At patch-level (sizes \(5\times 5\) and \(17\times 17\) pixels) we have a similar pattern.

Fig. 1.
figure 1

No absolute winner. Each method is trumped by another on some image.

Fig. 2.
figure 2

An example of oracle pixel and patch-wise selections from BM3D and MLP outputs and the resulting PSNRs for AWG with \(\sigma =50\).

2.2 Average and Selection Fusion and Oracle Bounds

As shown in Fig. 1 for images and in Fig. 2 for patch or pixel regions, the denoising methods are complementary in their performance. Now we study a couple of fusion strategies at image level.

Average fusion directly averages the image results.

Selection of non-overlapping patches assumes that the fusion result contains non-overlapping (equal size) patches with the best image results of the fused methods (see Fig. 2). One needs to learn a patch-wise classifier.

Selection of overlapping patches is similar to the above one in that a patch-wise decision is made, but this time the patches overlap. The final fusion result is obtained by averaging the patches in the overlapped areas (see Fig. 2).

We work with BM3D and MLP, partly because BM3D is an internal while MLP is an external method, and partly because of the result in Fig. 1 where at image level MLP performs better than BM3D. Therefore, the results from fusing BM3D and MLP at patch-level are interesting to see.

Fig. 3.
figure 3

Average PSNR [dB] comparison of BM3D [6] and MLP [12], average fusion, oracle selection of (overlapping or non-overlapping) patches, and our 3DCF fusion on 68 images, with AWG noise, \(\sigma =50\).

Fig. 4.
figure 4

Average PSNR [dB] comparison of A+ [29] and CSCN [24], average fusion, oracle selection of (overlapping or non-overlapping) patches, and our 3DCF fusion on Set14, upscaling factor \(\times 2\).

In Fig. 3 we report how the chosen patch size affects the performance of a selection strategy, on the same Berkeley images corrupted with AWG noise, \(\sigma =50\). We report oracle results, an upper bound for such a strategy. In comparison we report the performance of the fused BM3D and MLP methods, as well as the results of the average fusion and our proposed 3DCF method. We note that (i) overlapping patches lead to better results (while significantly slower) than non-overlapping patches; (ii) the smaller the patch size the better the oracle results become; (iii) the average fusion leads to poorer performance than the fused MLP method; (iv) our 3DCF fusion results are comparable with those from the oracle selection strategies for patch sizes above \(9\times 9\).

Complementary, in Fig. 4 we start from the A+ and CSCN methods for the super-resolution (SR) task, where we use the Set14 images and an upscaling factor \(\times 2\) (we use the settings described in the experimental section). As in the denoising case, (i) the smaller the patch size is, the better the oracle selection results get; (ii) the overlapped patches lead to better fusion results. However, for SR, (iii) the average fusion improves over both fused methods; (iv) our 3DCF fusion is significantly better than the fused methods, the average fusion, and compares favorably to the oracle selection fusion for patch sizes above \(5\times 5\).

From these experiments we can conclude that the average and (patch) selection strategies for fusion - while conceptually simple - are either not leading to consistently improved results (case of average fusion) or their oracle upper bounds are quite tight given the difficulty of accurately classifying patches (case of selection strategy). Note that PatchSNR [19] is an example of a selection strategy and that NN [17], a neural network fusion method, reported better results than PatchSNR.

We therefore followed the combination paradigm for image fusion and design and trained an end-to-end 3D convolutional network from the results of two methods to the targeted restored image.

3 Learning Fine Features by 3D Convolution

3.1 Motivation and Related Work

Most of the existing neural network architectures apply spatial filters which address inputs such as 2D images. When it comes to videos, thus 3D inputs, these 2D convolutional neural networks (2DCNN) do not employ crucial information such as the temporal correlation. For example, in human action recognition, the motion information is not captured by 2DCNNs and Ji et al. [34] introduced a 3D convolutional neural network (3DCNN) method (see Fig. 5). The 3DCNN architecture has 1 hardwired layer, 3 convolutional layers and 2 subsampling layers. The spatial dimension of inputs \(60 \times 40\) are gradually reduced to \(1 \times 1\) by going through the network, i.e. 7 input frames have been converted into a 128-dimensional feature map capturing also the motion information. In the end, each element of the 128-dimensional feature map is fully connected to each unit in the last layer, then the action class is determined.

Fig. 5.
figure 5

3DCNN proposed in [34] for human action recognition.

For performance improvements a brute force approach that proved successful is to deepen the (neural network) architecture [15, 24]. Yet, the improvements decline significantly with the depth while the training time and the demand of hardware (GPU) resources increase. For example, experiments reported in [15] demonstrate that the bulk of the performance is achieved by the first stages in their denoising TRD method while the last 3 stages (from 8) bring merely 0.01 dB to it. In [22] it is shown for SR methods that the first stages are the most important and that adding more stages only slightly improves the performance (of A+) further.

On the other hand, for image restoration tasks such as SR it is common to recover the corrupted luminance component instead of the RGB image directly, and to interpolate the chroma. However, exploiting the correlation between corrupted RGB or even extra channels such as depth (D) or near-infrared (NIR) should be beneficial to the restoration task at the price of increased computation. For example, for denoising, Dabov et al. [35] apply the same grouping method on chroma channels as on the luminance, and they achieve better PSNR performances than by using BM3D [36] independently on three channels. To sum up, given several highly correlated (corrupted) channels/images, we have a better chance to high quality recovery.

It follows that we can consider the outputs of state-of-the-art methods as highly correlated images, which can be treated as the starting point of our proposed novel 3D convolutional fusion (3DCF) architecture.

Fig. 6.
figure 6

Proposed 3D convolutional fusion method (3DCF).

3.2 Proposed Generic 3D Convolutional Fusion (3DCF)

General Architecture. As the starting point, we obtain several recovered outputs \(\{ \mathbf {I}_{i} \}_{i = 1, \dots , n}\) from the same corrupted image, with different methods. We stack those highly correlated images along the channel dimension, which brings us a multichannel image \(\mathbf {I}_{a} = [ \mathbf {I}_{1}, \mathbf {I}_{2}, \dots , \mathbf {I}_{n} ]\) (see Fig. 6).

Furthermore, since directional gradient filters are sensitive to intensity changes and edges, and our task is about recovering fine image details based on the results of existing methods, hence the correlation between the recovered output image and its gradients can be exploited. To this end, we firstly have the naive average input image \(\bar{\mathbf {I}} = \frac{1}{n} \sum _{i = 1}^n \mathbf {I}_{i}\), then filter it with the first- and second-order gradients, in both the x and y direction,

$$\begin{aligned} \begin{aligned} \mathbf {F}_{1x}&= \begin{bmatrix} 1&-1 \end{bmatrix} = \mathbf {F}_{1y}^T,\\ \mathbf {F}_{2x}&= \begin{bmatrix} 1&-2&1 \end{bmatrix}/2 = \mathbf {F}_{2y}^T, \end{aligned} \end{aligned}$$
(1)

followed by stacking those gradient filtered- and average images along the channel dimension, we have another input \(\mathbf {I}_{b}\) as our second starting point,

$$\begin{aligned} \mathbf {I}_b = [ \mathbf {F}_{2x} * \bar{\mathbf {I}}, \mathbf {F}_{1x} * \bar{\mathbf {I}}, \bar{\mathbf {I}}, \mathbf {F}_{1y} * \bar{\mathbf {I}}, \mathbf {F}_{2y} * \bar{\mathbf {I}}]. \end{aligned}$$
(2)

Next, we intensively explore the correlation within \(\mathbf {I}_{a}, \mathbf {I}_{b}\) by introducing the 3D convolutional layer. Related recent works such as  [15, 24, 33] mainly exploit deep features with spatial filters. In that case, given the image has multiple channels, they are independently filtered and eventually summed up as the input for the next layer, while the correlations among the channels may not be accurately captured. That is the main reason behind our idea – to fully explore the fine details along the channel dimension. As far as we know, this is the first time that a 3D layer is introduced to address low level image tasks.

Our next step is to update the input images \(\mathbf {I}_{a,b}\) Footnote 1 with a 3D hidden layer,

$$\begin{aligned} \mathbf {H}_{1}^{a,b}(\mathbf {I}_{a,b}) = \mathop {\text {tanh}} (\mathbf {W}_{1}^{a,b} * \mathbf {I}_{a,b}+ \mathbf {B}_{1}^{a,b}), \end{aligned}$$
(3)

where \(\mathbf {W}_1^{a,b}\) correspond n 3D filters with \(c \times h \times w\) kernel size and \(\mathbf {B}_1^{a,b}\) are biases. In our design, due to a tradeoff between the memory constraint and speed, we recommend n and \(c \times h \times w\) to be 32 and \(3 \times 5 \times 5\) for \(\mathbf {I}_{b}\), along with setting pad to be 0, so that we have the output with same size as input. The default size of filters regarding \(\mathbf {I}_{a}\) showed in Fig. 6 are also determined for the same reason. Besides we use hyperbolic tangent (tanh) as activation function because we allow negative value updates to pass through the network rather than ignore them as ReLU [37] does. In the following step, we use a naive convolutional layer with a single \(1 \times 1 \times 1\) filter, which is equivalent to sum up the input \(\mathbf {H}_{1}^{a,b}\)

$$\begin{aligned} \mathbf {H}_{2}^{a,b}(\mathbf {H}_{1}^{a,b}(\mathbf {I}_{a,b})) = \mathop {\text {tanh}} (w_{2}^{a,b} \sum _k \mathbf {H}_{1,k}^{a,b}(\mathbf {I}_{a,b})+ b_{2}^{a,b} \mathbf {1}), \end{aligned}$$
(4)

where \(w_{2}^{a,b}\), \(b_{2}^{a,b}\) are the scalar weights and biases, resp. We consider the above two steps as one inference stage. Another important difference between our proposed method and many other neural network methods is that we reconstruct the image residue instead of the image itself (see Fig. 6). Normally, the perturbation on image residues during the optimization is smaller than the one on image values, which increases the odds that the learning process eventually converges. Secondly, residue reconstruction substantiates the robust performance of our general architecture for distinct image restoration tasks. After going through n inference stages, we come to the reconstruction stage,

$$\begin{aligned} \mathbf {R}_{a,b}(\mathbf {I}_{a,b}) = (w_{2n + 2}^{a,b} \sum \mathbf {H}_{2n + 1}^{a,b} \circ \mathbf {H}_{2n}^{a,b} \dots \mathbf {H}_{2}^{a,b} \circ \mathbf {H}_{1}^{a,b}(\mathbf {I}_{a,b}) + b_{2n+2}^{a,b} \mathbf {1}), \end{aligned}$$
(5)

where \(\mathbf {R}_{a,b}(\mathbf {I}_{a,b})\) are the image residues we want to predict. In order to robustify the performance of our network, we simply duplicate the above mentioned process for each input image array \(\mathbf {I}_a \) and \(\mathbf {I}_b\) n times, which gives us 2n separate networks with the same architecture. In the end we sum up the residues and the average image to obtain our output image \(F(\mathbf {I}_1, \mathbf {I}_2, \dots , \mathbf {I}_n)\),

$$\begin{aligned} F(\mathbf {I}_1, \mathbf {I}_2, \dots , \mathbf {I}_n) = \frac{1}{n} \sum _k \mathbf {I}_k + \sum _{k}( c_k \mathbf {R}_{a}^{k}(\mathbf {I}_a) + d_k \mathbf {R}_{b}^{k}(\mathbf {I}_b)), \end{aligned}$$
(6)

where \(c_k, d_k\) are the coefficients to weight the residues.

Training. Our main task is to learn the parameters \(\mathbf {\Theta } = (\mathbf {W}, \mathbf {B})\) of the non-linear map F. To this end, we minimize the loss function \(l(\mathbf {\Theta })\), which computes the Euclidean distance (mean square error (MSE)) between the output image \(F(\mathbf {I}_1^i, \mathbf {I}_2^i, \dots , \mathbf {I}_n^i)\) and ground truth image \(\mathbf {I}_g^i\) contained in our training set, i.e.,

$$\begin{aligned} l(\mathbf {\Theta }) = \sum _i \Vert F(\mathbf {I}_1^i, \mathbf {I}_2^i, \dots , \mathbf {I}_n^i; \mathbf {\Theta }) - \mathbf {I}_g^i \Vert _2^2. \end{aligned}$$
(7)

The choice of the cost function is appropriate since PSNR is the main evaluation method of image restoration tasks and stands in monotonic relation with MSE. During the training stage, we update the weights/biases with standard back propagation [38, 39].

Currently, the optimization of the loss function is dominated by the stochastic gradient descent (SGD) method [40], for example in  [15, 24, 33]. Basically, at the \(t+1\)-th iteration they update the parameters \(\mathbf {\Theta }_{t+1}\) with the previous parameter update \(\mathbf {\Lambda }_t\) and negative gradient \(\nabla l(\mathbf {\Theta })\),

$$\begin{aligned} \begin{aligned} \mathbf {\Lambda }_{t+1}&= a \mathbf {\Lambda }_{t} - b \nabla l(\mathbf {\Theta }_t),\\ \mathbf {\Theta }_{t+1}&= \mathbf {\Theta }_{t} + \mathbf {\Lambda }_{t+1}, \end{aligned} \end{aligned}$$
(8)

where ab are the momentum and learning rate, resp. One weakness of SGD is that the improvements gained from the optimization decrease rapidly with growing iteration steps. In such case, SGD may not be able to recover accurate details from highly corrupted images. This is the main reason why we prefer adaptive moment estimation (Adam) [41] as our optimization method. The Adam method is stated as follows,

$$\begin{aligned} \begin{aligned} \mathbf {\Lambda }_{t}&= a_1 \mathbf {\Lambda }_{t-1} + (1 - a_1) \nabla l(\mathbf {\Theta }_t),\\ \mathbf {K}_{t}&= a_2 \mathbf {K}_{t-1} + (1 - a_2) \nabla l(\mathbf {\Theta }_t)^2, \end{aligned} \end{aligned}$$
(9)

where \(a_1, a_2\) are moments and \(\mathbf {\Theta }_{t+1}\) is updated based on \(\mathbf {\Lambda }_{t}, \mathbf {K}_{t}\),

$$\begin{aligned} \begin{aligned} \mathbf {\Theta }_{t+1}&= \mathbf {\Theta }_{t} - b \frac{\sqrt{1 - (a_2)^t}}{1 - (a_1)^t} \frac{\mathbf {\Lambda }_{t}}{\sqrt{\mathbf {K}_{t}} + \epsilon }, \end{aligned} \end{aligned}$$
(10)

here b is the learning rate and \(\epsilon \) is used to avoid explosion. At the beginning of the iterations, the cost of \(l(\mathbf {\Theta })\) converges considerably faster than SGD. Moreover, Eq. (10) shows that the magnitudes of parameter updates are independent of the rescaling of the gradient, therefore it provides a relatively fast convergence speed even after a large amount of iterations.

4 Experiments

In the following we describe the experimental setup and datasets used to validate our 3DCF approach on both the SR and DN tasks, then discuss the results.

4.1 Experimental Setup and Datasets

DN. Like most DN-related papers we add white Gaussian (AWG) noise to ground truth images to create our corrupted images. 3 standard deviations \(\sigma \in \{ 15, 25, 50\}\) are chosen to measure the performance of 3DCF. Under such conditions, we compare our 3DCF with state-of-the-art DN methods as described in the introductory Sect. 1: BM3D [6], LSSC [11], EPLL [10], opt-MRF [14], CRTF [20], WNNM [8], CSF [13], TRD [15], MLP [12], as well as the NN [17] fusion method.

We use the same training data mentioned in [15], i.e., 400 cropped images with \(180 \times 180\) size from the training part of the Berkeley segmentation dataset (BSD) [42]. We evaluate our method on the 68 test images as in [43], a standard benchmark employed by top methods like [13, 15].

SR. For SR we use the same 3DCF architecture as for DN and test it on the standard benchmarks Set5 [44], Set14 [45] (as proposed in [30]) and B100 [29] with 5, 14, 100 images resp., which are widely adopted by the recent literature. To obtain the LR images, according to many of the SR works, we firstly convert the ground truth image into YCbCr color space, then downscale the luminance channel with bicubic interpolation. Our training data is formed by the 200 training BDS images of size \(321 \times 481\) from which we extract millions of LR-HR image pairs. We report PSNR and SSIM results for the latest methods with top performances: A+ [29], SRCNN(L) [33], RFL [31], SelfEx [32], CSCN [24].

4.2 Implementation Details

We implement our 3DCF method with Caffe [46]. 3DCF is used in the same form for both DN and SR. For clarity and ease of understanding and deployment we prefer stacking two top methods along the channel dimension as our one starting point \(\mathbf {I}_a\). For DN we use MLP [12], an external neural network method, and BM3D [36], an internal method. Thus, such combination of two top methods increases our chance to take advantage of the strengths and overcome the weaknesses of both worlds. For SR, the CSCN [24] and A+ [29] are our favorite because of similar reasons – one from CNN and another from non-CNN type of methods. The starting point \(\mathbf {I}_b\) is simply obtained by the average image of two methods as well as its corresponding first- and second order gradients along x/y direction. To enable 3DCF to recover more accurate details, we use two networks for each starting point \(\mathbf {I}_a, \mathbf {I}_b\) (See Fig. 6), while slightly perturbing the value as the input of each activation, by multiplying \(-1\). For the same reason we fix the coefficients \(c_1, c_2\) to be 1 and 0.1. So are the coefficients \(d_1, d_2\). Now Eq. (6) looks as follows:

$$\begin{aligned} F(\mathbf {I}_1, \mathbf {I}_2) = \frac{1}{2} (\mathbf {I}_1 + \mathbf {I}_2) + \mathbf {R}_{a}^{1}(\mathbf {I}_a) + 0.1\mathbf {R}_{a}^{2}(\mathbf {I}_a) + \mathbf {R}_{b}^{1}(\mathbf {I}_b) + 0.1\mathbf {R}_{b}^{2}(\mathbf {I}_b). \end{aligned}$$
(11)

For the sake of time complexity and memory saving, each network showed in Fig. 6 has 4 layers, and the filter size \(n \times c \times h \times w\) is set to be \((32 \times 3 \times 5 \times ~5, 1 \times 1 \times 1 \times 1, 32 \times 3 \times 5 \times 5, 1 \times 1 \times 1 \times 1)\) for \(\mathbf {I}_a\), while \(\mathbf {I}_b\) has the almost same settings except for the 3rd layer with \(32 \times 2 \times 5 \times 5\). We also set the channel-, height- and width stride to be 1 for all layers. It is expected that our output is a single image with the same spatial size as the input image. To this end, the channel-, height- and width padding size are determined to be \((1 \times 2 \times 2, 0 \times 0 \times 0, 0 \times 2 \times 2, 0 \times 0 \times 0)\) for \(\mathbf {I}_a\), and for \(\mathbf {I}_b\) we follow the same setup except the first layer parameters are determined to be \(0 \times 2 \times 2\). We also initialize the weights by a Gaussian distribution with standard deviation 0.05 for convolutional layers, and put the weight to 1 for sum layers, and the bias to 0 for all cases.

Meanwhile, we simply use the default learning- and decay rate 1 when learning the weights/biases for each layer. In the end, for Eq. 10 the learning rate b for the whole network is considered to be 0.001, the moments \(a_1, a_2\) have the default value 0.9, 0.999, and \(\epsilon \) is also set to the default \(10^{-8}\). It is worth mentioning that all the parameters are exactly the same for the two tasks, DN and SR.

4.3 Denoising Results

We demonstrate our 3DCF method on 68 standard images [43] from BSD [42]. We apply the best setup for the compared methods, already described in the introductory Sect. 1. CRTF [20] has 5 cascades, CSF [13] employs the \(7 \times 7\) filter, the same as TRD [15] with 8 stages. Table 1 shows that our 3DCF method achieves top performances compared to other methods for 3 different standard deviations. For example, if we start our method with BM3D [6] and MLP [12], we are 0.11 dB and 0.1 dB better than the top standalone method MLP for \(\sigma \in \{ 25, 50\}\). Due to the lack of an MLP model trained for \(\sigma = 15\), we use BM3D+TRD instead. Still, the performance of our 3DCF is consistent with the other cases, 0.09 dB higher than TRD, the currently best method. Interestingly, if we compare 3DCF with the NN fusion method under the same conditions, that is, with the same starting methods BM3D and MLP, the proposed method outperforms NN with 0.15 and 0.07 dB for \(\sigma \in \{ 15, 25\}\). Such observation confirms the non-trivial improvements achieved by 3DCF. Moreover, Fig. 7 indicates that the naive average of MLP and BM3D is even worse than MLP. Besides, it is also notable from Fig. 7 that the PSNR gradually increases with the growth of back propagation. 3DCF is robust to the fused methods, TRD + MLP leads to relative improvements comparable with those achieved starting from BM3D + MLP or BM3D + TRD.

Table 1. Average PSNR values [dB] on 68 images from BSD dataset as in [43] for \(\sigma \in \{ 15, 25, 50\}\). The best is with bold. The results with (*) are obtained from [15].
Fig. 7.
figure 7

PSNR versus backprops on 68 images for \(\sigma \in \{25, 50\}\).

4.4 Super Resolution Results

The PSNR and SSIM results are listed in Table 2. Here our 3DCF fuses A+ [29] with CSCN [24]. Note that we modify the steps of downscaling the image for CSCN to be consistent with other methods including A+ and SRCNN(L). That is the reason why we obtain different PSNR results for CSCN than in the original work [24]. As in the case of DN, our 3DCF shows significant improvements over the starting methods. The PSNR improvements vary from 0.11 dB on (B100, \(\times 3\)) to 0.35 dB on (Set5, \(\times 2\)) over the best result from SRCNN (L,with largest model). The SSIM improvements follow the same trend. Note that for SR, the naive average fusion of A+ and CSCN results improves over both fused methods. However, our 3DCF results are on average 0.2 dB higher than the average fusion, as shown in Fig. 8.

Table 2. Average PSNR/SSIMs for upscaling factors \(\times 2\), \(\times 3\), and \(\times 4\) on datasets Set5, Set14, and B100. The best results are with bold.
Fig. 8.
figure 8

PSNR versus backprops on Set5 dataset for upscaling factors \(\times 2\), \(\times 3\), \(\times 4\).

Fig. 9.
figure 9

Denosing results for \(\sigma =50\). Best zoomed on screen.

Fig. 10.
figure 10

Super-resolution results (\(\times 4\)). Best zoomed on screen.

4.5 Other Aspects

Visual assessment. In general, the visual results are consistent with PSNR results. Some image results are shown in Fig. 9 for DN and in Fig. 10 for SR. We can observe that the 3DCF results have generally fewer artifacts and sharper edges in comparison with the other methods.

Running time. 3DCF runs on roughly 0.04 second per \(321\times 480\) image on nVidia TitanX GPU, which is quite competitive and shows that at the price of slight increase in processing time one could fuse available image restoration results. 3DCF needs about 5 h training time to obtain meaningful improvements over the fused methods, and this is mainly due to the Adam method.

General. To summarize, our 3DCF method shows wide adaptability for two important image restoration tasks, DN and SR, with non-trivial improvements. Also, the training and running times of 3DCF are competitive in comparison with other neural network architectures. For certain combinations of existing methods our proposed fusion method only shows mild progress, for example for the case of TRD+MLP (see Table 1). This sensitivity to the starting point drives us to be careful of the choice of starting methods.

5 Conclusions

We propose a novel 3D convolutional fusion (3DCF) network for image restoration. With the same settings, for both single image super resolution and image denoising, we achieve significant improvements over the fused methods and other fusion methods on several standard benchmarks. For speeding up the training, we apply an adaptive moment estimation method. The testing and training times are also competitive to other recent deep neural networks.