Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Nowadays, consumer cameras are able to capture an entire series of photographs in rapid succession. Hand-held acquisition of a burst of images is likely to cause blur due to unwanted camera shake during image capture. This is particularly true along with longer exposure times needed in low-light environments.

Motion blurring due to camera shake is commonly modeled as a spatially invariant convolution of a latent sharp image X with an unknown blur kernel k

$$\begin{aligned} Y = k * X + \varepsilon , \end{aligned}$$
(1)

where \(*\) denotes the convolution operator, Y the blurred observation and \(\varepsilon \) additive noise. Single image blind deconvolution (BD), i.e. recovering X from Y without knowing k, is a highly ill-posed problem for a variety of reasons. In contrast, multi-frame blind deconvolution or burst deblurring methods aim at recovering a single sharp high-quality image from a sequence of blurry and noisy observed images \(Y_1, Y_2, \ldots , Y_N\). Accumulating information from several observations can help to solve the reconstruction problem associated with Eq. (1) more effectively.

Traditionally, generative models are used for blind image deconvolution. While they offer much flexibility they are often computationally demanding and time-consuming.

Discriminative approaches on the other hand keep the promise of fast processing times, and are particularly suited for situations where an exact modeling of the image formation process is not possible. A popular choice in this context are neural networks. They gained some momentum due to the great success of deep learning in many supervised computer vision tasks, but also for a number of low-level vision tasks state-of-the-art results have been reported [1, 2]. Our proposed method lines up with the latter approaches and comes along with the following main contributions:

  1. 1.

    A robust state-of-the-art method for multi-image blind-deconvolution for both invariant and spatially-varying blur.

  2. 2.

    A hybrid neural network architecture as a discriminative approach for image deblurring supporting end-to-end learning in the fashion of deep learning.

  3. 3.

    A neural network layer version of Fourier-Burst-Accumulation [3] with learnable weights.

  4. 4.

    The proposed embedding of a small neural network allowing for information sharing across the image burst early in the processing stage.

2 Related Work

Blind image deconvolution (BD) has seen considerable progress in the last decade. A comprehensive review is provided in the recent overview article by Wang and Tao [4].

Single image blind deconvolution. Approaches for single image BD that report state-of-the-art results include the methods of Sun et al. [5], and Michaeli and Irani [6] that use powerful patch-based priors for image prediction. Following the success of deep learning methods in computer vision, also a number of neural network based methods have been proposed for image restoration tasks including non-blind deconvolution [2, 7, 8] which seeks to restore a blurred image when the blur kernel is known, but also for the more challenging task of blind deconvolution [9,10,11,12,13,14] where the blur kernel is not known a priori. Most relevant to our work is the recent work of Chakrabarti [11] which proposes a neural network that is trained to output the complex Fourier coefficients of a deconvolution filter. When applied to an input patch in the frequency domain, the network returns a prediction of the Fourier transform of the corresponding latent sharp image patch. For whole image restoration, the input image is cut into overlapping patches, each of which is independently processed by the network. The outputs are recomposed to yield an initial estimate of the latent sharp image, which is then used together with the blurry input image for the estimation of a single invariant blur kernel. The final result is obtained using the state-of-the-art non-blind deconvolution method of Zoran and Weiss [15].

Multi-frame blind deconvolution. Splitting an exposure budget across many photos can lead to a significant quality advantage [16]. Thus, it has been shown that multiple captured images can help alleviating the illposedness of the BD problem [17]. This has been exploited in several approaches [18,19,20,21,22] for multi-frame BD, i.e. combining multiple, differently blurred images into a single latent sharp image. Generative methods, that make explicit use of an image formation model, mainly differ in the prior and/or the optimization procedure they use. State-of-the-art methods use sparse priors with fast Bregman splitting techniques for optimization [20], or within a variational inference framework [18], cross-blur penalty functions between image pairs [21, 22], also in combination with robust cost functions [19].

More recently proposed methods [23,24,25,26] also model the inter-frame motion and try to exploit the interrelation between camera motion, blur and image mis-alignment. Camera motion, when integrated during the exposure time of a single frame will produce intra-frame motion blur while leading to inter-frame mis-alignment during readout time between consecutive image capture.

All of the above-mentioned methods employ generative models and try to explicitly estimate one unknown blur kernel for each blurry input frame along with predicting the latent sharp image. A common shortcoming is the large computational burden with typical computation times in the order of tens of minutes, which hinders their wide-spread use in practice.

Recently, Delbracio and Sapiro have presented a fast method that aggregates a burst of images into a single image that is both sharper and less noisy than all the images in the burst [3]. The approach is inspired by a recently proposed Lucky Imaging method [27] targeted for astronomical imaging. Traditional Lucky Imaging approaches would select only a few “lucky” frames from a stack of hundreds to thousands recorded short-exposure images and combine them via non-rigid shift-and-add techniques into a single sharp image. In contrast, the authors of [27] propose to take all images into account (rather than a small subset of carefully chosen frames) taking and combining what is less blurred of each frame to form an improved image. It does so by computing a weighted average of the Fourier coefficients of the registered images in the burst. In [3, 28] it has been demonstrated that this approach can be adapted successfully to remove camera shake originated from hand tremor vibrations. Their Fourier Burst Accumulation (FBA) approach allows for fast processing even for Megapixel images while at the same time yielding high-quality results provided that a “lucky”, i.e. almost sharp frame is amongst the captured image burst.

In our work, we not only present a learning-based variant of FBA but also show how to alleviate the drawback of requiring a sharp frame amongst the input sequence of images. To this end we combine the single image BD method of [11] with FBA in a single network architecture which facilitates end-to-end learning. To the best of our knowledge this is the first time that a fully discriminative approach has been presented for the challenging problem of multi-frame BD.

3 Method

Let’s assume we have given a burst of observed color images \(Y_1, Y_2,\ldots , Y_N \in \mathcal {I}\) capturing the same scene \(X \in \mathcal {I}\). Assuming each image in the captured sequence is blurred differently, our image formation model reads

$$\begin{aligned} Y_t = k_t* X + \varepsilon _t, \end{aligned}$$
(2)

where \(*\) denotes the convolution operator, \(k_t\) the blur kernel for observation \(Y_t\) and \(\varepsilon _t\) additive zero-mean Gaussian noise.

We aim at predicting a latent single sharp image \(\hat{X}\) through a deep neural network architecture, i.e.

$$\begin{aligned} \pi ^{(\theta )}:\mathcal {I}_p^N \rightarrow \mathcal {I}_p,\qquad (y_1,y_2,\ldots ,y_N)\mapsto \hat{x} = \pi ^{(\theta )}(y_1,y_2,\ldots ,y_N). \end{aligned}$$

The network operates on a patch-by-patch basis, here \(y_t \in \mathcal {I}_p\) and \(\hat{x} \in \mathcal {I}_p\) denote a patch in \(Y_t\) and X respectively. The patches are chosen to be overlapping. Our network predicts a single sharp patch \(\hat{x}\in \mathcal {I}_p\) from multiple input patches \(y_t\in \mathcal {I}_p\). All predicted patches are recomposed to form the final prediction \(\hat{X}\) by averaging the predicted pixel values. During the training phase we optimize the learning parameters \(\theta \) by directly minimizing the objective

$$\begin{aligned} \Vert \pi ^{(\theta )}(y_1,y_2,\ldots ,y_N) - x \Vert _2^2. \end{aligned}$$
(3)

In the following we will describe the construction of \(\pi ^{(\theta )}(\cdot )\), the optimization of network parameters \(\theta \) during the training of the neural network and the restoration of an entire sharp image.

3.1 Network Architecture

The architecture \(\pi ^{(\theta )}(\cdot )\) consists of several stages: (a) frequency band analysis with Fourier coefficient prediction, (b) a deconvolution part and (c) image fusion. Figure 1 illustrates the first two stages of our proposed system.

Fig. 1.
figure 1

Frequency band analysis and deconvolution for an image burst with 3 patches \(y_1,y_2,y_3\). Following the work of Chakrabarti [11] we separate the Fourier spectrum in 4 different bands \(b_1,\ldots ,b_4\). In addition, we allow each band separately to interact across all images in one burst to support early information sharing. The predicted output of the deconvolution step are smaller patches \(\tilde{x_1},\tilde{x_2},\tilde{x_3}\).

(a) Frequency band analysis. The frequency band analysis computes the discrete Fourier transform of the observed patch \(y_t\) according to the neural network approach in [11] at three different sizes (\(17\times 17, 33\times 33, 65\times 65\)) using different sample sizes, which we will refer to bands \(b_1, b_2, b_3\). In addition, band \(b_4\) represents a low-pass band containing all coefficients with \(\max \left| z \right| \le 4\) from band \(b_3\). This is depicted in Fig. 1. To enable early information sharing within one burst of patches, we allow the neural network to spread the per band information extracted from one patch across all images of the burst using \(1\times 1\) convolution.

This essentially embeds a fully connected neural network for each Fourier coefficient \((f_{ij})_{t}\) with weight sharing. Since we use these operations in the image fusion stage again, we elaborate on this idea in more detail.

The values of one Fourier coefficient \((f_{ij})_{t}\) at frequency position (ij) across the entire burst \(t=1,2,\ldots ,N\) can be considered as a single vector \((f_{ij})_{t=1,2,\ldots ,N}\) of dimension N (compare Fig. 2). Each of these vectors is fed through a small network of fully connected layers, labeled by mlp_1 in Fig. 1. This allows the neural network to adjust the extracted Fourier coefficients right before a dimensionality reduction occurs. These modified values \((f'_{ij})_{t=1,2,\ldots ,N}\) give rise to adjusted Fourier bands \(b_1', b_2', b_3', b_4'\).

Fig. 2.
figure 2

For arbitrary inputs (bands \(b_1,b_2,b_3,b_4\) or later FBA weights) we interpret each coefficient across one burst as a single vector. A transformed version of this excerpt will be placed at the same location in the output patch again. To reduced the number of learnable parameters, we employ weight sharing independent of the position.

(b) Deconvolution. Pairwise merging of the resulting bands \(b_1', b_2', b_3', b_4'\) with modified Fourier coefficients using fully connected layers with ReLU activation units entails a dimensionality reduction. The produced 4096 feature vector encoding is then fed through several fully connected layers producing a 4225 dimensional prediction of the filter coefficients of the deconvolution kernel. Applying the deconvolution kernel predicts a sharp patch \(\hat{x}\) of size \(33\times 33\) from each input sequence of patches. This step is implemented as a multiplication of the predicted Wiener Filter with the Fourier transform of the input patch.

(c) Image fusion. In the last part of our pipeline we fuse all available sharp patches \(y_1,y_2,\ldots ,y_N\) by adopting the FBA approach described in [3] as a neural network component with learnable weights. The vanilla FBA algorithm applies the following weighted sum to a Fourier transform \(\hat{\alpha }\) of a patch \(\alpha \):

$$\begin{aligned} u(\hat{\alpha })&= \mathcal {F}^{-1}\left( \sum _{i=1}^N w_i(\zeta ) \hat{\alpha }_i(\zeta )\right) (x) \end{aligned}$$
(4)
$$\begin{aligned} w_i(\zeta )&= \frac{\left| \hat{\alpha }_i(\zeta ) \right| ^p}{\sum _{j=1}^N \left| \hat{\alpha }_j(\zeta ) \right| ^p}, \end{aligned}$$
(5)

where \(w_i\) denotes the contribution of frequency \(\zeta \) of a patch \(\alpha _i\). Note, that \(u(\hat{\alpha })\) is differentiable in \(\hat{\alpha }\) allowing to pass gradient information to previous layers through back-propagation. To incorporate this algorithm as a neural network layer into our pipeline, we replace Eq. (4) by a parametrized version

$$\begin{aligned} u(\hat{\alpha })&= \mathcal {F}^{-1}\left( \sum _{i=1}^N h_\phi (\zeta ) \hat{\alpha }(\zeta )\right) (x). \end{aligned}$$
(6)

Hence, instead of a hard-coded weight-averaging (using \(w_i\)) the network is able to learn a data-dependent weighted-averaging scheme. Again, the function \(h_\phi (\cdot )\) represents two \(1\times 1\) convolutional layers with trainable parameters \(\phi \) following the same idea of considering the Fourier coefficient across one burst as a single vector (compare Fig. 2).

3.2 Training

The network is trained on an artificially generated dataset obtained by applying synthetic blur kernels to patches extracted from the MS COCO dataset [29]. This dataset consists of real-world photographs collected from the internet. To increase the quality of ground-truth patches guiding the training process we reject patches with too small image gradients. This process gives us 542217 sharp patches. For a fair evaluation we use a splittingFootnote 1 in training and validation set. Optimizing the neural network parameters is done on the training set only. The input bursts of 14 blurry images are generated on-the-fly by applying synthetic blur kernels to the ground-truth patches. These synthetic blur kernels of sizes \(17\times 17\) and \(7\times 7\) pixels are generated using a Gaussian process with a Matérn covariance function following [9], a random subset of which is shown in Fig. 3. In addition, we apply standard data augmentation methods like rotating and mirroring to the ground-truth data. Hence, this approach gives nearly an infinite amount of training data. We also add zero-mean Gaussian noise with variance 0.1. The validation data is precomputed to ensure fair evaluation during training.

Unfortunately, sophisticated stepsize heuristics like Adam [30] or Adagrad [31] failed to guarantee a stable training. We suspect the large range of values in the Fourier space to mislead those heuristics. Instead, we use stochastic gradient descent with momentum (\(\beta =0.9\)), batchsize 32 and an initial learning rate of \(\eta =2\) which decreases every 5000 steps by a factor of 0.8. Training the neural network took 6 days using TensorFlow [32] on a NVIDIA Titan X.

Fig. 3.
figure 3

Some of the synthetically generated PSFs using a Gaussian process for generating training examples on-the-fly.

Fig. 4.
figure 4

Deblurring a burst of degraded images from a groundtruth image (left) results in a desaturated image (middle). Therefore we correct those colors (right image) using color transfer. (Color figure online)

The FBA approach [3] applies Gaussian smoothing to the weights \(w_i\) to account for the fact that small camera shakes are likely to vary the Fourier spectrum in a smooth way. While this removes strong artefacts in the restored recomposed image, it prevents the network to convergence during training. Following this idea we tried a fixed Gaussian blur with parameters set to the reported values of [3] as well as learning a blur kernel (initialized by a Gaussian) during training. In both cases we observed no convergence during training. Therefore, we apply this smoothing only for the final application of the neural network.

3.3 Deployment

During deployment we feed input patches of size \(65 \times 65\) into our neural network with stride 5. Using overlapping patches helps to average multiple predictions. For recombination of overlapping patches we apply a 2-dimensional Hanning window to each patch to favour pixel values in the patch center and devaluate information at the border of the patch.

While the predicted images \(\hat{X}\) generated by our neural network contain well-defined sharp edges we observed desaturation in color contrast. To correct the color of the predicted image we replace its ab-channel in the Lab color space by the ab-channel of the FBA results (compare Fig. 4).

Regarding runtime the most expensive step is the frequency band analysis. Given a burst of 14 images of size \(1000 \times 700\) pixels the entire reconstruction process takes roughly 5 min per channel with our unoptimized implementation.

Fig. 5.
figure 5

Comparison to state-of-the-art multi-frame blind deconvolution algorithms on real-world data. See the supplementary material for high-resolution images. Note that our approach produces the sharpest results except for the last scene, which could be caused by the color transfer described in Sect. 3.3. (Color figure online)

4 Experiments

To evaluate and validate our approach we conduct several experiments including a comprehensive comparison with state-of-the-art techniques on a real-world dataset, and a performance evaluation on a synthetic dataset to test the robustness of our approach with varying image quality of the input sequence.

4.1 Comparison on Real-World Dataset

We compare the restored images with other state-of-the-art multi-image blind deconvolution algorithms. In particular, we compare with the multichannel blind deconvolution method from Šroubek et al. [21], the sparse-prior method of [33] and the FBA method proposed in [3]. We used the data provided by [3], which contains typical photographs captured with hand-held cameras (iPad back camera, Canon 400 D). As they are captured under various challenging lighting conditions they exhibit both noise and saturated pixels. As shown in [3] the FBA algorithms demonstrated superior performance compared to previous state-of-the-art multi-image blind deconvolution algorithms [21, 33] in both reconstruction quality and runtime. Figure 5 shows crops of the deblurred results on these images. The high-resolution images are enclosed in the supplemental material. Our trained neural network featuring the FBA-like averaging yields comparable if not superior results compared to previous approaches [3, 21, 33]. In direct comparison to the FBA results, our method is better removing blur due to our additional prepended deconvolution module.

4.2 Deblurring Bursts with Varying Number of Frames and Quality

Here, we analyse the performance of our approach depending on the burst “quality”. Sorting all images provided by [3] within one burst according to their PSNR beginning with images of strong blur and consequently adding sharper shots to the burst gives a series of bursts starting with images of poor quality up to bursts with at least one close-to-sharp shot. Since our architecture is trained for deblurring bursts with exactly 14 input images, we duplicated images of bursts with fewer frames. Figure 6 clearly indicates good performance of our neural network even for a relative small number of input images with strong blur artifact.

Fig. 6.
figure 6

FBA and our algorithm are compared on bursts with a growing number of images of increasing quality. The individual images are sorted according to their PSNR starting with the most blurry images. The input images were taken from [3].

4.3 Deblurring Image Bursts Without Reasonable Sharp Frames

To further challenge our neural network approach, we artificially sampled image bursts from unseen images taken from the MS COCO validation set and blurred them by applying synthetic blur kernels of size \(14\times 14\). The restored sharp images from the input bursts of 14 artificially blurred images under absence of a close-to-sharp frame (best shot) are depicted in Fig. 7. As the experiments indicate the explicit deconvolution step in our approach is absolutely necessary to handle these kind of snapshots and to remove blur artifacts. In contrast, while FBA [3] stands out in small memory footprint and fast processing times it clearly failed to recover sharp images for cases where no reasonably sharp frame is available amongst the input sequence.

Fig. 7.
figure 7

Comparing FBA (third column) and our trained neural network (fourth column) against best shot and a typical shot. These images are taken from the validation set. For image bursts without a single sharp frame lucky imaging approaches fail due to a missing explicit deconvolution step, while our approach gives reasonable results.

4.4 Comparing to a Baseline Version

One might ask, how our trained neural network compares to an approach that applies the methods of Chakrabarti [11] and Delbracio and Sapiro [3] subsequently, each in a separate step. We fine-tuned the provided weights from [11] in combination with our FBA-layer. Figure 8 shows the training progress for an exemplar patch, where the improvement in sharpness is clearly visible.

In addition, we run the entire pipeline of Chakrabarti [11] including the costly non-blind deconvolution EPLL step and afterwards FBA. The approach is significantly slower and results in less sharp reconstructions (see Fig. 9).

Fig. 8.
figure 8

The combination of the work of Chakrabarti [11] and Delbracio et al. [3] can be considered as a baseline version of our neural network. We fine-tuned the published weights from the work of Chakrabarti [11] in an end-to-end fashion in combination with our FBA-layer. The left-most patch is the ground-truth patch. Note how the sharpness continuously increases with training.

Fig. 9.
figure 9

Comparison to a baseline approach of simply stacking [3, 11]. Without end-to-end training ringing-artifacts are clearly visible on the blue roof. They are significantly dampened after training. (Color figure online)

4.5 Spatially-Varying Blur

To test whether our network is also able to deal with spatially-varying blur we generated a burst of images degraded by non-stationary blur. To this end, we took one of the recorded camera trajectories of [34] that are provided on the project webpageFootnote 2. The camera trajectory has been recorded with a Vicon system at 500 fps and represents the camera motion during a slightly longer-exposed shot (1/30 s). The trajectory comprises a 6-dimensional time series with 167 time samples. We divided this time series into 8 fragments of approximately equal lengths.

With a Matlab script (see Supplemental material) 8 spatially-varying PSFs are generated as shown at the bottom of Fig. 10.

The spatially varying kernels of size 17 \(\times \) 17 pixels are applied using the Efficient Filter Flow model of [35]. The results of our network along with results of FBA for three example images are shown in Fig. 10. Our results are consistently sharper and demonstrate that our approach is also able to correct for spatially-varying blur.

Fig. 10.
figure 10

Comparison to FBA on image sequences with spatially-varying blur. Our approach is able to reconstruct consistently sharper images.

5 Conclusion, Limitations and Future Work

We presented a discriminative approach for multi-frame blind deconvolution (BD) by posing it as a nonlinear regression problem. As a function approximator, we use a deep layered neural network, whose optimal parameters are learned from artificially generated data. Our proposed network architecture draws inspiration from two recent works as (a) a neural network approach to single image blind deconvolution of Chakrabarti [11], and (b) the Fourier Burst Accumulation (FBA) algorithm of Delbracio and Sapiro [3]. The latter takes a burst of images as input and combines them through a weighted average in the frequency domain to a single sharp image. We reformulated FBA as a learning method and casted it into a deep layered neural network. Instead of resorting to heuristics and hand-tuned parameters for weight computation, we learn optimal weights as network parameters through end-to-end training.

By prepending parts of the network of Chakrabarti to our FBA network we are able to extend its applicability by alleviating the necessity of a close-to-sharp frame being amongst the image burst. Our system is trained end-to-end on a set of artificially generated training examples, enabling competitive performance in multi-frame BD, both with respect to quality and runtime. Due to its novel information sharing in the frequency band analysis stage and its explicit deconvolution step, our network outperforms state-of-the-art techniques like FBA [3] especially for bursts with few severely degraded images.

Our contribution resides at the experimental level and despite competitive results with state-of-the-art, our proposed approach is subject to a number of limitations. However, at the same time it opens up several exciting directions for future research:

  • Our proposed approach doesn’t exploit the temporal structure of the input image sequence, which encodes valuable information about intra-frame blur and inter-frame image mis-alignment [24,25,26]. Embedding our described network into a network architecture akin to the spatio-temporal auto-encoder of Pătrăucean et al. [36] might enable such non-trivial inference.

  • Our current model assumes a static scence and is not able to handle object motion. Inserting a Spatial Transformer Network Layer [37] which also facilitates optical flow estimation [36] could be an interesting avenue to capture and correct for object motion occurring between consecutive frames.