1 Introduction

With the increasing digital cameras and mobile phones, a huge amount of high-resolution images are taken every day [2, 8, 11, 18, 51], e.g., the latest Huawei Mate20 series mobile phones have over 60 megapixels. However, sensor shake is often inevitable, resulting in undesirable motion blurring. Although sharp images might be obtained by fixing devices or taking the images again, in many occasions, however, we have no chance to fix the devices or take the images again, for example, in remote sensing [15], video surveillance [44], medical imaging [14] and some other related fields. Therefore, how to obtain sharp images from blurred ones has been noticed by researchers in many fields for many years, but the problem still cannot be well-solved due to the complexity of the motion blur process and, most importantly, the high-resolution natural images often have rich details. Most existing methods may not produce satisfactory results, as shown in Fig. 1.

Image deblurring problems are a kind of image degradation problems, which can be expressed as

$$\begin{aligned} I^\mathrm{blur} = A(I^\mathrm{sharp}) + n, \end{aligned}$$
(1)

where \(I^\mathrm{blur}\) is the given blurred image and \(I^\mathrm{sharp}\) is the sharp image. A is a degradation function, and n denotes possible noise. In this work, we shall focus on the cases where the degradation process is shift invariant; thereby, the generation process of a blurred image is given by

$$\begin{aligned} I^\mathrm{blur} = I^\mathrm{sharp} * k + n, \end{aligned}$$
(2)

where \(*\) denotes 2D convolution and k is the blur kernel. To obtain the sharp image and the blur kernel simultaneously, some commonly used approaches are MAP [3, 52] and variational Bayes [9, 28]. Lots of methods have been proposed and explored in the literature. For example, Chan et al. [3] proposed total variation to regularize the gradients of the sharp image. Zhang et al. [52] used a sparse coding method for sharp image recovering. Cai et al. [1] applied sparse representation to estimate the sharp image and blur kernel at the same time. Although these methods obtained moderate good results, they cannot apply to real applications and most importantly, cannot handle well high-frequency features.

Fig. 1
figure 1

Results on a challenging motion blurred image. From left to right and top to bottom : blurred image, the result by Pan et al. [35], the results by Xu et al. [48] and Ours. The results show that our model distinctly outperforms the two competing methods

To achieve fast image deblurring, it is straightforward to consider the idea of deep learning that pre-trained network models by plenty of training examples. Although the training process is computationally expensive, deep learning methods can process testing images very efficiently, as they only need to pass an image through the learned network. Most existing deep learning-based methods are built upon the well-known convolution neural network (CNN) [12, 34]. However, CNN tends to suppress the high-frequency details in images. To relieve this issue, generative adversarial network (GAN) [10] is one of the promising choices. Kupyn et al. [25] used a GAN-based method that used ResBlocks architecture as the generator. Pan et al. [34] used a GAN-based method to extract intrinsic physical features in images.

In this paper, we proposed a cycle GAN-based method for image deblurring. Specifically, we utilize an encoder–decoder network as a generator and a classification network for the discriminator. It uses a cycle-consistent training strategy that requires training two different generators (one for blurring a sharp image and the other one for sharpening a blurred image) and two discriminators (one for classifying the blurred images and the other for sharp). Besides, we proposed a novel loss function. For cycle loss, which aims to make the reconstructed images and the input images as close as possible under some measurements, there are some classical choices for evaluation, L1 loss, mean square loss, least square loss and perceptual loss.

By some comparison experiments, we demonstrated that perceptual loss can capture high-frequency features in image deblurring. So perceptual loss is used for evaluation in all experiments. Then, we show that U-net-based architecture with L2 norm and perceptual objective performs better in image deblurring problem. Besides, we found that during training, using unpaired images with cycle consistency training strategy can improve the performance in image deblurring tasks. In summary, our contributions of this paper are as follows:

  1. 1.

    A novel cycle GAN-based architecture is presented for image deblurring.

  2. 2.

    We proposed a new loss function in our architecture.

  3. 3.

    A novel training strategy is proposed to tackle the image deblurring problem.

Fig. 2
figure 2

Network architecture with cycle-consistent training strategy, including two cycle generators and two discriminators

2 Related works

Image deblurring is a classical problem in image processing and computer vision. We can divide it into learning-based methods and learning-free methods.

In learning-free methods, most existing works suppose that blur is shift invariant and caused by motion [4, 21, 29], which can be treated as a deconvolution problem [22, 29, 46, 53]. There are many ways to solve this; Liu et al. [29] used Bayesian estimation, that is,

$$\begin{aligned} p(I^\mathrm{sharp},k|I^\mathrm{blur})\propto P(I^\mathrm{blur}|I^\mathrm{sharp},k)P(I^\mathrm{sharp})P(k). \end{aligned}$$
(3)

One commonly used deblurring method is the maximum a posteriori (MAP) framework, where the latent sharp image \(I^\mathrm{sharp}\) and the blur kernel k can be obtained by [7],

$$\begin{aligned} \begin{aligned} \arg \max _{I^\mathrm{sharp}, k} P(I^\mathrm{blur}|I^\mathrm{sharp}, k)P(I^\mathrm{sharp})P(k). \end{aligned} \end{aligned}$$
(4)

Chan et al. [3] used a robust total variation minimization method which is effective for regularizing the edge of the sharp images. Zhang et al. [52] used a sparse coding method for sharp image recovering, which assumed that the natural image patch can be sparsely represented by an over-complete dictionary. Cai et al. [1] applied sparse representation to estimate sharp image and blur kernel at the same time. Krishnan et al. [23] found that the minimum of their loss function in many existing methods does not correspond to their real sharp images, so they used a normalized sparsity prior to tackle this problem. Michaeli et al. [31] found that multi-scale properties can also be used for blind deblurring problems, and they regard self-similarity as an image prior. Ren et al. [38] used a low rank prior for both raw pixels and their gradients.

Another common approach to estimate motion blur process is to maximize the marginal distribution:

$$\begin{aligned} p(k, I^\mathrm{blur})&= \int p(k, I^\mathrm{sharp}|I^\mathrm{blur})dI^\mathrm{sharp} \nonumber \\&= \int p(I^\mathrm{blur}|k)p(k)dI^\mathrm{sharp}. \end{aligned}$$
(5)

Fergus et al. [9] proposed a motion deblurring method based on variational Bayes. Levin et al. [28] used an expectation–maximization (EM) method to estimate blur process. These two approaches have some drawbacks: it is hard to optimize, time-consuming and cannot handle high-frequency features well.

Learning-based methods use deep learning techniques, which aim to find intrinsic features through the learning process by themselves. Deep learning [27] has boosted the research in related fields such as image recognition [24] and image segmentation [13]. For deblurring problems using deep learning techniques, [25] trained a CNN architecture to learn the mapping function from blurred images to sharp ones. [34] used a CNN architecture with a physics-based image prior to learn the mapping function.

Table 1 Model parameters of discriminators
Fig. 3
figure 3

Generator network built in this work. We use encoder–decoder network architecture with skip connection and cycle consistency objective training strategy, which gives comparative results in image deblurring tasks

One of the novel deep learning techniques is generative adversarial networks, usually known as GANs, introduced by Goodfellow et al. [10], inspired by the zero-sum game in game theory proposed by Nash et al. [33], which has achieved many exciting results in image in-painting [50], style transfer [16, 17, 54], and it can even be used in other fields such as material science [40]. The system includes a generator and a discriminator. The generator tries to capture the latent real data distribution, and output a new data sampled from the real data distribution, while discriminator tries to discriminate whether the input data are from real data distribution or not. Both the generator and the discriminator can be built based on convolutional neural networks [27], and trained based on the above ideas.

Instead of input a random noise in origin generative adversarial nets [10], conditional GAN [6] inputs random noise with discrete labels or even images [16]. Zhu et al. [54] take a step further, using conditional GAN with unlabeled data, which gives more realistic images in style transfer tasks. Inspired by this idea, Isola et al. [16] proposed one of the first image deblurring models based on generative adversarial nets [10].

While numerous learning-based methods have been proposed, most of the works need paired training data [25, 32], which is hard to collect in practice, and strong supervision of these methods may cause over-fitting.

3 Proposed method

The goal of blind image deblurring is to recover the sharp images given only the blurred images, with no information about the blurring process. We introduce a GAN-based model with a novel objective and training strategy to tackle this problem. The whole model architecture is shown in Fig. 2.

3.1 Model architecture

For discriminator architecture, we use slightly modified version of PatchGAN architecture [16], and the model parameters are shown in Table 1. Instead of classifying the whole image as sharp or not sharp, PatchGAN-based discriminator tries to classify each image patch from the whole images, which gives better results in image deblurring problems. Experiments show that PatchGAN-based architecture can achieve good results if the image patches are a quarter size of the input images [16]; so in the work, we choose image patch \(= 70\times 70\) in all experiments.

As the sharp images and corresponding blurred images are similar in pixel values, it is efficient to distinguish whether the input is from blur domain or sharp domain separately, so we build two discriminators as shown in Table 1. We also report quantitative results (see in Sect. 4—Experiments for more detail).

For generator, Ronneberger et al. [39] used encoder–decoder architecture and Kupyn et al. [25] used ResBlock architecture for image deblurring. The generator architecture is shown in Fig. 3. Hereby, the network only consists of convolution and transpose convolution with instance normalization. For the convolution layer, we apply leaky ReLU activation. For the transpose convolution layer, we apply ReLU activation. In the encoder part, each block consists of a downsampling convolution layer, which halves the height and width with stride 2 and doubles the number of channels \([H, W, C] \rightarrow [H/2, W/2, C\times 2]\). In the decoder part, each block inverts the effect of downsampling \([H, W, C] \rightarrow [H \times 2, W \times 2, C /2]\). We use filter size of \(4\times 4\) in all convolution and deconvolution blocks.

Table 2 A quantitative evaluation on the effectiveness of cycle consistency

3.2 Training

Our goal is to learn the mapping function between blur domain B and sharp domain S given samples \(\left\{ \hbox {blur}_i \right\} ^M_{i=1]}\) where \(\hbox {blur}_i\in B \) and \(\left\{ \hbox {sharp}_j \right\} ^N_{j=1}\) where \( \hbox {sharp}_j\in S \). A combination of the following losses is used as objectives:

$$\begin{aligned} {\mathcal {L}}(D_A,D_B,G_{B2S},G_{S2B}) = {\mathcal {L}}_\mathrm{adv} + \alpha {\mathcal {L}}_\mathrm{cycle}, \end{aligned}$$
(6)

where \({\mathcal {L}}, {\mathcal {L}}_\mathrm{adv}, {\mathcal {L}}_\mathrm{cycle}, \alpha \) is the total loss function, adversarial loss, cycle loss and their parameters, respectively. The adversarial loss tries to ensure the deblurred images as realistic as possible; cycle loss tries to ensure that the deblurred images can transfer back to the blur domain, which can also make the deblurred images as realistic as possible.

For the two mapping functions \(G_{S2B} : I^\mathrm{sharp} \rightarrow I^\mathrm{blur}, G_{B2S} : I^\mathrm{blur} \rightarrow I^\mathrm{sharp}\) aims to transfer the sharp images to the blur domain and transfer the blurred images to the sharp domain, respectively. The adversarial loss is as follows:

$$\begin{aligned} {\mathcal {L}}_\mathrm{adv} = {\mathcal {L}}_\mathrm{adv1} + {\mathcal {L}}_\mathrm{adv2}, \end{aligned}$$
(7)

where

$$\begin{aligned} {\mathcal {L}}_\mathrm{adv1}= & {} {\mathbb {E}}_{I^\mathrm{blur} \sim p_\mathrm{data}(I^\mathrm{blur})}[\log D_A(G_{B2S}(I^\mathrm{blur}))], \end{aligned}$$
(8)
$$\begin{aligned} {\mathcal {L}}_\mathrm{adv2}= & {} {\mathbb {E}}_{I^\mathrm{sharp} \sim p_\mathrm{data}(I^\mathrm{sharp})}[\log (1 - D_B(G_{S2B}(I^\mathrm{sharp})))],\nonumber \\ \end{aligned}$$
(9)

where the two discriminators \(D_A, D_B\) tries to distinguish whether the input images are blur or not, sharp or not, respectively. Generators \(G_{S2B}\) and \(G_{B2S}\) try to fool the discriminators and generate the images from specific domain as realistic as possible.

Table 3 A quantitative evaluation on the effectiveness of two discriminators
Table 4 Model parameters of generators
Table 5 Comparisons of our model with other four methods on the five images shown in Fig. 4

Isola et al. [16] and Zhu et al. [54] showed that least square loss [30] can perform better than mean square loss in style transfer task, and Kupyn et al. [25] used least square loss [30] for image deblurring tasks. So far, we do not know which loss objective performs better in image deblurring problems, mean square loss or least square loss [30]; we have done some experiments, see in Sect. 4—Experiments for more detail.

For cycle loss, which aims to make the reconstructed images and the input images as close as possible under some measurements, there are two classical choices for evaluation, L1 loss or mean square loss, least square loss [30] or perceptual loss [17]. The experiments show that perceptual loss [17] can capture high-frequency features in image deblurring task, which gives more texture and details. So perceptual loss is used in all experiments. Cycle loss is as follows:

$$\begin{aligned} {\mathcal {L}}_\mathrm{cycle} = {\mathcal {L}}_\mathrm{cycle1} + {\mathcal {L}}_\mathrm{cycle2}, \end{aligned}$$
(10)

where

$$\begin{aligned} {\mathcal {L}}_\mathrm{cycle1}= & {} \frac{1}{N^{(i,j)} M^{(i, j)}} \nonumber \\&\sum _{x=1}^{N^{(i,j)}} \sum _{y=1}^{M^{(i,j)}} (\sigma _{i,j}(I^\mathrm{sharp})_{x,y}-\sigma _{i,j}(G_{B2S}(I^\mathrm{blur}))_{x,y})^2,\nonumber \\ \end{aligned}$$
(11)
$$\begin{aligned} {\mathcal {L}}_\mathrm{cycle2}= & {} \frac{1}{N^{(i,j)} M^{(i, j)}} \nonumber \\= & {} \sum _{x=1}^{N^{(i,j)}}\sum _{y=1}^{M^{(i,j)}}(\sigma _{i,j}(I^\mathrm{blur})_{x,y}-\sigma _{i,j}(G_{S2B}(I^\mathrm{sharp}))_{x,y})^2,\nonumber \\ \end{aligned}$$
(12)

where \(\sigma _{i,j}\) is the feature map which obtains from the i-th max pooling layer after the j-th convolution layer from VGG-19 network, and \(N^{(i,j)}, M^{(i,j)}\) are the dimensions of the corresponding feature map; the perceptual loss can capture high level intrinsic features which has been proved to work well in image deblurring [25], and some other image processing task [16, 54].

So in summary, we aim to optimize the following objective function:

$$\begin{aligned} G_{B2S}^* = \arg \max _{D_A,D_B} \min _{G_{B2S}, G_{S2B}} {\mathcal {L}}(D_A,D_B,G_{B2S},G_{S2B}) \end{aligned}$$
(13)

We train the network with a batch size of 2, and give 100 epochs over the training data. The reconstructed images are regularized with cycle-consistent objective with a strength of 10. No dropout technique is used since the model does not overfit within 100 epochs.

For the optimization procedure, we perform ten steps on \(G_{S2B}\) and \(G_{B2S}\), and then one step on \(D_A\) and \(D_B\). We use Adam [19] optimizer with a learning rate of \(2 \times 10^{-3}\) in the first 80 epochs, and then linearly decay the learning rate to zero in the following epochs to ensure the convergence. The whole training process is shown in Fig. 2.

Fig. 4
figure 4

Some visual comparison results of our model and other four approaches. From left to right: blurred images, the results by Pan et al. [37], Pan et. al. [35], Xu et al. [48], Kupyn et al. [25] and the results produced by the proposed method

The key point is to train the model in one scene for given epochs, and then move to another scene. When the input is from the blur domain, the starting point and the cycle training process are the left-hand side. When the input is from the sharp domain, the starting point and cycle training process are the right-hand side. Notice that we just have one model during training; for different input from a different domain, the starting point of the model can be a little different.

4 Experiments

We implement our model with Keras [5] library. All the experiments are performed on a workstation with NVIDIA Tesla K80 GPU.

4.1 Network analysis

Cycle consistency Cycle consistency ensures that the deblurred images can transfer to the blur domain, and blur images can transfer back to the sharp domain, which can make sure that our model learns what is “blur” mean, and give more realistic results. We report a quantitative result in Table 2 to demonstrate the advantage of using cycle consistency.

Skip connections Skip connection is widely used to combine the different levels of information which can also benefit back propagation. Inspired by Ronneberger et al. [39] and the success of skip connections, we use U-net-based architecture with skip connections as shown in Fig. 3.

Table 6 Quantitative evaluations for our model in GoPRO dataset using different evaluations

Motion blur generation For motion blur generation, Kupyn et al. [25] proposed a method which can generate realistic random motion trajectory. We use this method to generate blur kernels to blur images.

Architecture selection We do some experiments in Table 3 to demonstrate the effectiveness of using two discriminators. The results show that using two discriminators can significantly improve performance.

We also do some experiments to find the optimal choice for generator and training objectives in Table 4. The results show that for image deblurring task, the optimal choice for generator architecture is U-net-based architecture, and the optimal evaluation for optimization objective is a least square loss. The generator architectures are shown in Fig. 3 and Table 4

4.2 Results analysis

4.2.1 GoPRO dataset

The proposed GoPRO dataset consists of 21,000 images, including 11,000 blurred images and 10,000 sharp images. We use 20,000 images for training, and 1000 images for testing.

For image evaluation, most of the works used full reference measurements PSNR and SSIM in all their experiments [34,35,36,37]. Tofighi et al. [45] used SNR and ISNR for evaluation. For other image assessments, VIF [41] captures wavelet features which focus on high-frequency features, and IFC [42] puts more weights on edge features. Lai et al. [26] pointed out that the full reference image assessments VIF and IFC are better than PSNR and SSIM.

For the fair comparison, we choose different learning-free methods proposed by Pan et al. [37], Pan et al. [35] and learning-based methods proposed by Xu et al. [48] and Kupyn et al. [25].

Some result examples are shown in Table 5 and Fig. 4. All the salient regions are pointed out in each image. We also report the quantitative results of average evaluations on this dataset in Table 6. We observe that our method outperforms many competitive methods in PSNR, SSIM, MS-SSIM, IFC and VIF. We also observe that our methods recovered more textures, sharp edges, background and fewer artifacts.

Table 7 Quantitative evaluations of three methods in terms of PSNR and SSIM on the Kohler et al. [20] dataset

4.2.2 Kőhler dataset

This dataset consists of four ground-truth images and 12 blurry images for each of them. These blurs are caused by replaying recorded 6D camera motion. We report the quantitative results on this dataset comparing with Tofighi et al. [45] in Table 7. From this table, we can see that our method performs better than Tofighi et al. [45].

Table 8 Quantitative evaluations for our model in dataset of Sun et al. [43]
Table 9 Quantitative evaluations for our model in dataset of Wieschollek et al. [47]

4.2.3 Dataset of Sun et al.

This dataset consists of 640 images generated from 80 natural images and eight blur kernels. We add 1% Gaussian noise as done in Xu et al. [49]. We report the quantitative results on this dataset comparing with Xu et al. [49] in terms of average PSNR in Table 8. The method proposed by Xu et al. [49] suppresses extraneous textures and meanwhile enhances salient edges during training, which gives better results in PSNR performance.

4.2.4 Dataset of Wieschollek et al.

In the method of Wieschollek et al. [47], the authors use a 720p high-resolution video from Youtube to generate the dataset. For a fair comparison, here we report the quantitative results on this dataset comparing with Wieschollek et al. in Table 9. The method proposed by Wieschollek et al. [47] used recurrent neural network architecture with multi-scale paired input, which can achieve state-of-the-art performance when dealing with video blurring and burst blurring problems.

Table 10 Performance of different generator architectures dealing with deblurring problem in GoPRO dataset

5 Conclusions

In this paper, we provide an unsupervised method for blind motion deblurring problem. We build the network with a cycle training strategy. We use two discriminators to distinguish whether the input is blur or not, sharp or not, separately, which can perform better in image deblurring tasks. We show that encoder–decoder-based architecture gives better results. For optimization objective, least square loss performs better than mean square loss.

During training, the experiments show that this model can deal with image deblurring task well without giving any domain-specific knowledge. It can recover more high-frequency textures and details, which not only outperform many competitive methods in many different image quality assessments but also in human visualization evaluation (Table 10).

We show that the key point is to train the model in one scene for given epochs, and then move to another scene, which can ensure that the model learns exactly what “blur” and “sharp” mean.

We also show that our model can handle blur caused by motion or camera shake; the recovered image has fewer artifacts compared to many existing methods. We conduct extensive experiments on three other datasets and report quantitative results.