Keywords

1 Introduction

Recently, deep neural networks with adversarial learning have become a prevalent technique in generative image modelling and have made remarkable advances. In topics such as image super-resolution, in-painting, synthesis, and image-to-image translation, there are already numerous adversarial learning based methods demonstrating the prominent effectiveness of GANs in generating realistic, plausible and conceptually convincing images [8, 14, 21, 25,26,27, 29].

In this paper we address image enhancement problems such as blind single image deblurring by casting them as a special case for image-to-image translation, under the adversarial learning framework. A straightforward way for realising quality improvement in image restoration is to involve image quality measure as constraints for training GANs. As it is known, the objective function of GANs defines gradient scale and direction for network optimization. Adversarial loss in GANs, an indispensable component, is the foundation that encourages the generation of images to be as realistic as possible. However, details and textures in the generated images are unable to be fully recovered and they are critical for the human visual system to perceive image quality. Thus, image quality measures that compensate the overlooked perceptual features in images are necessary to take a part in guiding gradient optimization during the training.

An image quality based loss is proposed and added to the objective function of GANs. There are three common quality measures that can be adopted. We investigate their effects on generated/restored image quality, compared with the baseline model without any quality loss. The rest of this paper is structured as follows. Section 2 describes related work. Section 3 introduces the proposed method, followed by experimental settings, results and discussion in Sect. 4. Section 5 concludes the findings and suggests possible future work.

2 Related Work

2.1 Generative Adversarial Networks

GAN consists of a generative model and a discriminative model. These two models are trained simultaneously by the means of adversarial learning, a process that can significantly contribute to improving the generation performance. Adversarial learning encourages competition between the generator and the discriminator. The generator is trained to generate better fake samples to fool the discriminator until they are indistinguishable from real samples.

For a standard GAN (a.k.a the vanilla GAN) proposed by Goodfellow et al. [6], the generator G receives noise as the input and generates fake samples from model distribution \(p_{g}\), the discriminator D classifies whether the input data is real. There are a great number of variants of GAN proposed afterwards, such as conditional GAN (cGAN) [17], least squares GAN (LSGAN) [16], Wasserstein GAN (WGAN) [1], and Wasserstein GAN with gradient penalty (WGAN-GP) [7].

2.2 Image Deblurring

Image deblurring has been a perennial and challenging problem in image processing and its aim is to recover clean and sharp images from degraded observations. Recovery process often utilises image statistics and prior knowledge of the imaging system and degradation process, and adopts a deconvolution algorithm to estimate latent images. However, prior knowledge of degradation models is generally unavailable in practical situations - the case is categorized as blind image deblurring (BID). Most conventional BID algorithms make estimations according to image statistics and heuristics. Fergus et al. [5] proposed a spatial domain prior of a uniform camera blur kernel and camera rotation. Li et al. [24] created a maximum-a-posterior (MAP) based framework and adopted iterative approach for motion deblurring. Recent approaches have turned to deep learning for improved performances. Xu et al. [23] adapted convolutional kernels in convolutional neural networks (CNNs) to blur kernels. Schuler et al. [20] built stacked CNNs that pack feature extraction, kernel estimation and image estimation modules. Chakrabarti [3] proposed to predict complex Fourier coefficients of motion kernels by using neural networks.

2.3 Image Quality Measures

Image quality assessment (IQA) is a critical and necessary step to provide quantitative objective measures of visual quality for image processing tasks. IQA methods have been an important and active research topic. Here we focus on four commonly used IQA methods: PSNR, SSIM, FSIM and GMSD.

Peak signal-to-noise ratio (PSNR) is a simple signal fidelity measure that calculates the ratio between the maximum possible pixel value in the image and the mean squared error (MSE) between distorted and reference images.

Structural similarity index measure (SSIM) considers image quality degradation as perceived change of structural information in image. Since structural information is independent of illumination and contrast [22], SSIM index is a linear combination of these three relatively independent terms, luminance l(xy), contrast c(xy) and structure comparison function s(xy). Besides, the measure is based on local patches of two aligned images because luminance and contrast vary across the entire image. To avoid blocking effect in the resulting SSIM index map, \(11\times 11\) circular-symmetric Gaussian weighing function is applied before computation. Patch based SSIM index is defined as in Eq. 1, while for the entire image, it is common to use mean SSIM (MSSIM) as the evaluation metric for the overall image quality (Eq. 2).

$$\begin{aligned} \begin{aligned} SSIM(x,y)&= l(x,y) \cdot c(x,y) \cdot s(x,y) \\&=\dfrac{(2\mu _{x} \mu _{y}+c_{1})(2\sigma _{x y}+c_{2})}{(\mu _{x}^2+\mu _{y}^2+c_{1})(\sigma _{x}^2+\sigma _{y}^2+c_{2})} \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} MSSIM(X,Y)=\frac{1}{M} \sum _{m=1}^M SSIM(x_m,y_m) \end{aligned}$$
(2)

where x and y are two local windows from two aligned images X and Y. \(\mu _{x}\) is the mean of x, \(\mu _{y}\) is the mean of y, \(\sigma _{x}^2\) is the variance of x, \(\sigma _{y}^2\) is the variance of y, \(\sigma _{x y}\) is the covariance of x and y, constants \(c_{1}\) and \(c_{2}\) are conventionally set to 0.0001 and 0.0009 to stabilize the division. M is the total number of windows.

Feature similarity index measure (FSIM) is based on similarity of salient low-level visual features, i.e. the phase congruency (PC). High PC means the existence of highly informative features, where the Fourier waves at different frequencies have congruent phases [28]. To compensate the contrast information that the primary feature PC is invariant to, gradient magnitude is added as the secondary feature for computing FSIM index.

First, PC map computation of an image is conducted by generalizing the method proposed in [12] from 1-D signal to 2-D grayscale image, by the means of applying the spreading function of Gaussian. 2-D log-Gabor filter extracts a quadrature pair of even-symmetric filter response and odd-symmetric filer response \([e_{n,\theta _j} (a), o_{n, \theta _j} (a)]\) at pixel a on scale n in the image. Transfer function is formulated as follows,

$$\begin{aligned} G(\omega , \theta _j) = exp(- \frac{(\log (\frac{\omega }{\omega _0}))^2}{2\sigma _r^2}) \cdot exp(- \frac{(\theta -\theta _j)^2}{2\sigma _\theta ^2}) \end{aligned}$$
(3)

where \(\omega \) represents the frequency, \(\theta _j=\frac{j\pi }{J}\) (\(j=\{ 0,1,\dots ,J-1\}\)) represents the orientation angle of the filer, J is the number of orientations. \(\omega _0\) is the filter center frequency, \(\sigma _r\) is the filter bandwidth, \(\sigma _\theta \) is the filter angular bandwidth. And the PC at pixel a is defined as,

$$\begin{aligned} PC(a) = \frac{\sum _j E_{\theta _j} (a)}{\epsilon + \sum _n \sum _j A_{n,\theta _j}(a)} \end{aligned}$$
(4)
$$\begin{aligned}&E_{\theta _j} (a) = \sqrt{F_{\theta _j}(a)^2 + H_{\theta _j} (a)^2} \end{aligned}$$
(5)
$$\begin{aligned} F_{\theta _j}(a)&= \sum _n e_{n,\theta _j} (a), \; H_{\theta _j} (a) = \sum _n o_{n,\theta _j} (a) \end{aligned}$$
(6)
$$\begin{aligned} A_{n,\theta _j}(a) = \sqrt{e_{n,\theta _j} (a)^2 + o_{n, \theta _j} (a)^2} \end{aligned}$$
(7)

where \(E_{\theta _j} (a)\) is the local energy function along orientation \(\theta \).

Gradient magnitude (GM) computation follows the traditional definition that computes partial derivatives \(G_h (a)\) and \(G_v (a)\) along horizontal and vertical directions using gradient operators. GM is defined as \(G(a)=\sqrt{G_h (a) ^2+ G_v (a) ^2}\).

For calculating FSIM index between X and Y, PC and GM similarity measure between these two images are computed as follows,

$$\begin{aligned} S_{PC} (a)&= \frac{2PC_X(a) \cdot PC_Y (a) + T_1}{PC_X^2 (a) + PC_Y^2 (a) +T_1}\end{aligned}$$
(8)
$$\begin{aligned} S_G (a)&= \frac{2G_X(a) \cdot G_Y (a) + T_2}{G_X^2 (a) + G_Y^2 (a) +T_2}\end{aligned}$$
(9)
$$\begin{aligned} S_L (a)&= S_{PC} (a) \cdot S_G (a) \end{aligned}$$
(10)

where \(T_1\) and \(T_2\) are positive constants depending on dynamic range of PC and GM values respectively. Based on similarity measure \(S_L (a)\), the FSIM index is defined as,

$$\begin{aligned} FSIM(X,Y) = \frac{\sum _a^{\varOmega } S_L (a) \cdot PC_m (a)}{\sum _a^{\varOmega } PC_m (a)} \end{aligned}$$
(11)

where \(PC_m (a) = max(PC_X (a), PC_Y (a))\) is to balance the importance between similarity between X and Y, \(\varOmega \) is the entire spatial domain of image. Introduced in [28], \(FSIM_c\) is for colour images by incorporating chormatic information.

Gradient magnitude standard deviation (GMSD) mainly utilizes feature properties in image gradient domain to derive quality measure. GMSD metric calculates the standard deviation of gradient magnitude. Prewitt filter is commonly adopted as the gradient operator. Similar to FSIM index, GM similarity measure is firstly computed using Eq. 9. The difference is the Eq. 12. So the smaller GMSD the higher image perceptual quality.

$$\begin{aligned} GMSD(X,Y) = \sqrt{\frac{1}{N} \sum _{a\in \varOmega } (S_G (a) - mean(G(a)))^2} \end{aligned}$$
(12)

where N is the total number of pixels in image, \(mean(G(a))=\frac{1}{N} \sum _{a \in \varOmega } S_G(a)\).

3 The Proposed Method

We propose modified GAN models that are able to blindly restore sharp latent images with better quality from single blurred images. Quality improvement of restored images is realized by adding a quality loss into the training objective function. We compare three image quality measure based losses, which are based on SSIM, FSIM and MSE. We apply these quality losses to two types of GAN models, LSGAN and WGAN-GP, respectively.

3.1 Loss Function

For simplicity, we first define variables and terms as follows. Batch size is m, input blurred image samples \(\{ {I_B^{(i)}} \} ^m_{i=1}\), restored image samples \(\{ {I_R^{(i)}} \} ^m_{i=1}\), and original sharp image samples \(\{ {I_S^{(i)}} \} ^m_{i=1}\). The adversarial loss \(\mathcal {L}_\mathrm {ad}\), content loss \(\mathcal {L}_\mathrm {X}\), quality loss \(\mathcal {L}_\mathcal {Q}\) are as follows. Adversarial Loss. For LSGAN,

$$\begin{aligned} G: \mathcal {L}_\mathrm {ad}&= \frac{1}{m} \sum _{i=1}^{m} \frac{1}{2} \, (D(G(I_B^{(i)}))-1)^2 \end{aligned}$$
(13)
$$\begin{aligned} D: \mathcal {L}_\mathrm {ad}&= \frac{1}{m} \sum _{i=1}^{m} \frac{1}{2} \, [(D(I_S^{(i)})-1)^2 + D(G(I_B^{(i)}))^2] \end{aligned}$$
(14)

For WGAN-GP,

$$\begin{aligned} \mathcal {L}_\mathrm {ad} = \frac{1}{m} \sum _{i=1}^{m} D(I_S^{(i)})-D(G(I_B^{(i)}))+ \lambda \, [(\Vert \nabla _{\tilde{x}} D({\tilde{x}}) \Vert - 1)^2] \end{aligned}$$
(15)

Content Loss. \(\mathcal {L}_\mathrm {X}\) is a \(L_{2}\) loss based on the difference between the VGG-19 feature maps of generated image and sharp image. As proposed in [9], the VGG19 network is pretrained on ImageNet [4]. \(\mathcal {L}_\mathrm {X}\) is formulated as,

$$\begin{aligned} \mathcal {L}_\mathrm {X} = \frac{1}{W_{j,k}H_{j,k}} \sum _{x=1}^{W_{j,k}} \sum _{y=1}^{H_{j,k}} (\phi _{j,k}(I_S^{(i)})_{x,y} - \phi _{j,k}(G(I_B^{(i)}))_{x,y})^2 \end{aligned}$$
(16)

where \(\phi _{j,k}\) is the feature map of the k-th convolution before j-th maxpooling layer in the VGG19 network. \(W_{j,k}\) and \(H_{j,k}\) are the dimensions of feature maps.

Quality Loss. Based on SSIM and FSIM, quality loss functions are defined as in Eqs. 17 and 18. In addition we experiment a MSE based quality loss (Eq. 19) that computes between \(I_R^{(i)}\) and \(I_S^{(i)}\) and name this quality loss as Pixel Loss.

$$\begin{aligned} SSIM Loss: \mathcal {L}_\mathcal {Q}&= 1 - SSIM(I_R^{(i)},I_S^{(i)})\end{aligned}$$
(17)
$$\begin{aligned} FSIM Loss: \mathcal {L}_\mathcal {Q}&= 1 - FSIM(I_R^{(i)},I_S^{(i)})\end{aligned}$$
(18)
$$\begin{aligned} Pixel Loss: \mathcal {L}_\mathcal {Q}&= MSE(I_R^{(i)},I_S^{(i)}) \end{aligned}$$
(19)

Combining the adversarial loss \(\mathcal {L}_\mathrm {ad}\), content loss \(\mathcal {L}_\mathrm {X}\) and image quality loss \(\mathcal {L}_\mathcal {Q}\), the overall loss function is formulated as,

$$\begin{aligned} \mathcal {L} = \mathcal {L}_\mathrm {ad} +100 \mathcal {L}_\mathrm {X} + \mathcal {L}_\mathcal {Q} \end{aligned}$$
(20)

3.2 Network Architecture

We adopted the network architecture proposed in [13]. The generator has two strided convolution blocks, nine residual blocks, two transposed convolution blocks. The residual block was formed by one convolution layer, an instance normalization layer and ReLU activation. Dropout regularization with rate of 50% was adopted. Besides, global skip connection learned a residual image, which was added with the output image to constitute the final restored image \(I_R\). The discriminator was a \(70\times 70\) PatchGAN [8], containing four convolutional layers, each followed by BatchNorm and LeakyReLU with \(\alpha = 0.2\) except for the first layer.

4 Experiments

4.1 Datasets

The training dataset was sampled from the train set of the Microsoft Common Object in COntext (MS COCO) dataset [15], which contains over 330,000 images covering 91 common object categories in natural context. We adopted the method in [2] to synthesize motion blur kernels. Kernel size was set as \(31\times 31\), motion parameters followed the default setting in the original paper. In total, we generated 250 kernels to randomly blur MS COCO dataset images. We randomly selected 6000 images from the MS COCO train set for training and 1000 from the test set for evaluation. Besides, trained models were tested on two other datasets, the GoPro dataset [18] and the Kohler dataset [11].

GoPro dataset has 3214 pairs of realistic blurry images and their sharp version at \(1280\times 720\) resolution. Images are 240 fps video sequences captured by GoPro Hero 4 camera in various daily or natural scenes. Blurry images are averaged from a varying number of consecutive frames, in order to synthesize motion blur of varying degrees. This is a common benchmark for image motion deblurring. We randomly select 1000 pairs for evaluation.

Kohler dataset contains four original images, 48 blurred images that are generated by applying 12 approximations of human camera shakes on original images respectively. The dataset is also considered as a benchmark for evaluation of blind deblurring algorithms.

4.2 Implementation

We performed experiments using PyTorch [19] on a Nvidia Titan V GPU. All images were scaled to \(640 \times \ 360\) and randomly cropped to patches of size \(256 \times \ 256\). Networks were optimized using the Adam solver [10]. Initial learning rate was \(10^{-4}\) for both generator and critic. For LSGAN models, learning rate remained unchanged for the first 150 epochs and linearly decayed to zero for the rest 150 epochs, and it took around 6 days to finish the training. For WGAN-GP models, learning rate was maintained for 50 epochs and then linearly decreased to zero for another 50 epochs. Training took around 3 days to converge.

4.3 Results and Analysis

We name the model without quality loss as the baseline model. Evaluation metrics include PSNR, SSIM, FSIM and GMSD. Quantitative performances are given in Tables 1, 2 and 3. Examples of resulting images are shown in Figs. 1, 2 and 3. MS COCO Dataset. From Table 1, we can observe that WGAN-GP model with SSIM loss function has the best performance on all four measures. The WGAN-GP model with FSIM loss function has a comparable performance with subtle differences in values. But significant improvements from the baseline model that does not include quality losses demonstrate the usefulness of SSIM or FSIM loss. From the examples shown in Fig. 1, restored images by SSIM loss and FSIM loss contain more details visually and also have better quantitative evaluation results than their counterparts.

Table 1. Model performance evaluation measures averaged on 1000 images of MS COCO dataset.

GoPro Dataset. We can find similar performances on MS COCO dataset, although the training was solely based on synthetic blurred images from MS COCO dataset. Still WGAN-GP is the model that gives better performance. In terms of PSNR and SSIM metrics, performance of WGAN-GP model with SSIM loss function is ranked the first. And FSIM loss function encourages the model to produce better results with regard to FSIM and GMSD metrics.

Fig. 1.
figure 1

Results generated by WGAN-GP model with various loss functions on MS COCO dataset.

Table 2. Model performance evaluation measure averaged on 1000 images of GoPro dataset.
Fig. 2.
figure 2

Results generated by WGAN-GP model with different loss functions on GoPro dataset.

Kohler Dataset. Compared to results of above two datasets, results given in Table 3 are generally low. Considering images from Kohler dataset are approximations of human camera shake, models trained by synthetic blurred images have limited generalization on tackling such real blurry images. But SSIM and FSIM loss still demonstrate their effectiveness in improving image quality as shown in Table 3 and Fig. 3, although the example in Fig. 3 is a challenging one to restore.

Table 3. Model performance evaluation measure averaged on 48 images of Kohler dataset.
Fig. 3.
figure 3

Results generated by WGAN-GP model with different loss functions on Kohler dataset.

As one can observe from Tables 1, 2 and 3, quantitative results show that image quality measure based loss functions are effective components for GANs to further improve generated image quality. Among the three loss functions, models trained with the SSIM loss and the FSIM loss have comparable performances and generate the best results, compared to the baseline model and model trained with the pixel loss. Experimentation on three different datasets in two different types of GAN model demonstrates effectiveness of inclusion of such quality loss functions.

For visual comparison, models with the SSIM or FSIM loss function restore images with better texture details and edges. However, if we carefully observe the details in generated image patches, we can find that with the SSIM loss function, the patches have window artifacts while models with FSIM loss produce smoother details when zoomed in. It is because the SSIM loss is computed by basing on local windows of images while the FSIM loss is computed pixel by pixel. And in the result images generated by models trained with pixel loss function, details are still blurred, illustrating that L2 loss in the spatial domain has little contribution to image quality improvement. In general, compared to FSIM loss, SSIM loss has the advantage of computation efficiency and performance stability in various quantitative evaluation metrics and visual quality.

It is also noted that WGAN-GP generates better results than LSGAN and converges faster. But training WGAN-GP model is more difficult; during the experimentation, model training diverges more often than LSGAN. Parameter tuning becomes a crucial step in experiment setting for training WGAN-GP, and it is very time-consuming to find a feasible model structure and network parameters.

5 Conclusion

In this paper, we tackled the problem of image deblurring with the framework of adversarial learning models. Losses based on image quality measures were proposed as additional components in the training objective function of the GAN models. Experimental results on various benchmark datasets have demonstrated the effectiveness of adding such image quality losses and their potential in improving quality of generated images.

For future work, training data could include more diverse datasets to improve generalization ability of the network. So far the weightings of these various losses in the overall objective function have not been fine-tuned; further experiments could be conducted on further improving the performance by fine-tuning these parameters. Besides, considering flexibility and adaptability of these image quality losses, applications for solving other image enhancement and restoration tasks would also be worth investigating in the future.