1 Introduction

Motion blur is one of the most common types of image blurring. Shorter exposure time and fast-moving objects or camera shaking will cause motion blur in the final image, resulting in poor image perception, affecting image information transmission, and postprocessing [1, 2]. In the field of computer vision, motion blur could cause the reduction of accuracy and efficiency of image recognition and classification. Therefore, the restoration of motion blurred images is of great significance. Most of the early blurred image restoration approaches are based on the following blur model [3,4,5,6]:

$$\begin{array}{*{20}c} {I_{\text{B}} = K*I_{\text{S}} + N} \\ \end{array}$$
(1)

where \(I_{\text{B}}\), \(I_{\text{S}}\), \(K\), and \(N\) are the blurred image, the latent sharp image, the blurred kernel, and noise, respectively, which represents a convolution operation. The process may be seen as a convolution operation between a sharp image and a blurred image kernel, when the blurred image is formed after the effect of random noise. An image restoration algorithm can be divided into blind restoration and non-blind restoration based on whether the blur kernel is known or not. The non-blind restoration algorithm restores the blurred image by estimating the inverse process of (1) using the known blurred kernel. The classical algorithm uses an LR (Lucy–Richardson) algorithm, a Wiener filter, and a Kalman filter [7]. A partial blind restoration algorithm is used to reconstruct the image by estimating the blurred kernel. It is time-consuming and inefficient to restore an image with a blurred kernel due to the unknown blurred kernel’s function; the blur types in reality are complex and uncertain in most cases.

Generative adversarial networks (GANs) [8] are a rapidly developing research direction in the field of deep learning. GANs have been proven to have excellent performance in image super-resolution, image style transfer, and high-precision sample estimation. Image deblurring can be considered a type of image style transfer, that is, the change of image from “blurred” style to “sharp” style.

GAN’s excellent performance in the image conversion task and the existing image deblur algorithm still have many shortcomings. It is a new direction to apply GAN to motion blur image restoration task. In this paper, an enhanced GANs model, which may obtain higher PSNR and SSIM on motion blur removal, is proposed. In addition, the deblurred image will have a better visual appearance and quality. In Section II, the related work on image restorations is discussed. In Section III, the proposed method and algorithm are described, and the mathematical analysis is elaborated. In Section IV, the proposed method is validated by the experiments, and the benchmarked datasets are used to compare the results. The experiments are also simulated on images obtained from the Chang’e 3 space mission. In Section V, the summary of the work, conclusions, and implications for the design of Chang’e mission’s camera pointing systems is also given. The contributions of this paper are as follows:

  1. 1.

    A new work structure and a hybrid loss function are proposed, which can recover the motion blurry image efficiently;

  2. 2.

    Experiments on different datasets prove that the algorithm can be applied to real scenes.

2 Work

2.1 Related study

In recent years, numerous blurred image restoration approaches based on deep learning have been proposed. Sun et al. [9] used CNN to predict the probability distribution of motion blur in every block of image and restored the image by the motion blur probability distribution of each image block. Nah et al. [10] used a multi-scale CNN to directly correct the deblurred image. However, the algorithm has high complexity and low efficiency, so it cannot process images quickly. Ramakrishnan et al. [11] proposed a novel network structure to improve the efficiency of the network, while maintaining the effect of deblurring. Although this algorithm improved the efficiency, the clarity of the generated images was similar to that of Nah’s method. Kupyn et al. [12] proposed a motion blur removal algorithm based on condition generation against a network. This approach uses WGAN-GP [13] and perceptual loss [14] as the final loss function and achieved a high image restoration effect. Deep learning is also applied to motion blurred video restoration. Su et al. [15] adopt CNN to aggregate multiple images to generate an sharp output. Zhang et al. [16] used GAN with 3D convolution to capture spatial and temporal information encoded in neighbouring frames to restore blurred video. Chen et al. [17] used self-supervised fashion to fine-tune existing deblurring neural networks, which improves the performance of video deblurring algorithm.

2.2 GANs

GANs consist of a generator and a discriminator. The basic principle is that the generator receives a random noise signal to generate new data samples, and the discriminator determines whether the samples are from the real sample set or the simulated samples generated by the generator. The purpose of the generator is to generate a sample that is close to the data distribution of the real sample set, making it impossible for the discriminator to determine the data source. The objective optimisation function of the generation of GANs is as follows:

$$\begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} V\left( {D,G} \right) = {\mathbb{E}}_{x\sim p\left( x \right)} \left[ {\log D\left( x \right)} \right] + {\mathbb{E}}_{x\sim p\left( z \right)} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]} \\ \end{array}$$
(2)

where \(p\left( x \right)\), \(p\left( z \right)\), \(G\left( {z;\,\theta_{g} } \right)\), and \(D\left( x \right)\) are data distribution, predefined noise variable, mapping from noise space to data space, and the probability that x is from the real sample set, respectively. \({\mathbb{E}}\) is mathematical expectation. The discriminator was trained to minimise \(\log D\left( x \right)\), and the generator is trained to minimise \(\log \left( {1 - D\left( z \right)} \right)\). To enable the network to output data according to the expectations, the researchers added the extra conditional information \(, y,\) on the original GANs, and \(y\) could be any kind of information [18]. The optimal objective function of conditional generation adversarial networks is as follows:

$$\begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} V\left( {D,G} \right) = {\mathbb{E}}_{x\sim p\left( x \right)} \left[ {\log D\left( {x|y} \right)} \right] + {\mathbb{E}}_{{z\sim p_{z} \left( z \right)}} \left[ {\log \left( {1 - D\left( {G\left( {z|y} \right)} \right)} \right)} \right]} \\ \end{array}$$
(3)

However, training the original GANs suffered from many challenges, such as mode collapse, training instability, vanishing gradients, etc. The reason for these problems is that the original GANs use JS divergence to measure the differences between the two distributions. JS divergence is a constant log2, where two distributions do not overlap completely, which leads to the mutated properties of JS divergence. Thus, Wasserstein GAN (WGAN) [19] uses Wasserstein distance to measure the differences between the two distributions.

$$\begin{array}{*{20}c} {W\left( {P_{r,} P_{g} } \right) = \mathop {\inf }\limits_{{y\sim \prod \left( {P_{r,} P_{g} } \right)}} {\mathbb{E}}_{{\left( {x,y} \right)\sim y}} \left[ {x - y} \right]} \\ \end{array}$$
(4)

For two distributions, \(P_{r}\) and \(P_{g}\), their joint distribution is \(\prod \left( {P_{r} ,P_{g} } \right)\). Calculating the distance between \(y\) and \(y\) sampled from each joint distribution takes all of x and \(y\) to calculate the expected value. The smallest value is chosen as the Wasserstein distance. Compared with JS divergence, Wasserstein distance has continuous transformation regardless of whether the two distributions overlap.

3 Proposed method

The proposed method is to improve the performance of existing end-to-end deblurring adversarial network model by making the network to have the ability to extract the weight of feature channels and remove the draughtboard artefacts. The key idea is to use a novel convolution unit constructed by squeeze and excitation networks and residual network to extract image feature. In addition, using resize convolution as up-sampling method, he contrastive experiments showed that chessboard effect could be effectively removed by resize convolution.

3.1 Generator

The generator structure is shown in Fig. 1, which is similar to the structure used by Kupyn et al. [12]. It contains two strided convolution blocks with stride two, nine residual blocks, and two transposed convolution blocks. To improve the quality of the image, some improvements are made on the basis of the network structure of Kupyn et al. [12] and are elaborated as follows.

Fig. 1
figure 1

Generator network structure

3.1.1 Without batch normalisation

Traditional neural networks only normalise the data before it is inputted into the network, whereas the batch normalisation [20] layer normalises the input of hidden layers. Batch normalisation may solve the problem of gradient disappearance and gradient explosion in the back-propagation algorithm, which can also accelerate the convergence speed of the network. The batch normalisation layers are removed from the network, as Nah et al. [10] and Bee et al. [21] presented in their model. Since batch normalisation layers normalise the features, it limits network flexibility.

3.1.2 SE-ResBlock

The SENet (squeeze and excitation networks) proposed by Hu et al. [22] is used to improve the performance of the residual network. During CNN, the convolution kernel could be regarded as the aggregate of the spatial information and the characteristics of dimension information. To improve the performance of CNN, many approaches have been proposed, from spatial dimension, such as the inception module [20, 23,24,25]. The difference is that SENet improves network performance from the feature dimension; SENet can learn the relationship between different feature channels, obtain the weight of feature channels, and use the weight to promote useful features and suppress features that are less useful for the current task.

SENet includes three key operations: squeeze, excitation, and reweight. Squeeze operations achieve feature compression by turning each of the two-dimensional characteristic channels into a real number. Excitation operation gives weight to each feature channel through the parameter, W, which represents the correlation between the feature channels. In the reweight operation, the output of excitation is regarded as the importance of different feature channels. By multiplying the output of excitation as the weighting of the original feature, the original feature is re-calibrated. SENet has been proven to have excellent performance in image classification. In this paper, SENet is applied to image processing due to the image processing method adopted in this paper to completely reconstruct the image.

In the reconstruction process, the importance of different feature channels should be considered. So the combined convolution unit of SENet and ResNet (residual networks) [26], named SE-ResBlock, is used and is shown in Fig. 2. Global average pooling is used as a squeeze operation. The excite operation calculates the interchannel correlation by using two fully connected layers to make up the bottleneck structure. The reweight operation is used to weight the normalised feature into the original feature channels.

Fig. 2
figure 2

SE-ResBlock network structure

3.1.3 Resize convolution [27]

Using CNN to generate images is a process that transforms low-resolution image blocks into a high-resolution image. It is usually realised by deconvolution. Due to the “uneven overlap” in the deconvolution process, it will lead to the artefact similar to the draughtboard lattice in the details of the image, which are called “draughtboard artefacts”. To eliminate this phenomenon, one approach is to make sure the kernel size is divided by the stride; however, it is still easy to create draughtboard artefacts. The resize convolution is used as an up-sampling method instead of deconvolution. The resize convolution is implicitly weight-tying in a way that discourages high-frequency artefacts. The process of resize convolution is to resize the image (using nearest-neighbour interpolation or bilinear interpolation) and then do a convolutional layer, as shown in Fig. 3.

Fig. 3
figure 3

Deconvolution and resize convolution

3.2 Discriminator

Here, the Markovian discriminator [28], PatchGAN, is used as the discriminator for EDGAN. Since content loss (the combination of perceptual loss and gradient loss was used in this paper) has been able to process the low-frequency components of the image very well, the discriminator only needs to process the high-frequency components. Therefore, the receptive field of the discriminator output does not need to be the whole input image. The output can be a feature map, and the receptive field of each pixel in the feature map is a patch on the input image, which can accelerate the discriminator while obtaining high-quality images. According to the experiments in the literature of Isola et al. [29], the output of the discriminator is set at \(50 \times 50\), which takes into account both image quality and network operation speed. The model parameters of the discriminator are shown in Table 1.

Table 1 Model parameters of discriminator

3.3 Loss function

The generator loss functions as a combination of adversarial loss, perceptual loss, and gradient loss are formulated as follows:

$$\begin{array}{*{20}c} {l = \ell_{{{\text{WGAN}} - {\text{GP}}}}^{\text{Generator}} + \alpha \cdot \ell_{\text{percept}} + \beta \cdot \ell_{\text{grad}} } \\ \end{array}$$
(5)

where \(\alpha\) and \(\beta\) are the weight parameters of perceptual loss and gradient loss.

3.3.1 Adversarial loss

To overcome the problems existing in the original GANs, while WGAN reduces the difficulty of training for GANs, it is still difficult to converge under certain conditions, and the effect of generating the picture did not satisfy the expectations. WGAN will limit the weight to a certain extent after updating the weight of each iteration, whereas WGAN-GP calculates the weight gradient according to the input of the discriminator and corrects the weight according to the norm of the gradient. WGAN-GP effectively solves the problem of WGAN. The loss is calculated as the following, where \(I^{B}\) is blurry image:

$$\begin{array}{*{20}c} {\ell_{{{\text{WGAN}} - {\text{GP}}}}^{\text{Generator}} = \mathop \sum \limits_{n = 1}^{N} - D_{{\theta_{D} }} \left( {G_{{\theta_{G} }} \left( {I^{B} } \right)} \right)} \\ \end{array}$$
(6)

3.3.2 Perceptual loss

The basic idea here is to use the features extracted by CNN as part of the target function. By reducing the Euclidean distance between feature maps generated by CNN and target images, the generated images are more consistent with the target image than the pixel-level loss function. Perceptual loss is a kind of high-level loss. The definition is as follows, where \(I^{S}\) is sharp image:

$$\begin{array}{*{20}c} {\ell_{\text{percep}} = \frac{1}{{W_{i,j} H_{i,j} }}\mathop \sum \limits_{x = 1}^{{W_{i,j} }} \mathop \sum \limits_{y = 1}^{{H_{i,j} }} \left( {\phi_{i,j} \left( {I^{S} } \right)_{x,y} - \phi_{i,j} \left( {G_{{\theta_{G} }} \left( {I^{B} } \right)} \right)_{x,y} } \right)^{2} } \\ \end{array}$$
(7)

where \(W_{i,j}\) and \(H_{i,j}\) are the dimensions of the feature maps; \(\phi_{i,j}\) is the feature map obtained by the j-th convolution before the i-th maxpooling layer within a CNN. The CNN used in this paper is a VGG19 (layer 1–14) network [31], pretrained on ImageNet [32].

3.3.3 Gradient loss [33]

In addition, image information in the gradient domain is also leveraged as a high-level loss term as follows:

$$\begin{array}{*{20}c} {\ell_{\text{grad}} = \frac{1}{2N}\mathop \sum \limits_{i = 1}^{N} \left| {\nabla_{h} \left( {I^{S} } \right) - \nabla_{h} \left( {I^{B} } \right)} \right| + \left| {\nabla_{v} \left( {I^{S} } \right) - \nabla_{v} \left( {I^{B} } \right)} \right|} \\ \end{array}$$
(8)

where \(\nabla_{h}\) and \(\nabla_{v}\) indicate the horizontal and vertical gradients. N indicates the number of training image pairs.

WGAN-GP is used as the critic function, the discriminator loss, as the following:

$$\begin{array}{*{20}c} {\ell_{{{\text{WGAN}} - {\text{GP}}}}^{\text{Discriminator}} = \mathop {\mathbb{E}}\limits_{{\widetilde{x}\sim P_{g} }} \left[ {D\left( {\widetilde{x}} \right)} \right] - \mathop {\mathbb{E}}\limits_{{x\sim P_{r} }} \left[ {D\left( x \right)} \right] + \lambda \mathop {\mathbb{E}}\limits_{{\widehat{x}\sim P_{{\widehat{x}}} }} \left[ {\left( {\nabla_{{\widehat{x}}} D\left( {\widehat{x}} \right)_{2} - 1} \right)^{2} } \right]} \\ \end{array}$$
(9)

where \(\lambda\) is the penalty coefficient, \(p_{g}\) is the sample distribution of the generator, \(p_{r}\) is the sample distribution of the sharp image, and \(p_{{\widehat{x}}}\) is the distribution uniformly sampled along the straight lines between \(p_{g}\) and \(p_{r}\).

4 Experimental evaluation

4.1 Experimental settings

The following describes the validations of the proposed method and algorithms. First, the benchmarked datasets are used to test the performances, and then it is tested with the images obtained from Chang’e 3 space missions. All of the proposed models using the PyTorch deep learning framework are implemented. The experimental model training and testing hardware platform is NVIDIA GTX 1070 GPU and Intel i7-4790 CPU. The first model is referred for using SE-ResBlock as \({\text{EDGAN}}_{\text{SE}}\). The second model used SE-ResBlock and gradient loss as \({\text{EDGAN}}_{{{\text{SE}} - {\text{G}}}}\). The third model used SE-ResBlock and a resize convolution as \({\text{EDGAN}}_{{{\text{SE}} - {\text{R}}}}\). A random crop size of \(250 \times 250\) from a GOPRO dataset [10] is used to train the proposed model. The model optimisation algorithm is ADAM [34], a set learning rate of \(10^{ - 4}\). To speed up the convergence rate of the model, the learning rate in the first 150 rounds was fixed, and in the second 150 rounds, the learning rate was gradually reduced to zero. To balance the performance of the generator and the discriminator, every five-gradient descent algorithms were performed by the discriminator, and the generator was executed once. The \(\alpha\) and \(\beta\) of generator loss function are 1 and 5. The approaches proposed in this paper do not adopt a batch normalisation layer. Other specific differences are shown in Table 2.

Table 2 Mean PSNR and SSIM on GOPRO dataset of 1111 images

4.2 Datasets

4.2.1 GOPRO dataset

The GOPRO dataset consists of 3214 pairs of images, of which 2103 are training sets and 1111 pairs are test sets. The dataset adopts 240fps high frame rate video captured by GOPRO4 Hero camera, taking about 7–14 images as a group, averaging each group of images to get blurry image, and taking the middle frame of this group of images as the label image. Since this dataset has been applied in many related studies, it is also used in this paper for the convenience of comparison with other studies. The proposed model is compared with Kupyn et al. [12], and the results are shown in Table 3. The details of image are shown in Fig. 4. From the results, it is found that EDGAN could better restore the image detail. The contrast between \({\text{EDGAN}}_{\text{SE}}\) and \({\text{EDGAN}}_{{{\text{SE}} - {\text{R}}}}\) is shown in Fig. 5. Deconvolution causes the abnormal colour artefacts in the image texture, and resize convolution can eliminate it.

Table 3 Mean PSNR and SSIM on GOPRO dataset of 1111 images
Fig. 4
figure 4

Comparison of deblurred images by the proposed model and DeblurGAN [12] on one of the images taken from GOPRO dataset

Fig. 5
figure 5

Comparison of deblurred images by deconvolution and resize convolution

4.2.2 Lai dataset

The Lai dataset includes a real dataset and a synthetic dataset. The real dataset is made up of 100 blurred images collected from real-world scenes. These images use different shooting devices, shooting settings, and shooting themes. The synthetic dataset includes 100 blurred images generated by a convolution between 25 sharp images collected from the Internet and four different blurred kernels. The blur kernels are generated by the recorded 6D camera trajectories. In the experiment, only a synthetic dataset is used, and the results are shown in Table 4. EDGAN shows superior results both in qualitative and in quantitative ways. Deblurred images from tests on the Lai dataset are shown in Fig. 6.

Table 4 Mean PSNR and SSIM on Lai dataset of 100 images
Fig. 6
figure 6

Comparison of deblurred images by the proposed model and DeblurGAN [12] on one of the images taken from Lai dataset

4.2.3 Chang’e 3 space mission images [35]

The impact of the proposed algorithm is also evaluated using the open-source data images available for Chang’e 3 space mission. The Chang’e 3 space mission from China was successfully launched, and the rover “Yutu” was landed on the moon’s surface in December 2013. One of the many successful space missions from the Chinese in several decades was considered. One of the many projects that was led by the author of this paper, Prof. KL Yung [1, 2, 36] and his team at the Hong Kong Polytechnic University, was the design and development of the camera pointing system mounted on the Lander of the moon’s surface. The equipment was operated for over 3 years on the moon and was terminated in 2017.

The following simulated experiments were conducted to test the algorithm which can be used to improve the design and implementation of future Chang’e missions. Ten pictures from Chang’e 3 and its rover “Yutu” image database are randomly selected. And the corresponding blurred images are artificially generated and are restored. The blurred images were generated by a convolution operation of a clear image and a motion blur operator. The motion blur operator included two parameters, angle \(\theta\) and pixel number \(n\). The convolution represents the object photographed being moved with \(n\) pixel with \(\theta\) angle counter-clockwise. The results are shown in Tables 5, 6, 7, and 8. Deblurred images are shown in Figs. 7, 8, 9, and 10.

Table 5 Mean PSNR and SSIM on Chang’e 3 image database of 10 images (\(n = 20, \theta = 15\))
Table 6 Mean PSNR and SSIM on Chang’e 3 image database of 10 images (\(n = 10, \theta = 15\))
Table 7 Mean PSNR and SSIM on rover “Yutu” image database of 10 images (\(n = 20, \theta = 15\))
Table 8 Mean PSNR and SSIM on rover “Yutu” image database of 10 images (\(n = 10, \theta = 15\))
Fig. 7
figure 7

Chang’e 3 image database \(n = 20, \theta = 15\)

Fig. 8
figure 8

Chang’e 3 image database \(n = 10, \theta = 15\)

Fig. 9
figure 9

Rover “Yutu” image database \(n = 20, \theta = 15\)

Fig. 10
figure 10

Rover “Yutu” image database \(n = 10, \theta = 15\)

Two conclusions from the simulated experiments are drawn: (1) although resize convolution can eliminate draughtboard artefacts, it will cause PSNR and SSIM to decrease, and (2) when the blur is serious (\(n = 20\), \(\theta = 15\)), DeblurGAN does not restore the details in the image very well, especially the line structure, and SE-ResBlock can better restore the details of the image.

5 Discussion

This paper demonstrates that the image generated by the proposed algorithm is sharper through comparative experiments on different datasets. Although the algorithm performs well in image deblurring, it is unable to process high-resolution images in real time due to the lack of further optimisation of network structure for algorithm efficiency. In addition, for image frames from video, because the algorithm proposed in this paper does not support multiple images as input, the interframe information cannot be extracted. The future work will focus on the speed improvement and way to extract information of this algorithm.

6 Conclusion

In this paper, the EDGAN is proposed, and it is a novel model designed to restore motion blurred images. And the EDGAN sets a new state-of-the-art technology to public benchmark datasets in terms of the PSNR and SSIM metric [37]. In addition, using resize convolution as an up-sampling method can effectively eliminate “draughtboard artefacts” on the generated images is confirmed. However, the resize convolution would reduce the quality of the image details. It will be part of future work to eliminate colour artefacts without reducing the performance of EDGAN. Moreover, the method is tested and evaluated for future space missions of Chang’e. The camera pointing system developed by the Hong Kong Polytechnic University in early 2013 was used to capture images of the moon, as well as the movement of the rovers. It was capable of 360 degrees of image capturing, as well as positioning and navigating of the rover. Based on the past experience, a new algorithm with a deep learning approach for future space missions is proposed. The results indicate that deep learning can achieve good performances for high-precision image restorations and can be incorporated into the design of cylindrical projection of sequential images of the camera pointing system for image constructions, as well as feature recognition in future deep space explorations.