Keywords

1 Introduction

Image restoration is required to handle various and complicated degradation patterns produced in different degradation processes. The degradation representations act as a crucial component to model the degradation processes and handle complicated degradation patterns, such as different noise levels in image denoising [8, 21, 51] and different combinations of Gaussian blurs and motion blurs in blind super-resolution [41]. However, the degradation representations are less exploited in learning-based deblurring methods and have not been well integrated into state-of-the-art deblurring networks.

The general blurring process can be formulated as

$$\begin{aligned} y = F(x, k) + \eta , \end{aligned}$$
(1)

where x and y are sharp image and blurry image respectively. F(xk) is usually modeled as a blurring operator with kernel k. \(\eta \) represents the Gaussian noise.

A popular paradigm for image deblurring is based on the Maximum A Posterior (MAP) estimate framework,

$$\begin{aligned} (k, x) = \arg \max \mathbb {P}(y|x,k)\mathbb {P}(x)\mathbb {P}(k), \end{aligned}$$
(2)

where \(\mathbb {P}(x)\) and \(\mathbb {P}(k)\) model the priors of the clean images and the blur kernels. Many handcrafted priors for modeling \(\mathbb {P}(x)\) and \(\mathbb {P}(k)\) have been proposed  [1, 11, 18, 26]. But most of them are insufficient in characterizing the clean images and blur kernels accurately. Furthermore, the operator F is generally modeled as a convolution operation in conventional MAP frameworks, which does not hold in practice and causes unpleasing artifacts in real-world challenging cases.

Based on the limitations of kernel-based blurring modeling, a series of kernel-free approaches [3, 4, 6, 38, 47] are proposed to directly learn the mapping from blurry images to corresponding sharp images. While those methods outperform previous deblurring methods significantly, their performances are still limited in complicated blurry patterns, due to the lack of explicit modeling of the degradation process. This is because, unlike denoising methods, where the noise level might be similar across different images, blurring of different images generally have totally different patterns and cannot be well handled by fixed-weight networks without considering the degradation process. To combine the modeling of degradation and learning-based deblurring, recent works [29, 39] propose to learn explicit degradation representations by using Deep Image Prior (DIP) [40] to reparameterize the kernel k and the sharp image x. This inevitably involves the time-consuming iterative inverse optimization and hyperparameter tuning of DIP to adapt the deblurring process. Moreover, degradation representations have not been taken as a common component in SOTA deblurring methods [3, 47].

In this paper, we propose to learn explicit degradation representations with a novel joint sharp-to-blurry image reblurring and blurry-to-sharp image deblurring learning framework. Specifically, the degradation representations are learned in the process of sharp-to-blurry image reblurring. The process takes as input a blurry image and learns the degradation representations as a multi-channel spatial latent map to encode the spatially varying blur patterns in replacement of the conventional convolutional blur kernels or the DIP prior. A reblurring generator then takes as input the latent degradation map and the original sharp image and reblurs the sharp image back to its corresponding blurry image.

To effectively integrate the learned degradation representations into the reblurring process, we introduce a multi-scale degradation injection network (MSDI-Net) for achieving conditional image reblurring. The network adopts a U-Net like architecture [31]. The sharp image is fed into the encoder of the U-Net but the degradation is input into the encoder-decoder via the skip-connections to modulate the shortcut encoder feature maps. Specifically, the latent degradation map is gradually upsampled via nearest-neighbor interpolation and a convolution layer to multiple resolutions and are then used to predict spatially varying weighting and bias parameters of the shortcut feature maps at each corresponding resolution. In this way, the learning of the latent degradation representation is supervised by the original blurry image for sharp-to-blurry image reblurring.

To make the learned degradation representations contributing to image deblurring, another blurry-to-sharp image generator network is also introduced, which shares the same MSDI-Net architecture but does not share weights with the reblurring generator. The learned degradation representations are processed similarly to deblur a blurry input image. Specifically, its encoder takes the blurry image as input, while the latent degradation representations are used to modulate the blurry image’s encoder-decoder shortcut feature maps at multiple resolutions. With the help of learned degradation map, the deblurring generator can handle complicated spatially varying blurry patterns, which are adaptively learned from the data to optimize both reblurring and deblurring tasks.

The main contributions of this work are two-fold: 1) We propose a novel joint framework for learning both sharp-to-blurry image reblurring and blurry-to-sharp image deblurring to adaptively encode spatially varying degradations and model the image blurring process, which in turn, benefits the image deblurring performance. 2) The proposed joint reblurring and deblurring framework outperforms state-of-the-art image deblurring methods on the widely used GoPro [24] and RealBlur [30] datasets.

2 Related Work

In this section, we briefly talk about the related works of image restoration with degradation representations and different image deblurring methods.

Image Restoration with Degradation Representations. Image restoration tasks are usually required to handle different and complicated degradations in real-world applications. The degradation representations have been exploited and taken as one crucial component in several image restoration tasks, such as image denoising and image super-resolution. In image denoising, [8, 21, 51] take the noise variance as one network input to adaptively handle various noise strengths. Several practical denoising methods stabilize the noise variance caused by various ISO [43] and the property of Poisson-Gaussian distribution [15, 35]. Similarly, image blind super-resolution is required to handle various degradations (different Gaussian blurs, motion blurs, and noises) in real-world applications. [41] proposes an unsupervised learning scheme for learning degradation representations based on the assumption that the degradation is the same in an image but can vary for different images. However, this assumption does not hold in image deblurring.

Optimization-Based Deblurring. A popular approach for image deblurring is based on the Maximum A Posterior (MAP). Most MAP-based methods focus on finding good priors for sharp images and blur kernels. Many priors are designed to model clean images and blur kernels. They include total variation (TV) [1], hyper-Laplacian prior [11], \(l_0\)-norm gradient prior [46] and sparse image priors [14]. They all assume the blur kernel is linear and uniform and can be represented as a convolution kernel. However, this assumption does not hold for real-world blurring with the non-uniform kernels. Some non-uniform deblurring methods [5, 23, 33, 44] are proposed based on the assumption that the blur is locally uniform. They are not practical even with high computational costs.

Learning-Based Deblurring. Many deep deblurring models have been proposed over the past few years. Earlier attempts [32, 37] utilize deep convolution neural networks to facilitate blur kernel estimation. However, there are several limitations [24] in estimating kernels. 1) Simple kernel convolution is not practical in real-world challenging cases. 2) The incorrect kernel estimations, caused by noise and large motions, may cause unpleasing artifacts. 3) Estimating spatially varying kernels requires a huge amount of computation. To avoid the above limitations, a series of kernel-free methods [3, 4, 6, 12, 13, 24, 27, 36, 38, 47, 49, 52] are proposed with much better performances. Nah et al. [24] propose a multi-scale network for image deblurring. Similarly, Tao et al. [38] propose a scale-recurrent structure for image deblurring. Adversarial training is also introduced in image deblurring [12, 13]. Chen et al. [2] introduces a reblur2deblur framework for video deblurring. Zhang et al. [52] proposes a reblurring network to synthesize additional blurry training images. The reblurring network and deblurring network are separate in both two methods. In this work, we combine reblurring network and deblurring network in learning the degradation representations. Recently, multi-stage approaches [3, 36, 47, 49] achieve impressive performance against previous methods. While those deblurring networks outperform the traditional deblurring methods significantly, their performances are still limited due to the lack of explicit degradation modeling.

Learning Blurring Degradation Representations. While the explicit degradation representations have shown convincing improvements on many low-level vision tasks, it is rarely explored in learning-based deblurring methods. SelfDeblur [29] introduces the Deep Image Prior (DIP) [40] to model the clean images and the kernels separately. But, it assumes the blur kernels are linear and uniform, which is not practical in complicated real-world scenes. Tran et al. [39] address the limitation by introducing the explicit representation for the blur kernels and the blur operators and reparameterizing the degradation and the sharp image by using DIP. This inevitably involves alternative optimization, which is time-consuming. The learned representations cannot improve the performance of existing deblurring networks directly and their application is limited to time-consuming DIP-based optimization methods.

Fig. 1.
figure 1

Learning degradation representations with reblurring and deblurring.

3 Methodology

In this section, we first introduce a joint learning framework for both sharp-to-blurry image reblurring and blurry-to-sharp image deblurring to encode latent spatially varying degradation representations from blurry images. A blur-aware loss is introduced to enhance the image deblurring performance. To more effectively integrate the latent degradation representations for reblurring and deblurring, a multi-scale degradation injection network is proposed for both tasks.

3.1 Learning Degradation Representations from Joint Reblurring and Deblurring

Most existing deblurring methods [1, 11, 18, 26, 32] take the blur kernels as the degradation representations and model the blurring process as a convolution on the input image. However, the simple kernel convolution is not practical in real-world challenging cases and it is usually difficult to estimate the blur kernels in large motions and spatially varying blurring cases. We propose to encode latent degradation representations from blurry images via the joint learning of image reblurring and deblurring. As shown in Fig. 1, we introduce an encoder E to encode the blurry image y into the degradation representations E(y), which is modeled as a multi-channel latent map encoding 2D spatially varying blurring degradation in a latent space. Then an image reblurring generator \(G_r\) and an image deblurring generator \(G_d\) are introduced to generate the reblurred image \(\acute{y}\) and the deblurred image \(\acute{x}\), respectively. The degradation representations do not only help the reblurring generator \(G_r\) model the degradation process, but also help the deblurring network handle complicated spatially varying degradation patterns. The joint training of reblurring and deblurring strengthens the expressiveness of learned degradation representations.

Sharp-to-Blurry Image Reblurring. Different from modeling the blurring process as a convolution, our generator \(G_r\) models the degradation process via learning to generate the blurry image \(\acute{y}\), given the sharp image x and the corresponding degradation representations E(y). In addition, instead of generating the whole blurry image from scratch, the reblurring generator \(G_r\) learns to predict the residual between the sharp image x and the blurry image y. The learning of the blurring degradation process in our framework is therefore formulated as generating a blurry image \(\acute{y}\) from its clean image x as

$$\begin{aligned} \acute{y} = x + G_r(x, E(y)), \end{aligned}$$
(3)

where \(\acute{y}\) is the reblurred image conditioned on the sharp image x and degradation representation E(y). With the residual learning, the encoder E is encouraged to neglect the contents of the blurry image and to focus on disentangling the content-independent degradation representation E(y) from the blurry image y.

Note that most existing encoder-decoder networks, such as VAE [10], can also learn implicit image representations via image reconstruction. They aim at encoding the whole-image contents into the latent representations. However, our framework aims at encoding only the degradation information by predicting the residuals between the sharp and blurry images. Such a task actually has a lower difficulty level than reconstructing all contents of an input image and therefore leads to encoding better blurring degradation representations.

We first tried the mainstream \(L_1\) distance as the loss function. But, as shown later in our supplementary, the \(L_1\) loss function merely measures the pixel-wise distance and cannot properly describe the similarity of the blurry patterns of two images. Then the reblurring generator \(G_r\) cannot well generate the blurry images with \(L_1\) loss, which harms the learning of degradation representation. Therefore, we further resort to perceptual loss [9] and adversarial training [7] to distinguish different degradation patterns in training. Adversarial loss is applied on the output of the generator \(G_r\) to distinguish real and fake blurry images so that the decoder can well model the blurring process and improve the expressiveness of degradation representation. We take the hinge loss [17, 22, 28, 48] as the adversarial loss to help the reblurring generator \(G_r\) model the degradation (blurring) process and improve the expressiveness of learned degradation representations. We train the reblurring generator \(G_r\) to generate the blurry images with the multi-scale discriminator D used in [42]. The training objective for image reblurring is formulated as

$$\begin{aligned} \begin{aligned} L_G&= - E_{x\sim p_{\textrm{data}}} D(x, \acute{y}) + \lambda _1 L_{\textrm{perceptual}}(y,\acute{y}), \\ L_D&= - E_{(x,y)\sim p_{\textrm{data}}} [\textrm{min}(0, -1 + D(x,y))] - E_{x\sim p_{\textrm{data}}} [\textrm{min} (0, -1-D(x, \acute{y}))], \end{aligned} \end{aligned}$$
(4)

where \(\lambda _1\) balances the \(L_1\) loss and the discriminator loss for image reblurring, and the discriminator D is a conditional discriminator conditioning on the sharp image x. Conditioning on the sharp image x, the conditional discriminator D can focus on whether the reblurred image \(\acute{y}\) has the same image contents with the corresponding sharp image x. The reblurring helps extract content-independent degradation information (shown in Fig), which is different with the content-dependent conditional networks [45, 50, 53].

Blurry-to-Sharp Image Deblurring. To make the learned degradation representations contributing to image deblurring, we also model the image deblurring process with the image reblurring process jointly. The image deblurring generator \(G_d\) follows a similar design to that of \(G_r\). The deblurring generator \(G_d\) is only required to learn the deblurring residuals between the blurry image y and its corresponding sharp image x, and the learned degradation representations E(y). The blurry-to-sharp image deblurring is modeled as

$$\begin{aligned} \acute{x} = y + G_d(y, E(y)). \end{aligned}$$
(5)

Thanks to the 2D learnable degradation representations, the image deblurring generator \(G_d\) is aware of the spatially varying blurry patterns and thus can adaptively handle various and complicated degradation patterns. The learning of image deblurring makes the learned degradation representations adapted to the deblurring task. The loss function for image deblurring is formulated as

$$\begin{aligned} L_(x, \acute{x}) = \lambda _2 L_1(x, \acute{x}). \end{aligned}$$
(6)

Discussion of the Learned Degradation Representation. The learned degradation representation has two main advantages against conventional kernel modeling: 1) Our degradation representations can learn non-uniform spatially varying degradations effectively. Figures 1 and 6 show that the encoder can distinguishe different degradation representations. 2) Interpolating on the latent space of representations can generate blurry images with controllable blurry levels (as shown in Fig. 5). The representations are also content-independent (as shown in Fig. 6), which is different from previous conditional networks [45, 50, 53]. Built on this latent space, the representations show better interpretability and expressiveness.

3.2 Image Deblurring with Learned Degradation Representations

After obtaining the pre-trained encoder E, we freeze the pre-trained encoder and re-train the deblurring generator \(G_d\) to illustrate that our improvement is not from the complicated framework but the learned degradation representations. To further demonstrate the generality of learned degradation representations, we also train the deblurring generator \(G_d\) on the RealBlur dataset [30] with the encoder trained on the GoPro dataset [24].

Following HINet [3], we select the Peak Signal-to-Noise Ratio (PSNR) loss as the main supervision. We also utilize a blur-aware loss function as extra supervision. The well-trained encoder E should be quite sensitive to capture various and even subtle blur patterns. We, therefore, define the blur-aware loss as the distance between degradation encoder features of the ground-truth sharp image x and the estimated sharp image \(\acute{x}\). The deviation between the encoder feature maps \(\Vert E(x)- E(\acute{x})\Vert _1\) of images x and \(\acute{x}\) can give more weights on the remaining blurry regions of the network output \(\acute{x}\). Similar to perceptual loss [9], the \(L_1\) distances between the encoder feature maps can be calculated at multiple scales. The blur-aware loss \(L_{\textrm{blur}}\) is therefore formulated as

$$\begin{aligned} L_{\textrm{blur}}(\acute{x},x) = \sum _{i=1}^N \frac{1}{|E^{(i)}|} [||E^{(i)}(\acute{x}) - E^{(i)}(x)||_1], \end{aligned}$$
(7)

where \(E^{(i)}\) denotes the i-th layer of the encoder E and \(|E^{(i)}|\) denotes the number of pixels in feature map \(E^{(i)}(x)\). Using the blur-aware loss makes the deblurring generator \(G_d\) pay more attention to the remaining blurry regions of the network output \(\acute{x}\). Then the deblurring objective is formulated as

$$\begin{aligned} L(x,\acute{x})= \textrm{PSNR}(x,\acute{x}) + \lambda _2 L_{\textrm{blur}}(x,\acute{x}). \end{aligned}$$
(8)
Fig. 2.
figure 2

The network structure of multi-scale degradation injection network for image deblurring. Image reblurring shares the same structure.

3.3 Multi-scale Degradation Injection Network

To effectively integrate the degradation E(y) to predict the reblurring and deblurring residuals, we propose a multi-scale degradation injection network (MSDI-Net) for both reblurring and deblurring. Our reblurring generator \(G_r\) and deblurring generator \(G_d\) consists of two MSDI-Nets, which are stacked as HINet [3] does. for simplicity, the overview of a MSDI-Net is shown in Fig. 2. The details of the architectures are shown in the supplementary materials.

The MSDI-Net consists of an encoder, a decoder, concatenation-based skip-connections, and a multi-scale degradation injection module. The first three modules of the MSDI-Net are widely adopted in U-Net like architectures [31]. The multi-scale degradation injection module modulates the feature of each skip-connection spatially, based on the learned degradation representation. The spatially variant modulation are explored in image synthesis [28] and image super-resolution [16]. We adopt it as the key to connecting learnable degradation representations to reblurring and deblurring.

Let \(f_i \in \mathbb {R}^{C_i \times H_i \times W_i}\) denote the features extracted at scale \(i = 1, \dots , 5\) in the encoder. The extracted feature \(f_i\) is passed to the decoder at the skip connection of scale i. Our MSDI-Net integrates the degradation representations into concatenation-based skip-connections of scale i. At scale i, we obtain the degradation map \(M_i \in \mathbb {R}^{C_i \times H_i \times W_i}\) by a convolution-based upsampling block on the original degradation map \(M_{i-1} \in \mathbb {R}^{C_{i-1} \times H_{i-1} \times W_{i-1}}\) (\(H_{i} = 2 \times H_{i-1},W_{i} = 2 \times W_{i-1}\)), which is implemented by a nearest-neighbor interpolation and a convolution layer to avoid checkerboard artifacts [25]. Then we utilize a spatially adaptive modulation (SAM) module to modulate the skip-connection feature \(f_i\) at scale i. The spatially adaptive modulation modulates the feature map channels in a spatially varying manner with both predicted scaling and additions. At the skip connection of scale i, we use several convolution (\(3\times 3\)) layers on the degradation map \(M_i\) to predict the modulating parameters \(\gamma _i \in \mathbb {R}^{C_i \times H_i \times W_i}\) and \(\beta _i \in \mathbb {R}^{C_i \times H_i \times W_i}\) respectively, which modulate the feature map \(f_i\) as

$$\begin{aligned} F_i = \gamma _i \odot f_i + \beta _i, \end{aligned}$$
(9)

where \(F_i\) is the modulated skip-connection features. Since the modulation parameters are predicted from the degradation representations, the learnable modulation of the feature channels makes the deblurring network aware of spatially varying degradations. The degradation-aware feature \(F_i\) is then concatenated with the decoder feature at scale i and the last decoder layer predicts the image residuals for both image reblurring and deblurring.

Injecting the degradation representations enables the networks to handle various and complicated degradation patterns adaptively. Injection at multiple scales improves the expressiveness of degradation representations by strengthening the connections between the degradation representations and two generators.

4 Experiments

4.1 Dataset and Implementation Details

We train and evaluate our method on the GoPro [24] and RealBlur datasets [30]. The GoPro dataset consists of 2,103 pairs of blurry and sharp images for training and 1,111 pairs for testing. The RealBlur dataset consists of 3,758 pairs for training and 980 pairs for testing. We first train the framework of reblurring and deblurring on GoPro dataset [24]. We apply horizontal flipping and rotation as data augmentation and crop image patch of size \(256 \times 256\) from the dataset for training. \(\lambda _1\) is set as 30 and \(\lambda _2\) is set as 10. The networks of whole framework are trained with a batch size of 32 for 200k iterations. Then we freeze the weights of well-trained encoder E and train the deblurring generator \(G_d\) on the GoPro dataset [24] and the RealBlur dataset [30] respectively. \(\lambda _3\) is set as 1. The deblurring generator \(G_d\) is trained with a batch size of 64 for 400k iterations. We use the Adam optimizer and the learning rate is set as \(3 \times 10^{-4}\) at the beginning and decreased to \(1 \times 10^{-7}\) following the cosine annealing strategy [19].

Table 1. Deblurring comparisons on GoPro [24] dataset. Best and second best scores are highlighted and underlined.
Table 2. Detailed comparisons of MPRNet [47] and our method on GoPro test dataset [24]. MACs are estimated with the input size of \(3\times 256\times 256\).
Fig. 3.
figure 3

Visual comparisons for image deblurring on the GoPro test dataset [24]. From left-top to right-bottom: blurry images, ground-truth images, and results obtained by MIMO-UNet [4], HINet [3], MPRNet [47] and our proposed method.

4.2 Performance Comparison

We compare our method with state-of-the-art deblurring methods [3, 4, 47] on the GoPro test dataset [24]. The quantitative results are reported in Table 1. For testing, we slice the whole image into several \(256\times 256\) patches and test all patches to report the results of HINet [3], MPRNet-patch256 [47] and our method. Our method achieves 0.45 dB improvement in terms of PSNR over the previous best-performing method HINet [3]. To evaluate effectiveness and generality of learned degradation representations, we also evaluate our method on the RealBlur dataset. As listed in Table 3, our method achieves the best performance in terms of PSNR and SSIM. Since HINet [3] does not release the model for RealBlur dataset [30], we train the HINet model on RealBlur dataset based on their released training code. Note that we apply the degradation representations trained on GoPro dataset [24] directly on RealBlur dataset [30]. Our method outperforms the previous SOTA HINet [3] by 0.23 dB PSNR, which demonstrate the generality of learnable degradation representations.

Fig. 4.
figure 4

Visual comparisons for image deblurring on the RealBlur test dataset [30]. From left-top to right-bottom: blurry images, ground-truth images and resuls obtained by DeblurGANv2, HINet [3], MIMO-UNet [4], our proposed method.

Table 3. Deblurring comparisons on the RealBlur test dataset [30].

In Table 2, we provide detailed comparisons between our method and MPRNet [47]. We divide the whole gopro dataset into blurriest 10% and sharpest 10% as [34] does. It is observed that the main improvement of our method is the improvement of 0.34dB PSNR in the blurriest 10%, which demonstrates the advantage of proposed degradation learning. What’s more, our method’s computational cost is less than 50% of MPRNet’s [47].

Figures 3 and 4 show example deblurred results from the GoPro [24] and RealBlur [30] test sets by the evaluated approaches. Our method produces sharper images and recovers more details in the regions of texts and moving objects, compared with other methods.

4.3 Interpolation and Decoupleness of Degradation Representations

To demonstrate the effectiveness of learned degradation representations, we study the interpolation and decoupleness of learned degradation representations. Given a blurry image y and its corresponding sharp image x, we can obtain two degradation representations E(y) and E(x). Then we obtain several intermediate degradation representations by interpolating from E(y) to E(x). The corresponding output of the decoder changes smoothly from sharp to blurry images. The blur interpolation in Fig. 5 shows that our degradation representations are built on the latent space and accurately aware of different degradations. We empirically validate the decoupleness of blur and image contents on GoPro dataset. We divide the GoPro test dataset into 555 pairs of images. For each pair of sharp images \(\{A,B\}\), the corresponding blurry images are \(\{\textrm{blur}(A),\textrm{blur}(B)\}\) and the degradation representations are obtained as \(\{\textrm{deg}_A, \textrm{deg}_B\}\). The sharp image A can be reblurred according to \(\textrm{deg}_B\) to obtain \(\textrm{ReBlur}(A, \textrm{deg}_B)\). We use the average contextual similarity \(\textrm{CX}\) [20] to measure pairwise image similarity. We have \(\textrm{CX}(\textrm{blur}(A),A)= 2.72\), \(\textrm{CX}(\textrm{ReBlur}(A,\textrm{deg}_B),A)=2.65\), \(\textrm{CX}(\textrm{blur}(A),B)=5.43\), and \(\textrm{CX}(\textrm{ReBlur}(A,\textrm{deg}_B),B)=5.39\), averaging over all the pairs. The similarities empirically show that \(\textrm{ReBlur}(A, \textrm{deg}_B)\) has similar contents with A (the former two equations) but doesn’t have similar contents with B (the latter two equations). The visual examples on degradation decoupleness are provided in Fig. 6. The decoupleness of learned degradation representations shows our difference with the content-dependent conditional networks [45, 50, 53] and the capacity of being a general operator to replace conventional blurring process.

Fig. 5.
figure 5

Generating blurry images with linearly interpolated degradation representations. From left to right: the blurry level from sharp to blur.

Fig. 6.
figure 6

Reblurring \(A_1,A_2,A_3\) with the degradation representation of \(\textrm{Blur}(B)\).

Table 4. The ablation study of image deblurring on GoPro test dataset [24].

4.4 Ablation Study

We evaluate the effectiveness of learning degradation representations and the multi-scale degradation injection network by revising one of the components of our model at a time. Table 4 lists the performances of different settings on the GoPro test set [24]. We first remove the degradation encoder for learning the degradation representations and make \(G_d\) only takes as input the blurry image (denoted as “Ours w/o degradation”). The performance suffers a large drop of 0.47dB PSNR. Then we remove the generator of image reblurring and reserve the encoder to provide additional encoding from the blurry images (denoted as “Our w/o reblurring”), this operation causes a drop of 0.19 dB PSNR, demonstrating that reblurring indeed contributes to the learning of better degradation representations. Then we remove the blur-aware loss function (Eq. (7)). The training objective for deblurring becomes only the PSNR loss (denoted as “Ours w/o blur loss”). The performance drops slightly by about 0.07 PSNR. We further experiment with different ways of integrating degradation representations into the generators. When we remove the injection at multiple scales, the degradation representation is integrated just at the lowest resolution skip connection (denoted as “Our injection w/o multi-scale”). The performance suffers a significant drop of 0.48 PSNR, which means that injecting the degradation representation into a single scale would affect the deblurring performance significantly. Then we replace the modulation and test on concatenating latent degradation map with skip-connection feature maps (denoted as “Our w/ concat injection”). For fair comparisons of computational cost, we add two-layer residual blocks after the feature concatenation. The operation causes a drop of 0.23 PSNR. We also remove the integration totally and concatenate the upsampling degradation maps with the blurry images at the network entrance with the blurry image (denoted as “Ours input w/ concat”), which is the mainstream design in image denoising [8, 21, 51]. Its performance drops by about 0.3 PSNR. All the ablation studies demonstrate the effectiveness of our proposed learnable degradation representations and MSDI-Net for image deblurring.

5 Conclusions

In this paper, we propose a framework for learning degradation representation with image reblurring and image deblurring. We first utilize an encoder to learn the degradation representations explicitly. Then we propose a multi-scale degradation injection network to effectively integrate the degradation representations for reblurring and deblurring. With the degradation representations, our networks can be aware of and handle spatially varying degradation patterns adaptively. The experimental results demonstrate that our method outperforms other methods with a clear margin.