1 Introduction

Motion blur is usually caused by the tangled motion of objects in the captured scene or camera shake. Image deblurring has always been a challenging problem in the fields of computer vision and image processing, where the goal is to recover images with sharp details from a given blurred image. Blurry images not only affect the quality of people’s visual perception, but also degrade the performance of vision tasks such as object detection [1] and face recognition [2]. Therefore, although image deblurring is a low-level computer vision task, it is of great significance to study an efficient deblurring algorithm to recover image structure texture.

Mathematically, the image degradation model due to blurring can be expressed as follows:

$$\begin{aligned} x = F(y,k) + n \end{aligned}$$
(1)

where x and y denote the blurry image and the clear latent image, respectively. F(yk) usually stands for 2D convolution operator with kernel k. n represents the additive random noise. Since only the blurry image x is given and other quantities are unknown, there are infinitely many inconsistent solutions to solve Eq. (1), so image deblurring is a highly ill-posed problem. Traditional methods usually employ blur kernel estimation or natural image priors to deal with this ill-posed problem [3,4,5,6,7,8]. Thanks to the rapid development of deep learning techniques and the availability of large-scale datasets, a large number of learning-based methods employ end-to-end deep convolutional neural networks to learn the mapping relationship between blurry and clear images [9,10,11,12,13,14,15,16,17]. Despite the excellent performance of these learning-based methods, they are still insufficient in recovering the texture details of images. The method based on the generative adversarial network [18,19,20,21] can enhance perceptual quality, but they sometimes generate deblurred results with artifacts as shown in Fig. 1.

Recently, the field of image super-resolution [22,23,24,25,26,27,28] has introduced additional reference images to super-resolve low-resolution(LR) images in order to compensate for the information lost in LR images and achieve excellent performance. This method, called reference-based super-resolution(RefSR), aims to transfer relevant texture details from a reference image to an LR image. Inspired by RefSR, we explore a new approach to utilize reference images to help deblur degraded images. Reference images and blurry images have similar content information and textures; they can be captured from different camera angles or obtained from video frames. Compared with using only one blurry image input to the neural network, the additional reference image can provide more complementary information, which alleviates the ill-posedness of the image deblurring problem to a certain extent. The state-of-the-art deblurring algorithm EFNet [29] also exploits additional information, achieving state-of-the-art deblurring performance. Reference-based methods have great potential for image restoration, but when applied to image deblurring, it is still very difficult to establish the corresponding matching relationship between the blurry image and the reference image, that is, how to transfer the relevant texture information to facilitate the reconstruction of the deblurred image. Specifically, there is a misalignment of complementary information between the reference image and the blurry image due to different camera angles or object movements. For efficient transfer of high-quality textures, alignment of the reference image and the blurry image is required. Most reference-based super-resolution adopts spatial alignment [23] and patch alignment [24, 26] for alignment operations. However, unlike the downsampled LR image, the blurry images have a large degree of blur degradation, and it is difficult to directly establish the correspondence between the blurry images and the reference images. In addition, inaccurate alignment can transfer irrelevant textures to the deblurred image, resulting in severe degradation of deblurring performance. Hence, we need to explore a framework to achieve matching correspondence between blurry images and reference images, and adaptively transfer relevant texture features.

Fig. 1
figure 1

One example. a Input blurry image. b Result of DeblurGAN-v2 [19]. c Result of MIMO-Unet+ [14]. d Our result

To address the issues mentioned above, we propose a novel reference-based image deblurring framework, which consists of two tasks: single image deblurring and reference feature transfer. First, we perform coarse image deblurring on the blurry input, which alleviates the difficulty of matching between blurry images and reference images. Given the results of the single image deblurring task, the reference feature transfer task further establishes the corresponding matching relationship and transfers the textures from reference image to deblurred image. In addition, to stably and efficiently transfer features from reference images, we propose the reference alignment module to extract high-quality features, which contains both patch alignment and deformable alignment. Finally, we fuse features and reconstruct the deblurred images using adaptive feature fusion. We conduct extensive experiments on synthetic and real-world datasets. Quantitative and visual experimental results demonstrate that our method achieves state-of-the-art performance.

The main contributions of this paper are summarized as follows:

  • We propose a novel reference-based dual-task framework for image deblurring, which consists of a single image deblurring task and a reference feature transfer task.

  • We propose the reference alignment module and adaptive feature fusion module, which effectively utilize the texture features of the reference image and refine the single image deblurring results.

  • Extensive experiments on benchmark datasets show that our framework achieves excellent deblurring performance. Moreover, our framework also shows better performance when the reference image is dissimilar to the blurry input.

2 Related work

In this section, we briefly review some works related to our research, including learning-based deblurring methods and reference-based methods.

2.1 Learning-based deblurring methods

Learning-based methods have made significant progress in recent years. Sun et al. [30] utilized convolutional neural networks to estimate spatially varying motion blur kernels from local patches and then obtained deblurring results by deconvoluting the blurry images. Nah et al. [9] proposed an end-to-end multi-scale network to gradually restore clear images from coarse to fine. Similarly, Tao et al. [10] proposed a scale-recurrent structure on a multi-scale basis to reduce the amount of parameters. Meanwhile, generative adversarial networks [18, 19] are also employed to improve the perceptual quality of deblurring results. Zhang et al. [12] divided the image into multiple patches as the input of the network and aggregated multiple image patches at different stages for better performance. Park et al. [31] proposed incremental temporal training, which uses temporal information to gradually restore blurred images. Li et al. [32] proposed a light global context refinement module into image deblurring for enriching global feature details. Cho et al. [33] proposed a fully convolutional deblurring network with multiple inputs and multiple outputs, and fused features of different scales to achieve excellent performance. Niu et al. [33] extracted the spatio-temporal information from the blurry input to assist deblurring. Although the methods mentioned above are beneficial for image deblurring, their algorithms are only based on a blurry image. The blurry image loses the low-frequency and high-frequency information of the image due to severe degradation. Low-frequency information and high-frequency information correspond to image structure and texture details, respectively. Due to the lack of sufficient information, it is difficult to recover results with structural and texture details when faced with images with large blur degradation, which limits the deblurring performance. In contrast, the reference images we introduce can provide more information and facilitate the restoration of image structure and texture.

2.2 Reference-based super-resolution

RefSR super-resolve LR images with the help of an additional reference image, the purpose of which is to extract and transfer texture information from the reference image after aligning the low-resolution image with the reference image. Zheng et al. [23] estimated the optical flow between the reference image and the low-resolution image, and then aligned them by the flow. Optical flow estimation is widely used in the field of computer vision. [34,35,36]. Inspired by video super-resolution [37, 38], Shim et al. [39] further utilize the feature of deformable convolution(DCN) to extract relevant reference features. However, the above alignment methods usually lack to construct long-distance correlations between image pairs. Therefore, patch alignment based methods [24,25,26, 40] are proposed. Zhang et al. [24] fuse reference features in a multi-scale feature space by computing the similarity between patches.Yang et al. [25] proposed texture transformer, in which hard and soft attention are used to extract and fuse textures. Lu et al. [26] proposed a coarse-to-fine patches correspondence matching pattern that significantly reduces the computational complexity. Wang et al. [40] first generalized RefSR to real-world dual cameras, super-resolve wide-angle images with telephoto images and obtained high-fidelity results. Huang et al. [28] proposed a reference-based dual-task [41] framework, which achieved state-of-the-art performance.

The above RefSR methods provide another idea for image deblurring. To this end, we explore new ways to utilize additional sharp reference images to assist in image deblurring. Inspired by previous methods [25, 39], we adopt patch alignment and deformable alignment [42] strategies for different fields of view between blurry images and reference images. We also equip our framework with the adaptive fusion module, designed to effectively and efficiently aggregate reference feature and deblurred feature.

Fig. 2
figure 2

The overview of the proposed deblurring framework, which contains a single image deblurring task and a reference feature transfer task. The single image deblurring network deblurs the blurry input \({I_\mathrm{{Input}}}\) to the deblurring image \({I_\mathrm{{Deblur}}}\) and obtains its feature \({F_\mathrm{{Deblur}}}\). The reference feature transfer task first calculate the cosine similarity matrix between \({I_\mathrm{{Deblur}}}\) and the reference image \({I_\mathrm{{Ref}}}\). The reference feature \({F_\mathrm{{Ref}}}\) is then warped by the reference alignment module. After the final adaptive feature fusion with \({F_\mathrm{{Deblur}}}\), the framework yields a clear output \({I_\mathrm{{Output}}}\)

3 Method

In this work, to obtain deblurred images with fine textures, we propose a dual-task framework to fully utilize the additional reference images. We apply separate tasks to the blurry input and the reference image. To be specific, the blurry input provides content and structural information for the deblurred result, while the reference image is expected to provide texture details. To this end, we process the blurry input \({I_\mathrm{{Input}}}\) and the reference image \({I_\mathrm{{Ref}}}\) separately and then perform adaptive fusion. As shown in Fig. 2, our framework mainly consists of two parts:

For the single image deblurring task, we roughly deblur the blurry input to reduce its blur degradation degree and obtain the single image deblurring result \({I_\mathrm{{Deblur}}}\) and the deblurring feature \({F_\mathrm{{Deblur}}}\):

$$\begin{aligned} {F_\mathrm{{Deblur}}} = {\mathcal{F}_{ID}}({I_\mathrm{{Input}}}) \end{aligned}$$
(2)

where \({\mathcal{F}_{ID}}\) represents the single image deblurring model, and here we use BAM [43] as the main building block of the model.

For the reference feature transfer task, we extract well-aligned features from the reference image and transfer to the deblurred feature \({F_\mathrm{{Deblur}}}\). We first map \({I_\mathrm{{Ref}}}\) and \({I_\mathrm{{Deblur}}}\) into feature maps using a shared VGG19 pre-trained model. After that, we use the normalized inner product [24] in the feature space to calculate the cosine similarity matrix \({M_{i,j}}\) between \({I_\mathrm{{Ref}}}\) and \({I_\mathrm{{Deblur}}}\). Then, we calculate the index map P and confidence map C based on the matrix \({M_{i,j}}\) for the following two modules:

$$\begin{aligned} {P_i}= & {} \mathop {\arg \max }\limits _j {M_{i,j}} \end{aligned}$$
(3)
$$\begin{aligned} {C_i}= & {} \mathop {\max }\limits _j {M_{i,j}} \end{aligned}$$
(4)

where the operation represented by Eq. (3) is to take the index of the maximum value of each row of the cosine similarity matrix M, and the operation represented by Eq. (4) is to take the maximum value of each row of M in the matrix. The index map P is used to warp the reference image to align the blurry input. The confidence map C is used to weight the relevant reference features. Next we use the reference alignment module to align the images and extract well-aligned reference features \({F_\mathrm{{Aligned}}}\), respectively:

$$\begin{aligned} {F_\mathrm{{Aligned}}} = {\mathcal{F}_\mathrm{{RA}}}(P,{I_\mathrm{{Ref}}},{I_\mathrm{{Deblur}}}) \end{aligned}$$
(5)

where \({\mathcal{F}_\mathrm{{RA}}}\) represents the reference alignment module. Note that \({I_\mathrm{{Deblur}}}\) is only used to calculate the cosine similarity matrix with the reference image. Then, \({F_\mathrm{{Aligned}}}\) is aggregated with the deblurring feature \({F_\mathrm{{Deblur}}}\) through the adaptive feature fusion module, and the final deblurring result \({I_\mathrm{{Output}}}\) is obtained:

$$\begin{aligned} {I_\mathrm{{Output}}} = {\mathcal{F}_\mathrm{{AFF}}}({F_\mathrm{{Aligned}}},{F_\mathrm{{Deblur}}},C) \end{aligned}$$
(6)

where \({\mathcal{F}_\mathrm{{AFF}}}\) represents the adaptive feature fusion module.

Fig. 3
figure 3

Illustration of the reference alignment module and adaptive feature fusion module

3.1 Feature extraction with reference alignment

      The purpose of the reference alignment module is to obtain the features of the aligned blurry input, which will be used in the subsequent adaptive feature fusion. Inspired by reference-based super-resolution and video super-resolution, as shown in Fig. 3a, we adopt a combination of patch alignment and deformable alignment. The key to patch alignment is to use the index map P calculated by the cosine similarity matrix to select high-quality features in the reference feature \({F_\mathrm{{Ref}}}\). So after getting the index map P, we need to unfold the reference feature \({F_\mathrm{{Ref}}}\) into patches. We use the Unfold operation in the Pytorch [44] framework to unfold it, where the sliding window is a 3\(\times \)3 convolution kernel with a stride of 1. Then, we warp the reference feature \({F_\mathrm{{Ref}}}\) with the index map P followed by folding operation to obtain the aligned feature \({F_\mathrm{{Paligned}}}\):

$$\begin{aligned} {F_\mathrm{{Paligned}}} = \mathcal{W}({F_\mathrm{{Ref}}},P) \end{aligned}$$
(7)

where \(\mathcal{W}\) represents the spatial warping operation. This warp step is used to transfer reference features according to index map P. In other words, we transfer feature patches according to the index of the most relevant position. The patch alignment method can stably find similar textures in the reference features. However, a simple patch-level alignment cannot fully exploit the similar features of the reference images [40]. Previous studies [38, 42, 45] have shown that deformable alignment has superior alignment performance; therefore, we introduce deformable alignment into the reference alignment module to align reference features and blurry features adaptively in the feature level [37]. Here, we are using DCNv2 [47]. Specifically, we first concatenate the reference feature \({F_\mathrm{{Ref}}}\) and the deblurred feature \({F_\mathrm{{Deblur}}}\) together to predict the offset o and modulation mask m:

$$\begin{aligned} o= & {} {\mathcal{E}_o}([{F_\mathrm{{Ref}}},{F_\mathrm{{Deblur}}}]) \end{aligned}$$
(8)
$$\begin{aligned} m= & {} \sigma ({\mathcal{E}_m}([{F_\mathrm{{Ref}}},{F_\mathrm{{Deblur}}}])) \end{aligned}$$
(9)

where \({\mathcal{E}_o}\) and \({\mathcal{E}_m}\) represent stacked convolutional layers, o represent activation functions, and [, ] represent concatenation operations. With the help of masks, we can perform the alignment operation adaptively even if the reference image and the blurry image are not the same scene. Then, we use deformable convolution \(\mathcal{D}\) to compute aligned features \({F_\mathrm{{Daligned}}}\):

$$\begin{aligned} {F_\mathrm{{Daligned}}} = \mathcal{D}({F_\mathrm{{Ref}}},o,m) \end{aligned}$$
(10)

With the help of deformable convolutional offset diversity, the filter implicitly captures motion information by learning local samples. After obtaining the deformable aligned reference feature \({F_\mathrm{{Daligned}}}\), we concatenate it with the deblurred feature \({F_\mathrm{{Deblur}}}\), and then use the residual blocks R [47] for further feature aggregation. Finally, we fuse the aggregated feature and feature \({F_\mathrm{{Paligned}}}\) through a convolutional layer to obtain the final aligned feature \({F_\mathrm{{Aligned}}}\):

$$\begin{aligned} {F_\mathrm{{Aligned}}} = Conv({F_\mathrm{{Paligned}}} + R([{F_\mathrm{{Daligned}}},{F_\mathrm{{Deblur}}}])) \end{aligned}$$
(11)

where Conv represents a convolutional layer. The alignment reference module not only uses stable explicit alignment of patch alignment, but also utilizes deformable convolution for implicit alignment. The combination of the two alignment methods facilitates the acquisition of better aligned features.

3.2 Adaptive feature fusion

Although the reference feature \({F_\mathrm{{Aligned}}}\) obtained by the reference alignment module has similar content to the deblurred feature \({F_\mathrm{{Deblur}}}\), it is not optimal to simply concatenate or add them directly. This is because the alignment is not necessarily very precise and may introduce additional noise information. To effectively combine the deblurred feature \({F_\mathrm{{Deblur}}}\) and the aligned feature \({F_\mathrm{{Aligned}}}\), as shown in Fig. 3b, we propose an adaptive feature fusion module to perform feature aggregation. We first concatenate the two together and then use the confidence map C to guide the fusion process adaptively. In order to suppress the features with inaccurate alignment and weight the high-quality aligned features, the confidence map C also uses a set of convolutional layers to aggregate the information of adjacent patches. Formally, we have:

$$\begin{aligned} {F_c} = Conv([{F_\mathrm{{Deblur}}},{F_\mathrm{{Aligned}}}]) \otimes Conv(C) \end{aligned}$$
(12)

Then, we use skip connections to synthesize the fused feature \(F_\mathrm{{fusion}}\):

$$\begin{aligned} {F_\mathrm{{fusion}}} = {F_c} + {F_\mathrm{{Deblur}}} \end{aligned}$$
(13)

Finally, after the reconstruction of the decoder, we get the deblurred result \(I_\mathrm{{Output}}\).

$$\begin{aligned} {I_\mathrm{{Output}}} = \mathrm{{Decoder}}({F_\mathrm{{fusion}}}) \end{aligned}$$
(14)

3.3 Implementation details

Our framework is mainly divided into two stages of training optimization. For the first stage, we trained the single image deblurring network separately, where the number of BAM basic blocks is 10. Our goal is to preserve the spatial structure and content information of the deblurring results. To this end, we only use the L1 loss to minimize the pixel minimum distance between the output \({I_\mathrm{{Deblur}}}\) of the single image deblurring task and the ground truth image \({I_\mathrm{{GT}}}\):

$$\begin{aligned} {\mathcal{L}_1} = {\left\| {{I_\mathrm{{GT}}} - {I_\mathrm{{Output}}}} \right\| _1}, \end{aligned}$$
(15)

where \({\left\| \cdot \right\| _1}\) represents L1-norm. After training, we fixed the single image deblurring network for the second stage of training.

In the second stage training, the output result \({I_\mathrm{{Deblur}}}\) of the single image deblurring network is used to calculate the cosine similarity matrix between the single image deblurring network and the reference image \({I_\mathrm{{Ref}}}\), and the deblurring feature \({F_\mathrm{{Deblur}}}\) is used for feature fusion to obtain the final deblurred output \(I_\mathrm{{Output}}\). We adopt the reconstruction loss proposed by [40]. This reconstruction loss computes the loss between the deblurred output \(I_\mathrm{{Output}}\) and the ground truth image \({I_\mathrm{{GT}}}\) from both the low- and high-frequency domains. This loss can be defined as:

$$\begin{aligned} {\mathcal{L}_\mathrm{{rec}}} = \left\| {I_\mathrm{{Output}}^\mathrm{{filter}} - I_\mathrm{{GT}}^\mathrm{{filter}}} \right\| + \sum \limits _i {{\delta _i}({I_\mathrm{{Output}}},{I_\mathrm{{GT}}})} \end{aligned}$$
(16)

where the superscript filter of the first item represents the filtering operation using a 3\(\times \)3 Gaussian kernel. The second term \({\delta _i}(X,Y) = {\min _j}\mathbb {D}({x_i},{y_j})\) is the contextual loss that minimizing the difference between the pixel \({x_i}\) in \(I_\mathrm{{Output}}\) and its most relevant pixel \({y_j}\) in \(I_{GT}\) at the perceptual distance \(\mathbb {D}\) [48,49,50]. The first term makes the deblurred output stably follow the low-frequency structure of the ground truth image, and the second term flexibly maximizes the similarity between \(I_\mathrm{{Output}}\) and \({I_\mathrm{{GT}}}\), improving the perceptual visual quality.

It took about 150 h to train the single image deblurring task. We trained the entire framework using the Adam optimizer [51], where = 0.9, = 0.999. It took about 80 h for our framework to converge. The batch size is 12. The initial learning rate was 0.0001 and gradually decreased using a cosine annealing strategy. The first four layers of the vgg19 pre-trained model were used for feature extraction. We built and trained the network framework using Pytorch [44] on an NVIDIA TITAN RTX GPU.

4 Experiments

4.1 Datasets and metrics

Datasets. In the experiments, we use the most popular GoPro [9] dataset. The training set and test set of the GoPro dataset are generated in the same way. Their ground truth images are all taken with a GoPro4 high frame rate camera, and the corresponding blurry images are generated by averaging consecutive ground truth frames. The dataset has a total of 3214 pairs of sharp and blurry images, of which 2103 pairs are used for training and 1111 pairs are used for testing. The resolution of the blurry and ground truth images is \(1280\times 720\). To evaluate the generalization ability of the model, we also tested on the HIDE [52] test set, which consists of 2025 images. To evaluate the ability of the model to handle real-world blur, we chose the high-quality RealBlur-J [53] test set, which contains 980 blurry images in low light, for our experiments. For the selection of sharp reference images, we choose the adjacent frames of ground truth images as reference images to facilitate the restoration of blurry images. The same is true for the HIDE and RealBlur-J datasets. All models in the experiments were trained on the GoPro training set.

Table 1 Quantitative evaluation results on the GoPro and HIDE test sets. The best scores are shown in bold. \(*\) indicates that the author does not release source codes
Table 2 Quantitative evaluation results on the RealBlur test set. The best scores are shown in bold
Fig. 4
figure 4

Visual comparison on the GoPro test dataset(top three examples) and HIDE test dataset(the forth and fifth examples). Our method recovers fine textures and major structures in text, textures, moving objects and human faces. Zoom-in for details

Evaluation metrics. Like existing baseline deblurring methods, we use PSNR and structural similarity (SSIM) [54] to evaluate all experimental results. In general, larger PSNR and SSIM represent higher-quality restored images. All PSNR and SSIM in the experiments are calculated using built-in functions in MATLAB R2018a.

Fig. 5
figure 5

Visual comparison on the RealBlur test dataset. Our method recovers more realistic details than other methods on blurry images in low light. Zoom-in for details

4.2 Comparisons with state-of-the-art methods

We first quantitatively compare the proposed method with several state-of-the-art methods, including DeepDeblur [9], SRN [10], DeblurGAN-v2 [19], DSD [11], MTRNN [31], DMPHN [12], DBGAN [55], RADN [16], SAPHN [56], SimpleNet [32], MPRNet [15], MIMO-UNet+ [14], HINet [13]. The test models of the above methods are all trained under the GoPro training set and tested under the GoPro, HIDE and RealBlur test sets. In addition to this, we also conduct qualitative evaluations and user study to measure the visual performance of different methods.

Quantitative evaluations. As shown in Table 1, we provide quantitative evaluation results on the GoPro and HIDE test set. As can be seen, our method achieves the highest PSNR and SSIM scores, surpassing all the other methods by at least 0.6 dB on the GoPro test dataset. This shows that with the help of reference images, our framework achieves state-of-the-art deblurring performance and generalization ability. Single image deblurring methods cannot utilize additional reference information. In contrast, our method can adaptively utilize useful information on the reference image. The quantitative comparison results on RealBlur-J are shown in Table 2, due to the inconsistent way in which the training and test set data are generated, the metrics of the other state-of-the-art methods drop very low, but our method still achieves the highest performance. In addition, our framework does not require the use of complex multi-scale, multi-stage or multi-patch strategies. The above quantitative comparison results show that our approach achieves the best performance.

Qualitative evaluations. We further compare the proposed method with other state-of-the-art methods for visual quality. Among the best deblurring methods, we choose MPRNet [15], MIMO-UNet+ [14], SRN[10], DeblurGAN-v2 [19], MTRNN [31], DMPHN [12] as comparison methods. Thanks to the authors of these methods providing better quality source codes, we can make a fair comparison. The visual comparison results on the GoPro and HIDE test sets are shown in Fig. 4. All five examples have considerable restoration challenges, making the current single image deblurring algorithms intractable. Specifically, the top three examples were selected from the GoPro test set, and we selected texts with large motion blur, textures, and objects in high-speed motion for comparison. In the first example, the results of other methods produce severe artifacts, and some are even difficult to recognize, yet our method can still recover recognizable digital text. The second example contains texture details, but DeblurGAN-v2 [19] produces severe artifacts, MIMO-UNet+ [14] produces a certain scale distortion, and MPRNet [15], although the restoration effect is better than other methods, is still worse than our deblurring effect. Our method can effectively suppress blur diffusion and artifacts. The third example is high-speed motion blur, and our method still produces results that are closest to ground truth images. The last two examples are selected from the HIDE test set, and our method still achieves good visual performance. For example, in the last example, our method focuses more on reconstructing face shape and details. In contrast, due to severe motion shake, the high-frequency details of the image are lost, and it is difficult for other methods to recover clear facial information. Figure 5 shows the visual comparison results on the RealBlur test set; our method recovers more details in low-light blurry regions compared to other methods. Overall, with the help of reference images, our method recovers major structures and fine details on both synthetic and real-world datasets.

User study. To further evaluate the visual perceptual quality of the deblurring results, we also conduct a user study comparison of the deblurred results with three state-of-the-art methods. We choose DMPHN [12], MIMO [14] and MPRNet [15] as baseline comparison methods. The user study consisted of 30 users who had normal vision and were not aware of any experimental details, so it was objective. The selected images are derived from different scenes in the GoPro test set and are universal. We give users two images (ours and baseline) at a time and let users choose the image they think is the most realistic without a time limit. They do not know which algorithm has been used to recover the image. We collected 300 valid images for each set of comparisons. As shown in Fig. 7, more than 80\(\%\) of people like our deblurring results, which again strongly proves that our method has better subjective quality.

Table 3 Quantitative evaluation for ablation study of reference feature transfer task. The PSNR is computed on GoPro
Fig. 6
figure 6

Ablation study on reference feature transfer task. Note that “ +” denote “with”, “GT” stand for ground truth

Fig. 7
figure 7

User study results on GoPro dataset

4.3 Ablation study

4.3.1 Effect of reference feature transfer

Based on the single image deblurring task, we introduce a reference feature transfer task to improve the texture details of the deblurring results. Among them, in the reference feature transfer task, the reference module and the adaptive feature fusion module are essential, so we conduct ablation experiments on these two key modules. We retrained five network variants by adding important components of the model one by one: (1) single image deblurring task without reference image. (2) reference alignment module with patch alignment only. (3) reference alignment module with deformable alignment only. (4) reference alignment module based on patch alignment and deformable alignment. (5) entire framework with adaptive feature fusion. Each ablation experiment was trained for about 80 h.

Reference Alignment Module. Table 3 evaluates the effectiveness of the reference alignment module, compared with the baseline (1), (2) and (3) demonstrate the performance gain of utilizing only patch alignment or deformable alignment. (3) has better performance than (2), which indicates that deformable alignment can warp reference features better than patch alignment. As shown in Fig. 6, the patch alignment and the deformable alignment have different deblurring effects on the edges of license plate numbers, so we combine them in the reference alignment module. (4) achieves better performance, revealing that patch alignment and deformable alignment have a synergistic effect, resulting in better gain. Figure 6 also demonstrates that result (4) produces a clearer deblurred result. In summary, the reference alignment module combines the stability of patch alignment with the superior performance of deformable alignment to achieve better reference alignment.

The last row of Table 3 shows the performance of using adaptive feature fusion. For other control groups, we perform element-wise summation of deblurred features \({F_\mathrm{{Deblur}}}\) and aligned features \({F_\mathrm{{Aligned}}}\) without using confidence maps. As shown in Table 3, with adaptive feature fusion, the model achieves a gain of 0.12 dB in terms of PSNR, which indicates that the guidance of the confidence map is beneficial to the performance improvement. As shown in Fig. 6, adaptive feature fusion module further improves the deblurring effect, resulting in sharper structures and realistic textures (Fig. 7).

Fig. 8
figure 8

Ablation result of using different reference images on the GoPro dataset. The proposed framework generates sharper results and fewer artifacts than the result without reference. Zoom-in for details

Fig. 9
figure 9

The result of object detection. The first column is the detection result of the blurry input, and the second column is the detection result after deblurring by our method

Fig. 10
figure 10

Object detection performance evaluation

Table 4 Ablation study on different reference images

4.3.2 Robustness to different reference images

Previous experiments have demonstrated that sharp reference images have large gains in deblurring performance. We use the similar texture information of the reference image to help image restoration, but do not analyze the influence of reference images in different scenes on the deblurring results. So to answer this question, we experimented with different reference images under the GoPro test set. As shown in Fig. 8, the Reference 1 is from the same scene as the blurry input and has a high degree of similarity, while the other reference images are randomly selected from other GoPro scenes. It can be seen that with the help of different reference images, our method achieves superior deblurring results. This is because patch alignment is a special attention mechanism, which can find and weight the most relevant texture features under the guidance of index map P and confidence map C.

We further conduct quantitative comparisons, as shown in Table 4, even though the reference image and the blurry image originate from different scenes; our method also has only a slight performance degradation. This means that our framework can robustly utilize reference images in different scenes to facilitate the restoration of deblurred images.

Table 5 Performance and efficiency comparison on the GoPro test dataset. We tested all the experimental metrics using an Nvidia Titan RTX GPU
Fig. 11
figure 11

Influence of different batch sizes and initial learning rates. a Influence of batch size. b Influence of learning rate

4.4 Object detection performance evaluation

As mentioned in the previous introduction, image blur can seriously affect other computer vision tasks, object detection is one of them. As one of the most basic and challenging problems in computer vision, object detection [57, 58] has received extensive attention in recent years. With the development of deep learning, methods based on deep learning have been significantly improved in the field of object detection. Among them, the YOLO [59] series algorithms have gradually become the benchmark algorithms in the industry due to their better performance. However, most of the object detection algorithms assume that the input image is clear, so when blurred images are used as input, these methods all face severe performance degradation. At this point, the image deblurring technology can be applied to the object detection task to remove image blur and improve the accuracy of object detection. As shown in Fig. 9, we use the YOLO V5 object detector to detect the blurry and deblurred images, respectively. The first column is the result of detection using the blurry image directly, and the second column is the detection result after image deblurring using our method. The first column clearly has many undetected objects and false detections; in contrast, the false negative examples are successfully detected in the second column.

Moreover, we further evaluate the proposed method with other deblurring methods in terms of performance improvement for object detection. Since the objects in the GoPro dataset are mostly people, cars, and potted plants, we only measure the average precision of these three classes for performance evaluation. As shown in Fig. 10, our method has the most obvious performance improvement for object detection, showing its superior performance.

4.5 Performance and efficiency comparison

In addition to quantitative and qualitative comparisons, we also compare the number of parameters, FLOPs and running time with state-of-the-art methods. Table 5 shows the experimental results. Compared with other single image deblurring methods, our method has relatively large FLOPs and parameter amount due to the integration of two different tasks. In addition, when testing, we divide the entire image into multiple patches for inference separately, and then splicing them into a whole image for output. Compared with other methods that input an entire image into the network for inference, this method of patch testing increases the inference speed to a certain extent. Compared with other methods, our method achieves the most excellent performance while maintaining an acceptable efficiency.

4.6 Influence of the batch size and initial learning rate

During the training phase of the whole framework, batch size and initial learning rate are important hyper-parameters that affect model performance. Specifically, increasing the batch size within an appropriate range can improve the stability of model convergence. If the initial learning rate is too large, the model will not converge, and if it is too small, the model will converge very slowly or cannot learn. Thus, we analyze the influence of these two hyper-parameters on the convergence of the model on the GoPro dataset. Figure 11 shows the ablation results.

Figure 11a shows the impact of batch size on the convergence speed of the model. It can be seen that as the batch size increases, the convergence speed of the model gradually increases. Therefore, we adjust the batch size to 12 to fully utilize the performance of the GPU.

Figure 11b shows the impact of the initial learning rate on model convergence. It can be seen that the model is more sensitive to the initial learning rate, too large or too small will seriously affect the performance, so we choose the initial learning rate as \(1 \times {10^{ - 4}}\).

5 Conclusion

In this paper, we propose an effective framework for deblurring with reference images. The framework mainly includes a single image deblurring task and a reference feature transfer task. The single image deblurring task recovers a rough deblurred image from the blurry input, which will be used to compute the cosine similarity matrix with the reference image. The reference feature transfer task finally synthesizes high-quality deblurred results with the help of a well-designed reference alignment module and an adaptive feature fusion module. Quantitative and qualitative experimental results on synthetic and real-world datasets demonstrate that our framework achieves superior performance.

Limitations and Future Work. Although our framework achieves state-of-the-art performance, computing the cosine similarity matrix between the reference image and the single image deblurring results is memory-intensive. In addition, compared to the single image deblurring task, the framework yields expensive computation and inference speed due to the integration of the single image deblurring task and the reference feature transfer task. In the future, we are interested in exploring the use of lighter-weight single image deblurring models to trade off performance and computational speed.