1 Introduction

Image SR is a traditional but popular research in low-level vision tasks. Image super-resolution (SR) can be divided into single image super-resolution (SISR), and reference-based image super-resolution (RefSR). In recent years, research on SISR has a great progress. Convolutional neural networks (CNNs) [1] has greatly advanced the SOTA of SISR. However, due to the ill-posed nature of SISR problems, some essential features of the images have been lost. If the texture or content of the images does not occur in training dataset, the effect of the network will decrease. From this point of view, RefSR has a better theoretical basis. RefSR enables the algorithm to transfer similar features from reference images to SR images. Therefore, these algorithms have better generalization ability.

Compared with SISR, there are few studies on RefSR. The main idea of those researches is to utilize the high-resolution (HR) textures from given Ref image to produce visually pleasing results. However, these approaches have limitations in some aspects. Zhang et al. [3] adopt a feature space defined by a pre-trained classification model to search and transfer textures between the LR and Ref image. As a follow-up work, Yang et al. [4] proposes a learnable texture extractor and uses a hard-attention module and a soft-attention module to transfer and fuse texture features. Nevertheless, these approaches only use Ref image to extract texture features, which does not take full advantage of the important training data. These methods train a texture extractor completely different from the SR network in structure. The network utilization is not high and the texture extractor does not directly give SR network what it needs. To address these problems, this paper proposes a novel image super-resolution network which integrates texture extraction and super-resolution. Specifically, several innovative designs of our network are carried out to break out the limitations of TTSR [4] and SRNTT [3]. The overall improvement and innovation enable our method to make full use of Ref image which achieves a better visual result compared with SOTA approaches. The main contributions of this paper are as follows.

To the best of our knowledge, we are one of the first to use same architecture of network to accomplish super resolution and texture transfer. More specifically, the same architecture increases the amount of training data implicitly. We train a reverse network for reference images to generate corresponding texture features. The texture features generated by the reverse network are used to carry out texture transfer. This design makes HR features provide the SR network with what it really needs.

2 Related work

In this section, we review previous works of SISR and RefSR which are the most relevant to our work.

2.1 Single image super-resolution

SISR has been studied for a long time. With the development of deep learning, SISR has a great improvement over traditional non-learning-based methods. These SISR methods can be divided into several groups according to the most distinctive features in their model designs such as linear networks, residual networks, attention-based networks, etc. For more details, Anwar et al. [5] and Wang et al. [6] can be referred.

Super-resolution CNN (SRCNN) [7] proposed by Dong et al. firstly adopts deep learning into SISR by using a three-layer CNN to represent the mapping function. To speed up the SR process, Dong et al. [8] replace the interpolated LR image with the original LR image and adopt deconvolution at the very last layer to enlarge the feature map. VDSR [9], DnCNN [10], etc. proposed lately are linear networks like SRCNN. Enhanced deep residual network (EDSR) [12] modifies the ResNet architecture [13] to work with the SR task. Deeply-recursive convolutional network (DRCN) [15] utilizes recursive learning to solve SR problem. Its motivation is to progressively break down the harder SR problem into a set of simpler ones, which are easy to solve.

To solve the SR problem, a lot of related technologies are carried out. Residual channel attention network (RCAN) [17] designs a channel attention mechanism for each local residual block. To improve perceptual quality of the images, Justin et al. [19] introduces perceptual loss into SR tasks. SR generative adversarial network (SRGAN) [20] adopts generative adversarial networks (GANs) [22] and introduce adversarial loss to increase the SR result. Guo et al. [23] propose a novel dual regression scheme for paired and unpaired data, which carries out a new solution to SR problem.

2.2 Reference-based image super-resolution

Compared to SISR, there are few studies on RefSR which can obtain more accurate details from the Ref image. The early work is to use image aligning or patch matching. Image aligning methods, such as [2, 28], must have a good aligning quality between LR and Ref image. The time-consuming aligning approaches are adverse to real applications. Vivek et al. [25] and Zheng et al. [26] used patch match to search proper reference information earlier. Recent years, SR by neural texture transfer (SRNTT) [3] applied patch matching between VGG [27] features of the LR and Ref image to swap similar texture features. In SRNTT, VGG features are untrainable and SRNTT feeds all the swapped features equally into the main network. As a follow-up work of SRNTT, texture transformer network for image SR (TTSR) [4] uses a learnable texture extractor and applies the attention mechanism to the texture features confusion. Due to the different network structure, the features from texture extractor cannot meet the requirements of the main network. These methods only consider Ref image as reference and do not make the most of the reference images. To address these problems, we use the same network structure as main network and texture extractor. Moreover, in the train process of texture extractor which shares the parameters in main network, we make the better use of Ref image.

3 Approach

In this section, we introduce the proposed SR-USRN. Our method is trained in three steps. The process will be discussed in Sects. 3.1, 3.2 and 3.3.

Fig. 1
figure 1

The proposed novel RefSR structure. The orange dotted line represents that it is calculated only during training process. SR-MainNet and ReverseNet in different positions share the parameters, respectively. The Q, K and V are the texture features extracted from a LR image, a down-sampled Ref image and an original Ref image, respectively

3.1 SR-MainNet

As shown in Fig. 1, the SR-MainNet is the main component of the whole method. Here, we apply the same network to the SR-MainNet and texture extractor. In Fig. 1, \(I^{\mathrm{LR}}\), \(I^{\mathrm{Ref}}\), \(I^{\mathrm{HR}}\) and \(I^{{\mathrm{Ref}}\downarrow }\) represent the input LR image, the reference image, ground truth of the input image and 4\(\times \) bicubic-downsampled reference image. In this step, we train two parallel networks. \(I^{\mathrm{LR}}\) and \(I^{{\mathrm{Ref}}\downarrow }\) are fed to SR-MainNet as follows:

$$\begin{aligned} I^{\mathrm{SR}}&={\text {SR-MainNet}}\big (I^{\mathrm{LR}}\big ) \end{aligned}$$
(1)
$$\begin{aligned} I^{\mathrm{RefSR}}&={\text {SR-MainNet}}\big (I^{{\mathrm{Ref}}\downarrow }\big ) \end{aligned}$$
(2)

where \(I^{\mathrm{SR}}\) and \(I^{\mathrm{RefSR}}\) denote the predicted HR image \(I^{\mathrm{HR}}\) and Ref image \(I^{\mathrm{Ref}}\), respectively. The overall loss function consists of three components in this step. The overall loss can be interpreted as:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{overall}}=\lambda _{\mathrm{rec}} {\mathcal {L}}_{\mathrm{rec}}+\lambda _{\mathrm{adv}} {\mathcal {L}}_{\mathrm{adv}}+\lambda _{\mathrm{per}} {\mathcal {L}}_{\mathrm{per}} \end{aligned}$$
(3)

The reconstruction loss is essential:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{rec}}= & {} \frac{1}{{CHW}}\left\| I^{\mathrm{HR}}-I^{\mathrm{SR}}\right\| _{1} + \frac{1}{{CHW}}\left\| I^{\mathrm{Ref}}-I^{\mathrm{RefSR}}\right\| _{1}\nonumber \\ \end{aligned}$$
(4)

where (CHW) is the size of the HR and Ref. To ensure the shape performance and easy convergence, we utilize \(L_1\) loss but not \(L_2\) loss. Our method has two parallel paths, thus the loss has two parts. The following loss functions also have two parts. The adversarial loss is adopted to increase the naturalness of the SR image. As disscussed in [4], we adopt WGAN-GP [28] as well. This loss can be interpreted as:

$$\begin{aligned} {\mathcal {L}}_{D}= & {} \underset{{\tilde{x}} \sim {\mathbb {P}}_{g}}{{\mathbb {E}}}[D({\tilde{x}})]-\underset{x \sim {\mathbb {P}}_{r}}{{\mathbb {E}}}[D(x)]\nonumber \\&+\,\lambda \underset{{\hat{x}} \sim {\mathbb {P}}_{{\hat{x}}}}{{\mathbb {E}}}\left[ \left( \left\| \nabla _{{\hat{x}}} D({\hat{x}})\right\| _{2}-1\right) ^{2}\right] \nonumber \\ {\mathcal {L}}_{G}= & {} -\underset{{\tilde{x}} \sim {\mathbb {P}}_{g}}{{\mathbb {E}}}[D({\tilde{x}})] \end{aligned}$$
(5)

where \({\mathbb {P}}_{r}\) and \({\mathbb {P}}_{g}\) are the model distribution and real distribution respectively. The aim of adversarial loss is to make the model distribution approximate to the real distribution. For more details, you can refer [28].

This paper applies perceptual loss [3, 20, 29] to improve visual quality of the SR and RefSR image. Perceptual loss is used to reduce the difference in feature space between the predicted image and the target image. This loss can be interpreted as:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{per}}= & {} \frac{1}{C_{i} H_{i} W_{i}}\left\| \phi _{i}^{\mathrm{vgg}}\left( I^{\mathrm{SR}}\right) -\phi _{i}^{\mathrm{vgg}}\left( I^{\mathrm{HR}}\right) \right\| _{2}^{2}\nonumber \\&+ \frac{1}{C_{i} H_{i} W_{i}}\left\| \phi _{i}^{\mathrm{vgg}}\left( I^{\mathrm{RefSR}}\right) -\phi _{i}^{\mathrm{vgg}}\left( I^{\mathrm{Ref}}\right) \right\| _{2}^{2} \end{aligned}$$
(6)

where \(\phi _{i}^{\mathrm{vgg}}(\cdot )\) denotes the i-th layer’s feature map of VGG19, and \(\left( C_{i}, H_{i}, W_{i}\right) \) represents the shape of the feature map at that layer.

In this step, the preliminary SR-MainNet is trained without reference. HR and Ref image are treated equally. This step serves two purposes. First, we get a SR-MainNet without reference image. Second, the SR-MainNet works as a texture extractor as well, providing the features of the LR and Ref\(\downarrow \) image.

3.2 ReverseNet

Step 1 provides the features of the LR and Ref\(\downarrow \) image. To realize the texture transfer like [4], we should get the features of Ref image. In step 2, the ReverseNet is trained to get the Ref features. The structure of ReverseNet is shown in Fig. 2. This paper chooses VGG19 as the backbone of the ReverseNet. More discussion about the backbone of the ReverseNet is in Sect. 4.3. Then three levels features of Ref images are generated by VGG19, and fed into several convolution layers, i.e. the interface module in Fig. 2. The output of the ReverseNet is the features of Ref image, which has the same form as the features of LR and Ref\(\downarrow \) image. To train the ReverseNet, the output of the SR-MainNet as the input is fed into the ReverseNet. The training process is shown as follows:

$$\begin{aligned} F^{\mathrm{SR}}= & {} {\text {ReverseNet}}\big (I^{\mathrm{SR}}\big ) \end{aligned}$$
(7)
$$\begin{aligned} F^{\mathrm{RefSR}}= & {} {\text {ReverseNet}}\big (I^{\mathrm{RefSR}}\big ) \end{aligned}$$
(8)

where \(F^{\mathrm{SR}}\) and \(F^{\mathrm{RefSR}}\) are the output of the ReverseNet corresponding to different inputs. \(F^{\mathrm{SR}}\) and \(F^{\mathrm{RefSR}}\) each have three levels. Our goal is to make Ref images generate the same type of feature map as LR and Ref\(\downarrow \) image. To constrain \(F^{\mathrm{SR}}\) and \(F^{\mathrm{RefSR}}\), the features captured from SR-MainNet are considered as ground truth.

$$\begin{aligned}&F^{\mathrm{LR}}\overset{c}{=}{\text {SR-MainNet}}\big (I^{\mathrm{LR}}\big ) \end{aligned}$$
(9)
$$\begin{aligned}&F^{{\mathrm{Ref}} \downarrow }\overset{c}{=}{\text {SR-MainNet}}\big (I^{{\mathrm{Ref}}\downarrow }\big ) \end{aligned}$$
(10)

where \(\overset{c}{=}\) denotes \(F^{\mathrm{LR}}\) and \(F^{{\mathrm{Ref}} \downarrow }\) are the feature maps captured from the SR-MainNet but not the final output. In this step, the parameters of SR-MainNet are fixed. \(F^{\mathrm{LR}}\) and \(F^{\mathrm{SR}}\) have the same size in each level. \(F^{{\mathrm{Ref}} \downarrow }\) and \(F^{\mathrm{RefSR}}\) have the same size in each level as well. The loss function is like this:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{rev}}= & {} \frac{1}{C H W}\left\| F^{\mathrm{LR}}-F^{\mathrm{SR}}\right\| _{1} \nonumber \\&+ \frac{1}{C H W}\left\| F^{{\mathrm{Ref}}\downarrow }-F^{\mathrm{RefSR}}\right\| _{1} \end{aligned}$$
(11)

where (CHW) is the size of the feature maps of SR and RefSR. This loss function enables ReverseNet to recover SR-MainNet’s middle feature maps from its output. Thus when Ref image is fed into the ReverseNet, the feature output will have the same form as the features of LR and Ref\(\downarrow \) image.

Fig. 2
figure 2

Reverse Net structure. Blue box is the backbone of the reversenet. The Interface Module includes several convolution layers to process the feature maps from backbone. F1, F2, F3 are three sizes of feature maps captured from backbone. L1, L2, L3 are three levels of texture features which will be fed to SR-MainNet

3.3 Combine

We train the SR-MainNet without Ref and ReverseNet in last two steps. In this step, two networks are combined to finetune the final SR-MainNet with Ref. The ReverseNet trained in step 2 takes Ref image \(I^{\mathrm{Ref}}\) as input and outputs V (value) features. The Q (query) and K (key), i.e. \(F^{\mathrm{LR}}\) and \(F^{{\mathrm{Ref}} \downarrow }\), are catched from the SR-MainNet with the input \(I^{\mathrm{LR}}\) and \(I^{{\mathrm{Ref}}\downarrow }\).

$$\begin{aligned}&Q\overset{c}{=}{\text {SR-MainNet}}\big (I^{\mathrm{LR}}\big ) \end{aligned}$$
(12)
$$\begin{aligned}&K\overset{c}{=}{\text {SR-MainNet}}\big (I^{{\mathrm{Ref}}\downarrow }\big ) \end{aligned}$$
(13)
$$\begin{aligned}&V={\text {ReverseNet}}\big (I^{\mathrm{Ref}}\big ) \end{aligned}$$
(14)

where \(\overset{c}{=}\) denotes Q and K are the feature maps captured from the SR-MainNet but not the final output. As shown in Fig. 2, the SR-MainNet and ReverseNet generate the features Q, K, V respectively. Then, Q, K, V are fed into texture transformer module. In the texture transformer, this paper applies the similar transform strategy as [4]. But we do not use the hard-and-soft attention module.

To embed the relevance between the LR and Ref image, the similarity between Q and K is calculated. Q and K are unfolded into patches, denoted as \(q_{i}\left( i \in \left[ 1, H_{\mathrm{LR}} \times W_{\mathrm{LR}}\right] \right) \) and \(k_{j}\left( j \in \left[ 1, H_{\mathrm{Ref}} \times W_{\mathrm{Ref}}\right] \right) \). Then normalized inner product is used to calculate the relevance \(r_{i, j}\) between each \(q_{i}\) and \(k_{j}\):

$$\begin{aligned} r_{i, j}=\left\langle \frac{q_{i}}{\left\| q_{i}\right\| }, \frac{k_{j}}{\left\| k_{j}\right\| }\right\rangle \end{aligned}$$
(15)

Then we construct a swapped feature map M from the relevance. The i-th element \(m_{i}\left( i \in \left[ 1, H_{\mathrm{LR}} \times W_{\mathrm{LR}}\right] \right) \) is calculated from the relevance \(r_{i, j}\):

$$\begin{aligned} m_{i}=\mathop {\arg \max }_{j} r_{i, j} \end{aligned}$$
(16)

This paper considers M as a index map, which represents the most relevant position in the Ref image to the each position in the LR image. Then we get the transformed HR features T from Ref image. The index selection operation is applied to the unfolded patches of V using the map M as the index:

$$\begin{aligned} t_{i}=v_{m_{i}} \end{aligned}$$
(17)

where \(t_{i}\) denotes the value of T in the i-th position, which is selected from the \(m_{i}\)-position of V. After the transformation process, the SR-MainNet gets the transformed features T as reference features. The operation can be represented as:

$$\begin{aligned} F_{\mathrm{out}}=F+{\text {Conv}}({\text {Concat}}(F, T)) \end{aligned}$$
(18)

where \(F_{\mathrm{out}}\) indicates the synthesized output features. Conv and Concat represent a covolutional layer and concatenation operation, respectively. It can be seen from the above formula that the equation still holds without T.

$$\begin{aligned} F_{\mathrm{out}}=F \end{aligned}$$
(19)

The SR-MainNet without reference is the case without T. Thus the parameters of SR-MainNet trained before can be reused and finetuned here. Moreover, this paper uses SR-MainNet as the texture extractor, which makes distribution of the features similar. This will improve the confusion quality of the Ref and LR image.

In the training process, this step has the same loss function as Sect. 3.1.

3.4 Implementation details

For the SR-MainNet, we adopt the same network used in TTSR [4] except for the soft-and-hard attention module. In order to simplify the calculation process, this paper just uses the hard-attention module to generate the transform map. We use PyTorch to implement the model on an NVIDIA 1080Ti GPU. Throughout the training, we augment the training images by randomly horizontally and vertically flipping followed by randomly rotating 90, 180, and 270. We apply the Adam [30] with \(\beta _{1}=0.9 \), \(\beta _{2}=0.999 \) as the optimizer. The learning rate is set to 1e−4 and the batch-size is 6 images. In the training stage, the size of the input LR image is \(40\times 40\), and the size of Ref image and the output SR image is \(160\times 160\). The weight coefficients for \({\mathcal {L}}_{\mathrm{rec}}\), \({\mathcal {L}}_{\mathrm{adv}}\) and \({\mathcal {L}}_{\mathrm{per}}\) are 1, 1e−3 and 1e−2, respectively in Sects. 3.1 and 3.3. SR-MainNet without reference in Sect. 3.1 is trained 30 epochs. In Sect. 3.2, the ReverseNet is trained 15 epochs. Finally, SR-MainNet and ReverseNet are combined to train another 50 epochs in Sect. 3.3. In the testing stage, there is no size requirement for the LR image and Ref image, except to be cropped to a multiple of 4.

Table 1 PSNR/SSIM comparison among different SR methods on four different datasets

4 Experiments

4.1 Datasets and metrics

Following the same setting as [3, 4], we train and test our model on the recently proposed RefSR dataset, CUFED5Footnote 1 [3]. To evaluate the generalization capacity of the trained model on CUFED5, we test it on Sun80 [31] and Urban100 [32].Footnote 2 To evaluate the performance of the methods, we calculate PSNR and SSIM on Y channel of YCbCr space as the quantitative criteria.

Fig. 3
figure 3

Visual comparison among different SR methods on CUFED5 testing set (top one example), Urban100 (the second and third examples), Sun80 (the fourth example)

4.2 Evaluation

To evaluate the effectiveness of SR-USRN, our model is compared with other SOTA SISR and RefSR methods as shown in Table 1. These experiments are carried out with a 4\(\times \) scaling factor between LR and HR images.

Following the setting in SRNTT [3] and TTSR [4], this paper trains all the methods on CUFED5 training set, and tests on CUFED5 testing set, Sun80 and Urban100 datasets. For SR methods, although adversarial loss and perceptual loss can improve the visual quality of SR images, it will lead to the loss of PSNR and SSIM scores. Therefore, for fair comparison on PSNR and SSIM, we train another version of our model which is optimized only on reconstruction loss named SR-USRN-rec. SRNTT-rec and TTSR-rec indicate SRNTT and TTSR optimized only on reconstruction loss as well.

The detailed evaluation results are shown in Table 1. As shown in Table 1, SR-USRN-rec gets the second highest scores in all three datasets. When trained with adversarial loss and perceptual loss, our method performs better than SRNTT and TTSR, and achieves the highest scores on CUFED5 and Sun80. On Urban100, SR-USRN and SRNTT have similar performance. In Table 1, our method achieves SOTA SR performance whether with or without adversarial loss.

We also compare the number of network parameters and inference time with other RefSR methods in Table 2. The inference time is calculated on input of a \(128\times 128\) LR image and a \(512 \times 512\) Ref image. The time consumption is acceptable compared to the performance increase. Our approach has a slight increase in the number of parameters and inference time over TTSR, which is acceptable.

To compare the differences in visual quality, we show the performance of different methods in Fig. 3. As shown in Fig. 3, our method achieves the best visual quality, compared with existing SISR and RefSR. SR-USRN can recover texture better than SRNTT and avoid some unpleasant texture transformation in TTSR.

Table 2 Number of parameters and inference time of different algorithms

4.3 Ablation study

4.3.1 the use of parallel structure and ReverseNet

SR-USRN mainly contains two networks: the ReverseNet and parallel SR-MainNet. Ablation results are shown in Table 3. In order to verify the effectiveness of the two parts, this section only uses one part to train two versions.

Table 3 Ablation study on the use of parallel structure and ReverseNet

The version without SR-MainNet means that we do not use the parallel structure and only use the original input images without reference images to train SR-MainNet and ReverseNet. The version without ReverseNet means that the second step is skipped and ReverseNet initialized by VGG19 is finetuned directly in the third step. As shown in Table 3, the network with both two parts achieves the highest score. It proves the indispensable of the two parts.

4.3.2 the use of ReverseNet

The use of the ReverseNet is to transform the Ref image to the corresponding features in the SR-MainNet. The ability of the ReverseNet to recover the MainNet features is important to the texture transformer. Here, this section tries to find out the impact of the structure of ReverseNet on test results. Several typical networks are taken into consideration. These networks include VGG19, ResNet18 and ResNet34. As shown in Fig. 2, this paper catches part of VGG19, ResNet18 or ResNet34 as the backbone of ReverseNet, and several convolution layers follow three-level features of main structure. The second, fourth, eighth ReLU layers of VGG19 are used for feature swapping. In ResNet18 and ResNet34, the first layer is a convolution layer with \(7\times 7\) kernel-size and 2 stride and the third layer is a maxpool layer with 2 stride. To avoid reducing the image size too early, this paper changes the convolution layer to a layer with \(3\times 3\) kernel-size and 1 stride, and deletes the first maxpool layer. As shown in Table 4, we can see that VGG19 gets the best result. ResNet34 is deeper than ResNet18 and ResNet18 is deeper than VGG19, while the order of PSNR/SSIM scores is opposite. Because of the limitation of data volume, the ReverseNet with deep structure can not recover the features in SR-MainNet, resulting in the decline of the results. In the future work, we will pay more attention to the design of the ReverseNet.

4.3.3 Effect of reference similarity

As discussed in [3, 4], similarity between LR and Ref images is a key factor to the performance of RefSR methods. This paper investigates the performance of the SOTA RefSR and our proposed method at different reference levels. Table 5 lists the results of several algorithms at five levels of references. Comparing the scores of PSNR/SSIM in Table 5 , it is obvious that higher relevant level achieves better scores in each method. Our method also meets this phenomenon, which means that our method also has the ability to distinguish the relevance of image content and texture.

Table 4 Ablation study on different structure of ReverseNet
Table 5 Ablation study on reference images of different similarity

4.3.4 Ability to find right texture

Our method uses SR-MainNet to transform texture from Ref image to LR image. In Fig. 4, we uses the transform map to transform the original image directly. With the reduce of similarity, the quality of the transformed image is declined. It proves that SR-MainNet can express the features correctly. In Fig. 5, it is obvious that only one girl in the right image. Three boxes with different colors mark the corresponding index positions in two images. Three boxes in the faces of different persons map to the same face of the girl in right. It proves that the SR-MainNet can find similar texture even if the content of the image is different.

4.3.5 Generalization performance of SR-USRN

To prove the generalization performance of SR-USRN, several images captured in daily life are processed with super-resolution. These images are taken by iPhone 11. In Fig. 6, images c and f have a more regular and realistic texture than b and e. This experiment shows good generalization performance of SR-USRN.

Fig. 4
figure 4

Using reference images to recover ground truth. Images with different similarities are shown in the first row. The corresponding transformed images and the ground truth are shown in the second row.

Fig. 5
figure 5

Corresponding positions in LR and reference images. Image a is the LR image and b is the reference image. The box with the same color in two images is the corresponding position in transform map

Fig. 6
figure 6

Image a and d are original images. Image b and e are processed by DRN. Image c and f are processed by SR-USRN

5 Conclusion

In this paper, we propose SR-USRN which uses the same network structure to realize texture transform and super-resolution. SR-USRN consists of three steps, including training SR-MainNet without reference, training ReverseNet to recover the features in SR-MainNet by Ref image and combining SR-MainNet and ReverseNet to train final RefSR Network. This paper uses Ref image to train SR-MainNet in first step and shares the parameters in SR and texture transformation. This design makes full use of the Ref image and the same structure of network makes texture transformer know what SR network really needs. Extensive experiments demonstrate that SR-USRN achieves significant improvements over SOTA approaches on both quantitative and qualitative evaluations. We find that the ReverseNet structure affects the extraction of Ref image features and the final SR results. In the future work, we will refer to the SR-MainNet structure to design a more efficient ReverseNet.